Courses, books

Data Science at the Command Line by Jeroen Janssens written using bookdown.
10 Free Must-Read Books for Machine Learning and Data Science
STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
Introduction to Data Science Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
http://datasciencelabs.github.io/2016/ Harvard SPH
https://cs109.github.io/2014 Harvard CS
https://r4ds.had.co.nz/ R for Data Science
Biostat 203B: Introduction to Data Science
Data Science, Big Data Analytics, and Digital Methods Videos. Over 3,200 videos comprising over 120 hours are available.

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
coding variables: be consistent, no spelling error
date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
missing data: "NA". Not leave any cells blank.
using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

continuous
oridinal
categorical
missing
censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

No empty cells
Put one thing in a cell
Make a rectangle
No calculation in the raw data files
Create a data dictionary (same as code book)

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))

> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

options(stringAsFactors=FALSE)
Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

NIH CBIIT

http://datascience.cancer.gov/

Reproducibility

Bioinformatics advice I wish I learned 10 years ago from NIH

Project and Data Organization

Bioinformatics advice I wish I learned 10 years ago.

Project Organization

proj
├── dev
│   ├── clustering.Rmd
│   └── dim_reduce.Rmd
├── doc
├── output
│   ├── 2019-05-10
│   ├── 2019-05-19
│   └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts

Data Organization

data
├── annotations
│   ├── clue_drug_repurposing_hub
│   │   ├── repurposing_drugs_20180907.txt
│   │   └── repurposing_samples_20180907.txt
│   └── ...
├── containers
│   └── singularity
│       └── sclc-george2015
├── projects
│   ├── nih
│   │   ├── mm-feature-selection
│   │   ├── mm-p3-variants
│   │   └── sclc-doe
├── public
│   └── human
│       ├── array_express
│       ├── geo
│       │   └── GSE6477
│       │       ├── processed
│       │       │   ├── GSE6477_expr.csv
│       │       │   └── sample_metadata.csv
│       │       └── raw
│       │           ├── GPL96.soft
│       │           └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA

Container

Data Science for Startups: Containers Building reproducible setups for machine learning