Courses, books

Data Science at the Command Line by Jeroen Janssens written using bookdown.
10 Free Must-Read Books for Machine Learning and Data Science
STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
Introduction to Data Science Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
http://datasciencelabs.github.io/2016/ Harvard SPH
https://cs109.github.io/2014 Harvard CS
https://r4ds.had.co.nz/ R for Data Science

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
coding variables: be consistent, no spelling error
date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
missing data: "NA". Not leave any cells blank.
using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

continuous
oridinal
categorical
missing
censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

No empty cells
Put one thing in a cell
Make a rectangle
No calculation in the raw data files
Create a data dictionary (same as code book)

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))

> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

options(stringAsFactors=FALSE)
Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

NIH CBIIT

http://datascience.cancer.gov/

Reproducibility

Bioinformatics advice I wish I learned 10 years ago from NIH

Project and Data Organization

Bioinformatics advice I wish I learned 10 years ago.

Project Organization

proj
├── dev
│   ├── clustering.Rmd
│   └── dim_reduce.Rmd
├── doc
├── output
│   ├── 2019-05-10
│   ├── 2019-05-19
│   └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts

Data Organization

data
├── annotations
│   ├── clue_drug_repurposing_hub
│   │   ├── repurposing_drugs_20180907.txt
│   │   └── repurposing_samples_20180907.txt
│   └── ...
├── containers
│   └── singularity
│       └── sclc-george2015
├── projects
│   ├── nih
│   │   ├── mm-feature-selection
│   │   ├── mm-p3-variants
│   │   └── sclc-doe
├── public
│   └── human
│       ├── array_express
│       ├── geo
│       │   └── GSE6477
│       │       ├── processed
│       │       │   ├── GSE6477_expr.csv
│       │       │   └── sample_metadata.csv
│       │       └── raw
│       │           ├── GPL96.soft
│       │           └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA

Container

Data Science for Startups: Containers Building reproducible setups for machine learning