Data science
Jump to navigation
Jump to search
Courses, books
- Data Science at the Command Line by Jeroen Janssens written using bookdown.
- 10 Free Must-Read Books for Machine Learning and Data Science
- STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
- Introduction to Data Science Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
- https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
- http://datasciencelabs.github.io/2016/ Harvard SPH
- https://cs109.github.io/2014 Harvard CS
- https://r4ds.had.co.nz/ R for Data Science
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2) > tmp [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > tmp[2,1] <- NA > complete.cases(tmp) [1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv") GSS$BaseLaborStatus <- GSS$LaborStatus levels(GSS$BaseLaborStatus) summary(GSS$BaseLaborStatus) GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus) GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time" GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>% mutate(tidyLaborStatus = recode(LaborStatus, `Temp not working` = "Temporarily not working", `Unempl, laid off` = "Unemployed, laid off", `Working fulltime` = "Working full time", `Working parttime ` = "Working part time"))
NIH CBIIT
http://datascience.cancer.gov/
Reproducibility
Bioinformatics advice I wish I learned 10 years ago from NIH
Project and Data Organization
- Project Organization
proj ├── dev │ ├── clustering.Rmd │ └── dim_reduce.Rmd ├── doc ├── output │ ├── 2019-05-10 │ ├── 2019-05-19 │ └── 2019-05-21 ├── README.Rmd ├── renv ├── rmd └── scripts
- Data Organization
data ├── annotations │ ├── clue_drug_repurposing_hub │ │ ├── repurposing_drugs_20180907.txt │ │ └── repurposing_samples_20180907.txt │ └── ... ├── containers │ └── singularity │ └── sclc-george2015 ├── projects │ ├── nih │ │ ├── mm-feature-selection │ │ ├── mm-p3-variants │ │ └── sclc-doe ├── public │ └── human │ ├── array_express │ ├── geo │ │ └── GSE6477 │ │ ├── processed │ │ │ ├── GSE6477_expr.csv │ │ │ └── sample_metadata.csv │ │ └── raw │ │ ├── GPL96.soft │ │ └── GSE6477_series_matrix.txt.gz └── ref └── human ├── agilent ├── gatk ├── gencode-v30 └── rRNA
Container
Data Science for Startups: Containers Building reproducible setups for machine learning