Data science
Appearance
Courses, books
- STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
- Introduction to Data Science Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2)
> tmp
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>%
mutate(tidyLaborStatus =
recode(LaborStatus,
`Temp not working` = "Temporarily not working",
`Unempl, laid off` = "Unemployed, laid off",
`Working fulltime` = "Working full time",
`Working parttime ` = "Working part time"))
NIH CBIIT
http://datascience.cancer.gov/