Data science: Difference between revisions
Jump to navigation
Jump to search
Line 64: | Line 64: | ||
* No calculation in the raw data files | * No calculation in the raw data files | ||
* Create a '''data dictionary''' (same as '''code book''') | * Create a '''data dictionary''' (same as '''code book''') | ||
== Gene name errors from Excel == | |||
[https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984 Gene name errors: Lessons not learned] | |||
= [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] = | = [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] = |
Revision as of 10:36, 31 July 2021
Courses, books
- Data Science at the Command Line by Jeroen Janssens written using bookdown.
- 10 Free Must-Read Books for Machine Learning and Data Science
- STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
- https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
- http://datasciencelabs.github.io/2016/ Harvard SPH
- https://cs109.github.io/2014 Harvard CS
- Biostat 203B: Introduction to Data Science
- Data Science, Big Data Analytics, and Digital Methods Videos. Over 3,200 videos comprising over 120 hours are available.
- The Analytics Edge from edX.org or MOOC/Massive Open Online Courses.
- Introduction to Machine Learning (I2ML)
- Probabilistic Machine Learning: An Introduction
- Telling Stories With Data by Rohan Alexander
Python
- Python Data Science Handbook: Essential Tools for Working with Data
- Getting started with data science using Python from opensource.com
R
- An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- https://r4ds.had.co.nz/ R for Data Science
- 20 Free Online Books to Learn R and Data Science
- Introduction to Data Science or on github Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
- Why Is It Called That Way?! – Origin and Meaning of R Package Names.
- dplyr
- lubridate
- ggplot2
- data.table
- tibble
- purrr
- amelia
- magrittr
- batman
- Homeric
- fcuk
- hellno
Python vs R
R, Python & Julia in data science : A comparison
Datacamp
- 32 Completely FREE DataCamp Courses To Take In 2020
- How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
Gene name errors from Excel
Gene name errors: Lessons not learned
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2) > tmp [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > tmp[2,1] <- NA > complete.cases(tmp) [1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv") GSS$BaseLaborStatus <- GSS$LaborStatus levels(GSS$BaseLaborStatus) summary(GSS$BaseLaborStatus) GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus) GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time" GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>% mutate(tidyLaborStatus = recode(LaborStatus, `Temp not working` = "Temporarily not working", `Unempl, laid off` = "Unemployed, laid off", `Working fulltime` = "Working full time", `Working parttime ` = "Working part time"))
NIH CBIIT
http://datascience.cancer.gov/
Seminars
NCI Data science webinar series
Reproducibility
Bioinformatics advice I wish I learned 10 years ago from NIH
Project and Data Organization
- Project Organization
proj ├── dev │ ├── clustering.Rmd │ └── dim_reduce.Rmd ├── doc ├── output │ ├── 2019-05-10 │ ├── 2019-05-19 │ └── 2019-05-21 ├── README.Rmd ├── renv ├── rmd └── scripts
- Data Organization
data ├── annotations │ ├── clue_drug_repurposing_hub │ │ ├── repurposing_drugs_20180907.txt │ │ └── repurposing_samples_20180907.txt │ └── ... ├── containers │ └── singularity │ └── sclc-george2015 ├── projects │ ├── nih │ │ ├── mm-feature-selection │ │ ├── mm-p3-variants │ │ └── sclc-doe ├── public │ └── human │ ├── array_express │ ├── geo │ │ └── GSE6477 │ │ ├── processed │ │ │ ├── GSE6477_expr.csv │ │ │ └── sample_metadata.csv │ │ └── raw │ │ ├── GPL96.soft │ │ └── GSE6477_series_matrix.txt.gz └── ref └── human ├── agilent ├── gatk ├── gencode-v30 └── rRNA
Container
Data Science for Startups: Containers Building reproducible setups for machine learning