Data science: Difference between revisions
Appearance
| Line 1: | Line 1: | ||
= Courses, books = | = Courses, books = | ||
* [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown]. | * [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown]. | ||
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | * [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | ||
* https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH | * https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH | ||
| Line 9: | Line 8: | ||
* [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available. | * [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available. | ||
* [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses]. | * [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses]. | ||
* [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander | * [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander | ||
* [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners] | * [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners] | ||
== Python == | == Python == | ||
| Line 51: | Line 46: | ||
* [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020] | * [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020] | ||
* How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done. | * How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done. | ||
= Machine Learning = | |||
* [https://github.com/dair-ai/ML-YouTube-Courses ML Youtube Courses] | |||
* [https://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science] | |||
* [https://github.com/compstat-lmu/lecture_i2ml Introduction to Machine Learning (I2ML)] | |||
* [https://probml.github.io/pml-book/book1.html?s=09 Probabilistic Machine Learning: An Introduction] | |||
* [https://betaandbit.github.io/RML/ The Hitchhiker’s Guide to Responsible Machine Learning] | |||
* [https://stanford-cs329s.github.io/syllabus.html CS 329S: Machine Learning Systems Design] Stanford | |||
= How to prepare data for collaboration = | = How to prepare data for collaboration = | ||
Revision as of 07:53, 28 March 2022
Courses, books
- Data Science at the Command Line by Jeroen Janssens written using bookdown.
- STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
- https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
- http://datasciencelabs.github.io/2016/ Harvard SPH
- https://cs109.github.io/2014 Harvard CS
- Biostat 203B: Introduction to Data Science
- Data Science, Big Data Analytics, and Digital Methods Videos. Over 3,200 videos comprising over 120 hours are available.
- The Analytics Edge from edX.org or MOOC/Massive Open Online Courses.
- Telling Stories With Data by Rohan Alexander
- Best Data Science Books For Beginners
Python
- Python Data Science Handbook: Essential Tools for Working with Data
- Getting started with data science using Python from opensource.com
R
- Coursera -> Data Science Specialization by JHS.
- Data science in a box
- An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- https://r4ds.had.co.nz/ R for Data Science
- 20 Free Online Books to Learn R and Data Science
- Teaching resources by Irizarry. edx.org. Audit is free.
- Introduction to Data Science
- Data Analysis for the Life Sciences
- Genomics Data Analysis
- Introduction to Data Science or on github Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
- Why Is It Called That Way?! – Origin and Meaning of R Package Names.
- dplyr
- lubridate
- ggplot2
- data.table
- tibble
- purrr
- amelia
- magrittr
- batman
- Homeric
- fcuk
- hellno
Python vs R
R, Python & Julia in data science : A comparison
Datacamp
- 32 Completely FREE DataCamp Courses To Take In 2020
- How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
Machine Learning
- ML Youtube Courses
- 10 Free Must-Read Books for Machine Learning and Data Science
- Introduction to Machine Learning (I2ML)
- Probabilistic Machine Learning: An Introduction
- The Hitchhiker’s Guide to Responsible Machine Learning
- CS 329S: Machine Learning Systems Design Stanford
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
Data Organization in Spreadsheets
Data Organization in Spreadsheets Broman & Woo 2018
Gene name errors from Excel
- Gene name errors: Lessons not learned.
- HGNChelper: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols
- Some examples: MARCH3, SEPT8, OCT4, DEC1.
- Gene names, data corruption and Excel: a 2021 update
length(x)
# [1] 28109
length(grep("march", x, ignore.case=T))
# [1] 11
length(grep("sep", x, ignore.case=T))
# [1] 24
length(grep("oct", x, ignore.case=T))
# [1] 0
length(grep("dec", x, ignore.case=T))
# [1] 6
grep("sep", x, ignore.case=T, value=T)
[1] "RNaseP_nuc" "SEP15" "SEPHS1"
[4] "SEPHS2" "SEPN1" "SEPP1"
[7] "SEPSECS" "SEPT1" "SEPT10"
[10] "SEPT11" "SEPT12" "SEPT14"
[13] "SEPT2" "SEPT3" "SEPT4"
[16] "SEPT5-GP1BB" "SEPT6" "SEPT7"
[19] "SEPT7P2" "SEPT7P9" "SEPT8"
[22] "SEPT9" "SEPW1" "septin 9/TNRC6C fusion"
# Count non-alphanumeric symbols from a string
ind <- grep("[^[:alnum:] ]", x)
length(ind)
# [1] 1108
# Some cases:
# "5S_rRNA"
# "HGC6.1.1"
# "Ig alpha 1-[alpha]2m"
# "T-cell receptor alpha chain variable ..."
# "TRA@"
# "TRNA_Ala"
# "TTN-AS1"
# "aromatase cytochrome P-450 (P-450AROM)"
# "immunoglobulin epsilon chain constant..."
# "septin 9/TNRC6C fusion"
All NIH-funded data must be made freely accessible
Data Sharing and Public Access Policies
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2)
> tmp
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>%
mutate(tidyLaborStatus =
recode(LaborStatus,
`Temp not working` = "Temporarily not working",
`Unempl, laid off` = "Unemployed, laid off",
`Working fulltime` = "Working full time",
`Working parttime ` = "Working part time"))
NIH CBIIT
http://datascience.cancer.gov/
Seminars
NCI Data science webinar series
Reproducibility
Bioinformatics advice I wish I learned 10 years ago from NIH
Project and Data Organization
- Project Organization
proj ├── dev │ ├── clustering.Rmd │ └── dim_reduce.Rmd ├── doc ├── output │ ├── 2019-05-10 │ ├── 2019-05-19 │ └── 2019-05-21 ├── README.Rmd ├── renv ├── rmd └── scripts
- Data Organization
data ├── annotations │ ├── clue_drug_repurposing_hub │ │ ├── repurposing_drugs_20180907.txt │ │ └── repurposing_samples_20180907.txt │ └── ... ├── containers │ └── singularity │ └── sclc-george2015 ├── projects │ ├── nih │ │ ├── mm-feature-selection │ │ ├── mm-p3-variants │ │ └── sclc-doe ├── public │ └── human │ ├── array_express │ ├── geo │ │ └── GSE6477 │ │ ├── processed │ │ │ ├── GSE6477_expr.csv │ │ │ └── sample_metadata.csv │ │ └── raw │ │ ├── GPL96.soft │ │ └── GSE6477_series_matrix.txt.gz └── ref └── human ├── agilent ├── gatk ├── gencode-v30 └── rRNA
Container
Data Science for Startups: Containers Building reproducible setups for machine learning