Data science: Difference between revisions
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
= Courses, books = | = Courses, books = | ||
* [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown]. | * [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown]. | ||
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | * [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | ||
* https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH | * https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH | ||
Line 9: | Line 8: | ||
* [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available. | * [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available. | ||
* [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses]. | * [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses]. | ||
* [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander | * [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander | ||
* [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners] | * [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners] | ||
== Python == | == Python == | ||
Line 51: | Line 46: | ||
* [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020] | * [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020] | ||
* How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done. | * How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done. | ||
= Machine Learning = | |||
* [https://github.com/dair-ai/ML-YouTube-Courses ML Youtube Courses] | |||
* [https://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science] | |||
* [https://github.com/compstat-lmu/lecture_i2ml Introduction to Machine Learning (I2ML)] | |||
* [https://probml.github.io/pml-book/book1.html?s=09 Probabilistic Machine Learning: An Introduction] | |||
* [https://betaandbit.github.io/RML/ The Hitchhiker’s Guide to Responsible Machine Learning] | |||
* [https://stanford-cs329s.github.io/syllabus.html CS 329S: Machine Learning Systems Design] Stanford | |||
= How to prepare data for collaboration = | = How to prepare data for collaboration = |
Revision as of 07:53, 28 March 2022
Courses, books
- Data Science at the Command Line by Jeroen Janssens written using bookdown.
- STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
- https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
- http://datasciencelabs.github.io/2016/ Harvard SPH
- https://cs109.github.io/2014 Harvard CS
- Biostat 203B: Introduction to Data Science
- Data Science, Big Data Analytics, and Digital Methods Videos. Over 3,200 videos comprising over 120 hours are available.
- The Analytics Edge from edX.org or MOOC/Massive Open Online Courses.
- Telling Stories With Data by Rohan Alexander
- Best Data Science Books For Beginners
Python
- Python Data Science Handbook: Essential Tools for Working with Data
- Getting started with data science using Python from opensource.com
R
- Coursera -> Data Science Specialization by JHS.
- Data science in a box
- An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- https://r4ds.had.co.nz/ R for Data Science
- 20 Free Online Books to Learn R and Data Science
- Teaching resources by Irizarry. edx.org. Audit is free.
- Introduction to Data Science
- Data Analysis for the Life Sciences
- Genomics Data Analysis
- Introduction to Data Science or on github Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
- Why Is It Called That Way?! – Origin and Meaning of R Package Names.
- dplyr
- lubridate
- ggplot2
- data.table
- tibble
- purrr
- amelia
- magrittr
- batman
- Homeric
- fcuk
- hellno
Python vs R
R, Python & Julia in data science : A comparison
Datacamp
- 32 Completely FREE DataCamp Courses To Take In 2020
- How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
Machine Learning
- ML Youtube Courses
- 10 Free Must-Read Books for Machine Learning and Data Science
- Introduction to Machine Learning (I2ML)
- Probabilistic Machine Learning: An Introduction
- The Hitchhiker’s Guide to Responsible Machine Learning
- CS 329S: Machine Learning Systems Design Stanford
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
Data Organization in Spreadsheets
Data Organization in Spreadsheets Broman & Woo 2018
Gene name errors from Excel
- Gene name errors: Lessons not learned.
- HGNChelper: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols
- Some examples: MARCH3, SEPT8, OCT4, DEC1.
- Gene names, data corruption and Excel: a 2021 update
length(x) # [1] 28109 length(grep("march", x, ignore.case=T)) # [1] 11 length(grep("sep", x, ignore.case=T)) # [1] 24 length(grep("oct", x, ignore.case=T)) # [1] 0 length(grep("dec", x, ignore.case=T)) # [1] 6 grep("sep", x, ignore.case=T, value=T) [1] "RNaseP_nuc" "SEP15" "SEPHS1" [4] "SEPHS2" "SEPN1" "SEPP1" [7] "SEPSECS" "SEPT1" "SEPT10" [10] "SEPT11" "SEPT12" "SEPT14" [13] "SEPT2" "SEPT3" "SEPT4" [16] "SEPT5-GP1BB" "SEPT6" "SEPT7" [19] "SEPT7P2" "SEPT7P9" "SEPT8" [22] "SEPT9" "SEPW1" "septin 9/TNRC6C fusion" # Count non-alphanumeric symbols from a string ind <- grep("[^[:alnum:] ]", x) length(ind) # [1] 1108 # Some cases: # "5S_rRNA" # "HGC6.1.1" # "Ig alpha 1-[alpha]2m" # "T-cell receptor alpha chain variable ..." # "TRA@" # "TRNA_Ala" # "TTN-AS1" # "aromatase cytochrome P-450 (P-450AROM)" # "immunoglobulin epsilon chain constant..." # "septin 9/TNRC6C fusion"
All NIH-funded data must be made freely accessible
Data Sharing and Public Access Policies
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2) > tmp [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > tmp[2,1] <- NA > complete.cases(tmp) [1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv") GSS$BaseLaborStatus <- GSS$LaborStatus levels(GSS$BaseLaborStatus) summary(GSS$BaseLaborStatus) GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus) GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time" GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>% mutate(tidyLaborStatus = recode(LaborStatus, `Temp not working` = "Temporarily not working", `Unempl, laid off` = "Unemployed, laid off", `Working fulltime` = "Working full time", `Working parttime ` = "Working part time"))
NIH CBIIT
http://datascience.cancer.gov/
Seminars
NCI Data science webinar series
Reproducibility
Bioinformatics advice I wish I learned 10 years ago from NIH
Project and Data Organization
- Project Organization
proj ├── dev │ ├── clustering.Rmd │ └── dim_reduce.Rmd ├── doc ├── output │ ├── 2019-05-10 │ ├── 2019-05-19 │ └── 2019-05-21 ├── README.Rmd ├── renv ├── rmd └── scripts
- Data Organization
data ├── annotations │ ├── clue_drug_repurposing_hub │ │ ├── repurposing_drugs_20180907.txt │ │ └── repurposing_samples_20180907.txt │ └── ... ├── containers │ └── singularity │ └── sclc-george2015 ├── projects │ ├── nih │ │ ├── mm-feature-selection │ │ ├── mm-p3-variants │ │ └── sclc-doe ├── public │ └── human │ ├── array_express │ ├── geo │ │ └── GSE6477 │ │ ├── processed │ │ │ ├── GSE6477_expr.csv │ │ │ └── sample_metadata.csv │ │ └── raw │ │ ├── GPL96.soft │ │ └── GSE6477_series_matrix.txt.gz └── ref └── human ├── agilent ├── gatk ├── gencode-v30 └── rRNA
Container
Data Science for Startups: Containers Building reproducible setups for machine learning