Data science: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 1: Line 1:
= Courses, books =
= Courses, books =
* [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown].
* [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown].
* [https://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science]
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel
* https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
* https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
Line 9: Line 8:
* [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available.
* [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available.
* [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses].
* [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses].
* [https://github.com/compstat-lmu/lecture_i2ml Introduction to Machine Learning (I2ML)]
* [https://probml.github.io/pml-book/book1.html?s=09 Probabilistic Machine Learning: An Introduction]
* [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander
* [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander
* [https://betaandbit.github.io/RML/  The Hitchhiker’s Guide to Responsible Machine Learning]
* [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners]
* [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners]
* [https://stanford-cs329s.github.io/syllabus.html CS 329S: Machine Learning Systems Design] Stanford


== Python ==
== Python ==
Line 51: Line 46:
* [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020]
* [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020]
* How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
* How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
= Machine Learning =
* [https://github.com/dair-ai/ML-YouTube-Courses ML Youtube Courses]
* [https://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science]
* [https://github.com/compstat-lmu/lecture_i2ml Introduction to Machine Learning (I2ML)]
* [https://probml.github.io/pml-book/book1.html?s=09 Probabilistic Machine Learning: An Introduction]
* [https://betaandbit.github.io/RML/  The Hitchhiker’s Guide to Responsible Machine Learning]
* [https://stanford-cs329s.github.io/syllabus.html CS 329S: Machine Learning Systems Design] Stanford


= How to prepare data for collaboration =
= How to prepare data for collaboration =

Revision as of 08:53, 28 March 2022

Courses, books

Python

R

Python vs R

R, Python & Julia in data science : A comparison

Datacamp

Machine Learning

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

  • naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
  • coding variables: be consistent, no spelling error
  • date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
  • missing data: "NA". Not leave any cells blank.
  • using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

  • continuous
  • oridinal
  • categorical
  • missing
  • censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

  • No empty cells
  • Put one thing in a cell
  • Make a rectangle
  • No calculation in the raw data files
  • Create a data dictionary (same as code book)

Data Organization in Spreadsheets

Data Organization in Spreadsheets Broman & Woo 2018

Gene name errors from Excel

length(x)
# [1] 28109
length(grep("march", x, ignore.case=T))
# [1] 11
length(grep("sep", x, ignore.case=T))
# [1] 24
length(grep("oct", x, ignore.case=T))
# [1] 0
length(grep("dec", x, ignore.case=T))
# [1] 6
grep("sep", x, ignore.case=T, value=T)
 [1] "RNaseP_nuc"             "SEP15"                  "SEPHS1"
 [4] "SEPHS2"                 "SEPN1"                  "SEPP1"
 [7] "SEPSECS"                "SEPT1"                  "SEPT10"
[10] "SEPT11"                 "SEPT12"                 "SEPT14"
[13] "SEPT2"                  "SEPT3"                  "SEPT4"
[16] "SEPT5-GP1BB"            "SEPT6"                  "SEPT7"
[19] "SEPT7P2"                "SEPT7P9"                "SEPT8"
[22] "SEPT9"                  "SEPW1"                  "septin 9/TNRC6C fusion"

# Count non-alphanumeric symbols from a string
ind <- grep("[^[:alnum:] ]", x)
length(ind)
# [1] 1108

# Some cases: 
# "5S_rRNA"
# "HGC6.1.1"
# "Ig alpha 1-[alpha]2m"
# "T-cell receptor alpha chain variable ..."
# "TRA@"
# "TRNA_Ala"
# "TTN-AS1"
# "aromatase cytochrome P-450 (P-450AROM)"
# "immunoglobulin epsilon chain constant..."
# "septin 9/TNRC6C fusion"

All NIH-funded data must be made freely accessible

Data Sharing and Public Access Policies

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

  • options(stringAsFactors=FALSE)
  • Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

NIH CBIIT

http://datascience.cancer.gov/

Seminars

NCI Data science webinar series

Reproducibility

Bioinformatics advice I wish I learned 10 years ago from NIH

Project and Data Organization

Project Organization
proj
├── dev
│   ├── clustering.Rmd
│   └── dim_reduce.Rmd
├── doc
├── output
│   ├── 2019-05-10
│   ├── 2019-05-19
│   └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts
Data Organization
data
├── annotations
│   ├── clue_drug_repurposing_hub
│   │   ├── repurposing_drugs_20180907.txt
│   │   └── repurposing_samples_20180907.txt
│   └── ...
├── containers
│   └── singularity
│       └── sclc-george2015
├── projects
│   ├── nih
│   │   ├── mm-feature-selection
│   │   ├── mm-p3-variants
│   │   └── sclc-doe
├── public
│   └── human
│       ├── array_express
│       ├── geo
│       │   └── GSE6477
│       │       ├── processed
│       │       │   ├── GSE6477_expr.csv
│       │       │   └── sample_metadata.csv
│       │       └── raw
│       │           ├── GPL96.soft
│       │           └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA

Container

Data Science for Startups: Containers Building reproducible setups for machine learning

Big data

Hadoop

Spark