Data science

From 太極
Jump to navigation Jump to search

Courses, books

Python

R

Python vs R

R, Python & Julia in data science : A comparison

Datacamp

Machine Learning

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

  • naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
  • coding variables: be consistent, no spelling error
  • date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
  • missing data: "NA". Not leave any cells blank.
  • using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

  • continuous
  • oridinal
  • categorical
  • missing
  • censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

  • No empty cells
  • Put one thing in a cell
  • Make a rectangle
  • No calculation in the raw data files
  • Create a data dictionary (same as code book)

Data Organization in Spreadsheets

Data Organization in Spreadsheets Broman & Woo 2018

Gene name errors from Excel

length(x)
# [1] 28109
length(grep("march", x, ignore.case=T))
# [1] 11
length(grep("sep", x, ignore.case=T))
# [1] 24
length(grep("oct", x, ignore.case=T))
# [1] 0
length(grep("dec", x, ignore.case=T))
# [1] 6
grep("sep", x, ignore.case=T, value=T)
 [1] "RNaseP_nuc"             "SEP15"                  "SEPHS1"
 [4] "SEPHS2"                 "SEPN1"                  "SEPP1"
 [7] "SEPSECS"                "SEPT1"                  "SEPT10"
[10] "SEPT11"                 "SEPT12"                 "SEPT14"
[13] "SEPT2"                  "SEPT3"                  "SEPT4"
[16] "SEPT5-GP1BB"            "SEPT6"                  "SEPT7"
[19] "SEPT7P2"                "SEPT7P9"                "SEPT8"
[22] "SEPT9"                  "SEPW1"                  "septin 9/TNRC6C fusion"

# Count non-alphanumeric symbols from a string
ind <- grep("[^[:alnum:] ]", x)
length(ind)
# [1] 1108

# Some cases: 
# "5S_rRNA"
# "HGC6.1.1"
# "Ig alpha 1-[alpha]2m"
# "T-cell receptor alpha chain variable ..."
# "[email protected]"
# "TRNA_Ala"
# "TTN-AS1"
# "aromatase cytochrome P-450 (P-450AROM)"
# "immunoglobulin epsilon chain constant..."
# "septin 9/TNRC6C fusion"

All NIH-funded data must be made freely accessible

Data Sharing and Public Access Policies

Public online data

Systematic Review of Supervised Machine Learning Models in Prediction of Medical Conditions 2022

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

  • options(stringAsFactors=FALSE)
  • Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

NIH CBIIT

http://datascience.cancer.gov/

Seminars

NCI Data science webinar series

Reproducibility

Bioinformatics advice I wish I learned 10 years ago from NIH

Project and Data Organization

Project Organization
proj
├── dev
│   ├── clustering.Rmd
│   └── dim_reduce.Rmd
├── doc
├── output
│   ├── 2019-05-10
│   ├── 2019-05-19
│   └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts
Data Organization
data
├── annotations
│   ├── clue_drug_repurposing_hub
│   │   ├── repurposing_drugs_20180907.txt
│   │   └── repurposing_samples_20180907.txt
│   └── ...
├── containers
│   └── singularity
│       └── sclc-george2015
├── projects
│   ├── nih
│   │   ├── mm-feature-selection
│   │   ├── mm-p3-variants
│   │   └── sclc-doe
├── public
│   └── human
│       ├── array_express
│       ├── geo
│       │   └── GSE6477
│       │       ├── processed
│       │       │   ├── GSE6477_expr.csv
│       │       │   └── sample_metadata.csv
│       │       └── raw
│       │           ├── GPL96.soft
│       │           └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA

Container

Data Science for Startups: Containers Building reproducible setups for machine learning

Big data

Hadoop

Spark