Courses, books

Data Science at the Command Line by Jeroen Janssens written using bookdown.
STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
STAT 545 Data wrangling, exploration, and analysis with R, Jenny Bryan
https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
http://datasciencelabs.github.io/2016/ Harvard SPH
https://cs109.github.io/2014 Harvard CS
Biostat 203B: Introduction to Data Science
Data Science, Big Data Analytics, and Digital Methods Videos. Over 3,200 videos comprising over 120 hours are available.
The Analytics Edge from edX.org or MOOC/Massive Open Online Courses.
Telling Stories With Data by Rohan Alexander
Best Data Science Books For Beginners

Python

Python Data Science Handbook: Essential Tools for Working with Data
Getting started with data science using Python from opensource.com

R

Coursera -> Data Science Specialization by JHS.
Data science in a box
An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
R for Data Science
The Art of Data Science Roger D. Peng and Elizabeth Matsui
20 Free Online Books to Learn R and Data Science
Teaching resources by Irizarry. edx.org. Audit is free.
- Introduction to Data Science
- Data Analysis for the Life Sciences
- Genomics Data Analysis
Introduction to Data Science or on github Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
Why Is It Called That Way?! – Origin and Meaning of R Package Names.
- dplyr
- lubridate
- ggplot2
- data.table
- tibble
- purrr
- amelia
- magrittr
- batman
- Homeric
- fcuk
- hellno

Python vs R

R, Python & Julia in data science : A comparison

Datacamp

32 Completely FREE DataCamp Courses To Take In 2020
How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.

Machine Learning

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
coding variables: be consistent, no spelling error
date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
missing data: "NA". Not leave any cells blank.
using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

continuous
oridinal
categorical
missing
censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

No empty cells
Put one thing in a cell
Make a rectangle
No calculation in the raw data files
Create a data dictionary (same as code book)

Data Organization in Spreadsheets

Data Organization in Spreadsheets Broman & Woo 2018

Gene name errors from Excel

Gene name errors: Lessons not learned.
- HGNChelper: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols
- Some examples: MARCH3, SEPT8, OCT4, DEC1.
Gene names, data corruption and Excel: a 2021 update

length(x)
# [1] 28109
length(grep("march", x, ignore.case=T))
# [1] 11
length(grep("sep", x, ignore.case=T))
# [1] 24
length(grep("oct", x, ignore.case=T))
# [1] 0
length(grep("dec", x, ignore.case=T))
# [1] 6
grep("sep", x, ignore.case=T, value=T)
 [1] "RNaseP_nuc"             "SEP15"                  "SEPHS1"
 [4] "SEPHS2"                 "SEPN1"                  "SEPP1"
 [7] "SEPSECS"                "SEPT1"                  "SEPT10"
[10] "SEPT11"                 "SEPT12"                 "SEPT14"
[13] "SEPT2"                  "SEPT3"                  "SEPT4"
[16] "SEPT5-GP1BB"            "SEPT6"                  "SEPT7"
[19] "SEPT7P2"                "SEPT7P9"                "SEPT8"
[22] "SEPT9"                  "SEPW1"                  "septin 9/TNRC6C fusion"

# Count non-alphanumeric symbols from a string
ind <- grep("[^[:alnum:] ]", x)
length(ind)
# [1] 1108

# Some cases: 
# "5S_rRNA"
# "HGC6.1.1"
# "Ig alpha 1-[alpha]2m"
# "T-cell receptor alpha chain variable ..."
# "TRA@"
# "TRNA_Ala"
# "TTN-AS1"
# "aromatase cytochrome P-450 (P-450AROM)"
# "immunoglobulin epsilon chain constant..."
# "septin 9/TNRC6C fusion"

A real example:

> data.frame(GENEID[i, 1], pull(xcsv, 1)[i], row.names = NULL)
   GENEID.i..1. pull.xcsv..1..i.
1         1-Dec             DEC1
2         1-Mar            MARC1
3         2-Mar            MARC2
4         1-Mar           MARCH1
5        10-Mar          MARCH10
6        11-Mar          MARCH11
7         2-Mar           MARCH2
8         3-Mar           MARCH3
9         4-Mar           MARCH4
10        5-Mar           MARCH5
11        6-Mar           MARCH6
12        7-Mar           MARCH7
13        8-Mar           MARCH8
14        9-Mar           MARCH9
15       15-Sep            SEP15
16        1-Sep            SEPT1
17       10-Sep           SEPT10
18       11-Sep           SEPT11
19       12-Sep           SEPT12
20       14-Sep           SEPT14
21        2-Sep            SEPT2
22        3-Sep            SEPT3
23        4-Sep            SEPT4
24        6-Sep            SEPT6
25        7-Sep            SEPT7
26        8-Sep            SEPT8
27        9-Sep            SEPT9

Also it is possible the gene names start with a numeric number.

> grep("^[0-9]", pull(xcsv, 1), value = TRUE)
[1] "5S_rRNA"   "5_8S_rRNA" "6M1-18"    "7M1-2"     "7SK"

Check using the R package

> library(HGNChelper)
> GENEID[grep("^[0-9]", GENEID[,1]), 1] |> checkGeneSymbols()
Maps last updated on: Thu Oct 24 12:31:05 2019
           x Approved    Suggested.Symbol
1    5S_rRNA    FALSE                <NA>
2  5_8S_rRNA    FALSE                <NA>
3     6M1-18    FALSE                <NA>
4      7M1-2    FALSE                <NA>
5        7SK    FALSE               RN7SK
6      1-Dec    FALSE  BHLHE40 /// DELEC1
7      1-Mar    FALSE  MTARC1 /// MARCHF1
8      2-Mar    FALSE  MTARC2 /// MARCHF2
9      1-Mar    FALSE  MTARC1 /// MARCHF1
10    10-Mar    FALSE            MARCHF10
11    11-Mar    FALSE            MARCHF11
12     2-Mar    FALSE  MTARC2 /// MARCHF2
13     3-Mar    FALSE             MARCHF3
14     4-Mar    FALSE             MARCHF4
15     5-Mar    FALSE             MARCHF5
16     6-Mar    FALSE             MARCHF6
17     7-Mar    FALSE             MARCHF7
18     8-Mar    FALSE             MARCHF8
19     9-Mar    FALSE             MARCHF9
20    15-Sep    FALSE             SELENOF
21     1-Sep    FALSE             SEPTIN1
22    10-Sep    FALSE            SEPTIN10
23    11-Sep    FALSE            SEPTIN11
24    12-Sep    FALSE            SEPTIN12
25    14-Sep    FALSE            SEPTIN14
26     2-Sep    FALSE SEPTIN2 /// SEPTIN6
27     3-Sep    FALSE             SEPTIN3
28     4-Sep    FALSE             SEPTIN4
29     6-Sep    FALSE             SEPTIN6
30     7-Sep    FALSE             SEPTIN7
31     8-Sep    FALSE             SEPTIN8
32     9-Sep    FALSE             SEPTIN9

All NIH-funded data must be made freely accessible

Data Sharing and Public Access Policies

Public online data

Systematic Review of Supervised Machine Learning Models in Prediction of Medical Conditions 2022

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))

> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

options(stringAsFactors=FALSE)
Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

NIH CBIIT

http://datascience.cancer.gov/

Seminars

NCI Data science webinar series

Reproducibility

Bioinformatics advice I wish I learned 10 years ago from NIH

Project and Data Organization

Bioinformatics advice I wish I learned 10 years ago.

Project Organization

proj
├── dev
│   ├── clustering.Rmd
│   └── dim_reduce.Rmd
├── doc
├── output
│   ├── 2019-05-10
│   ├── 2019-05-19
│   └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts

Data Organization

data
├── annotations
│   ├── clue_drug_repurposing_hub
│   │   ├── repurposing_drugs_20180907.txt
│   │   └── repurposing_samples_20180907.txt
│   └── ...
├── containers
│   └── singularity
│       └── sclc-george2015
├── projects
│   ├── nih
│   │   ├── mm-feature-selection
│   │   ├── mm-p3-variants
│   │   └── sclc-doe
├── public
│   └── human
│       ├── array_express
│       ├── geo
│       │   └── GSE6477
│       │       ├── processed
│       │       │   ├── GSE6477_expr.csv
│       │       │   └── sample_metadata.csv
│       │       └── raw
│       │           ├── GPL96.soft
│       │           └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA

Container

Data Science for Startups: Containers Building reproducible setups for machine learning

Data science