Jump to content

Data science: Difference between revisions

From 太極
Brb (talk | contribs)
Brb (talk | contribs)
 
(56 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Courses, books =
= Courses, books =
* [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown].
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel
* [https://leanpub.com/datasciencebook Introduction to Data Science] Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
* [https://stat545.com/index.html STAT 545] Data wrangling, exploration, and analysis with R, Jenny Bryan
* https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
* http://datasciencelabs.github.io/2016/ Harvard SPH
* https://cs109.github.io/2014 Harvard CS
* [http://hua-zhou.github.io/teaching/biostatm280-2019winter/schedule.html Biostat 203B: Introduction to Data Science]
* [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available.
* [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses].
* [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander
* [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners]
* [https://nrennie.rbind.io/data-science-resources/ Data Science Resources] - A curated collection of useful, freely-available data science and visualisation resources by Nicola Rennie.
 
== Debian ==
https://wiki.debian.org/DebianScience
 
== Python ==
* [https://www.amazon.com/Python-Data-Science-Handbook-Essential/dp/1491912057 Python Data Science Handbook: Essential Tools for Working with Data]
* [https://opensource.com/article/19/9/get-started-data-science-python Getting started with data science using Python] from opensource.com
 
== R ==
* [https://www.coursera.org/specializations/jhu-data-science Coursera -> Data Science Specialization] by JHS.
* [https://datasciencebox.org/ Data science in a box]
* [http://faculty.marshall.usc.edu/gareth-james/ISL/ An Introduction to Statistical Learning with Applications in R] by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
* [https://r4ds.had.co.nz/  R for Data Science]
* [https://bookdown.org/rdpeng/artofdatascience/ The Art of Data Science] Roger D. Peng and Elizabeth Matsui
* [http://cmdlinetips.com/2018/01/free-online-resources-books-to-learn-r-and-data-science/ 20 Free Online Books to Learn R and Data Science]
* [https://rafalab.github.io/pages/teaching.html Teaching resources] by Irizarry. edx.org. Audit is free.
** Introduction to Data Science
** Data Analysis for the Life Sciences
** Genomics Data Analysis
* [https://leanpub.com/datasciencebook Introduction to Data Science] or on [https://rafalab.github.io/dsbook/ github] Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
* [https://www.statworx.com/de/blog/why-is-it-called-that-way-origin-and-meaning-of-r-package-names/ Why Is It Called That Way?! – Origin and Meaning of R Package Names].
** dplyr
** lubridate
** ggplot2
** data.table
** tibble
** purrr
** amelia
** magrittr
** batman
** Homeric
** fcuk
** hellno
* [https://www.r-bloggers.com/2024/08/top-25-r-packages-you-need-to-learn-in-2024/ Top 25 R Packages (You Need To Learn In 2024)]
** janitor to clean column names,
** skimr: quick data summarization
** bslib: next-gen UI for shiny apps
** box: modularize R scripts
** data.table & tidytable: high-performance data manipulation
** renv: reproducibility made easy
** targets: pipeline management for reproducible workflows
** naniar: visualize missing data
** mlr3: advanced machine learning. Ebook: [https://mlr3book.mlr-org.com/ Applied Machine Learning Using mlr3 in R].
** gt: making professional tables
** GWalkR: tableau-like visualizations in R
** torch: Deep learning in R
** Plumber: build APIs in R
** Vetiver: model deployment in R and Python
** fs: efficient file system operations
** correlationfunnel: turn correlations into insights
** clock: super-powered date and time handling
** furrr: parallelized iterative processing
** patchwork: combine multiple plots
** echarts4r: interactive visualizations
** officer: generate microsoft office documents
** golem: production-grade shiny app
** rhino: fullstack shiny development
** ROI: R optimization infrastructure
** mapgl: next-level mapping with Mapbox GL and MapLibre GL
 
== Python vs R ==
[https://www.eoda.de/en/wissen/blog/r-python-julia-in-data-science-ein-vergleich R, Python & Julia in data science : A comparison]
 
== Datacamp ==
* [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020]
* How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
 
= Machine Learning =
* [https://github.com/dair-ai/ML-YouTube-Courses ML Youtube Courses]
* [https://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science]
* [https://github.com/compstat-lmu/lecture_i2ml Introduction to Machine Learning (I2ML)]
* [https://probml.github.io/pml-book/book1.html?s=09 Probabilistic Machine Learning: An Introduction]
* [https://betaandbit.github.io/RML/  The Hitchhiker’s Guide to Responsible Machine Learning]
* [https://stanford-cs329s.github.io/syllabus.html CS 329S: Machine Learning Systems Design] Stanford
* [https://twitter.com/Richard_D_Riley/status/1580907524634681347 Stability of Clinical Prediction Models Developed Using Statistical or Machine Learning Approaches], [https://youtu.be/-zRyEbhjcMo video]
 
== Top Machine Learning Algorithms ==
[https://s3.amazonaws.com/assets.datacamp.com/email/other/ML+Cheat+Sheet_2.pdf Top Machine Learning Algorithms] with pros and cons.
 
== 20 Cutting-Edge Statistical Techniques ==
[https://freedium.cfd/https://medium.com/@thedatabeast/20-cutting-edge-statistical-techniques-every-data-scientist-should-master-in-2025-4fbcef24b373 20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025]


= How to prepare data for collaboration =
= How to prepare data for collaboration =
Line 24: Line 115:
* No calculation in the raw data files
* No calculation in the raw data files
* Create a '''data dictionary''' (same as '''code book''')
* Create a '''data dictionary''' (same as '''code book''')
== Data Organization in Spreadsheets ==
[https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 Data Organization in Spreadsheets] Broman & Woo 2018
== Paper naming ==
For example, '''FirstAuthorLastName_etal_ShortDescription_PublicationYear_JournalAbbrev.pdf'''.
== Gene name errors from Excel ==
<ul>
<li>[https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984 Gene name errors: Lessons not learned].
* [https://cran.r-project.org/web/packages/HGNChelper/index.html HGNChelper]: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols
* Some examples: MARCH3, SEPT8, OCT4, DEC1.
<li>To avoid the problem, import the file into Excel by going to Data > From Text. After choosing the file to upload, pick '''Delimited''' under the file type, select Comma as the delimiter, and click Next. In the final step, click on the column with the gene names, and select '''Text''' under “Column data format.” Click Finish.
<li>[https://nsaunders.wordpress.com/2021/08/03/gene-names-data-corruption-and-excel-a-2021-update Gene names, data corruption and Excel: a 2021 update]
<syntaxhighlight lang='splus'>
length(x)
# [1] 28109
length(grep("march", x, ignore.case=T))
# [1] 11
length(grep("sep", x, ignore.case=T))
# [1] 24
length(grep("oct", x, ignore.case=T))
# [1] 0
length(grep("dec", x, ignore.case=T))
# [1] 6
grep("sep", x, ignore.case=T, value=T)
[1] "RNaseP_nuc"            "SEP15"                  "SEPHS1"
[4] "SEPHS2"                "SEPN1"                  "SEPP1"
[7] "SEPSECS"                "SEPT1"                  "SEPT10"
[10] "SEPT11"                "SEPT12"                "SEPT14"
[13] "SEPT2"                  "SEPT3"                  "SEPT4"
[16] "SEPT5-GP1BB"            "SEPT6"                  "SEPT7"
[19] "SEPT7P2"                "SEPT7P9"                "SEPT8"
[22] "SEPT9"                  "SEPW1"                  "septin 9/TNRC6C fusion"
# Count non-alphanumeric symbols from a string
ind <- grep("[^[:alnum:] ]", x)
length(ind)
# [1] 1108
# Some cases:
# "5S_rRNA"
# "HGC6.1.1"
# "Ig alpha 1-[alpha]2m"
# "T-cell receptor alpha chain variable ..."
# "TRA@"
# "TRNA_Ala"
# "TTN-AS1"
# "aromatase cytochrome P-450 (P-450AROM)"
# "immunoglobulin epsilon chain constant..."
# "septin 9/TNRC6C fusion"
</syntaxhighlight>
A real example:
<pre>
> data.frame(GENEID[i, 1], pull(xcsv, 1)[i], row.names = NULL)
  GENEID.i..1. pull.xcsv..1..i.
1        1-Dec            DEC1
2        1-Mar            MARC1
3        2-Mar            MARC2
4        1-Mar          MARCH1
5        10-Mar          MARCH10
6        11-Mar          MARCH11
7        2-Mar          MARCH2
8        3-Mar          MARCH3
9        4-Mar          MARCH4
10        5-Mar          MARCH5
11        6-Mar          MARCH6
12        7-Mar          MARCH7
13        8-Mar          MARCH8
14        9-Mar          MARCH9
15      15-Sep            SEP15
16        1-Sep            SEPT1
17      10-Sep          SEPT10
18      11-Sep          SEPT11
19      12-Sep          SEPT12
20      14-Sep          SEPT14
21        2-Sep            SEPT2
22        3-Sep            SEPT3
23        4-Sep            SEPT4
24        6-Sep            SEPT6
25        7-Sep            SEPT7
26        8-Sep            SEPT8
27        9-Sep            SEPT9
</pre>
Also it is possible the gene names start with a numeric number.
<pre>
> grep("^[0-9]", pull(xcsv, 1), value = TRUE)
[1] "5S_rRNA"  "5_8S_rRNA" "6M1-18"    "7M1-2"    "7SK" 
</pre>
Check using the R package
<pre>
> library(HGNChelper)
> GENEID[grep("^[0-9]", GENEID[,1]), 1] |> checkGeneSymbols()
Maps last updated on: Thu Oct 24 12:31:05 2019
          x Approved    Suggested.Symbol
1    5S_rRNA    FALSE                <NA>
2  5_8S_rRNA    FALSE                <NA>
3    6M1-18    FALSE                <NA>
4      7M1-2    FALSE                <NA>
5        7SK    FALSE              RN7SK
6      1-Dec    FALSE  BHLHE40 /// DELEC1
7      1-Mar    FALSE  MTARC1 /// MARCHF1
8      2-Mar    FALSE  MTARC2 /// MARCHF2
9      1-Mar    FALSE  MTARC1 /// MARCHF1
10    10-Mar    FALSE            MARCHF10
11    11-Mar    FALSE            MARCHF11
12    2-Mar    FALSE  MTARC2 /// MARCHF2
13    3-Mar    FALSE            MARCHF3
14    4-Mar    FALSE            MARCHF4
15    5-Mar    FALSE            MARCHF5
16    6-Mar    FALSE            MARCHF6
17    7-Mar    FALSE            MARCHF7
18    8-Mar    FALSE            MARCHF8
19    9-Mar    FALSE            MARCHF9
20    15-Sep    FALSE            SELENOF
21    1-Sep    FALSE            SEPTIN1
22    10-Sep    FALSE            SEPTIN10
23    11-Sep    FALSE            SEPTIN11
24    12-Sep    FALSE            SEPTIN12
25    14-Sep    FALSE            SEPTIN14
26    2-Sep    FALSE SEPTIN2 /// SEPTIN6
27    3-Sep    FALSE            SEPTIN3
28    4-Sep    FALSE            SEPTIN4
29    6-Sep    FALSE            SEPTIN6
30    7-Sep    FALSE            SEPTIN7
31    8-Sep    FALSE            SEPTIN8
32    9-Sep    FALSE            SEPTIN9
</pre>
</ul>
== All NIH-funded data must be made freely accessible ==
[https://datascience.cancer.gov/data-sharing/policies Data Sharing and Public Access Policies]
= Public online data =
[https://www.medrxiv.org/content/10.1101/2022.04.22.22274183v1 Systematic Review of Supervised Machine Learning Models in Prediction of Medical Conditions] 2022


= [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] =
= [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] =
Line 41: Line 269:
[1]  TRUE FALSE  TRUE
[1]  TRUE FALSE  TRUE
</pre>
</pre>
= NICAR data journalism conferences =
[https://www.machlis.com/nicar/ Resources from NICAR data journalism conferences]


= Wrangling categorical data in R =
= Wrangling categorical data in R =
Line 78: Line 309:
http://datascience.cancer.gov/
http://datascience.cancer.gov/


= Container =
== Seminars ==
[https://towardsdatascience.com/data-science-for-startups-containers-d1d785bfe5b Data Science for Startups: Containers]
[https://www.youtube.com/playlist?app=desktop&list=PLFAF53BE7B120386E&s=09 NCI Data science webinar series]
 
= Reproducibility =
[https://github.com/nih-byob/presentations/tree/master/2019/01_bioinformatics_tips Bioinformatics advice I wish I learned 10 years ago] from NIH
 
== Project and Data Organization ==
* [https://github.com/nih-byob/presentations/tree/master/2019/01_bioinformatics_tips Bioinformatics advice I wish I learned 10 years ago].
: Project Organization
: <syntaxhighlight lang='bash'>
proj
├── dev
│  ├── clustering.Rmd
│  └── dim_reduce.Rmd
├── doc
├── output
│  ├── 2019-05-10
│  ├── 2019-05-19
│  └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts
</syntaxhighlight>
: Data Organization
: <syntaxhighlight lang='bash'>
data
├── annotations
│  ├── clue_drug_repurposing_hub
│  │  ├── repurposing_drugs_20180907.txt
│  │  └── repurposing_samples_20180907.txt
│  └── ...
├── containers
│  └── singularity
│      └── sclc-george2015
├── projects
│  ├── nih
│  │  ├── mm-feature-selection
│  │  ├── mm-p3-variants
│  │  └── sclc-doe
├── public
│  └── human
│      ├── array_express
│      ├── geo
│      │  └── GSE6477
│      │      ├── processed
│      │      │  ├── GSE6477_expr.csv
│      │      │  └── sample_metadata.csv
│      │      └── raw
│      │          ├── GPL96.soft
│      │          └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA
</syntaxhighlight>
 
== Container ==
[https://towardsdatascience.com/data-science-for-startups-containers-d1d785bfe5b Data Science for Startups: Containers] Building reproducible setups for machine learning
 
= Big data =
== Hadoop ==
== Spark ==
* [https://opensource.com/article/19/5/visualize-log-data-apache-spark How to analyze log data with Python and Apache Spark]
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3087-8 PyBDA: a command line tool for automated analysis of big biological data sets]
* [https://arxiv.org/pdf/1912.11144.pdf Parallel Computing With R: A Brief Review] by Dirk Eddelbuettel
* [https://blog.rstudio.com/2020/01/29/sparklyr-1-1/ sparklyr 1.1: Foundations, Books, Lakes and Barriers]
 
= Edge, fog computing =
[https://www.makeuseof.com/what-is-fog-computing-fog-vs-edge-computing-explained/ What Is Fog Computing? Fog vs. Edge Computing Explained]

Latest revision as of 18:26, 9 August 2025

Courses, books

Debian

https://wiki.debian.org/DebianScience

Python

R

Python vs R

R, Python & Julia in data science : A comparison

Datacamp

Machine Learning

Top Machine Learning Algorithms

Top Machine Learning Algorithms with pros and cons.

20 Cutting-Edge Statistical Techniques

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

  • naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
  • coding variables: be consistent, no spelling error
  • date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
  • missing data: "NA". Not leave any cells blank.
  • using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

  • continuous
  • oridinal
  • categorical
  • missing
  • censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

  • No empty cells
  • Put one thing in a cell
  • Make a rectangle
  • No calculation in the raw data files
  • Create a data dictionary (same as code book)

Data Organization in Spreadsheets

Data Organization in Spreadsheets Broman & Woo 2018

Paper naming

For example, FirstAuthorLastName_etal_ShortDescription_PublicationYear_JournalAbbrev.pdf.

Gene name errors from Excel

  • Gene name errors: Lessons not learned.
    • HGNChelper: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols
    • Some examples: MARCH3, SEPT8, OCT4, DEC1.
  • To avoid the problem, import the file into Excel by going to Data > From Text. After choosing the file to upload, pick Delimited under the file type, select Comma as the delimiter, and click Next. In the final step, click on the column with the gene names, and select Text under “Column data format.” Click Finish.
  • Gene names, data corruption and Excel: a 2021 update
    length(x)
    # [1] 28109
    length(grep("march", x, ignore.case=T))
    # [1] 11
    length(grep("sep", x, ignore.case=T))
    # [1] 24
    length(grep("oct", x, ignore.case=T))
    # [1] 0
    length(grep("dec", x, ignore.case=T))
    # [1] 6
    grep("sep", x, ignore.case=T, value=T)
     [1] "RNaseP_nuc"             "SEP15"                  "SEPHS1"
     [4] "SEPHS2"                 "SEPN1"                  "SEPP1"
     [7] "SEPSECS"                "SEPT1"                  "SEPT10"
    [10] "SEPT11"                 "SEPT12"                 "SEPT14"
    [13] "SEPT2"                  "SEPT3"                  "SEPT4"
    [16] "SEPT5-GP1BB"            "SEPT6"                  "SEPT7"
    [19] "SEPT7P2"                "SEPT7P9"                "SEPT8"
    [22] "SEPT9"                  "SEPW1"                  "septin 9/TNRC6C fusion"
    
    # Count non-alphanumeric symbols from a string
    ind <- grep("[^[:alnum:] ]", x)
    length(ind)
    # [1] 1108
    
    # Some cases: 
    # "5S_rRNA"
    # "HGC6.1.1"
    # "Ig alpha 1-[alpha]2m"
    # "T-cell receptor alpha chain variable ..."
    # "TRA@"
    # "TRNA_Ala"
    # "TTN-AS1"
    # "aromatase cytochrome P-450 (P-450AROM)"
    # "immunoglobulin epsilon chain constant..."
    # "septin 9/TNRC6C fusion"

    A real example:

    > data.frame(GENEID[i, 1], pull(xcsv, 1)[i], row.names = NULL)
       GENEID.i..1. pull.xcsv..1..i.
    1         1-Dec             DEC1
    2         1-Mar            MARC1
    3         2-Mar            MARC2
    4         1-Mar           MARCH1
    5        10-Mar          MARCH10
    6        11-Mar          MARCH11
    7         2-Mar           MARCH2
    8         3-Mar           MARCH3
    9         4-Mar           MARCH4
    10        5-Mar           MARCH5
    11        6-Mar           MARCH6
    12        7-Mar           MARCH7
    13        8-Mar           MARCH8
    14        9-Mar           MARCH9
    15       15-Sep            SEP15
    16        1-Sep            SEPT1
    17       10-Sep           SEPT10
    18       11-Sep           SEPT11
    19       12-Sep           SEPT12
    20       14-Sep           SEPT14
    21        2-Sep            SEPT2
    22        3-Sep            SEPT3
    23        4-Sep            SEPT4
    24        6-Sep            SEPT6
    25        7-Sep            SEPT7
    26        8-Sep            SEPT8
    27        9-Sep            SEPT9
    

    Also it is possible the gene names start with a numeric number.

    > grep("^[0-9]", pull(xcsv, 1), value = TRUE)
    [1] "5S_rRNA"   "5_8S_rRNA" "6M1-18"    "7M1-2"     "7SK"   
    

    Check using the R package

    > library(HGNChelper)
    > GENEID[grep("^[0-9]", GENEID[,1]), 1] |> checkGeneSymbols()
    Maps last updated on: Thu Oct 24 12:31:05 2019
               x Approved    Suggested.Symbol
    1    5S_rRNA    FALSE                <NA>
    2  5_8S_rRNA    FALSE                <NA>
    3     6M1-18    FALSE                <NA>
    4      7M1-2    FALSE                <NA>
    5        7SK    FALSE               RN7SK
    6      1-Dec    FALSE  BHLHE40 /// DELEC1
    7      1-Mar    FALSE  MTARC1 /// MARCHF1
    8      2-Mar    FALSE  MTARC2 /// MARCHF2
    9      1-Mar    FALSE  MTARC1 /// MARCHF1
    10    10-Mar    FALSE            MARCHF10
    11    11-Mar    FALSE            MARCHF11
    12     2-Mar    FALSE  MTARC2 /// MARCHF2
    13     3-Mar    FALSE             MARCHF3
    14     4-Mar    FALSE             MARCHF4
    15     5-Mar    FALSE             MARCHF5
    16     6-Mar    FALSE             MARCHF6
    17     7-Mar    FALSE             MARCHF7
    18     8-Mar    FALSE             MARCHF8
    19     9-Mar    FALSE             MARCHF9
    20    15-Sep    FALSE             SELENOF
    21     1-Sep    FALSE             SEPTIN1
    22    10-Sep    FALSE            SEPTIN10
    23    11-Sep    FALSE            SEPTIN11
    24    12-Sep    FALSE            SEPTIN12
    25    14-Sep    FALSE            SEPTIN14
    26     2-Sep    FALSE SEPTIN2 /// SEPTIN6
    27     3-Sep    FALSE             SEPTIN3
    28     4-Sep    FALSE             SEPTIN4
    29     6-Sep    FALSE             SEPTIN6
    30     7-Sep    FALSE             SEPTIN7
    31     8-Sep    FALSE             SEPTIN8
    32     9-Sep    FALSE             SEPTIN9
    

All NIH-funded data must be made freely accessible

Data Sharing and Public Access Policies

Public online data

Systematic Review of Supervised Machine Learning Models in Prediction of Medical Conditions 2022

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

NICAR data journalism conferences

Resources from NICAR data journalism conferences

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

  • options(stringAsFactors=FALSE)
  • Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

NIH CBIIT

http://datascience.cancer.gov/

Seminars

NCI Data science webinar series

Reproducibility

Bioinformatics advice I wish I learned 10 years ago from NIH

Project and Data Organization

Project Organization
proj
├── dev
│   ├── clustering.Rmd
│   └── dim_reduce.Rmd
├── doc
├── output
│   ├── 2019-05-10
│   ├── 2019-05-19
│   └── 2019-05-21
├── README.Rmd
├── renv
├── rmd
└── scripts
Data Organization
data
├── annotations
│   ├── clue_drug_repurposing_hub
│   │   ├── repurposing_drugs_20180907.txt
│   │   └── repurposing_samples_20180907.txt
│   └── ...
├── containers
│   └── singularity
│       └── sclc-george2015
├── projects
│   ├── nih
│   │   ├── mm-feature-selection
│   │   ├── mm-p3-variants
│   │   └── sclc-doe
├── public
│   └── human
│       ├── array_express
│       ├── geo
│       │   └── GSE6477
│       │       ├── processed
│       │       │   ├── GSE6477_expr.csv
│       │       │   └── sample_metadata.csv
│       │       └── raw
│       │           ├── GPL96.soft
│       │           └── GSE6477_series_matrix.txt.gz
└── ref
    └── human
        ├── agilent
        ├── gatk
        ├── gencode-v30
        └── rRNA

Container

Data Science for Startups: Containers Building reproducible setups for machine learning

Big data

Hadoop

Spark

Edge, fog computing

What Is Fog Computing? Fog vs. Edge Computing Explained