Data science: Difference between revisions
No edit summary |
(→R) |
||
(55 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Courses = | = Courses, books = | ||
[https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | * [https://www.datascienceatthecommandline.com/ Data Science at the Command Line] by Jeroen Janssens written using [https://bookdown.org/ bookdown]. | ||
* [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | |||
* [https://stat545.com/index.html STAT 545] Data wrangling, exploration, and analysis with R, Jenny Bryan | |||
* https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH | |||
* http://datasciencelabs.github.io/2016/ Harvard SPH | |||
* https://cs109.github.io/2014 Harvard CS | |||
* [http://hua-zhou.github.io/teaching/biostatm280-2019winter/schedule.html Biostat 203B: Introduction to Data Science] | |||
* [http://methods.sagepub.com/Search/Results?products%5b0%5d=17 Data Science, Big Data Analytics, and Digital Methods Videos]. Over 3,200 videos comprising over 120 hours are available. | |||
* [https://www.edx.org/course/the-analytics-edge-2 The Analytics Edge] from edX.org or [http://mooc.org/ MOOC/Massive Open Online Courses]. | |||
* [https://www.tellingstorieswithdata.com/ Telling Stories With Data] by Rohan Alexander | |||
* [https://finnstats.com/index.php/2022/02/21/best-data-science-books-for-beginners/ Best Data Science Books For Beginners] | |||
== Debian == | |||
https://wiki.debian.org/DebianScience | |||
== Python == | |||
* [https://www.amazon.com/Python-Data-Science-Handbook-Essential/dp/1491912057 Python Data Science Handbook: Essential Tools for Working with Data] | |||
* [https://opensource.com/article/19/9/get-started-data-science-python Getting started with data science using Python] from opensource.com | |||
== R == | |||
* [https://www.coursera.org/specializations/jhu-data-science Coursera -> Data Science Specialization] by JHS. | |||
* [https://datasciencebox.org/ Data science in a box] | |||
* [http://faculty.marshall.usc.edu/gareth-james/ISL/ An Introduction to Statistical Learning with Applications in R] by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani | |||
* [https://r4ds.had.co.nz/ R for Data Science] | |||
* [https://bookdown.org/rdpeng/artofdatascience/ The Art of Data Science] Roger D. Peng and Elizabeth Matsui | |||
* [http://cmdlinetips.com/2018/01/free-online-resources-books-to-learn-r-and-data-science/ 20 Free Online Books to Learn R and Data Science] | |||
* [https://rafalab.github.io/pages/teaching.html Teaching resources] by Irizarry. edx.org. Audit is free. | |||
** Introduction to Data Science | |||
** Data Analysis for the Life Sciences | |||
** Genomics Data Analysis | |||
* [https://leanpub.com/datasciencebook Introduction to Data Science] or on [https://rafalab.github.io/dsbook/ github] Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free) | |||
* [https://www.statworx.com/de/blog/why-is-it-called-that-way-origin-and-meaning-of-r-package-names/ Why Is It Called That Way?! – Origin and Meaning of R Package Names]. | |||
** dplyr | |||
** lubridate | |||
** ggplot2 | |||
** data.table | |||
** tibble | |||
** purrr | |||
** amelia | |||
** magrittr | |||
** batman | |||
** Homeric | |||
** fcuk | |||
** hellno | |||
* [https://www.r-bloggers.com/2024/08/top-25-r-packages-you-need-to-learn-in-2024/ Top 25 R Packages (You Need To Learn In 2024)] | |||
** janitor to clean column names, | |||
** skimr: quick data summarization | |||
** bslib: next-gen UI for shiny apps | |||
** box: modularize R scripts | |||
** data.table & tidytable: high-performance data manipulation | |||
** renv: reproducibility made easy | |||
** targets: pipeline management for reproducible workflows | |||
** naniar: visualize missing data | |||
** mlr3: advanced machine learning | |||
** gt: making professional tables | |||
** GWalkR: tableau-like visualizations in R | |||
** torch: Deep learning in R | |||
** Plumber: build APIs in R | |||
** Vetiver: model deployment in R and Python | |||
** fs: efficient file system operations | |||
** correlationfunnel: turn correlations into insights | |||
** clock: super-powered date and time handling | |||
** furrr: parallelized iterative processing | |||
** patchwork: combine multiple plots | |||
** echarts4r: interactive visualizations | |||
** officer: generate microsoft office documents | |||
** golem: production-grade shiny app | |||
** rhino: fullstack shiny development | |||
** ROI: R optimization infrastructure | |||
** mapgl: next-level mapping with Mapbox GL and MapLibre GL | |||
== Python vs R == | |||
[https://www.eoda.de/en/wissen/blog/r-python-julia-in-data-science-ein-vergleich R, Python & Julia in data science : A comparison] | |||
== Datacamp == | |||
* [https://toptipbio.com/free-datacamp-courses/ 32 Completely FREE DataCamp Courses To Take In 2020] | |||
* How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done. | |||
= Machine Learning = | |||
* [https://github.com/dair-ai/ML-YouTube-Courses ML Youtube Courses] | |||
* [https://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science] | |||
* [https://github.com/compstat-lmu/lecture_i2ml Introduction to Machine Learning (I2ML)] | |||
* [https://probml.github.io/pml-book/book1.html?s=09 Probabilistic Machine Learning: An Introduction] | |||
* [https://betaandbit.github.io/RML/ The Hitchhiker’s Guide to Responsible Machine Learning] | |||
* [https://stanford-cs329s.github.io/syllabus.html CS 329S: Machine Learning Systems Design] Stanford | |||
* [https://twitter.com/Richard_D_Riley/status/1580907524634681347 Stability of Clinical Prediction Models Developed Using Statistical or Machine Learning Approaches], [https://youtu.be/-zRyEbhjcMo video] | |||
== Top Machine Learning Algorithms == | |||
[https://s3.amazonaws.com/assets.datacamp.com/email/other/ML+Cheat+Sheet_2.pdf Top Machine Learning Algorithms] with pros and cons. | |||
= How to prepare data for collaboration = | = How to prepare data for collaboration = | ||
Line 23: | Line 111: | ||
* No calculation in the raw data files | * No calculation in the raw data files | ||
* Create a '''data dictionary''' (same as '''code book''') | * Create a '''data dictionary''' (same as '''code book''') | ||
== Data Organization in Spreadsheets == | |||
[https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 Data Organization in Spreadsheets] Broman & Woo 2018 | |||
== Paper naming == | |||
For example, '''FirstAuthorLastName_etal_ShortDescription_PublicationYear_JournalAbbrev.pdf'''. | |||
== Gene name errors from Excel == | |||
* [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984 Gene name errors: Lessons not learned]. | |||
** [https://cran.r-project.org/web/packages/HGNChelper/index.html HGNChelper]: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols | |||
** Some examples: MARCH3, SEPT8, OCT4, DEC1. | |||
* [https://nsaunders.wordpress.com/2021/08/03/gene-names-data-corruption-and-excel-a-2021-update Gene names, data corruption and Excel: a 2021 update] | |||
<syntaxhighlight lang='splus'> | |||
length(x) | |||
# [1] 28109 | |||
length(grep("march", x, ignore.case=T)) | |||
# [1] 11 | |||
length(grep("sep", x, ignore.case=T)) | |||
# [1] 24 | |||
length(grep("oct", x, ignore.case=T)) | |||
# [1] 0 | |||
length(grep("dec", x, ignore.case=T)) | |||
# [1] 6 | |||
grep("sep", x, ignore.case=T, value=T) | |||
[1] "RNaseP_nuc" "SEP15" "SEPHS1" | |||
[4] "SEPHS2" "SEPN1" "SEPP1" | |||
[7] "SEPSECS" "SEPT1" "SEPT10" | |||
[10] "SEPT11" "SEPT12" "SEPT14" | |||
[13] "SEPT2" "SEPT3" "SEPT4" | |||
[16] "SEPT5-GP1BB" "SEPT6" "SEPT7" | |||
[19] "SEPT7P2" "SEPT7P9" "SEPT8" | |||
[22] "SEPT9" "SEPW1" "septin 9/TNRC6C fusion" | |||
# Count non-alphanumeric symbols from a string | |||
ind <- grep("[^[:alnum:] ]", x) | |||
length(ind) | |||
# [1] 1108 | |||
# Some cases: | |||
# "5S_rRNA" | |||
# "HGC6.1.1" | |||
# "Ig alpha 1-[alpha]2m" | |||
# "T-cell receptor alpha chain variable ..." | |||
# "TRA@" | |||
# "TRNA_Ala" | |||
# "TTN-AS1" | |||
# "aromatase cytochrome P-450 (P-450AROM)" | |||
# "immunoglobulin epsilon chain constant..." | |||
# "septin 9/TNRC6C fusion" | |||
</syntaxhighlight> | |||
A real example: | |||
<pre> | |||
> data.frame(GENEID[i, 1], pull(xcsv, 1)[i], row.names = NULL) | |||
GENEID.i..1. pull.xcsv..1..i. | |||
1 1-Dec DEC1 | |||
2 1-Mar MARC1 | |||
3 2-Mar MARC2 | |||
4 1-Mar MARCH1 | |||
5 10-Mar MARCH10 | |||
6 11-Mar MARCH11 | |||
7 2-Mar MARCH2 | |||
8 3-Mar MARCH3 | |||
9 4-Mar MARCH4 | |||
10 5-Mar MARCH5 | |||
11 6-Mar MARCH6 | |||
12 7-Mar MARCH7 | |||
13 8-Mar MARCH8 | |||
14 9-Mar MARCH9 | |||
15 15-Sep SEP15 | |||
16 1-Sep SEPT1 | |||
17 10-Sep SEPT10 | |||
18 11-Sep SEPT11 | |||
19 12-Sep SEPT12 | |||
20 14-Sep SEPT14 | |||
21 2-Sep SEPT2 | |||
22 3-Sep SEPT3 | |||
23 4-Sep SEPT4 | |||
24 6-Sep SEPT6 | |||
25 7-Sep SEPT7 | |||
26 8-Sep SEPT8 | |||
27 9-Sep SEPT9 | |||
</pre> | |||
Also it is possible the gene names start with a numeric number. | |||
<pre> | |||
> grep("^[0-9]", pull(xcsv, 1), value = TRUE) | |||
[1] "5S_rRNA" "5_8S_rRNA" "6M1-18" "7M1-2" "7SK" | |||
</pre> | |||
Check using the R package | |||
<pre> | |||
> library(HGNChelper) | |||
> GENEID[grep("^[0-9]", GENEID[,1]), 1] |> checkGeneSymbols() | |||
Maps last updated on: Thu Oct 24 12:31:05 2019 | |||
x Approved Suggested.Symbol | |||
1 5S_rRNA FALSE <NA> | |||
2 5_8S_rRNA FALSE <NA> | |||
3 6M1-18 FALSE <NA> | |||
4 7M1-2 FALSE <NA> | |||
5 7SK FALSE RN7SK | |||
6 1-Dec FALSE BHLHE40 /// DELEC1 | |||
7 1-Mar FALSE MTARC1 /// MARCHF1 | |||
8 2-Mar FALSE MTARC2 /// MARCHF2 | |||
9 1-Mar FALSE MTARC1 /// MARCHF1 | |||
10 10-Mar FALSE MARCHF10 | |||
11 11-Mar FALSE MARCHF11 | |||
12 2-Mar FALSE MTARC2 /// MARCHF2 | |||
13 3-Mar FALSE MARCHF3 | |||
14 4-Mar FALSE MARCHF4 | |||
15 5-Mar FALSE MARCHF5 | |||
16 6-Mar FALSE MARCHF6 | |||
17 7-Mar FALSE MARCHF7 | |||
18 8-Mar FALSE MARCHF8 | |||
19 9-Mar FALSE MARCHF9 | |||
20 15-Sep FALSE SELENOF | |||
21 1-Sep FALSE SEPTIN1 | |||
22 10-Sep FALSE SEPTIN10 | |||
23 11-Sep FALSE SEPTIN11 | |||
24 12-Sep FALSE SEPTIN12 | |||
25 14-Sep FALSE SEPTIN14 | |||
26 2-Sep FALSE SEPTIN2 /// SEPTIN6 | |||
27 3-Sep FALSE SEPTIN3 | |||
28 4-Sep FALSE SEPTIN4 | |||
29 6-Sep FALSE SEPTIN6 | |||
30 7-Sep FALSE SEPTIN7 | |||
31 8-Sep FALSE SEPTIN8 | |||
32 9-Sep FALSE SEPTIN9 | |||
</pre> | |||
== All NIH-funded data must be made freely accessible == | |||
[https://datascience.cancer.gov/data-sharing/policies Data Sharing and Public Access Policies] | |||
= Public online data = | |||
[https://www.medrxiv.org/content/10.1101/2022.04.22.22274183v1 Systematic Review of Supervised Machine Learning Models in Prediction of Medical Conditions] 2022 | |||
= [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] = | = [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] = | ||
Line 73: | Line 293: | ||
`Working parttime ` = "Working part time")) | `Working parttime ` = "Working part time")) | ||
</syntaxhighlight> | </syntaxhighlight> | ||
= NIH CBIIT = | |||
http://datascience.cancer.gov/ | |||
== Seminars == | |||
[https://www.youtube.com/playlist?app=desktop&list=PLFAF53BE7B120386E&s=09 NCI Data science webinar series] | |||
= Reproducibility = | |||
[https://github.com/nih-byob/presentations/tree/master/2019/01_bioinformatics_tips Bioinformatics advice I wish I learned 10 years ago] from NIH | |||
== Project and Data Organization == | |||
* [https://github.com/nih-byob/presentations/tree/master/2019/01_bioinformatics_tips Bioinformatics advice I wish I learned 10 years ago]. | |||
: Project Organization | |||
: <syntaxhighlight lang='bash'> | |||
proj | |||
├── dev | |||
│ ├── clustering.Rmd | |||
│ └── dim_reduce.Rmd | |||
├── doc | |||
├── output | |||
│ ├── 2019-05-10 | |||
│ ├── 2019-05-19 | |||
│ └── 2019-05-21 | |||
├── README.Rmd | |||
├── renv | |||
├── rmd | |||
└── scripts | |||
</syntaxhighlight> | |||
: Data Organization | |||
: <syntaxhighlight lang='bash'> | |||
data | |||
├── annotations | |||
│ ├── clue_drug_repurposing_hub | |||
│ │ ├── repurposing_drugs_20180907.txt | |||
│ │ └── repurposing_samples_20180907.txt | |||
│ └── ... | |||
├── containers | |||
│ └── singularity | |||
│ └── sclc-george2015 | |||
├── projects | |||
│ ├── nih | |||
│ │ ├── mm-feature-selection | |||
│ │ ├── mm-p3-variants | |||
│ │ └── sclc-doe | |||
├── public | |||
│ └── human | |||
│ ├── array_express | |||
│ ├── geo | |||
│ │ └── GSE6477 | |||
│ │ ├── processed | |||
│ │ │ ├── GSE6477_expr.csv | |||
│ │ │ └── sample_metadata.csv | |||
│ │ └── raw | |||
│ │ ├── GPL96.soft | |||
│ │ └── GSE6477_series_matrix.txt.gz | |||
└── ref | |||
└── human | |||
├── agilent | |||
├── gatk | |||
├── gencode-v30 | |||
└── rRNA | |||
</syntaxhighlight> | |||
== Container == | |||
[https://towardsdatascience.com/data-science-for-startups-containers-d1d785bfe5b Data Science for Startups: Containers] Building reproducible setups for machine learning | |||
= Big data = | |||
== Hadoop == | |||
== Spark == | |||
* [https://opensource.com/article/19/5/visualize-log-data-apache-spark How to analyze log data with Python and Apache Spark] | |||
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3087-8 PyBDA: a command line tool for automated analysis of big biological data sets] | |||
* [https://arxiv.org/pdf/1912.11144.pdf Parallel Computing With R: A Brief Review] by Dirk Eddelbuettel | |||
* [https://blog.rstudio.com/2020/01/29/sparklyr-1-1/ sparklyr 1.1: Foundations, Books, Lakes and Barriers] | |||
= Edge, fog computing = | |||
[https://www.makeuseof.com/what-is-fog-computing-fog-vs-edge-computing-explained/ What Is Fog Computing? Fog vs. Edge Computing Explained] |
Latest revision as of 16:40, 27 August 2024
Courses, books
- Data Science at the Command Line by Jeroen Janssens written using bookdown.
- STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
- STAT 545 Data wrangling, exploration, and analysis with R, Jenny Bryan
- https://jhu-advdatasci.github.io/2018/ Johns Hopkins SPH
- http://datasciencelabs.github.io/2016/ Harvard SPH
- https://cs109.github.io/2014 Harvard CS
- Biostat 203B: Introduction to Data Science
- Data Science, Big Data Analytics, and Digital Methods Videos. Over 3,200 videos comprising over 120 hours are available.
- The Analytics Edge from edX.org or MOOC/Massive Open Online Courses.
- Telling Stories With Data by Rohan Alexander
- Best Data Science Books For Beginners
Debian
https://wiki.debian.org/DebianScience
Python
- Python Data Science Handbook: Essential Tools for Working with Data
- Getting started with data science using Python from opensource.com
R
- Coursera -> Data Science Specialization by JHS.
- Data science in a box
- An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- R for Data Science
- The Art of Data Science Roger D. Peng and Elizabeth Matsui
- 20 Free Online Books to Learn R and Data Science
- Teaching resources by Irizarry. edx.org. Audit is free.
- Introduction to Data Science
- Data Analysis for the Life Sciences
- Genomics Data Analysis
- Introduction to Data Science or on github Data Analysis and Prediction Algorithms with R, by Rafael A Irizarry (free)
- Why Is It Called That Way?! – Origin and Meaning of R Package Names.
- dplyr
- lubridate
- ggplot2
- data.table
- tibble
- purrr
- amelia
- magrittr
- batman
- Homeric
- fcuk
- hellno
- Top 25 R Packages (You Need To Learn In 2024)
- janitor to clean column names,
- skimr: quick data summarization
- bslib: next-gen UI for shiny apps
- box: modularize R scripts
- data.table & tidytable: high-performance data manipulation
- renv: reproducibility made easy
- targets: pipeline management for reproducible workflows
- naniar: visualize missing data
- mlr3: advanced machine learning
- gt: making professional tables
- GWalkR: tableau-like visualizations in R
- torch: Deep learning in R
- Plumber: build APIs in R
- Vetiver: model deployment in R and Python
- fs: efficient file system operations
- correlationfunnel: turn correlations into insights
- clock: super-powered date and time handling
- furrr: parallelized iterative processing
- patchwork: combine multiple plots
- echarts4r: interactive visualizations
- officer: generate microsoft office documents
- golem: production-grade shiny app
- rhino: fullstack shiny development
- ROI: R optimization infrastructure
- mapgl: next-level mapping with Mapbox GL and MapLibre GL
Python vs R
R, Python & Julia in data science : A comparison
Datacamp
- 32 Completely FREE DataCamp Courses To Take In 2020
- How to Get Free DataCamp Subscription For 2 Months? Microsoft is providing a Free DataCamp subscription with Visual Studio Dev Essential Account. You just need to sign up for the account and its done.
Machine Learning
- ML Youtube Courses
- 10 Free Must-Read Books for Machine Learning and Data Science
- Introduction to Machine Learning (I2ML)
- Probabilistic Machine Learning: An Introduction
- The Hitchhiker’s Guide to Responsible Machine Learning
- CS 329S: Machine Learning Systems Design Stanford
- Stability of Clinical Prediction Models Developed Using Statistical or Machine Learning Approaches, video
Top Machine Learning Algorithms
Top Machine Learning Algorithms with pros and cons.
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
Data Organization in Spreadsheets
Data Organization in Spreadsheets Broman & Woo 2018
Paper naming
For example, FirstAuthorLastName_etal_ShortDescription_PublicationYear_JournalAbbrev.pdf.
Gene name errors from Excel
- Gene name errors: Lessons not learned.
- HGNChelper: Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols
- Some examples: MARCH3, SEPT8, OCT4, DEC1.
- Gene names, data corruption and Excel: a 2021 update
length(x) # [1] 28109 length(grep("march", x, ignore.case=T)) # [1] 11 length(grep("sep", x, ignore.case=T)) # [1] 24 length(grep("oct", x, ignore.case=T)) # [1] 0 length(grep("dec", x, ignore.case=T)) # [1] 6 grep("sep", x, ignore.case=T, value=T) [1] "RNaseP_nuc" "SEP15" "SEPHS1" [4] "SEPHS2" "SEPN1" "SEPP1" [7] "SEPSECS" "SEPT1" "SEPT10" [10] "SEPT11" "SEPT12" "SEPT14" [13] "SEPT2" "SEPT3" "SEPT4" [16] "SEPT5-GP1BB" "SEPT6" "SEPT7" [19] "SEPT7P2" "SEPT7P9" "SEPT8" [22] "SEPT9" "SEPW1" "septin 9/TNRC6C fusion" # Count non-alphanumeric symbols from a string ind <- grep("[^[:alnum:] ]", x) length(ind) # [1] 1108 # Some cases: # "5S_rRNA" # "HGC6.1.1" # "Ig alpha 1-[alpha]2m" # "T-cell receptor alpha chain variable ..." # "TRA@" # "TRNA_Ala" # "TTN-AS1" # "aromatase cytochrome P-450 (P-450AROM)" # "immunoglobulin epsilon chain constant..." # "septin 9/TNRC6C fusion"
A real example:
> data.frame(GENEID[i, 1], pull(xcsv, 1)[i], row.names = NULL) GENEID.i..1. pull.xcsv..1..i. 1 1-Dec DEC1 2 1-Mar MARC1 3 2-Mar MARC2 4 1-Mar MARCH1 5 10-Mar MARCH10 6 11-Mar MARCH11 7 2-Mar MARCH2 8 3-Mar MARCH3 9 4-Mar MARCH4 10 5-Mar MARCH5 11 6-Mar MARCH6 12 7-Mar MARCH7 13 8-Mar MARCH8 14 9-Mar MARCH9 15 15-Sep SEP15 16 1-Sep SEPT1 17 10-Sep SEPT10 18 11-Sep SEPT11 19 12-Sep SEPT12 20 14-Sep SEPT14 21 2-Sep SEPT2 22 3-Sep SEPT3 23 4-Sep SEPT4 24 6-Sep SEPT6 25 7-Sep SEPT7 26 8-Sep SEPT8 27 9-Sep SEPT9
Also it is possible the gene names start with a numeric number.
> grep("^[0-9]", pull(xcsv, 1), value = TRUE) [1] "5S_rRNA" "5_8S_rRNA" "6M1-18" "7M1-2" "7SK"
Check using the R package
> library(HGNChelper) > GENEID[grep("^[0-9]", GENEID[,1]), 1] |> checkGeneSymbols() Maps last updated on: Thu Oct 24 12:31:05 2019 x Approved Suggested.Symbol 1 5S_rRNA FALSE <NA> 2 5_8S_rRNA FALSE <NA> 3 6M1-18 FALSE <NA> 4 7M1-2 FALSE <NA> 5 7SK FALSE RN7SK 6 1-Dec FALSE BHLHE40 /// DELEC1 7 1-Mar FALSE MTARC1 /// MARCHF1 8 2-Mar FALSE MTARC2 /// MARCHF2 9 1-Mar FALSE MTARC1 /// MARCHF1 10 10-Mar FALSE MARCHF10 11 11-Mar FALSE MARCHF11 12 2-Mar FALSE MTARC2 /// MARCHF2 13 3-Mar FALSE MARCHF3 14 4-Mar FALSE MARCHF4 15 5-Mar FALSE MARCHF5 16 6-Mar FALSE MARCHF6 17 7-Mar FALSE MARCHF7 18 8-Mar FALSE MARCHF8 19 9-Mar FALSE MARCHF9 20 15-Sep FALSE SELENOF 21 1-Sep FALSE SEPTIN1 22 10-Sep FALSE SEPTIN10 23 11-Sep FALSE SEPTIN11 24 12-Sep FALSE SEPTIN12 25 14-Sep FALSE SEPTIN14 26 2-Sep FALSE SEPTIN2 /// SEPTIN6 27 3-Sep FALSE SEPTIN3 28 4-Sep FALSE SEPTIN4 29 6-Sep FALSE SEPTIN6 30 7-Sep FALSE SEPTIN7 31 8-Sep FALSE SEPTIN8 32 9-Sep FALSE SEPTIN9
All NIH-funded data must be made freely accessible
Data Sharing and Public Access Policies
Public online data
Systematic Review of Supervised Machine Learning Models in Prediction of Medical Conditions 2022
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2) > tmp [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > tmp[2,1] <- NA > complete.cases(tmp) [1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv") GSS$BaseLaborStatus <- GSS$LaborStatus levels(GSS$BaseLaborStatus) summary(GSS$BaseLaborStatus) GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus) GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time" GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>% mutate(tidyLaborStatus = recode(LaborStatus, `Temp not working` = "Temporarily not working", `Unempl, laid off` = "Unemployed, laid off", `Working fulltime` = "Working full time", `Working parttime ` = "Working part time"))
NIH CBIIT
http://datascience.cancer.gov/
Seminars
NCI Data science webinar series
Reproducibility
Bioinformatics advice I wish I learned 10 years ago from NIH
Project and Data Organization
- Project Organization
proj ├── dev │ ├── clustering.Rmd │ └── dim_reduce.Rmd ├── doc ├── output │ ├── 2019-05-10 │ ├── 2019-05-19 │ └── 2019-05-21 ├── README.Rmd ├── renv ├── rmd └── scripts
- Data Organization
data ├── annotations │ ├── clue_drug_repurposing_hub │ │ ├── repurposing_drugs_20180907.txt │ │ └── repurposing_samples_20180907.txt │ └── ... ├── containers │ └── singularity │ └── sclc-george2015 ├── projects │ ├── nih │ │ ├── mm-feature-selection │ │ ├── mm-p3-variants │ │ └── sclc-doe ├── public │ └── human │ ├── array_express │ ├── geo │ │ └── GSE6477 │ │ ├── processed │ │ │ ├── GSE6477_expr.csv │ │ │ └── sample_metadata.csv │ │ └── raw │ │ ├── GPL96.soft │ │ └── GSE6477_series_matrix.txt.gz └── ref └── human ├── agilent ├── gatk ├── gencode-v30 └── rRNA
Container
Data Science for Startups: Containers Building reproducible setups for machine learning
Big data
Hadoop
Spark
- How to analyze log data with Python and Apache Spark
- PyBDA: a command line tool for automated analysis of big biological data sets
- Parallel Computing With R: A Brief Review by Dirk Eddelbuettel
- sparklyr 1.1: Foundations, Books, Lakes and Barriers