Revision as of 11:56, 11 April 2021

Common Workflow Language (CWL)

https://www.commonwl.org/
Workflow systems turn raw data into scientific knowledge. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics by Baichoo 2018.

R

CRAN Task View: Reproducible Research
Rcwl package
- Connecting Bioconductor to other bioinformatics tools using Rcwl from Bioc2020
Reproducible Research: What to do When Your Results Can’t Be Reproduced. 3 danger zones.
- R session context
  - R version
  - Packages versions
  - Using set.seed() for a reproducible randomization
  - Floating point accuracy
- Operating System (OS) context
  - System packages versions
  - System locale
  - Environment variables
- Data versioning
A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, and the repro package. Four elements
- Git folder of source code (R project)
- Makefile
- Docker software environment (Containerization)
- RMarkdown (dynamic document generation)

Rmarkdown

Rmarkdown package

packrat and renv

R packages → packrat/renv

checkpoint

R → Reproducible Research

dockr package

'dockr': easy containerization for R

Docker & Singularity

Docker

targets package

targets: Democratizing Reproducible Analysis Pipelines Will Landau

Snakemake

Papers

High-throughput analysis suggests differences in journal false discovery rate by subject area and impact factor but not open access status

Share your code and data

zenodo.org which has been used by Demystifying "drop-outs" in single-cell UMI data
OSF which has been used by Methods for correcting inference based on outcomes predicted by machine learning
codeocean.
- A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. The R code can be downloaded by git (Capsule -> Export -> Clone via Git). The data (3.4G zip file) has to be downloaded manually. The environment panel shows what packages have to be installed (apt-get, Bioconductor, R-CRAN, R-Github). It seems "Export" is more complete than "Clone via Git". It even include a Dockerfile.
- Consensus Non-negative Matrix factorization (cNMF) v1.2

Misc

4 great free tools that can make your R work more efficient, reproducible and robust
digest: Create Compact Hash Digests of R Objects

memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).

library(ggplot2) # mpg 
library(memoise) 
plot_mpg2 <- function(mpgdf, row_to_remove) {
  mpgdf = mpgdf[-row_to_remove,]
  plot(mpgdf$cty, mpgdf$hwy)
  lines(lowess(mpgdf$cty, mpgdf$hwy), col=2)
}
m_plot_mpg2 = memoise(plot_mpg2)
system.time(m_plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.019   0.003   0.025
system.time(plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.018   0.003   0.024
system.time(m_plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.000   0.000   0.001
system.time(plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.032   0.008   0.047

And be careful when it is used in simulation.

f <- function(n=1e5) { 
  a <- rnorm(n)
  a
} 
system.time(f1 <- f())
mf <- memoise::memoise(f)
system.time(f2 <- mf())
system.time(f3 <- mf())
all.equal(f2, f3) # TRUE

reproducible: A Set of Tools that Enhance Reproducibility Beyond Package Management
Improving reproducibility in computational biology research 2020

@@ Line 4: / Line 4: @@
 == R ==
+* [https://cran.r-project.org/web/views/ReproducibleResearch.html CRAN Task View: Reproducible Research]
 * [https://bioconductor.org/packages/release/bioc/html/Rcwl.html Rcwl] package
 ** [https://liubuntu.github.io/Bioc2020RCWL/ Connecting Bioconductor to other bioinformatics tools using Rcwl] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
@@ Line 17: / Line 18: @@
 *** Environment variables
 ** Data versioning
-* A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker
+* [https://pure.mpg.de/rest/items/item_3178013_4/component/file_3178471/content A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker], [https://brandmaier.github.io/reproducible-data-analysis-materials/BioPsy2020.html#1 Slides], and the [https://github.com/aaronpeikert/repro repro] package. Four elements
-** [https://pure.mpg.de/rest/items/item_3178013_4/component/file_3178471/content manuscript]
+** Git folder of source code (R project)
-** [https://brandmaier.github.io/reproducible-data-analysis-materials/BioPsy2020.html#1 Slides]
+** Makefile
+** Docker software environment (Containerization)
+** RMarkdown (dynamic document generation)
 = Rmarkdown =

Reproducible: Difference between revisions

Revision as of 11:56, 11 April 2021

Contents

Common Workflow Language (CWL)

R

Rmarkdown

packrat and renv

checkpoint

dockr package

Docker & Singularity

targets package

Snakemake

Papers

Share your code and data

Misc

Navigation menu

Reproducible: Difference between revisions

Revision as of 11:56, 11 April 2021

Common Workflow Language (CWL)

R

Rmarkdown

packrat and renv

checkpoint

dockr package

Docker & Singularity

targets package

Snakemake

Papers

Share your code and data

Misc

Navigation menu

Search