Common Workflow Language (CWL)

https://www.commonwl.org/
Workflow systems turn raw data into scientific knowledge. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics by Baichoo 2018.

R

CRAN Task View: Reproducible Research
Rcwl package
- Connecting Bioconductor to other bioinformatics tools using Rcwl from Bioc2020
Reproducible Research: What to do When Your Results Can’t Be Reproduced. 3 danger zones.
- R session context
  - R version
  - Packages versions
  - Using set.seed() for a reproducible randomization
  - Floating point accuracy
- Operating System (OS) context
  - System packages versions
  - System locale
  - Environment variables
- Data versioning

A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, Talks & Video. The whole idea is written in an R package repro package. The package create an R project Template where we can use it by RStudio -> New Project -> Create Example Repro Template. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Note this approach does not make use the renv package. Also it cannot handle Bioconductor packages. Four elements
- Git folder of source code for version control (R project)
- Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.
- Docker software environment (Containerization)
- RMarkdown (dynamic document generation)
```
automake() # Create '.repro/Dockerfile_packages', 
           #        '.repro/Makefile_Rmds' & 'Dockerfile'
           # and open <Makefile>

# Modify <Makefile> by following the console output

rerun() # will inspects the files of a project and suggest a way to 
        # reproduce the project. So just follow the console output
        # by opening a terminal and typing
make docker && make -B DOCKER=TRUE

# The above will generate the output html file in your browser
```

Rmarkdown

Rmarkdown package

packrat and renv

R packages → packrat/renv

checkpoint

R → Reproducible Research

dockr package

'dockr': easy containerization for R

Docker & Singularity

Docker

targets package

targets: Democratizing Reproducible Analysis Pipelines Will Landau

Snakemake

Papers

High-throughput analysis suggests differences in journal false discovery rate by subject area and impact factor but not open access status

Share your code and data

zenodo.org which has been used by Demystifying "drop-outs" in single-cell UMI data
OSF which has been used by Methods for correcting inference based on outcomes predicted by machine learning
codeocean.
- A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. The R code can be downloaded by git (Capsule -> Export -> Clone via Git). The data (3.4G zip file) has to be downloaded manually. The environment panel shows what packages have to be installed (apt-get, Bioconductor, R-CRAN, R-Github). It seems "Export" is more complete than "Clone via Git". It even include a Dockerfile.
- Consensus Non-negative Matrix factorization (cNMF) v1.2

Misc

4 great free tools that can make your R work more efficient, reproducible and robust
digest: Create Compact Hash Digests of R Objects

memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).

library(ggplot2) # mpg 
library(memoise) 
plot_mpg2 <- function(mpgdf, row_to_remove) {
  mpgdf = mpgdf[-row_to_remove,]
  plot(mpgdf$cty, mpgdf$hwy)
  lines(lowess(mpgdf$cty, mpgdf$hwy), col=2)
}
m_plot_mpg2 = memoise(plot_mpg2)
system.time(m_plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.019   0.003   0.025
system.time(plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.018   0.003   0.024
system.time(m_plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.000   0.000   0.001
system.time(plot_mpg2(mpg, 12))
#   user  system elapsed
#  0.032   0.008   0.047

And be careful when it is used in simulation.

f <- function(n=1e5) { 
  a <- rnorm(n)
  a
} 
system.time(f1 <- f())
mf <- memoise::memoise(f)
system.time(f2 <- mf())
system.time(f3 <- mf())
all.equal(f2, f3) # TRUE

reproducible: A Set of Tools that Enhance Reproducibility Beyond Package Management
Improving reproducibility in computational biology research 2020

Reproducible

Contents

Common Workflow Language (CWL)

R

Rmarkdown

packrat and renv

checkpoint

dockr package

Docker & Singularity

targets package

Snakemake

Papers

Share your code and data

Misc

Navigation menu

Reproducible

Common Workflow Language (CWL)

R

Rmarkdown

packrat and renv

checkpoint

dockr package

Docker & Singularity

targets package

Snakemake

Papers

Share your code and data

Misc

Navigation menu

Search