Reproducible
Jump to navigation
Jump to search
Common Workflow Language (CWL)
- https://www.commonwl.org/
- Workflow systems turn raw data into scientific knowledge. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics by Baichoo 2018.
R
- CRAN Task View: Reproducible Research
- Rcwl package
- Reproducible Research: What to do When Your Results Can’t Be Reproduced. 3 danger zones.
- R session context
- R version
- Packages versions
- Using set.seed() for a reproducible randomization
- Floating point accuracy
- Operating System (OS) context
- System packages versions
- System locale
- Environment variables
- Data versioning
- R session context
- A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, Talks & Video. The whole idea is written in an R package repro package. The package create an R project Template where we can use it by RStudio -> New Project -> Create Example Repro Template. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Note this approach does not make use the renv package. Also it cannot handle Bioconductor packages. Four elements
- Git folder of source code for version control (R project)
- Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.
- Docker software environment (Containerization)
- RMarkdown (dynamic document generation)
automake() # Create '.repro/Dockerfile_packages', # '.repro/Makefile_Rmds' & 'Dockerfile' # and open <Makefile> # Modify <Makefile> by following the console output rerun() # will inspects the files of a project and suggest a way to # reproduce the project. So just follow the console output # by opening a terminal and typing make docker && make -B DOCKER=TRUE # The above will generate the output html file in your browser
Rmarkdown
Rmarkdown package
packrat and renv
checkpoint
dockr package
'dockr': easy containerization for R
Docker & Singularity
targets package
targets: Democratizing Reproducible Analysis Pipelines Will Landau
Snakemake
- Hypercluster: a flexible tool for parallelized unsupervised clustering optimization
- https://snakemake.readthedocs.io/en/stable/tutorial/setup.html#run-tutorial-for-free-in-the-cloud-via-gitpod
- https://hpc.nih.gov/apps/snakemake.html
Papers
- zenodo.org which has been used by Demystifying "drop-outs" in single-cell UMI data
- OSF which has been used by Methods for correcting inference based on outcomes predicted by machine learning
- codeocean.
- A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. The R code can be downloaded by git (Capsule -> Export -> Clone via Git). The data (3.4G zip file) has to be downloaded manually. The environment panel shows what packages have to be installed (apt-get, Bioconductor, R-CRAN, R-Github). It seems "Export" is more complete than "Clone via Git". It even include a Dockerfile.
- Consensus Non-negative Matrix factorization (cNMF) v1.2
Misc
- 4 great free tools that can make your R work more efficient, reproducible and robust
- digest: Create Compact Hash Digests of R Objects
- memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
library(ggplot2) # mpg library(memoise) plot_mpg2 <- function(mpgdf, row_to_remove) { mpgdf = mpgdf[-row_to_remove,] plot(mpgdf$cty, mpgdf$hwy) lines(lowess(mpgdf$cty, mpgdf$hwy), col=2) } m_plot_mpg2 = memoise(plot_mpg2) system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.019 0.003 0.025 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.018 0.003 0.024 system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.000 0.000 0.001 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.032 0.008 0.047
- And be careful when it is used in simulation.
f <- function(n=1e5) { a <- rnorm(n) a } system.time(f1 <- f()) mf <- memoise::memoise(f) system.time(f2 <- mf()) system.time(f3 <- mf()) all.equal(f2, f3) # TRUE
- reproducible: A Set of Tools that Enhance Reproducibility Beyond Package Management
- Improving reproducibility in computational biology research 2020