Reproducible: Difference between revisions
Jump to navigation
Jump to search
(→R) |
(→R) |
||
Line 27: | Line 27: | ||
</ul> | </ul> | ||
<pre> | <pre> | ||
automake() # Create and open <Makefile> | automake() # Create '.repro/Dockerfile_packages', | ||
# '.repro/Makefile_Rmds' & 'Dockerfile' | |||
# and open <Makefile> | |||
# Modify <Makefile> by following the console output | # Modify <Makefile> by following the console output | ||
rerun() # and follow the console output | |||
# by opening a terminal | rerun() # will inspects the files of a project and suggest a way to | ||
# reproduce the project. So just follow the console output | |||
# by opening a terminal and typing | |||
make docker && make -B DOCKER=TRUE | make docker && make -B DOCKER=TRUE | ||
# The above will generate the output | |||
# The above will generate the output html file in your browser | |||
</pre> | </pre> | ||
</li> | </li> |
Revision as of 14:07, 11 April 2021
Common Workflow Language (CWL)
- https://www.commonwl.org/
- Workflow systems turn raw data into scientific knowledge. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics by Baichoo 2018.
R
- CRAN Task View: Reproducible Research
- Rcwl package
- Reproducible Research: What to do When Your Results Can’t Be Reproduced. 3 danger zones.
- R session context
- R version
- Packages versions
- Using set.seed() for a reproducible randomization
- Floating point accuracy
- Operating System (OS) context
- System packages versions
- System locale
- Environment variables
- Data versioning
- R session context
- A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, Talks & Video. The whole idea is written in an R package repro package. The package create an R project Template where we can use it by RStudio -> New Project -> Create Example Repro Template. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Four elements
- Git folder of source code for version control (R project)
- Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.
- Docker software environment (Containerization)
- RMarkdown (dynamic document generation)
automake() # Create '.repro/Dockerfile_packages', # '.repro/Makefile_Rmds' & 'Dockerfile' # and open <Makefile> # Modify <Makefile> by following the console output rerun() # will inspects the files of a project and suggest a way to # reproduce the project. So just follow the console output # by opening a terminal and typing make docker && make -B DOCKER=TRUE # The above will generate the output html file in your browser
Rmarkdown
Rmarkdown package
packrat and renv
checkpoint
dockr package
'dockr': easy containerization for R
Docker & Singularity
targets package
targets: Democratizing Reproducible Analysis Pipelines Will Landau
Snakemake
- Hypercluster: a flexible tool for parallelized unsupervised clustering optimization
- https://snakemake.readthedocs.io/en/stable/tutorial/setup.html#run-tutorial-for-free-in-the-cloud-via-gitpod
- https://hpc.nih.gov/apps/snakemake.html
Papers
- zenodo.org which has been used by Demystifying "drop-outs" in single-cell UMI data
- OSF which has been used by Methods for correcting inference based on outcomes predicted by machine learning
- codeocean.
- A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. The R code can be downloaded by git (Capsule -> Export -> Clone via Git). The data (3.4G zip file) has to be downloaded manually. The environment panel shows what packages have to be installed (apt-get, Bioconductor, R-CRAN, R-Github). It seems "Export" is more complete than "Clone via Git". It even include a Dockerfile.
- Consensus Non-negative Matrix factorization (cNMF) v1.2
Misc
- 4 great free tools that can make your R work more efficient, reproducible and robust
- digest: Create Compact Hash Digests of R Objects
- memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
library(ggplot2) # mpg library(memoise) plot_mpg2 <- function(mpgdf, row_to_remove) { mpgdf = mpgdf[-row_to_remove,] plot(mpgdf$cty, mpgdf$hwy) lines(lowess(mpgdf$cty, mpgdf$hwy), col=2) } m_plot_mpg2 = memoise(plot_mpg2) system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.019 0.003 0.025 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.018 0.003 0.024 system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.000 0.000 0.001 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.032 0.008 0.047
- And be careful when it is used in simulation.
f <- function(n=1e5) { a <- rnorm(n) a } system.time(f1 <- f()) mf <- memoise::memoise(f) system.time(f2 <- mf()) system.time(f3 <- mf()) all.equal(f2, f3) # TRUE
- reproducible: A Set of Tools that Enhance Reproducibility Beyond Package Management
- Improving reproducibility in computational biology research 2020