Reproducible: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 1: Line 1:
= Common Workflow Language (CWL) =
= Common Workflow Language (CWL) =
* https://www.commonwl.org/
* https://www.commonwl.org/
* [https://www.nature.com/articles/d41586-019-02619-z Workflow systems turn raw data into scientific knowledge]. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2446-1 Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics] by Baichoo 2018.
* [https://www.nature.com/articles/d41586-019-02619-z Workflow systems turn raw data into scientific knowledge]. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), [https://www.nextflow.io/ Nextflow]. The best is to embed the workflow in a container; see [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2446-1 Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics] by Baichoo 2018.


== R ==
== R ==

Revision as of 21:37, 1 December 2020

Common Workflow Language (CWL)

R

Rmarkdown

Rmarkdown package

packrat and renv

R packages → packrat/renv

checkpoint

R → Reproducible Research

dockr package

'dockr': easy containerization for R

Docker & Singularity

Docker

Snakemake

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Misc

  • 4 great free tools that can make your R work more efficient, reproducible and robust
  • digest: Create Compact Hash Digests of R Objects
  • memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
    library(ggplot2) # mpg 
    library(memoise) 
    plot_mpg2 <- function(mpgdf, row_to_remove) {
      mpgdf = mpgdf[-row_to_remove,]
      plot(mpgdf$cty, mpgdf$hwy)
      lines(lowess(mpgdf$cty, mpgdf$hwy), col=2)
    }
    m_plot_mpg2 = memoise(plot_mpg2)
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.019   0.003   0.025
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.018   0.003   0.024
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.000   0.000   0.001
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.032   0.008   0.047
And be careful when it is used in simulation.
f <- function(n=1e5) { 
  a <- rnorm(n)
  a
} 
system.time(f1 <- f())
mf <- memoise::memoise(f)
system.time(f2 <- mf())
system.time(f3 <- mf())
all.equal(f2, f3) # TRUE