Reproducible: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 4: Line 4:


== R ==
== R ==
* [https://cran.r-project.org/web/views/ReproducibleResearch.html CRAN Task View: Reproducible Research]
* [https://bioconductor.org/packages/release/bioc/html/Rcwl.html Rcwl] package
* [https://bioconductor.org/packages/release/bioc/html/Rcwl.html Rcwl] package
** [https://liubuntu.github.io/Bioc2020RCWL/ Connecting Bioconductor to other bioinformatics tools using Rcwl] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
** [https://liubuntu.github.io/Bioc2020RCWL/ Connecting Bioconductor to other bioinformatics tools using Rcwl] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
Line 17: Line 18:
*** Environment variables
*** Environment variables
** Data versioning
** Data versioning
* A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker
* [https://pure.mpg.de/rest/items/item_3178013_4/component/file_3178471/content A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker], [https://brandmaier.github.io/reproducible-data-analysis-materials/BioPsy2020.html#1 Slides], and the [https://github.com/aaronpeikert/repro repro] package. Four elements
** [https://pure.mpg.de/rest/items/item_3178013_4/component/file_3178471/content manuscript]
** Git folder of source code (R project)
** [https://brandmaier.github.io/reproducible-data-analysis-materials/BioPsy2020.html#1 Slides]
** Makefile
** Docker software environment (Containerization)
** RMarkdown (dynamic document generation)


= Rmarkdown =
= Rmarkdown =

Revision as of 12:56, 11 April 2021

Common Workflow Language (CWL)

R

Rmarkdown

Rmarkdown package

packrat and renv

R packages → packrat/renv

checkpoint

R → Reproducible Research

dockr package

'dockr': easy containerization for R

Docker & Singularity

Docker

targets package

targets: Democratizing Reproducible Analysis Pipelines Will Landau

Snakemake

Papers

High-throughput analysis suggests differences in journal false discovery rate by subject area and impact factor but not open access status

Share your code and data

Misc

  • 4 great free tools that can make your R work more efficient, reproducible and robust
  • digest: Create Compact Hash Digests of R Objects
  • memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
    library(ggplot2) # mpg 
    library(memoise) 
    plot_mpg2 <- function(mpgdf, row_to_remove) {
      mpgdf = mpgdf[-row_to_remove,]
      plot(mpgdf$cty, mpgdf$hwy)
      lines(lowess(mpgdf$cty, mpgdf$hwy), col=2)
    }
    m_plot_mpg2 = memoise(plot_mpg2)
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.019   0.003   0.025
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.018   0.003   0.024
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.000   0.000   0.001
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.032   0.008   0.047
And be careful when it is used in simulation.
f <- function(n=1e5) { 
  a <- rnorm(n)
  a
} 
system.time(f1 <- f())
mf <- memoise::memoise(f)
system.time(f2 <- mf())
system.time(f3 <- mf())
all.equal(f2, f3) # TRUE