Reproducible: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 18: Line 18:
*** Environment variables
*** Environment variables
** Data versioning
** Data versioning
* [https://pure.mpg.de/rest/items/item_3178013_4/component/file_3178471/content A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker], [https://brandmaier.github.io/reproducible-data-analysis-materials/BioPsy2020.html#1 Slides], and the [https://github.com/aaronpeikert/repro repro] package. Four elements
<ul>
** Git folder of source code (R project)
<li>[https://pure.mpg.de/rest/items/item_3178013_4/component/file_3178471/content A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker], [https://brandmaier.github.io/reproducible-data-analysis-materials/BioPsy2020.html#1 Slides], [https://github.com/aaronpeikert/repro-talk Talks & Video]. The whole idea is written in an R package [https://github.com/aaronpeikert/repro repro] package. The package create an R project Template where we can use it by RStudio -> New Project -> '''Create Example Repro Template'''. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Four elements
** Makefile
<ul>
** Docker software environment (Containerization)
<li>Git folder of source code for version control (R project) </li>
** RMarkdown (dynamic document generation)
<li>Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.</li>
<li>Docker software environment (Containerization)</li>
<li>RMarkdown (dynamic document generation)</li>
</ul>
<pre>
automake() # Create and open <Makefile>
# Modify <Makefile> by following the console output
rerun() # and follow the console output
# by opening a terminal
make docker && make -B DOCKER=TRUE
# The above will generate the output
</pre>
</li>
</ul>


= Rmarkdown =
= Rmarkdown =

Revision as of 13:41, 11 April 2021

Common Workflow Language (CWL)

R

  • A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, Talks & Video. The whole idea is written in an R package repro package. The package create an R project Template where we can use it by RStudio -> New Project -> Create Example Repro Template. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Four elements
    • Git folder of source code for version control (R project)
    • Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.
    • Docker software environment (Containerization)
    • RMarkdown (dynamic document generation)
    automake() # Create and open <Makefile>
    # Modify <Makefile> by following the console output
    rerun() # and follow the console output
    # by opening a terminal
    make docker && make -B DOCKER=TRUE
    # The above will generate the output
    

Rmarkdown

Rmarkdown package

packrat and renv

R packages → packrat/renv

checkpoint

R → Reproducible Research

dockr package

'dockr': easy containerization for R

Docker & Singularity

Docker

targets package

targets: Democratizing Reproducible Analysis Pipelines Will Landau

Snakemake

Papers

High-throughput analysis suggests differences in journal false discovery rate by subject area and impact factor but not open access status

Share your code and data

Misc

  • 4 great free tools that can make your R work more efficient, reproducible and robust
  • digest: Create Compact Hash Digests of R Objects
  • memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
    library(ggplot2) # mpg 
    library(memoise) 
    plot_mpg2 <- function(mpgdf, row_to_remove) {
      mpgdf = mpgdf[-row_to_remove,]
      plot(mpgdf$cty, mpgdf$hwy)
      lines(lowess(mpgdf$cty, mpgdf$hwy), col=2)
    }
    m_plot_mpg2 = memoise(plot_mpg2)
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.019   0.003   0.025
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.018   0.003   0.024
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.000   0.000   0.001
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.032   0.008   0.047
And be careful when it is used in simulation.
f <- function(n=1e5) { 
  a <- rnorm(n)
  a
} 
system.time(f1 <- f())
mf <- memoise::memoise(f)
system.time(f2 <- mf())
system.time(f3 <- mf())
all.equal(f2, f3) # TRUE