Reproducible: Difference between revisions
Jump to navigation
Jump to search
(→Misc) |
(→R) |
||
Line 4: | Line 4: | ||
== R == | == R == | ||
[https://bioconductor.org/packages/release/bioc/html/Rcwl.html Rcwl] package | * [https://bioconductor.org/packages/release/bioc/html/Rcwl.html Rcwl] package | ||
* [https://appsilon.com/reproducible-research-when-your-results-cant-be-reproduced/ Reproducible Research: What to do When Your Results Can’t Be Reproduced]. 3 danger zones. | |||
** R session context | |||
*** R version | |||
*** Packages versions | |||
*** Using set.seed() for a reproducible randomization | |||
*** Floating point accuracy | |||
** Operating System (OS) context | |||
*** System packages versions | |||
*** System locale | |||
*** Environment variables | |||
** Data versioning | |||
= Rmarkdown = | = Rmarkdown = |
Revision as of 19:47, 23 July 2020
Common Workflow Language (CWL)
- https://www.commonwl.org/
- Workflow systems turn raw data into scientific knowledge. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics by Baichoo 2018.
R
- Rcwl package
- Reproducible Research: What to do When Your Results Can’t Be Reproduced. 3 danger zones.
- R session context
- R version
- Packages versions
- Using set.seed() for a reproducible randomization
- Floating point accuracy
- Operating System (OS) context
- System packages versions
- System locale
- Environment variables
- Data versioning
- R session context
Rmarkdown
Rmarkdown package
packrat
checkpoint
dockr package
'dockr': easy containerization for R
Docker & Singularity
Misc
- 4 great free tools that can make your R work more efficient, reproducible and robust
- digest: Create Compact Hash Digests of R Objects
- memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
library(ggplot2) # mpg library(memoise) plot_mpg2 <- function(mpgdf, row_to_remove) { mpgdf = mpgdf[-row_to_remove,] plot(mpgdf$cty, mpgdf$hwy) lines(lowess(mpgdf$cty, mpgdf$hwy), col=2) } m_plot_mpg2 = memoise(plot_mpg2) system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.019 0.003 0.025 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.018 0.003 0.024 system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.000 0.000 0.001 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.032 0.008 0.047
- And be careful when it is used in simulation.
f <- function(n=1e5) { a <- rnorm(n) a } system.time(f1 <- f()) mf <- memoise::memoise(f) system.time(f2 <- mf()) system.time(f3 <- mf()) all.equal(f2, f3) # TRUE
- reproducible: A Set of Tools that Enhance Reproducibility Beyond Package Management
- Improving reproducibility in computational biology research 2020