Reproducible
Common Workflow Language (CWL)
- https://www.commonwl.org/
- Workflow systems turn raw data into scientific knowledge. Pipeline, Snakemake, Docker, Galaxy, Python, Conda, Workflow Definition Language (WDL), Nextflow. The best is to embed the workflow in a container; see Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics by Baichoo 2018.
- Simplifying the development of portable, scalable, and reproducible workflows Piccolo 2021.
R
- CRAN Task View: Reproducible Research
- Rcwl package
- Reproducible Research: What to do When Your Results Can’t Be Reproduced. 3 danger zones.
- R session context
- R version
- Packages versions
- Using set.seed() for a reproducible randomization
- Floating point accuracy
- Operating System (OS) context
- System packages versions
- System locale
- Environment variables
- Data versioning
- R session context
- A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, Talks & Video. The whole idea is written in an R package repro package. The package create an R project Template where we can use it by RStudio -> New Project -> Create Example Repro Template. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Note this approach does not make use the renv package. Also it cannot handle Bioconductor packages. Four elements
- Git folder of source code for version control (R project)
- Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.
- Docker software environment (Containerization)
- RMarkdown (dynamic document generation)
automake() # Create '.repro/Dockerfile_packages', # '.repro/Makefile_Rmds' & 'Dockerfile' # and open <Makefile> # Modify <Makefile> by following the console output rerun() # will inspects the files of a project and suggest a way to # reproduce the project. So just follow the console output # by opening a terminal and typing make docker && make -B DOCKER=TRUE # The above will generate the output html file in your browser
In the end, it calls the following command according to the console output where 'reproproject' in this example is the Docker image name (same as my project name except it automatically converts the name to lower cases).
docker run --rm --user 368262265 \ -v "/Full_Path_To_Project":"/home/rstudio/" \ reproproject Rscript \ -e 'rmarkdown::render("/home/rstudio//markdown.Rmd", "all")'
- Advanced Reproducibility in Cancer Informatics
Rmarkdown
Rmarkdown package
packrat
- CRAN & Github
- Bioconductor related issues
- Videos:
- Packrat will not only store all packages, but also all project files.
- Packrat is integrated in RStudio’s user interface. It allows you to share projects along co-workers easily. See Using Packrat with RStudio.
- limitations.
- XML package needs to install some OS library libxml2. So it is not just R package issue.
- Ubuntu goodies
- Git and packrat. The packrat/src directory can be very large. If you don't want them available in your git-repo, you simply add packrat/src/ to the .gitignore. But, this will mean that anyone accessing the git-repo will not have access to the package source code, and the files will be downloaded from CRAN, or from wherever the source line dictates within the packrat.lock file.
- A scenario that we need packrat: suppose we are developing a package in the current R-3.5.X. Our package requires the 'doRNG' package. That package depends the 'rngtools' package. A few months later a new R (3.6.0) was released and a new release (1.3.1.1) of 'rngtools' also requires R-3.6.0. So if we want to install 'doRNG' in R-3.5.x, it will fail with an error: dependency 'rngtools' is not available for package 'doRNG' .
Create a snapshot
- Do we really need to call packrat::snapshot()? The walk through page says it is not needed but the lock file is not updated from my testing.
- I got an error when it is trying to fetch the source code from bioconductor and local repositories: packrat is trying to fetch the source from CRAN in these two packages.
- On normal case, the packrat/packrat.lock file contains two entries in 'Repos' field (line 4).
- The cause of the error is I ran snapshot() after I quitted R and entered again. So the solution is to add bioc and local repositories to options(repos).
- So what is important of running snapshot()?
- Check out the forum.
> dir.create("~/projects/babynames", recu=T) > packrat::init("~/projects/babynames") Initializing packrat project in directory: - "~/projects/babynames" Adding these packages to packrat: _ packrat 0.4.9-3 Fetching sources for packrat (0.4.9-3) ... OK (CRAN current) Snapshot written to '/home/brb/projects/babynames/packrat/packrat.lock' Installing packrat (0.4.9-3) ... OK (built source) Initialization complete! Unloading packages in user library: - packrat Packrat mode on. Using library in directory: - "~/projects/babynames/packrat/lib" > install.packages("reshape2") > packrat::snapshot() > system("tree -L 2 ~/projects/babynames/packrat/") /home/brb/projects/babynames/packrat/ ├── init.R ├── lib │ └── x86_64-pc-linux-gnu ├── lib-ext │ └── x86_64-pc-linux-gnu ├── lib-R # base packages │ └── x86_64-pc-linux-gnu ├── packrat.lock ├── packrat.opts └── src ├── bitops ├── glue ├── magrittr ├── packrat ├── plyr ├── Rcpp ├── reshape2 ├── stringi └── stringr
Restoring snapshots
Suppose a packrat project was created on Ubuntu 16.04 and we now want to repeat the analysis on Ubuntu 18.04. We first copy the whole project directory ('babynames') to Ubuntu 18.04. Then we should delete the library subdirectory ('packrat/lib') which contains binary files (*.so) that do not work on the new OS. After we delete the library subdirectory, start R from the project directory. Now if we run packrat::restore() command, it will re-install all missing libraries. Bingo! NOTE: Maybe I should use packrat::bundle() instead of manually copy the whole project folder.
Note: some OS level libraries (e.g. libXXX-dev) need to be installed manually beforehand in order for the magic to work.
$ rm -rf ~/projects/babynames/packrat/lib $ cd ~/projects/babynames/ $ R > > packrat::status() > remove.packages("plyr") > packrat::status() > packrat::restore()
Workflow
setwd("ProjectDir") packrat::init() packrat::on() # packrat::search_path() install.packages() # For personal packages stored locally packrat::set_opts(local.repos = "~/git/R") packrat::install_local("digest") # dir name of the package library(YourPackageName) # double check all dependent ones have been installed packrat::snapshot() packrat::bundle()
A bundle file (*.tar.gz) will be created under ProjectDir/packrat/src directory. Note this tar.gz file includes the whole project folder.
To unbundle the project in a new R environment/directory:
setwd("NewDirectory") # optional packrat::unbundle(FullPathofBundleTarBall, ".") # this will create 'ProjectDir' # CPU is more important than disk speed # At the end, it will show the project has been unbundled and restored at ... setwd("ProjectDir") packrat::packrat_mode() # on .libPaths() # verify library() # Expect to see packages in our bundle # packrat::on()
Example 1: The above method works for packages from Bioconductor; e.g. S4Vectors which depends on BiocGenerics & BiocVersion only. However, Bioconductor project des not have a snapshot repository like MRAN. So it is difficult to reproduce the environment for an earlier release of Bioconductor.
Example 2: bundle our in-house R package for future reproducibility.
Set Up a Custom CRAN-like Repository
See https://rstudio.github.io/packrat/custom-repos.html. Note the personal repository name ('sushi' in this example) used in "Repository" field of the personal package will be used in <packrat/packrat.lock> file. So as long as we work on the same computer, it is easy to restore a packrat project containing packages coming from personal repository.
- packrat::init()
- packrat::snapshot(), packrat::restore()
- packrat::clean()
- packrat::status()
- packrat::install_local() # http://rstudio.github.io/packrat/limitations.html
- packrat::bundle() # see @28:44 of the video, packrat::unbundle() # see @29:17 of the same video. This will rebuild all packages
- packrat::on(), packrat::off()
- packrat::get_opts()
- packrat::set_opts() # http://rstudio.github.io/packrat/limitations.html
- packrat::opts$local.repos("~/local-cran")
- packrat::opts$external.packages(c("devtools")) # break the isolation
- packrat::extlib()
- packrat::with_extlib()
- packrat::project_dir(), .libPaths()
Warning
- If we download and modify some function definition from a package in CRAN without changing DESCRIPTION file or the package name, the snapshot created using packrat::snapshot() will contain the package source from CRAN instead of local repository. This is because (I guess) the DESCRIPTION file contains a field 'Repository' with the value 'CRAN'.
Docker
- This is a minimal example that installs a single package each from CRAN, bioconductor, and github to a Docker image using packrat.
- All operations are done in the container. So the host OS does not need to have R installed.
- The R script will install packrat in the container. It will also initialize packrat in the working directory and install R packages there. But in the packrat::snapshot() it chooses snapshot.sources = FALSE. The goal is to generate packrat.lock file.
- The first part of generating packrat.lock is not quite right since the file was generated in the container only. We should use -v in the docker run command. The github repository at https://github.com/joelnitta/docker-packrat-example has fixed the problem.
$ git clone https://github.com/joelnitta/docker-packrat-example.git $ cd docker-packrat-example # Step 1: create the 'packrat.lock' file $ nano install_packages.R # note: nano is not available in the rstudio container # need to install additional OS level packages like libcurl # in rocker/rstudio. Probably rocker/tidyverse is better than rstudio # $ docker run -it -e DISABLE_AUTH=true -v $(pwd):/home/rstudio/project rocker/tidyverse:3.6.0 bash # Inside the container now $ cd home/rstudio/project $ time Rscript install_packages.R # generate 'packrat/packrat.lock' $ exit # It took 43 minutes. # Question: is there an easier way to generate packrat.lock without # wasting time to install lots of packages? # Step 2: build the image # Open another terminal/tab $ nano Dockerfile # change rocker image and R version. Make sure these two are the same as # we have used when we created the 'packrat.lock' file $ time docker build . -t mycontainer # It took 45 minutes. $ docker run -it mycontainer R # Step 3: check the packages defined in 'install_packages.R' are installed packageVersion("minimal") packageVersion("biospear")
Questions:
- After running the statement packrat::init(), it will leave a footprint of a hidden file .Rprofile in the current directory. PS: The purpose of .Rprofile file is to direct R to use the private package library (when it is started from the project directory).
#### -- Packrat Autoloader (version 0.5.0) -- #### source("packrat/init.R") #### -- End Packrat Autoloader -- ####
- If the 'packrat' directory was accidentally deleted, next time when you launch R it will show an error message because it cannot find the file.
- The ownership of the 'packrat' directory will be root now. See this Package Management for Reproducible R Code.
- This sophisticated approach does not save the package source code. If a package has been updated and the version we used has been moved to archive in CRAN, what will happen when we try to restore it? So it is probably better to use snapshot.sources = TRUE and run packrat::bundle().
renv: successor to the packrat package
- https://rstudio.github.io/renv/index.html
- release 2019-11-6
- Introduction to renv 2021-01-09
- R renv: How to Manage Dependencies in R Projects Easily 2023-03-22
- The renv::migrate() function makes it possible to migrate projects from Packrat to renv.
- Why Package & Environment Management is Critical for Serious Data Science and a workflow.
- Deploying an R Shiny app on Heroku free tier
- Bioconductor related questions
- Installing packages on a PBS-Pro HPC cluster using renv
- Videos
Compare to packrat:
- Many packages are difficult to build from sources. Your system will need to have a compatible compiler toolchain available. In some cases, R packages may depend on C/C++ features that aren't available in an older system toolchain, especially in some older Linux enterprise environments.
- renv no longer attempts to explicitly download and track R package source tarballs within your project. For packages from local sources, refer this article.
- renv has its discovery machinery to analyze your R code to determine which R packages will be included in the lock file. We can however instead prefer to capture all packages installed into your project library by using renv::settings$snapshot.type("all")
renv package does not have bundle() nor unbundle() function.
# mkdir renvdeseq2 setwd("renvdeseq2") renv::init(bioconductor = TRUE) # attempts to copy and reuse packages # already installed in your R libraries # We'll be asked to restart the R session if we # are not doing this in RStudio. renv::install("BiocManager") # method 1: this will only install packages under the curDir/renv/... folder BiocManager::install("DESeq2") # method 2: this will install packages in ~/.cache/R/renv/renv/... folder # therefore, the library can be reused by other needs. options(repos = BiocManager::repositories()) renv::install("DESeq2") renv::snapshot() # create renv.lock # it seems the lock file "renvdeseq2/renv.lock" does not # save any package info I just installed from Bioconductor # except the renv package. # Read https://rstudio.github.io/renv/articles/faq.html
Find R package dependencies in a project
renv::dependencies()
The following line will make snapshot() to write all packages in renv .cache directory (e.g., ~/.cache/R/renv/cache/v5/R-4.2/x86_64-pc-linux-gnu/) to renv.lock file. Note that the setting is persistent even we restart R!
renv::settings$snapshot.type("all") # default is "implicit" renv::snapshot()
Pass renv.lock to other people and/or clone the project repository
# Make sure the 'renv' package has been installed on the remote computer install.packages("renv") renv::init() # install the packages declared in renv.lock
Use renv::migrate() to port a Packrat project to renv.
Reference
See Reference.
Bioconductor
Create an Rmd file and include an R chunk "library(DESeq2)". Then run the following line
renv::init(bioconductor = TRUE)
and it will generate "renv.lock", ".Rprofile" files and "renv" directory.
PS. When we install a fresh R in Ubuntu, we should run "sudo apt install r-base-dev curl libcurl4-openssl-dev libssl-dev libxml2-dev " system packages before we can successfully run "BiocManager::install('DESeq2')".
renv::dependencies()
?dependencies. Find R packages used within a project. dependencies() will crawl files within your project, looking for R files and the packages used within those R files.
df <- renv::dependencies("Some_Dir")
It also search Rmd files from my testing.
renv::restore()
See the output message on here. This is based on renv 0.16.0 (2022-09-29).
install.packages()
If I open a project that loaded an renv environment, then calling "install.packages()" will install new packages into the renv's cache folder (e.g., ~/.cache/R/renv/cache/v5/R-4.2/x86_64-pc-linux-gnu/ in Linux). Note that the version number will be recorded too (e.g., ~/.cache/R/renv/cache/v5/R-4.2/x86_64-pc-linux-gnu/pkgndep/1.2.1 ).
Hash
renv - manually overwrite package version in lock file. The hash is used for caching; it allows renv::restore() to restore a package from the global renv cache if available, thereby avoiding a retrieve + build + install of the package.
If it is not set, then renv will not use the cache and instead always try to retrieve the package from the declared source.
Cache and path customization
?paths (lined from Installing from Local Sources)
On my macOS, it shows ~/Library/Caches/org.R-project.R/R/renv. Specifically, it is ~/Library/Caches/org.R-project.R/R/renv/cache/v5/R-4.2/aarch64-apple-darwin20 for my R 4.2.x.
On my Linux system, I see the source packages (*.tar.gz) are stored at
- ~/.local/share/renv/source/bioconductor/ # Store bioconductor packages
- ~/.local/share/renv/source/repository/ # Store CRAN packages
and the binary packages are stored at
- ~/.local/share/renv/cache/ (~/.local/share/renv/cache/v5/R-4.0/x86_64-pc-linux-gnu/)
Note that once I have used renv::init() to restore a project, the related R packages (binary and/or source) will be cached. So next time when we do renv::init(), local R packages can be found.
So how do we manage the packages in cache? For example if we are developing an R package and we made a change but did not change the version number.
- renv::purge("MyPackage") # remove binary and source
> root <- renv::paths$root() Welcome to renv! It looks like this is your first time using renv. This is a one-time message, briefly describing some of renv's functionality. renv maintains a local cache of data on the filesystem, located at: - '~/.local/share/renv' This path can be customized: please see the documentation in `?renv::paths`. renv will also write to files within the active project folder, including: - A folder 'renv' in the project directory, and - A lockfile called 'renv.lock' in the project directory. In particular, projects using renv will normally use a private, per-project R library, in which new packages will be installed. This project library is isolated from other R libraries on your system. In addition, renv will update files within your project directory, including: - .gitignore - .Rbuildignore - .Rprofile Please read the introduction vignette with `vignette("renv")` for more information. You can browse the package documentation online at https://rstudio.github.io/renv/. Do you want to proceed? [y/N]:
Private R packages
Local R packages
Deprecated?
- https://rstudio.github.io/renv/articles/local-sources.html
- Since local R packages (no matter it is source or binary) are not part of renv.lock, the original location of these packages are not important when we first install these packages.
- When we try to restore local R packages, we can put these packages' source files into renv/local directory.
# mkdir renvbiotrip setwd("renvbiotrip") renv::init() # we shall restart R according to the instruction # * Initializing project ... # * Discovering package dependencies ... Done! # * Copying packages into the cache ... Done! # The following package(s) will be updated in the lockfile: # CRAN =============================== # - renv [* -> 0.10.0] # * Lockfile written to '/tmp/renvbiotrip/renv.lock'. # * Project '/tmp/renvbiotrip' loaded. [renv 0.10.0] # * renv activated -- please restart the R session. renv::install("~/Downloads/MyPackage_0.1.1.tar.gz") # 1. The above command will take care of the dependence. Cool ! # That is, we don't need to use the remotes package. # 2. The output will show if packages are installed from # 'linked cache' or from source renv::settings$snapshot.type("all") renv::snapshot() # It will give a message some package(s) were installed from an unknown source # renv may be unable to restore these packages in the future.
Since the dependence package versions change from time to time, if we compare the renv.lock file created yesterday it will likely be different from what we created today (package version and hash tag).
Now we are ready to test the restoration.
-
Pass renv.lock and MyPackage_0.1.1.tar.gz to other people (different instruction if we pass the project repository?). Suppose we have copied renv.lock to renvbiotrip/ directory on a new computer.
# mkdir renvbiotrip ## Copy renv.lock to renvbiotrip/ # mkdir renvbiotrip/renv/local ## Copy MyPackage_0.1.1.tar.gz (private packages) to renvbiotrip/renv/local install.packages("renv") renv::restore() # install the packages declared in renv.lock # The output will show if packages are installed from # 'linked cache' or from source library(MyPackage) # verify MyPackage::foo() # test
- We can test renv.lock in a Docker container from another directory to mimic the way of passing the file to other people. For example,
docker run --rm -it -v $(pwd):/home/docker -w /home/docker r-base:4.0.0
- We can create a docker image based on the renv.lock and MyPackage.tar.gz files. See the renvbiotrip repository.
Note that
- If we issue renv::restore() instead of renv::init() on the destination machine, the packages will be installed into the global environment.
- It seems renv::init() is equivalent to renv::activate() AND renv::restore() on the destination machine.
The project library is out of sync with the lockfile
We'll get this message if we start R with a version different from what is in the "renv.lock" file. See install a package on an old version of R.
graph
- Search for "graph" on https://rstudio.github.io/renv/index.html
- We install igraph package first before we can use renv::graph(). It seems no extra software was needed to install igraph package. Still I got an error,
> graph(root = "devtools", leaf = "rlang") Error in inherits(edges, "formula") : argument "edges" is missing, with no default
Docker
- Using renv with Docker. Note that there are two ways for the Docker approach. One way is to include package installation in the Docker file which embeds the packages into the image. A second approach is to add appropriate R packages when the container is run.
- Creating Docker Images with renv (see here for 3 example Registries: Rocker Project/R-Hub/RStudio). Make sure <renv.lock> file and the local R package <MyPackage_0.1.0.tar.gz> are in the current directory.
FROM r-base:4.0.0 RUN R -e 'install.packages("renv")' COPY renv.lock /home/docker RUN mkdir -p /home/docker/renv/local COPY MyPackage_0.1.0.tar.gz /home/renv/local WORKDIR /home/docker RUN R -e 'renv::restore()' CMD ["R"]
- Running Docker Containers with renv
docker build -t renvMyPackage . docker run --rm -it renvMyPackage # OR docker run --rm -it -v $(pwd):/home/docker renvbiotrip
Question: how to update a package within a container? 1. start the container with root and update packages in the container 2. system("su docker") to switch to the user 'docker'. 3. when we run system("su docker"), it will exit R and go to the shell. Run "whoami" to double check the current user and type "R" to enter R again.
Another simple but inferior way to test the docker method is the following: assuming <renv.lock> is saved in the ProjectDir directory and the ProjectDir directory does not have renv nor .Rprofile. The big drawback of this approach is the created renv directory and <.Rprofile> belongs to the user root.
docker run --rm -it -v ProjectDir:/home r-base:4.0.0 install.packages("renv") setwd("/home") renv::init()
- Creating Docker Images with renv (see here for 3 example Registries: Rocker Project/R-Hub/RStudio). Make sure <renv.lock> file and the local R package <MyPackage_0.1.0.tar.gz> are in the current directory.
pracpac
pracpac - Practical 'R' Packaging in 'Docker'
Github actions
Chapter 5 Testing with a reproducible environment
checkpoint
dockr package
'dockr': easy containerization for R
Docker & Singularity
targets package
targets: Democratizing Reproducible Analysis Pipelines Will Landau
Dev Containers
Easy R Tutorials with Dev Containers
Snakemake
- Hypercluster: a flexible tool for parallelized unsupervised clustering optimization
- https://snakemake.readthedocs.io/en/stable/tutorial/setup.html#run-tutorial-for-free-in-the-cloud-via-gitpod
- https://hpc.nih.gov/apps/snakemake.html
- Snakemake—a scalable bioinformatics workflow engine (paper, 2012)
Papers
- zenodo.org which has been used by
- OSF which has been used by Methods for correcting inference based on outcomes predicted by machine learning
- codeocean.
- A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. The R code can be downloaded by git (Capsule -> Export -> Clone via Git). The data (3.4G zip file) has to be downloaded manually. The environment panel shows what packages have to be installed (apt-get, Bioconductor, R-CRAN, R-Github). It seems "Export" is more complete than "Clone via Git". It even include a Dockerfile.
- Consensus Non-negative Matrix factorization (cNMF) v1.2
Misc
- 4 great free tools that can make your R work more efficient, reproducible and robust
- digest: Create Compact Hash Digests of R Objects
- memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
library(ggplot2) # mpg library(memoise) plot_mpg2 <- function(mpgdf, row_to_remove) { mpgdf = mpgdf[-row_to_remove,] plot(mpgdf$cty, mpgdf$hwy) lines(lowess(mpgdf$cty, mpgdf$hwy), col=2) } m_plot_mpg2 = memoise(plot_mpg2) system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.019 0.003 0.025 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.018 0.003 0.024 system.time(m_plot_mpg2(mpg, 12)) # user system elapsed # 0.000 0.000 0.001 system.time(plot_mpg2(mpg, 12)) # user system elapsed # 0.032 0.008 0.047
- And be careful when it is used in simulation.
f <- function(n=1e5) { a <- rnorm(n) a } system.time(f1 <- f()) mf <- memoise::memoise(f) system.time(f2 <- mf()) system.time(f3 <- mf()) all.equal(f2, f3) # TRUE
- reproducible: A Set of Tools that Enhance Reproducibility Beyond Package Management
- Improving reproducibility in computational biology research 2020