R
Install and upgrade R
New release
- R 4.4.0
- R 4.3.0
- What's new in R 4.3.0?
- Extracting from a pipe. The underscore _ can be used to refer to the final value from a pipeline
mtcars |> lm(mpg ~ disp, data = _) |> _$coef
. Previously we need to use this way or this way. If we want to apply some (anonymous) function to each element of a list, use map(), map_dbl() from the purrr.
- R 4.2.0
- Calling if() or while() with a condition of length greater than one gives an error rather than a warning.
- use underscore (_) as a placeholder on the right-hand side (RHS) of a forward pipe. For example, mtcars |> subset(cyl == 4) |> lm(mpg ~ disp, data = _)
- Enhancements to HTML Documentation
- New features in R 4.2.0
- R 4.1.0
- pipe and shorthand for creating a function
- New features in R 4.1.0 anonymous functions (lambda function)
- R 4.0.0
- R 4.0.0 now available, and a look back at R's history
- R 4.0.0 brings numerous and significant changes to syntax, strings, reference counting, grid units, and more, R 4.0: 3 new features
- factor is not default in data frame for character vector
- palette() function has a new default set of colours, and palette.colors() & palette.pals() are new
- r"(YourString)" for raw character constants. See ?Quotes
- R 3.6.0
- What's new in R 3.6.0
- Changes to random number generation
- More functions now support vectors with more than 2 billion elements
- What's new in R 3.6.0
- R 3.5.0
Online Editor
We can run R on web browsers without installing it on local machines (similar to [/ideone.com Ideone.com] for C++. It does not require an account either (cf RStudio).
rdrr.io
It can produce graphics too. The package I am testing (cobs) is available too.
rstudio.cloud
RDocumentation
The interactive engine is based on DataCamp Light
For example, tbl_df function from dplyr package.
The website DataCamp allows to run library() on the Script window. After that, we can use the packages on R Console.
Here is a list of (common) R packages that users can use on the web.
The packages on RDocumentation may be outdated. For example, the current stringr on CRAN is v1.2.0 (2/18/2017) but RDocumentation has v1.1.0 (8/19/2016).
Web Applications
Creating local repository for CRAN and Bioconductor
Parallel Computing
See R parallel.
Cloud Computing
Install R on Amazon EC2
http://randyzwitch.com/r-amazon-ec2/
Bioconductor on Amazon EC2
http://www.bioconductor.org/help/bioconductor-cloud-ami/
Big Data Analysis
- CRAN Task View: High-Performance and Parallel Computing with R
- R for big data in one picture
- Handling large data sets in R
- Big Data Analytics with R by Simon Walkowiak
- pbdR
- https://en.wikipedia.org/wiki/Programming_with_Big_Data_in_R
- Programming with Big Data in R - pbdR George Ostrouchov and Mike Matheson Oak Ridge National Laboratory
bigmemory, biganalytics, bigtabulate
ff, ffbase
- tapply does not work. Using tapply, ave functions for ff vectors in R
- Popularity bigdata / large data packages in R and ffbase useR presentation
- ffbase: statistical functions for large datasets in useR 2013
- ffbase package
biglm
data.table
See data.table.
disk.frame
Split-apply-combine for Maximum Likelihood Estimation of a linear model
Apache arrow
Reproducible Research
Reproducible Environments
https://rviews.rstudio.com/2019/04/22/reproducible-environments/
checkpoint package
Some lessons in R coding
- don't use rand() and srand() in c. The result is platform dependent. My experience is Ubuntu/Debian/CentOS give the same result but they are different from macOS and Windows. Use Rcpp package and R's random number generator instead.
- don't use list.files() directly. The result is platform dependent even different Linux OS. An extra sorting helps!
Useful R packages
Rcpp
http://cran.r-project.org/web/packages/Rcpp/index.html. See more here.
RInside : embed R in C++ code
- http://dirk.eddelbuettel.com/code/rinside.html
- http://dirk.eddelbuettel.com/papers/rfinance2010_rcpp_rinside_tutorial_handout.pdf
Ubuntu
With RInside, R can be embedded in a graphical application. For example, $HOME/R/x86_64-pc-linux-gnu-library/3.0/RInside/examples/qt directory includes source code of a Qt application to show a kernel density plot with various options like kernel functions, bandwidth and an R command text box to generate the random data. See my demo on Youtube. I have tested this qtdensity example successfully using Qt 4.8.5.
- Follow the instruction cairoDevice to install required libraries for cairoDevice package and then cairoDevice itself.
- Install Qt. Check 'qmake' command becomes available by typing 'whereis qmake' or 'which qmake' in terminal.
- Open Qt Creator from Ubuntu start menu/Launcher. Open the project file $HOME/R/x86_64-pc-linux-gnu-library/3.0/RInside/examples/qt/qtdensity.pro in Qt Creator.
- Under Qt Creator, hit 'Ctrl + R' or the big green triangle button on the lower-left corner to build/run the project. If everything works well, you shall see the interactive program qtdensity appears on your desktop.
With RInside + Wt web toolkit installed, we can also create a web application. To demonstrate the example in examples/wt directory, we can do
cd ~/R/x86_64-pc-linux-gnu-library/3.0/RInside/examples/wt make sudo ./wtdensity --docroot . --http-address localhost --http-port 8080
Then we can go to the browser's address bar and type http://localhost:8080 to see how it works (a screenshot is in here).
Windows 7
To make RInside works on Windows OS, try the following
- Make sure R is installed under C:\ instead of C:\Program Files if we don't want to get an error like g++.exe: error: Files/R/R-3.0.1/library/RInside/include: No such file or directory.
- Install RTools
- Instal RInside package from source (the binary version will give an error )
- Create a DOS batch file containing necessary paths in PATH environment variable
@echo off set PATH=C:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;%PATH% set PATH=C:\R\R-3.0.1\bin\i386;%PATH% set PKG_LIBS=`Rscript -e "Rcpp:::LdFlags()"` set PKG_CPPFLAGS=`Rscript -e "Rcpp:::CxxFlags()"` set R_HOME=C:\R\R-3.0.1 echo Setting environment for using R cmd
In the Windows command prompt, run
cd C:\R\R-3.0.1\library\RInside\examples\standard make -f Makefile.win
Now we can test by running any of executable files that make generates. For example, rinside_sample0.
rinside_sample0
As for the Qt application qdensity program, we need to make sure the same version of MinGW was used in building RInside/Rcpp and Qt. See some discussions in
- http://stackoverflow.com/questions/12280707/using-rinside-with-qt-in-windows
- http://www.mail-archive.com/[email protected]/msg04377.html
So the Qt and Wt web tool applications on Windows may or may not be possible.
GUI
Qt and R
- http://cran.r-project.org/web/packages/qtbase/index.html QtDesigner is such a tool, and its output is compatible with the qtbase R package
- http://qtinterfaces.r-forge.r-project.org
tkrplot
On Ubuntu, we need to install tk packages, such as by
sudo apt-get install tk-dev
reticulate - Interface to 'Python'
Hadoop (eg ~100 terabytes)
See also HighPerformanceComputing
- RHadoop
- Hive
- MapReduce. Introduction by Linux Journal.
- http://www.techspritz.com/category/tutorials/hadoopmapredcue/ Single node or multinode cluster setup using Ubuntu with VirtualBox (Excellent)
- Running Hadoop on Ubuntu Linux (Single-Node Cluster)
- Ubuntu 12.04 http://www.youtube.com/watch?v=WN2tJk_oL6E and instruction
- Linux Mint http://blog.hackedexistence.com/installing-hadoop-single-node-on-linux-mint
- http://www.r-bloggers.com/search/hadoop
RHadoop
- RDataMining.com based on Mac.
- Ubuntu 12.04 - Crishantha.com, nikhilshah123sh.blogspot.com.Bighadoop.wordpress contains an example.
- RapReduce in R by RevolutionAnalytics with a few examples.
- https://twitter.com/hashtag/rhadoop
- Bigd8ta.com based on Ubuntu 14.04.
Snowdoop: an alternative to MapReduce algorithm
- http://matloff.wordpress.com/2014/11/26/how-about-a-snowdoop-package/
- http://matloff.wordpress.com/2014/12/26/snowdooppartools-update/comment-page-1/#comment-665
XML
On Ubuntu, we need to install libxml2-dev before we can install XML package.
sudo apt-get update sudo apt-get install libxml2-dev
On CentOS,
yum -y install libxml2 libxml2-devel
XML
- http://giventhedata.blogspot.com/2012/06/r-and-web-for-beginners-part-ii-xml-in.html. It gave an example of extracting the XML-values from each XML-tag for all nodes and save them in a data frame using xmlSApply().
- http://www.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
- https://tonybreyal.wordpress.com/2011/11/18/htmltotext-extracting-text-from-html-via-xpath/
- https://www.tutorialspoint.com/r/r_xml_files.htm
- https://www.datacamp.com/community/tutorials/r-data-import-tutorial#xml
- Extracting data from XML PubMed and Zillow are used to illustrate. xmlTreeParse(), xmlRoot(), xmlName() and xmlSApply().
- https://yihui.name/en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/
library(XML) # Read and parse HTML file doc.html = htmlTreeParse('http://apiolaza.net/babel.html', useInternal = TRUE) # Extract all the paragraphs (HTML tag is p, starting at # the root of the document). Unlist flattens the list to # create a character vector. doc.text = unlist(xpathApply(doc.html, '//p', xmlValue)) # Replace all by spaces doc.text = gsub('\n', ' ', doc.text) # Join all the elements of the character vector into a single # character string, separated by spaces doc.text = paste(doc.text, collapse = ' ')
This post http://stackoverflow.com/questions/25315381/using-xpathsapply-to-scrape-xml-attributes-in-r can be used to monitor new releases from github.com.
> library(RCurl) # getURL() > library(XML) # htmlParse and xpathSApply > xData <- getURL("https://github.com/alexdobin/STAR/releases") > doc = htmlParse(xData) > plain.text <- xpathSApply(doc, "//span[@class='css-truncate-target']", xmlValue) # I look at the source code and search 2.5.3a and find the tag as # 2.5.3a > plain.text [1] "2.5.3a" "2.5.2b" "2.5.2a" "2.5.1b" "2.5.1a" [6] "2.5.0c" "2.5.0b" "STAR_2.5.0a" "STAR_2.4.2a" "STAR_2.4.1d" > > # try bwa > > xData <- getURL("https://github.com/lh3/bwa/releases") > doc = htmlParse(xData) > xpathSApply(doc, "//span[@class='css-truncate-target']", xmlValue) [1] "v0.7.15" "v0.7.13" > # try picard > xData <- getURL("https://github.com/broadinstitute/picard/releases") > doc = htmlParse(xData) > xpathSApply(doc, "//span[@class='css-truncate-target']", xmlValue) [1] "2.9.1" "2.9.0" "2.8.3" "2.8.2" "2.8.1" "2.8.0" "2.7.2" "2.7.1" "2.7.0" [10] "2.6.0"
This method can be used to monitor new tags/releases from some projects like Cura, BWA, Picard, STAR. But for some projects like sratools the class attribute in the span element ("css-truncate-target") can be different (such as "tag-name").
xmlview
RCurl
On Ubuntu, we need to install the packages (the first one is for XML package that RCurl suggests)
# Test on Ubuntu 14.04 sudo apt-get install libxml2-dev sudo apt-get install libcurl4-openssl-dev
Scrape google scholar results
No google ID is required
Seems not work
Error in data.frame(footer = xpathLVApply(doc, xpath.base, "/font/span[@class='gs_fl']", : arguments imply differing number of rows: 2, 0
devtools
devtools package depends on Curl. It actually depends on some system files. If we just need to install a package, consider the remotes package which was suggested by the BiocManager package.
# Ubuntu 14.04 sudo apt-get install libcurl4-openssl-dev # Ubuntu 16.04, 18.04 sudo apt-get install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev # Ubuntu 20.04 sudo apt-get install -y libxml2-dev libcurl4-openssl-dev libssl-dev
Lazy-load database XXX is corrupt. internal error -3. It often happens when you use install_github to install a package that's currently loaded; try restarting R and running the app again.
NB. According to the output of apt-cache show r-cran-devtools, the binary package is very old though apt-cache show r-base and supported packages like survival shows the latest version.
httr
httr imports curl, jsonlite, mime, openssl and R6 packages.
When I tried to install httr package, I got an error and some message:
Configuration failed because openssl was not found. Try installing: * deb: libssl-dev (Debian, Ubuntu, etc) * rpm: openssl-devel (Fedora, CentOS, RHEL) * csw: libssl_dev (Solaris) * brew: openssl (Mac OSX) If openssl is already installed, check that 'pkg-config' is in your PATH and PKG_CONFIG_PATH contains a openssl.pc file. If pkg-config is unavailable you can set INCLUDE_DIR and LIB_DIR manually via: R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...' -------------------------------------------------------------------- ERROR: configuration failed for package ‘openssl’
It turns out after I run sudo apt-get install libssl-dev in the terminal (Debian), it would go smoothly with installing httr package. Nice httr!
Real example: see this post. Unfortunately I did not get a table result; I only get an html file (R 3.2.5, httr 1.1.0 on Ubuntu and Debian).
Since httr package was used in many other packages, take a look at how others use it. For example, aRxiv package.
A package to download free Springer books during Covid-19 quarantine, An update to "An adventure in downloading books" (rvest package)
curl
curl is independent of RCurl package.
- http://cran.r-project.org/web/packages/curl/vignettes/intro.html
- https://www.opencpu.org/posts/curl-release-0-8/
library(curl) h <- new_handle() handle_setform(h, name="aaa", email="bbb" ) req <- curl_fetch_memory("http://localhost/d/phpmyql3_scripts/ch02/form2.html", handle = h) rawToChar(req$content)
rOpenSci packages
rOpenSci contains packages that allow access to data repositories through the R statistical programming environment
remotes
Download and install R packages stored in 'GitHub', 'BitBucket', or plain 'subversion' or 'git' repositories. This package is a lightweight replacement of the 'install_*' functions in 'devtools'. Also remotes does not require any extra OS level library (at least on Ubuntu 16.04).
Example:
# https://github.com/henrikbengtsson/matrixstats remotes::install_github('HenrikBengtsson/matrixStats@develop')
DirichletMultinomial
On Ubuntu, we do
sudo apt-get install libgsl0-dev
Create GUI
gWidgets
GenOrd: Generate ordinal and discrete variables with given correlation matrix and marginal distributions
json
Map
leaflet
- rstudio.github.io/leaflet/#installation-and-use
- https://metvurst.wordpress.com/2015/07/24/mapview-basic-interactive-viewing-of-spatial-data-in-r-6/
choroplethr
- http://blog.revolutionanalytics.com/2014/01/easy-data-maps-with-r-the-choroplethr-package-.html
- http://www.arilamstein.com/blog/2015/06/25/learn-to-map-census-data-in-r/
- http://www.arilamstein.com/blog/2015/09/10/user-question-how-to-add-a-state-border-to-a-zip-code-map/
ggplot2
How to make maps with Census data in R
googleVis
See an example from RJSONIO above.
googleAuthR
Create R functions that interact with OAuth2 Google APIs easily, with auto-refresh and Shiny compatibility.
gtrendsR - Google Trends
- Download and plot Google Trends data with R
- Analyzing Google Trends Data in R
- microarray analysis from 2004-04-01
- ngs next generation sequencing from 2004-04-01
- dna sequencing from 2004-01-01.
- rna sequencing from 2004-01-01. It can be seen RNA sequencing >> DNA sequencing.
- Python vs R – Who Is Really Ahead in Data Science, Machine Learning? and The Incredible Growth of Python by David Robinson
quantmod
Maintaining a database of price files in R. It consists of 3 steps.
- Initial data downloading
- Update existing data
- Create a batch file
caret
- http://topepo.github.io/caret/index.html & https://github.com/topepo/caret/
- https://www.r-project.org/conferences/useR-2013/Tutorials/kuhn/user_caret_2up.pdf
- https://github.com/cran/caret source code mirrored on github
- Cheatsheet https://www.rstudio.com/resources/cheatsheets/
- Chapter 21 of "R for Statistical Learning"
Tool for connecting Excel with R
- https://bert-toolkit.com/
- BERT: a newcomer in the R Excel connection
- http://blog.revolutionanalytics.com/2018/08/how-to-use-r-with-excel.html
How to create raw tabular data
Eleven quick tips for properly handling tabular data 2024 PLOS.
write.table
Output a named vector
vec <- c(a = 1, b = 2, c = 3) write.csv(vec, file = "my_file.csv", quote = F) x = read.csv("my_file.csv", row.names = 1) vec2 <- x[, 1] names(vec2) <- rownames(x) all.equal(vec, vec2) # one liner: row names of a 'matrix' become the names of a vector vec3 <- as.matrix(read.csv('my_file.csv', row.names = 1))[, 1] all.equal(vec, vec3)
Avoid leading empty column to header
write.table writes unwanted leading empty column to header when has rownames
write.table(a, 'a.txt', col.names=NA) # Or better by write.table(data.frame("SeqId"=rownames(a), a), "a.txt", row.names=FALSE)
Add blank field AND column names in write.table
- write.table(, row.names = TRUE) will miss one element on the 1st row when "row.names = TRUE" which is enabled by default.
- Suppose x is (n x 2)
- write.table(x, sep="\t") will generate a file with 2 element on the 1st row
- read.table(file) will return an object with a size (n x 2)
- read.delim(file) and read.delim2(file) will also be correct
- Note that write.csv() does not have this issue that write.table() has
- Suppose x is (n x 2)
- Suppose we use write.csv(x, file). The csv file will be ((n+1) x 3) b/c the header row.
- If we use read.csv(file), the object is (n x 3). So we need to use read.csv(file, row.names = 1)
- adding blank field AND column names in write.table(); write.table writes unwanted leading empty column to header when has rownames
write.table(a, 'a.txt', col.names=NA)
- readr::write_tsv() does not include row names in the output file
read.delim(, row.names=1) and write.table(, row.names=TRUE)
How to Use read.delim Function in R
Case 1: no row.names
write.table(df, 'my_data.txt', quote=FALSE, sep='\t', row.names=FALSE) my_df <- read.delim('my_data.txt') # the rownames will be 1, 2, 3, ...
Case 2: with row.names. Note: if we open the text file in Excel, we'll see the 1st row is missing one header at the end. It is actually missing the column name for the 1st column.
write.table(df, 'my_data.txt', quote=FALSE, sep='\t', row.names=TRUE) my_df <- read.delim('my_data.txt') # it will automatically assign the rownames
Read/Write Excel files package
- http://www.milanor.net/blog/?p=779
- flipAPI. One useful feature of DownloadXLSX, which is not supported by the readxl package, is that it can read Excel files directly from the URL.
- xlsx: depends on Java
- openxlsx: not depend on Java. Depend on zip application. On Windows, it seems to be OK without installing Rtools. But it can not read xls file; it works on xlsx file.
- It can't be used to open .xls or .xlm files.
- When I try the package to read an xlsx file, I got a warning: No data found on worksheet. 6/28/2018
- Use R to write multiple tables to a single Excel file
- readxl: it does not depend on anything although it can only read but not write Excel files.
- It is part of tidyverse package. The readxl website provides several articles for more examples.
- readxl webinar.
- One advantage of read_excel (as with read_csv in the readr package) is that the data imports into an easy to print object with three attributes a tbl_df, a tbl and a data.frame.
- For writing to Excel formats, use writexl or openxlsx package.
library(readxl) read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = readxl_progress(), .name_repair = "unique") # Example read_excel(path, range = cell_cols("c:cx"), col_types = "numeric")
- writexl: zero dependency xlsx writer for R
library(writexl) mylst <- list(sheet1name = df1, sheet2name = df2) write_xlsx(mylst, "output.xlsx")
For the Chromosome column, integer values becomes strings (but converted to double, so 5 becomes 5.000000) or NA (empty on sheets).
> head(read_excel("~/Downloads/BRCA.xls", 4)[ , -9], 3) UniqueID (Double-click) CloneID UGCluster 1 HK1A1 21652 Hs.445981 2 HK1A2 22012 Hs.119177 3 HK1A4 22293 Hs.501376 Name Symbol EntrezID 1 Catenin (cadherin-associated protein), alpha 1, 102kDa CTNNA1 1495 2 ADP-ribosylation factor 3 ARF3 377 3 Uroporphyrinogen III synthase UROS 7390 Chromosome Cytoband ChimericClusterIDs Filter 1 5.000000 5q31.2 <NA> 1 2 12.000000 12q13 <NA> 1 3 <NA> 10q25.2-q26.3 <NA> 1
The hidden worksheets become visible (Not sure what are those first rows mean in the output).
> excel_sheets("~/Downloads/BRCA.xls") DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 01 00 00 00 9a 0c 00 00 1a 00 DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 03 00 00 00 9b 0c 00 00 0a 00 DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 02 00 00 00 9a 0c 00 00 06 00 [1] "Experiment descriptors" "Filtered log ratio" "Gene identifiers" [4] "Gene annotations" "CollateInfo" "GeneSubsets" [7] "GeneSubsetsTemp"
The Chinese character works too.
> read_excel("~/Downloads/testChinese.xlsx", 1) 中文 B C 1 a b c 2 1 2 3
To read all worksheets we need a convenient function
read_excel_allsheets <- function(filename) { sheets <- readxl::excel_sheets(filename) sheets <- sheets[-1] # Skip sheet 1 x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X, col_types = "numeric")) names(x) <- sheets x } dcfile <- "table0.77_dC_biospear.xlsx" dc <- read_excel_allsheets(dcfile) # Each component (eg dc1) is a tibble.
readr
Compared to base equivalents like read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names.
1.0.0 released. readr 2.0.0 adds built-in support for reading multiple files at once, fast multi-threaded lazy reading and automatic guessing of delimiters among other changes.
Consider a text file where the table (6100 x 22) has duplicated row names and the (1,1) element is empty. The column names are all unique.
- read.delim() will treat the first column as rownames but it does not allow duplicated row names. Even we use row.names=NULL, it still does not read correctly. It does give warnings (EOF within quoted string & number of items read is not a multiple of the number of columns). The dim is 5177 x 22.
- readr::read_delim(Filename, "\t") will miss the last column. The dim is 6100 x 21.
- data.table::fread(Filename, sep = "\t") will detect the number of column names is less than the number of columns. Added 1 extra default column name for the first column which is guessed to be row names or an index. The dim is 6100 x 22. (Winner!)
The readr::read_csv() function is as fast as data.table::fread() function. For files beyond 100MB in size fread() and read_csv() can be expected to be around 5 times faster than read.csv(). See 5.3 of Efficient R Programming book.
Note that data.table::fread() can read a selection of the columns.
Speed comparison
The Fastest Way To Read And Write Files In R. data.table >> readr >> base.
ggplot2
See ggplot2
Data Manipulation & Tidyverse
See Tidyverse.
Data Science
See Data science page
microbenchmark & rbenchmark
- microbenchmark
- rbenchmark (not updated since 2012)
Plot, image
jpeg
If we want to create the image on this wiki left hand side panel, we can use the jpeg package to read an existing plot and then edit and save it.
We can also use the jpeg package to import and manipulate a jpg image. See Fun with Heatmaps and Plotly.
EPS/postscript format
- Don't use postscript().
- Use cairo_ps(). See aving High-Resolution ggplots: How to Preserve Semi-Transparency. It works on base R plots too.
cairo_ps(filename = "survival-curves.eps", width = 7, height = 7, pointsize = 12, fallback_resolution = 300) print(p) # or any base R plots statements dev.off()
- Export a graph to .eps file with R.
- The results looks the same as using cairo_ps().
- The file size by setEPS() + postscript() is quite smaller compared to using cairo_ps().
- However, grep can find the characters shown on the plot generated by cairo_ps() but not setEPS() + postscript().
setEPS() postscript("whatever.eps") # 483 KB plot(rnorm(20000)) dev.off() # grep rnorm whatever.eps # Not found! cairo_ps("whatever_cairo.eps") # 2.4 MB plot(rnorm(20000)) dev.off() # grep rnorm whatever_cairo.eps # Found!
- View EPS files
- Linux: evince. It is installed by default.
- Mac: evince. brew install evince
- Windows. Install ghostscript 9.20 (10.x does not work with ghostview/GSview) and ghostview/GSview (5.0). In Ghostview, open Options -> Advanced Configure. Change Ghostscript DLL path AND Ghostscript include Path according to the ghostscript location ("C:\.
- Edit EPS files: Inkscape
- Step 1: open the EPS file
- Step 2: EPS Input: Determine page orientation from text direction 'Page by page' - OK
- Step 3: PDF Import Settings: default is "Internal import", but we shall choose "Cairo import".
- Step 4: Zoom in first.
- Step 5: Click on Layers and Objects tab on the RHS. Now we can select any lines or letters and edit them as we like. The selected objects are highlighted in the "Layers and Objects" panel. That is, we can select multiple objects using object names. The selected objects can be rotated (Object -> Rotate 90 CW), for example.
- Step 6: We can save the plot as any formats like svg, eps, pdf, html, pdf, ...
png and resolution
It seems people use res=300 as a definition of high resolution.
- Bottom line: fix res=300 and adjust height/width as needed. The default is res=72, height=width=480. If we increase res=300, the text font size will be increased, lines become thicker and the plot looks like a zoom-in.
- Saving high resolution plot in png.
png("heatmap.png", width = 8, height = 6, units='in', res = 300) # we can adjust width/height as we like # the pixel values will be width=8*300 and height=6*300 which is equivalent to # 8*300 * 6*300/10^6 = 4.32 Megapixels (1M pixels = 10^6 pixels) in camera's term # However, if we use png(, width=8*300, height=6*300, units='px'), it will produce # a plot with very large figure body and tiny text font size. # It seems the following command gives the same result as above png("heatmap.png", width = 8*300, height = 6*300, res = 300) # default units="px"
- Chapter 14.5 Outputting to Bitmap (PNG/TIFF) Files by R Graphics Cookbook
- Changing the resolution affects the size (in pixels) of graphical objects like text, lines, and points.
- 10 tips for making your R graphics look their best David Smith
- In Word you can resize the graphic to an appropriate size, but the high resolution gives you the flexibility to choose a size while not compromising on the quality. I'd recommend at least 1200 pixels on the longest side for standard printers.
- ?png. The png function has default settings ppi=72, height=480, width=480, units="px".
- By default no resolution is recorded in the file, except for BMP.
- BMP vs PNG format. If you need a smaller file size and don’t mind a lossless compression, PNG might be a better choice. If you need to retain as much detail as possible and don’t mind a larger file size, BMP could be the way to go.
- Compression: BMP files are raw and uncompressed, meaning they’re large files that retain as much detail as possible. On the other hand, PNG files are compressed but still lossless. This means you can reduce or expand PNGs without losing any information.
- File size: BMPs are larger than PNGs. This is because PNG files automatically compress, and can be compressed again to make the file even smaller.
- Common uses: BMP contains a maximum amount of details while PNGs are good for small illustrations, sketches, drawings, logos and icons.
- Quality: No difference
- Transparency: PNG supports transparency while BMP doesn't
- Some comparison about the ratio
- 11/8.5=1.29 (A4 paper)
- 8/6=1.33 (plot output)
- 1440/900=1.6 (my display)
- Setting resolution and aspect ratios in R
- The difference of res parameter for a simple plot. How to change the resolution of a plot in base R?
- High Resolution Figures in R.
- High resolution graphics with R
- R plot: size and resolution
- How can I increase the resolution of my plot in R?, devEMF package
- See Images -> Anti-alias.
- How to check DPI on PNG
- The width of a PNG file in terms of inches cannot be determined directly from the file itself, as the file contains pixel dimensions, not physical dimensions. However, you can calculate the width in inches if you know the resolution (DPI, dots per inch) of the image. Remember that converting pixel measurements to physical measurements like inches involves a specific resolution (DPI), and different devices may display the same image at different sizes due to having different resolutions.
- Cairo case.
PowerPoint
- For PP presentation, I found it is useful to use svg() to generate a small size figure. Then when we enlarge the plot, the text font size can be enlarged too. According to svg, by default, width = 7, height = 7, pointsize = 12, family = sans.
- Try the following code. The font size is the same for both plots/files. However, the first plot can be enlarged without losing its quality.
svg("svg4.svg", width=4, height=4) plot(1:10, main="width=4, height=4") dev.off() svg("svg7.svg", width=7, height=7) # default plot(1:10, main="width=7, height=7") dev.off()
magick
https://cran.r-project.org/web/packages/magick/
See an example here I created.
Cairo
See White strips problem in png() or tiff().
geDevices
- Saving R Graphics across OSs. Use png(type="cairo-png") or the ragg package which can be incorporated into RStudio.
- Setting the Graphics Device in a RMarkdown Document
cairoDevice
PS. Not sure the advantage of functions in this package compared to R's functions (eg. Cairo_svg() vs svg()).
For ubuntu OS, we need to install 2 libraries and 1 R package RGtk2.
sudo apt-get install libgtk2.0-dev libcairo2-dev
On Windows OS, we may got the error: unable to load shared object 'C:/Program Files/R/R-3.0.2/library/cairoDevice/libs/x64/cairoDevice.dll' . We need to follow the instruction in here.
dpi requirement for publication
For import into PDF-incapable programs (MS Office)
sketcher: photo to sketch effects
httpgd
- https://nx10.github.io/httpgd/ A graphics device for R that is accessible via network protocols. Display graphics on browsers.
- Three tricks to make IDEs other than Rstudio better for R development
igraph
Identifying dependencies of R functions and scripts
https://stackoverflow.com/questions/8761857/identifying-dependencies-of-r-functions-and-scripts
library(mvbutils) foodweb(where = "package:batr") foodweb( find.funs("package:batr"), prune="survRiskPredict", lwd=2) foodweb( find.funs("package:batr"), prune="classPredict", lwd=2)
iterators
Iterator is useful over for-loop if the data is already a collection. It can be used to iterate over a vector, data frame, matrix, file
Iterator can be combined to use with foreach package http://www.exegetic.biz/blog/2013/11/iterators-in-r/ has more elaboration.
Colors
- scales package. This is used in ggplot2 package.
- colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes. Popular! Many reverse imports/suggests; e.g. ComplexHeatmap. See my ggplot2 page.
hcl_palettes(plot = TRUE) # a quick overview hcl_palettes(palette = "Dark 2", n=5, plot = T) q4 <- qualitative_hcl(4, palette = "Dark 3")
- Create color range between two colors in R using colorRampPalette()
- How to expand color palette with ggplot and RColorBrewer
- palette_explorer() function from the tmaptools package. See selecting color palettes with shiny.
- Cookbook for R
- Sequential, diverging and qualitative colour scales/palettes from colorbrewer.org: scale_colour_brewer(), scale_fill_brewer(), ...
- http://colorbrewer2.org/
- It seems there is no choice of getting only 2 colors no matter which set name we can use
- To see the set names used in brewer.pal, see
- RColorBrewer::display.brewer.all()
- Output
- Especially, Set1 from http://colorbrewer2.org/
- To list all R color names, colors().
- Color Chart (include Hex and RGB) & Using Color in R from http://research.stowers.org
- Code to generate rectangles with colored background https://www.r-graph-gallery.com/42-colors-names/
- http://www.bauer.uh.edu/parks/truecolor.htm Interactive RGB, Alpha and Color Picker
- http://deanattali.com/blog/colourpicker-package/ Not sure what it is doing
- How to Choose the Best Colors For Your Data Charts
- How to expand color palette with ggplot and RColorBrewer
- Color names in R
- convert hex value to color names
library(plotrix) sapply(rainbow(4), color.id) # color.id is a function # it is used to identify closest match to a color sapply(palette(), color.id) sapply(RColorBrewer::brewer.pal(4, "Set1"), color.id)
- hsv() function. Matrix-style screensaver in R
Below is an example using the option scale_fill_brewer(palette = "Paired"). See the source code at gist. Note that only set1 and set3 palettes in qualitative scheme can support up to 12 classes.
According to the information from the colorbrew website, qualitative schemes do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes.
colortools
Tools that allow users generate color schemes and palettes
colourpicker
A Colour Picker Tool for Shiny and for Selecting Colours in Plots
eyedroppeR
Select colours from an image in R with {eyedroppeR}
rex
Friendly Regular Expressions
formatR
The best strategy to avoid failure is to put comments in complete lines or after complete R expressions.
See also this discussion on stackoverflow talks about R code reformatting.
library(formatR) tidy_source("Input.R", file = "output.R", width.cutoff=70) tidy_source("clipboard") # default width is getOption("width") which is 127 in my case.
Some issues
- Comments appearing at the beginning of a line within a long complete statement. This will break tidy_source().
cat("abcd", # This is my comment "defg")
will result in
> tidy_source("clipboard") Error in base::parse(text = code, srcfile = NULL) : 3:1: unexpected string constant 2: invisible(".BeGiN_TiDy_IdEnTiFiEr_HaHaHa# This is my comment.HaHaHa_EnD_TiDy_IdEnTiFiEr") 3: "defg" ^
- Comments appearing at the end of a line within a long complete statement won't break tidy_source() but tidy_source() cannot re-locate/tidy the comma sign.
cat("abcd" ,"defg" # This is my comment ,"ghij")
will become
cat("abcd", "defg" # This is my comment , "ghij")
Still bad!!
- Comments appearing at the end of a line within a long complete statement breaks tidy_source() function. For example,
cat("</p>", "<HR SIZE=5 WIDTH=\"100%\" NOSHADE>", ifelse(codeSurv == 0,"<h3><a name='Genes'><b><u>Genes which are differentially expressed among classes:</u></b></a></h3>", #4/9/09 "<h3><a name='Genes'><b><u>Genes significantly associated with survival:</u></b></a></h3>"), file=ExternalFileName, sep="\n", append=T)
will result in
> tidy_source("clipboard", width.cutoff=70) Error in base::parse(text = code, srcfile = NULL) : 3:129: unexpected SPECIAL 2: "<HR SIZE=5 WIDTH=\"100%\" NOSHADE>" , 3: ifelse ( codeSurv == 0 , "<h3><a name='Genes'><b><u>Genes which are differentially expressed among classes:</u></b></a></h3>" , %InLiNe_IdEnTiFiEr%
- width.cutoff parameter is not always working. For example, there is no any change for the following snippet though I hope it will move the cat() to the next line.
if (codePF & !GlobalTest & !DoExactPermTest) cat(paste("Multivariate Permutations test was computed based on", NumPermutations, "random permutations"), "<BR>", " ", file = ExternalFileName, sep = "\n", append = T)
- It merges lines though I don't always want to do that. For example
cat("abcd" ,"defg" ,"ghij")
will become
cat("abcd", "defg", "ghij")
styler
https://cran.r-project.org/web/packages/styler/index.html Pretty-prints R code without changing the user's formatting intent.
Download papers
biorxivr
Search and Download Papers from the bioRxiv Preprint Server (biology)
aRxiv
Interface to the arXiv API
pdftools
- http://ropensci.org/blog/2016/03/01/pdftools-and-jeroen
- http://r-posts.com/how-to-extract-data-from-a-pdf-file-with-r/
- https://ropensci.org/technotes/2018/12/14/pdftools-20/
aside: set it aside
An RStudio addin to run long R commands aside your current session.
Teaching
- smovie: Some Movies to Illustrate Concepts in Statistics
Organize R research project
- CRAN Task View: Reproducible Research
- Organizing R Research Projects: CPAT, A Case Study
- Project-oriented workflow. It suggests the here package. Don't use setwd() and rm(list = ls()).
- Practice safe paths. Use projects and the here package.
- In RStudio, if we try to send a few lines of code and one of the line contains setwd(), it will give a message: The working directory was changed to XXX inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
- how to use the `here` package
- No update for the here package after 2020-12. Consider usethis package (Automate project and package setup).
- drake project
- targets package
- ProjectTemplate
How to save (and load) datasets in R (.RData vs .Rds file)
How to save (and load) datasets in R: An overview
Naming convention
- What is your preferred style for naming variables in R?
- Use of period separator: they can get mixed up in simple method dispatch. However, it is used by base R (make.names(), read.table(), et al)
- Use of underscores: really annoying for ESS users
- camelCase: Winner
- However, the survey said (no surprises perhaps) that
- lowerCamelCase was most often used for function names,
- period.separated names most often used for parameters.
- What are file naming conventions?
- Consistent naming conventions in R
- http://adv-r.had.co.nz/Style.html
- Testing for valid variable names
- R reserved words ?Reserved
- R Reserved Words
- Among these words, if, else, repeat, while, function, for, in, next and break are used for conditions, loops and user defined functions.
- Microarray/RNA-seq data
clinicalDesignData # clnDesignData geneExpressionData # gExpData geneAnnotationData # gAnnoData dataClinicalDesign dataGeneExpression dataAnnotation
# Search all variables ending with .Data ls()[grep("\\.Data$", ls())] # Search all variables starting with data_ ls()[grep("^data_", ls())]
Efficient Data Management in R
Efficient Data Management in R. .Rprofile, renv package and dplyr package.
Text to speech
Text-to-Speech with the googleLanguageR package
Speech to text
https://github.com/ggerganov/whisper.cpp and an R package audio.whisper
Weather data
- prism package
- Weatherbase
logR
https://github.com/jangorecki/logR
Progress bar
https://github.com/r-lib/progress#readme
Configurable Progress bars, they may include percentage, elapsed time, and/or the estimated completion time. They work in terminals, in 'Emacs' 'ESS', 'RStudio', 'Windows' 'Rgui' and the 'macOS'.
cron
beepr: Play A Short Sound
https://www.rdocumentation.org/packages/beepr/versions/1.3/topics/beep. Try sound=3 "fanfare", 4 "complete", 5 "treasure", 7 "shotgun", 8 "mario".
utils package
https://www.rdocumentation.org/packages/utils/versions/3.6.2
tools package
- https://www.rdocumentation.org/packages/tools/versions/3.6.2
- Where in the file are there non ASCII characters?, tools::showNonASCIIfile(<filename>)
Different ways of using R
Extending R by John M. Chambers (2016)
10 things R can do that might surprise you
https://simplystatistics.org/2019/03/13/10-things-r-can-do-that-might-surprise-you/
R call C/C++
Mainly talks about .C() and .Call().
Note that scalars and arrays must be passed using pointers. So if we want to access a function not exported from a package, we may need to modify the function to make the arguments as pointers.
- R-Extension manual of course.
- Compiled Code chapter from 'R Packages' by Hadley Wickham
- http://faculty.washington.edu/kenrice/sisg-adv/sisg-07.pdf
- http://www.stat.berkeley.edu/scf/paciorek-cppWorkshop.pdf (Very useful)
- http://www.stat.harvard.edu/ccr2005/
- http://mazamascience.com/WorkingWithData/?p=1099
- Make an R package with C++ code (a playlist from youtube)
- Using R – Calling C code ‘Hello World!’
- Computing tip by Hao Wu
.Call
Be sure to add the PACKAGE parameter to avoid an error like
cvfit <- cv.grpsurvOverlap(X, Surv(time, event), group, cv.ind = cv.ind, seed=1, penalty = 'cMCP') Error in .Call("standardize", X) : "standardize" not resolved from current namespace (grpreg)
NAMESPACE file & useDynLib
- https://cran.r-project.org/doc/manuals/r-release/R-exts.html#useDynLib
- We don't need to include double quotes around the C/Fortran subroutines in .C() or .Fortran()
- digest package example: NAMESPACE and R functions using .Call().
- stats example: NAMESPACE
(From Writing R Extensions manual) Loading is most often done automatically based on the useDynLib() declaration in the NAMESPACE file, but may be done explicitly via a call to library.dynam(). This has the form
library.dynam("libname", package, lib.loc)
library.dynam.unload()
- https://stat.ethz.ch/R-manual/R-devel/library/base/html/library.dynam.html
- http://r-pkgs.had.co.nz/src.html. The library.dynam.unload() function should be placed in .onUnload() function. This function can be saved in any R files.
- digest package example zzz.R
gcc
Coping with varying `gcc` versions and capabilities in R packages
Primitive functions
SEXP
Some examples from packages
- sva package has one C code function
R call Fortran
- R call Fortran 90
- The Need for Speed Part 1: Building an R Package with Fortran (or C) (Very detailed)
Embedding R
- See Writing for R Extensions Manual Chapter 8.
- Talk by Simon Urbanek in UseR 2004.
- Technical report by Friedrich Leisch in 2007.
- https://stat.ethz.ch/pipermail/r-help/attachments/20110729/b7d86ed7/attachment.pl
An very simple example (do not return from shell) from Writing R Extensions manual
The command-line R front-end, R_HOME/bin/exec/R, is one such example. Its source code is in file <src/main/Rmain.c>.
This example can be run by
R_HOME/bin/R CMD R_HOME/bin/exec/R
Note:
- R_HOME/bin/exec/R is the R binary. However, it couldn't be launched directly unless R_HOME and LD_LIBRARY_PATH are set up. Again, this is explained in Writing R Extension manual.
- R_HOME/bin/R is a shell-script front-end where users can invoke it. It sets up the environment for the executable. It can be copied to /usr/local/bin/R. When we run R_HOME/bin/R, it actually runs R_HOME/bin/R CMD R_HOME/bin/exec/R (see line 259 of R_HOME/bin/R as in R 3.0.2) so we know the important role of R_HOME/bin/exec/R.
More examples of embedding can be found in tests/Embedding directory. Read <index.html> for more information about these test examples.
An example from Bioconductor workshop
- What is covered in this section is different from Create and use a standalone Rmath library.
- Use eval() function. See R-Ext 8.1 and 8.2 and 5.11.
- http://stackoverflow.com/questions/2463437/r-from-c-simplest-possible-helloworld (obtained from searching R_tryEval on google)
- http://stackoverflow.com/questions/7457635/calling-r-function-from-c
Example: Create embed.c file. Then build the executable. Note that I don't need to create R_HOME variable.
cd tar xzvf cd R-3.0.1 ./configure --enable-R-shlib make cd tests/Embedding make ~/R-3.0.1/bin/R CMD ./Rtest nano embed.c # Using a single line will give an error and cannot not show the real problem. # ../../bin/R CMD gcc -I../../include -L../../lib -lR embed.c # A better way is to run compile and link separately gcc -I../../include -c embed.c gcc -o embed embed.o -L../../lib -lR -lRblas ../../bin/R CMD ./embed
Note that if we want to call the executable file ./embed directly, we shall set up R environment by specifying R_HOME variable and including the directories used in linking R in LD_LIBRARY_PATH. This is based on the inform provided by Writing R Extensions.
export R_HOME=/home/brb/Downloads/R-3.0.2 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/brb/Downloads/R-3.0.2/lib ./embed # No need to include R CMD in front.
Question: Create a data frame in C? Answer: Use data.frame() via an eval() call from C. Or see the code is stats/src/model.c, as part of model.frame.default. Or using Rcpp as here.
Reference http://bioconductor.org/help/course-materials/2012/Seattle-Oct-2012/AdvancedR.pdf
Create a Simple Socket Server in R
This example is coming from this paper.
Create an R function
simpleServer <- function(port=6543) { sock <- socketConnection ( port=port , server=TRUE) on.exit(close( sock )) cat("\nWelcome to R!\nR>" ,file=sock ) while(( line <- readLines ( sock , n=1)) != "quit") { cat(paste("socket >" , line , "\n")) out<- capture.output (try(eval(parse(text=line )))) writeLines ( out , con=sock ) cat("\nR> " ,file =sock ) } }
Then run simpleServer(). Open another terminal and try to communicate with the server
$ telnet localhost 6543 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Welcome to R! R> summary(iris[, 3:5]) Petal.Length Petal.Width Species Min. :1.000 Min. :0.100 setosa :50 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :4.350 Median :1.300 virginica :50 Mean :3.758 Mean :1.199 3rd Qu.:5.100 3rd Qu.:1.800 Max. :6.900 Max. :2.500 R> quit Connection closed by foreign host.
Rserve
Note the way of launching Rserve is like the way we launch C program when R was embedded in C. See Example from Bioconductor workshop.
See my Rserve page.
outsider
- outsider: Install and run programs, outside of R, inside of R
- Run bcftools with outsider in R
(Commercial) StatconnDcom
R.NET
rJava
- A primer in using Java from R - part 1
- Note rJava is needed by xlsx package.
Terminal
# jdk 7 sudo apt-get install openjdk-7-* update-alternatives --config java # oracle jdk 8 sudo add-apt-repository -y ppa:webupd8team/java sudo apt-get update echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections sudo apt-get -y install openjdk-8-jdk
and then run the following (thanks to http://stackoverflow.com/questions/12872699/error-unable-to-load-installed-packages-just-now) to fix an error: libjvm.so: cannot open shared object file: No such file or directory.
- Create the file /etc/ld.so.conf.d/java.conf with the following entries:
/usr/lib/jvm/java-8-oracle/jre/lib/amd64 /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server
- And then run sudo ldconfig
Now go back to R
install.packages("rJava")
Done!
If above does not work, a simple way is by (under Ubuntu) running
sudo apt-get install r-cran-rjava
which will create new package 'default-jre' (under /usr/lib/jvm) and 'default-jre-headless'.
RCaller
RApache
Rscript, arguments and commandArgs()
Passing arguments to an R script from command lines Syntax:
$ Rscript --help Usage: /path/to/Rscript [--options] [-e expr [-e expr2 ...] | file] [args]
Example:
args = commandArgs(trailingOnly=TRUE) # test if there is at least one argument: if not, return an error if (length(args)==0) { stop("At least one argument must be supplied (input file).n", call.=FALSE) } else if (length(args)==1) { # default output file args[2] = "out.txt" } cat("args[1] = ", args[1], "\n") cat("args[2] = ", args[2], "\n")
Rscript --vanilla sillyScript.R iris.txt out.txt # args[1] = iris.txt # args[2] = out.txt
Rscript, #! Shebang and optparse package
- Writing R scripts like linux bash files.
- What Is the Shebang (#!) Character Sequence in Linux?
- Where does the output of Rscript go?
- Create a file <shebang.R>.
#!/usr/bin/env Rscript print ("shebang works")
Then in the command line
chmod u+x shebang.R ./shebang.R
- Running R in batch mode on Linux
- optparse package. Check out its vignette.
- getopt: C-Like 'getopt' Behavior.
littler
Provides hash-bang (#!) capability for R
FAQs:
- Difference between Rscript and littler
- Whats the difference between Rscript and R CMD BATCH
- Why (or when) is Rscript (or littler) better than R CMD BATCH?
root@ed5f80320266:/# ls -l /usr/bin/{r,R*} # R 3.5.2 docker container -rwxr-xr-x 1 root root 82632 Jan 26 18:26 /usr/bin/r # binary, can be used for 'shebang' lines, r --help # Example: r --verbose -e "date()" -rwxr-xr-x 1 root root 8722 Dec 20 11:35 /usr/bin/R # text, R --help # Example: R -q -e "date()" -rwxr-xr-x 1 root root 14552 Dec 20 11:35 /usr/bin/Rscript # binary, can be used for 'shebang' lines, Rscript --help # It won't show the startup message when it is used in the command line. # Example: Rscript -e "date()"
We can install littler using two ways.
- install.packages("littler"). This will install the latest version but the binary 'r' program is only available under the package/bin directory (eg ~/R/x86_64-pc-linux-gnu-library/3.4/littler/bin/r). You need to create a soft link in order to access it globally.
- sudo apt install littler. This will install 'r' globally; however, the installed version may be old.
After the installation, vignette contains several examples. The off-line vignette has a table of contents. Nice! The web version of examples does not have the TOC.
r was not meant to run interactively like R. See man r.
RInside: Embed R in C++
See RInside
(From RInside documentation) The RInside package makes it easier to embed R in your C++ applications. There is no code you would execute directly from the R environment. Rather, you write C++ programs that embed R which is illustrated by some the included examples.
The included examples are armadillo, eigen, mpi, qt, standard, threads and wt.
To run 'make' when we don't have a global R, we should modify the file <Makefile>. Also if we just want to create one executable file, we can do, for example, 'make rinside_sample1'.
To run any executable program, we need to specify LD_LIBRARY_PATH variable, something like
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/brb/Downloads/R-3.0.2/lib
The real build process looks like (check <Makefile> for completeness)
g++ -I/home/brb/Downloads/R-3.0.2/include \ -I/home/brb/Downloads/R-3.0.2/library/Rcpp/include \ -I/home/brb/Downloads/R-3.0.2/library/RInside/include -g -O2 -Wall \ -I/usr/local/include \ rinside_sample0.cpp \ -L/home/brb/Downloads/R-3.0.2/lib -lR -lRblas -lRlapack \ -L/home/brb/Downloads/R-3.0.2/library/Rcpp/lib -lRcpp \ -Wl,-rpath,/home/brb/Downloads/R-3.0.2/library/Rcpp/lib \ -L/home/brb/Downloads/R-3.0.2/library/RInside/lib -lRInside \ -Wl,-rpath,/home/brb/Downloads/R-3.0.2/library/RInside/lib \ -o rinside_sample0
Hello World example of embedding R in C++.
#include <RInside.h> // for the embedded R via RInside int main(int argc, char *argv[]) { RInside R(argc, argv); // create an embedded R instance R["txt"] = "Hello, world!\n"; // assign a char* (string) to 'txt' R.parseEvalQ("cat(txt)"); // eval the init string, ignoring any returns exit(0); }
The above can be compared to the Hello world example in Qt.
#include <QApplication.h> #include <QPushButton.h> int main( int argc, char **argv ) { QApplication app( argc, argv ); QPushButton hello( "Hello world!", 0 ); hello.resize( 100, 30 ); app.setMainWidget( &hello ); hello.show(); return app.exec(); }
RFortran
RFortran is an open source project with the following aim:
To provide an easy to use Fortran software library that enables Fortran programs to transfer data and commands to and from R.
It works only on Windows platform with Microsoft Visual Studio installed:(
Call R from other languages
C
Error: “not resolved from current namespace” error, when calling C routines from R
Solution: add getNativeSymbolInfo() around your C/Fortran symbols. Search Google:r dyn.load not resolved from current namespace
JRI
ryp2
http://rpy.sourceforge.net/rpy2.html
Create a standalone Rmath library
R has many math and statistical functions. We can easily use these functions in our C/C++/Fortran. The definite guide of doing this is on Chapter 9 "The standalone Rmath library" of R-admin manual.
Here is my experience based on R 3.0.2 on Windows OS.
Create a static library <libRmath.a> and a dynamic library <Rmath.dll>
Suppose we have downloaded R source code and build R from its source. See Build_R_from_its_source. Then the following 2 lines will generate files <libRmath.a> and <Rmath.dll> under C:\R\R-3.0.2\src\nmath\standalone directory.
cd C:\R\R-3.0.2\src\nmath\standalone make -f Makefile.win
Use Rmath library in our code
set CPLUS_INCLUDE_PATH=C:\R\R-3.0.2\src\include set LIBRARY_PATH=C:\R\R-3.0.2\src\nmath\standalone # It is not LD_LIBRARY_PATH in above. # Created <RmathEx1.cpp> from the book "Statistical Computing in C++ and R" web site # http://math.la.asu.edu/~eubank/CandR/ch4Code.cpp # It is OK to save the cpp file under any directory. # Force to link against the static library <libRmath.a> g++ RmathEx1.cpp -lRmath -lm -o RmathEx1.exe # OR g++ RmathEx1.cpp -Wl,-Bstatic -lRmath -lm -o RmathEx1.exe # Force to link against dynamic library <Rmath.dll> g++ RmathEx1.cpp Rmath.dll -lm -o RmathEx1Dll.exe
Test the executable program. Note that the executable program RmathEx1.exe can be transferred to and run in another computer without R installed. Isn't it cool!
c:\R>RmathEx1 Enter a argument for the normal cdf: 1 Enter a argument for the chi-squared cdf: 1 Prob(Z <= 1) = 0.841345 Prob(Chi^2 <= 1)= 0.682689
Below is the cpp program <RmathEx1.cpp>.
//RmathEx1.cpp #define MATHLIB_STANDALONE #include <iostream> #include "Rmath.h" using std::cout; using std::cin; using std::endl; int main() { double x1, x2; cout << "Enter a argument for the normal cdf:" << endl; cin >> x1; cout << "Enter a argument for the chi-squared cdf:" << endl; cin >> x2; cout << "Prob(Z <= " << x1 << ") = " << pnorm(x1, 0, 1, 1, 0) << endl; cout << "Prob(Chi^2 <= " << x2 << ")= " << pchisq(x2, 1, 1, 0) << endl; return 0; }
Calling R.dll directly
See Chapter 8.2.2 of R Extensions. This is related to embedding R under Windows. The file <R.dll> on Windows is like <libR.so> on Linux.
Create HTML report
ReportingTools (Jason Hackney) from Bioconductor. See Genome->ReportingTools.
htmlTable package
The htmlTable package is intended for generating tables using HTML formatting. This format is compatible with Markdown when used for HTML-output. The most basic table can easily be created by just passing a matrix or a data.frame to the htmlTable-function.
- http://cran.r-project.org/web/packages/htmlTable/vignettes/general.html
- http://gforge.se/2014/01/fast-track-publishing-using-knitr-part-iv/
- News in htmlTable 2.0
formattable
- https://github.com/renkun-ken/formattable
- http://www.magesblog.com/2016/01/formatting-table-output-in-r.html
- Make Beautiful Tables with the Formattable Package
htmltab package
This package is NOT used to CREATE html report but EXTRACT html table.
ztable package
Makes zebra-striped tables (tables with alternating row colors) in LaTeX and HTML formats easily from a data.frame, matrix, lm, aov, anova, glm or coxph objects.
Create academic report
reports package in CRAN and in github repository. The youtube video gives an overview of the package.
Create pdf and epub files
# Idea: # knitr pdflatex # rnw -------> tex ----------> pdf library(knitr) knit("example.rnw") # create example.tex file
- A very simple example <002-minimal.Rnw> from yihui.name works fine on linux.
git clone https://github.com/yihui/knitr-examples.git
- <knitr-minimal.Rnw>. I have no problem to create pdf file on Windows but still cannot generate pdf on Linux from tex file. Some people suggested to run sudo apt-get install texlive-fonts-recommended to install missing fonts. It works!
To see a real example, check out DESeq2 package (inst/doc subdirectory). In addition to DESeq2, I also need to install DESeq, BiocStyle, airway, vsn, gplots, and pasilla packages from Bioconductor. Note that, it is best to use sudo/admin account to install packages.
Or starts with markdown file. Download the example <001-minimal.Rmd> and remove the last line of getting png file from internet.
# Idea: # knitr pandoc # rmd -------> md ----------> pdf git clone https://github.com/yihui/knitr-examples.git cd knitr-examples R -e "library(knitr); knit('001-minimal.Rmd')" pandoc 001-minimal.md -o 001-minimal.pdf # require pdflatex to be installed !!
To create an epub file (not success yet on Windows OS, missing figures on Linux OS)
# Idea: # knitr pandoc # rnw -------> tex ----------> markdown or epub library(knitr) knit("DESeq2.Rnw") # create DESeq2.tex system("pandoc -f latex -t markdown -o DESeq2.md DESeq2.tex")
Convert tex to epub
kable() for tables
Create Tables In LaTeX, HTML, Markdown And ReStructuredText
- https://rmarkdown.rstudio.com/lesson-7.html
- https://stackoverflow.com/questions/20942466/creating-good-kable-output-in-rstudio
- http://kbroman.org/knitr_knutshell/pages/figs_tables.html
- https://blogs.reed.edu/ed-tech/2015/10/creating-nice-tables-using-r-markdown/
- kableExtra package
Create Word report
Using the power of Word
How to go from R to nice tables in Microsoft Word
knitr + pandoc
- http://www.r-statistics.com/2013/03/write-ms-word-document-using-r-with-as-little-overhead-as-possible/
- http://www.carlboettiger.info/2012/04/07/writing-reproducibly-in-the-open-with-knitr.html
- http://rmarkdown.rstudio.com/articles_docx.html
It is better to create rmd file in RStudio. Rstudio provides a template for rmd file and it also provides a quick reference to R markdown language.
# Idea: # knitr pandoc # rmd -------> md --------> docx library(knitr) knit2html("example.rmd") #Create md and html files
and then
FILE <- "example" system(paste0("pandoc -o ", FILE, ".docx ", FILE, ".md"))
Note. For example reason, if I play around the above 2 commands for several times, the knit2html() does not work well. However, if I click 'Knit HTML' button on the RStudio, it then works again.
Another way is
library(pander) name = "demo" knit(paste0(name, ".Rmd"), encoding = "utf-8") Pandoc.brew(file = paste0(name, ".md"), output = paste0(-name, "docx"), convert = "docx")
Note that once we have used knitr command to create a md file, we can use pandoc shell command to convert it to different formats:
- A pdf file: pandoc -s report.md -t latex -o report.pdf
- A html file: pandoc -s report.md -o report.html (with the -c flag html files can be added easily)
- Openoffice: pandoc report.md -o report.odt
- Word docx: pandoc report.md -o report.docx
We can also create the epub file for reading on Kobo ereader. For example, download this file and save it as example.Rmd. I need to remove the line containing the link to http://i.imgur.com/RVNmr.jpg since it creates an error when I run pandoc (not sure if it is the pandoc version I have is too old). Now we just run these 2 lines to get the epub file. Amazing!
knit("example.Rmd") pandoc("example.md", format="epub")
PS. If we don't remove the link, we will get an error message (pandoc 1.10.1 on Windows 7)
> pandoc("Rmd_to_Epub.md", format="epub") executing pandoc -f markdown -t epub -o Rmd_to_Epub.epub "Rmd_to_Epub.utf8md" pandoc.exe: .\.\http://i.imgur.com/RVNmr.jpg: openBinaryFile: invalid argument (Invalid argument) Error in (function (input, format, ext, cfg) : conversion failed In addition: Warning message: running command 'pandoc -f markdown -t epub -o Rmd_to_Epub.epub "Rmd_to_Epub.utf8md"' had status 1
pander
Try pandoc[1] with a minimal reproducible example, you might give a try to my "pander" package [2] too:
library(pander) Pandoc.brew(system.file('examples/minimal.brew', package='pander'), output = tempfile(), convert = 'docx')
Where the content of the "minimal.brew" file is something you might have got used to with Sweave - although it's using "brew" syntax instead. See the examples of pander [3] for more details. Please note that pandoc should be installed first, which is pretty easy on Windows.
- http://johnmacfarlane.net/pandoc/
- http://rapporter.github.com/pander/
- http://rapporter.github.com/pander/#examples
R2wd
Use R2wd package. However, only 32-bit R is allowed and sometimes it can not produce all 'table's.
> library(R2wd) > wdGet() Loading required package: rcom Loading required package: rscproxy rcom requires a current version of statconnDCOM installed. To install statconnDCOM type installstatconnDCOM() This will download and install the current version of statconnDCOM You will need a working Internet connection because installation needs to download a file. Error in if (wdapp[["Documents"]][["Count"]] == 0) wdapp[["Documents"]]$Add() : argument is of length zero
The solution is to launch 32-bit R instead of 64-bit R since statconnDCOM does not support 64-bit R.
Convert from pdf to word
The best rendering of advanced tables is done by converting from pdf to Word. See http://biostat.mc.vanderbilt.edu/wiki/Main/SweaveConvert
rtf
Use rtf package for Rich Text Format (RTF) Output.
xtable
Package xtable will produce html output.
print(xtable(X), type="html")
If you save the file and then open it with Word, you will get serviceable results. I've had better luck copying the output from xtable and pasting it into Excel.
officer
- CRAN. Microsoft Word, Microsoft Powerpoint and HTML documents generation from R.
- The gist includes a comprehensive example that encompasses various elements such as sections, subsections, and tables. It also incorporates a detailed paragraph, along with visual representations created using base R plots and ggplots.
- Add a line space
doc <- body_add_par(doc, "") # Function to add n line spaces body_add_par_n <- function (doc, n) { for(i in 1:n){ doc <- body_add_par(doc, "") } return(doc) } body_add_par_n(3)
- Figures from the documentation of officeverse.
- See Data frame to word table?.
- See Office page for some code.
- How to read and create Word Documents in R where we can extracting tables from Word Documents.
x = read_docx("myfile.docx") content <- docx_summary(x) # a vector grep("nlme", content$text, ignore.case = T, value = T)
Powerpoint
- officer package (formerly ReporteRs). How to create powerpoint reports with R
- flextable (imports officer)
- R data.frame to table image for presentation.
library(gridExtra) grid.newpage() grid.table(mydf)
- Rmarkdown
PDF manipulation
R Graphs Gallery
- Romain François
- R Graph Catalog written using R + Shiny. The source code is available on Github.
- Forest plot. See the packages rmeta and forestplot. The forest plot can be used to plot the quantities like relative risk (with 95% CI) in survival data.
COM client or server
Client
- RDCOMClient where excel.link depends on it.
- How to Execute VBA Code in Excel via R using RDCOMClient
Server
Use R under proxy
http://support.rstudio.org/help/kb/faq/configuring-r-to-use-an-http-proxy
RStudio
- Github
- Installing RStudio (1.0.44) on Ubuntu will not install Java even the source code contains 37.5% Java??
- Preview
rstudio.cloud
Launch RStudio
Create .Rproj file
If you have an existing package that doesn't have an .Rproj file, you can use devtools::use_rstudio("path/to/package") to add it.
With an RStudio project file, you can
- Restore .RData into workspace at startup
- Save workspace to .RData on exit (or save.image("Robj.RData") & load("Robj.RData"))
- Always save history (even if no saving .RData, savehistory(".Rhistory") & loadhistory(".Rhistory"))
- etc
package search
https://github.com/RhoInc/CRANsearcher
Git
Visual Studio
R and Python support now built in to Visual Studio 2017
List files using regular expression
- Extension
list.files(pattern = "\\.txt$")
where the dot (.) is a metacharacter. It is used to refer to any character.
- Start with
list.files(pattern = "^Something")
Using Sys.glob()"' as
> Sys.glob("~/Downloads/*.txt") [1] "/home/brb/Downloads/ip.txt" "/home/brb/Downloads/valgrind.txt"
Hidden tool: rsync in Rtools
c:\Rtools\bin>rsync -avz "/cygdrive/c/users/limingc/Downloads/a.exe" "/cygdrive/c/users/limingc/Documents/" sending incremental file list a.exe sent 323142 bytes received 31 bytes 646346.00 bytes/sec total size is 1198416 speedup is 3.71 c:\Rtools\bin>
Unforunately, if the destination is a network drive, I could get a permission denied (13) error. See also rsync file permissions on windows.
Install rgdal package (geospatial Data) on ubuntu
Terminal
sudo apt-get install libgdal1-dev libproj-dev # https://stackoverflow.com/a/44389304 sudo apt-get install libgdal1i # Ubuntu 16.04 https://stackoverflow.com/a/12143411
R
install.packages("rgdal")
Install sf package
I got the following error even I have installed some libraries.
checking GDAL version >= 2.0.1... no configure: error: sf is not compatible with GDAL versions below 2.0.1
Then I follow the instruction here
sudo apt remove libgdal-dev sudo apt remove libproj-dev sudo apt remove gdal-bin sudo add-apt-repository ppa:ubuntugis/ubuntugis-stable sudo apt update sudo apt-cache policy libgdal-dev # Make sure a version >= 2.0 appears sudo apt install libgdal-dev # works on ubuntu 20.04 too # no need the previous lines
Database
RSQLite
- https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html
- https://github.com/rstats-db/RSQLite
Creating a new database:
library(DBI) mydb <- dbConnect(RSQLite::SQLite(), "my-db.sqlite") dbDisconnect(mydb) unlink("my-db.sqlite") # temporary database mydb <- dbConnect(RSQLite::SQLite(), "") dbDisconnect(mydb)
Loading data:
mydb <- dbConnect(RSQLite::SQLite(), "") dbWriteTable(mydb, "mtcars", mtcars) dbWriteTable(mydb, "iris", iris) dbListTables(mydb) dbListFields(con, "mtcars") dbReadTable(con, "mtcars")
Queries:
dbGetQuery(mydb, 'SELECT * FROM mtcars LIMIT 5') dbGetQuery(mydb, 'SELECT * FROM iris WHERE "Sepal.Length" < 4.6') dbGetQuery(mydb, 'SELECT * FROM iris WHERE "Sepal.Length" < :x', params = list(x = 4.6)) res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4") dbFetch(res)
Batched queries:
dbClearResult(rs) rs <- dbSendQuery(mydb, 'SELECT * FROM mtcars') while (!dbHasCompleted(rs)) { df <- dbFetch(rs, n = 10) print(nrow(df)) } dbClearResult(rs)
Multiple parameterised queries:
rs <- dbSendQuery(mydb, 'SELECT * FROM iris WHERE "Sepal.Length" = :x') dbBind(rs, param = list(x = seq(4, 4.4, by = 0.1))) nrow(dbFetch(rs)) #> [1] 4 dbClearResult(rs)
Statements:
dbExecute(mydb, 'DELETE FROM iris WHERE "Sepal.Length" < 4') #> [1] 0 rs <- dbSendStatement(mydb, 'DELETE FROM iris WHERE "Sepal.Length" < :x') dbBind(rs, param = list(x = 4.5)) dbGetRowsAffected(rs) #> [1] 4 dbClearResult(rs)
sqldf
Manipulate R data frames using SQL. Depends on RSQLite. A use of gsub, reshape2 and sqldf with healthcare data
RPostgreSQL
RMySQL
- http://datascienceplus.com/bringing-the-powers-of-sql-into-r/
- See here about the installation of the required package (libmysqlclient-dev) in Ubuntu.
MongoDB
- http://www.r-bloggers.com/r-and-mongodb/
- http://watson.nci.nih.gov/~sdavis/blog/rmongodb-using-R-with-mongo/
odbc
RODBC
DBI
dbplyr
- To use databases with dplyr, you need to first install dbplyr
- https://db.rstudio.com/dplyr/
- Five commonly used backends: RMySQL, RPostgreSQ, RSQLite, ODBC, bigrquery.
- http://www.datacarpentry.org/R-ecology-lesson/05-r-and-databases.html
Create a new SQLite database:
surveys <- read.csv("data/surveys.csv") plots <- read.csv("data/plots.csv") my_db_file <- "portal-database.sqlite" my_db <- src_sqlite(my_db_file, create = TRUE) copy_to(my_db, surveys) copy_to(my_db, plots) my_db
Connect to a database:
download.file(url = "https://ndownloader.figshare.com/files/2292171", destfile = "portal_mammals.sqlite", mode = "wb") library(dbplyr) library(dplyr) mammals <- src_sqlite("portal_mammals.sqlite")
Querying the database with the SQL syntax:
tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))
Querying the database with the dplyr syntax:
surveys <- tbl(mammals, "surveys") surveys %>% select(year, species_id, plot_id) head(surveys, n = 10) show_query(head(surveys, n = 10)) # show which SQL commands are actually sent to the database
Simple database queries:
surveys %>% filter(weight < 5) %>% select(species_id, sex, weight)
Laziness (instruct R to stop being lazy):
data_subset <- surveys %>% filter(weight < 5) %>% select(species_id, sex, weight) %>% collect()
Complex database queries:
plots <- tbl(mammals, "plots") plots # # The plot_id column features in the plots table surveys # The plot_id column also features in the surveys table # Join databases method 1 plots %>% filter(plot_id == 1) %>% inner_join(surveys) %>% collect()
NoSQL
nodbi: the NoSQL Database Connector
Github
R source
https://github.com/wch/r-source/ Daily update, interesting, should be visited every day. Clicking 1000+ commits to look at daily changes.
If we are interested in a certain branch (say 3.2), look for R-3-2-branch.
R packages (only) source (metacran)
- https://github.com/cran/ by Gábor Csárdi, the author of igraph software.
Bioconductor packages source
Announcement, https://github.com/Bioconductor-mirror
Send local repository to Github in R by using reports package
http://www.youtube.com/watch?v=WdOI_-aZV0Y
My collection
- https://github.com/arraytools
- https://gist.github.com/4383351 heatmap using leukemia data
- https://gist.github.com/4382774 heatmap using sequential data
- https://gist.github.com/4484270 biocLite
How to download
Clone ~ Download.
- Command line
git clone https://gist.github.com/4484270.git
This will create a subdirectory called '4484270' with all cloned files there.
- Within R
library(devtools) source_gist("4484270")
or First download the json file from
https://api.github.com/users/MYUSERLOGIN/gists
and then
library(RJSONIO) x <- fromJSON("~/Downloads/gists.json") setwd("~/Downloads/") gist.id <- lapply(x, "[[", "id") lapply(gist.id, function(x){ cmd <- paste0("git clone https://gist.github.com/", x, ".git") system(cmd) })
Jekyll
An Easy Start with Jekyll, for R-Bloggers
Connect R with Arduino
- https://zhuhao.org/post/connect-arduino-chips-with-r/
- http://lamages.blogspot.com/2012/10/connecting-real-world-to-r-with-arduino.html
- http://jean-robert.github.io/2012/11/11/thermometer-R-using-Arduino-Java.html
- http://bio7.org/?p=2049
- http://www.rforge.net/Arduino/svn.html
Android App
- R Instructor $4.84
- Statistical Distribution (Not R related app)
- Data-driven Introspection of my Android Mobile usage in R
Common plots tips
Create an empty plot
plot.new()
Overlay plots
How to Overlay Plots in R-Quick Guide with Example.
#Step1:-create scatterplot plot(x1, y1) #Step 2:-overlay line plot lines(x2, y2) #Step3:-overlay scatterplot points(x2, y2)
Save the par() and restore it
Example 1: Don't use old.par <- par() directly. no.readonly = FALSE by default. * The `no.readonly = TRUE` argument in the par() function in R is used to get the full list of graphical parameters that can be restored.
- When you call `par()` with no arguments or `par(no.readonly = TRUE)`, it returns an invisible named list of all the graphical parameters. This includes both parameters that can be set and those that are read-only.
- If we use par(old.par) where old.par <- par(), we will get several warning messages like 'In par(op) : graphical parameter "cin" cannot be set'.
old.par <- par(no.readonly = TRUE); par(mar = c(5, 4, 4, 2) - 2) # OR in one step old.par <- par(mar = c(5, 4, 4, 2) - 2) ## do plotting stuff with new settings par(old.par)
Example 2: Use it inside a function with the on.exit(0 function.
ex <- function() { old.par <- par(no.readonly = TRUE) # all par settings which # could be changed. on.exit(par(old.par)) ## ... do lots of par() settings and plots ## ... invisible() #-- now, par(old.par) will be executed }
Example 3: It seems par() inside a function will affect the global environment. But if we use dev.off(), it will reset all parameters.
ex <- function() { par(mar=c(5,4,4,1)) } ex() par()$mar
ex = function() { png("~/Downloads/test.png"); par(mar=c(5,4,4,1)); dev.off()} ex() par()$mar
Grouped boxplots
- Step by step to create a grouped boxplots
- 'at' parameter in boxplot() to change the equal spaced boxplots
- embed par(mar=) in boxplot()
- mtext(line=) to solve the problem the xlab overlapped with labels.
- ggplot2 approach (Hint: facet_grid is used)
Weather Time Line
The plot looks similar to a boxplot though it is not. See a screenshot on Android by Sam Ruston.
Horizontal bar plot
library(ggplot2) dtf <- data.frame(x = c("ETB", "PMA", "PER", "KON", "TRA", "DDR", "BUM", "MAT", "HED", "EXP"), y = c(.02, .11, -.01, -.03, -.03, .02, .1, -.01, -.02, 0.06)) ggplot(dtf, aes(x, y)) + geom_bar(stat = "identity", aes(fill = x), show.legend = FALSE) + coord_flip() + xlab("") + ylab("Fold Change")
Include bar values in a barplot
- https://stats.stackexchange.com/questions/3879/how-to-put-values-over-bars-in-barplot-in-r.
- barplot(), text() and axis() functions. The data can be from a table() object.
- How to label a barplot bar with positive and negative bars with ggplot2
Use text().
Or use geom_text() if we are using the ggplot2 package. See an example here or this.
For stacked barplot, see this post.
Grouped barplots
- https://www.r-graph-gallery.com/barplot/, https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2/ (simpliest, no error bars)
library(ggplot2) # mydata <- data.frame(OUTGRP, INGRP, value) ggplot(mydata, aes(fill=INGRP, y=value, x=OUTGRP)) + geom_bar(position="dodge", stat="identity")
- https://datascienceplus.com/building-barplots-with-error-bars/. The error bars define 2 se (95% interval) for the black-and-white version and 1 se (68% interval) for ggplots. Be careful.
> 1 - 2*(1-pnorm(1)) [1] 0.6826895 > 1 - 2*(1-pnorm(1.96)) [1] 0.9500042
- two bars in one factor (stack). The data can be a 2-dim matrix with numerical values.
- two bars in one factor, Drawing multiple barplots on a graph in R (next to each other)
- Three variables barplots
- More alternatives (not done by R)
Unicode symbols
Mind reader game, and Unicode symbols
Math expression
- ?plotmath
- https://stackoverflow.com/questions/4973898/combining-paste-and-expression-functions-in-plot-labels
- Some cases
- Use expression() function
- Don't need the backslash; use eta instead of \eta. eta will be recognized as a special keyword in expression()
- Use parentheses instead of curly braces; use hat(eta) instead of hat{eta}
- Summary: use expression(hat(eta)) instead of expression(\hat{\eta})
- [] means subscript, while ^ means superscript. See Add Subscript and Superscript to Plot in R
- Spacing can be done with ~.
- Mix math symbols and text using paste()
- Using substitute() and paste() if we need to substitute text (this part is advanced)
# Expressions plot(x,y, xlab = expression(hat(x)[t]), ylab = expression(phi^{rho + a}), main = "Pure Expressions") # Superscript plot(1:10, main = expression("My Title"^2)) # Subscript plot(1:10, main = expression("My Title"[2])) # Expressions with Spacing # '~' is to add space and '*' is to squish characters together plot(1:10, xlab= expression(Delta * 'C')) plot(x,y, xlab = expression(hat(x)[t] ~ z ~ w), ylab = expression(phi^{rho + a} * z * w), main = "Pure Expressions with Spacing") # Expressions with Text plot(x,y, xlab = expression(paste("Text here ", hat(x), " here ", z^rho, " and here")), ylab = expression(paste("Here is some text of ", phi^{rho})), main = "Expressions with Text") # Substituting Expressions plot(x,y, xlab = substitute(paste("Here is ", pi, " = ", p), list(p = py)), ylab = substitute(paste("e is = ", e ), list(e = ee)), main = "Substituted Expressions")
Impose a line to a scatter plot
- abline + lsfit # least squares
plot(cars) abline(lsfit(cars[, 1], cars[, 2])) # OR abline(lm(cars[,2] ~ cars[,1]))
- abline + line # robust line fitting
plot(cars) (z <- line(cars)) abline(coef(z), col = 'green')
- lines
plot(cars) fit <- lm(cars[,2] ~ cars[,1]) lines(cars[,1], fitted(fit), col="blue") lines(stats::lowess(cars), col='red')
How to actually make a quality scatterplot in R: axis(), mtext()
How to actually make a quality scatterplot in R
3D scatterplot
- Scatterplot3d: 3D graphics - R software and data visualization. how to add legend to scatterplot3d in R and consider xpd=TRUE.
- R web > plotly
Rotating x axis labels for barplot
https://stackoverflow.com/questions/10286473/rotating-x-axis-labels-in-r-for-barplot
barplot(mytable,main="Car makes",ylab="Freqency",xlab="make",las=2)
Set R plots x axis to show at y=0
https://stackoverflow.com/questions/3422203/set-r-plots-x-axis-to-show-at-y-0
plot(1:10, rnorm(10), ylim=c(0,10), yaxs="i")
Different colors of axis labels in barplot
See Vary colors of axis labels in R based on another variable
Method 1: Append labels for the 2nd, 3rd, ... color gradually because 'col.axis' argument cannot accept more than one color.
tN <- table(Ni <- stats::rpois(100, lambda = 5)) r <- barplot(tN, col = rainbow(20)) axis(1, 1, LETTERS[1], col.axis="red", col="red") axis(1, 2, LETTERS[2], col.axis="blue", col = "blue")
Method 2: text() which can accept multiple colors in 'col' parameter but we need to find out the (x, y) by ourselves.
barplot(tN, col = rainbow(20), axisnames = F) text(4:6, par("usr")[3]-2 , LETTERS[4:6], col=c("black","red","blue"), xpd=TRUE)
Use text() to draw labels on X/Y-axis including rotation
- adj = 1 means top/right alignment. For left-bottom alignment, set adj = 0. The default is to center the text. [?text
- par("usr") gives the extremes of the user coordinates of the plotting region of the form c(x1, x2, y1, y2).
- par("usr") is determined *after* a plot has been created
- Example of using the "usr" parameter
- https://datascienceplus.com/building-barplots-with-error-bars/
par(mar = c(5, 6, 4, 5) + 0.1) plot(..., xaxt = "n") # "n" suppresses plotting of the axis; need mtext() and axis() to supplement text(x = barCenters, y = par("usr")[3] - 1, srt = 45, adj = 1, labels = myData$names, xpd = TRUE)
Vertically stacked plots with the same x axis
Include labels on the top axis/margin: axis() and mtext()
plot(1:4, rnorm(4), axes = FALSE) axis(3, at=1:4, labels = LETTERS[1:4], tick = FALSE, line = -0.5) # las, cex.axis box() mtext("Groups selected", cex = 0.8, line = 1.5) # default side = 3
See also 15_Questions_All_R_Users_Have_About_Plots
This can be used to annotate each plot with the script name, date, ...
mtext(text=paste("Prepared on", format(Sys.time(), "%d %B %Y at %H:%M")), adj=.99, # text align to right cex=.75, side=3, las=1, line=2)
ggplot2 uses breaks instead of at parameter. See ggplot2 → Add axis on top or right hand side, ggplot2 → scale_x_continus(name, breaks, labels) and the scale_continuous documentation.
Legend tips
Increase/decrease legend font size cex & ggplot2 package case.
plot(rnorm(100)) # op <- par(cex=2) legend("topleft", legend = 1:4, col=1:4, pch=1, lwd=2, lty = 1, cex =2) # par(op)
legend inset. Default is 0. % (from 0 to 1) to draw the legend away from x and y axis. The inset argument with negative values moves the legend outside the plot.
legend("bottomright", inset=.05, )
legend without a box
legend(, bty = "n")
Add a legend title
legend(, title = "")
Add a common legend to multiple plots. Use the layout function.
Superimpose a density plot or any curves
Use lines().
Example 1
plot(cars, main = "Stopping Distance versus Speed") lines(stats::lowess(cars)) plot(density(x), col = "#6F69AC", lwd = 3) lines(density(y), col = "#95DAC1", lwd = 3) lines(density(z), col = "#FFEBA1", lwd = 3)
Example 2
require(survival) n = 10000 beta1 = 2; beta2 = -1 lambdaT = 1 # baseline hazard lambdaC = 2 # hazard of censoring set.seed(1234) x1 = rnorm(n,0) x2 = rnorm(n,0) # true event time T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1-beta2*x2)) C <- rweibull(n, shape=1, scale=lambdaC) time = pmin(T,C) status <- 1*(T <= C) status2 <- 1-status plot(survfit(Surv(time, status2) ~ 1), ylab="Survival probability", main = 'Exponential censoring time') xseq <- seq(.1, max(time), length =100) func <- function(x) 1-pweibull(x, shape = 1, scale = lambdaC) lines(xseq, func(xseq), col = 'red') # survival function of Weibull
Example 3. Use ggplot(df, aes(x = x, color = factor(grp))) + geom_density(). Then each density curve will represent data from each "grp".
log scale
If we set y-axis to use log-scale, then what we display is the value log(Y) or log10(Y) though we still label the values using the input. For example, when we plot c(1, 10, 100) using the log scale, it is like we draw log10(c(1, 10, 100)) = c(0,1,2) on the plot but label the axis using the true values c(1, 10, 100).
Custom scales
Using custom scales with the 'scales' package
Time series
Time series stock price plot
- http://blog.revolutionanalytics.com/2015/08/plotting-time-series-in-r.html (ggplot2, xts, dygraphs)
- Visualize your Portfolio’s Performance and Generate a Nice Report with R
- https://timelyportfolio.github.io/rCharts_time_series/history.html
library(quantmod) getSymbols("AAPL") getSymbols("IBM") # similar to AAPL getSymbols("CSCO") # much smaller than AAPL, IBM getSymbols("DJI") # Dow Jones, huge chart_Series(Cl(AAPL), TA="add_TA(Cl(IBM), col='blue', on=1); add_TA(Cl(CSCO), col = 'green', on=1)", col='orange', subset = '2017::2017-08') tail(Cl(DJI))
tidyquant: Getting stock data
The 'largest stock profit or loss' puzzle: efficient computation in R
Timeline plot
- https://stackoverflow.com/questions/20695311/chronological-timeline-with-points-in-time-and-format-date
- vistime - Pretty Timelines in R
Clockify
Circular plot
- http://freakonometrics.hypotheses.org/20667 which uses circlize package; see also the ComplexHeatmap package.
- https://www.biostars.org/p/17728/
- RCircos package from CRAN.
- OmicCircos from Bioconductor.
Word cloud
- Text mining and word cloud fundamentals in R : 5 simple steps you should know
- 7 Alternatives to Word Clouds for Visualizing Long Lists of Data
- Data + Art STEAM Project: Initial Results
- ggwordcloud
Text mining
- tm package. It was used by R code of An estimate of the science-wise false discovery rate and application to the top medical literature.
World map
Visualising SSH attacks with R (rworldmap and rgeolocate packages)
Diagram/flowchart/Directed acyclic diagrams (DAGs)
DiagrammeR
- Graphviz and DiagrammeR
- http://rich-iannone.github.io/DiagrammeR/,
- rmarkdown
- graphviz and mermaid doc and examples
- https://donlelek.github.io/2015-03-31-dags-with-r/
- Data-driven flowcharts in R using DiagrammeR
diagram
Functions for Visualising Simple Graphs (Networks), Plotting Flow Diagrams
DAGitty (browser-based and R package)
dagR
Gmisc
Concept Maps
concept-maps where the diagrams are generated from https://app.diagrams.net/.
flow
flow, How To Draw Flow Diagrams In R
Venn Diagram
hexbin plot
- How to create a hexbin chart in R
- hextri: Hexbin Plots with Triangles. See an example on this https://www.pnas.org/content/117/48/30266#F4 paper] about the postpi method.
Bump chart/Metro map
https://dominikkoch.github.io/Bump-Chart/
Amazing/special plots
See Amazing plot.
Google Analytics
GAR package
http://www.analyticsforfun.com/2015/10/query-your-google-analytics-data-with.html
Linear Programming
http://www.r-bloggers.com/modeling-and-solving-linear-programming-with-r-free-book/
Linear Algebra
- Elegant linear algebra in R with the Matrix package. Matrix package is used.
- Linear Algebra for Machine Learning and Deep Learning in R. MASS library is used.
Amazon Alexa
R and Singularity
https://rviews.rstudio.com/2017/03/29/r-and-singularity/
Teach kids about R with Minecraft
http://blog.revolutionanalytics.com/2017/06/teach-kids-about-r-with-minecraft.html
Secure API keys
Securely store API keys in R scripts with the "secret" package
Credentials and secrets
How to manage credentials and secrets safely in R
Hide a password
keyring package
- https://cran.r-project.org/web/packages/keyring/index.html
- How to hide a password in R with the Keyring package
getPass
Vision and image recognition
- https://www.stoltzmaniac.com/google-vision-api-in-r-rooglevision/ Google vision API IN R] – RoogleVision
- Computer Vision Algorithms for R users and https://github.com/bnosac/image
Creating a Dataset from an Image
Creating a Dataset from an Image in R Markdown using reticulate
Turn pictures into coloring pages
https://gist.github.com/jeroen/53a5f721cf81de2acba82ea47d0b19d0
Numerical optimization
CRAN Task View: Numerical Mathematics, CRAN Task View: Optimization and Mathematical Programming
- uniroot: One Dimensional Root (Zero) Finding. This is used in simulating survival data for predefined censoring rate
- optimize: One Dimensional Optimization
- optim: General-purpose optimization based on Nelder–Mead, quasi-Newton and conjugate-gradient algorithms.
- constrOptim: Linearly Constrained Optimization
- nlm: Non-Linear Minimization
- nls: Nonlinear Least Squares
- torch for optimization. L-BFGS optimizer.
Ryacas: R Interface to the 'Yacas' Computer Algebra System
Doing Maths Symbolically: R as a Computer Algebra System (CAS)
Game
- R Programming with Minecraft
- pixelpuzzle
- Interactive pixel art in R with {pixeltrix}
- Shiny math game
- mazing: Utilities for Making and Plotting Mazes
- snake which is based on raylibr
Music
SAS
sasMap Static code analysis for SAS scripts
R packages
Tricks
Getting help
- http://stackoverflow.com/questions/tagged/r and R page contains resources.
- https://stat.ethz.ch/pipermail/r-help/
- https://stat.ethz.ch/pipermail/r-devel/
Better Coder/coding, best practices
- http://www.mango-solutions.com/wp/2015/10/10-top-tips-for-becoming-a-better-coder/
- Writing Good R Code and Writing Well
- R Code – Best practices
- What best practices do you use for programming in R?
- Best practices in statistical computing Sanchez 2021
E-notation
6.022E23 (or 6.022e23) is equivalent to 6.022×10^23
Getting user's home directory
See What are HOME and working directories?
# Windows normalizePath("~") # "C:\\Users\\brb\\Documents" Sys.getenv("R_USER") # "C:/Users/brb/Documents" Sys.getenv("HOME") # "C:/Users/brb/Documents" # Mac normalizePath("~") # [1] "/Users/brb" Sys.getenv("R_USER") # [1] "" Sys.getenv("HOME") # "/Users/brb" # Linux normalizePath("~") # [1] "/home/brb" Sys.getenv("R_USER") # [1] "" Sys.getenv("HOME") # [1] "/home/brb"
tempdir()
- The path is a per-session temporary directory. On parallel use, R processes forked by functions such as mclapply and makeForkCluster in package parallel share a per-session temporary directory.
- Set temporary folder for R in Rstudio server
Distinguish Windows and Linux/Mac, R.Version()
identical(.Platform$OS.type, "unix") returns TRUE on Mac and Linux.
get_os <- function(){ sysinf <- Sys.info() if (!is.null(sysinf)){ os <- sysinf['sysname'] if (os == 'Darwin') os <- "osx" } else { ## mystery machine os <- .Platform$OS.type if (grepl("^darwin", R.version$os)) os <- "osx" if (grepl("linux-gnu", R.version$os)) os <- "linux" } tolower(os) }
names(R.Version()) # [1] "platform" "arch" "os" "system" # [5] "status" "major" "minor" "year" # [9] "month" "day" "svn rev" "language" # [13] "version.string" "nickname" getRversion() # [1] ‘4.3.0’
Rprofile.site, Renviron.site (all platforms) and Rconsole (Windows only)
- https://cran.r-project.org/doc/manuals/r-release/R-admin.html (Rprofile.site). Put R statements.
- https://cran.r-project.org/doc/manuals/r-release/R-exts.html (Renviron.site). Define environment variables.
- https://cran.r-project.org/doc/manuals/r-release/R-intro.html (Rprofile.site, Renviron.site, Rconsole (Windows only))
- How to store and use webservice keys and authentication details
- Use your .Rprofile to give you important notifications
- *R for Enterprise: Understanding R’s Startup
- *Managing R with .Rprofile, .Renviron, Rprofile.site, Renviron.site, rsession.conf, and repos.conf
If we like to install R packages to a personal directory, follow this. Just add the line
R_LIBS_SITE=F:/R/library
to the file R_HOME/etc/x64/Renviron.site. In R, run Sys.getenv("R_LIBS_SITE") or Sys.getenv("R_LIBS_USER") to query the environment variable. See Environment Variables.
What is the best place to save Rconsole on Windows platform
Put/create the file <Rconsole> under C:/Users/USERNAME/Documents folder so no matter how R was upgraded/downgraded, it always find my preference.
My preferred settings:
- Font: Consolas (it will be shown as "TT Consolas" in Rconsole)
- Size: 12
- background: black
- normaltext: white
- usertext: GreenYellow or orange (close to RStudio's Cobalt theme) or sienna1 or SpringGreen or tan1 or yellow
and others (default options)
- pagebg: white
- pagetext: navy
- highlight: DarkRed
- dataeditbg: white
- dataedittext: navy (View() function)
- dataedituser: red
- editorbg: white (edit() function)
- editortext: black
A copy of the Rconsole is saved in github.
How R starts up
https://rstats.wtf/r-startup.html
startup - Friendly R Startup Configuration
https://github.com/henrikbengtsson/startup
Saving and loading history automatically: .Rprofile & local()
- savehistory("filename"). It will save everything from the beginning to the command savehistory() to a text file.
- .Rprofile will automatically be loaded when R has started from that directory
## Ensure the utils package is loaded if (!"utils" %in% loadedNamespaces()) { library(utils) } ## Path to save/load history history_file <- "~/.Rhistory" # Change this path if needed ## Load history on startup if (file.exists(history_file)) { utils::loadhistory(history_file) } ## Save history on exit .Last <- function() { try(utils::savehistory(history_file)) }
- Don't do things in your .Rprofile that affect how R code runs, such as loading a package like dplyr or ggplot or setting an option such as stringsAsFactors = FALSE. See Project-oriented workflow.
- .Rprofile has been created/used by the packrat package to restore a packrat environment. See the packrat/init.R file and R packages → packrat.
- Customizing Startup from R in Action, Fun with .Rprofile and customizing R startup
- You can also place a .Rprofile file in any directory that you are going to run R from or in the user home directory.
- At startup, R will source the Rprofile.site file. It will then look for a .Rprofile file to source in the current working directory. If it doesn't find it, it will look for one in the user's home directory.
options(continue=" ") # default is "+ " options(prompt="R> ", continue=" ") options(editor="nano") # default is "vi" on Linux # options(htmlhelp=TRUE) local({r <- getOption("repos") r["CRAN"] <- "https://cran.rstudio.com" options(repos=r)}) .First <- function(){ # library(tidyverse) cat("\nWelcome at", date(), "\n") } .Last <- function(){ cat("\nGoodbye at ", date(), "\n") }
- https://stackoverflow.com/questions/16734937/saving-and-loading-history-automatically
- The history file will always be read from the $HOME directory and the history file will be overwritten by a new session. These two problems can be solved if we define R_HISTFILE system variable.
- local() function can be used in .Rprofile file to set up the environment even no new variables will be created (change repository, install packages, load libraries, source R files, run system() function, file/directory I/O, etc)
Linux or Mac
In ~/.profile or ~/.bashrc I put:
export R_HISTFILE=~/.Rhistory
In ~/.Rprofile I put:
if (interactive()) { if (.Platform$OS.type == "unix") .First <- function() try(utils::loadhistory("~/.Rhistory")) .Last <- function() try(savehistory(file.path(Sys.getenv("HOME"), ".Rhistory"))) }
Windows
If you launch R by clicking its icon from Windows Desktop, the R starts in C:\User\$USER\Documents directory. So we can create a new file .Rprofile in this directory.
if (interactive()) { .Last <- function() try(savehistory(file.path(Sys.getenv("HOME"), ".Rhistory"))) }
Disable "Save workspace image?" prompt when exit R?
How to disable "Save workspace image?" prompt in R?
R release versions
rversions: Query the main 'R' 'SVN' repository to find the released versions & dates.
getRversion()
getRversion() [1] ‘4.3.0’
Detect number of running R instances in Windows
- http://stackoverflow.com/questions/15935931/detect-number-of-running-r-instances-in-windows-within-r
C:\Program Files\R>tasklist /FI "IMAGENAME eq Rscript.exe" INFO: No tasks are running which match the specified criteria. C:\Program Files\R>tasklist /FI "IMAGENAME eq Rgui.exe" Image Name PID Session Name Session# Mem Usage ============================================================================ Rgui.exe 1096 Console 1 44,712 K C:\Program Files\R>tasklist /FI "IMAGENAME eq Rserve.exe" Image Name PID Session Name Session# Mem Usage ============================================================================ Rserve.exe 6108 Console 1 381,796 K
In R, we can use
> system('tasklist /FI "IMAGENAME eq Rgui.exe" ', intern = TRUE) [1] "" [2] "Image Name PID Session Name Session# Mem Usage" [3] "============================================================================" [4] "Rgui.exe 1096 Console 1 44,804 K" > length(system('tasklist /FI "IMAGENAME eq Rgui.exe" ', intern = TRUE))-3
Editor
http://en.wikipedia.org/wiki/R_(programming_language)#Editors_and_IDEs
- Emacs + ESS. The ESS is useful in the case I want to tidy R code (the tidy_source() function in the formatR package sometimes gives errors; eg when I tested it on an R file like <GetComparisonResults.R> from BRB-ArrayTools v4.4 stable).
- Edit the file C:\Program Files\GNU Emacs 23.2\site-lisp\site-start.el with something like
(setq-default inferior-R-program-name "c:/program files/r/r-2.15.2/bin/i386/rterm.exe")
- Using Emacs for R 2022
- Rstudio - editor/R terminal/R graphics/file browser/package manager. The new version (0.98) also provides a new feature for debugging step-by-step. See also RStudio Tricks
- geany - I like the feature that it shows defined functions on the side panel even for R code. RStudio can also do this (see the bottom of the code panel).
- Rgedit which includes a feature of splitting screen into two panes and run R in the bottom panel. See here.
- Komodo IDE with browser preview http://www.youtube.com/watch?v=wv89OOw9roI at 4:06 and http://docs.activestate.com/komodo/4.4/editor.html
GUI for Data Analysis
Update to Data Science Software Popularity 6/7/2023
BlueSky Statistics
- https://www.blueskystatistics.com/Default.asp
- A Comparative Review of the BlueSky Statistics GUI for R
Rcmdr
http://cran.r-project.org/web/packages/Rcmdr/index.html. After loading a dataset, click Statistics -> Fit models. Then select Linear regression, Linear model, GLM, Multinomial logit model, Ordinal regression model, Linear mixed model, and Generalized linear mixed model. However, Rcmdr does not include, e.g. random forest, SVM, glmnet, et al.
Deducer
http://cran.r-project.org/web/packages/Deducer/index.html
jamovi
Scope
See
- Assignments within functions in the An Introduction to R manual.
source()
- source() assigns to the global environment, not the calling environment, which might not be what you want/expect. Instead, use source("file.R", local = TRUE) to avoid assigning functions and variables to the global environment.
- source() does not work like C's preprocessor where statements in header files will be literally inserted into the code. It does not work when you define a variable in a function but want to use it outside the function (even through source())
## foo.R ## cat(ArrayTools, "\n") ## End of foo.R # 1. Error predict <- function() { ArrayTools <- "C:/Program Files" # or through load() function source("foo.R") # or through a function call; foo() } predict() # Object ArrayTools not found # 2. OK. Make the variable global predict <- function() { ArrayTools <<- "C:/Program Files' source("foo.R") } predict() ArrayTools # 3. OK. Create a global variable ArrayTools <- "C:/Program Files" predict <- function() { source("foo.R") } predict()
Note that any ordinary assignments done within the function are local and temporary and are lost after exit from the function.
Example 1.
> ttt <- data.frame(type=letters[1:5], JpnTest=rep("999", 5), stringsAsFactors = F) > ttt type JpnTest 1 a 999 2 b 999 3 c 999 4 d 999 5 e 999 > jpntest <- function() { ttt$JpnTest[1] ="N5"; print(ttt)} > jpntest() type JpnTest 1 a N5 2 b 999 3 c 999 4 d 999 5 e 999 > ttt type JpnTest 1 a 999 2 b 999 3 c 999 4 d 999 5 e 999
Example 2. How can we set global variables inside a function? The answer is to use the "<<-" operator or assign(, , envir = .GlobalEnv) function.
Other resource: Advanced R by Hadley Wickham.
Example 3. Writing functions in R, keeping scoping in mind
New environment
- http://adv-r.had.co.nz/Environments.html.
- Environments in R
- load(), attach(), with().
- How to switch to a new environment and stick into it? seems not possible!
Run the same function on a bunch of R objects
mye = new.env() load(<filename>, mye) for(n in names(mye)) n = as_tibble(mye[[n]])
Just look at the contents of rda file without saving to anywhere (?load)
local({ load("myfile.rda") ls() })
Or use attach() which is a wrapper of load(). It creates an environment and slots it into the list right after the global environment, then populates it with the objects we're attaching.
attach("all.rda") # safer and will warn about masked objects w/ same name in .GlobalEnv ls(pos = 2) ## also typically need to cleanup the search path: detach("file:all.rda")
If we want to read data from internet, load() works but not attach().
con <- url("http://some.where.net/R/data/example.rda") ## print the value to see what objects were created. print(load(con)) close(con) # Github example # https://stackoverflow.com/a/62954840
myEnv <- new.env() source("some_other_script.R", local=myEnv) attach(myEnv, name="sourced_scripts") search() ls(2) ls(myEnv) with(myEnv, print(x))
str( , max) function
Use max.level parameter to avoid a long display of the structure of a complex R object. Use give.head = FALSE to hide the attributes. See ?str
If we use str() on a function like str(lm), it is equivalent to args(lm)
For a complicated list object, it is useful to use the max.level argument; e.g. str(, max.level = 1)
For a large data frame, we can use the tibble() function; e.g. mydf %>% tibble()
tidy() function
broom::tidy() provides a simplified form of an R object (obtained from running some analysis). See here.
View all objects present in a package, ls()
https://stackoverflow.com/a/30392688. In the case of an R package created by Rcpp.package.skeleton("mypackage"), we will get
> devtools::load_all("mypackage") > search() [1] ".GlobalEnv" "devtools_shims" "package:mypackage" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "package:methods" [10] "Autoloads" "package:base" > ls("package:mypackage") [1] "_mypackage_rcpp_hello_world" "evalCpp" "library.dynam.unload" [4] "rcpp_hello_world" "system.file"
Note that the first argument of ls() (or detach()) is used to specify the environment. It can be
- an integer (the position in the ‘search’ list);
- the character string name of an element in the search list;
- an explicit ‘environment’ (including using ‘sys.frame’ to access the currently active function calls).
Speedup R code
- Strategies to speedup R code from DataScience+
Profiler
- Understand Code Performance with the profiler (Video)
- xrprof package, Top R tips and news from RStudio Global 2021
&& vs &
See https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/Logic.
- The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The return is a vector.
- The longer form evaluates left to right examining only the first element of each vector. The return is one value.
- The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined.
- The idea of the longer form && in R seems to be the same as the && operator in linux shell; see here.
- Single or double?: AND operator and OR operator in R. The confusion might come from the inconsistency when choosing these operators in different languages. For example, in C, & performs bitwise AND, while && does Boolean logical AND.
- Think of && as a stricter &
c(T,F,T) & c(T,T,T) # [1] TRUE FALSE TRUE c(T,F,T) && c(T,T,T) # [1] TRUE c(T,F,T) && c(F,T,T) # [1] FALSE c(T,F,T) && c(NA,T,T) # [1] NA
# Assume 'b' is not defined > if (TRUE && b==3) cat("end") Error: object 'b' not found > if (FALSE && b==3) cat("end") > # No error since the 2nd condition is never evaluated
It's useful in functions(). We don't need nested if statements. In this case if 'arg' is missing, the argument 'L' is not needed so there is not syntax error.
> foo <- function(arg, L) { # Suppose 'L' is meaningful only if 'arg' is provided # # Evaluate 'L' only if 'arg' is provided # if (!missing(arg) && L) { print("L is true") } else { print("Either arg is missing or L is FALSE") } } > foo() [1] "arg is missing or L is FALSE" > foo("a", F) [1] "arg is missing or L is FALSE" > foo("a", T) [1] "L is true"
Other examples: && is more flexible than &.
nspot <- ifelse(missing(rvm) || !rvm, nrow(exprTrain), sum(filter)) if (!is.null(exprTest) && any(is.na(exprTest))) { ... }
for-loop, control flow
- ?Control
- next can be used to skip the rest of the inner-most loop
- ifelse() Function
Vectorization
- Vectorization (Mathematics) from wikipedia
- Array programming from wikipedia
- Single instruction, multiple data (SIMD) from wikipedia
- What is vectorization stackoverflow
- http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
- https://github.com/vsbuffalo/devnotes/wiki/R-and-Vectorization
- Why Vectorize? statcompute.wordpress.com
- Beware of Vectorize from Jim Hester
- matrixStats: Functions that Apply to Rows and Columns of Matrices (and to Vectors). E.g. col / rowMedians(), col / rowRanks(), and col / rowSds(). Benchmark reports.
sapply vs vectorization
Speed test: sapply vs vectorization
lapply vs for loop
- lapply vs for loop - Performance R
- https://code-examples.net/en/q/286e03a
- R: are *apply loops faster than for loops?
split() and sapply()
split() can be used to split a vector, columns or rows. See How to split a data frame?
- Split divides the data in the vector or data frame x into the groups defined by f. The syntax is
split(x, f, drop = FALSE, …)
- split() + cut(). How to Split Data into Equal Sized Groups in R: A Comprehensive Guide for Beginners
- Split a vector into chunks. split() returns a vector/indices and the indices can be used in lapply() to subset the data. Useful for the split() + lapply() + do.call() or split() + sapply() operations.
d <- 1:10 chunksize <- 4 ceiling(1:10/4) # [1] 1 1 1 1 2 2 2 2 3 3 split(d, ceiling(seq_along(d)/chunksize)) # $`1` # [1] 1 2 3 4 # # $`2` # [1] 5 6 7 8 # # $`3` # [1] 9 10 do.call(c, lapply(split(d, ceiling(seq_along(d)/4)), function(x) sum(x)) ) # 1 2 3 # 10 26 19 # bigmemory vignette planeindices <- split(1:nrow(x), x[,'TailNum']) planeStart <- sapply(planeindices, function(i) birthmonth(x[i, c('Year','Month'), drop=FALSE]))
- Split rows of a data frame/matrix; e.g. rows represents genes. The data frame/matrix is split directly.
split(mtcars,mtcars$cyl) split(data.frame(matrix(1:20, nr=10) ), ceiling(1:10/chunksize)) # data.frame/tibble works split.data.frame(matrix(1:20, nr=10), ceiling(1:10/chunksize)) # split.data.frame() works for matrices
- Split columns of a data frame/matrix.
ma <- cbind(x = 1:10, y = (-4:5)^2, z = 11:20) split(ma, cbind(rep(1,10), rep(2, 10), rep(1,10))) # not an interesting example # $`1` # [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # # $`2` # [1] 16 9 4 1 0 1 4 9 16 25
- split() + sapply() to merge columns. See below Mean of duplicated columns for more detail.
- split() + sapply() to split a vector. See nsFilter() function which can remove duplicated probesets/rows using unique Entrez Gene IDs (genefilter package). The source code of nsFilter() and findLargest().
tSsp = split.default(testStat, lls) # testStat is a vector of numerics including probeset IDs as names # lls is a vector of entrez IDs (same length as testStat) # tSSp is a list of the same length as unique elements of lls. sapply(tSsp, function(x) names(which.max(x))) # return a vector of probset IDs of length of unique entrez IDs
strsplit and sapply
> namedf <- c("John ABC", "Mary CDE", "Kat FGH") > strsplit(namedf, " ") 1 [1] "John" "ABC" 2 [1] "Mary" "CDE" 3 [1] "Kat" "FGH" > sapply(strsplit(namedf, " "), "[", 1) [1] "John" "Mary" "Kat" > sapply(strsplit(namedf, " "), "[", 2) [1] "ABC" "CDE" "FGH"
Mean of duplicated columns: rowMeans; compute Means by each row
- Reduce columns of a matrix by a function in R. To use rowMedians() instead of rowMeans(), we need to install matrixStats from CRAN.
set.seed(1) x <- matrix(1:60, nr=10); x[1, 2:3] <- NA colnames(x) <- c("b", "b", "b", "c", "a", "a"); x res <- sapply(split(1:ncol(x), colnames(x)), function(i) rowMeans(x[, i, drop=F], na.rm = TRUE)) res # notice the sorting of columns a b c [1,] 46 1 31 [2,] 47 12 32 [3,] 48 13 33 [4,] 49 14 34 [5,] 50 15 35 [6,] 51 16 36 [7,] 52 17 37 [8,] 53 18 38 [9,] 54 19 39 [10,] 55 20 40 # vapply() is safter than sapply(). # The 3rd arg in vapply() is a template of the return value. res2 <- vapply(split(1:ncol(x), colnames(x)), function(i) rowMeans(x[, i, drop=F], na.rm = TRUE), rep(0, nrow(x)))
- colSums, rowSums, colMeans, rowMeans (no group variable). These functions are equivalent to use of ‘apply’ with ‘FUN = mean’ or ‘FUN = sum’ with appropriate margins, but are a lot faster.
rowMeans(x, na.rm=T) # [1] 31 27 28 29 30 31 32 33 34 35 apply(x, 1, mean, na.rm=T) # [1] 31 27 28 29 30 31 32 33 34 35
- matrixStats: Functions that Apply to Rows and Columns of Matrices (and to Vectors)
- From for() loops to the split-apply-combine paradigm for column-wise tasks: the transition for a dinosaur
Mean of duplicated rows: colMeans and rowsum
- colMeans(x, na.rm = FALSE, dims = 1), take mean per columns & sum over rows. It returns a vector. Other similar idea functions include colSums, rowSums, rowMeans.
x <- matrix(1:60, nr=10); x[1, 2:3] <- NA; x rownames(x) <- c(rep("b", 2), rep("c", 3), rep("d", 4), "a") # move 'a' to the last res <- sapply(split(1:nrow(x), rownames(x)), function(i) colMeans(x[i, , drop=F], na.rm = TRUE)) res <- t(res) # transpose is needed since sapply() will form the resulting matrix by columns res # still a matrix, rows are ordered # [,1] [,2] [,3] [,4] [,5] [,6] # a 10.0 20.0 30.0 40.0 50.0 60.0 # b 1.5 12.0 22.0 31.5 41.5 51.5 # c 4.0 14.0 24.0 34.0 44.0 54.0 # d 7.5 17.5 27.5 37.5 47.5 57.5 table(rownames(x)) # a b c d # 1 2 3 4 aggregate(x, list(rownames(x)), FUN=mean, na.rm = T) # EASY, but it becomes a data frame, rows are ordered # Group.1 V1 V2 V3 V4 V5 V6 # 1 a 10.0 20.0 30.0 40.0 50.0 60.0 # 2 b 1.5 12.0 22.0 31.5 41.5 51.5 # 3 c 4.0 14.0 24.0 34.0 44.0 54.0 # 4 d 7.5 17.5 27.5 37.5 47.5 57.5
- Reduce multiple probes by the maximally expressed probe (set) measured by average intensity across arrays
- rowsum(x, group, reorder = TRUE, …). Sum over rows. It returns a matrix. This is very special. It's not the same as rowSums. There is no "colsum" function. It has the speed advantage over sapply+colSums OR aggregate.
group <- rownames(x) rowsum(x, group, na.rm=T)/as.vector(table(group)) # [,1] [,2] [,3] [,4] [,5] [,6] # a 10.0 20.0 30.0 40.0 50.0 60.0 # b 1.5 6.0 11.0 31.5 41.5 51.5 # c 4.0 14.0 24.0 34.0 44.0 54.0 # d 7.5 17.5 27.5 37.5 47.5 57.5
- How to calculate mean/median per group in a dataframe in r where doBy and dplyr are recommended.
- matrixStats: Functions that Apply to Rows and Columns of Matrices (and to Vectors)
- doBy package
- use ave() and unique()
- data.table package
- plyr package
- by() function. Calculating change from baseline in R
- See aggregate Function in R- A powerful tool for data frames & summarize in r, Data Summarization In R
- aggregate() function. Too slow! http://slowkow.com/2015/01/28/data-table-aggregate/. Don't use aggregate post.
> attach(mtcars) dim(mtcars) [1] 32 11 > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > with(mtcars, table(cyl, vs)) vs cyl 0 1 4 1 10 6 3 4 8 14 0 > aggdata <-aggregate(mtcars, by=list(cyl,vs), FUN=mean, na.rm=TRUE) > print(aggdata) Group.1 Group.2 mpg cyl disp hp drat wt qsec vs 1 4 0 26.00000 4 120.30 91.0000 4.430000 2.140000 16.70000 0 2 6 0 20.56667 6 155.00 131.6667 3.806667 2.755000 16.32667 0 3 8 0 15.10000 8 353.10 209.2143 3.229286 3.999214 16.77214 0 4 4 1 26.73000 4 103.62 81.8000 4.035000 2.300300 19.38100 1 5 6 1 19.12500 6 204.55 115.2500 3.420000 3.388750 19.21500 1 am gear carb 1 1.0000000 5.000000 2.000000 2 1.0000000 4.333333 4.666667 3 0.1428571 3.285714 3.500000 4 0.7000000 4.000000 1.500000 5 0.0000000 3.500000 2.500000 > detach(mtcars) # Another example: select rows with a minimum value from a certain column (yval in this case) > mydf <- read.table(header=T, text=' id xval yval A 1 1 A -2 2 B 3 3 B 4 4 C 5 5 ') > x = mydf$xval > y = mydf$yval > aggregate(mydf[, c(2,3)], by=list(id=mydf$id), FUN=function(x) x[which.min(y)]) id xval yval 1 A 1 1 2 B 3 3 3 C 5 5
Mean by Group
Mean by Group in R (2 Examples) | dplyr Package vs. Base R
aggregate(x = iris$Sepal.Length, # Specify data column by = list(iris$Species), # Specify group indicator FUN = mean) # Specify function (i.e. mean)
library(dplyr) iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal.Length), # Specify column list(name = mean)) # Specify function
- ave(x, ..., FUN),
- aggregate(x, by, FUN),
- by(x, INDICES, FUN): return is a list
- tapply(): return results as a matrix or array. Useful for ragged array.
Apply family
Vectorize, aggregate, apply, by, eapply, lapply, mapply, rapply, replicate, scale, sapply, split, tapply, and vapply.
The following list gives a hierarchical relationship among these functions.
- apply(X, MARGIN, FUN, ...) – Apply a Functions Over Array Margins
- lapply(X, FUN, ...) – Apply a Function over a List (including a data frame) or Vector X.
- sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) – Apply a Function over a List or Vector
- replicate(n, expr, simplify = "array")
- mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) – Multivariate version of sapply
- Vectorize(FUN, vectorize.args = arg.names, SIMPLIFY = TRUE, USE.NAMES = TRUE) - Vectorize a Scalar Function
- Map(FUN, ...) A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.
- vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE) – similar to sapply, but has a pre-specified type of return value
- rapply(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...) – A recursive version of lapply
- sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) – Apply a Function over a List or Vector
- tapply(V, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) – Apply a Function Over a "Ragged" Array. V is typically a vector where split() will be applied. INDEX is a list of one or more factors.
- aggregate(D, by, FUN, ..., simplify = TRUE, drop = TRUE) - Apply a function to each columns of subset data frame split by factors. FUN (such as mean(), weighted.mean(), sum()) is a simple function applied to a vector. D is typically a data frame. This is used to summarize data.
- by(D, INDICES, FUN, ..., simplify = TRUE) - Apply a Function to each subset data frame split by factors. FUN (such as summary(), lm()) is applied to a data frame. D is typically a data frame.
- eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE) – Apply a Function over values in an environment
Difference between apply vs sapply vs lapply vs tapply?
- apply - When you want to apply a function to the rows or columns or both of a matrix and output is a one-dimensional if only row or column is selected else it is a 2D-matrix
- lapply - When you want to apply a function to each element of a list in turn and get a list back.
- sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
- tapply - When you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
Some short examples:
- stern.nyu.edu.
- Apply Function in R – apply vs lapply vs sapply vs mapply vs tapply vs rapply vs vapply from datasciencemadesimple.com.
- How to use which one (apply family) when?
Apply vs for loop
Note that, apply's performance is not always better than a for loop. See
- http://tolstoy.newcastle.edu.au/R/help/06/05/27255.html (answered by Brian Ripley)
- https://stat.ethz.ch/pipermail/r-help/2014-October/422455.html (has one example)
- R: are *apply loops faster than for loops?. The author said 'an important reason for using *apply() functions may instead be that they fit the functional programming paradigm better, where everything is done using function calls and side effects are reduced'... The scope of the variables defined within f is limited to f, and variables defined outside f cannot be modified inside f (except using the special scoping assignment operator <<-).
- Why loops are slow in R
- Why is `unlist(lapply)` faster than `sapply`?
Progress bar
What is the cost of a progress bar in R?
The package 'pbapply' creates a text-mode progress bar - it works on any platforms. On Windows platform, check out this post. It uses winProgressBar() and setWinProgressBar() functions.
e-Rum 2020 Slides on Progressr by Henrik Bengtsson. progressr 0.8.0: RStudio's progress bar, Shiny progress updates, and absolute progress, progressr 0.10.1: Plyr Now Supports Progress Updates also in Parallel
simplify option in sapply()
library(KEGGREST) names1 <- keggGet(c("hsa05340", "hsa05410")) names2 <- sapply(names1, function(x) x$GENE) length(names2) # same if we use lapply() above # [1] 2 names3 <- keggGet(c("hsa05340")) names4 <- sapply(names3, function(x) x$GENE) length(names4) # may or may not be what we expect # [1] 76 names4 <- sapply(names3, function(x) x$GENE, simplify = FALSE) length(names4) # same if we use lapply() w/o simplify # [1] 1
lapply and its friends Map(), Reduce(), Filter() from the base package for manipulating lists
- Examples of using lapply() + split() on a data frame. See rollingyours.wordpress.com.
- mapply() documentation. Use mapply() to merge lists.
mapply(rep, 1:4, 4:1) mapply(rep, times = 1:4, x = 4:1) mapply(function(x, y) seq_len(x) + y, c(a = 1, b = 2, c = 3), # names from first c(A = 10, B = 0, C = -10)) mapply(c, firstList, secondList, SIMPLIFY=FALSE)
- Finding the Expected value of the maximum of two Bivariate Normal variables with simulation sapply + mapply.
z <- mapply(function(u, v) { max(u, v) }, u = x[, 1], v = x[, 2])
- Map() and Reduce() in functional programming
- Map(), Reduce(), and Filter() from Advanced R by Hadley
- If you have two or more lists (or data frames) that you need to process in parallel, use Map(). One good example is to compute the weighted.mean() function that requires two input objects. Map() is similar to mapply() function and is more concise than lapply(). Advanced R has a comment that Map() is better than mapply().
# Syntax: Map(f, ...) xs <- replicate(5, runif(10), simplify = FALSE) ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE) Map(weighted.mean, xs, ws) # instead of a more clumsy way lapply(seq_along(xs), function(i) { weighted.mean(xsi, wsi) })
- Reduce() reduces a vector, x, to a single value by recursively calling a function, f, two arguments at a time. A good example of using Reduce() function is to read a list of matrix files and merge them. See How to combine multiple matrix frames into one using R?
# Syntax: Reduce(f, x, ...) > m1 <- data.frame(id=letters[1:4], val=1:4) > m2 <- data.frame(id=letters[2:6], val=2:6) > merge(m1, m2, "id", all = T) id val.x val.y 1 a 1 NA 2 b 2 2 3 c 3 3 4 d 4 4 5 e NA 5 6 f NA 6 > m <- list(m1, m2) > Reduce(function(x,y) merge(x,y, "id",all=T), m) id val.x val.y 1 a 1 NA 2 b 2 2 3 c 3 3 4 d 4 4 5 e NA 5 6 f NA 6
- If you have two or more lists (or data frames) that you need to process in parallel, use Map(). One good example is to compute the weighted.mean() function that requires two input objects. Map() is similar to mapply() function and is more concise than lapply(). Advanced R has a comment that Map() is better than mapply().
- Playing Map() and Reduce() in R – Subsetting - using parallel and future packages. Union Multiple Data.Frames with Different Column Names
sapply & vapply
- This discusses why vapply is safer and faster than sapply.
- Vector output: sapply and vapply from Advanced R (Hadley Wickham).
- THOSE “OTHER” APPLY FUNCTIONS…. rapply(), vapply() and eapply() are covered.
- Speed test: sapply vs. vectorization
- sapply can be used in plotting; for example, glmnet relax vignette uses sapply(myList, lines, col="grey") to draw multiple lines simultaneously on a list of matrices.
See parallel::parSapply() for a parallel version of sapply(1:n, function(x)). We can this technique to speed up this example.
rapply - recursive version of lapply
- http://4dpiecharts.com/tag/recursive/
- Search in R source code. Mainly r-source/src/library/stats/R/dendrogram.R.
replicate
https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r
> replicate(5, rnorm(3)) [,1] [,2] [,3] [,4] [,5] [1,] 0.2509130 -0.3526600 -0.3170790 1.064816 -0.53708856 [2,] 0.5222548 1.5343319 0.6120194 -1.811913 -1.09352459 [3,] -1.9905533 -0.8902026 -0.5489822 1.308273 0.08773477
See parSapply() for a parallel version of replicate().
Vectorize
- Vectorize(FUN, vectorize.args = arg.names, SIMPLIFY = TRUE, USE.NAMES = TRUE): creates a function wrapper that vectorizes a scalar function. Its value is a list or vector or array. It calls mapply().
> rep(1:4, 4:1) [1] 1 1 1 1 2 2 2 3 3 4 > vrep <- Vectorize(rep.int) > vrep(1:4, 4:1) 1 [1] 1 1 1 1 2 [1] 2 2 2 3 [1] 3 3 4 [1] 4
> rweibull(1, 1, c(1, 2)) # no error but not sure what it gives? [1] 2.17123 > Vectorize("rweibull")(n=1, shape = 1, scale = c(1, 2)) [1] 1.6491761 0.9610109
myfunc <- function(a, b) a*b myfunc(1, 2) # 2 myfunc(3, 5) # 15 myfunc(c(1,3), c(2,5)) # 2 15 Vectorize(myfunc)(c(1,3), c(2,5)) # 2 15 myfunc2 <- function(a, b) if (length(a) == 1) a * b else NA myfunc2(1, 2) # 2 myfunc2(3, 5) # 15 myfunc2(c(1,3), c(2,5)) # NA Vectorize(myfunc2)(c(1, 3), c(2, 5)) # 2 15 Vectorize(myfunc2)(c(1, 3, 6), c(2, 5)) # 2 15 12 # parameter will be re-used
plyr and dplyr packages
Practical Data Science for Stats - a PeerJ Collection
The Split-Apply-Combine Strategy for Data Analysis (plyr package) in J. Stat Software.
A quick introduction to plyr with a summary of apply functions in R and compare them with functions in plyr package.
- plyr has a common syntax -- easier to remember
- plyr requires less code since it takes care of the input and output format
- plyr can easily be run in parallel -- faster
Tutorials
- Introduction to dplyr from http://dplyr.tidyverse.org/.
- A video of dplyr package can be found on vimeo.
- Hands-on dplyr tutorial for faster data manipulation in R from dataschool.io.
Examples of using dplyr:
- Medicines under evaluation
- CBI GEO Metadata Survey
- Logistic Regression by Page Piccinini. mutate(), inner_join() and %>%.
- DESeq2 post analysis select(), gather(), arrange() and %>%.
tibble
Tidy DataFrames but not Tibbles
Tibble objects
- it does not have row names (cf data frame),
- it never changes the type of the inputs (e.g. it never converts strings to factors!),
- it never changes the names of variables
To show all rows or columns of a tibble object,
print(tbObj, n= Inf) print(tbObj, width = Inf)
If we try to do a match on some column of a tibble object, we will get zero matches. The issue is we cannot use an index to get a tibble column.
Subsetting: to extract a column from a tibble object, use [[ or $ or dplyr::pull(). Select Data Frame Columns in R.
TibbleObject$VarName # OR TibbleObject"VarName" # OR pull(TibbleObject, VarName) # won't be a tibble object anymore # For multiple columns, use select() dplyr::select(TibbleObject, -c(VarName1, VarName2)) # still a tibble object # OR dplyr::select(TibbleObject, 2:5) #
Convert a data frame to a tibble See Tibble Data Format in R: Best and Modern Way to Work with Your Data
my_data <- as_tibble(iris) class(my_data)
llply()
llply is equivalent to lapply except that it will preserve labels and can display a progress bar. This is handy if we want to do a crazy thing.
LLID2GOIDs <- lapply(rLLID, function(x) get("org.Hs.egGO")[[x]])
where rLLID is a list of entrez ID. For example,
get("org.Hs.egGO")[["6772"]]
returns a list of 49 GOs.
ddply()
http://lamages.blogspot.com/2012/06/transforming-subsets-of-data-in-r-with.html
ldply()
An R Script to Automatically download PubMed Citation Counts By Year of Publication
Performance/speed comparison
Performance comparison of converting list to data.frame with R language
Using R's set.seed() to set seeds for use in C/C++ (including Rcpp)
http://rorynolan.rbind.io/2018/09/30/rcsetseed/
get_seed()
See the same blog
get_seed <- function() { sample.int(.Machine$integer.max, 1) }
Note: .Machine$integer.max = 2147483647 = 2^31 - 1.
Random seeds
By default, R uses the exact time in milliseconds of the computer's clock when R starts up to generate a seed. See ?Random.
set.seed(as.numeric(Sys.time())) set.seed(as.numeric(Sys.Date())) # same seed for each day
.Machine and the largest integer, double
See ?.Machine.
Linux/Mac 32-bit Windows 64-bit Windows double.eps 2.220446e-16 2.220446e-16 2.220446e-16 double.neg.eps 1.110223e-16 1.110223e-16 1.110223e-16 double.xmin 2.225074e-308 2.225074e-308 2.225074e-308 double.xmax 1.797693e+308 1.797693e+308 1.797693e+308 double.base 2.000000e+00 2.000000e+00 2.000000e+00 double.digits 5.300000e+01 5.300000e+01 5.300000e+01 double.rounding 5.000000e+00 5.000000e+00 5.000000e+00 double.guard 0.000000e+00 0.000000e+00 0.000000e+00 double.ulp.digits -5.200000e+01 -5.200000e+01 -5.200000e+01 double.neg.ulp.digits -5.300000e+01 -5.300000e+01 -5.300000e+01 double.exponent 1.100000e+01 1.100000e+01 1.100000e+01 double.min.exp -1.022000e+03 -1.022000e+03 -1.022000e+03 double.max.exp 1.024000e+03 1.024000e+03 1.024000e+03 integer.max 2.147484e+09 2.147484e+09 2.147484e+09 sizeof.long 8.000000e+00 4.000000e+00 4.000000e+00 sizeof.longlong 8.000000e+00 8.000000e+00 8.000000e+00 sizeof.longdouble 1.600000e+01 1.200000e+01 1.600000e+01 sizeof.pointer 8.000000e+00 4.000000e+00 8.000000e+00
NA when overflow
tmp <- 156287L tmp*tmp # [1] NA # Warning message: # In tmp * tmp : NAs produced by integer overflow .Machine$integer.max # [1] 2147483647
How to select a seed for simulation or randomization
set.seed() allow alphanumeric seeds
https://stackoverflow.com/a/10913336
set.seed(), for loop and saving random seeds
- Detect When the Random Number Generator Was Used
if (interactive()) { invisible(addTaskCallback(local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { msg <- "NOTE: .Random.seed changed" if (requireNamespace("crayon", quietly=TRUE)) msg <- crayon::blurred(msg) message(msg) last <<- curr } TRUE } }), name = "RNG tracker")) }
- http://r.789695.n4.nabble.com/set-seed-and-for-loop-td3585857.html. This question is legitimate when we want to debug on a certain iteration.
set.seed(1001) data <- vector("list", 30) seeds <- vector("list", 30) for(i in 1:30) { seeds[[i]] <- .Random.seed data[[i]] <- runif(5) } # If we save and load .Random.seed from a file using scan(), make # sure to convert its type from doubles to integers. # Otherwise, .Random.seed will complain! .Random.seed <- seeds[[23]] # restore data.23 <- runif(5) data.23 data[[23]]
- impute.knn
- Duncan Murdoch: This works in this example, but wouldn't work with all RNGs, because some of them save state outside of .Random.seed. See ?.Random.seed for details.
- Uwe Ligges's comment: set.seed() actually generates a seed. See ?set.seed that points us to .Random.seed (and relevant references!) which contains the actual current seed.
- Petr Savicky's comment is also useful in the situation when it is not difficult to re-generate the data.
- Local randomness in R.
sample()
sample() inaccurate on very large populations, fixed in R 3.6.0
- The default method for generating from a discrete uniform distribution (used in ‘sample()’, for instance) has been changed. In prior versions, the probability of generating each integer could vary from equal by up to 0.04% (or possibly more if generating more than a million different integers). See also What's new in R 3.6.0 by David Smith.
# R 3.5.3 set.seed(123) m <- (2/5)*2^32 m > 2^31 # [1] FALSE log10(m) # [1] 9.23502 x <- sample(m, 1000000, replace = TRUE) table(x %% 2) # 0 1 # 400070 599930
- Fast sampling support in dqrng
- Differences of the output of sample()
# R 3.5.3 # docker run --net=host -it --rm r-base:3.5.3 > set.seed(1234) > sample(5) [1] 1 3 2 4 5 # R 3.6.0 # docker run --net=host -it --rm r-base:3.6.0 > set.seed(1234) > sample(5) [1] 4 5 2 3 1 > RNGkind(sample.kind = "Rounding") Warning message: In RNGkind(sample.kind = "Rounding") : non-uniform 'Rounding' sampler used > set.seed(1234) > sample(5) [1] 1 3 2 4 5
Getting different results with set.seed() in RStudio
Getting different results with set.seed(). It's possible that you're loading an R package that is changing the requested random number generator; RNGkind().
dplyr::sample_n()
The function has a parameter weight. For example if we have some download statistics for each day and we want to do sampling based on their download numbers, we can use this function.
Regular Expression
See here.
Read rrd file
- https://en.wikipedia.org/wiki/RRDtool
- http://oss.oetiker.ch/rrdtool/
- https://github.com/pldimitrov/Rrd
- http://plamendimitrov.net/blog/2014/08/09/r-package-for-working-with-rrd-files/
on.exit()
Examples of using on.exit(). In all these examples, add = TRUE is used in the on.exit() call to ensure that each exit action is added to the list of actions to be performed when the function exits, rather than replacing the previous actions.
- Database connections
library(RSQLite) sqlite_get_query <- function(db, sql) { conn <- dbConnect(RSQLite::SQLite(), db) on.exit(dbDisconnect(conn), add = TRUE) dbGetQuery(conn, sql) }
- File connections
read_chars <- function(file_name) { conn <- file(file_name, "r") on.exit(close(conn), add = TRUE) readChar(conn, file.info(file_name)$size) }
- Temporary files
history_lines <- function() { f <- tempfile() on.exit(unlink(f), add = TRUE) savehistory(f) readLines(f, encoding = "UTF-8") }
- Printing messages
myfun = function(x) { on.exit(print("first")) on.exit(print("second"), add = TRUE) return(x) }
file, connection
- cat() and scan() (read data into a vector or list from the console or file)
- read() and write()
- read.table() and write.table()
out = file('tmp.txt', 'w') writeLines("abcd", out) writeLines("eeeeee", out) close(out) readLines('tmp.txt') unlink('tmp.txt') args(writeLines) # function (text, con = stdout(), sep = "\n", useBytes = FALSE) foo <- function() { con <- file() ... on.exit(close(con)) ... }
Error in close.connection(f) : invalid connection. If we want to use close(con), we have to specify how to open the connection; such as
con <- gzfile(FileName, "r") # Or gzfile(FileName, open = 'r') x <- read.delim(con) close(x)
withr package
https://cran.r-project.org/web/packages/withr/index.html . Reverse suggested by languageserver.
Clipboard (?connections), textConnection(), pipe()
- On Windows, we can use readClipboard() and writeClipboard().
source("clipboard") read.table("clipboard")
- Clipboard -> R. Reading/writing clipboard on macOS. Use textConnection() function:
x <- read.delim(textConnection("<USE_KEYBOARD_TO_PASTE_FROM_CLIPBOARD>")) # Or on Mac x <- read.delim(pipe("pbpaste")) # safely ignore the warning: incomplete final line found by readTableHeader on 'pbpaste'
An example is to copy data from this post. In this case we need to use read.table() instead of read.delim().
- R -> clipboard on Mac. Note: pbcopy and pbpaste are macOS terminal commands. See pbcopy & pbpaste: Manipulating the Clipboard from the Command Line.
- pbcopy: takes standard input and places it in the clipboard buffer
- pbpaste: takes data from the clipboard buffer and writes it to the standard output
clip <- pipe("pbcopy", "w") write.table(apply(x, 1, mean), file = clip, row.names=F, col.names=F) # write.table(data.frame(Var1, Var2), file = clip, row.names=F, quote=F, sep="\t") close(clip)
- Clipboard -> Excel.
- Method 1: Paste icon -> Text import wizard -> Delimit (Tab, uncheck Space) or Fixed width depending on the situation -> Finish.
- Method 2: Ctrl+v first. Then choose Data -> Text to Columns. Fixed width -> Next -> Next -> Finish.
- On Linux, we need to install "xclip". See R Copy from Clipboard in Ubuntu Linux. It seems to work.
# sudo apt-get install xclip read.table(pipe("xclip -selection clipboard -o",open="r"))
clipr
clipr: Read and Write from the System Clipboard
read/manipulate binary data
- x <- readBin(fn, raw(), file.info(fn)$size)
- rawToChar(x[1:16])
- See Biostrings C API
String Manipulation
- Handling Strings with R(ebook) by Gaston Sanchez.
- A guide to working with character data in R (6/22/2018)
- Chapter 7 of the book 'Data Manipulation with R' by Phil Spector.
- Chapter 7 of the book 'R Cookbook' by Paul Teetor.
- Chapter 2 of the book 'Using R for Data Management, Statistical Analysis and Graphics' by Horton and Kleinman.
- http://www.endmemo.com/program/R/deparse.php. It includes lots of examples for each R function it lists.
- Four ways to reverse a string in R
- A short note on the startsWith function
format(): padding with zero
ngenes <- 10 genenames <- paste0("bm", gsub(" ", "0", format(1:ngenes))); genenames # [1] "bm01" "bm02" "bm03" "bm04" "bm05" "bm06" "bm07" "bm08" "bm09" "bm10"
noquote()
noqute Print character strings without quotes.
stringr package
- https://stringr.tidyverse.org/index.html
- Vignette compares stringr functions to their base R equivalents
- When I try to use trimws() on data obtained from readxl::read_excell(), I find trimws() does not work but stringr::str_trim() works. trimws bug? leading whitespace not removed.
glue package
- glue. Useful in a loop and some function like ggtitle() or ggsave(). Inside the curly braces {R-Expression}, the expression is evaluated.
library(glue) name <- "John" age <- 30 glue("My name is {name} and I am {age} years old.") # My name is John and I am 30 years old. price <- 9.99 quantity <- 3 total <- glue("The total cost is {round(price * quantity, 2)}.") # Inside the curly braces {}, the expression round(price * quantity, 2) is evaluated. print(total) # The total cost is 29.97.
The syntax of glue() in R is quite similar to Python's print() function when using formatted strings. In Python, you typically use f-strings to embed variables inside strings.
name = "John" age = 30 print(f"My name is {name} and I am {age} years old.") # My name is John and I am 30 years old. price = 9.99 quantity = 3 total = f"The total cost is {price * quantity:.2f}." print(total) # The total cost is 29.97.
- String interpolation
Raw data type
Fun with strings, Cyrillic alphabets
a1 <- "А" a2 <- "A" a1 == a2 # [1] FALSE charToRaw("А") # [1] d0 90 charToRaw("A") # [1] 41
number of characters limit
It's a limit on a (single) input line in the REPL
Comparing strings to numeric
">" coerces the number to a string before comparing.
"10" < 2 # TRUE
HTTPs connection
HTTPS connection becomes default in R 3.2.2. See
- http://blog.rstudio.org/2015/08/17/secure-https-connections-for-r/
- http://blog.revolutionanalytics.com/2015/08/good-advice-for-security-with-r.html
R 3.3.2 patched The internal methods of ‘download.file()’ and ‘url()’ now report if they are unable to follow the redirection of a ‘http://’ URL to a ‘https://’ URL (rather than failing silently)
setInternet2
There was a bug in ftp downloading in R 3.2.2 (r69053) Windows though it is fixed now in R 3.2 patch.
Read the discussion reported on 8/8/2015. The error only happened on ftp not http connection. The final solution is explained in this post. The following demonstrated the original problem.
url <- paste0("ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/", "GCF_000001405.13.assembly.txt") f1 <- tempfile() download.file(url, f1)
It seems the bug was fixed in R 3.2-branch. See 8/16/2015 patch r69089 where a new argument INTERNET_FLAG_PASSIVE was added to InternetOpenUrl() function of wininet library. This article and this post explain differences of active and passive FTP.
The following R command will show the exact svn revision for the R you are currently using.
R.Version()$"svn rev"
If setInternet2(T), then https protocol is supported in download.file().
When setInternet(T) is enabled by default, download.file() does not work for ftp protocol (this is used in getGEO() function of the GEOquery package). If I use setInternet(F), download.file() works again for ftp protocol.
The setInternet2() function is defined in R> src> library> utils > R > windows > sysutils.R.
R up to 3.2.2
setInternet2 <- function(use = TRUE) .Internal(useInternet2(use))
See also
- <src/include/Internal.h> (declare do_setInternet2()),
- <src/main/names.c> (show do_setInternet2() in C)
- <src/main/internet.c> (define do_setInternet2() in C).
Note that: setInternet2(T) becomes default in R 3.2.2. To revert to the previous default use setInternet2(FALSE). See the <doc/NEWS.pdf> file. If we use setInternet2(F), then it solves the bug of getGEO() error. But it disables the https file download using the download.file() function. In R < 3.2.2, it is also possible to download from https by setIneternet2(T).
R 3.3.0
setInternet2 <- function(use = TRUE) { if(!is.na(use)) stop("use != NA is defunct") NA }
Note that setInternet2.Rd says As from \R 3.3.0 it changes nothing, and only \code{use = NA} is accepted. Also NEWS.Rd says setInternet2() has no effect and will be removed in due course.
Finite, Infinite and NaN Numbers: is.finite(), is.infinite(), is.nan()
In R, basically all mathematical functions (including basic Arithmetic), are supposed to work properly with +/-, Inf and NaN as input or output.
See ?is.finite.
How to replace Inf with NA in All or Specific Columns of the Data Frame
replace() function
- replace(vector, index, values)
- https://stackoverflow.com/a/11811147
File/path operations
- list.files(, include.dirs =F, recursive = T, pattern = "\\.csv$", all.files = TRUE)
- file.info()
- dir.create()
- file.create()
- file.copy()
- file.exists()
- basename() - remove the parent path, dirname() - returns the part of the path up to but excluding the last path separator
> file.path("~", "Downloads") [1] "~/Downloads" > dirname(file.path("~", "Downloads")) [1] "/home/brb" > basename(file.path("~", "Downloads")) [1] "Downloads"
- path.expand("~/.Renviron") # "/home/brb/.Renviron"
- normalizePath() # Express File Paths in Canonical Form
> cat(normalizePath(c(R.home(), tempdir())), sep = "\n") /usr/lib/R /tmp/RtmpzvDhAe
- system.file() - Finds the full file names of files in packages etc
> system.file("extdata", "ex1.bam", package="Rsamtools") [1] "/home/brb/R/x86_64-pc-linux-gnu-library/4.0/Rsamtools/extdata/ex1.bam"
- tools::file_path_sans_ext() - remove the file extension or the sub() function.
read/download/source a file from internet
Simple text file http
retail <- read.csv("http://robjhyndman.com/data/ausretail.csv",header=FALSE)
Zip, RData, gz file and url() function
x <- read.delim(gzfile("filename.txt.gz"), nrows=10)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con)
Here url() function is like file(), gzfile(), bzfile(), xzfile(), unz(), pipe(), fifo(), socketConnection(). They are used to create connections. By default, the connection is not opened (except for ‘socketConnection’), but may be opened by setting a non-empty value of argument ‘open’. See ?url.
Another example is Read gzipped csv directly from a url in R
con <- gzcon(url(paste("http://dumps.wikimedia.org/other/articlefeedback/", "aa_combined-20110321.csv.gz", sep=""))) txt <- readLines(con) dat <- read.csv(textConnection(txt))
Another example of using url() is
load(url("http:/www.example.com/example.RData"))
This does not work with load(), dget(), read.table() for files on OneDrive. In fact, I cannot use wget with shared files from OneDrive. The following trick works: How to configure a OneDrive file for use with wget.
Dropbox is easy and works for load(), wget, ...
R download .RData or Directly loading .RData from github from Github.
zip function
This will include 'hallmarkFiles' root folder in the files inside zip.
zip(zipfile = 'myFile.zip', files = dir('hallmarkFiles', full.names = TRUE)) # Verify/view the files. 'list = TRUE' won't extract unzip('testZip.zip', list = TRUE)
downloader package
This package provides a wrapper for the download.file function, making it possible to download files over https on Windows, Mac OS X, and other Unix-like platforms. The RCurl package provides this functionality (and much more) but can be difficult to install because it must be compiled with external dependencies. This package has no external dependencies, so it is much easier to install.
Google drive file based on https using RCurl package
require(RCurl) myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AkuuKBh0jM2TdGppUFFxcEdoUklCQlJhM2kweGpoUUE&single=true&gid=0&output=csv") read.csv(textConnection(myCsv))
Google sheet file using googlesheets package
Reading data from google sheets into R
Github files https using RCurl package
- http://support.rstudio.org/help/kb/faq/configuring-r-to-use-an-http-proxy
- http://tonybreyal.wordpress.com/2011/11/24/source_https-sourcing-an-r-script-from-github/
x = getURL("https://gist.github.com/arraytools/6671098/raw/c4cb0ca6fe78054da8dbe253a05f7046270d5693/GeneIDs.txt", ssl.verifypeer = FALSE) read.table(text=x)
- gistr package
data summary table
summarytools: create summary tables for vectors and data frames
https://github.com/dcomtois/summarytools. R Package for quickly and neatly summarizing vectors and data frames.
skimr: A frictionless, pipeable approach to dealing with summary statistics
skimr for useful and tidy summary statistics
modelsummary
modelsummary: Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready
broom
Create publication tables using tables package
See p13 for example at here
R's tables packages is the best solution. For example,
> library(tables) > tabular( (Species + 1) ~ (n=1) + Format(digits=2)* + (Sepal.Length + Sepal.Width)*(mean + sd), data=iris ) Sepal.Length Sepal.Width Species n mean sd mean sd setosa 50 5.01 0.35 3.43 0.38 versicolor 50 5.94 0.52 2.77 0.31 virginica 50 6.59 0.64 2.97 0.32 All 150 5.84 0.83 3.06 0.44 > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
and
# This example shows some of the less common options > Sex <- factor(sample(c("Male", "Female"), 100, rep=TRUE)) > Status <- factor(sample(c("low", "medium", "high"), 100, rep=TRUE)) > z <- rnorm(100)+5 > fmt <- function(x) { s <- format(x, digits=2) even <- ((1:length(s)) %% 2) == 0 s[even] <- sprintf("(%s)", s[even]) s } > tabular( Justify(c)*Heading()*z*Sex*Heading(Statistic)*Format(fmt())*(mean+sd) ~ Status ) Status Sex Statistic high low medium Female mean 4.88 4.96 5.17 sd (1.20) (0.82) (1.35) Male mean 4.45 4.31 5.05 sd (1.01) (0.93) (0.75)
fgsea example
(archived) ClinReport: Statistical Reporting in Clinical Trials
https://cran.r-project.org/web/packages/ClinReport/index.html
Append figures to PDF files
How to append a plot to an existing pdf file. Hint: use the recordPlot() function.
Save base graphics as pseudo-objects
Save base graphics as pseudo-objects in R. Note there are some cons with this approach.
pdf(NULL) dev.control(displaylist="enable") plot(df$x, df$y) text(40, 0, "Random") text(60, 2, "Text") lines(stats::lowess(df$x, df$y)) p1.base <- recordPlot() invisible(dev.off()) # Display the saved plot grid::grid.newpage() p1.base
Extracting tables from PDFs
- extracting Tables from PDFs in R using Tabulizer. This needs the rJava package. Linux works fine. Some issue came out on my macOS 10.12 Sierra. Library not loaded: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/lib/server/libjvm.dylib. Referenced from: /Users/XXXXXXX/Library/R/3.5/library/rJava/libs/rJava.so.
-
pdftools - Text Extraction, Rendering and Converting of PDF Documents. pdf_text() and pdf_data() functions.
library(pdftools) pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf" txt <- pdf_text(pdf_file) # length = number of pages # Suppose the table we are interested in is on page 1 cat(txt[1]) # Good but not in a data frame format pdf_data(pdf_file)1 # data frame/tibble format
However, it seems it does not work on Table S6. Tabulizer package is better at this case.
This is another example. 神技能-自动化批量从PDF里面提取表格
- How To Convert PDF To Text On Linux (GUI And Command Line). It works when I tested my PDF file.
sudo apt install poppler-utils pdftotext -layout input.pdf output.txt pdftotext -layout -f 3 -l 4 input.pdf output.txt # from page 3 to 4.
- Convert PDF files into Excel spreadsheets using Adobe Acrobat. See How to extract pages from a PDF. Note the PDF file should not be opened by Excel since it is binary format Excel can't recognize.
- I found it is easier to use copy the column (it works) from PDF and paste them to Excel
- tabulapdf: Extract Tables from PDF Documents
Print tables
addmargins()
- addmargins. Puts Arbitrary Margins On Multidimensional Tables Or Arrays.
- How to put margins on tables or arrays in R?
tableone
- https://cran.r-project.org/web/packages/tableone/
- Table 1 and the Characteristics of Study Population
- 如何快速绘制论文的表1(基本特征三线表)?
- See Table 1 from Tidymodels Machine Learning: Diabetes Classification
Some examples
Simple
Cox models
finalfit package
- https://cran.r-project.org/web/packages/finalfit/index.html. Lots of vignettes.
- Survival. It fits both univariate and multivariate regressions and reports the results for both of them.
- summary_factorlist() from the finalfit package.
- Elegant regression results tables and plots in R: the finalfit package
table1
- https://cran.r-project.org/web/packages/table1/
- Creating a nice looking Table 1 with standardized mean differences (SMD). SMD is the difference in group means divided by the pooled standard deviation (and is defined differently for categorical measures). Note that the pooled standard deviation defined here is different from we see on the t.test when we assume equivalent variance in two samples.
gtsummary
- Presentation-Ready Summary Tables with gtsummary
- gtsummary & on CRAN
- tbl_summary(). The output is in the "Viewer" window.
- An example: Tidymodels Machine Learning: Diabetes Classification. The table is saved in a png file. The column variable is response.
gt*
- gt: Easily Create Presentation-Ready Display Tables
- Introduction to Clinical Tables with the {gt} Package
- Add any Plot to your {gt} table
dplyr
https://stackoverflow.com/a/34587522. The output includes counts and proportions in a publication like fashion.
tables::tabular()
gmodels::CrossTable()
https://www.statmethods.net/stats/frequencies.html
base::prop.table(x, margin)
R> m <- matrix(1:4, 2) R> prop.table(m, 1) # row percentage [,1] [,2] [1,] 0.2500000 0.7500000 [2,] 0.3333333 0.6666667 R> prop.table(m, 2) # column percentage [,1] [,2] [1,] 0.3333333 0.4285714 [2,] 0.6666667 0.5714286
stats::xtabs()
stats::ftable()
> ftable(Titanic, row.vars = 1:3) Survived No Yes Class Sex Age 1st Male Child 0 5 Adult 118 57 Female Child 0 1 Adult 4 140 2nd Male Child 0 11 Adult 154 14 Female Child 0 13 Adult 13 80 3rd Male Child 35 13 Adult 387 75 Female Child 17 14 Adult 89 76 Crew Male Child 0 0 Adult 670 192 Female Child 0 0 Adult 3 20 > ftable(Titanic, row.vars = 1:2, col.vars = "Survived") Survived No Yes Class Sex 1st Male 118 62 Female 4 141 2nd Male 154 25 Female 13 93 3rd Male 422 88 Female 106 90 Crew Male 670 192 Female 3 20 > ftable(Titanic, row.vars = 2:1, col.vars = "Survived") Survived No Yes Sex Class Male 1st 118 62 2nd 154 25 3rd 422 88 Crew 670 192 Female 1st 4 141 2nd 13 93 3rd 106 90 Crew 3 20 > str(Titanic) table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ... - attr(*, "dimnames")=List of 4 ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew" ..$ Sex : chr [1:2] "Male" "Female" ..$ Age : chr [1:2] "Child" "Adult" ..$ Survived: chr [1:2] "No" "Yes" > x <- ftable(mtcars[c("cyl", "vs", "am", "gear")]) > x gear 3 4 5 cyl vs am 4 0 0 0 0 0 1 0 0 1 1 0 1 2 0 1 0 6 1 6 0 0 0 0 0 1 0 2 1 1 0 2 2 0 1 0 0 0 8 0 0 12 0 0 1 0 0 2 1 0 0 0 0 1 0 0 0 > ftable(x, row.vars = c(2, 4)) cyl 4 6 8 am 0 1 0 1 0 1 vs gear 0 3 0 0 0 0 12 0 4 0 0 0 2 0 0 5 0 1 0 1 0 2 1 3 1 0 2 0 0 0 4 2 6 2 0 0 0 5 0 1 0 0 0 0 > > ## Start with expressions, use table()'s "dnn" to change labels > ftable(mtcars$cyl, mtcars$vs, mtcars$am, mtcars$gear, row.vars = c(2, 4), dnn = c("Cylinders", "V/S", "Transmission", "Gears")) Cylinders 4 6 8 Transmission 0 1 0 1 0 1 V/S Gears 0 3 0 0 0 0 12 0 4 0 0 0 2 0 0 5 0 1 0 1 0 2 1 3 1 0 2 0 0 0 4 2 6 2 0 0 0 5 0 1 0 0 0 0
tracemem, data type, copy
How to avoid copying a long vector
Tell if the current R is running in 32-bit or 64-bit mode
8 * .Machine$sizeof.pointer
where sizeof.pointer returns the number of *bytes* in a C SEXP type and '8' means number of bits per byte.
32- and 64-bit
See R-admin.html.
- For speed you may want to use a 32-bit build, but to handle large datasets a 64-bit build.
- Even on 64-bit builds of R there are limits on the size of R objects, some of which stem from the use of 32-bit integers (especially in FORTRAN code). For example, the dimensionas of an array are limited to 2^31 -1.
- Since R 2.15.0, it is possible to select '64-bit Files' from the standard installer even on a 32-bit version of Windows (2012/3/30).
Handling length 2^31 and more in R 3.0.0
From R News for 3.0.0 release:
There is a subtle change in behaviour for numeric index values 2^31 and larger. These never used to be legitimate and so were treated as NA, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.
In R 2.15.2, if I try to assign a vector of length 2^31, I will get an error
> x <- seq(1, 2^31) Error in from:to : result would be too long a vector
However, for R 3.0.0 (tested on my 64-bit Ubuntu with 16GB RAM. The R was compiled by myself):
> system.time(x <- seq(1,2^31)) user system elapsed 8.604 11.060 120.815 > length(x) [1] 2147483648 > length(x)/2^20 [1] 2048 > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 183823 9.9 407500 21.8 350000 18.7 Vcells 2147764406 16386.2 2368247221 18068.3 2148247383 16389.9 >
Note:
- 2^31 length is about 2 Giga length. It takes about 16 GB (2^31*8/2^20 MB) memory.
- On Windows, it is almost impossible to work with 2^31 length of data if the memory is less than 16 GB because virtual disk on Windows does not work well. For example, when I tested on my 12 GB Windows 7, the whole Windows system freezes for several minutes before I force to power off the machine.
- My slide in http://goo.gl/g7sGX shows the screenshots of running the above command on my Ubuntu and RHEL machines. As you can see the linux is pretty good at handling large (> system RAM) data. That said, as long as your linux system is 64-bit, you can possibly work on large data without too much pain.
- For large dataset, it makes sense to use database or specially crafted packages like bigmemory or ff or bigstatsr.
- [[<- for index 2^31 fails
NA in index
- Question: what is seq(1, 3)[c(1, 2, NA)]?
Answer: It will reserve the element with NA in indexing and return the value NA for it.
- Question: What is TRUE & NA?
Answer: NA
- Question: What is FALSE & NA?
Answer: FALSE
- Question: c("A", "B", NA) != "" ?
Answer: TRUE TRUE NA
- Question: which(c("A", "B", NA) != "") ?
Answer: 1 2
- Question: c(1, 2, NA) != "" & !is.na(c(1, 2, NA)) ?
Answer: TRUE TRUE FALSE
- Question: c("A", "B", NA) != "" & !is.na(c("A", "B", NA)) ?
Answer: TRUE TRUE FALSE
Conclusion: In order to exclude empty or NA for numerical or character data type, we can use which() or a convenience function keep.complete(x) <- function(x) x != "" & !is.na(x). This will guarantee return logical values and not contain NAs.
Don't just use x != "" OR !is.na(x).
Some functions
- X %>% tidyr::drop_na()
- stats::na.omit() and stats::complete.cases(). NA Omit in R | 3 Example Codes for na.omit (Data Frame, Vector & by Column)
Constant and 'L'
Add 'L' after a constant. For example,
for(i in 1L:n) { } if (max.lines > 0L) { } label <- paste0(n-i+1L, ": ") n <- length(x); if(n == 0L) { }
Vector/Arrays
R indexes arrays from 1 like Fortran, not from 0 like C or Python.
remove integer(0)
How to remove integer(0) from a vector?
Append some elements
append() and its after argument
setNames()
Assign names to a vector
z <- setNames(1:3, c("a", "b", "c")) # OR z <- 1:3; names(z) <- c("a", "b", "c") # OR z <- c("a"=1, "b"=2, "c"=3) # not work if "a", "b", "c" is like x[1], x[2], x[3].
Factor
labels argument
We can specify the factor levels and new labels using the factor() function.
sex <- factor(sex, levels = c("0", "1"), labels = c("Male", "Female")) drug_treatment <- factor(drug_treatment, levels = c("Placebo", "Low dose", "High dose")) health_status <- factor(health_status, levels = c("Healthy", "Alzheimer's")) factor(rev(letters[1:3]), labels = c("A", "B", "C")) # C B A # Levels: A B C
Create a factor/categorical variable from a continuous variable: cut() and dplyr::case_when()
cut( c(0, 10, 30), breaks = c(0, 30, 50, Inf), labels = c("Young", "Middle-aged", "Elderly") ) # Default include.lowest = FALSE # [1] <NA> Young Young
- https://dplyr.tidyverse.org/reference/case_when.html
- Using dplyr’s mutate and case_when functions as alternative for if else statement
- Case when in R using case_when() Dplyr – case_when in R
- How To Convert Continuous Variables Into Categorical By Creating Bins
- ?cut
set.seed(1) x <- rnorm(100) facVar <- cut(x, c(min(x), -1, 1, max(x)), labels = c("low", "medium", "high")) table(facVar, useNA = "ifany") facVar # low medium high <NA> # 10 74 15 1
Note the option include.lowest = TRUE is needed when we use cut() + quantile(); otherwise the smallest data will become NA since the intervals have the format (a, b].
x2 <- cut(x, quantile(x, 0:2/2), include.lowest = TRUE) # split x into 2 levels x2 <- cut(x, quantile(x, 0:3/3), include.lowest = TRUE) # split x into 3 levels library(tidyverse); library(magrittr) set.seed(1) breaks <- quantile(runif(100), probs=seq(0, 1, len=20)) x <- runif(50) bins <- cut(x, breaks=unique(breaks), include.lowest=T, right=T) data.frame(sc=x, bins=bins) %>% group_by(bins) %>% summarise(n=n()) %>% ggplot(aes(x = bins, y = n)) + geom_col(color = "black", fill = "#90AACB") + theme_minimal() + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none") + coord_flip()
- A Guide to Using the cut() Function in R
- tibble object
library(tidyverse) tibble(age_yrs = c(0, 4, 10, 15, 24, 55), age_cat = case_when( age_yrs < 2 ~ "baby", age_yrs < 13 ~ "kid", age_yrs < 20 ~ "teen", TRUE ~ "adult") )
- R tip: Learn dplyr’s case_when() function
case_when( condition1 ~ value1, condition2 ~ value2, TRUE ~ ValueAnythingElse ) # Example case_when( x %%2 == 0 ~ "even", x %%2 == 1 ~ "odd", TRUE ~ "Neither even or odd" )
How to change one of the level to NA
https://stackoverflow.com/a/25354985. Note that the factor level is removed.
x <- factor(c("a", "b", "c", "NotPerformed")) levels(x)[levels(x) == 'NotPerformed'] <- NA
Creating missing values in factors
Concatenating two factor vectors
Not trivial. How to concatenate factors, without them being converted to integer level?.
unlist(list(f1, f2)) # unlist(list(factor(letters[1:5]), factor(letters[5:2])))
droplevels()
droplevels(): drop unused levels from a factor or, more commonly, from factors in a data frame.
factor(x , levels = ...) vs levels(x) <-
Note levels(x) is to set/rename levels, not reorder. Use relevel() or factor() to reorder.
levels() plyr::revalue() forcats::fct_recode() |
rename levels |
factor(, levels) | reorder levels |
sizes <- factor(c("small", "large", "large", "small", "medium")) sizes #> [1] small large large small medium #> Levels: large medium small sizes2 <- factor(sizes, levels = c("small", "medium", "large")) # reorder levels but data is not changed sizes2 # [1] small large large small medium # Levels: small medium large sizes3 <- sizes levels(sizes3) <- c("small", "medium", "large") # rename, not reorder # large -> small # medium -> medium # small -> large sizes3 # [1] large small small large medium # Levels: small medium large
A regression example.
set.seed(1) x <- sample(1:2, 500, replace = TRUE) y <- round(x + rnorm(500), 3) x <- as.factor(x) sample_data <- data.frame(x, y) # create linear model summary(lm( y~x, sample_data)) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.96804 0.06610 14.65 <2e-16 *** # x2 0.99620 0.09462 10.53 <2e-16 *** # Wrong way when we want to change the baseline level to '2' # No change on the model fitting except the apparent change on the variable name in the printout levels(sample_data$x) <- c("2", "1") summary(lm( y~x, sample_data)) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.96804 0.06610 14.65 <2e-16 *** # x1 0.99620 0.09462 10.53 <2e-16 *** # Correct way if we want to change the baseline level to '2' # The estimate was changed by flipping the sign from the original data sample_data$x <- relevel(x, ref = "2") summary(lm( y~x, sample_data)) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.96425 0.06770 29.01 <2e-16 *** # x1 -0.99620 0.09462 -10.53 <2e-16 ***
stats::relevel()
relevel. This function can only be used to change the reference level of a factor variable. It does not directly create an arbitrary order of levels. That is, it is useful in lm() or aov(), etc.
reorder(), levels() and boxplot()
- How to Reorder Boxplots in R: A Comprehensive Guide (tapply() method, simple & effective)
- reorder().This is useful in barplot (ggplot2::geom_col()) where we want to sort the bars by a numerical variable.
# Syntax: # newFac <- with(df, reorder(fac, vec, FUN=mean)) # newFac is like fac except it has a new order (bymedian <- with(InsectSprays, reorder(spray, count, median)) ) class(bymedian) levels(bymedian) boxplot(count ~ bymedian, data = InsectSprays, xlab = "Type of spray", ylab = "Insect count", main = "InsectSprays data", varwidth = TRUE, col = "lightgray") # boxplots are sorted according to the new levels boxplot(count ~ spray, data = InsectSprays, xlab = "Type of spray", ylab = "Insect count", main = "InsectSprays data", varwidth = TRUE, col = "lightgray") # not sorted
- Statistics Sunday: My 2019 Reading (reorder function)
factor() vs ordered()
factor(levels=c("a", "b", "c"), ordered=TRUE) # ordered(0) # Levels: a < b < c factor(levels=c("a", "b", "c")) # factor(0) # Levels: a b c ordered(levels=c("a", "b", "c")) # Error in factor(x, ..., ordered = TRUE) : # argument "x" is missing, with no default
Data frame
- http://adv-r.had.co.nz/Data-structures.html#data-frames. A data frame is a list of equal-length vectors. So a data frame is not a vector nor a matrix though it looks like a matrix.
- http://blog.datacamp.com/15-easy-solutions-data-frame-problems-r/
stringsAsFactors = FALSE
http://www.win-vector.com/blog/2018/03/r-tip-use-stringsasfactors-false/
We can use options(stringsAsFactors=FALSE) forces R to import character data as character objects.
In R 4.0.0, stringAsFactors=FALSE will be default. This also affects read.table() function.
check.names = FALSE
Note this option will not affect rownames. So if the rownames contains special symbols, like dash, space, parentheses, etc, they will not be modified.
> data.frame("1a"=1:2, "2a"=1:2, check.names = FALSE) 1a 2a 1 1 1 2 2 2 > data.frame("1a"=1:2, "2a"=1:2) # default X1a X2a 1 1 1 2 2 2
Create unique rownames: make.unique()
groupCodes <- c(rep("Cont",5), rep("Tre1",5), rep("Tre2",5)) rownames(mydf) <- make.unique(groupCodes)
data.frame() will change rownames
class(df2) # [1] "matrix" "array" rownames(df2)[c(9109, 44999)] # [1] "A1CF" "A1BG-AS1" rownames(data.frame(df2))[c(9109, 44999)] # [1] "A1CF" "A1BG.AS1"
Print a data frame without rownames
# Method 1. rownames(df1) <- NULL # Method 2. print(df1, row.names = FALSE)
Convert data frame factor columns to characters
Convert data.frame columns from factors to characters
# Method 1: bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE) # Method 2: bob[] <- lapply(bob, as.character)
To replace only factor columns:
# Method 1: i <- sapply(bob, is.factor) bob[i] <- lapply(bob[i], as.character) # Method 2: library(dplyr) bob %>% mutate_if(is.factor, as.character) -> bob
Sort Or Order A Data Frame
How To Sort Or Order A Data Frame In R
- df[order(df$x), ], df[order(df$x, decreasing = TRUE), ], df[order(df$x, df$y), ]
- library(plyr); arrange(df, x), arrange(df, desc(x)), arrange(df, x, y)
- library(dplyr); df %>% arrange(x),df %>% arrange(x, desc(x)), df %>% arrange(x, y)
- library(doBy); order(~x, df), order(~ -x, df), order(~ x+y, df)
data.frame to vector
df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6)) class(df) # [1] "data.frame" class(t(df)) # [1] "matrix" "array" class(unlist(df)) # [1] "numeric" # Method 1: Convert data frame to matrix using as.matrix() # and then Convert matrix to vector using as.vector() or c() mat <- as.matrix(df) vec1 <- as.vector(mat) # [1] 1 2 3 4 5 6 vec2 <- c(mat) # Method 2: Convert data frame to matrix using t()/transpose # and then Convert matrix to vector using as.vector() or c() vec3 <- as.vector(t(df)) # [1] 1 4 2 5 3 6 vec4 <- c(t(df)) # Not working as.vector(df) # $x # [1] 1 2 3 # $y # [1] 4 5 6 # Method 3: unlist() - easiest solution unlist(df) # x1 x2 x3 y1 y2 y3 # 1 2 3 4 5 6 unlist(data.frame(df), use.names = F) # OR dplyr::pull() # [1] 1 2 3 4 5 6
Q: Why as.vector(df) cannot convert a data frame into a vector?
A: The as.vector function cannot be used directly on a data frame to convert it into a vector because a data frame is a list of vectors (i.e., its columns) and as.vector only removes the attributes of an object to create a vector. When you apply as.vector to a data frame, R does not know how to concatenate these independent columns (which could be of different types) into a single vector. Therefore, it doesn’t perform the operation. Therefore as.vector() returns the underlying list structure of the data frame instead of converting it into a vector.
However, when you transpose the data frame using t(), it gets converted into a matrix. A matrix in R is a vector with dimensions. Therefore, all elements of the matrix must be of the same type. If they are not, R will coerce them to be so. Once you have a matrix, as.vector() can easily convert it into a vector because all elements are of the same type.
Using cbind() to merge vectors together?
It’s a common mistake to try and create a data frame by cbind()ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly. See Advanced R -> Data structures chapter.
cbind NULL and data.frame
cbind can't combine NULL with dataframe. Add as.matrix() will fix the problem.
merge
- All You Need To Know About Merging (Joining) Datasets in R. If we like to merge/join by the rownames, we can use dplyr::rownames_to_column(); see dplyr left_join() by rownames.
- Merge DataFrames by Row Names in R
- How to perform merges (joins) on two or more data frames with base R, tidyverse and data.table
- How to understand the different types of merge
Special character in the matched variable can create a trouble when we use merge() or dplyr::inner_join(). I guess R internally turns df2 (a matrix but not a data frame) to a data frame (so rownames are changed if they contain special character like "-"). This still does not explain the situation when I
class(df1); class(df2) # [1] "data.frame" # 2 x 2 # [1] "matrix" "array" # 52439 x 2 rownames(df1) # [1] "A1CF" "A1BG-AS1" merge(df1, df2[c(9109, 44999), ], by=0) # Row.names 786-0 A498 ACH-000001 ACH-000002 # 1 A1BG-AS1 0 0 7.321358 6.908333 # 2 A1CF 0 0 3.011470 1.189578 merge(df1, df2[c(9109, 38959:44999), ], by= 0) # still correct merge(df1, df2[c(9109, 38958:44999), ], by= 0) # same as merge(df1, df2, by=0) # Row.names 786-0 A498 ACH-000001 ACH-000002 # 1 A1CF 0 0 3.01147 1.189578 rownames(df2)[38958:38959] # [1] "ITFG2-AS1" "ADGRD1-AS1" rownames(df1)[2] <- "A1BGAS1" rownames(df2)[44999] <- "A1BGAS1" merge(df1, df2, by= 0) # Row.names 786-0 A498 ACH-000001 ACH-000002 # 1 A1BGAS1 0 0 7.321358 6.908333 # 2 A1CF 0 0 3.011470 1.189578
is.matrix: data.frame is not necessarily a matrix
See ?matrix. is.matrix returns TRUE if x is a vector and has a "dim" attribute of length 2 and FALSE otherwise.
An example that is a data frame (is.data.frame() returns TRUE) but not a matrix (is.matrix() returns FALSE) is an object returned by
X <- data.frame(x=1:2, y=3:4)
The 'X' object is NOT a vector and it does NOT have the "dim" attribute. It has only 3 attributes: "names", "row.names" & "class". Note that dim() function works fine and returns correctly though there is not "dim" attribute.
Another example that is a data frame but not a matrix is the built-in object cars; see ?matrix. It is not a vector
Convert a data frame to a matrix: as.matrix() vs data.matrix()
If I have a data frame X which recorded the time of some files.
- is.data.frame(X) shows TRUE but is.matrix(X) show FALSE
- as.matrix(X) will keep the time mode. The returned object is not a data frame anymore.
- data.matrix(X) will convert the time to numerical values. So use data.matrix() if the data is numeric. The returned object is not a data frame anymore.
# latex directory contains cache files from knitting an rmarkdown file X <- list.files("latex/", full.names = T) %>% grep("RData", ., value=T) %>% file.info() %>% `[`("mtime") X %>% is.data.frame() # TRUE X %>% is.matrix() # FALSE X %>% as.matrix() %>% is.matrix() # TRUE X %>% data.matrix() %>% is.matrix() # TRUE X %>% as.matrix() %>% "["(1:2, ) # timestamps X %>% data.matrix() %>% "["(1:2, ) # numeric
- The as.matrix() function is used to coerce an object into a matrix. It can be used with various types of R objects, such as vectors, data frames, and arrays.
- The data.matrix() function is specifically designed for converting a data frame into a matrix by coercing all columns to numeric values. If the data frame contains non-numeric columns, such as character or factor columns, data.matrix() will convert them to numeric values if possible (e.g., by converting factors to their integer codes).
- See the following example where as.matrix() and data.matrix() return different resuls.
df <- data.frame(a = c(1, 2, 3), b = c("x", "y", "z")) mat <- as.matrix(df) mat # a b # [1,] "1" "x" # [2,] "2" "y" # [3,] "3" "z" class(mat) # [1] "matrix" "array" mat2 <- data.matrix(df) mat2 # a b # [1,] 1 1 # [2,] 2 2 # [3,] 3 3 class(mat2) # [1] "matrix" "array" typeof(mat) # [1] "character" typeof(mat2) # [1] "double"
matrix vs data.frame
Case 1: colnames() is safer than names() if the object could be a data frame or a matrix.
Browse[2]> names(res2$surv.data.new[[index]]) NULL Browse[2]> colnames(res2$surv.data.new[[index]]) [1] "time" "status" "treat" "AKT1" "BRAF" "FLOT2" "MTOR" "PCK2" "PIK3CA" [10] "RAF1" Browse[2]> mode(res2$surv.data.new[[index]]) [1] "numeric" Browse[2]> is.matrix(res2$surv.data.new[[index]]) [1] TRUE Browse[2]> dim(res2$surv.data.new[[index]]) [1] 991 10
Case 2:
ip1 <- installed.packages()[,c(1,3:4)] # class(ip1) = 'matrix' unique(ip1$Priority) # Error in ip1$Priority : $ operator is invalid for atomic vectors unique(ip1[, "Priority"]) # OK ip2 <- as.data.frame(installed.packages()[,c(1,3:4)], stringsAsFactors = FALSE) # matrix -> data.frame unique(ip2$Priority) # OK
The length of a matrix and a data frame is different.
> length(matrix(1:6, 3, 2)) [1] 6 > length(data.frame(matrix(1:6, 3, 2))) [1] 2 > x[1] X1 1 1 2 2 3 3 4 4 5 5 6 6 > x1 [1] 1 2 3 4 5 6
So the length of a data frame is the number of columns. When we use sapply() function on a data frame, it will apply to each column of the data frame.
How to Remove Duplicates
How to Remove Duplicates in R with Example
Convert a matrix (not data frame) of characters to numeric
Just change the mode of the object
tmp <- cbind(a=c("0.12", "0.34"), b =c("0.567", "0.890")); tmp a b 1 0.12 0.567 2 0.34 0.890 > is.data.frame(tmp) # FALSE > is.matrix(tmp) # TRUE > sum(tmp) Error in sum(tmp) : invalid 'type' (character) of argument > mode(tmp) # "character" > mode(tmp) <- "numeric" > sum(tmp) [1] 1.917
Convert Data Frame Row to Vector
as.numeric() or c()
Convert characters to integers
mode(x) <- "integer"
Non-Standard Evaluation
Understanding Non-Standard Evaluation. Part 1: The Basics
Select Data Frame Columns in R
This is part of series of DATA MANIPULATION IN R from datanovia.com
- pull(): Extract column values as a vector. The column of interest can be specified either by name or by index.
- select(): Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
- select_if(): Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
- Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names
Another way is to the dollar sign $ operator (?"$") to extract rows or column from a data frame.
class(USArrests) # "data.frame" USArrests$"Assault"
Note that for both data frame and matrix objects, we need to use the [ operator to extract columns and/or rows.
USArrests[c("Alabama", "Alask"), c("Murder", "Assault")] # Murder Assault # Alabama 13.2 236 # Alaska 10.0 263 USArrests[c("Murder", "Assault")] # all rows tmp <- data(package="datasets") class(tmp$results) # "matrix" "array" tmp$results[, "Item"] # Same method can be used if rownames are available in a matrix
Note for a data.table object, we can extract columns using the column names without double quotes.
data.table(USArrests)[1:2, list(Murder, Assault)]
Add columns to a data frame
How to add columns to a data frame in R
Exclude/drop/remove data frame columns
# method 1 df = subset(mydata, select = -c(x,z) ) # method 2 drop <- c("x","z") df = mydata[,!(names(mydata) %in% drop)] # method 3: dplyr mydata2 = select(mydata, -a, -x, -y) mydata2 = select(mydata, -c(a, x, y)) mydata2 = select(mydata, -a:-y) mydata2 = mydata[,!grepl("^INC",names(mydata))]
Remove Rows from the data frame
Remove Rows from the data frame in R
Danger of selecting rows from a data frame
> dim(cars) [1] 50 2 > data.frame(a=cars[1,], b=cars[2, ]) a.speed a.dist b.speed b.dist 1 4 2 4 10 > dim(data.frame(a=cars[1,], b=cars[2, ])) [1] 1 4 > cars2 = as.matrix(cars) > data.frame(a=cars2[1,], b=cars2[2, ]) a b speed 4 4 dist 2 10
Creating data frame using structure() function
Creating data frame using structure() function in R
Create an empty data.frame
https://stackoverflow.com/questions/10689055/create-an-empty-data-frame
# the column types default as logical per vector(), but are then overridden a = data.frame(matrix(vector(), 5, 3, dimnames=list(c(), c("Date", "File", "User"))), stringsAsFactors=F) str(a) # NA but they are logical , not numeric. a[1,1] <- rnorm(1) str(a) # similar to above a <- data.frame(matrix(NA, nrow = 2, ncol = 3)) # different data type a <- data.frame(x1 = character(), x2 = numeric(), x3 = factor(), stringsAsFactors = FALSE)
Objects from subsetting a row in a data frame vs matrix
- Subsetting creates repeated rows. This will create unexpected rownames.
R> z <- data.frame(x=1:3, y=2:4) R> rownames(z) <- letters[1:3] R> rownames(z)[c(1,1)] [1] "a" "a" R> rownames(z[c(1,1),]) [1] "a" "a.1" R> z[c(1,1), ] x y a 1 2 a.1 1 2
- Convert a dataframe to a vector (by rows) The solution is as.vector(t(mydf[i, ])) or c(mydf[i, ]). My example:
str(trainData) # 'data.frame': 503 obs. of 500 variables: # $ bm001: num 0.429 1 -0.5 1.415 -1.899 ... # $ bm002: num 0.0568 1 0.5 0.3556 -1.16 ... # ... trainData[1:3, 1:3] # bm001 bm002 bm003 # 1 0.4289449 0.05676296 1.657966 # 2 1.0000000 1.00000000 1.000000 # 3 -0.5000000 0.50000000 0.500000 o <- data.frame(time = trainData[1, ], status = trainData[2, ], treat = trainData[3, ], t(TData)) # Warning message: # In data.frame(time = trainData[1, ], status = trainData[2, ], treat = trainData[3, : # row names were found from a short variable and have been discarded
'trees' data from the 'datasets' package
trees[1:3,] # Girth Height Volume # 1 8.3 70 10.3 # 2 8.6 65 10.3 # 3 8.8 63 10.2 # Wrong ways: data.frame(trees[1,] , trees[2,]) # Girth Height Volume Girth.1 Height.1 Volume.1 # 1 8.3 70 10.3 8.6 65 10.3 data.frame(time=trees[1,] , status=trees[2,]) # time.Girth time.Height time.Volume status.Girth status.Height status.Volume # 1 8.3 70 10.3 8.6 65 10.3 data.frame(time=as.vector(trees[1,]) , status=as.vector(trees[2,])) # time.Girth time.Height time.Volume status.Girth status.Height status.Volume # 1 8.3 70 10.3 8.6 65 10.3 data.frame(time=c(trees[1,]) , status=c(trees[2,])) # time.Girth time.Height time.Volume status.Girth status.Height status.Volume # 1 8.3 70 10.3 8.6 65 10.3 # Right ways: # method 1: dropping row names data.frame(time=c(t(trees[1,])) , status=c(t(trees[2,]))) # OR data.frame(time=as.numeric(trees[1,]) , status=as.numeric(trees[2,])) # time status # 1 8.3 8.6 # 2 70.0 65.0 # 3 10.3 10.3 # method 2: keeping row names data.frame(time=t(trees[1,]) , status=t(trees[2,])) # X1 X2 # Girth 8.3 8.6 # Height 70.0 65.0 # Volume 10.3 10.3 data.frame(time=unlist(trees[1,]) , status=unlist(trees[2,])) # time status # Girth 8.3 8.6 # Height 70.0 65.0 # Volume 10.3 10.3 # Method 3: convert a data frame to a matrix is.matrix(trees) # [1] FALSE trees2 <- as.matrix(trees) data.frame(time=trees2[1,] , status=trees2[2,]) # row names are kept # time status # Girth 8.3 8.6 # Height 70.0 65.0 # Volume 10.3 10.3 dim(trees[1,]) # [1] 1 3 dim(trees2[1, ]) # NULL trees[1, ] # notice the row name '1' on the left hand side # Girth Height Volume # 1 8.3 70 10.3 trees2[1, ] # Girth Height Volume # 8.3 70.0 10.3
Convert a list to data frame
How to Convert a List to a Data Frame in R.
# method 1 data.frame(t(sapply(my_list,c))) # method 2 library(dplyr) bind_rows(my_list) # OR bind_cols(my_list) # method 3 library(data.table) rbindlist(my_list)
tibble and data.table
Clean a dataset
How to clean the datasets in R
matrix
Define and subset a matrix
- Matrix in R
- It is clear when a vector becomes a matrix the data is transformed column-wisely (byrow = FALSE, by default).
- When subsetting a matrix, it follows the format: X[rows, colums] or X[y-axis, x-axis].
data <- c(2, 4, 7, 5, 10, 1) A <- matrix(data, ncol = 3) print(A) # [,1] [,2] [,3] # [1,] 2 7 10 # [2,] 4 5 1 A[1:1, 2:3, drop=F] # [,1] [,2] # [1,] 7 10
Prevent automatic conversion of single column to vector
use drop = FALSE such as mat[, 1, drop = FALSE].
complete.cases(): remove rows with missing in any column
It works on a sequence of vectors, matrices and data frames.
NROW vs nrow
?nrow. Use NROW/NCOL instead of nrow/ncol to treat vectors as 1-column matrices.
matrix (column-major order) multiply a vector
- Matrices in R R (like Fortran) are stored in a column-major order. It means array slice A[,1] are contiguous.
> matrix(1:6, 3,2) [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > matrix(1:6, 3,2) * c(1,2,3) # c(1,2,3) will be recycled to form a matrix. Good quiz. [,1] [,2] [1,] 1 4 [2,] 4 10 [3,] 9 18 > matrix(1:6, 3,2) * c(1,2,3,4) # c(1,2,3,4) will be recycled [,1] [,2] [1,] 1 16 [2,] 4 5 [3,] 9 12
add a vector to all rows of a matrix
add a vector to all rows of a matrix. sweep() or rep() is the best.
sparse matrix
R convert matrix or data frame to sparseMatrix
To subset a vector from some column of a sparseMatrix, we need to convert it to a regular vector, as.vector().
Attributes
- Attributes in R
- Data structures in "Advanced R"
Names
Useful functions for dealing with object names. (Un)Setting object names: stats::setNames(), unname() and rlang::set_names()
Print a vector by suppressing names
Use unname. sapply(, , USE.NAMES = FALSE).
format.pval/print p-values/format p values
format.pval(). By default it will show 5 significant digits (getOption("digits")-2).
> set.seed(1); format.pval(c(stats::runif(5), pi^-100, NA)) [1] "0.26551" "0.37212" "0.57285" "0.90821" "0.20168" "< 2e-16" "NA" > format.pval(c(0.1, 0.0001, 1e-27)) [1] "1e-01" "1e-04" "<2e-16" R> pvalue [1] 0.0004632104 R> print(pvalue, digits =20) [1] 0.00046321036188223807528 R> format.pval(pvalue) [1] "0.00046321" R> format.pval(pvalue * 1e-1) [1] "4.6321e-05" R> format.pval(0.00004632) [1] "4.632e-05" R> getOption("digits") [1] 7
Return type
The format.pval() function returns a string, so it’s not appropriate to use the returned object for operations like sorting.
Wrong number of digits in format.pval()
See here. The solution is to apply round() and then format.pval().
x <- c(6.25433625041843e-05, NA, 0.220313341361346, NA, 0.154029880744594, 0.0378437685448703, 0.023358329881356, NA, 0.0262561986351483, 0.000251274794673796) format.pval(x, digits=3) # [1] "6.25e-05" "NA" "0.220313" "NA" "0.154030" "0.037844" "0.023358" # [8] "NA" "0.026256" "0.000251" round(x, 3) |> format.pval(digits=3, eps=.001) # [1] "<0.001" "NA" "0.220" "NA" "0.154" "0.038" "0.023" "NA" # [9] "0.026" "<0.001"
dplr::mutate_if()
library(dplyr) df <- data.frame( char_var = c("A", "B", "C"), num_var1 = c(1.123456, 2.123456, 3.123456), num_var2 = c(4.654321, 5.654321, 6.654321), stringsAsFactors = FALSE ) # Round numerical variables to 4 digits after the decimal point df_rounded <- df %>% mutate_if(is.numeric, round, digits = 4)
Customize R: options()
Change the default R repository, my .Rprofile
Edit global Rprofile file. On *NIX platforms, it's located in /usr/lib/R/library/base/R/Rprofile although local .Rprofile settings take precedence.
For example, I can specify the R mirror I like by creating a single line .Rprofile file under my home directory. Another good choice of repository is cloud.r-project.org.
Type file.edit("~/.Rprofile")
local({ r = getOption("repos") r["CRAN"] = "https://cran.rstudio.com/" options(repos = r) }) options(continue = " ", editor = "nano") message("Hi MC, loading ~/.Rprofile") if (interactive()) { .Last <- function() try(savehistory("~/.Rhistory")) }
Change the default web browser for utils::browseURL()
When I run help.start() function in LXLE, it cannot find its default web browser (seamonkey). The solution is to put
options(browser='seamonkey')
in the .Rprofile of your home directory. If the browser is not in the global PATH, we need to put the full path above.
For one-time only purpose, we can use the browser option in help.start() function:
> help.start(browser="seamonkey") If the browser launched by 'seamonkey' is already running, it is *not* restarted, and you must switch to its window. Otherwise, be patient ...
We can work made a change (or create the file) ~/.Renviron or etc/Renviron. See
- Changing default browser in options().
- https://stat.ethz.ch/R-manual/R-devel/library/utils/html/browseURL.html
Change the default editor
On my Linux and mac, the default editor is "vi". To change it to "nano",
options(editor = "nano")
Change prompt and remove '+' sign
See https://stackoverflow.com/a/1448823.
options(prompt="R> ", continue=" ")
digits
- Read and compute the sum of a numeric matrix file using R vs Python vs C++. Note by default R does not show digits after the decimal point because the number is large.
- Controlling number of decimal digits in print output in R
- ?print.default
- Formatting Decimal places in R, round(). format() where nsmall controls the minimum number of digits to the right of the decimal point
- numerical error in round() causing round to even to fail 2019-12-05
- signif() rounds x to n significant digits.
R> signif(pi, 3) [1] 3.14 R> signif(pi, 5) [1] 3.1416
- The default digits 7 may be too small. For example, if a number is very large, then we may not be able to see (enough) value after the decimal point. The acceptable range is 1-22. See the following examples
In R,
> options()$digits # Default [1] 7 > print(.1+.2, digits=18) [1] 0.300000000000000044 > 100000.07 + .04 [1] 100000.1 > options(digits = 16) > 100000.07 + .04 [1] 100000.11
In Python,
>>> 100000.07 + .04 100000.11
Disable scientific notation in printing: options(scipen)
How to Turn Off Scientific Notation in R?
This also helps with write.table() results. For example, 0.0003 won't become 3e-4 in the output file.
> numer = 29707; denom = 93874 > c(numer/denom, numer, denom) [1] 3.164561e-01 2.970700e+04 9.387400e+04 # Method 1. Without changing the global option > format(c(numer/denom, numer, denom), scientific=FALSE) [1] " 0.3164561" "29707.0000000" "93874.0000000" # Method 2. Change the global option > options(scipen=999) > numer/denom [1] 0.3164561 > c(numer/denom, numer, denom) [1] 0.3164561 29707.0000000 93874.0000000 > c(4/5, numer, denom) [1] 0.8 29707.0 93874.0
Suppress warnings: options() and capture.output()
Use options(). If warn is negative all warnings are ignored. If warn is zero (the default) warnings are stored until the top--level function returns.
op <- options("warn") options(warn = -1) .... options(op) # OR warnLevel <- options()$warn options(warn = -1) ... options(warn = warnLevel)
suppressWarnings( foo() ) foo <- capture.output( bar <- suppressWarnings( {print( "hello, world" ); warning("unwanted" )} ) )
str(iris, max.level=1) %>% capture.output(file = "/tmp/iris.txt")
Converts warnings into errors
options(warn=2)
demo() function
- How to wait for a keypress in R? PS readline() is different from readLines().
for(i in 1:2) { print(i); readline("Press [enter] to continue")}
- Hit 'ESC' or Ctrl+c to skip the prompt "Hit <Return> to see next plot:"
- demo() uses options() to ask users to hit Enter on each plot
op <- options(device.ask.default = ask) # ask = TRUE on.exit(options(op), add = TRUE)
sprintf
paste, paste0, sprintf
this post, 3 R functions that I enjoy
sep vs collapse in paste()
- sep is used if we supply multiple separate objects to paste(). A more powerful function is tidyr::unite() function.
- collapse is used to make the output of length 1. It is commonly used if we have only 1 input object
R> paste("a", "A", sep=",") # multi-vec -> multi-vec [1] "a,A" R> paste(c("Elon", "Taylor"), c("Mask", "Swift")) [1] "Elon Mask" "Taylor Swift" # OR R> sprintf("%s, %s", c("Elon", "Taylor"), c("Mask", "Swift")) R> paste(c("a", "A"), collapse="-") # one-vec/multi-vec -> one-scale [1] "a-A" # When use together, sep first and collapse second R> paste(letters[1:3], LETTERS[1:3], sep=",", collapse=" - ") [1] "a,A - b,B - c,C" R> paste(letters[1:3], LETTERS[1:3], sep=",") [1] "a,A" "b,B" "c,C" R> paste(letters[1:3], LETTERS[1:3], sep=",") |> paste(collapse=" - ") [1] "a,A - b,B - c,C"
Format number as fixed width, with leading zeros
- https://stackoverflow.com/questions/8266915/format-number-as-fixed-width-with-leading-zeros
- https://stackoverflow.com/questions/14409084/pad-with-leading-zeros-to-common-width?rq=1
# sprintf() a <- seq(1,101,25) sprintf("name_%03d", a) [1] "name_001" "name_026" "name_051" "name_076" "name_101" # formatC() paste("name", formatC(a, width=3, flag="0"), sep="_") [1] "name_001" "name_026" "name_051" "name_076" "name_101" # gsub() paste0("bm", gsub(" ", "0", format(5:15))) # [1] "bm05" "bm06" "bm07" "bm08" "bm09" "bm10" "bm11" "bm12" "bm13" "bm14" "bm15"
formatC and prettyNum (prettifying numbers)
R> (x <- 1.2345 * 10 ^ (-8:4)) [1] 1.2345e-08 1.2345e-07 1.2345e-06 1.2345e-05 1.2345e-04 1.2345e-03 [7] 1.2345e-02 1.2345e-01 1.2345e+00 1.2345e+01 1.2345e+02 1.2345e+03 [13] 1.2345e+04 R> formatC(x) [1] "1.234e-08" "1.234e-07" "1.234e-06" "1.234e-05" "0.0001234" "0.001234" [7] "0.01235" "0.1235" "1.234" "12.34" "123.4" "1234" [13] "1.234e+04" R> formatC(x, digits=3) [1] "1.23e-08" "1.23e-07" "1.23e-06" "1.23e-05" "0.000123" "0.00123" [7] "0.0123" "0.123" "1.23" "12.3" " 123" "1.23e+03" [13] "1.23e+04" R> formatC(x, digits=3, format="e") [1] "1.234e-08" "1.234e-07" "1.234e-06" "1.234e-05" "1.234e-04" "1.234e-03" [7] "1.235e-02" "1.235e-01" "1.234e+00" "1.234e+01" "1.234e+02" "1.234e+03" [13] "1.234e+04" R> x <- .000012345 R> prettyNum(x) [1] "1.2345e-05" R> x <- .00012345 R> prettyNum(x) [1] "0.00012345"
format(x, scientific = TRUE) vs round() vs format.pval()
Print numeric data in exponential format, so .0001 prints as 1e-4
format(c(0.00001156, 0.84134, 2.1669), scientific = T, digits=4) # [1] "1.156e-05" "8.413e-01" "2.167e+00" round(c(0.00001156, 0.84134, 2.1669), digits=4) # [1] 0.0000 0.8413 2.1669 format.pval(c(0.00001156, 0.84134, 2.1669)) # output is char vector # [1] "1.156e-05" "0.84134" "2.16690" format.pval(c(0.00001156, 0.84134, 2.1669), digits=4) # [1] "1.156e-05" "0.8413" "2.1669"
Creating publication quality graphs in R
HDF5 : Hierarchical Data Format
HDF5 is an open binary file format for storing and managing large, complex datasets. The file format was developed by the HDF Group, and is widely used in scientific computing.
- https://en.wikipedia.org/wiki/Hierarchical_Data_Format
- HDF5 tutorial and others
- rhdf5 package
- rhdf5 is used by ARCHS4 where you can download R program that will download hdf5 file storing expression and metadata such as gene ID, sample/GSM ID, tissues, et al.
Formats for writing/saving and sharing data
Efficiently Saving and Sharing Data in R
Write unix format files on Windows and vice versa
https://stat.ethz.ch/pipermail/r-devel/2012-April/063931.html
with() and within() functions
- Simplify Your Code with R’s Powerful Functions: with() and within()
- within() is similar to with() except it is used to create new columns and merge them with the original data sets. But if we just want to create a new column, we can just use df$newVar = . The following example is from youtube video.
closePr <- with(mariokart, totalPr - shipPr) head(closePr, 20) mk <- within(mariokart, { closePr <- totalPr - shipPr }) head(mk) # new column closePr mk <- mariokart aggregate(. ~ wheels + cond, mk, mean) # create mean according to each level of (wheels, cond) aggregate(totalPr ~ wheels + cond, mk, mean) tapply(mk$totalPr, mk[, c("wheels", "cond")], mean)
stem(): stem-and-leaf plot (alternative to histogram), bar chart on terminals
- https://en.wikipedia.org/wiki/Stem-and-leaf_display
- Tally plots in R
- https://stackoverflow.com/questions/14736556/ascii-plotting-functions-for-r
- txtplot package
Plot histograms as lines
https://stackoverflow.com/a/16681279. This is useful when we want to compare the distribution from different statistics.
x2=invisible(hist(out2$EB)) y2=invisible(hist(out2$Bench)) z2=invisible(hist(out2$EB0.001)) plot(x=x2$mids, y=x2$density, type="l") lines(y2$mids, y2$density, lty=2, pwd=2) lines(z2$mids, z2$density, lty=3, pwd=2)
Histogram with density line
hist(x, prob = TRUE) lines(density(x), col = 4, lwd = 2)
The overlayed density may looks strange in cases for example counts from single-cell RNASeq or p-values from RNASeq (there is a peak around x=0).
Graphical Parameters, Axes and Text, Combining Plots
15 Questions All R Users Have About Plots
See 15 Questions All R Users Have About Plots. This is a tremendous post. It covers the built-in plot() function and ggplot() from ggplot2 package.
- How To Draw An Empty R Plot? plot.new()
- How To Set The Axis Labels And Title Of The R Plots?
- How To Add And Change The Spacing Of The Tick Marks Of Your R Plot? axis()
- How To Create Two Different X- or Y-axes? par(new=TRUE), axis(), mtext(). ?par.
- How To Add Or Change The R Plot’s Legend? legend()
- How To Draw A Grid In Your R Plot? grid()
- How To Draw A Plot With A PNG As Background? rasterImage() from the png package
- How To Adjust The Size Of Points In An R Plot? cex argument
- How To Fit A Smooth Curve To Your R Data? loess() and lines()
- How To Add Error Bars In An R Plot? arrows()
- How To Save A Plot As An Image On Disc
- How To Plot Two R Plots Next To Each Other? par(mfrow)[which means Multiple Figures (use ROW-wise)], gridBase package, lattice package
- How To Plot Multiple Lines Or Points? plot(), lines()
- How To Fix The Aspect Ratio For Your R Plots? asp parameter
- What Is The Function Of hjust And vjust In ggplot2?
jitter function
- https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/jitter
- jitter(, amount) function adds a random variation between -amount/2 and amount/2 to each element in x
- What does the “jitter” function do in R?
- When to use Jitter
- How to calculate Area Under the Curve (AUC), or the c-statistic, by hand
Scatterplot with the "rug" function
require(stats) # both 'density' and its default method with(faithful, { plot(density(eruptions, bw = 0.15)) rug(eruptions) rug(jitter(eruptions, amount = 0.01), side = 3, col = "light blue") })
See also the stripchart() function which produces one dimensional scatter plots (or dot plots) of the given data.
Identify/Locate Points in a Scatter Plot
- ?identify
- Using the identify function in R
plot(x, y) identify(x, y, labels = names, plot = TRUE) # Use left clicks to select points we want to identify and "esc" to stop the process # This will put the labels on the plot and also return the indices of points # [1] 143 names[143]
Draw a single plot with two different y-axes
Draw Color Palette
Default palette before R 4.0
palette() # black, red, green3, blue, cyan, magenta, yellow, gray
# Example from Coursera "Statistics for Genomic Data Science" by Jeff Leek tropical = c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow') palette(tropical) plot(1:5, 1:5, col=1:5, pch=16, cex=5)
New palette in R 4.0.0
R 4.0: 3 new features, R 4.0.0 now available, and a look back at R's history. For example, we can select "ggplot2" palette to make the base graphics charts that match the color scheme of ggplot2.
R> palette() [1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710" [8] "gray62" R> palette.pals() [1] "R3" "R4" "ggplot2" [4] "Okabe-Ito" "Accent" "Dark 2" [7] "Paired" "Pastel 1" "Pastel 2" [10] "Set 1" "Set 2" "Set 3" [13] "Tableau 10" "Classic Tableau" "Polychrome 36" [16] "Alphabet" R> palette.colors(palette='R4') # same as palette() [1] "#000000" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710" [8] "#9E9E9E" R> palette("R3") # nothing return on screen but palette has changed R> palette() [1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" [8] "gray" R> palette("R4") # reset to the default color palette; OR palette("default") R> scales::show_col(palette.colors(palette = "Okabe-Ito")) R> for(id in palette.pals()) { scales::show_col(palette.colors(palette = id)) title(id) readline("Press [enter] to continue") }
The palette function can also be used to change the color palette. See Setting up Color Palettes in R
palette("ggplot2") palette(palette()[-1]) # Remove 'black' # OR palette(palette.colors(palette = "ggplot2")[-1] ) with(iris, plot(Sepal.Length, Petal.Length, col = Species, pch=16)) cc <- palette() palette(c(cc,"purple","brown")) # Add two colors
R> colors() |> length() # [1] 657 R> colors(distinct = T) |> length() # [1] 502
evoPalette
Evolve new colour palettes in R with evoPalette
rtist
rtist: Use the palettes of famous artists in your own visualizations.
SVG
Embed svg in html
svglite
svglite is better R's svg(). It was used by ggsave(). svglite 1.2.0, R Graphics Cookbook.
pdf -> svg
Using Inkscape. See this post.
svg -> png
SVG to PNG using the gyro package
read.table
clipboard
source("clipboard") read.table("clipboard")
inline text
mydf <- read.table(header=T, text=' cond yval A 2 B 2.5 C 1.6 ')
http(s) connection
temp = getURL("https://gist.github.com/arraytools/6743826/raw/23c8b0bc4b8f0d1bfe1c2fad985ca2e091aeb916/ip.txt", ssl.verifypeer = FALSE) ip <- read.table(textConnection(temp), as.is=TRUE)
read only specific columns
Use 'colClasses' option in read.table, read.delim, .... For example, the following example reads only the 3rd column of the text file and also changes its data type from a data frame to a vector. Note that we have include double quotes around NULL.
x <- read.table("var_annot.vcf", colClasses = c(rep("NULL", 2), "character", rep("NULL", 7)), skip=62, header=T, stringsAsFactors = FALSE)[, 1] # system.time(x <- read.delim("Methylation450k.txt", colClasses = c("character", "numeric", rep("NULL", 188)), stringsAsFactors = FALSE))
To know the number of columns, we might want to read the first row first.
library(magrittr) scan("var_annot.vcf", sep="\t", what="character", skip=62, nlines=1, quiet=TRUE) %>% length()
Another method is to use pipe(), cut or awk. See ways to read only selected columns from a file into R
check.names = FALSE in read.table()
gx <- read.table(file, header = T, row.names =1) colnames(gx) %>% grep("[^[:alnum:] ]", ., value = TRUE) # [1] "hCG_1642354" "IGH." "IGHV1.69" "IGKV1.5" "IGKV2.24" "KRTAP13.2" # [7] "KRTAP19.1" "KRTAP2.4" "KRTAP5.9" "KRTAP6.3" "Kua.UEV" gx <- read.table(file, header = T, row.names =1, check.names = FALSE) colnames(gx) %>% grep("[^[:alnum:] ]", ., value = TRUE) # [1] "hCG_1642354" "IGH@" "IGHV1-69" "IGKV1-5" "IGKV2-24" "KRTAP13-2" # [7] "KRTAP19-1" "KRTAP2-4" "KRTAP5-9" "KRTAP6-3" "Kua-UEV"
setNames()
Change the colnames. See an example from tidymodels
Testing for valid variable names
Testing for valid variable names
make.names(): Make syntactically valid names out of character vectors
- make.names()
- A valid variable name consists of letters, numbers and the dot or underline characters. The variable name starts with a letter or the dot not followed by a number. See R variables.
make.names("abc-d") # [1] "abc.d"
Serialization
If we want to pass an R object to C (use recv() function), we can use writeBin() to output the stream size and then use serialize() function to output the stream to a file. See the post on R mailing list.
> a <- list(1,2,3) > a_serial <- serialize(a, NULL) > a_length <- length(a_serial) > a_length [1] 70 > writeBin(as.integer(a_length), connection, endian="big") > serialize(a, connection)
In C++ process, I receive one int variable first to get the length, and then read <length> bytes from the connection.
socketConnection
See ?socketconnection.
Simple example
from the socketConnection's manual.
Open one R session
con1 <- socketConnection(port = 22131, server = TRUE) # wait until a connection from some client writeLines(LETTERS, con1) close(con1)
Open another R session (client)
con2 <- socketConnection(Sys.info()["nodename"], port = 22131) # as non-blocking, may need to loop for input readLines(con2) while(isIncomplete(con2)) { Sys.sleep(1) z <- readLines(con2) if(length(z)) print(z) } close(con2)
Use nc in client
The client does not have to be the R. We can use telnet, nc, etc. See the post here. For example, on the client machine, we can issue
nc localhost 22131 [ENTER]
Then the client will wait and show anything written from the server machine. The connection from nc will be terminated once close(con1) is given.
If I use the command
nc -v -w 2 localhost -z 22130-22135
then the connection will be established for a short time which means the cursor on the server machine will be returned. If we issue the above nc command again on the client machine it will show the connection to the port 22131 is refused. PS. "-w" switch denotes the number of seconds of the timeout for connects and final net reads.
Some post I don't have a chance to read. http://digitheadslabnotebook.blogspot.com/2010/09/how-to-send-http-put-request-from-r.html
Use curl command in client
On the server,
con1 <- socketConnection(port = 8080, server = TRUE)
On the client,
curl --trace-ascii debugdump.txt http://localhost:8080/
Then go to the server,
while(nchar(x <- readLines(con1, 1)) > 0) cat(x, "\n") close(con1) # return cursor in the client machine
Use telnet command in client
On the server,
con1 <- socketConnection(port = 8080, server = TRUE)
On the client,
sudo apt-get install telnet telnet localhost 8080 abcdefg hijklmn qestst
Go to the server,
readLines(con1, 1) readLines(con1, 1) readLines(con1, 1) close(con1) # return cursor in the client machine
Some tutorial about using telnet on http request. And this is a summary of using telnet.
Subsetting
Subset assignment of R Language Definition and Manipulation of functions.
The result of the command x[3:5] <- 13:15 is as if the following had been executed
`*tmp*` <- x x <- "[<-"(`*tmp*`, 3:5, value=13:15) rm(`*tmp*`)
Avoid Coercing Indices To Doubles
Careful on NA value
See the example below. Use one of the following methods to remove NA from the logical variable.
- which()
- base::subset()
- dplyr::filter()
R> mydf = data.frame(a=1:3, b=c(NA,5,6)) R> mydf[mydf$b >5, ] a b NA NA NA 3 3 6 R> mydf[which(mydf$b >5), ] a b 3 3 6 R> mydf %>% dplyr::filter(b > 5) a b 1 3 6 R> subset(mydf, b>5) a b 3 3 6
Implicit looping
set.seed(1) i <- sample(c(TRUE, FALSE), size=10, replace = TRUE) # [1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE sum(i) # [1] 6 x <- 1:10 length(x[i]) # [1] 6 x[i[1:3]] # [1] 1 3 4 6 7 9 10 length(x[i[1:3]]) # [1] 7
modelling
update()
Extract all variable names in lm(), glm(), ...
all.vars(formula(Model)[-2])
as.formula(): use a string in formula in lm(), glm(), ...
- Changing the variable inside an R formula
- How to succinctly write a formula with many variables from a data frame?
? as.formula xnam <- paste("x", 1:25, sep="") fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
outcome <- "mpg" variables <- c("cyl", "disp", "hp", "carb") # Method 1. The 'Call' portion of the model is reported as “formula = f” # our modeling effort, # fully parameterized! f <- as.formula( paste(outcome, paste(variables, collapse = " + "), sep = " ~ ")) print(f) # mpg ~ cyl + disp + hp + carb model <- lm(f, data = mtcars) print(model) # Call: # lm(formula = f, data = mtcars) # # Coefficients: # (Intercept) cyl disp hp carb # 34.021595 -1.048523 -0.026906 0.009349 -0.926863 # Method 2. eval() + bquote() + ".()" format(terms(model)) # or model$terms # [1] "mpg ~ cyl + disp + hp + carb" # The new line of code model <- eval(bquote( lm(.(f), data = mtcars) )) print(model) # Call: # lm(formula = mpg ~ cyl + disp + hp + carb, data = mtcars) # # Coefficients: # (Intercept) cyl disp hp carb # 34.021595 -1.048523 -0.026906 0.009349 -0.926863 # Note if we skip ".()" operator > eval(bquote( lm(f, data = mtcars) )) Call: lm(formula = f, data = mtcars) Coefficients: (Intercept) cyl disp hp carb 34.021595 -1.048523 -0.026906 0.009349 -0.926863
- Changing the variable inside an R formula 1. as.formula() 2. subset by i 3. get() 4. eval(parse()).
reformulate
Simplifying Model Formulas with the R Function ‘reformulate()’
I() function
I() means isolates. See What does the capital letter "I" in R linear regression formula mean?, In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)
Aggregating results from linear model
https://stats.stackexchange.com/a/6862
Replacement function "fun(x) <- a"
What are Replacement Functions in R?
R> xx <- c(1,3,66, 99) R> "cutoff<-" <- function(x, value){ x[x > value] <- Inf x } R> cutoff(xx) <- 65 # xx & 65 are both input R> xx [1] 1 3 Inf Inf R> "cutoff<-"(x = xx, value = 65) [1] 1 3 Inf Inf
The statement fun(x) <- a and R will read x <- "fun<-"(x,a)
S3 and S4 methods and signature
- How S4 works in R https://www.rdocumentation.org/packages/methods/versions/3.5.1/topics/Methods_Details
- Software for Data Analysis: Programming with R by John Chambers
- Programming with Data: A Guide to the S Language by John Chambers
- Extending R by John M. Chambers, 2016
- https://www.rmetrics.org/files/Meielisalp2009/Presentations/Chalabi1.pdf
- A Simple Guide to S3 Methods
- Hands-On Programming with R by Garrett Grolemund
- https://www.stat.auckland.ac.nz/S-Workshop/Gentleman/S4Objects.pdf
- packS4: Toy Example of S4 Package, * A (Not So) Short Introduction to S4
- http://www.cyclismo.org/tutorial/R/s4Classes.html
- https://www.coursera.org/lecture/bioconductor/r-s4-methods-C4dNr
- https://www.bioconductor.org/help/course-materials/2013/UnderstandingRBioc2013/
- http://adv-r.had.co.nz/S4.html, http://adv-r.had.co.nz/OO-essentials.html
- Object-Oriented Programming in R (Part 1): An Introduction, Part 2: S3 Simplified
Debug an S4 function
- showMethods('FUNCTION')
- getMethod('FUNCTION', 'SIGNATURE')
- debug(, signature)
> args(debug) function (fun, text = "", condition = NULL, signature = NULL) > library(genefilter) # Bioconductor > showMethods("nsFilter") Function: nsFilter (package genefilter) eset="ExpressionSet" > debug(nsFilter, signature="ExpressionSet") library(DESeq2) showMethods("normalizationFactors") # show the object class # "DESeqDataSet" in this case. getMethod(`normalizationFactors`, "DESeqDataSet") # get the source code
See the source code of normalizationFactors<- (setReplaceMethod() is used) and the source code of estimateSizeFactors(). We can see how avgTxLength was used in estimateNormFactors().
Another example
library(GSVA) args(gsva) # function (expr, gset.idx.list, ...) showMethods("gsva") # Function: gsva (package GSVA) # expr="ExpressionSet", gset.idx.list="GeneSetCollection" # expr="ExpressionSet", gset.idx.list="list" # expr="matrix", gset.idx.list="GeneSetCollection" # expr="matrix", gset.idx.list="list" # expr="SummarizedExperiment", gset.idx.list="GeneSetCollection" # expr="SummarizedExperiment", gset.idx.list="list" debug(gsva, signature = c(expr="matrix", gset.idx.list="list")) # OR # debug(gsva, signature = c("matrix", "list")) gsva(y, geneSets, method="ssgsea", kcdf="Gaussian") Browse[3]> debug(.gsva) # return(ssgsea(expr, gset.idx.list, alpha = tau, parallel.sz = parallel.sz, # normalization = ssgsea.norm, verbose = verbose, # BPPARAM = BPPARAM)) isdebugged("gsva") # [1] TRUE undebug(gsva)
- getClassDef() in S4 (Bioconductor course).
library(IRanges) ir <- IRanges(start=c(10, 20, 30), width=5) ir class(ir) ## [1] "IRanges" ## attr(,"package") ## [1] "IRanges" getClassDef(class(ir)) ## Class "IRanges" [package "IRanges"] ## ## Slots: ## ## Name: start width NAMES elementType ## Class: integer integer characterORNULL character ## ## Name: elementMetadata metadata ## Class: DataTableORNULL list ## ## Extends: ## Class "Ranges", directly ## Class "IntegerList", by class "Ranges", distance 2 ## Class "RangesORmissing", by class "Ranges", distance 2 ## Class "AtomicList", by class "Ranges", distance 3 ## Class "List", by class "Ranges", distance 4 ## Class "Vector", by class "Ranges", distance 5 ## Class "Annotated", by class "Ranges", distance 6 ## ## Known Subclasses: "NormalIRanges"
Check if a function is an S4 method
isS4(foo)
How to access the slots of an S4 object
- @ will let you access the slots of an S4 object.
- Note that often the best way to do this is to not access the slot directly but rather through an accessor function (e.g. coefs() rather than digging out the coefficients with $ or @). However, often such functions do not exist so you have to access the slots directly. This will mean that your code breaks if the internal implementation changes, however.
- R - S4 Classes and Methods Hansen. getClass() or getClassDef().
setReplaceMethod()
- What's the difference between setMethod(“$<-”) and set setReplaceMethod(“$”)?
- What is setReplaceMethod() and how does it work?
See what methods work on an object
see what methods work on an object, e.g. a GRanges object:
methods(class="GRanges")
Or if you have an object, x:
methods(class=class(x))
View S3 function definition: double colon '::' and triple colon ':::' operators and getAnywhere()
?":::"
- pkg::name returns the value of the exported variable name in namespace pkg
- pkg:::name returns the value of the internal variable name
base::"+" stats:::coef.default predict.ppr # Error: object 'predict.ppr' not found stats::predict.ppr # Error: 'predict.ppr' is not an exported object from 'namespace:stats' stats:::predict.ppr # OR getS3method("predict", "ppr") getS3method("t", "test")
methods() + getAnywhere() functions
Read the source code (include Fortran/C, S3 and S4 methods)
- lookup package
- Read the source
- Find the source code in UseMethod("XXX") for S3 methods.
S3 method is overwritten
For example, the select() method from dplyr is overwritten by grpreg package.
An easy solution is to load grpreg before loading dplyr.
mcols() and DataFrame() from Bioc S4Vectors package
- mcols: Get or set the metadata columns.
- colData: SummarizedExperiment instances from GenomicRanges
- DataFrame: The DataFrame class extends the DataTable virtual class and supports the storage of any type of object (with length and [ methods) as columns.
For example, in Shrinkage of logarithmic fold changes vignette of the DESeq2paper package
> mcols(ddsNoPrior[genes, ]) DataFrame with 2 rows and 21 columns baseMean baseVar allZero dispGeneEst dispFit dispersion dispIter dispOutlier dispMAP <numeric> <numeric> <logical> <numeric> <numeric> <numeric> <numeric> <logical> <numeric> 1 163.5750 8904.607 FALSE 0.06263141 0.03862798 0.0577712 7 FALSE 0.0577712 2 175.3883 59643.515 FALSE 2.25306109 0.03807917 2.2530611 12 TRUE 1.6011440 Intercept strain_DBA.2J_vs_C57BL.6J SE_Intercept SE_strain_DBA.2J_vs_C57BL.6J WaldStatistic_Intercept <numeric> <numeric> <numeric> <numeric> <numeric> 1 6.210188 1.735829 0.1229354 0.1636645 50.515872 2 6.234880 1.823173 0.6870629 0.9481865 9.074686 WaldStatistic_strain_DBA.2J_vs_C57BL.6J WaldPvalue_Intercept WaldPvalue_strain_DBA.2J_vs_C57BL.6J <numeric> <numeric> <numeric> 1 10.60602 0.000000e+00 2.793908e-26 2 1.92280 1.140054e-19 5.450522e-02 betaConv betaIter deviance maxCooks <logical> <numeric> <numeric> <numeric> 1 TRUE 3 210.4045 0.2648753 2 TRUE 9 243.7455 0.3248949
Pipe
- Differences between the base R and magrittr pipes 4/21/2023
- R is Getting an Official Pipe Operator, My Opinion on R’s Upcoming Pipe
- a(b(x)) vs x |> b() |> a(). See this tweet in R-dev 2020-12-04.
e0 <- quote(a(b(x))) e1 <- quote(x |> b() |> a()) identical(e0, e1)
- There are now 3 different R pipes
- Error: The pipe operator requires a function call as RHS.
# native pipe foo |> bar() # magrittr pipe foo %>% bar
- Use the new R pipe built into R 4.1
- The New Native Pipe Operator in R
- Understanding the native R pipe |>
- Navigating the Data Pipes: An R Programming Journey with Mario Bros
Packages take advantage of pipes
- rstatix: Pipe-Friendly Framework for Basic Statistical Tests
findInterval()
Related functions are cuts() and split(). See also
Assign operator
- Earlier versions of R used underscore (_) as an assignment operator.
- Assignments with the = Operator
- In R 1.8.0 (2003), the assign operator has been removed. See NEWS.
- In R 1.9.0 (2004), "_" is allowed in valid names. See NEWS.
Operator precedence
The ':' operator has higher precedence than '-' so 0:N-1 evaluates to (0:N)-1, not 0:(N-1) like you probably wanted.
order(), rank() and sort()
If we want to find the indices of the first 25 genes with the smallest p-values, we can use order(pval)[1:25].
> x = sample(10) > x [1] 4 3 10 7 5 8 6 1 9 2 > order(x) [1] 8 10 2 1 5 7 4 6 9 3 > rank(x) [1] 4 3 10 7 5 8 6 1 9 2 > rank(10*x) [1] 4 3 10 7 5 8 6 1 9 2 > x[order(x)] [1] 1 2 3 4 5 6 7 8 9 10 > sort(x) [1] 1 2 3 4 5 6 7 8 9 10
relate order() and rank()
- Order to rank: rank() = order(order())
set.seed(1) x <- rnorm(5) order(x) # [1] 3 1 2 5 4 rank(x) # [1] 2 3 1 5 4 order(order(x)) # [1] 2 3 1 5 4 all(rank(x) == order(order(x))) # TRUE
- Order to Rank method 2: rank(order()) = 1:n
ord <- order(x) ranks <- integer(length(x)) ranks[ord] <- seq_along(x) ranks # [1] 2 3 1 5 4
- Rank to Order:
ranks <- rank(x) ord <- order(ranks) ord # [1] 3 1 2 5 4
OS-dependent results on sorting string vector
Gene symbol case.
# mac: order(c("DC-UbP", "DC2")) # c(1,2) # linux: order(c("DC-UbP", "DC2")) # c(2,1)
Affymetric id case.
# mac: order(c("202800_at", "2028_s_at")) # [1] 2 1 sort(c("202800_at", "2028_s_at")) # [1] "2028_s_at" "202800_at" # linux order(c("202800_at", "2028_s_at")) # [1] 1 2 sort(c("202800_at", "2028_s_at")) # [1] "202800_at" "2028_s_at"
It does not matter if we include factor() on the character vector.
The difference is related to locale. See
- ?locales in R
- On OS, type locale
- sort() produces different results in Ubuntu and Windows
- To fix the inconsistency problem, we can set the locale in R code to "C" or use the stringr package (the locale is part of str_order()'s arguments).
# both mac and linux stringr::str_order(c("202800_at", "2028_s_at")) # [1] 2 1 stringr::str_order(c("DC-UbP", "DC2")) # [1] 1 2 # Or setting the locale to "C" Sys.setlocale("LC_ALL", "C"); sort(c("DC-UbP", "DC2")) # Or Sys.setlocale("LC_COLLATE", "C"); sort(c("DC-UbP", "DC2")) # But not Sys.setlocale("LC_ALL", "en_US.UTF-8"); sort(c("DC-UbP", "DC2"))
unique()
It seems it does not sort. ?unique.
# mac & linux R> unique(c("DC-UbP", "DC2")) [1] "DC-UbP" "DC2"
do.call
do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
The do.call() function in R: Unlocking Efficiency and Flexibility
Below are some examples from the help.
- Usage
do.call(what, args, quote = FALSE, envir = parent.frame()) # what: either a function or a non-empty character string naming the function to be called. # args: a list of arguments to the function call. The names attribute of args gives the argument names. # quote: a logical value indicating whether to quote the arguments. # envir: an environment within which to evaluate the call. This will be most useful # if what is a character string and the arguments are symbols or quoted expressions.
- do.call() is similar to lapply() but not the same. It seems do.call() can make a simple function vectorized.
> do.call("complex", list(imag = 1:3)) [1] 0+1i 0+2i 0+3i > lapply(list(imag = 1:3), complex) $imag [1] 0+0i > complex(imag=1:3) [1] 0+1i 0+2i 0+3i > do.call(function(x) x+1, list(1:3)) [1] 2 3 4
- Applying do.call with Multiple Arguments
> do.call("sum", list(c(1,2,3,NA), na.rm = TRUE)) [1] 6 > do.call("sum", list(c(1,2,3,NA) )) [1] NA
> tmp <- expand.grid(letters[1:2], 1:3, c("+", "-")) > length(tmp) [1] 3 > tmp[1:4,] Var1 Var2 Var3 1 a 1 + 2 b 1 + 3 a 2 + 4 b 2 + > c(tmp, sep = "") $Var1 [1] a b a b a b a b a b a b Levels: a b $Var2 [1] 1 1 2 2 3 3 1 1 2 2 3 3 $Var3 [1] + + + + + + - - - - - - Levels: + - $sep [1] "" > do.call("paste", c(tmp, sep = "")) [1] "a1+" "b1+" "a2+" "b2+" "a3+" "b3+" "a1-" "b1-" "a2-" "b2-" "a3-" [12] "b3-"
- environment and quote arguments.
> A <- 2 > f <- function(x) print(x^2) > env <- new.env() > assign("A", 10, envir = env) > assign("f", f, envir = env) > f <- function(x) print(x) > f(A) [1] 2 > do.call("f", list(A)) [1] 2 > do.call("f", list(A), envir = env) [1] 4 > do.call(f, list(A), envir = env) [1] 2 # Why? > eval(call("f", A)) [1] 2 > eval(call("f", quote(A))) [1] 2 > eval(call("f", A), envir = env) [1] 4 > eval(call("f", quote(A)), envir = env) [1] 100
- Good use case; see Get all Parameters as List
> foo <- function(a=1, b=2, ...) { list(arg=do.call(c, as.list(match.call())[-1])) } > foo() $arg NULL > foo(a=1) $arg a 1 > foo(a=1, b=2, c=3) $arg a b c 1 2 3
- do.call() + switch(). See an example from Seurat::NormalizeData.
do.call( what = switch( EXPR = margin, '1' = 'rbind', '2' = 'cbind', stop("'margin' must be 1 or 2") ), args = normalized.data ) switch('a', 'a' = rnorm(3), 'b'=rnorm(4)) # switch returns a value do.call(switch('a', 'a' = 'rnorm', 'b'='rexp'), args=list(n=4)) # switch returns a function
- The function we want to call is a string that may change: glmnet
# Suppose we want to call cv.glmnet or cv.coxnet or cv.lognet or cv.elnet .... depending on the case fun = paste("cv", subclass, sep = ".") cvstuff = do.call(fun, list(predmat,y,type.measure,weights,foldid,grouped))
expand.grid, mapply, vapply
A faster way to generate combinations for mapply and vapply
do.call vs mapply
- do.call() is doing what mapply() does but do.call() uses a list instead of multiple arguments. So do.call() more close to base::Map() function.
> mapply(paste, tmp[1], tmp[2], tmp[3], sep = "") Var1 [1,] "a1+" [2,] "b1+" [3,] "a2+" [4,] "b2+" [5,] "a3+" [6,] "b3+" [7,] "a1-" [8,] "b1-" [9,] "a2-" [10,] "b2-" [11,] "a3-" [12,] "b3-" # It does not work if we do not explicitly specify the arguments in mapply() > mapply(paste, tmp, sep = "") Var1 Var2 Var3 [1,] "a" "1" "+" [2,] "b" "1" "+" [3,] "a" "2" "+" [4,] "b" "2" "+" [5,] "a" "3" "+" [6,] "b" "3" "+" [7,] "a" "1" "-" [8,] "b" "1" "-" [9,] "a" "2" "-" [10,] "b" "2" "-" [11,] "a" "3" "-" [12,] "b" "3" "-"
- mapply is useful in generating variables with a vector of parameters. For example suppose we want to generate variables from exponential/weibull distribution and a vector of scale parameters (depending on some covariates). In this case we can use (Simulating Weibull distributions from vectors of parameters in R)
set.seed(1) mapply(rweibull, 1, c(1, 10), MoreArgs=list(n=1)) # [1] 1.326108 9.885284 set.seed(1) x <- replicate(1000, mapply(rweibull, 1, c(1, 10), MoreArgs=list(n=1))) dim(x) # [1] 2 1000 rowMeans(x) # [1] 1.032209 10.104131
set.seed(1); Vectorize(rweibull)(n=1, shape=1, scale=c(1, 10)) # [1] 1.326108 9.885284 set.seed(1); x <- replicate(1000, Vectorize(rweibull)(n=1, shape=1, scale=c(1, 10)))
do.call vs lapply
What's the difference between lapply and do.call? It seems to me the best usage is combining both functions: do.call(..., lapply())
- lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
- do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is widely used, for example, to assemble lists into simpler structures (often with rbind or cbind).
- Map applies a function to the corresponding elements of given vectors... Map is a simple wrapper to mapply which does not attempt to simplify the result, similar to Common Lisp's mapcar (with arguments being recycled, however). Future versions may allow some control of the result type.
> lapply(iris, class) # same as Map(class, iris) $Sepal.Length [1] "numeric" $Sepal.Width [1] "numeric" $Petal.Length [1] "numeric" $Petal.Width [1] "numeric" $Species [1] "factor" > x <- lapply(iris, class) > do.call(c, x) Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
https://stackoverflow.com/a/10801902
- lapply applies a function over a list. So there will be several function calls.
- do.call calls a function with a list of arguments (... argument) such as c() or rbind()/cbind() or sum or order or "[" or paste. So there is only one function call.
> X <- list(1:3,4:6,7:9) > lapply(X,mean) 1 [1] 2 2 [1] 5 3 [1] 8 > do.call(sum, X) [1] 45 > sum(c(1,2,3), c(4,5,6), c(7,8,9)) [1] 45 > do.call(mean, X) # Error > do.call(rbind,X) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > lapply(X,rbind) 1 [,1] [,2] [,3] [1,] 1 2 3 2 [,1] [,2] [,3] [1,] 4 5 6 3 [,1] [,2] [,3] [1,] 7 8 9 > mapply(mean, X, trim=c(0,0.5,0.1)) [1] 2 5 8 > mapply(mean, X) [1] 2 5 8
Below is a good example to show the difference of lapply() and do.call() - Generating Random Strings.
> set.seed(1) > x <- replicate(2, sample(LETTERS, 4), FALSE) > x 1 [1] "Y" "D" "G" "A" 2 [1] "B" "W" "K" "N" > lapply(x, paste0) 1 [1] "Y" "D" "G" "A" 2 [1] "B" "W" "K" "N" > lapply(x, paste0, collapse= "") 1 [1] "YDGA" 2 [1] "BWKN" > do.call(paste0, x) [1] "YB" "DW" "GK" "AN"
do.call + rbind + lapply
Lots of examples. See for example this one for creating a data frame from a vector.
x <- readLines(textConnection("---CLUSTER 1 --- 3 4 5 6 ---CLUSTER 2 --- 9 10 8 11")) # create a list of where the 'clusters' are clust <- c(grep("CLUSTER", x), length(x) + 1L) # get size of each cluster clustSize <- diff(clust) - 1L # get cluster number clustNum <- gsub("[^0-9]+", "", x[grep("CLUSTER", x)]) result <- do.call(rbind, lapply(seq(length(clustNum)), function(.cl){ cbind(Object = x[seq(clust[.cl] + 1L, length = clustSize[.cl])] , Cluster = .cl ) })) result Object Cluster [1,] "3" "1" [2,] "4" "1" [3,] "5" "1" [4,] "6" "1" [5,] "9" "2" [6,] "10" "2" [7,] "8" "2" [8,] "11" "2"
A 2nd example is to sort a data frame by using do.call(order, list()).
Another example is to reproduce aggregate(). aggregate() = do.call() + by().
attach(mtcars) do.call(rbind, by(mtcars, list(cyl, vs), colMeans)) # the above approach give the same result as the following # except it does not have an extra Group.x columns aggregate(mtcars, list(cyl, vs), FUN=mean)
Run examples
When we call help(FUN), it shows the document in the browser. The browser will show
example(FUN, package = "XXX") was run in the console To view output in the browser, the knitr package must be installed
How to get examples from help file, example()
Code examples in the R package manuals:
# How to run all examples from a man page example(within) # How to check your examples? devtools::run_examples() testthat::test_examples()
See this post. Method 1:
example(acf, give.lines=TRUE)
Method 2:
Rd <- utils:::.getHelpFile(?acf) tools::Rd2ex(Rd)
"[" and "[[" with the sapply() function
Suppose we want to extract string from the id like "ABC-123-XYZ" before the first hyphen.
sapply(strsplit("ABC-123-XYZ", "-"), "[", 1)
is the same as
sapply(strsplit("ABC-123-XYZ", "-"), function(x) x[1])
Dealing with dates
- Simple examples
dates <- c("January 15, 2023", "December 31, 1999") date_objects <- as.Date(dates, format = "%B %d, %Y") # format is for the input # [1] "2023-01-15" "1999-12-31"
- Find difference
# Convert the dates to Date objects date1 <- as.Date("6/29/21", format="%m/%d/%y") date2 <- as.Date("11/9/21", format="%m/%d/%y") # Calculate the difference in days diff_days <- as.numeric(difftime(date2, date1, units="days")) # 133 # In months diff_days / (365.25/12) # 4.36961 # OR using the lubridate package library(lubridate) # Convert the dates to Date objects date1 <- mdy("6/29/21") date2 <- mdy("11/9/21") interval(date1, date2) %/% months(1)
- http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
d1 = date() class(d1) # "character" d2 = Sys.Date() class(d2) # "Date" format(d2, "%a %b %d") library(lubridate); ymd("20140108") # "2014-01-08 UTC" mdy("08/04/2013") # "2013-08-04 UTC" dmy("03-04-2013") # "2013-04-03 UTC" ymd_hms("2011-08-03 10:15:03") # "2011-08-03 10:15:03 UTC" ymd_hms("2011-08-03 10:15:03", tz="Pacific/Auckland") # "2011-08-03 10:15:03 NZST" ?Sys.timezone x = dmy(c("1jan2013", "2jan2013", "31mar2013", "30jul2013")) wday(x[1]) # 3 wday(x[1], label=TRUE) # Tues
- http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
- http://rpubs.com/seandavi/GEOMetadbSurvey2014
- We want our dates and times as class "Date" or the class "POSIXct", "POSIXlt". For more information type ?POSIXlt.
- anytime package
- weeks to Christmas difftime(as.Date(“2019-12-25”), Sys.Date(), units =“weeks”)
- A Comprehensive Introduction to Handling Date & Time in R 2020
- Working with Dates and Times Pt 1
- Three major functions: as.Date(), as.POSIXct(), and as.POSIXlt().
- POSIXct is a class in R that represents date-time data. The ct stands for “calendar time” and it represents the (signed) number of seconds since the beginning of 1970 as a numeric vector1. It stores date time as integer.
- POSIXlt is a class in R that represents date-time data. It stands for “local time” and is a list with components as integer vectors, which can represent a vector of broken-down times. It stores date time as list:sec, min, hour, mday, mon, year, wday, yday, isdst, zone, gmtoff.
- R lubridate: How To Efficiently Work With Dates and Times in R 2023
Nonstandard/non-standard evaluation, deparse/substitute and scoping
- Standard and Non-Standard Evaluation in R
- Nonstandard evaluation from Advanced R book.
- Non-standard evaluation, how tidy eval builds on base R
- Vignette from the lazyeval package. It is needed in three cases
- Labelling: turn an argument into a label
- Formulas
- Dot-dot-dot
- substitute(expr, env) - capture expression. The return mode is a call.
- substitute() is often paired with deparse() to create informative labels for data sets and plots. The return mode of deparse() is character strings.
- Use 'substitute' to include the variable's name in a plot title, e.g.: var <- "abc"; hist(var,main=substitute(paste("Dist of ", var))) will show the title "Dist of var" instead of "Dist of abc" in the title.
- Passing a variable name to a function in R
- Example:
f <- function(x) { substitute(x) } f(1:10) # 1:10 class(f(1:10)) # or mode() # [1] "call" g <- function(x) deparse(substitute(x)) g(1:10) # [1] "1:10" class(g(1:10)) # or mode() # [1] "character"
- quote(expr) - similar to substitute() but do nothing?? noquote - print character strings without quotes
mode(quote(1:10)) # [1] "call"
- eval(expr, envir), evalq(expr, envir) - eval evaluates its first argument in the current scope before passing it to the evaluator: evalq avoids this.
- The parent.frame() is necessary in cases like the stats::update() function used by relax.glmnet().
- Example:
sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1)) subset1 <- function(x, condition) { condition_call <- substitute(condition) r <- eval(condition_call, x) x[r, ] } x <- 4 condition <- 4 subset1(sample_df, a== 4) # same as subset(sample_df, a >= 4) subset1(sample_df, a== x) # WRONG! subset1(sample_df, a == condition) # ERROR subset2 <- function(x, condition) { condition_call <- substitute(condition) r <- eval(condition_call, x, parent.frame()) x[r, ] } subset2(sample_df, a == 4) # same as subset(sample_df, a >= 4) subset2(sample_df, a == x) # 👌 subset2(sample_df, a == condition) # 👍
- deparse(expr) - turns unevaluated expressions into character strings. For example,
> deparse(args(lm)) [1] "function (formula, data, subset, weights, na.action, method = \"qr\", " [2] " model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, " [3] " contrasts = NULL, offset, ...) " [4] "NULL" > deparse(args(lm), width=20) [1] "function (formula, data, " " subset, weights, " [3] " na.action, method = \"qr\", " " model = TRUE, x = FALSE, " [5] " y = FALSE, qr = TRUE, " " singular.ok = TRUE, " [7] " contrasts = NULL, " " offset, ...) " [9] "NULL"
- parse(text) - returns the parsed but unevaluated expressions in a list. See Create a Simple Socket Server in R for the application of eval(parse(text)). Be cautious!
Following is another example. Assume we have a bunch of functions (f1, f2, ...; each function implements a different algorithm) with same input arguments format (eg a1, a2). We like to run these function on the same data (to compare their performance).
f1 <- function(x) x+1; f2 <- function(x) x+2; f3 <- function(x) x+3 f1(1:3) f2(1:3) f3(1:3) # Or myfun <- function(f, a) { eval(parse(text = f))(a) } myfun("f1", 1:3) myfun("f2", 1:3) myfun("f3", 1:3) # Or with lapply method <- c("f1", "f2", "f3") res <- lapply(method, function(M) { Mres <- eval(parse(text = M))(1:3) return(Mres) }) names(res) <- method
library() accept both quoted and unquoted strings
How can library() accept both quoted and unquoted strings. The key lines are
if (!character.only) package <- as.character(substitute(package))
Lexical scoping
- R Objects, S Objects, and Lexical Scoping
- Dynamic scoping vs Lexical scoping and the example of optimization
- Indicating local functions in R scripts
The ‘…’ argument
- See Section 10.4 of An Introduction to R. Especially, the expression list(...) evaluates all such arguments and returns them in a named list
- Some notes when using dot-dot-dot (…) in R
- How to check if any arguments were passed via “…” (ellipsis) in R? Is missing(…) valid?
Functions
- https://adv-r.hadley.nz/functions.html
- Writing Better R Functions — Best Practices and Tips. The docstring package and "?" is interesting!
Function argument
Argument matching from R Language Definition manual.
Argument matching is augmented by the functions
Access to the partial matching algorithm used by R is via pmatch.
Check function arguments
Checking the inputs of your R functions: match.arg() , stopifnot()
stopifnot(): function argument sanity check
- stopifnot(). stopifnot is a quick way to check multiple conditions on the input. so for instance. The code stops when either of the three conditions are not satisfied. However, it doesn't produce pretty error messages.
stopifnot(condition1, condition2, ...)
- Mining R 4.0.0 Changelog for Nuggets of Gold
Lazy evaluation in R functions arguments
- http://adv-r.had.co.nz/Functions.html
- https://stat.ethz.ch/pipermail/r-devel/2015-February/070688.html
- https://twitter.com/_wurli/status/1451459394009550850
R function arguments are lazy — they’re only evaluated if they’re actually used.
- Example 1. By default, R function arguments are lazy.
f <- function(x) { 999 } f(stop("This is an error!")) #> [1] 999
- Example 2. If you want to ensure that an argument is evaluated you can use force().
add <- function(x) { force(x) function(y) x + y } adders2 <- lapply(1:10, add) adders2[[1]](10) #> [1] 11 adders2[[10]](10) #> [1] 20
- Example 3. Default arguments are evaluated inside the function.
f <- function(x = ls()) { a <- 1 x } # ls() evaluated inside f: f() # [1] "a" "x" # ls() evaluated in global environment: f(ls()) # [1] "add" "adders" "f"
- Example 4. Laziness is useful in if statements — the second statement below will be evaluated only if the first is true.
x <- NULL if (!is.null(x) && x > 0) { }
Use of functions as arguments
Just Quickly: The unexpected use of functions as arguments
body()
Remove top axis title base plot
Return functions in R
- How and why to return functions in R
- See the doc & example from taskCallback - Create an R-level task callback manager. Top-level Task Callbacks in R.
- Turn R users insane with evil
anonymous function
In R, the main difference between a lambda function (also known as an anonymous function) and a regular function is that a lambda function is defined without a name, while a regular function is defined with a name.
- See Tidyverse page
- But defining functions to use them only once is kind of overkill. That's why you can use so-called anonymous functions in R. For example, lapply(list(1,2,3), function(x) { x * x })
- you can use lambda functions with many other functions in R that take a function as an argument. Some examples include sapply, apply, vapply, mapply, Map, Reduce, Filter, and Find. These functions all work in a similar way to lapply by applying a function to elements of a list or vector.
Reduce(function(x, y) x*y, list(1, 2, 3, 4)) # 24
- purrr anonymous function
- The new pipe and anonymous function syntax in R 4.1.0
- Functional programming from Advanced R
- What are anonymous functions in R.
> (function(x) x * x)(3) [1] 9 > (\(x) x * x)(3) [1] 9
Backtick sign, infix/prefix/postfix operators
The backtick sign ` (not the single quote) refers to functions or variables that have otherwise reserved or illegal names; e.g. '&&', '+', '(', 'for', 'if', etc. See some examples in Advanced R and What do backticks do in R?.
iris %>% `[[`("Species")
infix operator.
1 + 2 # infix + 1 2 # prefix 1 2 + # postfix
Use with functions like sapply, e.g. sapply(1:5, `+`, 3) .
Error handling and exceptions, tryCatch(), stop(), warning() and message()
- http://adv-r.had.co.nz/Exceptions-Debugging.html
- Catch Me If You Can: Exception Handling in R
- Temporarily disable warning messages
# Method1: suppressWarnings(expr) # Method 2: <pre> defaultW <- getOption("warn") options(warn = -1) [YOUR CODE] options(warn = defaultW)
- try() allows execution to continue even after an error has occurred. You can suppress the message with try(..., silent = TRUE).
out <- try({ a <- 1 b <- "x" a + b }) elements <- list(1:10, c(-1, 10), c(T, F), letters) results <- lapply(elements, log) is.error <- function(x) inherits(x, "try-error") succeeded <- !sapply(results, is.error)
- tryCatch(): With tryCatch() you map conditions to handlers (like switch()), named functions that are called with the condition as an input. Note that try() is a simplified version of tryCatch().
tryCatch(expr, ..., finally) show_condition <- function(code) { tryCatch(code, error = function(c) "error", warning = function(c) "warning", message = function(c) "message" ) } show_condition(stop("!")) #> [1] "error" show_condition(warning("?!")) #> [1] "warning" show_condition(message("?")) #> [1] "message" show_condition(10) #> [1] 10
Below is another snippet from available.packages() function,
z <- tryCatch(download.file(....), error = identity) if (!inherits(z, "error")) STATEMENTS
- The return class from tryCatch() may not be fixed.
result <- tryCatch({ # Code that might generate an error or warning log(99) }, warning = function(w) { # Code to handle warnings print(paste("Warning:", w)) }, error = function(e) { # Code to handle errors print(paste("Error:", e)) }, finally = { # Code to always run, regardless of whether an error or warning occurred print("Finished") }) # character type. But if we remove 'finally', it will be numeric.
- Capture message, warnings and errors from a R function
suppressMessages()
suppressMessages(expression)
List data type
Create an empty list
out <- vector("list", length=3L) # OR out <- list() for(j in 1:3) out[[j]] <- myfun(j) outlist <- as.list(seq(nfolds))
Nested list of data frames
An array can only hold data of a single type. read.csv() returns a data frame, which can contain both numerical and character data.
res <- vector("list", 3) names(res) <- paste0("m", 1:3) for (i in seq_along(res)) { res[[i]] <- vector("list", 2) # second-level list with 2 elements names(res[[i]]) <- c("fc", "pre") } res[["m1"]][["fc"]] <- read.csv() head(res$m1$fc) # Same as res[["m1"]][["fc"]]
Using $ in R on a List
How to Use Dollar Sign ($) Operator in R
Calling a function given a list of arguments
> args <- list(c(1:10, NA, NA), na.rm = TRUE) > do.call(mean, args) [1] 5.5 > mean(c(1:10, NA, NA), na.rm = TRUE) [1] 5.5
Descend recursively through lists
x[[c(5,3)]] is the same as x[[5]][[3]]. See ?Extract.
Avoid if-else or switch
?plot.stepfun.
y0 <- c(1,2,4,3) sfun0 <- stepfun(1:3, y0, f = 0) sfun.2 <- stepfun(1:3, y0, f = .2) sfun1 <- stepfun(1:3, y0, right = TRUE) tt <- seq(0, 3, by = 0.1) op <- par(mfrow = c(2,2)) plot(sfun0); plot(sfun0, xval = tt, add = TRUE, col.hor = "bisque") plot(sfun.2);plot(sfun.2, xval = tt, add = TRUE, col = "orange") # all colors plot(sfun1);lines(sfun1, xval = tt, col.hor = "coral") ##-- This is revealing : plot(sfun0, verticals = FALSE, main = "stepfun(x, y0, f=f) for f = 0, .2, 1") for(i in 1:3) lines(list(sfun0, sfun.2, stepfun(1:3, y0, f = 1))[[i]], col = i) legend(2.5, 1.9, paste("f =", c(0, 0.2, 1)), col = 1:3, lty = 1, y.intersp = 1) par(op)
Open a new Window device
X11() or dev.new()
par()
?par
text size (cex) and font size on main, lab & axis
- Graphical Parameters from statmethods.net.
- Overlaying Data Summaries in Dotplots
Examples (default is 1 for each of them):
- cex.main=0.9
- cex.sub
- cex.lab=0.8, font.lab=2 (x/y axis labels)
- cex.axis=0.8, font.axis=2 (axis/tick text/labels)
- col.axis="grey50"
An quick example to increase font size (cex.lab, cex.axis, cex.main) and line width (lwd) in a line plot and cex & lwd in the legend.
plot(x=x$mids, y=x$density, type="l", xlab="p-value", ylab="Density", lwd=2, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, main = "") lines(y$mids, y$density, lty=2, pwd=2) lines(z$mids, z$density, lty=3, pwd=2) legend('topright',legend = c('Method A','Method B','Method C'), lty=c(2,1,3), lwd=c(2,2,2), cex = 1.5, xjust = 0.5, yjust = 0.5)
ggplot2 case (default font size is 11 points):
- plot.title
- plot.subtitle
- axis.title.x, axis.title.y: (x/y axis labels)
- axis.text.x & axis.text.y: (axis/tick text/labels)
ggplot(df, aes(x, y)) + geom_point() + labs(title = "Title", subtitle = "Subtitle", x = "X-axis", y = "Y-axis") + theme(plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 10))
Default font
- ?png. The default font family is Arial on Windows and Helvetica otherwise.
- sans. See Changing the font of R base graphic plots
- Fonts from Cookbook for R. It seems ggplot2 also uses sans as the default font.
- Using different fonts with ggplot2
- R plot font family
- Add custom fonts in R
layout
reset the settings
op <- par(mfrow=c(2,1), mar = c(5,7,4,2) + 0.1) .... par(op) # mfrow=c(1,1), mar = c(5,4,4,2) + .1
mtext (margin text) vs title
- https://datascienceplus.com/adding-text-to-r-plot/
- https://datascienceplus.com/mastering-r-plot-part-2-axis/
mgp (axis tick label locations or axis title)
- The margin line (in ‘mex’ units) for the axis title, axis labels and axis line. Note that ‘mgp[1]’ affects the axis ‘title’ whereas ‘mgp[2:3]’ affect tick mark labels. The default is ‘c(3, 1, 0)’. If we like to make the axis labels closer to an axis, we can use mgp=c(1.5, .5, 0) for example.
- the default is c(3,1,0) which specify the margin line for the axis title, axis labels and axis line.
- the axis title is drawn in the fourth line of the margin starting from the plot region, the axis labels are drawn in the second line and the axis line itself is the first line.
- Setting graph margins in R using the par() function and lots of cow milk
- Move Axis Label Closer to Plot in Base R (2 Examples)
- http://rfunction.com/archives/1302 mgp – A numeric vector of length 3, which sets the axis label locations relative to the edge of the inner plot window. The first value represents the location the labels/axis title (i.e. xlab and ylab in plot), the second the tick-mark labels, and third the tick marks. The default is c(3, 1, 0).
move axis title closer to axis
- Setting a title and a subtitle. Default is around 1.7(?). ?title.
- move axis label closer to axis title(, line). This is useful when we use xaxt='n' to hide the ticks and labels.
title(ylab="Within-cluster variance", line=0, cex.lab=1.2, family="Calibri Light")
pch and point shapes
See here.
- Full circle: pch=16
- Display all possibilities: ggpubr::show_point_shapes()
lty (line type)
Line types in R: Ultimate Guide For R Baseplot and ggplot
See here.
ggpubr::show_line_types()
las (label style)
0: The default, parallel to the axis
1: Always horizontal boxplot(y~x, las=1)
2: Perpendicular to the axis
3: Always vertical
oma (outer margin), xpd, common title for two plots, 3 types of regions, multi-panel plots
- The following trick is useful when we want to draw multiple plots with a common title.
par(mfrow=c(1,2),oma = c(0, 0, 2, 0)) # oma=c(0, 0, 0, 0) by default plot(1:10, main="Plot 1") plot(1:100, main="Plot 2") mtext("Title for Two Plots", outer = TRUE, cex = 1.5) # outer=FALSE by default
- PCA plot example (the plot in the middle)
- For scatterplot3d() function, oma is not useful and I need to use xpd.
- Mastering R plot – Part 3: Outer margins mtext() & par(xpd).
- ?par about xpd option
- If FALSE (default), all plotting is clipped to the plot region,
- If TRUE, all plotting is clipped to the figure region,
- If NA, all plotting is clipped to the device region.
- 3 types of regions. See Creating multi-panel plots and figures using layout() & publication-quality figures with R, part 2
- plot region,
- figure region,
- device region.
- Creating multi-panel plots and figures using layout() includes several tricks including creating a picture-in-picture plot.
no.readonly
R语言里par(no.readonly=TURE)括号里面这个参数什么意思?, R-par()
Non-standard fonts in postscript and pdf graphics
https://cran.r-project.org/doc/Rnews/Rnews_2006-2.pdf#page=41
NULL, NA, NaN, Inf
https://tomaztsql.wordpress.com/2018/07/04/r-null-values-null-na-nan-inf/
save()/load() vs saveRDS()/readRDS() vs dput()/dget() vs dump()/source()
- saveRDS() can only save one R object while save() does not have this constraint.
- saveRDS() doesn’t save the both the object and its name it just saves a representation of the object. As a result, the saved object can be loaded into a named object within R that is different from the name it had when originally serialized. See this post.
x <- 5 saveRDS(x, "myfile.rds") x2 <- readRDS("myfile.rds") identical(mod, mod2, ignore.environment = TRUE)
dput: Writes an ASCII text representation of an R object. The object name is not written (unlike dump).
$ data(pbc, package = "survival") $ names(pbc) $ dput(names(pbc)) c("id", "time", "status", "trt", "age", "sex", "ascites", "hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage") > iris2 <- iris[1:2, ] > dput(iris2) structure(list(Sepal.Length = c(5.1, 4.9), Sepal.Width = c(3.5, 3), Petal.Length = c(1.4, 1.4), Petal.Width = c(0.2, 0.2), Species = structure(c(1L, 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = 1:2, class = "data.frame")
User 'verbose = TRUE' in load()
When we use load(), it is helpful to add 'verbose =TRUE' to see what objects get loaded.
What are RDS files anyways
qs package
qs: Quick Serialization of R Objects
==, all.equal(), identical()
- ==: exact match
- all.equal: compare R objects x and y testing ‘near equality’
- identical: The safe and reliable way to test two objects for being exactly equal.
x <- 1.0; y <- 0.99999999999 all.equal(x, y) # [1] TRUE identical(x, y) # [1] FALSE
Be careful about using "==" to return an index of matches in the case of data with missing values.
R> c(1,2,NA)[c(1,2,NA) == 1] [1] 1 NA R> c(1,2,NA)[which(c(1,2,NA) == 1)] [1] 1
See also the testhat package.
I found a case when I compare two objects where 1 is generated in Linux and the other is generated in macOS that identical() gives FALSE but all.equal() returns TRUE. The difference has a magnitude only e-17.
waldo
- https://waldo.r-lib.org/ or CRAN. Find and concisely describe the difference between a pair of R objects.
- How To Compare Objects In R
diffobj: Compare/Diff R Objects
https://cran.r-project.org/web/packages/diffobj/index.html
testthat
- https://github.com/r-lib/testthat
- Unit Tests in R
- Start with an End-to-End Test
- A Beautiful Mind: Writing Testable R Code
tinytest
tinytest: Lightweight but Feature Complete Unit Testing Framework
ttdo adds support of the 'diffobj' package for 'diff'-style comparison of R objects.
Numerical Pitfall
Numerical pitfalls in computing variance
.1 - .3/3 ## [1] 0.00000000000000001388
Sys.getpid()
This can be used to monitor R process memory usage or stop the R process. See this post.
Sys.getenv() & make the script more portable
Replace all the secrets from the script and replace them with Sys.getenv("secretname"). You can save the secrets in an .Renviron file next to the script in the same project.
$ for v in 1 2; do MY=$v Rscript -e "Sys.getenv('MY')"; done [1] "1" [1] "2" $ echo $MY 2
How to write R codes
- Code smells and feels from R Consortium
- write simple conditions,
- handle class properly,
- return and exit early,
- polymorphism,
- switch() [e.g., switch(var, value1=out1, value2=out2, value3=out3). Several examples in glmnet ]
- case_when(),
- %||%.
- 5 Tips for Writing Clean R Code – Leave Your Code Reviewer Commentless
- Comments
- Strings
- Loops
- Code Sharing
- Good Programming Practices
How to debug an R code
Locale bug (grep did not handle UTF-8 properly PR#16264)
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16264
Path length in dir.create() (PR#17206)
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17206 (Windows only)
install.package() error, R_LIBS_USER is empty in R 3.4.1 & .libPaths()
- https://support.rstudio.com/hc/en-us/community/posts/115008369408-Since-update-to-R-3-4-1-R-LIBS-USER-is-empty and http://r.789695.n4.nabble.com/R-LIBS-USER-on-Ubuntu-16-04-td4740935.html. Modify /etc/R/Renviron (if you have a sudo right) by uncomment out line 43.
R_LIBS_USER=${R_LIBS_USER-'~/R/x86_64-pc-linux-gnu-library/3.4'}
- https://stackoverflow.com/questions/44873972/default-r-personal-library-location-is-null. Modify $HOME/.Renviron by adding a line
R_LIBS_USER="${HOME}/R/${R_PLATFORM}-library/3.4"
- http://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html. Play with .libPaths()
On Mac & R 3.4.0 (it's fine)
> Sys.getenv("R_LIBS_USER") [1] "~/Library/R/3.4/library" > .libPaths() [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library"
On Linux & R 3.3.1 (ARM)
> Sys.getenv("R_LIBS_USER") [1] "~/R/armv7l-unknown-linux-gnueabihf-library/3.3" > .libPaths() [1] "/home/$USER/R/armv7l-unknown-linux-gnueabihf-library/3.3" [2] "/usr/local/lib/R/library"
On Linux & R 3.4.1 (*Problematic*)
> Sys.getenv("R_LIBS_USER") [1] "" > .libPaths() [1] "/usr/local/lib/R/site-library" "/usr/lib/R/site-library" [3] "/usr/lib/R/library"
I need to specify the lib parameter when I use the install.packages command.
> install.packages("devtools", "~/R/x86_64-pc-linux-gnu-library/3.4") > library(devtools) Error in library(devtools) : there is no package called 'devtools' # Specify lib.loc parameter will not help with the dependency package > library(devtools, lib.loc = "~/R/x86_64-pc-linux-gnu-library/3.4") Error: package or namespace load failed for 'devtools': .onLoad failed in loadNamespace() for 'devtools', details: call: loadNamespace(name) error: there is no package called 'withr' # A solution is to redefine .libPaths > .libPaths(c("~/R/x86_64-pc-linux-gnu-library/3.4", .libPaths())) > library(devtools) # Works
A better solution is to specify R_LIBS_USER in ~/.Renviron file or ~/.bash_profile; see ?Startup.
Using external data from within another package
https://logfc.wordpress.com/2017/03/02/using-external-data-from-within-another-package/
How to run R scripts from the command line
https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line
How to exit a sourced R script
- How to exit a sourced R script
- Problem using the source-function within R-functions The best way to handle the generic sort of problem you are describing is to take those source'd files, and rewrite their content as functions to be called from your other functions.
- ‘source()’ and ‘example()’ have a new optional argument ‘catch.aborts’ which allows continued evaluation of the R code after an error. 4-devel 2023/10/11.
Decimal point & decimal comma
Countries using Arabic numerals with decimal comma (Austria, Belgium, Brazil France, Germany, Netherlands, Norway, South Africa, Spain, Sweden, ...) https://en.wikipedia.org/wiki/Decimal_mark
setting seed locally (not globally) in R
https://stackoverflow.com/questions/14324096/setting-seed-locally-not-globally-in-r
R's internal C API
https://github.com/hadley/r-internals
cleancall package for C resource cleanup
Resource Cleanup in C and the R API
Random number generator
- https://cran.r-project.org/doc/manuals/R-exts.html#Random-numbers
- C code from R with .C(): random value is the same every time
- Random number generators produce collisions: Why, how many and more Marius Hofert 2020 and the published paper in American Statistician (including R code).
- R package examples. party package.
#include <R.h> void myunif(){ GetRNGstate(); double u = unif_rand(); PutRNGstate(); Rprintf("%f\n",u); }
$ R CMD SHLIB r_rand.c $ R R> dyn.load("r_rand.so") R> set.seed(1) R> .C("myunif") 0.265509 list() R> .C("myunif") 0.372124 list() R> set.seed(1) R> .C("myunif") 0.265509 list()
Test For Randomness
Different results in Mac and Linux
Random numbers: multivariate normal
Why MASS::mvrnorm() gives different result on Mac and Linux/Windows?
The reason could be the covariance matrix decomposition - and that may be due to the LAPACK/BLAS libraries. See
- https://stackoverflow.com/questions/11567613/different-random-number-generation-between-os
- https://stats.stackexchange.com/questions/149321/generating-and-working-with-random-vectors-in-r
- Cholesky versus eigendecomposition for drawing samples from a multivariate normal distribution See this example. A little more investigation shows the eigen values differ a little bit on macOS and Linux. See here.
rle() running length encoding
- https://en.wikipedia.org/wiki/Run-length_encoding
- How to Find Consecutive Repeats in R
- R Function of the Day: rle
- Creating nice tables using R Markdown
- https://rosettacode.org/wiki/Run-length_encoding
- R's base::rle() function
- R's Rle class from S4Vectors package which was used in for example IRanges/GRanges/GenomicRanges package
citation()
citation() citation("MASS") toBibtex(citation())
Notes on Citing R and R Packages with examples.
R not responding request to interrupt stop process
R not responding request to interrupt stop process. R is executing (for example) a C / C++ library call that doesn't provide R an opportunity to check for interrupts. It seems to match with the case I'm running (dist() function).
Monitor memory usage
- x <- rnorm(2^27) will create an object of the size 1GB (2^27*8/2^20=1024 MB).
- Windows: memory.size(max=TRUE)
- Linux
- RStudio: htop -p PID where PID is the process ID of /usr/lib/rstudio/bin/rsession, not /usr/lib/rstudio/bin/rstudio. This is obtained by running x <- rnorm(2*1e8). The object size can be obtained through print(object.size(x), units = "auto"). Note that 1e8*8/2^20 = 762.9395.
- R: htop -p PID where PID is the process ID of /usr/lib/R/bin/exec/R. Alternatively, use htop -p `pgrep -f /usr/lib/R/bin/exec/R`
- To find the peak memory usage grep VmPeak /proc/$PID/status
- mem_used() function from pryr package. It is not correct or useful if I use it to check the value compared to the memory returned by jobload in biowulf. So I cannot use it to see the memory used in running mclapply().
- peakRAM: Monitor the Total and Peak RAM Used by an Expression or Function
- Error: protect () : protection stack overflow and ?Memory
References:
- How to monitor CPU/memory usage of a single process?. htop -p $PID is recommended. It only shows the percentage of memory usage.
- Peak memory usage of a linux/unix process grep VmPeak /proc/$PID/status is recommended.
- How can I see the memory usage of a Linux process? pmap $PID | tail -n 1 is recommended. It shows the memory usage in absolute value (eg 1722376K).
- How to check the amount of RAM in R memfree <- as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)); memfree
Monitor Data
Monitoring Data in R with the lumberjack Package
Pushover
Monitoring Website SSL/TLS Certificate Expiration Times with R, {openssl}, {pushoverr}, and {DT}
Resource
Books
- R Development Guide R Contribution Working Group
- An R Community Public Library 2011-11-04
- A list of recommended books http://blog.revolutionanalytics.com/2015/11/r-recommended-reading.html
- Learning R programming by reading books: A book list
- Modern Applied Statistics with S by William N. Venables and Brian D. Ripley
- Seamless R and C++ Integration with Rcpp by Dirk Eddelbuettel
- Advanced R by Hadley Wickham 2014
- http://brettklamer.com/diversions/statistical/compile-hadleys-advanced-r-programming-to-a-pdf/ Compile Hadley's Advanced R to a PDF
- Functional programming and unit testing for data munging with R by Bruno Rodrigues
- R Cookbook by Paul Teetor
- Machine Learning with R by Brett Lantz
- R for Everyone by Jared P. Lander
- The Art of R Programming by Norman Matloff
- Applied Predictive Modeling by Max Kuhn
- R in Action by Robert Kabacoff
- The R Book by Michael J. Crawley
- Regression Modeling Strategies, with Applications to Linear Models, Survival Analysis and Logistic Regression by Frank E. Harrell
- Data Manipulation with R by Phil Spector
- DATA MANIPULATION IN R by ALBOUKADEL KASSAMBARA
- Review of Efficient R Programming
- R packages: Organize, Test, Document, and Share Your Code by Hadley Wicklam 2015
- Text Mining with R: A Tidy Approach and a blog
- Efficient R programming by Colin Gillespie and Robin Lovelace. It works to re-create the html version of the book if we follow their simple instruction in the Appendix. Note that pdf version has advantages of expected output (mathematical notations, tables) over the epub version.
# R 3.4.1 .libPaths(c("~/R/x86_64-pc-linux-gnu-library/3.4", .libPaths())) setwd("/tmp/efficientR/") bookdown::render_book("index.Rmd", output_format = "bookdown::pdf_book") # generated pdf file is located _book/_main.pdf bookdown::render_book("index.Rmd", output_format = "bookdown::epub_book") # generated epub file is located _book/_main.epub. # This cannot be done in RStudio ("parse_dt" not resolved from current namespace (lubridate)) # but it is OK to run in an R terminal
- Learning statistics with R: A tutorial for psychology students and other beginners by Danielle Navarro
- What They Forgot to Teach You About R Jennifer Bryan & Jim Hester
- Evidence-based Software Engineering by Derek M. Jones
- Big Book of R
- R for applied epidemiology and public health
- Epidemiology with R and the Epi package. ci.lin() function to return the CI from glm() fit.
- RStudio → Finding Your Way To R. Beginners/Intermediates/Experts
- Deep R Programming
Videos
- “Do More with R” video tutorials. Search for R video tutorials by task, topic, or package. Most videos are shorter than 10 minutes.
- R-Ladies Global (youtube)
Webinar
useR!
- http://blog.revolutionanalytics.com/2017/07/revisiting-user2017.html
- UseR 2018 workshop and tutorials
- UseR! 2019, tutorial, Better workflow
- UseR! 2020 & 2021
- A Guide to Binge Watching R / Medicine 2021
- UseR! 2022
R consortium
https://www.youtube.com/channel/UC_R5smHVXRYGhZYDJsnXTwg/featured
Blogs, Tips, Socials, Communities
- Google: revolutionanalytics In case you missed it
- Why R is hard to learn by Bob Musenchen.
- My 15 practical tips for a bioinformatician
- The R community is one of R's best features
- Bioinformatics Training at the Harvard Chan Bioinformatics Core
- The R Blog
https://developer.r-project.org/Blog/public/https://blog.r-project.org/ - Top Tips for Learning R from Africa R’s Shelmith Kariuki
- How Do I? …(do that in R) by Sharon Machlis
- Twitter for R programmers
Bug Tracking System
https://bugs.r-project.org/bugzilla3/ and Search existing bug reports. Remember to select 'All' in the Status drop-down list.
Use utils::sessionInfo() or even better the sessioninfo package.
License
Some Notes on GNU Licenses in R Packages
Why Dash uses the mit license (and not a copyleft gpl license)
Interview questions
- Does R store matrices in column-major order or row-major order?
- Matrices are stored in column-major order, which means that elements are arranged and accessed by columns. This is in contrast to languages like Python, where matrices (or arrays) are typically stored in row-major order.
- Explain the difference between == and === in R. Provide an example to illustrate their use.
- The == operator is used for testing equality of values in R. It returns TRUE if the values on the left and right sides are equal, otherwise FALSE. The === operator does not exist in base R.
- What is the purpose of the apply() function in R? How does it differ from the for loop?
- The apply() function in R is used to apply a function over the margins of an array or matrix. It is often used as an alternative to loops for applying a function to each row or column of a matrix.
- Describe the concept of factors in R. How are they used in data manipulation and analysis?
- Factors in R are used to represent categorical data. They are an essential data type for statistical modeling and analysis. Factors store both the unique values that occur in a dataset and the corresponding integer codes used to represent those values.
- What is the significance
of the attach() and detach() functions in R? When should they be used?
- A: The attach() function is used to add a data frame to the search path in R, making it easier to access variables within the data frame. The detach() function is used to remove a data frame from the search path, which can help avoid naming conflicts and reduce memory usage.
- Explain the concept of vectorization in R. How does it impact the performance of R code?
- Vectorization in R refers to the ability to apply operations to entire vectors or arrays at once, without needing to write explicit loops. This can significantly improve the performance of R code, as it allows operations to be performed in a more efficient, vectorized manner by taking advantage of R's underlying C code.
- Describe the difference between data.frame and matrix in R. When would you use one over the other?
- A data.frame in R is a two-dimensional structure that can store different types of data (e.g., numeric, character, factor) in its columns. It is similar to a table in a database.
- A matrix in R is also a two-dimensional structure, but it can only store elements of the same data type. It is more like a mathematical matrix.
- You would use a data.frame when you have heterogeneous data (i.e., different types of data) and need to work with it as a dataset. You would use a matrix when you have homogeneous data (i.e., the same type of data) and need to perform matrix operations.
- What are the benefits of using the dplyr package in R for data manipulation? Provide an example of how you would use dplyr to filter a data frame.
- The dplyr package provides a set of functions that make it easier to manipulate data frames in R.
- It uses a syntax that is easy to read and understand, making complex data manipulations more intuitive.
- To filter a data frame using dplyr, you can use the filter() function. For example, filter(df, column_name == value) would filter df to include only rows where column_name is equal to value.