Tidyverse
Tidyverse
Import | | readr, readxl | haven, DBI, httr +----- Visualize ------+ | | ggplot2, ggvis | | | | Tidy ------------- Transform tibble dplyr Model tidyr | broom +------ Model ---------+
Cheat sheet
The cheat sheets are downloaded from RStudio
- Data Transformation with dply
- Data Import
- Data Import with readr, tibble, and tidyr (not in RStudio anymore?)
Online
- TidyverseSkeptic by Norm Matloff
- R for Data Science and tidyverse package (it is a collection of ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr & forcats 8 packages).
- tidyverse, among others, was used at Mining CRAN DESCRIPTION Files (tbl_df(), %>%, summarise(), count(), mutate(), arrange(), unite(), ggplot(), filter(), select(), ...). Note that there is a problem to reproduce the result. I need to run cran <- cran[, -14] to remove the MD5sum column.
- Compile R for Data Science to a PDF
- Data Wrangling with dplyr and tidyr Cheat Sheet
- Data Wrangling with Tidyverse from the Harvard Chan School of Public Health.
- Best packages for data manipulation in R. It demonstrates to perform the same tasks using data.table and dplyr packages. data.table is faster and it may be a go-to package when performance and memory are the constraints.
- DATA MANIPULATION IN R by Alboukadel Kassambara
- subset data frame columns: pull() [return a vector], select() [return data frame], select_if(), other helper functions
- subset (filter) data frame rows: slice(), filter(), filter_all(), filter_if(), filter_at(), sample_n(), top_n()
- identify and remove duplicate rows: duplicated(), unique(), distinct()
- ordering rows: arrange(), desc()
- cf stats::reorder() to change a factor variable's order based on another variable. So the output is still a vector. It is useful in creating multiple boxplots. On the other hand, arrange() is to change the row order of a data frame and its input is a data frame.
- desc() can be used in arrange() [see ?desc] and reorder() [see ordered barplot ] too.
- desc(x) is just doing the negative operation -x.
- renaming and adding columns: rename()
- compute and add new variables to a data frame: mutate(), transmutate()
- computing summary statistics (pay to view)
- Data manipulation in r using data frames - an extensive article of basics
- The A to Z of tidyverse from Deeply Trivial
- Summer Institute in Statistics for Big Data (SISBID), SISBID 2020 Modules
Animation to explain
tidyexplain - Tidy Animated Verbs
Examples
A Gentle Introduction to Tidy Statistics in R
A Gentle Introduction to Tidy Statistics in R by Thomas Mock on RStudio webinar. Good coverage with step-by-step explanation. See part 1 & part 2 about the data and markdown document. All documents are available in github repository.
Task | R code | Graph |
---|---|---|
Load the libraries | library(tidyverse) library(readxl) library(broom) library(knitr) |
|
Read Excel file | raw_df <- readxl::read_xlsx("ad_treatment.xlsx") dplyr::glimpse(raw_df) |
|
Check distribution | g2 <- ggplot(raw_df, aes(x = age)) + geom_density(fill = "blue") g2 raw_df %>% summarize(min = min(age), max = max(age)) |
File:Check dist.svg |
Data cleaning | raw_df %>% summarize(na_count = sum(is.na(mmse))) |
|
Experimental variables
levels |
# check Ns and levels for our variables table(raw_df$drug_treatment, raw_df$health_status) table(raw_df$drug_treatment, raw_df$health_status, raw_df$sex) # tidy way of looking at variables raw_df %>% group_by(drug_treatment, health_status, sex) %>% count() |
|
Visual Exploratory
Data Analysis |
ggplot(data = raw_df, # add the data aes(x = drug_treatment, y = mmse, # set x, y coordinates color = drug_treatment)) + # color by treatment geom_boxplot() + facet_grid(~health_status) |
File:Onefacet.svg |
Summary Statistics | raw_df %>% glimpse() sum_df <- raw_df %>% mutate( sex = factor(sex, labels = c("Male", "Female")), drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")) ) %>% group_by(sex, health_status, drug_treatment # group by categorical variables ) %>% summarize( mmse_mean = mean(mmse), # calc mean mmse_se = sd(mmse)/sqrt(n()) # calc standard error ) %>% ungroup() # ungrouping variable is a good habit to prevent errors kable(sum_df) write.csv(sum_df, "adx37_sum_stats.csv") |
|
Plotting summary
statistics |
g <- ggplot(data = sum_df, # add the data aes(x = drug_treatment, #set x, y coordinates y = mmse_mean, group = drug_treatment, # group by treatment color = drug_treatment)) + # color by treatment geom_point(size = 3) + # set size of the dots facet_grid(sex~health_status) # create facets by sex and status g |
File:Twofacets.svg |
ANOVA | # set up the statistics df stats_df <- raw_df %>% # start with data mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), sex = factor(sex, labels = c("Male", "Female")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's"))) glimpse(stats_df) # this gives main effects AND interactions ad_aov <- aov(mmse ~ sex * drug_treatment * health_status, data = stats_df) summary(ad_aov) # this extracts ANOVA output into a nice tidy dataframe tidy_ad_aov <- tidy(ad_aov) # which we can save to Excel write.csv(tidy_ad_aov, "ad_aov.csv") |
|
Post-hocs | # pairwise t.tests ad_pairwise <- pairwise.t.test(stats_df$mmse, stats_df$sex:stats_df$drug_treatment:stats_df$health_status, p.adj = "none") # look at the posthoc p.values in a tidy dataframe kable(head(tidy(ad_pairwise))) # call and tidy the tukey posthoc tidy_ad_tukey <- tidy( TukeyHSD(ad_aov, which = 'sex:drug_treatment:health_status')) |
|
Publication plot | sig_df <- tribble( ~drug_treatment, ~ health_status, ~sex, ~mmse_mean, "Low dose", "Alzheimer's", "Male", 17, "High Dose", "Alzheimer's", "Male", 25, "Low dose", "Alzheimer's", "Female", 18, "High Dose", "Alzheimer's", "Female", 24 ) sig_df <- sig_df %>% mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), sex = factor(sex, levels = c("Male", "Female")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's"))) sig_df # plot of cognitive function health and drug treatment g1 <- ggplot(data = sum_df, aes(x = drug_treatment, y = mmse_mean, fill = drug_treatment, group = drug_treatment)) + geom_errorbar(aes(ymin = mmse_mean - mmse_se, ymax = mmse_mean + mmse_se), width = 0.5) + geom_bar(color = "black", stat = "identity", width = 0.7) + facet_grid(sex~health_status) + theme_bw() + scale_fill_manual(values = c("white", "grey", "black")) + theme(legend.position = "NULL", legend.title = element_blank(), axis.title = element_text(size = 20), legend.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text = element_text(size = 12)) + geom_text(data = sig_df, label = "*", size = 8) + labs(x = "\nDrug Treatment", y = "Cognitive Function (MMSE)\n", caption = "\nFigure 1. Effect of novel drug treatment AD-x37 on cognitive function in healthy and demented elderly adults. \nn = 100/treatment group (total n = 600), * indicates significance at p < 0.001") g1 # save the graph! ggsave("ad_publication_graph.png", g1, height = 7, width = 8, units = "in") |
File:Ad public.svg |
Opioid prescribing habits in texas
https://juliasilge.com/blog/texas-opioids/.
- It can read multiple sheets (27 sheets) at a time and merge them by rows.
- case_when(): A general vectorised if. This function allows you to vectorise multiple if_else() statements.
x %>% mutate(group = case_when( PredScore > quantile(PredScore, .5) ~ 'High', PredScore < quantile(PredScore, .5) ~ 'Low', TRUE ~ NA_character_ ))
- fill()
- bind_rows(). Another example.
- full_join(), left_join(), right_join(), inner_join(). See the exercises from Useful dplyr functions (with examples). Suppose df1=50x3, df2=45x3 with 25 overlaps. Then left_join=50x5, right_join=45x5, inner_join=25x5, full_join=70x5.
- gather()
- replace_na()
- str_to_title()
- count()
- top_n()
- kable()
Useful dplyr functions (with examples)
https://sw23993.wordpress.com/2017/07/10/useful-dplyr-functions-wexamples/
Supervised machine learning case studies in R
Supervised machine learning case studies in R - A Free, Interactive Course Using Tidy Tools.
Time series data
- Automating update of a fiscal database for the Euro Area
- readxl::read_excel()
- transmute(), as.Date()
- filter(), is.na()
- na.omit(), first()
- filter(), gather(), bind_rows(), arrange()
- group_by(), summarize()
- rdb(), lubridate::year(), magrittr::%<>%, select(), spread(), mutate(), select(), gather()
- filter(), full_join(), transmute(), !is.na()
- bind_rows(), mutate()
- chain() (deprecated!)
- ungroup()
- tibble(), left_join()
- Exploring eu wide data on new car registrations and co2 efficiency (data is available)
Calculating change from baseline
group_by() + mutate() + ungroup(). We can accomplish the task by using split() + lapply() + do.call().
trial_data_chg <- trial_data %>% arrange(USUBJID, AVISITN) %>% group_by(USUBJID) %>% mutate(CHG = AVAL - AVAL[1L]) %>% ungroup() # If the baseline is missing trial_data_chg2 <- trial_data %>% group_by(USUBJID) %>% mutate( CHG = if (any(AVISIT == "Baseline")) AVAL - AVAL[AVISIT == "Baseline"] else NA ) %>% ungroup()
Show all possible group combinations
Install on Ubuntu
sudo apt install r-cran-tidyverse # Ubuntu >= 18.04. However, I get unmet dependencies errors on R 3.5.3. # r-cran-curl : Depends: r-api-3.4 sudo apt-get install r-cran-curl r-cran-openssl r-cran-xml2 # Works fine on Ubuntu 16.04, 18.04, 20.04 sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev
80 R packages will be installed after tidyverse has been installed.
For RStudio server docker version (Debian 10), I also need to install zlib1g-dev
Install on Raspberry Pi/(ARM based) Chromebook
In additional to the requirements of installing on Ubuntu, I got an error when it is installing a dependent package fs: undefined symbol: pthread_atfork. The fs package version is 1.2.6. The solution is to add one line in fs/src/Makevars file and then install the "fs" package using the source on the local machine.
5 most useful data manipulation functions
- subset() for making subsets of data (natch)
- merge() for combining data sets in a smart and easy way
- melt()-reshape2 package for converting from wide to long data formats. See an example here where we want to combine multiple columns of values into 1 column. melt() is replaced by gather().
- dcast()-reshape2 package for converting from long to wide data formats (or just use tapply()), and for making summary tables
- ddply()-plyr package for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations
Miscellaneous examples using tibble or dplyr packages
Move a column to rownames
?tibble::column_to_rownames
# It assumes the input data frame has no row names; otherwise we will get # Error: `df` must be a data frame without row names in `column_to_rownames()` # tibble::column_to_rownames(data.frame(x=letters[1:5], y = rnorm(5)), "x")
Move rownames to a variable
https://tibble.tidyverse.org/reference/rownames.html
tibble::rownames_to_column(trees, "newVar") # Still a data frame
Old way add_rownames()
data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) %>% add_rownames("newvar") # tibble object
Rename variables
dplyr::rename(os, newName = oldName)
Drop a variable
select(df, -x)
Drop a level
group_by() has a .drop argument so you can also group by factor levels that don't appear in the data. See this example.
Remove rownames
tibble has_rownames(), rownames_to_column(), column_to_rownames()
has_rownames(mtcars) #> [1] TRUE # Remove row names remove_rownames(mtcars) %>% has_rownames() #> [1] FALSE
> tibble::has_rownames(trees) [1] FALSE > rownames(trees) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" [31] "31" > rownames(trees) <- NULL > rownames(trees) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" [31] "31"
relocate: change column order
# Move Petal.Width column to appear next to Sepal.Width iris %>% relocate(Petal.Width, .after = Sepal.Width) %>% head() # Move Petal.Width to the last column iris %>% relocate(Petal.Width, .after = last_col()) %>% head()
pull: extract a single column
x <- iris %>% filter(Species == 'setosa') %>% select(Sepal.Length) %>% pull() y <- iris %>% filter(Species == 'virginica') %>% select(Sepal.Length) %>% pull() t.test(x, y)
Anonymous functions
- https://dplyr.tidyverse.org/reference/funs.html
- Is the role of `~` tilde in dplyr limited to non-standard evaluation?
- Use of ~ (tilde) in R programming Language
- lapply and anonymous functions
- Dplyr across: First look at a new Tidyverse function
data.table
Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).
Note: data.table has its own ways (cf base R and dplyr) to subset columns.
Some resources:
- https://www.rdocumentation.org/packages/data.table/versions/1.12.0
- cookbook
- R Packages: dplyr vs data.table
- Cheat sheet from RStudio
- Reading large data tables in R. fread(FILENAME)
- Note that 'x[, 2] always return 2. If you want to do the thing you want, use x[, 2, with=FALSE] or x[, V2] where V2 is the header name. See the FAQ #1 in data.table.
- Understanding data.table Rolling Joins
- Intro to The data.table Package
- Subsetting rows and/or columns
- Alternative to using tapply(), aggregate(), table() to summarize data
- Similarities to SQL, DT[i, j, by]
- R : data.table (with 50 examples) from ListenData
- Describe Data
- Selecting or Keeping Columns
- Rename Variables
- Subsetting Rows / Filtering
- Faster Data Manipulation with Indexing
- Performance Comparison
- Sorting Data
- Adding Columns (Calculation on rows)
- How to write Sub Queries (like SQL)
- Summarize or Aggregate Columns
- GROUP BY (Within Group Calculation)
- Remove Duplicates
- Extract values within a group
- SQL's RANK OVER PARTITION
- Cumulative SUM by GROUP
- Lag and Lead
- Between and LIKE Operator
- Merging / Joins
- Convert a data.table to data.frame
- R Tutorial: data.table from dezyre.com
- Syntax: DT[where, select|update|do, by]
- Keys and setkey()
- Fast grouping using j and by: DT[,sum(v),by=x]
- Fast ordered joins: X[Y,roll=TRUE]
- In the Introduction to data.table vignette, the data.table::order() function is SLOWER than base::order() from my Odroid xu4 (running Ubuntu 14.04.4 trusty on uSD)
odt = data.table(col=sample(1e7)) (t1 <- system.time(ans1 <- odt[base::order(col)])) ## uses order from base R # user system elapsed # 2.730 0.210 2.947 (t2 <- system.time(ans2 <- odt[order(col)])) ## uses data.table's order # user system elapsed # 2.830 0.215 3.052 (identical(ans1, ans2)) # [1] TRUE
- Boost Your Data Munging with R
- rbindlist(). One problem, it uses too much memory. In fact, when I try to analyze R package downloads, the command "dat <- rbindlist(logs)" uses up my 64GB memory (OS becomes unresponsive).
- Convenience features of fread
- The ultimate R data.table cheat sheet from infoworld
OpenMP enabled compiler for Mac. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.
Question: how to make use multicore with data.table package?
dtplyr
https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/
reshape & reshape2 (superceded by tidyr package)
- Data Shape Transformation With Reshape()
- Use acast() function in reshape2 package. It will convert data.frame used for analysis to a table-like data.frame good for display.
- http://lamages.blogspot.com/2013/10/creating-matrix-from-long-dataframe.html
tidyr
Missing values
Handling Missing Values in R using tidyr
Pivot
- From gather to pivot. pivot_longer()/pivot_wider()
- Data Pivoting with tidyr
- A Tidy Transcriptomics introduction to RNA-Seq analyses
data %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance")
- Using R: setting a colour scheme in ggplot2. Note the new (default) column names value and name after the function pivot_longer(data, cols).
set(1) dat1 <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10)) dat2 <- pivot_longer(dat1, -y) head(dat2, 2) # A tibble: 2 x 3 y name value <dbl> <chr> <dbl> 1 -1.28 x1 0.717 2 -1.28 x2 -0.320 dat3 <- pivot_wider(dat2) range(dat1 - dat3)
unnest()
Benchmark
An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.
- vignette("tidy-data") & Cheat sheet
- Main functions
- Reshape data: gather() & spread(). These two will be deprecated
- Break apart or combine columns/Split cells: separate() & unite()
- Handle missing: drop_na() & fill() & replace_na()
- Other functions
- tidyr::separate() function. If a cell contains many elements separated by ",", we can use this function to create more columns. The opposite function is unite().
- tidyr::separate_rows(). If a cell contains many elements separated by ",", we can use this function to create one more row. See the cheat sheet link above.
- http://blog.rstudio.org/2014/07/22/introducing-tidyr/
- http://rpubs.com/seandavi/GEOMetadbSurvey2014
- http://timelyportfolio.github.io/rCharts_factor_analytics/factors_with_new_R.html
- tidyr vs reshape2
- A tidyr Tutorial from U of Virginia
- Benchmarking cast in R from long data frame to wide matrix
Make wide tables long with gather() (see 6.3.1 of Efficient R Programming)
library(tidyr) library(efficient) data(pew) # wide table dim(pew) # 18 x 10, (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k') pewt <- gather(data = pew, key = Income, value = Count, -religion) dim(pew) # 162 x 3, (religion, Income, Count) args(gather) # function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
where the three arguments of gather() requires:
- data: a data frame in which column names will become row values. If the data is a matrix, use %>% as.data.frame() beforehand.
- key: the name of the categorical variable into which the column names in the original datasets are converted.
- value: the name of cell value columns
In this example, the 'religion' column will not be included (-religion).
dplyr, plyr packages
- plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
- Essential functions: 3 rows functions, 3 column functions and 1 mixed function.
select, mutate, rename, recode +------------------+ filter + + arrange + + group_by + + drop_na + + ungroup + summarise + +------------------+
- These functions works on data frames and tibble objects. Note stats package also has a filter() function for time series data. If we have not loaded the dplyr package, the filter() function below will give an error (count() also is from dplyr).
iris %>% filter(Species == "setosa") %>% count() head(iris %>% filter(Species == "setosa") %>% arrange(Sepal.Length))
- dplyr tutorial from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
- start_with()/ ends_with() = Select columns that start/end with a character string
- contains() = Select columns that contain a character string
- matches() = Select columns that match a regular expression
- one_of() = Select columns names that are from a group of names
- Data Transformation in the book R for Data Science. Five key functions in the dplyr package:
- Filter rows: filter(). filter is faster than subset() for very large records. But subset() can both subset rows and select columns.
- Arrange rows: arrange()
- Select columns: select(). Or use $ or [[Number]] or [[NAME]].
- Add new variables: mutate()
- Grouped summaries: group_by() & summarise()
# filter jan1 <- filter(flights, month == 1, day == 1) filter(flights, month == 11 | month == 12) filter(flights, arr_delay <= 120, dep_delay <= 120) df <- tibble(x = c(1, NA, 3)) filter(df, x > 1) filter(df, is.na(x) | x > 1) # arrange arrange(flights, year, month, day) arrange(flights, desc(arr_delay)) # select select(flights, year, month, day) select(flights, year:day) select(flights, -(year:day)) # mutate flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) # if you only want to keep the new variables transmute(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # summarise() by_day <- group_by(flights, year, month, day) summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) # pipe. Note summarise() can return more than 1 variable. delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay, na.rm = TRUE))
- Efficient R Programming
- Data wrangling: Transformation from R-exercises.
- Express Intro to dplyr by rollingyours.
- the dot.
- stringr and plyr A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors.
- https://randomjohn.github.io/r-maps-with-census-data/ dplyr and stringr are used
- 5 interesting subtle insights from TED videos data analysis in R
- What is tidy eval and why should I care?
- The Seven Key Things You Need To Know About dplyr 1.0.0
select()
Select columns from a data frame
select() + everything()
If we want one particular column (say the dependent variable y) to appear first or last in the dataset. We can use the everything().
iris %>% select(Species, everything()) %>% head() iris %>% select(-Species, everything()) %>% head() # put Species to the last col
plyr::rbind.fill()
Videos
- Hands-on dplyr tutorial for faster data manipulation in R by Data School. At time 17:00, it compares the %>% operator, with() and aggregate() for finding group mean.
- https://youtu.be/aywFompr1F4 (shorter video) by Roger Peng
- https://youtu.be/8SGif63VW6E by Hadley Wickham
- Tidy eval: Programming with dplyr, tidyr, and ggplot2. Bang bang "!!" operator was introduced for use in a function call.
- JULIA SILGE
- “Do More with R” video tutorials
- Learning the R Tidyverse from lynda.com
dbplyr
https://dbplyr.tidyverse.org/articles/dbplyr.html
stringr
- stringr is part of the tidyverse but is not a core package. You need to load it separately.
- https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
- stringr Cheat sheet (2 pages, this will immediately download the pdf file)
- Detect Matches: str_detect(), str_which(), str_count(), str_locate()
- Subset: str_sub(), str_subset(), str_extract(), str_match()
- Manage Lengths: str_length(), str_pad(), str_trunc(), str_trim()
- Mutate Strings: str_sub(), str_replace(), str_replace_all(), str_remove()
- Case Conversion: str_to_lower(), str_to_upper(), str_to_title()
- Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
- Efficient data carpentry → Regular expressions from Efficient R programming book by Gillespie & Lovelace.
magrittr
- Vignettes
- How does the pipe operator actually work?
- magrittr and wrapr Pipes in R, an Examination. Instead of nested statements, it is using pipe operator %>%. So the code is easier to read. Impressive!
x %>% f # f(x) x %>% f(y) # f(x, y) x %>% f(arg=y) # f(x, arg=y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) x %>% select(which(colSums(!is.na(.))>0)) # remove columns with all missing data x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows
suppressPackageStartupMessages(library("dplyr")) starwars %>% filter(., height > 200) %>% select(., height, mass) %>% head(.) # instead of starwars %>% filter(height > 200) %>% select(height, mass) %>% head
iris$Species iris[["Species"]] iris %>% `[[`("Species") iris %>% `[[`(5) iris %>% subset(select = "Species")
- Split-apply-combine: group + summarize + sort/arrange + top n. The following example is from Efficient R programming.
data(wb_ineq, package = "efficient") wb_ineq %>% filter(grepl("g", Country)) %>% group_by(Year) %>% summarise(gini = mean(gini, na.rm = TRUE)) %>% arrange(desc(gini)) %>% top_n(n = 5)
- Writing Pipe-friendly Functions
- http://rud.is/b/2015/02/04/a-step-to-the-right-in-r-assignments/
- http://rpubs.com/tjmahr/pipelines_2015
- http://danielmarcelino.com/i-loved-this-crosstable/
- http://moderndata.plot.ly/using-the-pipe-operator-in-r-with-plotly/
- RMSE
f <- function(x) { (y - x) %>% '^'(2) %>% sum %>% '/'(length(x)) %>% sqrt %>% round(2) }
# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model diamonds <- ggplot2::diamonds diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat) pryr::object_size(diamonds) pryr::object_size(diamonds2) pryr::object_size(diamonds, diamonds2) rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str() rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2)) # instead of the righthand side. # If a function does not have a data frame based api, you can use %$%. # It explodes out the variables in a data frame. mtcars %$% cor(disp, mpg) # For assignment, magrittr provides the %<>% operator mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by mtcars %<>% transform(cyl = cyl * 2)
Upsides of using magrittr: no need to create intermediate objects, code is easy to read.
When not to use the pipe
- your pipes are longer than (say) 10 steps
- you have multiple inputs or outputs
- Functions that use the current environment: assign(), get(), load()
- Functions that use lazy evaluation: tryCatch(), try()
Dollar sign .$
- A Short Tutorial about Magrittr’s Pipe Operator and Placeholders, Simplify Your Code with %>%
gapminder %>% dplyr::filter(continent == "Asia") %>% {stats::cor(.$lifeExp, .$gdpPercap)} gapminder %>% dplyr::filter(continent == "Asia") %$% {stats::cor(lifeExp, gdpPercap)} gapminder %>% dplyr::mutate(continent = ifelse(.$continent == "Americas", "Western Hemisphere", .$continent))
- Another example Introduction to the msigdbr package
m_list = m_df %>% split(x = .$gene_symbol, f = .$gs_name) m_list2 = m_df %$% split(x = gene_symbol, f = gs_name) all.equal(m_list, m_list2) # [1] TRUE
- Use $ dollar sign at end of of an R magrittr pipeline to return a vector
DF %>% filter(y > 0) %>% .$y
%$%
Expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument.
lhs %$% rhs # lhs: A list, environment, or a data.frame. # rhs: An expression where the names in lhs is available. iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$% cor(Sepal.Length, Sepal.Width)
set_rownames() and set_colnames()
https://stackoverflow.com/a/56613460, https://www.rdocumentation.org/packages/magrittr/versions/1.5/topics/extract
data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) cbind(1:5, 2:6) %>% magrittr::set_colnames(letters[1:2])
purrr: : Functional Programming Tools
- https://purrr.tidyverse.org/
- purrr cookbook
- Functional programming (cf Object-Oriented Programming)
- Getting started with the purrr package in R, especially the map() function.
library(purrr) # map() is a replacement of lapply() # lapply(dat, function(x) mean(x$Open)) map(dat, function(x)mean(x$Open)) # map allows us to bypass the function function. # Using a tilda (~) in place of function and a dot (.) in place of x map(dat, ~mean(.$Open)) # map allows you to specify the structure of your output. map_dbl(dat, ~mean(.$Open)) # map2() is a replacement of mapply() # mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks) map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
- map_dfr() function from "The Joy of Functional Programming (for Data Science)" with Hadley Wickham. It can be used to replace a loop.
data <- map(paths, read.csv) data <- map_dfr(paths, read.csv, id = "path") out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE) out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)
- Purr yourself into a math genius
- Write & Read Multiple Excel files with purrr
- Handling errors using purrr's possibly() and safely()
- How to Automate Exploratory Analysis Plots
forcats
https://forcats.tidyverse.org/
JAMA retraction after miscoding – new Finalfit function to check recoding
outer()
Genomic sequence
- chartr
> yourSeq <- "AAAACCCGGGTTTNNN" > chartr("ACGT", "TGCA", yourSeq) [1] "TTTTGGGCCCAAANNN"
broom
broom: Convert Statistical Analysis Objects into Tidy Tibbles
Especially the tidy() function.
R> str(survfit(Surv(time, status) ~ x, data = aml)) List of 17 $ n : int [1:2] 11 12 $ time : num [1:20] 9 13 18 23 28 31 34 45 48 161 ... $ n.risk : num [1:20] 11 10 8 7 6 5 4 3 2 1 ... $ n.event : num [1:20] 1 1 1 1 0 1 1 0 1 0 ... ... R> tidy(survfit(Surv(time, status) ~ x, data = aml)) # A tibble: 20 x 9 time n.risk n.event n.censor estimate std.error conf.high conf.low strata <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 9 11 1 0 0.909 0.0953 1 0.754 x=Maintained 2 13 10 1 1 0.818 0.142 1 0.619 x=Maintained ... 18 33 3 1 0 0.194 0.627 0.664 0.0569 x=Nonmaintained 19 43 2 1 0 0.0972 0.945 0.620 0.0153 x=Nonmaintained 20 45 1 1 0 0 Inf NA NA x=Nonmaintained
lobstr package - dig into the internal representation and structure of R objects
tidymodels
- Introducing our new book, Tidy Modeling with R
- Handle class imbalance in climbing expedition data with tidymodels
Other packages
janitor
How to Clean Data: {janitor} Package