Revision as of 09:17, 8 June 2021

Tidyverse

Tidyverse Homepage
Pipelines for data analysis in R, Data Science in the Tidyverse

   Import
     |
     | readr, readxl
     | haven, DBI, httr   +----- Visualize ------+
     |                    |    ggplot2, ggvis    |
     |                    |                      |
   Tidy ------------- Transform 
   tibble               dplyr                   Model 
   tidyr                  |                    broom
                          +------ Model ---------+

Cheat sheet

The cheat sheets are downloaded from RStudio

Data Transformation with dply
Data Import
Data Import with readr, tibble, and tidyr (not in RStudio anymore?)

Online

TidyverseSkeptic by Norm Matloff
R for Data Science and tidyverse package (it is a collection of ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr & forcats 8 packages).
- tidyverse, among others, was used at Mining CRAN DESCRIPTION Files (tbl_df(), %>%, summarise(), count(), mutate(), arrange(), unite(), ggplot(), filter(), select(), ...). Note that there is a problem to reproduce the result. I need to run cran <- cran[, -14] to remove the MD5sum column.
- Compile R for Data Science to a PDF
The tidyverse style guide by Hadley Wickham
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with Tidyverse from the Harvard Chan School of Public Health.
Best packages for data manipulation in R. It demonstrates to perform the same tasks using data.table and dplyr packages. data.table is faster and it may be a go-to package when performance and memory are the constraints.
DATA MANIPULATION IN R by Alboukadel Kassambara
- subset data frame columns: pull() [return a vector], select() [return data frame], select_if(), other helper functions
- subset (filter) data frame rows: slice(), filter(), filter_all(), filter_if(), filter_at(), sample_n(), top_n()
- identify and remove duplicate rows: duplicated(), unique(), distinct()
- ordering rows: arrange(), desc()
  - cf stats::reorder() to change a factor variable's order based on another variable. So the output is still a vector. It is useful in creating multiple boxplots. On the other hand, arrange() is to change the row order of a data frame and its input is a data frame.
  - desc() can be used in arrange() [see ?desc] and reorder() [see ordered barplot ] too.
  - desc(x) is just doing the negative operation -x.
- renaming and adding columns: rename()
- compute and add new variables to a data frame: mutate(), transmutate()
- computing summary statistics (pay to view)
Data manipulation in r using data frames - an extensive article of basics
- Data manipulation in r using data frames - an extensive article of basics part2 - aggregation and sorting
The A to Z of tidyverse from Deeply Trivial
Summer Institute in Statistics for Big Data (SISBID), SISBID 2020 Modules
Complete tutorial

Animation to explain

tidyexplain - Tidy Animated Verbs

Examples

A Gentle Introduction to Tidy Statistics in R

A Gentle Introduction to Tidy Statistics in R by Thomas Mock on RStudio webinar. Good coverage with step-by-step explanation. See part 1 & part 2 about the data and markdown document. All documents are available in github repository.

Task	R code	Graph
Load the libraries	library(tidyverse) library(readxl) library(broom) library(knitr)
Read Excel file	raw_df <- readxl::read_xlsx("ad_treatment.xlsx") dplyr::glimpse(raw_df)
Check distribution	g2 <- ggplot(raw_df, aes(x = age)) + geom_density(fill = "blue") g2 raw_df %>% summarize(min = min(age), max = max(age))	File:Check dist.svg
Data cleaning	raw_df %>% summarize(na_count = sum(is.na(mmse)))
Experimental variables levels	# check Ns and levels for our variables table(raw_df$drug_treatment, raw_df$health_status) table(raw_df$drug_treatment, raw_df$health_status, raw_df$sex) # tidy way of looking at variables raw_df %>% group_by(drug_treatment, health_status, sex) %>% count()
Visual Exploratory Data Analysis	ggplot(data = raw_df, # add the data aes(x = drug_treatment, y = mmse, # set x, y coordinates color = drug_treatment)) + # color by treatment geom_boxplot() + facet_grid(~health_status)	File:Onefacet.svg
Summary Statistics	raw_df %>% glimpse() sum_df <- raw_df %>% mutate( sex = factor(sex, labels = c("Male", "Female")), drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")) ) %>% group_by(sex, health_status, drug_treatment # group by categorical variables ) %>% summarize( mmse_mean = mean(mmse), # calc mean mmse_se = sd(mmse)/sqrt(n()) # calc standard error ) %>% ungroup() # ungrouping variable is a good habit to prevent errors kable(sum_df) write.csv(sum_df, "adx37_sum_stats.csv")
Plotting summary statistics	g <- ggplot(data = sum_df, # add the data aes(x = drug_treatment, #set x, y coordinates y = mmse_mean, group = drug_treatment, # group by treatment color = drug_treatment)) + # color by treatment geom_point(size = 3) + # set size of the dots facet_grid(sex~health_status) # create facets by sex and status g	File:Twofacets.svg
ANOVA	# set up the statistics df stats_df <- raw_df %>% # start with data mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), sex = factor(sex, labels = c("Male", "Female")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's"))) glimpse(stats_df) # this gives main effects AND interactions ad_aov <- aov(mmse ~ sex * drug_treatment * health_status, data = stats_df) summary(ad_aov) # this extracts ANOVA output into a nice tidy dataframe tidy_ad_aov <- tidy(ad_aov) # which we can save to Excel write.csv(tidy_ad_aov, "ad_aov.csv")
Post-hocs	# pairwise t.tests ad_pairwise <- pairwise.t.test(stats_df$mmse, stats_df$sex:stats_df$drug_treatment:stats_df$health_status, p.adj = "none") # look at the posthoc p.values in a tidy dataframe kable(head(tidy(ad_pairwise))) # call and tidy the tukey posthoc tidy_ad_tukey <- tidy( TukeyHSD(ad_aov, which = 'sex:drug_treatment:health_status'))
Publication plot	sig_df <- tribble( ~drug_treatment, ~ health_status, ~sex, ~mmse_mean, "Low dose", "Alzheimer's", "Male", 17, "High Dose", "Alzheimer's", "Male", 25, "Low dose", "Alzheimer's", "Female", 18, "High Dose", "Alzheimer's", "Female", 24 ) sig_df <- sig_df %>% mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), sex = factor(sex, levels = c("Male", "Female")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's"))) sig_df # plot of cognitive function health and drug treatment g1 <- ggplot(data = sum_df, aes(x = drug_treatment, y = mmse_mean, fill = drug_treatment, group = drug_treatment)) + geom_errorbar(aes(ymin = mmse_mean - mmse_se, ymax = mmse_mean + mmse_se), width = 0.5) + geom_bar(color = "black", stat = "identity", width = 0.7) + facet_grid(sex~health_status) + theme_bw() + scale_fill_manual(values = c("white", "grey", "black")) + theme(legend.position = "NULL", legend.title = element_blank(), axis.title = element_text(size = 20), legend.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text = element_text(size = 12)) + geom_text(data = sig_df, label = "", size = 8) + labs(x = "\nDrug Treatment", y = "Cognitive Function (MMSE)\n", caption = "\nFigure 1. Effect of novel drug treatment AD-x37 on cognitive function in healthy and demented elderly adults. \nn = 100/treatment group (total n = 600), indicates significance at p < 0.001") g1 # save the graph! ggsave("ad_publication_graph.png", g1, height = 7, width = 8, units = "in")	File:Ad public.svg

Opioid prescribing habits in texas

https://juliasilge.com/blog/texas-opioids/.

It can read multiple sheets (27 sheets) at a time and merge them by rows.

case_when(): A general vectorised if. This function allows you to vectorise multiple if_else() statements.

x %>% mutate(group = case_when(
  PredScore > quantile(PredScore, .5) ~ 'High',
  PredScore < quantile(PredScore, .5) ~ 'Low',
  TRUE ~ NA_character_
))

fill()
bind_rows(). Another example.
full_join(), left_join(), right_join(), inner_join(). See the exercises from Useful dplyr functions (with examples). Suppose df1=50x3, df2=45x3 with 25 overlaps. Then left_join=50x5, right_join=45x5, inner_join=25x5, full_join=70x5.
gather()
replace_na()
str_to_title()
count()

top_n(). weight parameter. top_n(n=5, wt=x) won't order rows by weight in the output actually. slice_max(order_by = x, n = 5) does it.

set.seed(1)
d <- data.frame(
  x   = runif(90),
  grp = gl(3, 30)
) 

> d %>% group_by(grp) %>% top_n(5, wt=x)
# A tibble: 15 x 2
# Groups:   grp [3]
       x grp  
   <dbl> <fct>
 1 0.908 1    
 2 0.898 1    
 3 0.945 1    
 4 0.992 1    
 5 0.935 1    
 6 0.827 2    
 7 0.794 2    
 8 0.821 2    
 9 0.789 2    
10 0.861 2    
11 0.913 3    
12 0.875 3    
13 0.892 3    
14 0.864 3    
15 0.961 3 

> d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# A tibble: 15 x 2
# Groups:   grp [3]
       x grp  
   <dbl> <fct>
 1 0.992 1    
 2 0.945 1    
 3 0.935 1    
 4 0.908 1    
 5 0.898 1    
 6 0.861 2    
 7 0.827 2    
 8 0.821 2    
 9 0.794 2    
10 0.789 2    
11 0.961 3    
12 0.913 3    
13 0.892 3    
14 0.875 3    
15 0.864 3

kable()

Useful dplyr functions (with examples)

https://sw23993.wordpress.com/2017/07/10/useful-dplyr-functions-wexamples/

Supervised machine learning case studies in R

Supervised machine learning case studies in R - A Free, Interactive Course Using Tidy Tools.

Time series data

Automating update of a fiscal database for the Euro Area
- readxl::read_excel()
- transmute() (transmute() adds new variables and drops any existing ones), as.Date()
- filter(), is.na()
- na.omit(), first()
- filter(), gather(), bind_rows(), arrange()
- group_by(), summarize()
- rdb(), lubridate::year(), magrittr::%<>%, select(), spread(), mutate(), select(), gather()
- filter(), full_join(), transmute(), !is.na()
- bind_rows(), mutate()
- chain() (deprecated!)
- ungroup()
- tibble(), left_join()
Exploring eu wide data on new car registrations and co2 efficiency (data is available)

Calculating change from baseline

group_by() + mutate() + ungroup(). We can accomplish the task by using split() + lapply() + do.call().

trial_data_chg <- trial_data %>%
  arrange(USUBJID, AVISITN) %>%
  group_by(USUBJID) %>%
  mutate(CHG = AVAL - AVAL[1L]) %>%
  ungroup()

# If the baseline is missing
trial_data_chg2 <- trial_data %>%
  group_by(USUBJID) %>%
  mutate(
    CHG = if (any(AVISIT == "Baseline")) AVAL - AVAL[AVISIT == "Baseline"] else NA
  ) %>%
  ungroup()

Split data and fitting models to subsets

https://twitter.com/romain_francois/status/1226967548144635907?s=20

library(dplyr)
iris %>% 
  group_by(Species) %>%
  summarise(broom::tidy(lm(Petal.Length ~ Sepal.Length))

Show all possible group combinations

Ten Tremendous Tricks in the Tidyverse

https://youtu.be/NDHSBUN_rVU (video).

count(),
add_count(),
summarize() w/ a list column,
fct_reorder() + geom_col() + coord_flip(),
fct_lump(),
scale_x/y_log10(),
crossing(),
separate(),
extract().

Gapminder dataset

Hands-on R and dplyr – Analyzing the Gapminder Dataset

Install on Ubuntu

sudo apt install r-cran-tidyverse

# Ubuntu >= 18.04. However, I get unmet dependencies errors on R 3.5.3.
# r-cran-curl : Depends: r-api-3.4
sudo apt-get install r-cran-curl r-cran-openssl r-cran-xml2

# Works fine on Ubuntu 16.04, 18.04, 20.04
sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev

80 R packages will be installed after tidyverse has been installed.

For RStudio server docker version (Debian 10), I also need to install zlib1g-dev

Install on Raspberry Pi/(ARM based) Chromebook

In additional to the requirements of installing on Ubuntu, I got an error when it is installing a dependent package fs: undefined symbol: pthread_atfork. The fs package version is 1.2.6. The solution is to add one line in fs/src/Makevars file and then install the "fs" package using the source on the local machine.

5 most useful data manipulation functions

subset() for making subsets of data (natch)
merge() for combining data sets in a smart and easy way
melt()-reshape2 package for converting from wide to long data formats. See an example here where we want to combine multiple columns of values into 1 column. melt() is replaced by gather().
dcast()-reshape2 package for converting from long to wide data formats (or just use tapply()), and for making summary tables
ddply()-plyr package for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations

Miscellaneous examples using tibble or dplyr packages

Move a column to rownames

?tibble::column_to_rownames

# It assumes the input data frame has no row names; otherwise we will get
# Error: `df` must be a data frame without row names in `column_to_rownames()`
# 
tibble::column_to_rownames(data.frame(x=letters[1:5], y = rnorm(5)), "x")

Move rownames to a variable

https://tibble.tidyverse.org/reference/rownames.html

tibble::rownames_to_column(trees, "newVar")
# Still a data frame

Old way add_rownames()

data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) %>% add_rownames("newvar")
# tibble object

Rename variables

dplyr::rename(os, newName = oldName)

Drop a variable

select(df, -x)

Drop a level

group_by() has a .drop argument so you can also group by factor levels that don't appear in the data. See this example.

Remove rownames

tibble has_rownames(), rownames_to_column(), column_to_rownames()

has_rownames(mtcars)
#> [1] TRUE

# Remove row names
remove_rownames(mtcars) %>% has_rownames()
#> [1] FALSE

> tibble::has_rownames(trees)
[1] FALSE
> rownames(trees)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31"
> rownames(trees) <- NULL
> rownames(trees)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31"

relocate: change column order

relocate()

# Move Petal.Width column to appear next to Sepal.Width
iris %>% relocate(Petal.Width, .after = Sepal.Width) %>% head() 

# Move Petal.Width to the last column
iris %>% relocate(Petal.Width, .after = last_col()) %>% head()

pull: extract a single column

pull()

x <- iris %>% filter(Species == 'setosa') %>% select(Sepal.Length) %>% pull()
y <- iris %>% filter(Species == 'virginica') %>% select(Sepal.Length) %>% pull()
t.test(x, y)

reorder

iris %>% ggplot(aes(x=Species, y = Sepal.Width)) + 
         geom_boxplot() +
         xlab=("Species")

# reorder the boxplot based on the Species' median
iris %>% ggplot(aes(x=reorder(Species, Sepal.Width, FUN = median),
                    y=Sepal.Width)) + 
         geom_boxplot() +
         xlab=("Species")

Anonymous functions

https://dplyr.tidyverse.org/reference/funs.html
Is the role of `~` tilde in dplyr limited to non-standard evaluation?
Use of ~ (tilde) in R programming Language
lapply and anonymous functions
dplyr across: First look at a new Tidyverse function.
- Apply a function (or functions) across multiple columns. across(), if_any(), if_all().
- Select variables that match a pattern. starts_with(), ends_with(), contains(), matches(), num_range().
- data %>% group_by(Var1) %>% summarise(across(contains("SomeKey"), mean, na.rm = TRUE))

ny <- filter(cases, State == "NY") %>%
  select(County = `County Name`, starts_with(c("3", "4")))

daily_totals <- ny %>%
  summarize(
    across(starts_with("4"), sum)
  )

median_and_max <- list(
  med = ~median(.x, na.rm = TRUE),
  max = ~max(.x, na.rm = TRUE)
)

april_median_and_max <- ny %>%
  summarize(
    across(starts_with("4"), median_and_max)
  )

# across(.cols = everything(), .fns = NULL, ..., .names = NULL)

# Rounding the columns Sepal.Length and Sepal.Width
iris %>%
  as_tibble() %>%
  mutate(across(c(Sepal.Length, Sepal.Width), round))

iris %>% summarise(across(contains("Sepal"), ~mean(.x, na.rm = TRUE)))

# filter rows
iris %>% filter(if_any(ends_with("Width"), ~ . > 4))

iris %>% select(starts_with("Sepal"))

iris %>% select(starts_with(c("Petal", "Sepal")))

iris %>% select(contains("Sepal"))

data.table

Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).

Note: data.table has its own ways (cf base R and dplyr) to subset columns.

Some resources:

https://www.rdocumentation.org/packages/data.table/versions/1.12.0
cookbook
R Packages: dplyr vs data.table
Comparing Common Operations in dplyr and data.table
Cheat sheet from RStudio
Reading large data tables in R. fread(FILENAME)
Note that 'x[, 2] always return 2. If you want to do the thing you want, use x[, 2, with=FALSE] or x[, V2] where V2 is the header name. See the FAQ #1 in data.table.
Understanding data.table Rolling Joins
Intro to The data.table Package
- Subsetting rows and/or columns
- Alternative to using tapply(), aggregate(), table() to summarize data
- Similarities to SQL, DT[i, j, by]
R : data.table (with 50 examples) from ListenData
- Describe Data
- Selecting or Keeping Columns
- Rename Variables
- Subsetting Rows / Filtering
- Faster Data Manipulation with Indexing
- Performance Comparison
- Sorting Data
- Adding Columns (Calculation on rows)
- How to write Sub Queries (like SQL)
- Summarize or Aggregate Columns
- GROUP BY (Within Group Calculation)
- Remove Duplicates
- Extract values within a group
- SQL's RANK OVER PARTITION
- Cumulative SUM by GROUP
- Lag and Lead
- Between and LIKE Operator
- Merging / Joins
- Convert a data.table to data.frame
R Tutorial: data.table from dezyre.com
- Syntax: DT[where, select|update|do, by]
- Keys and setkey()
- Fast grouping using j and by: DT[,sum(v),by=x]
- Fast ordered joins: X[Y,roll=TRUE]

In the Introduction to data.table vignette, the data.table::order() function is SLOWER than base::order() from my Odroid xu4 (running Ubuntu 14.04.4 trusty on uSD)

odt = data.table(col=sample(1e7))
(t1 <- system.time(ans1 <- odt[base::order(col)]))  ## uses order from base R
#   user  system elapsed 
#  2.730   0.210   2.947 
(t2 <- system.time(ans2 <- odt[order(col)]))        ## uses data.table's order
#   user  system elapsed 
#  2.830   0.215   3.052
(identical(ans1, ans2))
# [1] TRUE

Boost Your Data Munging with R
rbindlist(). One problem, it uses too much memory. In fact, when I try to analyze R package downloads, the command "dat <- rbindlist(logs)" uses up my 64GB memory (OS becomes unresponsive).
Convenience features of fread
The ultimate R data.table cheat sheet from infoworld

OpenMP enabled compiler for Mac. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.

Question: how to make use multicore with data.table package?

dtplyr

https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/

reshape & reshape2 (superceded by tidyr package)

Data Shape Transformation With Reshape()
Use acast() function in reshape2 package. It will convert data.frame used for analysis to a table-like data.frame good for display.
http://lamages.blogspot.com/2013/10/creating-matrix-from-long-dataframe.html

tidyr

Missing values

Handling Missing Values in R using tidyr

Pivot

Conversion from gather() to pivot_long()

gather(df, key=KeyName, value = valueName, col1, col2, ...) # No quotes around KeyName and valueName

pivot_long(df, cols, name_to = "keyName", value_to = "valueName")

From gather to pivot. pivot_longer()/pivot_wider()
Data Pivoting with tidyr

A Tidy Transcriptomics introduction to RNA-Seq analyses

data %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance")

Using R: setting a colour scheme in ggplot2. Note the new (default) column names value and name after the function pivot_longer(data, cols).

set(1)
dat1 <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10))
dat2 <- pivot_longer(dat1, -y)
head(dat2, 2)
# A tibble: 2 x 3
      y name   value
  <dbl> <chr>  <dbl>
1 -1.28 x1     0.717
2 -1.28 x2    -0.320

dat3 <- pivot_wider(dat2)
range(dat1 - dat3)

unnest()

Benchmark

An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.

vignette("tidy-data") & Cheat sheet
Main functions
- Reshape data: gather() & spread(). These two will be deprecated
- Break apart or combine columns/Split cells: separate() & unite()
- Handle missing: drop_na() & fill() & replace_na()
Other functions
- tidyr::separate() function. If a cell contains many elements separated by ",", we can use this function to create more columns. The opposite function is unite().
- tidyr::separate_rows(). If a cell contains many elements separated by ",", we can use this function to create one more row. See the cheat sheet link above.
http://blog.rstudio.org/2014/07/22/introducing-tidyr/
http://rpubs.com/seandavi/GEOMetadbSurvey2014
http://timelyportfolio.github.io/rCharts_factor_analytics/factors_with_new_R.html
tidyr vs reshape2
A tidyr Tutorial from U of Virginia
Benchmarking cast in R from long data frame to wide matrix

Make wide tables long with gather() (see 6.3.1 of Efficient R Programming)

library(tidyr)
library(efficient)
data(pew) # wide table
dim(pew) # 18 x 10,  (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k') 
pewt <- gather(data = pew, key = Income, value = Count, -religion)
dim(pew) # 162 x 3,  (religion, Income, Count)

args(gather)
# function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

where the three arguments of gather() requires:

data: a data frame in which column names will become row values. If the data is a matrix, use %>% as.data.frame() beforehand.
key: the name of the categorical variable into which the column names in the original datasets are converted.
value: the name of cell value columns

In this example, the 'religion' column will not be included (-religion).

dplyr, plyr packages

plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
Essential functions: 3 rows functions, 3 column functions and 1 mixed function.

           select, mutate, rename, recode
            +------------------+
filter      +                  +
arrange     +                  +
group_by    +                  +
drop_na     +                  +
ungroup     + summarise        +
            +------------------+

These functions works on data frames and tibble objects. Note stats package also has a filter() function for time series data. If we have not loaded the dplyr package, the filter() function below will give an error (count() also is from dplyr).

iris %>% filter(Species == "setosa") %>% count()
head(iris %>% filter(Species == "setosa") %>% arrange(Sepal.Length))

dplyr tutorial from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
- start_with()/ ends_with() = Select columns that start/end with a character string
- contains() = Select columns that contain a character string
- matches() = Select columns that match a regular expression
- one_of() = Select columns names that are from a group of names
Data Transformation in the book R for Data Science. Five key functions in the dplyr package:
- Filter rows: filter(). filter is faster than subset() for very large records. But subset() can both subset rows and select columns.
- Arrange rows: arrange()
- Select columns: select(). Or use $ or [[Number]] or [[NAME]].
- Add new variables: mutate()
- Grouped summaries: group_by() & summarise()

# filter
jan1 <- filter(flights, month == 1, day == 1)
filter(flights, month == 11 | month == 12)
filter(flights, arr_delay <= 120, dep_delay <= 120)
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

# arrange
arrange(flights, year, month, day)
arrange(flights, desc(arr_delay))

# select
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

# mutate
flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)
# if you only want to keep the new variables
transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# summarise()
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

# pipe. Note summarise() can return more than 1 variable.
delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))

Efficient R Programming
Data wrangling: Transformation from R-exercises.
Express Intro to dplyr by rollingyours.
the dot.
stringr and plyr A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors.
https://randomjohn.github.io/r-maps-with-census-data/ dplyr and stringr are used
5 interesting subtle insights from TED videos data analysis in R
What is tidy eval and why should I care?
The Seven Key Things You Need To Know About dplyr 1.0.0

select()

Select columns from a data frame

select() + everything()

If we want one particular column (say the dependent variable y) to appear first or last in the dataset. We can use the everything().

iris %>% select(Species, everything()) %>% head()
iris %>% select(-Species, everything()) %>% head() # put Species to the last col

plyr::rbind.fill()

Videos

Hands-on dplyr tutorial for faster data manipulation in R by Data School. At time 17:00, it compares the %>% operator, with() and aggregate() for finding group mean.
https://youtu.be/aywFompr1F4 (shorter video) by Roger Peng
https://youtu.be/8SGif63VW6E by Hadley Wickham
Tidy eval: Programming with dplyr, tidyr, and ggplot2. Bang bang "!!" operator was introduced for use in a function call.
JULIA SILGE
- Preprocessing and resampling using #tidytuesday college data
- Bootstrap resampling with #tidytuesday beer production data
“Do More with R” video tutorials
Learning the R Tidyverse from lynda.com

dbplyr

https://dbplyr.tidyverse.org/articles/dbplyr.html

stringr

stringr is part of the tidyverse but is not a core package. You need to load it separately.
Handling Strings with R(ebook) by Gaston Sanchez.
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
stringr Cheat sheet (2 pages, this will immediately download the pdf file)
- Detect Matches: str_detect(), str_which(), str_count(), str_locate()
- Subset: str_sub(), str_subset(), str_extract(), str_match()
- Manage Lengths: str_length(), str_pad(), str_trunc(), str_trim()
- Mutate Strings: str_sub(), str_replace(), str_replace_all(), str_remove()
  - Case Conversion: str_to_lower(), str_to_upper(), str_to_title()
- Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
Efficient data carpentry → Regular expressions from Efficient R programming book by Gillespie & Lovelace.

magrittr

Vignettes
- magrittr 2.0 is coming soon, magrittr 2.0 is here!
How does the pipe operator actually work?
magrittr and wrapr Pipes in R, an Examination. Instead of nested statements, it is using pipe operator %>%. So the code is easier to read. Impressive!

x %>% f     # f(x)
x %>% f(y)  # f(x, y)
x %>% f(arg=y)  # f(x, arg=y)
x %>% f(z, .) # f(z, x)
x %>% f(y) %>% g(z)  #  g(f(x, y), z)

x %>% select(which(colSums(!is.na(.))>0))  # remove columns with all missing data
x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows

Make Arguments Explicit in magrittr/dplyr Pipelines

suppressPackageStartupMessages(library("dplyr"))
starwars %>%
  filter(., height > 200) %>%
  select(., height, mass) %>%
  head(.)
# instead of 
starwars %>%
  filter(height > 200) %>%
  select(height, mass) %>%
  head

Subset an element from a list

iris$Species
iris[["Species"]]

iris %>%
`[[`("Species")

iris %>%
`[[`(5)

iris %>%
  subset(select = "Species")

Split-apply-combine: group + summarize + sort/arrange + top n. The following example is from Efficient R programming.

data(wb_ineq, package = "efficient")
wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

f <- function(x) {
  (y - x) %>% 
    '^'(2) %>% 
    sum %>%
    '/'(length(x)) %>% 
    sqrt %>% 
    round(2)
}

# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)

rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str()
rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe
    # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2))  
    # instead of the righthand side.

# If a function does not have a data frame based api, you can use %$%.
# It explodes out the variables in a data frame.
mtcars %$% cor(disp, mpg) 

# For assignment, magrittr provides the %<>% operator
mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by
mtcars %<>% transform(cyl = cyl * 2)

Upsides of using magrittr: no need to create intermediate objects, code is easy to read.

When not to use the pipe

your pipes are longer than (say) 10 steps
you have multiple inputs or outputs
Functions that use the current environment: assign(), get(), load()
Functions that use lazy evaluation: tryCatch(), try()

Dollar sign .$

A Short Tutorial about Magrittr’s Pipe Operator and Placeholders, Simplify Your Code with %>%

gapminder %>% dplyr::filter(continent == "Asia") %>% 
  {stats::cor(.$lifeExp, .$gdpPercap)}
gapminder %>% dplyr::filter(continent == "Asia") %$% 
  {stats::cor(lifeExp, gdpPercap)}
gapminder %>%
  dplyr::mutate(continent = ifelse(.$continent == "Americas", "Western Hemisphere", .$continent))

Another example Introduction to the msigdbr package

m_list  = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
m_list2 = m_df %$% split(x = gene_symbol, f = gs_name)
all.equal(m_list, m_list2)
# [1] TRUE

Use $ dollar sign at end of of an R magrittr pipeline to return a vector
```
DF %>% filter(y > 0) %>% .$y
```

%$%

Expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument.

lhs %$% rhs
# lhs:	A list, environment, or a data.frame.
# rhs: An expression where the names in lhs is available.

iris %>%
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)

set_rownames() and set_colnames()

https://stackoverflow.com/a/56613460, https://www.rdocumentation.org/packages/magrittr/versions/1.5/topics/extract

data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5])

cbind(1:5, 2:6) %>% magrittr::set_colnames(letters[1:2])

purrr: : Functional Programming Tools

While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…). See Learn to purrr.

https://purrr.tidyverse.org/
cheatsheet
purrr cookbook
Higher-order function
Python Decorator/metaprogramming
Iterating over the lines of a data.frame with purrr
Functional programming (cf Object-Oriented Programming)
- Functional programming for beginners
- 5 Functional Programming Languages You Should Know

What does the tilde mean in this context of R code, What is meaning of first tilde in purrr::map

Getting started with the purrr package in R, especially the map() and map_df() functions.

library(purrr)
# map() is a replacement of lapply()
# lapply(dat, function(x) mean(x$Open))
map(dat, function(x)mean(x$Open))  

# map allows us to bypass the function function. 
# Using a tilda (~) in place of function and a dot (.) in place of x
map(dat, ~mean(.$Open))

# map allows you to specify the structure of your output.
map_dbl(dat, ~mean(.$Open))

# map2() is a replacement of mapply()
# mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)
map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))

map_dfr() function from "The Joy of Functional Programming (for Data Science)" with Hadley Wickham. It can be used to replace a loop.

data <- map(paths, read.csv)
data <- map_dfr(paths, read.csv, id = "path")

out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE)
out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)

Learn to purrr. Lots of good information like tilde-dot is a shorthand for functions.

function(x) {
  x + 10
}
# is the same as
~{.x + 10}

map_dbl(c(1, 4, 7), ~{.x + 10})

A closer look at replicate() and purrr::map() for simulations

twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
     ngroup = 2
     group = rep( c("group1", "group2"), each = nrep)
     eps = rnorm(ngroup*nrep, 0, sigma)
     growth = b0 + b1*(group == "group2") + eps
     growthfit = lm(growth ~ group)
     growthfit
}
sim_lm = replicate(5, twogroup_fun(), simplify = FALSE )
str(sim_lm, max.level = 1)

map_dbl(sim_lm, ~summary(.x)$r.squared)
# Same as function(x) {} style
map_dbl(sim_lm, function(x) summary(x)$r.squared)
# Same as sapply()
sapply(sim_lm, function(x) summary(x)$r.squared)
map_dfr(sim_lm, broom::tidy, .id = "model")

Functional programming from Advanced R.
Functional Programming : Sara Altman, Bill Behrman, Hadley Wickham

forcats

https://forcats.tidyverse.org/

JAMA retraction after miscoding – new Finalfit function to check recoding

outer()

Genomic sequence

chartr

> yourSeq <- "AAAACCCGGGTTTNNN"
> chartr("ACGT", "TGCA", yourSeq)
[1] "TTTTGGGCCCAAANNN"

broom

broom: Convert Statistical Analysis Objects into Tidy Tibbles

Especially the tidy() function.

R> str(survfit(Surv(time, status) ~ x, data = aml))
List of 17
 $ n        : int [1:2] 11 12
 $ time     : num [1:20] 9 13 18 23 28 31 34 45 48 161 ...
 $ n.risk   : num [1:20] 11 10 8 7 6 5 4 3 2 1 ...
 $ n.event  : num [1:20] 1 1 1 1 0 1 1 0 1 0 ...
...

R> tidy(survfit(Surv(time, status) ~ x, data = aml))
# A tibble: 20 x 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata         
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>          
 1     9     11       1        0   0.909     0.0953     1       0.754  x=Maintained   
 2    13     10       1        1   0.818     0.142      1       0.619  x=Maintained   
...
18    33      3       1        0   0.194     0.627      0.664   0.0569 x=Nonmaintained
19    43      2       1        0   0.0972    0.945      0.620   0.0153 x=Nonmaintained
20    45      1       1        0   0       Inf         NA      NA      x=Nonmaintained

lobstr package - dig into the internal representation and structure of R objects

lobstr 1.0.0

Other packages

tidytext

https://juliasilge.shinyapps.io/learntidytext/

tidytuesdayR

install.packages("tidytuesdayR")
library("tidytuesdayR")
tt_datasets(2020)  # get the exact day of the data we want to load
coffee_ratings <- tt_load("2020-07-07")
print(coffee_ratings)  #  readme(coffee_ratings)

janitor

How to Clean Data: {janitor} Package

@@ Line 996: / Line 996: @@
 * [https://en.wikipedia.org/wiki/Higher-order_function Higher-order function]
 * [https://pythonbasics.org/decorators/ Python Decorator/metaprogramming]
+* [https://www.r-bloggers.com/2020/11/iterating-over-the-lines-of-a-data-frame-with-purrr/ Iterating over the lines of a data.frame with purrr]
 * Functional programming (cf Object-Oriented Programming)
 ** [http://www.youtube.com/watch?v=vLmaZxegahk Functional programming for beginners]

Tidyverse: Difference between revisions