Tidyverse: Difference between revisions

From 太極
Jump to navigation Jump to search
(30 intermediate revisions by the same user not shown)
Line 33: Line 33:
** subset data frame columns: '''pull'''() [''return a vector''], '''select'''() [''return data frame''], select_if(), other helper functions
** subset data frame columns: '''pull'''() [''return a vector''], '''select'''() [''return data frame''], select_if(), other helper functions
** subset (filter) data frame rows: slice(), filter(), filter_all(), filter_if(), filter_at(), sample_n(), top_n()
** subset (filter) data frame rows: slice(), filter(), filter_all(), filter_if(), filter_at(), sample_n(), top_n()
** identify and remove duplicate rows: duplicated(), unique(), distinct()
** identify and remove duplicate rows: duplicated(), unique(), distinct(). [https://dplyr.tidyverse.org/reference/distinct.html distinct()] will keep only distinct variables if variables are specified (cf [https://dplyr.tidyverse.org/reference/select.html select()] which keeps all rows0.
** ordering rows: arrange(), desc()  
** ordering rows: arrange(), desc()  
*** cf stats::reorder() to change a factor variable's order based on another variable. So the output is still a vector. It is useful in creating multiple boxplots. On the other hand, arrange() is to change the row order of a data frame and its input is a data frame.  
*** cf stats::reorder() to change a factor variable's order based on another variable. So the output is still a vector. It is useful in creating multiple boxplots. On the other hand, arrange() is to change the row order of a data frame and its input is a data frame.  
Line 48: Line 48:


== Animation to explain ==
== Animation to explain ==
[https://github.com/gadenbuie/tidyexplain tidyexplain] - Tidy Animated Verbs
* [https://github.com/gadenbuie/tidyexplain tidyexplain] - Tidy Animated Verbs
* [https://tidydatatutor.com/ Tidy Data Tutor helps you visualize data analysis pipelines]
 
== stringr vs base R ==
[https://stringr.tidyverse.org/articles/from-base.html From base R]


= Examples =
= Examples =
Line 260: Line 264:
<ul>
<ul>
<li>
<li>
[https://dplyr.tidyverse.org/reference/top_n.html top_n()]. [https://stackoverflow.com/a/27766224 weight parameter]. '''top_n(n=5, wt=x)''' won't order rows by weight in the output actually. '''slice_max(order_by = x, n = 5)''' does it.
[https://dplyr.tidyverse.org/reference/top_n.html top_n()]. [https://stackoverflow.com/a/27766224 weight parameter]. '''top_n(n=5, wt=x)''' won't order rows by weight in the output actually. [https://dplyr.tidyverse.org/reference/slice.html? slice_max(order_by = x, n = 5)] does it.
<pre>
<pre>
set.seed(1)
set.seed(1)
Line 315: Line 319:


== Useful dplyr functions (with examples) ==
== Useful dplyr functions (with examples) ==
https://sw23993.wordpress.com/2017/07/10/useful-dplyr-functions-wexamples/
* https://sw23993.wordpress.com/2017/07/10/useful-dplyr-functions-wexamples/
* [https://datacornering.com/my-top-10-favorite-dplyr-tips-and-tricks/ My top 10 favorite dplyr tips and tricks]
** Rename columns by using the dplyr select function
** Calculate in row context with dplyr
** Rearrange columns quickly with dplyr everything
** Drop unnecessary columns with dplyr
** Use dplyr count or add_count instead of group_by and summarize
** Replace nested ifelse with dplyr case_when function
** Execute calculations across columns conditionally with dplyr
** Filter by calculation of grouped data inside filter function
** Get top and bottom values by each group with dplyr
** Reflow your dplyr code


== Supervised machine learning case studies in R ==
== Supervised machine learning case studies in R ==
Line 411: Line 426:


= Miscellaneous examples using tibble or dplyr packages =
= Miscellaneous examples using tibble or dplyr packages =
== Print all columns or rows ==
* print(x, width = Inf) # all columns
* print(x, n = Inf)    # all rows
== Move a column to rownames ==
== Move a column to rownames ==
?tibble::column_to_rownames  
?tibble::column_to_rownames  
Line 431: Line 451:
data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) %>% add_rownames("newvar")
data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) %>% add_rownames("newvar")
# tibble object
# tibble object
</pre>
== Remove rows or columns only containing NAs ==
[https://twitter.com/patilindrajeets/status/1462917447359598594 Surgically removing specific rows or columns that only contains `NA`s]
<pre>
library(dplyr)
df <- tibble(x = c(NA, NA, NA),
            y = c(2, 3, NA),
            z = c(NA, 5, NA) )
# removing columns where all elements are NA
df %>% select(where(~ !all(is.na(.x))))
# removing rows where all elements are NA
df %>% filter(if_any(.fns = ~ !is.na(.x)))
</pre>
</pre>


Line 550: Line 585:
</pre>
</pre>


= [https://cran.r-project.org/web/packages/data.table/index.html data.table] =
= Reading and writing data =
[https://www.danielecook.com/speeding-up-reading-and-writing-in-r/ Speeding up Reading and Writing in R]
 
== [https://cran.r-project.org/web/packages/data.table/index.html data.table] ==
Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).
Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).


Line 613: Line 651:
Question: how to make use multicore with data.table package?
Question: how to make use multicore with data.table package?


== dtplyr ==
=== dtplyr ===
https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/
https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/


Line 627: Line 665:
== Pivot ==
== Pivot ==
<ul>
<ul>
<li>[https://tidyr.tidyverse.org/articles/pivot.html pivot vignette] </br>
[https://tidyr.tidyverse.org/reference/pivot_wider.html pivot_wider()]
<pre>
R> d2 <- tibble(o=rep(LETTERS[1:2], each=3), n=rep(letters[1:3], 2), v=1:6); d2
# A tibble: 6 × 3
  o    n        v
  <chr> <chr> <int>
1 A    a        1
2 A    b        2
3 A    c        3
4 B    a        4
5 B    b        5
6 B    c        6
R> d1 <- d2%>% pivot_wider(names_from=n, values_from=v); d1
# A tibble: 2 × 4
  o        a    b    c
  <chr> <int> <int> <int>
1 A        1    2    3
2 B        4    5    6
</pre>
[https://tidyr.tidyverse.org/reference/pivot_longer.html pivot_longer()]
<pre>
R> d1 %>% pivot_longer(!o, names_to = 'n', values_to = 'v')
# Pivot all columns except 'o' column
# A tibble: 6 × 3
  o    n        v
  <chr> <chr> <int>
1 A    a        1
2 A    b        2
3 A    c        3
4 B    a        4
5 B    b        5
6 B    c        6
</pre>
<ul>
<li>In addition to the '''names_from''' and '''values_from''' columns, the data must have other columns </li>
<li>For each (combination) of unique value from other columns, the values from '''names_from''' variable must be unique</li>
</ul>
</li>
<li>Conversion from gather() to pivot_long()
<li>Conversion from gather() to pivot_long()
<pre>
<pre>
Line 662: Line 740:
* [https://thatdatatho.com/2020/03/28/tidyrs-pivot_longer-and-pivot_wider-examples-tidytuesday-challenge/ pivot_longer()’s Advantage Over gather()]
* [https://thatdatatho.com/2020/03/28/tidyrs-pivot_longer-and-pivot_wider-examples-tidytuesday-challenge/ pivot_longer()’s Advantage Over gather()]
* [https://datascienceplus.com/how-to-carry-column-metadata-in-pivot_longer/ How to carry column metadata in pivot_longer]
* [https://datascienceplus.com/how-to-carry-column-metadata-in-pivot_longer/ How to carry column metadata in pivot_longer]
 
* [https://datawookie.dev/blog/2021/10/working-with-really-wide-data/ Working with Really Wide Data]
== unnest() ==
* [https://tidyr.tidyverse.org/reference/unnest.html help]
* [https://stackoverflow.com/a/38021139 annotate boxplot in ggplot2]
* [https://www.tidymodels.org/learn/statistics/tidy-analysis/ Tidy analysis]


== Benchmark ==
== Benchmark ==
Line 707: Line 781:
= dplyr, plyr packages =
= dplyr, plyr packages =
* plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external '''database'''. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
* plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external '''database'''. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
* [https://twitter.com/kearneymw/status/1476538812406788101?s=20 It's amazing the things one can do in base R (without installing or loading any other #rstats packages)]
* Essential functions: 3 rows functions, 3 column functions and 1 mixed function.
* Essential functions: 3 rows functions, 3 column functions and 1 mixed function.
: <syntaxhighlight lang='rsplus'>
: <syntaxhighlight lang='rsplus'>
Line 790: Line 865:
* [http://www.r-exercises.com/2017/07/19/data-wrangling-transforming-23/ Data wrangling: Transformation] from R-exercises.
* [http://www.r-exercises.com/2017/07/19/data-wrangling-transforming-23/ Data wrangling: Transformation] from R-exercises.
* [https://rollingyours.wordpress.com/2016/06/29/express-intro-to-dplyr/ Express Intro to dplyr] by rollingyours.
* [https://rollingyours.wordpress.com/2016/06/29/express-intro-to-dplyr/ Express Intro to dplyr] by rollingyours.
* [https://martinsbioblogg.wordpress.com/2017/05/21/using-r-when-using-do-in-dplyr-dont-forget-the-dot/ the dot].
<ul>
<li>[https://martinsbioblogg.wordpress.com/2017/05/21/using-r-when-using-do-in-dplyr-dont-forget-the-dot/ the dot].
<pre>
matrix(rnorm(12),4, 3) %>% .[1:2, 1:2]
</pre>
</li>
</ul>
* [http://martinsbioblogg.wordpress.com/2013/03/24/using-r-reading-tables-that-need-a-little-cleaning/ stringr and plyr] A '''data.frame''' is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors.
* [http://martinsbioblogg.wordpress.com/2013/03/24/using-r-reading-tables-that-need-a-little-cleaning/ stringr and plyr] A '''data.frame''' is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors.
* https://randomjohn.github.io/r-maps-with-census-data/ dplyr and stringr are used
* https://randomjohn.github.io/r-maps-with-census-data/ dplyr and stringr are used
Line 799: Line 880:
== select() ==
== select() ==
[https://www.quantargo.com/courses/course-r-introduction/03-dplyr/02-select-columns-data-frame/recipe Select columns from a data frame]
[https://www.quantargo.com/courses/course-r-introduction/03-dplyr/02-select-columns-data-frame/recipe Select columns from a data frame]
<pre>
select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)
</pre>


=== select() + everything() ===
=== select() + everything() ===
Line 806: Line 894:
iris %>% select(-Species, everything()) %>% head() # put Species to the last col
iris %>% select(-Species, everything()) %>% head() # put Species to the last col
</pre>
</pre>
== group_by() ==
[https://dplyr.tidyverse.org/reference/group_by.html ?group_by]
=== group_by() + summarise(), arrange(desc()) ===
[https://r4ds.had.co.nz/transform.html Data transformation] from R for Data Science
[https://www.guru99.com/r-aggregate-function.html#3 Function in summarise()]
* group_by(var1) %>% summarise(varY = mean(var2)) %>% ggplot(aes(x = varX, y = varY, fill = varF)) + geom_bar(stat = "identity") + theme_classic()
* summarise(newvar = sum(var1) / sum(var2))
* arrange(desc(var1, var2))
* Distinct number of observation: '''n_distinct()'''
*  Count the number of rows: '''n()'''
* nth observation of the group: '''nth()'''
* First observation of the group: '''first()'''
* Last observation of the group: '''last()'''
=== group_by() + nest(), mutate(, map()), unnest(), list-columns ===
[https://r4ds.had.co.nz/many-models.html Many models]  from R for Data Science
<ul>
<li>[https://tidyr.tidyverse.org/reference/nest.html ?unnest],  vignette('rectangle'),  vignette('nest') & vignette('pivot')
<syntaxhighlight lang='rsplus'>
tibble(x = 1:2, y = list(1:4, 2:3)) %>% unnest(y) %>% group_by(x) %>% nest()
# returns to tibble(x = 1:2, y = list(1:4, 2:3)) with 'groups' information
</syntaxhighlight>
</li>
<li>[https://stackoverflow.com/a/38021139 annotate boxplot in ggplot2] </li>
<li>[https://towardsdatascience.com/coding-in-r-nest-and-map-your-way-to-efficient-code-4e44ba58ee4a Coding in R: Nest and map your way to efficient code]
<pre>
      group_by() + nest()    mutate(, map())  unnest()
data  -------------------->  --------------->  ------->
</pre>
<syntaxhighlight lang='rsplus'>
install.packages('gapminder'); library(gapminder)
gapminder_nest <- gapminder %>%
  group_by(country) %>%
  nest()  # country, data
          # each row of 'data' is a tibble
gapminder_nest$data[[1]]  # tibble 57 x 8
gapminder_nest <- gapminder_nest %>%
          mutate(pop_mean = map(.x = data, ~mean(.x$pop, na.rm = T)))
                                    # country, data, pop_mean
gapminder_nest %>% unnest(pop_mean) # country, data, pop_mean
gapminder_plot <- gapminder_nest %>%
  unnest(pop_mean) %>%
  select(country, pop_mean) %>%
  ungroup() %>%
  top_n(pop_mean, n = -10) %>%
  mutate(pop_mean = pop_mean/10^3)
gapminder_plot %>%
  ggplot(aes(x = reorder(country, pop_mean), y = pop_mean)) +
  geom_point(colour = "#FF6699", size = 5) +
  geom_segment(aes(xend = country, yend = 0), colour = "#FF6699") +
  geom_text(aes(label = round(pop_mean, 0)), hjust = -1) +
  theme_minimal() +
  labs(title = "Countries with smallest mean population from 1960 to 2016",
      subtitle = "(thousands)",
      x = "",
      y = "") +
  theme(legend.position = "none",
        axis.text.x = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        panel.grid.major.y = element_blank()) +
  coord_flip() +
  scale_y_continuous()
</syntaxhighlight>
</li>
<li>[https://www.tidymodels.org/learn/statistics/tidy-analysis/ Tidy analysis] from tidymodels </li>
<li>[https://community.rstudio.com/t/is-nest-mutate-map-unnest-really-the-best-alternative-to-dplyr-do/11009 Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()] </li>
</ul>
== mutate + replace() or ifelse() ==
<ul>
<li>mutate() is similar to [https://stat.ethz.ch/R-manual/R-devel/library/base/html/with.html base::within()] </li>
<li>[https://stackoverflow.com/a/28013895 Change value of variable with dplyr]
<pre>
mtcars %>%
    mutate(mpg=replace(mpg, cyl==4, NA)) %>%
    as.data.frame()
# VS
mtcars$mpg[mtcars$cyl == 4] <- NA
</pre>
</li>
<li>[https://stackoverflow.com/a/35610521 using ifelse()] </li>
<li>[https://stackoverflow.com/a/61602568 using case_when()] </li>
<li>[https://dplyr.tidyverse.org/reference/mutate_all.html Mutate multiple columns] </li>
<li>[https://www.bioinfoblog.com/entry/tidydata/advancedmutate Apply the mutate function to multiple columns at once | mutate_at / mutate_all / mutate_if]
<pre>
mutate_at(data, .vars = vars(starts_with("Petal")), .funs = ~ . * 2) %>% head()
mutate_at(data, .vars = vars(starts_with("Petal")), `*`, 2) %>% head()
</pre>
</li>
<li>[https://dplyr.tidyverse.org/reference/recode.html recode()]
<pre>
char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple", b = "Banana", .default = NA_character_)
</pre>
</li>
</ul>
== inner_join, left_join, full_join ==
* [https://dplyr.tidyverse.org/reference/mutate-joins.html Mutating joins]
* [https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti Join Data Frames with the R dplyr Package (9 Examples)]
* [https://www.datasciencemadesimple.com/join-in-r-merge-in-r/ Join in r: how to join (merge) data frames (inner, outer, left, right) in R]
* [https://www.guru99.com/r-dplyr-tutorial.html Dplyr Tutorial: Merge and Join Data in R with Examples]


== plyr::rbind.fill() ==
== plyr::rbind.fill() ==
Line 819: Line 1,018:
** [https://juliasilge.com/blog/tuition-resampling/ Preprocessing and resampling using #tidytuesday college data]
** [https://juliasilge.com/blog/tuition-resampling/ Preprocessing and resampling using #tidytuesday college data]
** [https://juliasilge.com/blog/beer-production/ Bootstrap resampling with #tidytuesday beer production data]
** [https://juliasilge.com/blog/beer-production/ Bootstrap resampling with #tidytuesday beer production data]
* [https://www.infoworld.com/article/3411819/do-more-with-r-video-tutorials.html “Do More with R” video tutorials]
* [https://www.infoworld.com/article/3411819/do-more-with-r-video-tutorials.html “Do More with R” video tutorials] by Sharon Machlis
* [https://www.lynda.com/R-tutorials/Learning-R-Tidyverse/586672-2.html Learning the R Tidyverse] from lynda.com
* [https://www.lynda.com/R-tutorials/Learning-R-Tidyverse/586672-2.html Learning the R Tidyverse] from lynda.com


== dbplyr ==
== dbplyr ==
https://dbplyr.tidyverse.org/articles/dbplyr.html
* https://dbplyr.tidyverse.org/articles/dbplyr.html
* [https://dbplyr.tidyverse.org/reference/translate_sql.html translate_sql()] Translate an R expression to sql. [https://twitter.com/rfunctionaday/status/1452127344093708295 Some examples].


= stringr =
= stringr =
Line 837: Line 1,037:
** Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
** Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
* [https://csgillespie.github.io/efficientR/data-carpentry.html#regular-expressions Efficient data carpentry &#8594; Regular expressions] from Efficient R programming book by Gillespie & Lovelace.
* [https://csgillespie.github.io/efficientR/data-carpentry.html#regular-expressions Efficient data carpentry &#8594; Regular expressions] from Efficient R programming book by Gillespie & Lovelace.
== split ==
[https://statisticsglobe.com/split-data-frame-variable-into-multiple-columns-in-r Split Data Frame Variable into Multiple Columns in R (3 Examples)]
Three ways:
* base::strsplit(x, CHAR)
* [https://stringr.tidyverse.org/reference/str_split.html stringr::str_split_fixed(x, CHAR, 2)]
* [https://tidyr.tidyverse.org/reference/separate.html tidyr::separate(x, c("NewVar1", "NewVar2"), CHAR)]
<pre>
x <- c("a-1", "b-2", "c-3")
stringr::str_split_fixed(x, "-", 2)
#      [,1] [,2]
# [1,] "a"  "1"
# [2,] "b"  "2"
# [3,] "c"  "3"
tidyr::separate(data.frame(x), x, c('x1', 'x2'), "-")
  # The first argument must be a data frame
  # The 2nd argument is the column names
#  x1 x2
# 1  a  1
# 2  b  2
# 3  c  3
</pre>


= [https://github.com/smbache/magrittr magrittr] =
= [https://github.com/smbache/magrittr magrittr] =
Line 932: Line 1,157:
mtcars %<>% transform(cyl = cyl * 2)
mtcars %<>% transform(cyl = cyl * 2)
</syntaxhighlight>
</syntaxhighlight>
* [https://data-and-the-world.onrender.com/posts/magrittr-pipes The Four Pipes of magrittr] and lambda functions.


Upsides of using magrittr: no need to create intermediate objects, code is easy to read.
Upsides of using magrittr: no need to create intermediate objects, code is easy to read.
Line 1,071: Line 1,297:
<li>[https://dcl-prog.stanford.edu/ Functional Programming] : Sara Altman, Bill Behrman, Hadley Wickham</li>
<li>[https://dcl-prog.stanford.edu/ Functional Programming] : Sara Altman, Bill Behrman, Hadley Wickham</li>
</ul>
</ul>
== negate() ==
[https://stackoverflow.com/a/48431135 How to select non-numeric columns using dplyr::select_if]
<syntaxhighlight lang='rsplus'>
library(tidyverse)
iris %>% select_if(negate(is.numeric))
</syntaxhighlight>


= forcats =
= forcats =
Line 1,116: Line 1,349:


= Other packages =
= Other packages =
== Great R packages for data import, wrangling, and visualization ==
[https://www.computerworld.com/article/2921176/great-r-packages-for-data-import-wrangling-visualization.html Great R packages for data import, wrangling, and visualization]


== tidytext ==
== tidytext ==

Revision as of 20:17, 9 January 2022

Tidyverse

   Import
     |
     | readr, readxl
     | haven, DBI, httr   +----- Visualize ------+
     |                    |    ggplot2, ggvis    |
     |                    |                      |
   Tidy ------------- Transform 
   tibble               dplyr                   Model 
   tidyr                  |                    broom
                          +------ Model ---------+

Cheat sheet

The cheat sheets are downloaded from RStudio

Online

Animation to explain

stringr vs base R

From base R

Examples

A Gentle Introduction to Tidy Statistics in R

A Gentle Introduction to Tidy Statistics in R by Thomas Mock on RStudio webinar. Good coverage with step-by-step explanation. See part 1 & part 2 about the data and markdown document. All documents are available in github repository.

Task R code Graph
Load the libraries
library(tidyverse)
library(readxl)
library(broom)
library(knitr)
Read Excel file
raw_df <- readxl::read_xlsx("ad_treatment.xlsx")

dplyr::glimpse(raw_df)
Check distribution
g2 <- ggplot(raw_df, aes(x = age)) +
  geom_density(fill = "blue")
g2
raw_df %>% summarize(min = min(age),
                     max = max(age))
File:Check dist.svg
Data cleaning
raw_df %>% 
  summarize(na_count = sum(is.na(mmse)))
Experimental variables

levels

# check Ns and levels for our variables
table(raw_df$drug_treatment, raw_df$health_status)
table(raw_df$drug_treatment, raw_df$health_status, raw_df$sex)

# tidy way of looking at variables
raw_df %>% 
  group_by(drug_treatment, health_status, sex) %>% 
  count()
Visual Exploratory

Data Analysis

ggplot(data = raw_df, # add the data
       aes(x = drug_treatment, y = mmse, # set x, y coordinates
           color = drug_treatment)) +    # color by treatment
  geom_boxplot() +
  facet_grid(~health_status)
File:Onefacet.svg
Summary Statistics
raw_df %>% 
  glimpse()
sum_df <- raw_df %>% 
            mutate(
              sex = factor(sex, 
                  labels = c("Male", "Female")),
              drug_treatment =  factor(drug_treatment, 
                  levels = c("Placebo", "Low dose", "High Dose")),
              health_status = factor(health_status, 
                  levels = c("Healthy", "Alzheimer's"))
              ) %>% 
            group_by(sex, health_status, drug_treatment # group by categorical variables
              ) %>%  
            summarize(
              mmse_mean = mean(mmse),      # calc mean
              mmse_se = sd(mmse)/sqrt(n()) # calc standard error
              ) %>%  
            ungroup() # ungrouping variable is a good habit to prevent errors

kable(sum_df)

write.csv(sum_df, "adx37_sum_stats.csv")
Plotting summary

statistics

g <- ggplot(data = sum_df, # add the data
       aes(x = drug_treatment,  #set x, y coordinates
           y = mmse_mean,
           group = drug_treatment,  # group by treatment
           color = drug_treatment)) +    # color by treatment
  geom_point(size = 3) + # set size of the dots
  facet_grid(sex~health_status) # create facets by sex and status
g
File:Twofacets.svg
ANOVA
# set up the statistics df
stats_df <- raw_df %>% # start with data
   mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")),
         sex = factor(sex, labels = c("Male", "Female")),
         health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")))

glimpse(stats_df)
# this gives main effects AND interactions
ad_aov <- aov(mmse ~ sex * drug_treatment * health_status, 
        data = stats_df)

summary(ad_aov)


# this extracts ANOVA output into a nice tidy dataframe
tidy_ad_aov <- tidy(ad_aov)
# which we can save to Excel
write.csv(tidy_ad_aov, "ad_aov.csv")
Post-hocs
# pairwise t.tests
ad_pairwise <- pairwise.t.test(stats_df$mmse,
                               stats_df$sex:stats_df$drug_treatment:stats_df$health_status, 
                               p.adj = "none")
# look at the posthoc p.values in a tidy dataframe
kable(head(tidy(ad_pairwise)))


# call and tidy the tukey posthoc
tidy_ad_tukey <- tidy(
                      TukeyHSD(ad_aov, 
                              which = 'sex:drug_treatment:health_status'))
Publication plot
sig_df <- tribble(
  ~drug_treatment, ~ health_status, ~sex, ~mmse_mean,
  "Low dose", "Alzheimer's", "Male", 17,
  "High Dose", "Alzheimer's", "Male", 25,
  "Low dose", "Alzheimer's", "Female", 18, 
  "High Dose", "Alzheimer's", "Female", 24
  )

sig_df <- sig_df %>% 
  mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")),
         sex = factor(sex, levels = c("Male", "Female")),
         health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")))
sig_df
# plot of cognitive function health and drug treatment
g1 <- ggplot(data = sum_df, 
       aes(x = drug_treatment, y = mmse_mean, fill = drug_treatment,  
           group = drug_treatment)) +
  geom_errorbar(aes(ymin = mmse_mean - mmse_se, 
                    ymax = mmse_mean + mmse_se), width = 0.5) +
  geom_bar(color = "black", stat = "identity", width = 0.7) +
  
  facet_grid(sex~health_status) +
  theme_bw() +
  scale_fill_manual(values = c("white", "grey", "black")) +
  theme(legend.position = "NULL",
        legend.title = element_blank(),
        axis.title = element_text(size = 20),
        legend.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12)) +
  geom_text(data = sig_df, label = "*", size = 8) +
  labs(x = "\nDrug Treatment", 
       y = "Cognitive Function (MMSE)\n",
       caption = "\nFigure 1. Effect of novel drug treatment AD-x37 on cognitive function in 
                    healthy and demented elderly adults. 
                  \nn = 100/treatment group (total n = 600), * indicates significance 
                    at p < 0.001")
g1

# save the graph!
ggsave("ad_publication_graph.png", g1, height = 7, width = 8, units = "in")
File:Ad public.svg

Opioid prescribing habits in texas

https://juliasilge.com/blog/texas-opioids/.

  • It can read multiple sheets (27 sheets) at a time and merge them by rows.
  • case_when(): A general vectorised if. This function allows you to vectorise multiple if_else() statements. How to use the R case_when function.
    x %>% mutate(group = case_when(
      PredScore > quantile(PredScore, .5) ~ 'High',
      PredScore < quantile(PredScore, .5) ~ 'Low',
      TRUE ~ NA_character_
    ))
    
  • top_n(). weight parameter. top_n(n=5, wt=x) won't order rows by weight in the output actually. slice_max(order_by = x, n = 5) does it.
    set.seed(1)
    d <- data.frame(
      x   = runif(90),
      grp = gl(3, 30)
    ) 
    
    > d %>% group_by(grp) %>% top_n(5, wt=x)
    # A tibble: 15 x 2
    # Groups:   grp [3]
           x grp  
       <dbl> <fct>
     1 0.908 1    
     2 0.898 1    
     3 0.945 1    
     4 0.992 1    
     5 0.935 1    
     6 0.827 2    
     7 0.794 2    
     8 0.821 2    
     9 0.789 2    
    10 0.861 2    
    11 0.913 3    
    12 0.875 3    
    13 0.892 3    
    14 0.864 3    
    15 0.961 3 
    
    > d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
    # A tibble: 15 x 2
    # Groups:   grp [3]
           x grp  
       <dbl> <fct>
     1 0.992 1    
     2 0.945 1    
     3 0.935 1    
     4 0.908 1    
     5 0.898 1    
     6 0.861 2    
     7 0.827 2    
     8 0.821 2    
     9 0.794 2    
    10 0.789 2    
    11 0.961 3    
    12 0.913 3    
    13 0.892 3    
    14 0.875 3    
    15 0.864 3 
    

Useful dplyr functions (with examples)

  • https://sw23993.wordpress.com/2017/07/10/useful-dplyr-functions-wexamples/
  • My top 10 favorite dplyr tips and tricks
    • Rename columns by using the dplyr select function
    • Calculate in row context with dplyr
    • Rearrange columns quickly with dplyr everything
    • Drop unnecessary columns with dplyr
    • Use dplyr count or add_count instead of group_by and summarize
    • Replace nested ifelse with dplyr case_when function
    • Execute calculations across columns conditionally with dplyr
    • Filter by calculation of grouped data inside filter function
    • Get top and bottom values by each group with dplyr
    • Reflow your dplyr code

Supervised machine learning case studies in R

Supervised machine learning case studies in R - A Free, Interactive Course Using Tidy Tools.

Time series data

Calculating change from baseline

group_by() + mutate() + ungroup(). We can accomplish the task by using split() + lapply() + do.call().

trial_data_chg <- trial_data %>%
  arrange(USUBJID, AVISITN) %>%
  group_by(USUBJID) %>%
  mutate(CHG = AVAL - AVAL[1L]) %>%
  ungroup()

# If the baseline is missing
trial_data_chg2 <- trial_data %>%
  group_by(USUBJID) %>%
  mutate(
    CHG = if (any(AVISIT == "Baseline")) AVAL - AVAL[AVISIT == "Baseline"] else NA
  ) %>%
  ungroup()

Split data and fitting models to subsets

https://twitter.com/romain_francois/status/1226967548144635907?s=20

library(dplyr)
iris %>% 
  group_by(Species) %>%
  summarise(broom::tidy(lm(Petal.Length ~ Sepal.Length))

Show all possible group combinations

Ten Tremendous Tricks in the Tidyverse

https://youtu.be/NDHSBUN_rVU (video).

  • count(),
  • add_count(),
  • summarize() w/ a list column,
  • fct_reorder() + geom_col() + coord_flip(),
  • fct_lump(),
  • scale_x/y_log10(),
  • crossing(),
  • separate(),
  • extract().

Gapminder dataset

Hands-on R and dplyr – Analyzing the Gapminder Dataset

Install on Ubuntu

sudo apt install r-cran-tidyverse

# Ubuntu >= 18.04. However, I get unmet dependencies errors on R 3.5.3.
# r-cran-curl : Depends: r-api-3.4
sudo apt-get install r-cran-curl r-cran-openssl r-cran-xml2

# Works fine on Ubuntu 16.04, 18.04, 20.04
sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev

80 R packages will be installed after tidyverse has been installed.

For RStudio server docker version (Debian 10), I also need to install zlib1g-dev

Install on Raspberry Pi/(ARM based) Chromebook

In additional to the requirements of installing on Ubuntu, I got an error when it is installing a dependent package fs: undefined symbol: pthread_atfork. The fs package version is 1.2.6. The solution is to add one line in fs/src/Makevars file and then install the "fs" package using the source on the local machine.

5 most useful data manipulation functions

  • subset() for making subsets of data (natch)
  • merge() for combining data sets in a smart and easy way
  • melt()-reshape2 package for converting from wide to long data formats. See an example here where we want to combine multiple columns of values into 1 column. melt() is replaced by gather().
  • dcast()-reshape2 package for converting from long to wide data formats (or just use tapply()), and for making summary tables
  • ddply()-plyr package for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations

Miscellaneous examples using tibble or dplyr packages

Print all columns or rows

  • print(x, width = Inf) # all columns
  • print(x, n = Inf) # all rows

Move a column to rownames

?tibble::column_to_rownames

# It assumes the input data frame has no row names; otherwise we will get
# Error: `df` must be a data frame without row names in `column_to_rownames()`
# 
tibble::column_to_rownames(data.frame(x=letters[1:5], y = rnorm(5)), "x")

Move rownames to a variable

https://tibble.tidyverse.org/reference/rownames.html

tibble::rownames_to_column(trees, "newVar")
# Still a data frame

Old way add_rownames()

data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) %>% add_rownames("newvar")
# tibble object

Remove rows or columns only containing NAs

Surgically removing specific rows or columns that only contains `NA`s

library(dplyr)
df <- tibble(x = c(NA, NA, NA),
             y = c(2, 3, NA),
             z = c(NA, 5, NA) )

# removing columns where all elements are NA
df %>% select(where(~ !all(is.na(.x))))

# removing rows where all elements are NA
df %>% filter(if_any(.fns = ~ !is.na(.x)))

Rename variables

dplyr::rename(os, newName = oldName)

Drop a variable

select(df, -x) 

Drop a level

group_by() has a .drop argument so you can also group by factor levels that don't appear in the data. See this example.

Remove rownames

tibble has_rownames(), rownames_to_column(), column_to_rownames()

has_rownames(mtcars)
#> [1] TRUE

# Remove row names
remove_rownames(mtcars) %>% has_rownames()
#> [1] FALSE
> tibble::has_rownames(trees)
[1] FALSE
> rownames(trees)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31"
> rownames(trees) <- NULL
> rownames(trees)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31"

relocate: change column order

relocate()

# Move Petal.Width column to appear next to Sepal.Width
iris %>% relocate(Petal.Width, .after = Sepal.Width) %>% head() 

# Move Petal.Width to the last column
iris %>% relocate(Petal.Width, .after = last_col()) %>% head()

pull: extract a single column

pull()

x <- iris %>% filter(Species == 'setosa') %>% select(Sepal.Length) %>% pull()
y <- iris %>% filter(Species == 'virginica') %>% select(Sepal.Length) %>% pull()
t.test(x, y)

reorder

iris %>% ggplot(aes(x=Species, y = Sepal.Width)) + 
         geom_boxplot() +
         xlab=("Species")

# reorder the boxplot based on the Species' median
iris %>% ggplot(aes(x=reorder(Species, Sepal.Width, FUN = median),
                    y=Sepal.Width)) + 
         geom_boxplot() +
         xlab=("Species")

Anonymous functions

ny <- filter(cases, State == "NY") %>%
  select(County = `County Name`, starts_with(c("3", "4")))

daily_totals <- ny %>%
  summarize(
    across(starts_with("4"), sum)
  )

median_and_max <- list(
  med = ~median(.x, na.rm = TRUE),
  max = ~max(.x, na.rm = TRUE)
)

april_median_and_max <- ny %>%
  summarize(
    across(starts_with("4"), median_and_max)
  )
# across(.cols = everything(), .fns = NULL, ..., .names = NULL)

# Rounding the columns Sepal.Length and Sepal.Width
iris %>%
  as_tibble() %>%
  mutate(across(c(Sepal.Length, Sepal.Width), round))

iris %>% summarise(across(contains("Sepal"), ~mean(.x, na.rm = TRUE)))

# filter rows
iris %>% filter(if_any(ends_with("Width"), ~ . > 4))

iris %>% select(starts_with("Sepal"))

iris %>% select(starts_with(c("Petal", "Sepal")))

iris %>% select(contains("Sepal"))

Reading and writing data

Speeding up Reading and Writing in R

data.table

Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).

Note: data.table has its own ways (cf base R and dplyr) to subset columns.

Some resources:

OpenMP enabled compiler for Mac. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.

Question: how to make use multicore with data.table package?

dtplyr

https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/

reshape & reshape2 (superceded by tidyr package)

tidyr

Missing values

Handling Missing Values in R using tidyr

Pivot

  • pivot vignette
    pivot_wider()
    R> d2 <- tibble(o=rep(LETTERS[1:2], each=3), n=rep(letters[1:3], 2), v=1:6); d2
    # A tibble: 6 × 3
      o     n         v
      <chr> <chr> <int>
    1 A     a         1
    2 A     b         2
    3 A     c         3
    4 B     a         4
    5 B     b         5
    6 B     c         6
    R> d1 <- d2%>% pivot_wider(names_from=n, values_from=v); d1
    # A tibble: 2 × 4
      o         a     b     c
      <chr> <int> <int> <int>
    1 A         1     2     3
    2 B         4     5     6
    

    pivot_longer()

    R> d1 %>% pivot_longer(!o, names_to = 'n', values_to = 'v')
    # Pivot all columns except 'o' column
    # A tibble: 6 × 3
      o     n         v
      <chr> <chr> <int>
    1 A     a         1
    2 A     b         2
    3 A     c         3
    4 B     a         4
    5 B     b         5
    6 B     c         6
    
    • In addition to the names_from and values_from columns, the data must have other columns
    • For each (combination) of unique value from other columns, the values from names_from variable must be unique
  • Conversion from gather() to pivot_long()
    gather(df, key=KeyName, value = valueName, col1, col2, ...) # No quotes around KeyName and valueName
    
    pivot_long(df, cols, name_to = "keyName", value_to = "valueName")
    
  • A Tidy Transcriptomics introduction to RNA-Seq analyses
    data %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance")
    
  • Using R: setting a colour scheme in ggplot2. Note the new (default) column names value and name after the function pivot_longer(data, cols).
    set(1)
    dat1 <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10))
    dat2 <- pivot_longer(dat1, -y)
    head(dat2, 2)
    # A tibble: 2 x 3
          y name   value
      <dbl> <chr>  <dbl>
    1 -1.28 x1     0.717
    2 -1.28 x2    -0.320
    
    dat3 <- pivot_wider(dat2)
    range(dat1 - dat3)
    

Benchmark

An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.

Make wide tables long with gather() (see 6.3.1 of Efficient R Programming)

library(tidyr)
library(efficient)
data(pew) # wide table
dim(pew) # 18 x 10,  (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k') 
pewt <- gather(data = pew, key = Income, value = Count, -religion)
dim(pew) # 162 x 3,  (religion, Income, Count)

args(gather)
# function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

where the three arguments of gather() requires:

  • data: a data frame in which column names will become row values. If the data is a matrix, use %>% as.data.frame() beforehand.
  • key: the name of the categorical variable into which the column names in the original datasets are converted.
  • value: the name of cell value columns

In this example, the 'religion' column will not be included (-religion).

dplyr, plyr packages

  • plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
  • It's amazing the things one can do in base R (without installing or loading any other #rstats packages)
  • Essential functions: 3 rows functions, 3 column functions and 1 mixed function.
           select, mutate, rename, recode
            +------------------+
filter      +                  +
arrange     +                  +
group_by    +                  +
drop_na     +                  +
ungroup     + summarise        +
            +------------------+
  • These functions works on data frames and tibble objects. Note stats package also has a filter() function for time series data. If we have not loaded the dplyr package, the filter() function below will give an error (count() also is from dplyr).
iris %>% filter(Species == "setosa") %>% count()
head(iris %>% filter(Species == "setosa") %>% arrange(Sepal.Length))
  • dplyr tutorial from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
    • start_with()/ ends_with() = Select columns that start/end with a character string
    • contains() = Select columns that contain a character string
    • matches() = Select columns that match a regular expression
    • one_of() = Select columns names that are from a group of names
  • Data Transformation in the book R for Data Science. Five key functions in the dplyr package:
# filter
jan1 <- filter(flights, month == 1, day == 1)
filter(flights, month == 11 | month == 12)
filter(flights, arr_delay <= 120, dep_delay <= 120)
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

# arrange
arrange(flights, year, month, day)
arrange(flights, desc(arr_delay))

# select
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

# mutate
flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)
# if you only want to keep the new variables
transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# summarise()
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

# pipe. Note summarise() can return more than 1 variable.
delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))
  • the dot.
    matrix(rnorm(12),4, 3) %>% .[1:2, 1:2]
    

select()

Select columns from a data frame

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

select() + everything()

If we want one particular column (say the dependent variable y) to appear first or last in the dataset. We can use the everything().

iris %>% select(Species, everything()) %>% head()
iris %>% select(-Species, everything()) %>% head() # put Species to the last col

group_by()

?group_by

group_by() + summarise(), arrange(desc())

Data transformation from R for Data Science

Function in summarise()

  • group_by(var1) %>% summarise(varY = mean(var2)) %>% ggplot(aes(x = varX, y = varY, fill = varF)) + geom_bar(stat = "identity") + theme_classic()
  • summarise(newvar = sum(var1) / sum(var2))
  • arrange(desc(var1, var2))
  • Distinct number of observation: n_distinct()
  • Count the number of rows: n()
  • nth observation of the group: nth()
  • First observation of the group: first()
  • Last observation of the group: last()

group_by() + nest(), mutate(, map()), unnest(), list-columns

Many models from R for Data Science

  • ?unnest, vignette('rectangle'), vignette('nest') & vignette('pivot')
    tibble(x = 1:2, y = list(1:4, 2:3)) %>% unnest(y) %>% group_by(x) %>% nest()
    # returns to tibble(x = 1:2, y = list(1:4, 2:3)) with 'groups' information
  • annotate boxplot in ggplot2
  • Coding in R: Nest and map your way to efficient code
          group_by() + nest()    mutate(, map())   unnest()
    data  -------------------->  --------------->  ------->
    
    install.packages('gapminder'); library(gapminder)
    
    gapminder_nest <- gapminder %>% 
      group_by(country) %>% 
      nest()  # country, data
              # each row of 'data' is a tibble
    
    gapminder_nest$data[[1]]  # tibble 57 x 8
    
    gapminder_nest <- gapminder_nest %>%
              mutate(pop_mean = map(.x = data, ~mean(.x$pop, na.rm = T)))
                                        # country, data, pop_mean
    
    gapminder_nest %>% unnest(pop_mean) # country, data, pop_mean
    
    gapminder_plot <- gapminder_nest %>% 
      unnest(pop_mean) %>% 
      select(country, pop_mean) %>% 
      ungroup() %>% 
      top_n(pop_mean, n = -10) %>% 
      mutate(pop_mean = pop_mean/10^3)
    gapminder_plot %>% 
      ggplot(aes(x = reorder(country, pop_mean), y = pop_mean)) +
      geom_point(colour = "#FF6699", size = 5) +
      geom_segment(aes(xend = country, yend = 0), colour = "#FF6699") +
      geom_text(aes(label = round(pop_mean, 0)), hjust = -1) +
      theme_minimal() +
      labs(title = "Countries with smallest mean population from 1960 to 2016",
           subtitle = "(thousands)",
           x = "",
           y = "") +
      theme(legend.position = "none",
            axis.text.x = element_blank(),
            plot.title = element_text(size = 14, face = "bold"),
            panel.grid.major.y = element_blank()) +
      coord_flip() +
      scale_y_continuous()
  • Tidy analysis from tidymodels
  • Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()

mutate + replace() or ifelse()

inner_join, left_join, full_join

plyr::rbind.fill()

Videos

dbplyr

stringr

  • stringr is part of the tidyverse but is not a core package. You need to load it separately.
  • Handling Strings with R(ebook) by Gaston Sanchez.
  • https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
  • stringr Cheat sheet (2 pages, this will immediately download the pdf file)
    • Detect Matches: str_detect(), str_which(), str_count(), str_locate()
    • Subset: str_sub(), str_subset(), str_extract(), str_match()
    • Manage Lengths: str_length(), str_pad(), str_trunc(), str_trim()
    • Mutate Strings: str_sub(), str_replace(), str_replace_all(), str_remove()
      • Case Conversion: str_to_lower(), str_to_upper(), str_to_title()
    • Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
  • Efficient data carpentry → Regular expressions from Efficient R programming book by Gillespie & Lovelace.

split

Split Data Frame Variable into Multiple Columns in R (3 Examples)

Three ways:

x <- c("a-1", "b-2", "c-3")

stringr::str_split_fixed(x, "-", 2)
#      [,1] [,2]
# [1,] "a"  "1" 
# [2,] "b"  "2" 
# [3,] "c"  "3" 

tidyr::separate(data.frame(x), x, c('x1', 'x2'), "-")
   # The first argument must be a data frame
   # The 2nd argument is the column names
#   x1 x2
# 1  a  1
# 2  b  2
# 3  c  3

magrittr

x %>% f     # f(x)
x %>% f(y)  # f(x, y)
x %>% f(arg=y)  # f(x, arg=y)
x %>% f(z, .) # f(z, x)
x %>% f(y) %>% g(z)  #  g(f(x, y), z)

x %>% select(which(colSums(!is.na(.))>0))  # remove columns with all missing data
x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows
suppressPackageStartupMessages(library("dplyr"))
starwars %>%
  filter(., height > 200) %>%
  select(., height, mass) %>%
  head(.)
# instead of 
starwars %>%
  filter(height > 200) %>%
  select(height, mass) %>%
  head
iris$Species
iris[["Species"]]

iris %>%
`[[`("Species")

iris %>%
`[[`(5)

iris %>%
  subset(select = "Species")
  • Split-apply-combine: group + summarize + sort/arrange + top n. The following example is from Efficient R programming.
data(wb_ineq, package = "efficient")
wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)
f <- function(x) {
  (y - x) %>% 
    '^'(2) %>% 
    sum %>%
    '/'(length(x)) %>% 
    sqrt %>% 
    round(2)
}
# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)

rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str()
rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe
    # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2))  
    # instead of the righthand side.

# If a function does not have a data frame based api, you can use %$%.
# It explodes out the variables in a data frame.
mtcars %$% cor(disp, mpg) 

# For assignment, magrittr provides the %<>% operator
mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by
mtcars %<>% transform(cyl = cyl * 2)

Upsides of using magrittr: no need to create intermediate objects, code is easy to read.

When not to use the pipe

  • your pipes are longer than (say) 10 steps
  • you have multiple inputs or outputs
  • Functions that use the current environment: assign(), get(), load()
  • Functions that use lazy evaluation: tryCatch(), try()

Dollar sign .$

%$%

Expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument.

lhs %$% rhs
# lhs:	A list, environment, or a data.frame.
# rhs: An expression where the names in lhs is available.

iris %>%
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)

set_rownames() and set_colnames()

https://stackoverflow.com/a/56613460, https://www.rdocumentation.org/packages/magrittr/versions/1.5/topics/extract

data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5])

cbind(1:5, 2:6) %>% magrittr::set_colnames(letters[1:2])

purrr: : Functional Programming Tools

While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…). See Learn to purrr.

  • What does the tilde mean in this context of R code, What is meaning of first tilde in purrr::map
  • Getting started with the purrr package in R, especially the map() and map_df() functions.
    library(purrr)
    # map() is a replacement of lapply()
    # lapply(dat, function(x) mean(x$Open))
    map(dat, function(x)mean(x$Open))  
    
    # map allows us to bypass the function function. 
    # Using a tilda (~) in place of function and a dot (.) in place of x
    map(dat, ~mean(.$Open))
    
    # map allows you to specify the structure of your output.
    map_dbl(dat, ~mean(.$Open))
    
    # map2() is a replacement of mapply()
    # mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)
    map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
data <- map(paths, read.csv)
data <- map_dfr(paths, read.csv, id = "path")

out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE)
out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)
  • Learn to purrr. Lots of good information like tilde-dot is a shorthand for functions.
    function(x) {
      x + 10
    }
    # is the same as
    ~{.x + 10}
    
    map_dbl(c(1, 4, 7), ~{.x + 10})
  • A closer look at replicate() and purrr::map() for simulations
    twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
         ngroup = 2
         group = rep( c("group1", "group2"), each = nrep)
         eps = rnorm(ngroup*nrep, 0, sigma)
         growth = b0 + b1*(group == "group2") + eps
         growthfit = lm(growth ~ group)
         growthfit
    }
    sim_lm = replicate(5, twogroup_fun(), simplify = FALSE )
    str(sim_lm, max.level = 1)
    
    map_dbl(sim_lm, ~summary(.x)$r.squared)
    # Same as function(x) {} style
    map_dbl(sim_lm, function(x) summary(x)$r.squared)
    # Same as sapply()
    sapply(sim_lm, function(x) summary(x)$r.squared)
    map_dfr(sim_lm, broom::tidy, .id = "model")
  • Functional programming from Advanced R.
  • Functional Programming : Sara Altman, Bill Behrman, Hadley Wickham

negate()

How to select non-numeric columns using dplyr::select_if

library(tidyverse)
iris %>% select_if(negate(is.numeric))

forcats

https://forcats.tidyverse.org/

JAMA retraction after miscoding – new Finalfit function to check recoding

outer()

Genomic sequence

  • chartr
> yourSeq <- "AAAACCCGGGTTTNNN"
> chartr("ACGT", "TGCA", yourSeq)
[1] "TTTTGGGCCCAAANNN"

broom

broom: Convert Statistical Analysis Objects into Tidy Tibbles

Especially the tidy() function.

R> str(survfit(Surv(time, status) ~ x, data = aml))
List of 17
 $ n        : int [1:2] 11 12
 $ time     : num [1:20] 9 13 18 23 28 31 34 45 48 161 ...
 $ n.risk   : num [1:20] 11 10 8 7 6 5 4 3 2 1 ...
 $ n.event  : num [1:20] 1 1 1 1 0 1 1 0 1 0 ...
...

R> tidy(survfit(Surv(time, status) ~ x, data = aml))
# A tibble: 20 x 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata         
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>          
 1     9     11       1        0   0.909     0.0953     1       0.754  x=Maintained   
 2    13     10       1        1   0.818     0.142      1       0.619  x=Maintained   
...
18    33      3       1        0   0.194     0.627      0.664   0.0569 x=Nonmaintained
19    43      2       1        0   0.0972    0.945      0.620   0.0153 x=Nonmaintained
20    45      1       1        0   0       Inf         NA      NA      x=Nonmaintained

lobstr package - dig into the internal representation and structure of R objects

lobstr 1.0.0

Other packages

Great R packages for data import, wrangling, and visualization

Great R packages for data import, wrangling, and visualization

tidytext

https://juliasilge.shinyapps.io/learntidytext/

tidytuesdayR

install.packages("tidytuesdayR")
library("tidytuesdayR")
tt_datasets(2020)  # get the exact day of the data we want to load
coffee_ratings <- tt_load("2020-07-07")
print(coffee_ratings)  #  readme(coffee_ratings)

janitor

How to Clean Data: {janitor} Package

funneljoin