Tidyverse: Difference between revisions

From 太極
Jump to navigation Jump to search
(32 intermediate revisions by the same user not shown)
Line 56: Line 56:
== Base-R and Tidyverse ==
== Base-R and Tidyverse ==
* [https://matloff.wordpress.com/2022/08/24/base-r-and-tidyverse-code-side-by-side/ Base-R and Tidyverse Code, Side-by-Side]
* [https://matloff.wordpress.com/2022/08/24/base-r-and-tidyverse-code-side-by-side/ Base-R and Tidyverse Code, Side-by-Side]
== tidyverse vs python panda ==
[https://www.r-bloggers.com/2024/02/why-pandas-feels-clunky-when-coming-from-r/ Why pandas feels clunky when coming from R]


= Examples =
= Examples =
Line 243: Line 246:
|}
|}


== glm() and ggplot2() ==
== palmerpenguins data ==
[https://www.r-bloggers.com/2023/11/introduction-to-data-manipulation-in-r-with-dplyr/ Introduction to data manipulation in R with {dplyr}]
 
== glm() and ggplot2(), mtcars ==
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='rsplus'>
data(mtcars)
data(mtcars)
Line 276: Line 282:
<ul>
<ul>
<li>[https://dplyr.tidyverse.org/reference/case_when.html case_when()]: A general vectorised if. This function allows you to vectorise multiple if_else() statements. [https://www.sharpsightlabs.com/blog/case-when-r/ How to use the R case_when function].
<li>[https://dplyr.tidyverse.org/reference/case_when.html case_when()]: A general vectorised if. This function allows you to vectorise multiple if_else() statements. [https://www.sharpsightlabs.com/blog/case-when-r/ How to use the R case_when function].
<pre>
case_when(
  condition_1 ~ result_1,
  condition_2 ~ result_2,
  ...
  condition_n ~ result_n,
  .default = default_result
)
</pre>
<pre>
<pre>
x %>% mutate(group = case_when(
x %>% mutate(group = case_when(
Line 326: Line 341:
</ul>
</ul>
* [https://www.rdocumentation.org/packages/knitr/versions/1.25/topics/kable kable()]
* [https://www.rdocumentation.org/packages/knitr/versions/1.25/topics/kable kable()]
== Tidying the Freedom Index ==
https://pacha.dev/blog/2023/06/05/freedom-index/index.html
tidyverse
* gsub()
* read_excel()
* filter()
* pivot_longer()
* case_when()
* fill()
* group_by(), mutate(), row_number(), ungroup()
* pivot_wider()
* drop_na()
* ungroup(), distinct()
* left_join()
ggplot2
* geom_line()
* facet_wrap()
* theme_minimal()
* theme()
* labs()


== Useful dplyr functions (with examples) ==
== Useful dplyr functions (with examples) ==
Line 542: Line 580:


== select(): extract multiple columns ==
== select(): extract multiple columns ==
== select(): drop columns ==
[https://www.r-bloggers.com/2024/04/simplifying-data-manipulation-how-to-drop-columns-from-data-frames-in-r/ Simplifying Data Manipulation: How to Drop Columns from Data Frames in R]


== slice(): select rows by index ==
== slice(): select rows by index ==
Line 580: Line 621:
== Anonymous functions ==
== Anonymous functions ==
* See [[R#anonymous_function|R]] page
* See [[R#anonymous_function|R]] page
* https://dplyr.tidyverse.org/reference/funs.html
* [https://stackoverflow.com/q/58845722 Is the role of `~` tilde in dplyr limited to non-standard evaluation?]
* [https://stackoverflow.com/a/14976479 Use of ~ (tilde) in R programming Language]
* [https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family?ex=4 lapply and anonymous functions]
* [https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family?ex=4 lapply and anonymous functions]
* [https://www.infoworld.com/article/3537612/dplyr-across-first-look-at-a-new-tidyverse-function.html dplyr across: First look at a new Tidyverse function].
** [https://dplyr.tidyverse.org/reference/across.html Apply a function (or functions) across multiple columns]. across(), if_any(), if_all().
** [https://tidyselect.r-lib.org/reference/starts_with.html Select variables that match a pattern]. starts_with(), ends_with(), contains(), matches(), num_range().
** [https://twitter.com/romain_francois/status/1350078666554933249/photo/2 data %>% group_by(Var1) %>% summarise(across(contains("SomeKey"), mean, na.rm = TRUE))]
<pre>
ny <- filter(cases, State == "NY") %>%
  select(County = `County Name`, starts_with(c("3", "4")))


daily_totals <- ny %>%
== Transformation on multiple columns ==
  summarize(
* [https://datasciencetut.com/how-to-apply-a-transformation-to-multiple-columns-in-r/ How to apply a transformation to multiple columns in R?]
    across(starts_with("4"), sum)
** '''df %>% mutate(across(c(col1, col2), function(x) x*2))'''
  )
** '''df %>% summarise(across(c(col1, col2), mean, na.rm=TRUE))
 
* select() vs '''across()'''
median_and_max <- list(
** the across() and select() functions are both used to manipulate columns in a data frame
   med = ~median(.x, na.rm = TRUE),
** The select() function is used to select columns from a data frame.
   max = ~max(.x, na.rm = TRUE)
** The across() function is used to apply a function to multiple columns in a data frame. It’s often used inside other functions like '''mutate()''' or '''summarize()'''.
)
:<syntaxhighlight lang='rsplus'>
data.frame(
   x = c(1, 2, 3),
   y = c(4, 5, 6)
) %>%
mutate(across(everything(), ~ .x * 2)) # purrr-style lambda
#  x  y
#1 2  8
#2 4 10
#3 6 12
</syntaxhighlight>
* [https://twitter.com/ChBurkhart/status/1655559927715463169?s=20 Quick tidyverse tip: How to make your summary statistics more human readable with pivot_wider]


april_median_and_max <- ny %>%
= Reading and writing data =
  summarize(
[https://www.danielecook.com/speeding-up-reading-and-writing-in-r/ Speeding up Reading and Writing in R]
    across(starts_with("4"), median_and_max)
  )
</pre>
<pre>
# across(.cols = everything(), .fns = NULL, ..., .names = NULL)


# Rounding the columns Sepal.Length and Sepal.Width
== [https://cran.r-project.org/web/packages/data.table/index.html data.table] ==
iris %>%
Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).
  as_tibble() %>%
  mutate(across(c(Sepal.Length, Sepal.Width), round))


iris %>% summarise(across(contains("Sepal"), ~mean(.x, na.rm = TRUE)))
Note: '''data.table''' has its own ways (cf base R and '''dplyr''') to subset columns.  


# filter rows
Some resources:
iris %>% filter(if_any(ends_with("Width"), ~ . > 4))
* https://www.rdocumentation.org/packages/data.table/versions/1.12.0
 
* [https://github.com/chuvanan/rdatatable-cookbook cookbook]
iris %>% select(starts_with("Sepal"))
* [https://www.waldrn.com/dplyr-vs-data-table/ R Packages: dplyr vs data.table]
 
* [https://martinctc.github.io/blog/comparing-common-operations-in-dplyr-and-data.table/ Comparing Common Operations in dplyr and data.table]
iris %>% select(starts_with(c("Petal", "Sepal")))
* [https://github.com/rstudio/cheatsheets/raw/master/datatable.pdf Cheat sheet] from [https://www.rstudio.com/resources/cheatsheets/ RStudio]
 
* [https://www.r-bloggers.com/importing-data-into-r-part-two/ Reading large data tables in R]. fread(FILENAME)
iris %>% select(contains("Sepal"))
* Note that '''x[, 2]'' always return 2. If you want to do the thing you want, use ''x[, 2, with=FALSE]'' or ''x[, V2]'' where V2 is the header name. See the FAQ #1 in [http://datatable.r-forge.r-project.org/datatable-faq.pdf data.table].
</pre>
* [http://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html Understanding data.table Rolling Joins]
 
* [https://rollingyours.wordpress.com/2016/06/14/fast-aggregation-of-large-data-with-the-data-table-package/ Intro to The data.table Package]
== Transformation on multiple columns ==
** Subsetting rows and/or columns
* [https://datasciencetut.com/how-to-apply-a-transformation-to-multiple-columns-in-r/ How to apply a transformation to multiple columns in R?]  
** Alternative to using tapply(), aggregate(), table() to summarize data
** '''df %>% mutate(across(c(col1, col2), function(x) x*2))'''
** Similarities to SQL, DT[i, j, by]
** '''df %>% summarise(across(c(col1, col2), mean, na.rm=TRUE))
* [https://www.listendata.com/2016/10/r-data-table.html R : data.table (with 50 examples)] from ListenData
* select() vs '''across()'''
** Describe Data
** the across() and select() functions are both used to manipulate columns in a data frame
** Selecting or Keeping Columns
** The select() function is used to select columns from a data frame.
** Rename Variables
** The across() function is used to apply a function to multiple columns in a data frame. It’s often used inside other functions like '''mutate()''' or '''summarize()'''.
** Subsetting Rows / Filtering
:<syntaxhighlight lang='rsplus'>
** Faster Data Manipulation with Indexing
data.frame(
** Performance Comparison
  x = c(1, 2, 3),
** Sorting Data
  y = c(4, 5, 6)
** Adding Columns (Calculation on rows)
) %>%
** How to write Sub Queries (like SQL)
mutate(across(everything(), ~ .x * 2)) # purrr-style lambda
** Summarize or Aggregate Columns
x  y
** GROUP BY (Within Group Calculation)
#1 2 8
** Remove Duplicates
#2 4 10
** Extract values within a group
#3 6 12
** SQL's RANK OVER PARTITION
</syntaxhighlight>
** Cumulative SUM by GROUP
* [https://twitter.com/ChBurkhart/status/1655559927715463169?s=20 Quick tidyverse tip: How to make your summary statistics more human readable with pivot_wider]
** Lag and Lead
 
** Between and LIKE Operator
= Reading and writing data =
** Merging / Joins
[https://www.danielecook.com/speeding-up-reading-and-writing-in-r/ Speeding up Reading and Writing in R]
** Convert a data.table to data.frame
 
* [https://www.dezyre.com/data-science-in-r-programming-tutorial/r-data-table-tutorial R Tutorial: data.table] from dezyre.com
== [https://cran.r-project.org/web/packages/data.table/index.html data.table] ==
** Syntax: DT[where, select|update|do, by]
Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).
** Keys and setkey()
** Fast grouping using j and by: DT[,sum(v),by=x]
** Fast ordered joins: X[Y,roll=TRUE]
* In the [https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro-vignette.html Introduction to data.table] vignette, the data.table::order() function is SLOWER than base::order() from my Odroid xu4 (running Ubuntu 14.04.4 trusty on uSD) <syntaxhighlight lang='rsplus'>
odt = data.table(col=sample(1e7))
(t1 <- system.time(ans1 <- odt[base::order(col)])) ## uses order from base R
#  user  system elapsed
#  2.730  0.210  2.947
(t2 <- system.time(ans2 <- odt[order(col)]))       ## uses data.table's order
#   user system elapsed
# 2.830  0.215  3.052
(identical(ans1, ans2))
# [1] TRUE
</syntaxhighlight>
* [https://jangorecki.github.io/blog/2016-06-30/Boost-Your-Data-Munging-with-R.html Boost Your Data Munging with R]
* [https://www.rdocumentation.org/packages/data.table/versions/1.12.0/topics/rbindlist rbindlist()]. One problem, it uses too much memory. In fact, when I try to analyze R package downloads, the command "dat <- rbindlist(logs)" uses up my 64GB memory (OS becomes unresponsive).
* [https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread Convenience features of fread]
* [https://www.infoworld.com/article/3575086/the-ultimate-r-datatable-cheat-sheet.html?s=09 The ultimate R data.table cheat sheet] from infoworld


Note: '''data.table''' has its own ways (cf base R and '''dplyr''') to subset columns.  
[https://github.com/Rdatatable/data.table/wiki/Installation#openmp-enabled-compiler-for-mac OpenMP enabled compiler for Mac]. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.
 
Question: how to make use multicore with data.table package?
 
=== dtplyr ===
https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/


Some resources:
= reshape & reshape2 (superceded by tidyr package) =
* https://www.rdocumentation.org/packages/data.table/versions/1.12.0
* [http://r-exercises.com/2016/07/06/data-shape-transformation-with-reshape/ Data Shape Transformation With Reshape()]
* [https://github.com/chuvanan/rdatatable-cookbook cookbook]
* Use '''acast()''' function in reshape2 package. It will convert data.frame used for analysis to a table-like data.frame good for display.
* [https://www.waldrn.com/dplyr-vs-data-table/ R Packages: dplyr vs data.table]
* http://lamages.blogspot.com/2013/10/creating-matrix-from-long-dataframe.html
* [https://martinctc.github.io/blog/comparing-common-operations-in-dplyr-and-data.table/ Comparing Common Operations in dplyr and data.table]
 
* [https://github.com/rstudio/cheatsheets/raw/master/datatable.pdf Cheat sheet] from [https://www.rstudio.com/resources/cheatsheets/ RStudio]
= [http://cran.r-project.org/web/packages/tidyr/index.html tidyr] =
* [https://www.r-bloggers.com/importing-data-into-r-part-two/ Reading large data tables in R]. fread(FILENAME)
== Missing values ==
* Note that '''x[, 2]'' always return 2. If you want to do the thing you want, use ''x[, 2, with=FALSE]'' or ''x[, V2]'' where V2 is the header name. See the FAQ #1 in [http://datatable.r-forge.r-project.org/datatable-faq.pdf data.table].
[https://www.programmingwithr.com/handling-missing-values-in-r-using-tidyr/ Handling Missing Values in R using tidyr]
* [http://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html Understanding data.table Rolling Joins]
 
* [https://rollingyours.wordpress.com/2016/06/14/fast-aggregation-of-large-data-with-the-data-table-package/ Intro to The data.table Package]  
== Pivot ==
** Subsetting rows and/or columns
<ul>
** Alternative to using tapply(), aggregate(), table() to summarize data
<li>tidyr package. [https://tidyr.tidyverse.org/articles/pivot.html pivot vignette],
** Similarities to SQL, DT[i, j, by]
[https://tidyr.tidyverse.org/reference/pivot_wider.html pivot_wider()]
* [https://www.listendata.com/2016/10/r-data-table.html R : data.table (with 50 examples)] from ListenData
<pre>
** Describe Data
R> d2 <- tibble(o=rep(LETTERS[1:2], each=3), n=rep(letters[1:3], 2), v=1:6); d2
** Selecting or Keeping Columns
# A tibble: 6 × 3
** Rename Variables
  o    n        v
** Subsetting Rows / Filtering
  <chr> <chr> <int>
** Faster Data Manipulation with Indexing
1 A    a        1
** Performance Comparison
2 A    b        2
** Sorting Data
3 A    c        3
** Adding Columns (Calculation on rows)
4 B    a        4
** How to write Sub Queries (like SQL)
5 B    b        5
** Summarize or Aggregate Columns
6 B    c        6
** GROUP BY (Within Group Calculation)
R> d1 <- d2%>% pivot_wider(names_from=n, values_from=v); d1
** Remove Duplicates
# A tibble: 2 × 4
** Extract values within a group
  o        a     b    c
** SQL's RANK OVER PARTITION
  <chr> <int> <int> <int>
** Cumulative SUM by GROUP
1 A        1    2    3
** Lag and Lead
2 B        4    5    6
** Between and LIKE Operator
</pre>
** Merging / Joins
[https://tidyr.tidyverse.org/reference/pivot_longer.html pivot_longer()]
** Convert a data.table to data.frame
<pre>
* [https://www.dezyre.com/data-science-in-r-programming-tutorial/r-data-table-tutorial R Tutorial: data.table] from dezyre.com
R> d1 %>% pivot_longer(!o, names_to = 'n', values_to = 'v')
** Syntax: DT[where, select|update|do, by]
# Pivot all columns except 'o' column
** Keys and setkey()
# A tibble: 6 × 3
** Fast grouping using j and by: DT[,sum(v),by=x]
   o    n        v
** Fast ordered joins: X[Y,roll=TRUE]
   <chr> <chr> <int>
* In the [https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro-vignette.html Introduction to data.table] vignette, the data.table::order() function is SLOWER than base::order() from my Odroid xu4 (running Ubuntu 14.04.4 trusty on uSD) <syntaxhighlight lang='rsplus'>
1 A    a        1
odt = data.table(col=sample(1e7))
2 A    b        2
(t1 <- system.time(ans1 <- odt[base::order(col)]))  ## uses order from base R
3 A    c        3
#   user  system elapsed
4 B    a        4
#  2.730   0.210  2.947
5 B    b        5
(t2 <- system.time(ans2 <- odt[order(col)]))        ## uses data.table's order
6 B    c        6
#  user  system elapsed
</pre>
2.830  0.215  3.052
<ul>
(identical(ans1, ans2))
<li>In addition to the '''names_from''' and '''values_from''' columns, the data must have other columns </li>
# [1] TRUE
<li>For each (combination) of unique value from other columns, the values from '''names_from''' variable must be unique</li>
</syntaxhighlight>
</ul>
* [https://jangorecki.github.io/blog/2016-06-30/Boost-Your-Data-Munging-with-R.html Boost Your Data Munging with R]
</li>
* [https://www.rdocumentation.org/packages/data.table/versions/1.12.0/topics/rbindlist rbindlist()]. One problem, it uses too much memory. In fact, when I try to analyze R package downloads, the command "dat <- rbindlist(logs)" uses up my 64GB memory (OS becomes unresponsive).
<li>Conversion from gather() to pivot_longer()
* [https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread Convenience features of fread]
<pre>
* [https://www.infoworld.com/article/3575086/the-ultimate-r-datatable-cheat-sheet.html?s=09 The ultimate R data.table cheat sheet] from infoworld
gather(df, key=KeyName, value = valueName, col1, col2, ...) # No quotes around KeyName and valueName


[https://github.com/Rdatatable/data.table/wiki/Installation#openmp-enabled-compiler-for-mac OpenMP enabled compiler for Mac]. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.
pivot_longer(df, cols, names_to = "keyName", values_to = "valueName")
 
  # cols can be everything()
Question: how to make use multicore with data.table package?
  # cols can be numerical numbers or column names
 
</pre>
=== dtplyr ===
</li>
https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/
</ul>
 
* [https://www.r-bloggers.com/using-r-from-gather-to-pivot/ From gather to pivot]. [https://tidyr.tidyverse.org/reference/pivot_longer.html pivot_longer()]/pivot_wider()
= reshape & reshape2 (superceded by tidyr package) =
* [https://blog.methodsconsultants.com/posts/data-pivoting-with-tidyr/ Data Pivoting with tidyr]
* [http://r-exercises.com/2016/07/06/data-shape-transformation-with-reshape/ Data Shape Transformation With Reshape()]
* Use '''acast()''' function in reshape2 package. It will convert data.frame used for analysis to a table-like data.frame good for display.
* http://lamages.blogspot.com/2013/10/creating-matrix-from-long-dataframe.html
 
= [http://cran.r-project.org/web/packages/tidyr/index.html tidyr] =
== Missing values ==
[https://www.programmingwithr.com/handling-missing-values-in-r-using-tidyr/ Handling Missing Values in R using tidyr]
 
== Pivot ==
<ul>
<ul>
<li>[https://tidyr.tidyverse.org/articles/pivot.html pivot vignette] </br>
<li>[https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html A Tidy Transcriptomics introduction to RNA-Seq analyses]
 
<pre>
[https://tidyr.tidyverse.org/reference/pivot_wider.html pivot_wider()]
data %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance")
</pre>
</li>
<li>[https://onunicornsandgenes.blog/2020/05/17/using-r-setting-a-colour-scheme-in-ggplot2/ Using R: setting a colour scheme in ggplot2]. Note the new (default) column names '''value''' and '''name''' after the function '''pivot_longer(data, cols)'''.
<pre>
<pre>
R> d2 <- tibble(o=rep(LETTERS[1:2], each=3), n=rep(letters[1:3], 2), v=1:6); d2
set(1)
# A tibble: 6 × 3
dat1 <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10))
   o    n        v
dat2 <- pivot_longer(dat1, -y)
   <chr> <chr> <int>
head(dat2, 2)
1 A     a        1
# A tibble: 2 x 3
2 A    b        2
      y name   value
3 A    c        3
   <dbl> <chr> <dbl>
4 B    a        4
1 -1.28 x1     0.717
5 B    b        5
2 -1.28 x2    -0.320
6 B    c        6
 
R> d1 <- d2%>% pivot_wider(names_from=n, values_from=v); d1
dat3 <- pivot_wider(dat2)
# A tibble: 2 × 4
range(dat1 - dat3)
  o        a    b    c
  <chr> <int> <int> <int>
1 A        1    2    3
2 B        4    5    6
</pre>
</pre>
[https://tidyr.tidyverse.org/reference/pivot_longer.html pivot_longer()]
</li>
<pre>
</ul>
R> d1 %>% pivot_longer(!o, names_to = 'n', values_to = 'v')
* [https://thatdatatho.com/2020/03/28/tidyrs-pivot_longer-and-pivot_wider-examples-tidytuesday-challenge/ pivot_longer()’s Advantage Over gather()]
# Pivot all columns except 'o' column
* [https://datascienceplus.com/how-to-carry-column-metadata-in-pivot_longer/ How to carry column metadata in pivot_longer]
# A tibble: 6 × 3
* [https://datawookie.dev/blog/2021/10/working-with-really-wide-data/ Working with Really Wide Data]
  o    n        v
* [https://towardsdev.com/data-reshaping-with-r-from-wide-to-long-and-back-7a5eb674d73e Data Reshaping with R: From Wide to Long (and back)]
  <chr> <chr> <int>
 
1 A    a        1
== Benchmark ==
2 A    b        2
An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.
3 A    c        3
4 B    a        4
5 B    b        5
6 B    c        6
</pre>
<ul>
<li>In addition to the '''names_from''' and '''values_from''' columns, the data must have other columns </li>
<li>For each (combination) of unique value from other columns, the values from '''names_from''' variable must be unique</li>
</ul>
</li>
<li>Conversion from gather() to pivot_long()
<pre>
gather(df, key=KeyName, value = valueName, col1, col2, ...) # No quotes around KeyName and valueName


pivot_long(df, cols, name_to = "keyName", value_to = "valueName")
* [https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html vignette("tidy-data")] & [https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf Cheat sheet]
</pre>
* Main functions
</li>
** Reshape data: [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/gather gather()] & [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/spread spread()]. [https://tidyr.tidyverse.org/dev/articles/pivot.html These two will be deprecated]
</ul>
** Break apart or combine columns/Split cells: [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate separate()] & [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/unite unite()]
* [https://www.r-bloggers.com/using-r-from-gather-to-pivot/ From gather to pivot]. [https://tidyr.tidyverse.org/reference/pivot_longer.html pivot_longer()]/pivot_wider()
** Handle missing: [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/drop_na drop_na()] & fill() & replace_na()
* [https://blog.methodsconsultants.com/posts/data-pivoting-with-tidyr/ Data Pivoting with tidyr]
* Other functions
<ul>
** tidyr::separate() function. If a cell contains many elements separated by ",", we can use this function to create more columns. The opposite function is unite().
<li>[https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html A Tidy Transcriptomics introduction to RNA-Seq analyses]
** [https://rpubs.com/maraaverick/tidyr-separate-rows-FTW tidyr::separate_rows()]. If a cell contains many elements separated by ",", we can use this function to create one more row. See the cheat sheet link above.
<pre>
* http://blog.rstudio.org/2014/07/22/introducing-tidyr/
data %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance")
* http://rpubs.com/seandavi/GEOMetadbSurvey2014
</pre>
* http://timelyportfolio.github.io/rCharts_factor_analytics/factors_with_new_R.html
</li>
* [http://www.milanor.net/blog/reshape-data-r-tidyr-vs-reshape2/ tidyr vs reshape2]
<li>[https://onunicornsandgenes.blog/2020/05/17/using-r-setting-a-colour-scheme-in-ggplot2/ Using R: setting a colour scheme in ggplot2]. Note the new (default) column names '''value''' and '''name''' after the function '''pivot_longer(data, cols)'''.
* [https://data.library.virginia.edu/a-tidyr-tutorial/ A tidyr Tutorial] from U of Virginia
<pre>
* [http://r-posts.com/benchmarking-cast-in-r-from-long-data-frame-to-wide-matrix/ Benchmarking cast in R from long data frame to wide matrix]
set(1)
dat1 <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10))
dat2 <- pivot_longer(dat1, -y)
head(dat2, 2)
# A tibble: 2 x 3
      y name  value
  <dbl> <chr>  <dbl>
1 -1.28 x1    0.717
2 -1.28 x2    -0.320


dat3 <- pivot_wider(dat2)
Make wide tables long with '''gather()''' (see 6.3.1 of Efficient R Programming)
range(dat1 - dat3)
<syntaxhighlight lang='rsplus'>
</pre>
library(tidyr)
</li>
library(efficient)
</ul>
data(pew) # wide table
* [https://thatdatatho.com/2020/03/28/tidyrs-pivot_longer-and-pivot_wider-examples-tidytuesday-challenge/ pivot_longer()’s Advantage Over gather()]
dim(pew) # 18 x 10,  (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k')
* [https://datascienceplus.com/how-to-carry-column-metadata-in-pivot_longer/ How to carry column metadata in pivot_longer]
pewt <- gather(data = pew, key = Income, value = Count, -religion)
* [https://datawookie.dev/blog/2021/10/working-with-really-wide-data/ Working with Really Wide Data]
dim(pew) # 162 x 3,  (religion, Income, Count)
* [https://towardsdev.com/data-reshaping-with-r-from-wide-to-long-and-back-7a5eb674d73e Data Reshaping with R: From Wide to Long (and back)]


== Benchmark ==
args(gather)
An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.
# function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
 
* [https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html vignette("tidy-data")] & [https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf Cheat sheet]
* Main functions
** Reshape data: [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/gather gather()] & [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/spread spread()]. [https://tidyr.tidyverse.org/dev/articles/pivot.html These two will be deprecated]
** Break apart or combine columns/Split cells: [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate separate()] & [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/unite unite()]
** Handle missing: [https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/drop_na drop_na()] & fill() & replace_na()
* Other functions
** tidyr::separate() function. If a cell contains many elements separated by ",", we can use this function to create more columns. The opposite function is unite().
** [https://rpubs.com/maraaverick/tidyr-separate-rows-FTW tidyr::separate_rows()]. If a cell contains many elements separated by ",", we can use this function to create one more row. See the cheat sheet link above.
* http://blog.rstudio.org/2014/07/22/introducing-tidyr/
* http://rpubs.com/seandavi/GEOMetadbSurvey2014
* http://timelyportfolio.github.io/rCharts_factor_analytics/factors_with_new_R.html
* [http://www.milanor.net/blog/reshape-data-r-tidyr-vs-reshape2/ tidyr vs reshape2]
* [https://data.library.virginia.edu/a-tidyr-tutorial/ A tidyr Tutorial] from U of Virginia
* [http://r-posts.com/benchmarking-cast-in-r-from-long-data-frame-to-wide-matrix/ Benchmarking cast in R from long data frame to wide matrix]
 
Make wide tables long with '''gather()''' (see 6.3.1 of Efficient R Programming)
<syntaxhighlight lang='rsplus'>
library(tidyr)
library(efficient)
data(pew) # wide table
dim(pew) # 18 x 10,  (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k')
pewt <- gather(data = pew, key = Income, value = Count, -religion)
dim(pew) # 162 x 3,  (religion, Income, Count)
 
args(gather)
# function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
</syntaxhighlight>
</syntaxhighlight>
where the three arguments of gather() requires:
where the three arguments of gather() requires:
Line 863: Line 860:
</syntaxhighlight>
</syntaxhighlight>
* [http://genomicsclass.github.io/book/pages/dplyr_tutorial.html dplyr tutorial] from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
* [http://genomicsclass.github.io/book/pages/dplyr_tutorial.html dplyr tutorial] from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
** start_with()/ ends_with() = Select columns that start/end with a character string
** [https://tidyselect.r-lib.org/reference/starts_with.html starts_with()]/ ends_with() = Select columns that start/end with a character string
** contains() = Select columns that contain a character string
** contains() = Select columns that contain a character string
** matches() = Select columns that match a regular expression
** matches() = Select columns that match a regular expression
Line 925: Line 922:
   group_by(year, month, day) %>%  
   group_by(year, month, day) %>%  
   summarise(mean = mean(dep_delay, na.rm = TRUE))
   summarise(mean = mean(dep_delay, na.rm = TRUE))
</syntaxhighlight>
* Another example
:<syntaxhighlight lang='r'>
data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, 30, 35, 40, 45),
  gender = c("F", "M", "M", "M", "F"),
  score1 = c(80, 85, 90, 95, 100),
  score2 = c(75, 80, 85, 90, 95)
)
# Example usage of dplyr functions
result <- data %>%
  filter(gender == "M") %>%                # Keep only rows where gender is "M"
  select(name, age, score1) %>%            # Select specific columns
  mutate(score_diff = score1 - score2) %>% # Calculate a new column based on existing columns
  arrange(desc(age)) %>%                  # Arrange rows in descending order of age
  #group_by(gender) %>%                    # Group the data by gender
  summarize(mean_score1 = mean(score1))    # Calculate the mean of score1 for each group
</syntaxhighlight>
</syntaxhighlight>
* [https://csgillespie.github.io/efficientR/data-carpentry.html#dplyr Efficient R Programming]
* [https://csgillespie.github.io/efficientR/data-carpentry.html#dplyr Efficient R Programming]
Line 942: Line 958:
* [https://towardsdatascience.com/what-you-need-to-know-about-the-new-dplyr-1-0-0-7eaaaf6d78ac The Seven Key Things You Need To Know About dplyr 1.0.0]
* [https://towardsdatascience.com/what-you-need-to-know-about-the-new-dplyr-1-0-0-7eaaaf6d78ac The Seven Key Things You Need To Know About dplyr 1.0.0]


== .x symbol ==
== select() for columns ==
[https://stackoverflow.com/a/56532176 dplyr piping data - difference between `.` and `.x`]
 
== select() ==
[https://www.quantargo.com/courses/course-r-introduction/03-dplyr/02-select-columns-data-frame/recipe Select columns from a data frame]
[https://www.quantargo.com/courses/course-r-introduction/03-dplyr/02-select-columns-data-frame/recipe Select columns from a data frame]
<pre>
<pre>
Line 962: Line 975:
</pre>
</pre>


== group_by() ==
=== .$Name ===
* [https://dplyr.tidyverse.org/reference/group_by.html ?group_by] and ungroup(),
Extract a column using piping. The '''.''' represents the data frame that is being piped in, and $Name extracts the ‘Name’ column.
* [https://dplyr.tidyverse.org/articles/grouping.html Grouped data]
<pre>
* Is ungroup() recommended after every group_by()? Always ungroup() when you’ve finished with your calculations. See [https://bookdown.org/yih_huynh/Guide-to-R-Book/groupby.html#ungrouping here] or [https://community.rstudio.com/t/is-ungroup-recommended-after-every-group-by/5296 this].
mtcars %>% .$mpg  # A vector
* You might want to use ungroup() if you want to perform further calculations or manipulations on the data that don’t depend on the grouping. For example, after ungrouping the data, you could add new columns or filter rows without being restricted by the grouping.


<pre>
mtcars %>% select(mpg) # A list
                  +-- mutate() + ungroup()
x -- group_by() --|
                  +-- summarise() # reduce the dimension, no way to get back
</pre>
</pre>


=== Subset rows by group ===
== filter() for rows ==
[https://datasciencetut.com/subset-rows-based-on-their-integer-locations/ Subset rows based on their integer locations-slice in R]
<pre>
mtcars %>% filter(mpg>10)


=== Rank by Group ===
identical(mtcars %>% filter(mpg>10), subset(mtcars, mpg>10))
[https://datasciencetut.com/how-to-rank-by-group-in-r/ How to Rank by Group in R?]
# [1] TRUE
<pre>
df %>% arrange(team, points) %>%
    group_by(team) %>%
    mutate(rank = rank(points))
</pre>
</pre>


=== group_by() + summarise(), arrange(desc()) ===
=== filter by date ===
[https://r4ds.had.co.nz/transform.html Data transformation] from R for Data Science
[https://datasciencetut.com/what-is-the-best-way-to-filter-by-date-in-r/ What Is the Best Way to Filter by Date in R?]


[https://www.guru99.com/r-aggregate-function.html#3 Function in summarise()]
== arrange (reorder) ==
* group_by(var1) %>% summarise(varY = mean(var2)) %>% ggplot(aes(x = varX, y = varY, fill = varF)) + geom_bar(stat = "identity") + theme_classic()
<ul>
* summarise(newvar = sum(var1) / sum(var2))
<li>Arrange values by a Single Variable:
* arrange(desc(var1, var2))
<pre>
* Distinct number of observation: '''n_distinct()'''
# Create a sample data frame
*  Count the number of rows: '''n()'''
students <- data.frame(
* nth observation of the group: '''nth()'''
  Name = c("Ali", "Boby", "Charlie", "Davdas"),
* First observation of the group: '''first()'''
  Score = c(85, 92, 78, 95)
* Last observation of the group: '''last()'''
)
 
# Arrange by Score in ascending order
arrange(students, Score)
#      Name Score
# 1 Charlie    78
# 2    Ali    85
# 3    Boby    92
# 4  Davdas    95
</pre>
<li>Arrange values by Multiple Variables:
This is like the "sort" function in Excel.
<pre>
# Create a sample data frame
transactions <- data.frame(
  Date = c("2024-04-01", "2024-04-01", "2024-04-02", "2024-04-03"),
  Amount = c(100, 150, 200, 75)
)


=== group_by() + summarise() + across() ===
# Arrange by Date in ascending order, then by Amount in descending order
* [https://twitter.com/ChBurkhart/status/1647243881095000069?s=20 Get a summarize from multiple columns without explicitly specifying the column names]
arrange(transactions, Date, desc(Amount))
* [https://dplyr.tidyverse.org/reference/across.html ?across]
#        Date Amount
# 1 2024-04-01    150
# 2 2024-04-01    100
# 3 2024-04-02    200
# 4 2024-04-03    75
</pre>
<li>Arrange values with Missing Values:
<pre>
# Create a sample data frame with missing values
data <- data.frame(
  ID = c(1, 2, NA, 4),
  Value = c(20, NA, 15, 30)
)


=== group_by() + nest(), mutate(, map()), unnest(), list-columns ===
# Arrange by Value in ascending order, placing missing values first
[https://www.rdocumentation.org/packages/tidyr/versions/1.3.0/topics/nest nest(data=)] is a function in the tidyr package in R that allows you to create nested data frames, where '''one column contains another data frame or list'''. This is useful when you want to perform analysis or visualization on each group separately. '''PS:''' it seems group_by() is not needed.
arrange(data, desc(is.na(Value)), Value)
:<syntaxhighlight lang='rsplus'>
#   ID Value
histogram <- gss_cat |>
# 1  2    NA
  nest(data = -marital) |>  # OR nest(.by = marital). 6x2 tibble. Col1=marital, col2=data.
# 2 NA    15
   mutate(
# 3  1   20
    histogram = pmap(
# 4  4    30
      .l = list(marital, data),
</pre>
      .f = \(marital, data) {
</ul>
        ggplot(data, aes(x = tvhours)) +
          geom_histogram(binwidth = 1) +
          labs(
            title = marital
          )
      }
    )
  )
histogram$histogram[[1]]
</syntaxhighlight>


[https://r4ds.had.co.nz/many-models.html Many models] from R for Data Science
=== arrange and match ===
How to do the following in pipe ''' A <- A[match(id.ref, A$id), ]'''


[https://stackoverflow.com/a/52216391 How to sort rows of a data frame based on a vector using dplyr pipe], [https://stackoverflow.com/a/59730594 Order data frame rows according to vector with specific order]
<ul>
<ul>
<li>[https://tidyr.tidyverse.org/reference/nest.html ?unnest], vignette('rectangle'), vignette('nest') & vignette('pivot')
<li>Data
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='r'>
tibble(x = 1:2, y = list(1:4, 2:3)) %>% unnest(y) %>% group_by(x) %>% nest()
library(dplyr)
# returns to tibble(x = 1:2, y = list(1:4, 2:3)) with 'groups' information
 
# Create a sample dataframe 'A'
set.seed(1); A <- data.frame(
    id = sample(letters[1:5]),
    value = 1:5
    )
print(A)
  id value
1  a    1
2  d    2
3  c    3
4  e    4
5 b    5
 
# Create a reference vector 'id.ref'
id.ref <- c("e", "d", "c", "b", "a")
</syntaxhighlight>
<syntaxhighlight lang='r'>
# Goal:
A[match(id.ref, A$id),]
  id value
4  e    4
2  d    2
3  c    3
5  b    5
1  a    1
</syntaxhighlight>
<li>Method 1 (best): no match() is needed. Brilliant!
<syntaxhighlight lang='r'>
A %>% arrange(factor(id, levels=id.ref))
  id value
1  e    4
2  d    2
3  c    3
4  b    5
5  a    1
# detail:
factor(A$id, levels=id.ref)
[1] a d c e b
Levels: e d c b a
</syntaxhighlight>
</syntaxhighlight>
</li>
<li>[https://stackoverflow.com/a/38021139 annotate boxplot in ggplot2] </li>
<li>[https://towardsdatascience.com/coding-in-r-nest-and-map-your-way-to-efficient-code-4e44ba58ee4a Coding in R: Nest and map your way to efficient code]
<pre>
      group_by() + nest()    mutate(, map())  unnest()
data  -------------------->  --------------->  ------->
</pre>
<syntaxhighlight lang='rsplus'>
install.packages('gapminder'); library(gapminder)


gapminder_nest <- gapminder %>%  
<li>Method 2: complicated
  group_by(country) %>%  
<syntaxhighlight lang='r'>
  nest()  # country, data
A %>%
          # each row of 'data' is a tibble
    mutate(id.match = match(id, id.ref)) %>%
    arrange(id.match) %>%
    select(-id.match)
  id value
1 e    4
2  d    2
3  c    3
4  b    5
5  a    1
# detail:
A %>%
    mutate(id.match = match(id, id.ref))
  id value id.match
a     1        5
2  d    2        2
3  c    3        3
4  e    4        1
5  b    5        4
</syntaxhighlight>


gapminder_nest$data[[1]] # tibble 57 x 8
<li>Method 3: a simplified version of Method 2, but it needs match()
<syntaxhighlight lang='r'>
A %>% arrange(match(id, id.ref))
  id value
e    4
2  d    2
3  c    3
4  b    5
5  a    1
</syntaxhighlight>
</ul>


gapminder_nest <- gapminder_nest %>%
== group_by() ==
          mutate(pop_mean = map(.x = data, ~mean(.x$pop, na.rm = T)))
* [https://dplyr.tidyverse.org/reference/group_by.html ?group_by] and ungroup(),
                                    # country, data, pop_mean
* [https://dplyr.tidyverse.org/articles/grouping.html Grouped data]
* Is ungroup() recommended after every group_by()? Always ungroup() when you’ve finished with your calculations. See [https://bookdown.org/yih_huynh/Guide-to-R-Book/groupby.html#ungrouping here] or [https://community.rstudio.com/t/is-ungroup-recommended-after-every-group-by/5296 this].
* You might want to use ungroup() if you want to perform further calculations or manipulations on the data that don’t depend on the grouping. For example, after ungrouping the data, you could add new columns or filter rows without being restricted by the grouping.


gapminder_nest %>% unnest(pop_mean) # country, data, pop_mean
<pre>
                  +-- mutate() + ungroup()
x -- group_by() --|
                  +-- summarise() # reduce the dimension, no way to get back
</pre>
 
=== Subset rows by group ===
[https://datasciencetut.com/subset-rows-based-on-their-integer-locations/ Subset rows based on their integer locations-slice in R]
 
=== Rank by Group ===
[https://datasciencetut.com/how-to-rank-by-group-in-r/ How to Rank by Group in R?]
<pre>
df %>% arrange(team, points) %>%
    group_by(team) %>%
    mutate(rank = rank(points))
</pre>


gapminder_plot <- gapminder_nest %>%
=== group_by() + summarise(), arrange(desc()) ===
  unnest(pop_mean) %>%
[https://r4ds.had.co.nz/transform.html Data transformation] from R for Data Science
  select(country, pop_mean) %>%
  ungroup() %>%
  top_n(pop_mean, n = -10) %>%
  mutate(pop_mean = pop_mean/10^3)
gapminder_plot %>%
  ggplot(aes(x = reorder(country, pop_mean), y = pop_mean)) +
  geom_point(colour = "#FF6699", size = 5) +
  geom_segment(aes(xend = country, yend = 0), colour = "#FF6699") +
  geom_text(aes(label = round(pop_mean, 0)), hjust = -1) +
  theme_minimal() +
  labs(title = "Countries with smallest mean population from 1960 to 2016",
      subtitle = "(thousands)",
      x = "",
      y = "") +
  theme(legend.position = "none",
        axis.text.x = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        panel.grid.major.y = element_blank()) +
  coord_flip() +
  scale_y_continuous()
</syntaxhighlight>
</li>
<li>[https://www.tidymodels.org/learn/statistics/tidy-analysis/ Tidy analysis] from tidymodels </li>
<li>[https://community.rstudio.com/t/is-nest-mutate-map-unnest-really-the-best-alternative-to-dplyr-do/11009 Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()] </li>
</ul>


== filter by date ==
[https://www.guru99.com/r-aggregate-function.html#3 Function in summarise()]
[https://datasciencetut.com/what-is-the-best-way-to-filter-by-date-in-r/ What Is the Best Way to Filter by Date in R?]
* group_by(var1) %>% summarise(varY = mean(var2)) %>% ggplot(aes(x = varX, y = varY, fill = varF)) + geom_bar(stat = "identity") + theme_classic()
* summarise(newvar = sum(var1) / sum(var2))
* arrange(desc(var1, var2))
* Distinct number of observation: '''n_distinct()'''
*  Count the number of rows: '''n()'''
* nth observation of the group: '''nth()'''
* First observation of the group: '''first()'''
* Last observation of the group: '''last()'''


== ave() - Adding a column of means by group to original data ==
=== group_by() + summarise() + across() ===
* [https://stackoverflow.com/a/7976250 Adding a column of means by group to original data],
* [https://twitter.com/ChBurkhart/status/1647243881095000069?s=20 Get a summarize from multiple columns without explicitly specifying the column names]
* [https://stackoverflow.com/a/6057297 ave(, FUN) for any function instead of average]
* [https://dplyr.tidyverse.org/reference/across.html ?across]


== mutate vs tapply ==
=== group_by() + nest(), mutate(, map()), unnest(), list-columns ===
[https://matloff.wordpress.com/2022/08/06/base-r-is-alive-and-well/ Base-R is alive and well]
[https://www.rdocumentation.org/packages/tidyr/versions/1.3.0/topics/nest nest(data=)] is a function in the tidyr package in R that allows you to create nested data frames, where '''one column contains another data frame or list'''. This is useful when you want to perform analysis or visualization on each group separately. '''PS:''' it seems group_by() is not needed.
:<syntaxhighlight lang='rsplus'>
histogram <- gss_cat |>
  nest(data = -marital) |>  # OR nest(.by = marital). 6x2 tibble. Col1=marital, col2=data.
  mutate(
    histogram = pmap(
      .l = list(marital, data),
      .f = \(marital, data) {
        ggplot(data, aes(x = tvhours)) +
          geom_histogram(binwidth = 1) +
          labs(
            title = marital
          )
      }
    )
  )
histogram$histogram[[1]]
</syntaxhighlight>
 
[https://r4ds.had.co.nz/many-models.html Many models]  from R for Data Science


== mutate + replace() or ifelse() ==
<ul>
<ul>
<li>mutate() is similar to [https://stat.ethz.ch/R-manual/R-devel/library/base/html/with.html base::within()] </li>
<li>[https://tidyr.tidyverse.org/reference/nest.html ?unnest],  vignette('rectangle'),  vignette('nest') & vignette('pivot')
<li>[https://stackoverflow.com/a/28013895 Change value of variable with dplyr]
<syntaxhighlight lang='rsplus'>
<pre>
tibble(x = 1:2, y = list(1:4, 2:3)) %>% unnest(y) %>% group_by(x) %>% nest()
mtcars %>%
# returns to tibble(x = 1:2, y = list(1:4, 2:3)) with 'groups' information
    mutate(mpg=replace(mpg, cyl==4, NA)) %>%
</syntaxhighlight>
    as.data.frame()
# VS
mtcars$mpg[mtcars$cyl == 4] <- NA
</pre>
</li>
</li>
<li>[https://stackoverflow.com/a/35610521 using ifelse()] </li>
<li>[https://stackoverflow.com/a/38021139 annotate boxplot in ggplot2] </li>
<li>[https://stackoverflow.com/a/61602568 using case_when()] </li>
<li>[https://towardsdatascience.com/coding-in-r-nest-and-map-your-way-to-efficient-code-4e44ba58ee4a Coding in R: Nest and map your way to efficient code]  
<li>[https://dplyr.tidyverse.org/reference/mutate_all.html Mutate multiple columns] </li>
<li>[https://www.bioinfoblog.com/entry/tidydata/advancedmutate Apply the mutate function to multiple columns at once | mutate_at / mutate_all / mutate_if]  
<pre>
<pre>
mutate_at(data, .vars = vars(starts_with("Petal")), .funs = ~ . * 2) %>% head()
      group_by() + nest()   mutate(, map())   unnest()
mutate_at(data, .vars = vars(starts_with("Petal")), `*`, 2) %>% head()
data  -------------------->  --------------->  ------->
</pre>
</pre>
</li>
<syntaxhighlight lang='rsplus'>
<li>[https://dplyr.tidyverse.org/reference/recode.html recode()]
install.packages('gapminder'); library(gapminder)
<pre>
char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple", b = "Banana", .default = NA_character_)
</pre>
</li>
</ul>


== Hash table ==
gapminder_nest <- gapminder %>%
<ul>
  group_by(country) %>%
<li>[https://stackoverflow.com/a/7659297 Create new column based on 4 values in another column]. The trick is to create a named vector; like a [https://www.geeksforgeeks.org/python-dictionary/# Dictionary in Python].
  nest()  # country, data
          # each row of 'data' is a tibble


Here is my example:
gapminder_nest$data[[1]]  # tibble 57 x 8
<syntaxhighlight lang='rsplus'>
hashtable <- data.frame(value=1:4, key=c("B", "C", "A", "D"))
input <- c("A", "B", "C", "D", "B", "B", "A", "A") # input to be matched with keys,
                                                  # this could be very long
# Trick: convert the hash table into a named vector
htb <- hashtable$value; names(htb) <- hashtable$key


# return the values according to the names
gapminder_nest <- gapminder_nest %>%
out <- htb[input]; out
          mutate(pop_mean = map(.x = data, ~mean(.x$pop, na.rm = T)))
A B C D B B A A
                                    # country, data, pop_mean
3 1 2 4 1 1 3 3
</syntaxhighlight>
We can implement using Python by creating a variable of [https://www.w3schools.com/python/python_dictionaries.asp dictionary type/structure].
<syntaxhighlight lang='python'>
hashtable = {'B': 1, 'C': 2, 'A': 3, 'D': 4}
input = ['A', 'B', 'C', 'D', 'B', 'B', 'A', 'A']
out = [hashtable[key] for key in input]
</syntaxhighlight>
Or using C
<syntaxhighlight lang='c'>
#include <stdio.h>


int main() {
gapminder_nest %>% unnest(pop_mean) # country, data, pop_mean
    int hashtable[4] = {3, 1, 2, 4};
    char input[] = {'A', 'B', 'C', 'D', 'B', 'B', 'A', 'A'};
    int out[sizeof(input)/sizeof(input[0])];


    for (int i = 0; i < sizeof(input)/sizeof(input[0]); i++) {
gapminder_plot <- gapminder_nest %>%
        out[i] = hashtable[input[i] - 'A'];
  unnest(pop_mean) %>%
    }
  select(country, pop_mean) %>%
 
  ungroup() %>%
    for (int i = 0; i < sizeof(out)/sizeof(out[0]); i++) {
  top_n(pop_mean, n = -10) %>%
        printf("%d ", out[i]);
  mutate(pop_mean = pop_mean/10^3)
    }
gapminder_plot %>%
    printf("\n");
  ggplot(aes(x = reorder(country, pop_mean), y = pop_mean)) +
 
  geom_point(colour = "#FF6699", size = 5) +
    return 0;
  geom_segment(aes(xend = country, yend = 0), colour = "#FF6699") +
}
  geom_text(aes(label = round(pop_mean, 0)), hjust = -1) +
  theme_minimal() +
  labs(title = "Countries with smallest mean population from 1960 to 2016",
      subtitle = "(thousands)",
      x = "",
      y = "") +
  theme(legend.position = "none",
        axis.text.x = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        panel.grid.major.y = element_blank()) +
  coord_flip() +
  scale_y_continuous()
</syntaxhighlight>
</syntaxhighlight>
<li>[https://cran.r-project.org/web/packages/hash/index.html hash] package
</li>
<li>[https://cran.r-project.org/web/packages/digest/ digest] package
<li>[https://www.tidymodels.org/learn/statistics/tidy-analysis/ Tidy analysis] from tidymodels </li>
<li>[https://community.rstudio.com/t/is-nest-mutate-map-unnest-really-the-best-alternative-to-dplyr-do/11009 Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()] </li>
</ul>
</ul>


== inner_join, left_join, full_join ==
== across() ==
* [https://dplyr.tidyverse.org/reference/mutate-joins.html Mutating joins]
<ul>
* [https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti Join Data Frames with the R dplyr Package (9 Examples)]
<li>[https://dplyr.tidyverse.org/reference/across.html ?across]. Applying a function or operation to multiple columns in a data frame simultaneously.
* [https://www.datasciencemadesimple.com/join-in-r-merge-in-r/ Join in r: how to join (merge) data frames (inner, outer, left, right) in R]
<pre>
* [https://www.guru99.com/r-dplyr-tutorial.html Dplyr Tutorial: Merge and Join Data in R with Examples]
across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)
gdf <-
  tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23) %>%
  group_by(g)
gdf %>% mutate(across(v1:v2, ~ .x + rnorm(1)))
#>      g    v1    v2
#>  <dbl> <dbl> <dbl>
#> 1    1  10.3  20.7
#> 2    1  11.3  21.7
#> 3    2  11.2  22.6
#> 4    3  13.5  22.7
</pre>
<li>[https://www.infoworld.com/article/3537612/dplyr-across-first-look-at-a-new-tidyverse-function.html dplyr across: First look at a new Tidyverse function].
* [https://dplyr.tidyverse.org/reference/across.html Apply a function (or functions) across multiple columns]. across(), if_any(), if_all().
* [https://tidyselect.r-lib.org/reference/starts_with.html Select variables that match a pattern]. starts_with(), ends_with(), contains(), matches(), num_range().
* [https://twitter.com/romain_francois/status/1350078666554933249/photo/2 data %>% group_by(Var1) %>% summarise(across(contains("SomeKey"), mean, na.rm = TRUE))]
<syntaxhighlight lang='rsplus'>
ny <- filter(cases, State == "NY") %>%
  select(County = `County Name`, starts_with(c("3", "4")))


== plyr::rbind.fill() ==
daily_totals <- ny %>%
* https://www.rdocumentation.org/packages/plyr/versions/1.8.6/topics/rbind.fill
  summarize(
* [https://f1000research.com/articles/5-1542 An example usage]
    across(starts_with("4"), sum)
  )


== Videos ==
median_and_max <- list(
* [https://youtu.be/jWjqLW-u3hc Hands-on dplyr tutorial for faster data manipulation in R] by Data School. At time 17:00, it compares the '''%>%''' operator, '''with()''' and '''aggregate()''' for finding group mean.
  med = ~median(.x, na.rm = TRUE),
* https://youtu.be/aywFompr1F4 (shorter video) by Roger Peng
  max = ~max(.x, na.rm = TRUE)
* https://youtu.be/8SGif63VW6E by Hadley Wickham
)
* [https://www.rstudio.com/resources/videos/tidy-eval-programming-with-dplyr-tidyr-and-ggplot2/ Tidy eval: Programming with dplyr, tidyr, and ggplot2]. Bang bang "!!" operator was introduced for use in a function call.
* JULIA SILGE
** [https://juliasilge.com/blog/tuition-resampling/ Preprocessing and resampling using #tidytuesday college data]
** [https://juliasilge.com/blog/beer-production/ Bootstrap resampling with #tidytuesday beer production data]
* [https://www.infoworld.com/article/3411819/do-more-with-r-video-tutorials.html “Do More with R” video tutorials] by Sharon Machlis
* [https://www.lynda.com/R-tutorials/Learning-R-Tidyverse/586672-2.html Learning the R Tidyverse] from lynda.com


== dbplyr ==
april_median_and_max <- ny %>%
* https://dbplyr.tidyverse.org/articles/dbplyr.html
  summarize(
* [https://dbplyr.tidyverse.org/reference/translate_sql.html translate_sql()] Translate an R expression to sql. [https://twitter.com/rfunctionaday/status/1452127344093708295 Some examples].
    across(starts_with("4"), median_and_max)
  )
</pre>
<pre>
# across(.cols = everything(), .fns = NULL, ..., .names = NULL)


= stringr =
# Rounding the columns Sepal.Length and Sepal.Width
* stringr is part of the tidyverse but is not a core package. You need to load it separately.
iris %>%
* [http://gastonsanchez.com/blog/resources/how-to/2013/09/22/Handling-and-Processing-Strings-in-R.html Handling Strings with R](ebook) by Gaston Sanchez.
  as_tibble() %>%
* https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
  mutate(across(c(Sepal.Length, Sepal.Width), round))
* [https://github.com/rstudio/cheatsheets/raw/master/strings.pdf stringr Cheat sheet] (2 pages, this will immediately download the pdf file)
** Detect Matches: '''str_detect()''', str_which(), str_count(), str_locate()
** Subset: '''str_sub()''', str_subset(), str_extract(), str_match()
** Manage Lengths: str_length(), str_pad(), str_trunc(), '''str_trim()'''
** Mutate Strings: '''str_sub()''', '''str_replace()''', str_replace_all(), '''str_remove()'''
*** Case Conversion: str_to_lower(), str_to_upper(), str_to_title()
** Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
* [https://csgillespie.github.io/efficientR/data-carpentry.html#regular-expressions Efficient data carpentry &#8594; Regular expressions] from Efficient R programming book by Gillespie & Lovelace.


== str_replace() ==
iris %>% summarise(across(contains("Sepal"), ~mean(.x, na.rm = TRUE)))
[https://datasciencetut.com/how-to-replace-string-in-column-in-r/ Replace a string in a column]: [https://dplyr.tidyverse.org/reference/across.html dplyr::across()] & str_replace()
<pre>
df <- data.frame(country=c('India', 'USA', 'CHINA', 'Algeria'),
                position=c('1', '1', '2', '3'),
                points=c(22, 25, 29, 13))


df %>%
# filter rows
  mutate(across('country', str_replace, 'India', 'Albania'))
iris %>% filter(if_any(ends_with("Width"), ~ . > 4))


df %>%
iris %>% select(starts_with("Sepal"))
  mutate(across('country', str_replace, 'A|I', ''))
</pre>


== split ==
iris %>% select(starts_with(c("Petal", "Sepal")))
[https://statisticsglobe.com/split-data-frame-variable-into-multiple-columns-in-r Split Data Frame Variable into Multiple Columns in R (3 Examples)]


Three ways:
iris %>% select(contains("Sepal"))
* base::strsplit(x, CHAR)
</syntaxhighlight>
* [https://stringr.tidyverse.org/reference/str_split.html stringr::str_split_fixed(x, CHAR, 2)]
</ul>
* [https://tidyr.tidyverse.org/reference/separate.html tidyr::separate(x, c("NewVar1", "NewVar2"), CHAR)]
<pre>
x <- c("a-1", "b-2", "c-3")


stringr::str_split_fixed(x, "-", 2)
== ave() - Adding a column of means by group to original data ==
#      [,1] [,2]
* [https://stackoverflow.com/a/7976250 Adding a column of means by group to original data],  
# [1,] "a"  "1"
* [https://stackoverflow.com/a/6057297 ave(, FUN) for any function instead of average]
# [2,] "b"  "2"
# [3,] "c"  "3"


tidyr::separate(data.frame(x), x, c('x1', 'x2'), "-")
== mutate vs tapply ==
  # The first argument must be a data frame
[https://matloff.wordpress.com/2022/08/06/base-r-is-alive-and-well/ Base-R is alive and well]
  # The 2nd argument is the column names
#  x1 x2
# 1  a  1
# 2  b  2
# 3  c  3
</pre>


= [https://github.com/smbache/magrittr magrittr]: pipe =
== mutate + replace() or ifelse() ==
* [https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html Vignettes]
<ul>
** [https://www.tidyverse.org/blog/2020/08/magrittr-2-0/?s=09 magrittr 2.0 is coming soon], [https://www.tidyverse.org/blog/2020/11/magrittr-2-0-is-here/?s=09 magrittr 2.0 is here!]
<li>mutate() is similar to [https://stat.ethz.ch/R-manual/R-devel/library/base/html/with.html base::within()] </li>
* [https://thomasadventure.blog/posts/how-does-the-pipe-operator-actually-work/ How does the pipe operator actually work?]
<li>[https://stackoverflow.com/a/28013895 Change value of variable with dplyr]
* [http://www.win-vector.com/blog/2018/04/magrittr-and-wrapr-pipes-in-r-an-examination/ magrittr and wrapr Pipes in R, an Examination]. Instead of nested statements, it is using pipe operator '''%>%'''. So the code is easier to read. Impressive!
<pre>
: <syntaxhighlight lang='rsplus'>
mtcars %>%
x %>% f    # f(x)
    mutate(mpg=replace(mpg, cyl==4, NA)) %>%
x %>% f(y) # f(x, y)
    as.data.frame()
x %>% f(arg=y) # f(x, arg=y)
# VS
x %>% f(z, .) # f(z, x)
mtcars$mpg[mtcars$cyl == 4] <- NA
x %>% f(y) %>% g(z) #  g(f(x, y), z)
</pre>
</li>
<li>[https://stackoverflow.com/a/35610521 using ifelse()] </li>
<li>[https://stackoverflow.com/a/61602568 using case_when()] </li>
<li>[https://dplyr.tidyverse.org/reference/mutate_all.html Mutate multiple columns] </li>
<li>[https://www.bioinfoblog.com/entry/tidydata/advancedmutate Apply the mutate function to multiple columns at once | mutate_at / mutate_all / mutate_if]  
<pre>
mutate_at(data, .vars = vars(starts_with("Petal")), .funs = ~ . * 2) %>% head()
mutate_at(data, .vars = vars(starts_with("Petal")), `*`, 2) %>% head()
</pre>
</li>
<li>[https://dplyr.tidyverse.org/reference/recode.html recode()]
<pre>
char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple", b = "Banana", .default = NA_character_)
</pre>
</li>
</ul>


x %>% select(which(colSums(!is.na(.))>0))  # remove columns with all missing data
== Hash table ==
x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows
<ul>
<li>[https://stackoverflow.com/a/7659297 Create new column based on 4 values in another column]. The trick is to create a named vector; like a [https://www.geeksforgeeks.org/python-dictionary/# Dictionary in Python].
 
Here is my example:
<syntaxhighlight lang='rsplus'>
hashtable <- data.frame(value=1:4, key=c("B", "C", "A", "D"))
input <- c("A", "B", "C", "D", "B", "B", "A", "A") # input to be matched with keys,
                                                  # this could be very long
# Trick: convert the hash table into a named vector
htb <- hashtable$value; names(htb) <- hashtable$key
 
# return the values according to the names
out <- htb[input]; out
A B C D B B A A
3 1 2 4 1 1 3 3
</syntaxhighlight>
</syntaxhighlight>
* [http://www.win-vector.com/blog/2018/03/r-tip-make-arguments-explicit-in-magrittr-dplyr-pipelines/ Make Arguments Explicit in magrittr/dplyr Pipelines]
We can implement using Python by creating a variable of [https://www.w3schools.com/python/python_dictionaries.asp dictionary type/structure].
: <syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='python'>
suppressPackageStartupMessages(library("dplyr"))
hashtable = {'B': 1, 'C': 2, 'A': 3, 'D': 4}
starwars %>%
input = ['A', 'B', 'C', 'D', 'B', 'B', 'A', 'A']
  filter(., height > 200) %>%
out = [hashtable[key] for key in input]
  select(., height, mass) %>%
  head(.)
# instead of
starwars %>%
  filter(height > 200) %>%
  select(height, mass) %>%
  head
</syntaxhighlight>
</syntaxhighlight>
* [https://stackoverflow.com/questions/27100678/how-to-extract-subset-an-element-from-a-list-with-the-magrittr-pipe Subset an element from a list]
Or using C
: <syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='c'>
iris$Species
#include <stdio.h>
iris[["Species"]]


iris %>%
int main() {
`[[`("Species")
    int hashtable[4] = {3, 1, 2, 4};
    char input[] = {'A', 'B', 'C', 'D', 'B', 'B', 'A', 'A'};
    int out[sizeof(input)/sizeof(input[0])];


iris %>%
    for (int i = 0; i < sizeof(input)/sizeof(input[0]); i++) {
`[[`(5)
        out[i] = hashtable[input[i] - 'A'];
    }


iris %>%
    for (int i = 0; i < sizeof(out)/sizeof(out[0]); i++) {
  subset(select = "Species")
        printf("%d ", out[i]);
    }
    printf("\n");
 
    return 0;
}
</syntaxhighlight>
</syntaxhighlight>
* '''Split-apply-combine''': group + summarize + sort/arrange + top n. The following example is from [https://csgillespie.github.io/efficientR/data-carpentry.html#data-aggregation Efficient R programming].
<li>[https://cran.r-project.org/web/packages/hash/index.html hash] package
: <syntaxhighlight lang='rsplus'>
<li>[https://cran.r-project.org/web/packages/digest/ digest] package
data(wb_ineq, package = "efficient")
</ul>
wb_ineq %>%
 
  filter(grepl("g", Country)) %>%
== inner_join, left_join, full_join ==
  group_by(Year) %>%
* [https://dplyr.tidyverse.org/reference/mutate-joins.html Mutating joins]
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
* [https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti Join Data Frames with the R dplyr Package (9 Examples)]
  arrange(desc(gini)) %>%
* [https://www.datasciencemadesimple.com/join-in-r-merge-in-r/ Join in r: how to join (merge) data frames (inner, outer, left, right) in R]
  top_n(n = 5)
* [https://www.guru99.com/r-dplyr-tutorial.html Dplyr Tutorial: Merge and Join Data in R with Examples]
</syntaxhighlight>
 
* [https://drdoane.com/writing-pipe-friendly-functions/ Writing Pipe-friendly Functions]
== plyr::rbind.fill() ==
* http://rud.is/b/2015/02/04/a-step-to-the-right-in-r-assignments/
* https://www.rdocumentation.org/packages/plyr/versions/1.8.6/topics/rbind.fill
* http://rpubs.com/tjmahr/pipelines_2015
* [https://f1000research.com/articles/5-1542 An example usage]
* http://danielmarcelino.com/i-loved-this-crosstable/
* http://moderndata.plot.ly/using-the-pipe-operator-in-r-with-plotly/
* RMSE
: <syntaxhighlight lang='rsplus'>
f <- function(x) {
  (y - x) %>%
    '^'(2) %>%
    sum %>%
    '/'(length(x)) %>%
    sqrt %>%
    round(2)
}
</syntaxhighlight>
* [https://nathaneastwood.github.io/2020/02/01/get-and-set-list-elements-with-magrittr/ Get and Set List Elements with magrittr]
* Videos
** [https://www.rstudio.com/resources/videos/writing-readable-code-with-pipes/ Writing Readable Code with Pipes]
** [https://youtu.be/iIBTI_qiq9g Pipes in R - An Introduction to magrittr package]
: <syntaxhighlight lang='rsplus'>
# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat)


pryr::object_size(diamonds)
== Videos ==
pryr::object_size(diamonds2)
* [https://education.rstudio.com/trainers/ RStudio Instructor Training and Certification]
pryr::object_size(diamonds, diamonds2)
* [https://youtu.be/jWjqLW-u3hc Hands-on dplyr tutorial for faster data manipulation in R] by Data School. At time 17:00, it compares the '''%>%''' operator, '''with()''' and '''aggregate()''' for finding group mean.
* https://youtu.be/aywFompr1F4 (shorter video) by Roger Peng
* https://youtu.be/8SGif63VW6E by Hadley Wickham
* [https://www.rstudio.com/resources/videos/tidy-eval-programming-with-dplyr-tidyr-and-ggplot2/ Tidy eval: Programming with dplyr, tidyr, and ggplot2]. Bang bang "!!" operator was introduced for use in a function call.
* JULIA SILGE
** [https://juliasilge.com/blog/tuition-resampling/ Preprocessing and resampling using #tidytuesday college data]
** [https://juliasilge.com/blog/beer-production/ Bootstrap resampling with #tidytuesday beer production data]
* [https://www.infoworld.com/article/3411819/do-more-with-r-video-tutorials.html “Do More with R” video tutorials] by Sharon Machlis
* [https://www.lynda.com/R-tutorials/Learning-R-Tidyverse/586672-2.html Learning the R Tidyverse] from lynda.com
* [https://www.youtube.com/watch?v=AuQOy06Dlr8 What's new in the tidyverse?] by Professor Mine Çetinkaya-Rundel


rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str()
== dbplyr ==
rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe
* https://dbplyr.tidyverse.org/articles/dbplyr.html
    # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2)
* [https://dbplyr.tidyverse.org/reference/translate_sql.html translate_sql()] Translate an R expression to sql. [https://twitter.com/rfunctionaday/status/1452127344093708295 Some examples].
    # instead of the righthand side.


# If a function does not have a data frame based api, you can use %$%.
= stringr =
# It explodes out the variables in a data frame.
<ul>
mtcars %$% cor(disp, mpg)  
<li>stringr is part of the tidyverse but is not a core package. You need to load it separately.
<li>[http://gastonsanchez.com/blog/resources/how-to/2013/09/22/Handling-and-Processing-Strings-in-R.html Handling Strings with R](ebook) by Gaston Sanchez.
<li>https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
<li>[https://github.com/rstudio/cheatsheets/raw/master/strings.pdf stringr Cheat sheet] (2 pages, this will immediately download the pdf file)
* Detect Matches: '''str_detect()''', str_which(), str_count(), str_locate()
* Subset: '''str_sub()''', str_subset(), str_extract(), str_match()
* Manage Lengths: str_length(), str_pad(), str_trunc(), '''str_trim()'''
* Mutate Strings: '''str_sub()''', '''str_replace()''', str_replace_all(), '''str_remove()'''
** Case Conversion: str_to_lower(), str_to_upper(), str_to_title()
* Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
<li>[https://csgillespie.github.io/efficientR/data-carpentry.html#regular-expressions Efficient data carpentry &#8594; Regular expressions] from Efficient R programming book by Gillespie & Lovelace.
<li>Common functions:
{| class="wikitable"
|-
! `stringr` Function !! Description !! Base R Equivalent
|-
| `str_length()` || Returns the number of characters in each element of a character vector. || `nchar()`
|-
| `str_sub()` || Extracts substrings from a character vector. || `substr()`
|-
| `str_trim()` || Removes leading and trailing whitespace from strings. || `trimws()`
|-
| `str_split()` || Splits a string into pieces based on a delimiter. || `strsplit()`
|-
| `str_replace()` || Replaces occurrences of a pattern in a string with another string. || `gsub()`
|-
| `str_detect()` || Detects whether a pattern is present in each element of a character vector. || `grepl()`
|-
| `str_subset()` || Returns the elements of a character vector that contain a pattern. || `grep()`
|-
| `str_count()` || Counts the number of occurrences of a pattern in each element of a character vector. || `gregexpr()` and `lengths()`
|}
</ul>


# For assignment, magrittr provides the %<>% operator
== str_replace() ==
mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by
[https://datasciencetut.com/how-to-replace-string-in-column-in-r/ Replace a string in a column]: [https://dplyr.tidyverse.org/reference/across.html dplyr::across()] & str_replace()
mtcars %<>% transform(cyl = cyl * 2)
<pre>
</syntaxhighlight>
df <- data.frame(country=c('India', 'USA', 'CHINA', 'Algeria'),
* [https://data-and-the-world.onrender.com/posts/magrittr-pipes The Four Pipes of magrittr] and lambda functions.
                position=c('1', '1', '2', '3'),
                points=c(22, 25, 29, 13))


Upsides of using magrittr: no need to create intermediate objects, code is easy to read.
df %>%
  mutate(across('country', str_replace, 'India', 'Albania'))


When not to use the pipe
df %>%
* your pipes are longer than (say) 10 steps
  mutate(across('country', str_replace, 'A|I', ''))
* you have multiple inputs or outputs
</pre>
* Functions that use the current environment: assign(), get(), load()
 
* Functions that use lazy evaluation: tryCatch(), try()
== split ==
[https://statisticsglobe.com/split-data-frame-variable-into-multiple-columns-in-r Split Data Frame Variable into Multiple Columns in R (3 Examples)]


== Dollar sign .$ ==
Three ways:
<ul>
* base::strsplit(x, CHAR)
<li>[http://thatdatatho.com/2019/03/13/tutorial-about-magrittrs-pipe-operator-and-placeholders/ A Short Tutorial about Magrittr’s Pipe Operator and Placeholders], [https://uc-r.github.io/pipe Simplify Your Code with %>%]
* [https://stringr.tidyverse.org/reference/str_split.html stringr::str_split_fixed(x, CHAR, 2)]
{{Pre}}
* [https://tidyr.tidyverse.org/reference/separate.html tidyr::separate(x, c("NewVar1", "NewVar2"), CHAR)]
gapminder %>% dplyr::filter(continent == "Asia") %>%
  {stats::cor(.$lifeExp, .$gdpPercap)}
gapminder %>% dplyr::filter(continent == "Asia") %$%
  {stats::cor(lifeExp, gdpPercap)}
gapminder %>%
  dplyr::mutate(continent = ifelse(.$continent == "Americas", "Western Hemisphere", .$continent))
</pre>
</li>
<li>Another example [https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html Introduction to the msigdbr package]
<pre>
<pre>
m_list  = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
x <- c("a-1", "b-2", "c-3")
m_list2 = m_df %$% split(x = gene_symbol, f = gs_name)
all.equal(m_list, m_list2)
# [1] TRUE
</pre>
</li>
<li>[https://stackoverflow.com/a/48130912 Use $ dollar sign at end of of an R magrittr pipeline to return a vector]
<pre>
DF %>% filter(y > 0) %>% .$y
</pre>
</li>
</ul>


== %$% ==
stringr::str_split_fixed(x, "-", 2)
Expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument.
#      [,1] [,2]
<pre>
# [1,] "a"  "1"
lhs %$% rhs
# [2,] "b"  "2"
# lhs: A list, environment, or a data.frame.
# [3,] "c"  "3"
# rhs: An expression where the names in lhs is available.


iris %>%
tidyr::separate(data.frame(x), x, c('x1', 'x2'), "-")
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  # The first argument must be a data frame
   cor(Sepal.Length, Sepal.Width)
  # The 2nd argument is the column names
#   x1 x2
# 1  a  1
# 2  b  2
# 3  c  3
</pre>
</pre>


== set_rownames() and set_colnames() ==
= [https://github.com/smbache/magrittr magrittr]: pipe =
https://stackoverflow.com/a/56613460, https://www.rdocumentation.org/packages/magrittr/versions/1.5/topics/extract
* [https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html Vignettes]
<pre>
** [https://www.tidyverse.org/blog/2020/08/magrittr-2-0/?s=09 magrittr 2.0 is coming soon], [https://www.tidyverse.org/blog/2020/11/magrittr-2-0-is-here/?s=09 magrittr 2.0 is here!]
data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5])
* [https://thomasadventure.blog/posts/how-does-the-pipe-operator-actually-work/ How does the pipe operator actually work?]
* [http://www.win-vector.com/blog/2018/04/magrittr-and-wrapr-pipes-in-r-an-examination/ magrittr and wrapr Pipes in R, an Examination]. Instead of nested statements, it is using pipe operator '''%>%'''. So the code is easier to read. Impressive!
: <syntaxhighlight lang='rsplus'>
x %>% f    # f(x)
x %>% f(y)  # f(x, y)
x %>% f(arg=y)  # f(x, arg=y)
x %>% f(z, .) # f(z, x)
x %>% f(y) %>% g(z)  #  g(f(x, y), z)


cbind(1:5, 2:6) %>% magrittr::set_colnames(letters[1:2])
x %>% select(which(colSums(!is.na(.))>0))  # remove columns with all missing data
</pre>
x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows
 
</syntaxhighlight>
== dtrackr ==
* [http://www.win-vector.com/blog/2018/03/r-tip-make-arguments-explicit-in-magrittr-dplyr-pipelines/ Make Arguments Explicit in magrittr/dplyr Pipelines]
[https://terminological.github.io/dtrackr/ dtrackr]: Track your Data Pipelines
: <syntaxhighlight lang='rsplus'>
suppressPackageStartupMessages(library("dplyr"))
starwars %>%
  filter(., height > 200) %>%
  select(., height, mass) %>%
  head(.)
# instead of
starwars %>%
  filter(height > 200) %>%
  select(height, mass) %>%
  head
</syntaxhighlight>
* [https://stackoverflow.com/questions/27100678/how-to-extract-subset-an-element-from-a-list-with-the-magrittr-pipe Subset an element from a list]
: <syntaxhighlight lang='rsplus'>
iris$Species
iris[["Species"]]
 
iris %>%
`[[`("Species")


= purrr: : Functional Programming Tools =
iris %>%
''While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…).'' See [http://www.rebeccabarter.com/blog/2019-08-19_purrr/ Learn to purrr].
`[[`(5)


* https://purrr.tidyverse.org/
iris %>%
* Chap 21 [https://r4ds.had.co.nz/iteration.html Iteration] from '''R for Data Science''' book
  subset(select = "Species")
* [https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf cheatsheet]
</syntaxhighlight>
* [http://colinfay.me/purrr-cookbook/ purrr cookbook]
* '''Split-apply-combine''': group + summarize + sort/arrange + top n. The following example is from [https://csgillespie.github.io/efficientR/data-carpentry.html#data-aggregation Efficient R programming].
* [https://en.wikipedia.org/wiki/Higher-order_function Higher-order function]
: <syntaxhighlight lang='rsplus'>
* [https://pythonbasics.org/decorators/ Python Decorator/metaprogramming]
data(wb_ineq, package = "efficient")
* [https://www.r-bloggers.com/2020/11/iterating-over-the-lines-of-a-data-frame-with-purrr/ Iterating over the lines of a data.frame with purrr]
wb_ineq %>%
* Functional programming (cf Object-Oriented Programming)
  filter(grepl("g", Country)) %>%
** [http://www.youtube.com/watch?v=vLmaZxegahk Functional programming for beginners]
  group_by(Year) %>%
** [https://www.makeuseof.com/tag/functional-programming-languages/ 5 Functional Programming Languages You Should Know]
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
<ul>
  arrange(desc(gini)) %>%
<li>[https://stackoverflow.com/a/56651232 What does the tilde mean in this context of R code], [https://stackoverflow.com/a/44834671 What is meaning of first tilde in purrr::map] </li>
  top_n(n = 5)
<li>[http://data.library.virginia.edu/getting-started-with-the-purrr-package-in-r/ Getting started with the purrr package in R], especially the [https://www.rdocumentation.org/packages/purrr/versions/0.2.5/topics/map map()] and '''map_df()''' functions.
</syntaxhighlight>
<syntaxhighlight lang='rsplus'>
* [https://drdoane.com/writing-pipe-friendly-functions/ Writing Pipe-friendly Functions]
library(purrr)
* http://rud.is/b/2015/02/04/a-step-to-the-right-in-r-assignments/
# map() is a replacement of lapply()
* http://rpubs.com/tjmahr/pipelines_2015
# lapply(dat, function(x) mean(x$Open))
* http://danielmarcelino.com/i-loved-this-crosstable/
map(dat, function(x)mean(x$Open)
* http://moderndata.plot.ly/using-the-pipe-operator-in-r-with-plotly/
* RMSE
: <syntaxhighlight lang='rsplus'>
f <- function(x) {
  (y - x) %>%
    '^'(2) %>%
    sum %>%
    '/'(length(x)) %>%
    sqrt %>%
    round(2)
}
</syntaxhighlight>
* [https://nathaneastwood.github.io/2020/02/01/get-and-set-list-elements-with-magrittr/ Get and Set List Elements with magrittr]
* Videos
** [https://www.rstudio.com/resources/videos/writing-readable-code-with-pipes/ Writing Readable Code with Pipes]
** [https://youtu.be/iIBTI_qiq9g Pipes in R - An Introduction to magrittr package]
: <syntaxhighlight lang='rsplus'>
# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat)


# map allows us to bypass the function function.
pryr::object_size(diamonds)
# Using a tilda (~) in place of function and a dot (.) in place of x
pryr::object_size(diamonds2)
map(dat, ~mean(.$Open))
pryr::object_size(diamonds, diamonds2)
 
rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str()
rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe
    # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2))
    # instead of the righthand side.


# map allows you to specify the structure of your output.
# If a function does not have a data frame based api, you can use %$%.
map_dbl(dat, ~mean(.$Open))
# It explodes out the variables in a data frame.
mtcars %$% cor(disp, mpg)  


# map2() is a replacement of mapply()
# For assignment, magrittr provides the %<>% operator
# mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)
mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by
map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
mtcars %<>% transform(cyl = cyl * 2)
</syntaxhighlight>
</syntaxhighlight>
</li>
* [https://data-and-the-world.onrender.com/posts/magrittr-pipes The Four Pipes of magrittr] and lambda functions.
</ul>
 
* map_dfr() function from [https://youtu.be/bzUmK0Y07ck?t=646 "The Joy of Functional Programming (for Data Science)" with Hadley Wickham]. It can be used to replace a loop.  
Upsides of using magrittr: no need to create intermediate objects, code is easy to read.
:<syntaxhighlight lang='rsplus'>
 
data <- map(paths, read.csv)
When not to use the pipe
data <- map_dfr(paths, read.csv, id = "path")
* your pipes are longer than (say) 10 steps
* you have multiple inputs or outputs
* Functions that use the current environment: assign(), get(), load()
* Functions that use lazy evaluation: tryCatch(), try()


out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE)
== Dollar sign .$ ==
out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)
</syntaxhighlight>
* [http://staff.math.su.se/hoehle/blog/2019/01/04/mathgenius.html Purr yourself into a math genius]
* [https://martinctc.github.io/blog/vignette-write-and-read-multiple-excel-files-with-purrr/ Write & Read Multiple Excel files with purrr]
* [https://aosmith.rbind.io/2020/08/31/handling-errors/ Handling errors using purrr's possibly() and safely()]
* [https://www.business-science.io/code-tools/2020/10/08/automate-plots.html How to Automate Exploratory Analysis Plots]
* [https://www.infoworld.com/article/3601124/error-handling-in-r-with-purrrs-possibly.amp.html Easy error handling in R with purrr’s possibly]
<ul>
<ul>
<li>[http://www.rebeccabarter.com/blog/2019-08-19_purrr/ Learn to purrr]. Lots of good information like tilde-dot is a shorthand for functions.  
<li>[http://thatdatatho.com/2019/03/13/tutorial-about-magrittrs-pipe-operator-and-placeholders/ A Short Tutorial about Magrittr’s Pipe Operator and Placeholders], [https://uc-r.github.io/pipe Simplify Your Code with %>%]
<syntaxhighlight lang='rsplus'>
{{Pre}}
function(x) {
gapminder %>% dplyr::filter(continent == "Asia") %>%
  x + 10
  {stats::cor(.$lifeExp, .$gdpPercap)}
}
gapminder %>% dplyr::filter(continent == "Asia") %$%
# is the same as
  {stats::cor(lifeExp, gdpPercap)}
~{.x + 10}
gapminder %>%
 
  dplyr::mutate(continent = ifelse(.$continent == "Americas", "Western Hemisphere", .$continent))
map_dbl(c(1, 4, 7), ~{.x + 10})
</pre>
</syntaxhighlight>
</li>
<li>Another example [https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html Introduction to the msigdbr package]
<pre>
m_list  = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
m_list2 = m_df %$% split(x = gene_symbol, f = gs_name)
all.equal(m_list, m_list2)
# [1] TRUE
</pre>
</li>
</li>
<li>[https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/ A closer look at replicate() and purrr::map() for simulations]  
<li>[https://stackoverflow.com/a/48130912 Use $ dollar sign at end of of an R magrittr pipeline to return a vector]
<syntaxhighlight lang='rsplus'>
<pre>
twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
DF %>% filter(y > 0) %>% .$y
    ngroup = 2
</pre>
    group = rep( c("group1", "group2"), each = nrep)
    eps = rnorm(ngroup*nrep, 0, sigma)
    growth = b0 + b1*(group == "group2") + eps
    growthfit = lm(growth ~ group)
    growthfit
}
sim_lm = replicate(5, twogroup_fun(), simplify = FALSE )
str(sim_lm, max.level = 1)
 
map_dbl(sim_lm, ~summary(.x)$r.squared)
# Same as function(x) {} style
map_dbl(sim_lm, function(x) summary(x)$r.squared)
# Same as sapply()
sapply(sim_lm, function(x) summary(x)$r.squared)
map_dfr(sim_lm, broom::tidy, .id = "model")
</syntaxhighlight>
</li>
</li>
<li>[http://adv-r.had.co.nz/Functional-programming.html Functional programming] from Advanced R.</li>
<li>[https://dcl-prog.stanford.edu/ Functional Programming] : Sara Altman, Bill Behrman, Hadley Wickham</li>
<li>[https://www.brodrigues.co/blog/2022-05-26-safer_programs/ Some learnings from functional programming you can use to write safer programs] </li>
</ul>
</ul>


== map() and map_dbl() ==
== %$% ==
<ul>
Expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument.
<li>An example from https://purrr.tidyverse.org/
<pre>
<syntaxhighlight lang='rsplus'>
lhs %$% rhs
mtcars |>
# lhs: A list, environment, or a data.frame.
    split(mtcars$cyl) |>  # from base R
# rhs: An expression where the names in lhs is available.
    map(\(df) lm(mpg ~ wt, data = df)) |>
    map(summary) |> map_dbl("r.squared")
#         4        6        8
# 0.5086326 0.4645102 0.4229655
</syntaxhighlight>
<li>Solution by base R lapply() and sapply(). See the article [https://purrr.tidyverse.org/articles/base.html purrr <-> base R]
<syntaxhighlight lang='rsplus'>
mtcars |>
    split(mtcars$cyl) |>
    lapply(function(df) lm(mpg ~ wt, data = df)) |>
    lapply(summary) |>
    sapply(function(x) x$r.squared)
#         4        6        8
# 0.5086326 0.4645102 0.4229655
</syntaxhighlight>
</ul>


== .x  symbol ==
iris %>%
[https://community.rstudio.com/t/function-argument-naming-conventions-x-vs-x/7764/2 Function argument naming conventions (`.x` vs `x`)]. Se [https://purrr.tidyverse.org/reference/map.html purrr::map]
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)
</pre>


== negate() ==
== set_rownames() and set_colnames() ==
[https://stackoverflow.com/a/48431135 How to select non-numeric columns using dplyr::select_if]
https://stackoverflow.com/a/56613460, https://www.rdocumentation.org/packages/magrittr/versions/1.5/topics/extract
<syntaxhighlight lang='rsplus'>
library(tidyverse)
iris %>% select_if(negate(is.numeric))
</syntaxhighlight>
 
== pmap() ==
[https://purrr.tidyverse.org/reference/pmap.html ?pmap] - Map over multiple input simultaneously (in "parallel")
<pre>
<pre>
# Create two lists with multiple elements
data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5])
list1 <- list(1, 2, 3)
list2 <- list(10, 20, 30)


# Define a function to add the elements of each list
cbind(1:5, 2:6) %>% magrittr::set_colnames(letters[1:2])
my_func <- function(x, y) {
</pre>
  x + y
}


# Use pmap to apply the function to each element of the lists in parallel
== match() ==
result <- pmap(list(list1, list2), my_func); result
<syntaxhighlight lang='r'>
[[1]]
a <- 1:3
[1] 11
id <- letters[1:3]
set.seed(1234); id.ref <- sample(id)
id # [1] "b" "c" "a"


[[2]]
a[match(id.ref, b)] # [1] 2 3 1
[1] 22
id.ref %>% match(b) %>% `[`(a, .) # Same, but complicated
</syntaxhighlight>


[[3]]
== dtrackr ==
[1] 33
[https://terminological.github.io/dtrackr/ dtrackr]: Track your Data Pipelines
</pre>


A more practical example when we want to run analysis or visualization on each element of some group/class variable. nest() + pmap().  
= purrr: : Functional Programming Tools =
<syntaxhighlight lang='rsplus'>
''While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…).'' See [http://www.rebeccabarter.com/blog/2019-08-19_purrr/ Learn to purrr].
# Create a data frame
df <- mpg %>%
  filter(manufacturer %in% c("audi", "volkswagen")) %>%
  select(manufacturer, year, cty)


# Nest the data by manufacturer
* https://purrr.tidyverse.org/
df_nested <- df %>%
* Chap 21 [https://r4ds.had.co.nz/iteration.html Iteration] from '''R for Data Science''' book
  nest(data = -manufacturer)
* [https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf cheatsheet]
 
* [http://colinfay.me/purrr-cookbook/ purrr cookbook]
# Create a function that takes a data frame and creates a ggplot object
* [https://en.wikipedia.org/wiki/Higher-order_function Higher-order function]
my_plot_func <- function(data, manuf) {
* [https://pythonbasics.org/decorators/ Python Decorator/metaprogramming]
    ggplot(data, aes(x = year, y = cty)) +
* [https://www.r-bloggers.com/2020/11/iterating-over-the-lines-of-a-data-frame-with-purrr/ Iterating over the lines of a data.frame with purrr]
        geom_point() +
* Functional programming (cf Object-Oriented Programming)
        ggtitle(manuf)
** [http://www.youtube.com/watch?v=vLmaZxegahk Functional programming for beginners]
}
** [https://www.makeuseof.com/tag/functional-programming-languages/ 5 Functional Programming Languages You Should Know]
<ul>
<li>[https://stackoverflow.com/a/56651232 What does the tilde mean in this context of R code], [https://stackoverflow.com/a/44834671 What is meaning of first tilde in purrr::map] </li>
<li>[http://data.library.virginia.edu/getting-started-with-the-purrr-package-in-r/ Getting started with the purrr package in R], especially the [https://www.rdocumentation.org/packages/purrr/versions/0.2.5/topics/map map()] and '''map_df()''' functions.
<syntaxhighlight lang='rsplus'>
library(purrr)
# map() is a replacement of lapply()
# lapply(dat, function(x) mean(x$Open))
map(dat, function(x)mean(x$Open))
 
# map allows us to bypass the function function.
# Using a tilda (~) in place of function and a dot (.) in place of x
map(dat, ~mean(.$Open))


# Use pmap to apply the function to each element of the list-column in df_nested
# map allows you to specify the structure of your output.
df_nested_plot <- df_nested %>%
map_dbl(dat, ~mean(.$Open))
    mutate(plot = pmap(list(data, manufacturer), my_plot_func))


df_nested_plot[[1]]
# map2() is a replacement of mapply()
# mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)
map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
</syntaxhighlight>
</syntaxhighlight>
Another example: fitting regressions for data in each group
</li>
<syntaxhighlight lang='rsplus'>
</ul>
library(tidyverse)
* map_dfr() function from [https://youtu.be/bzUmK0Y07ck?t=646 "The Joy of Functional Programming (for Data Science)" with Hadley Wickham]. It can be used to replace a loop.
:<syntaxhighlight lang='rsplus'>
data <- map(paths, read.csv)
data <- map_dfr(paths, read.csv, id = "path")


# create example data
out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE)
data <- tibble(
out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)
  x = rnorm(100),
</syntaxhighlight>
  y = rnorm(100),
* [http://staff.math.su.se/hoehle/blog/2019/01/04/mathgenius.html Purr yourself into a math genius]
  group = sample(c("A", "B", "C"), 100, replace = TRUE)
* [https://martinctc.github.io/blog/vignette-write-and-read-multiple-excel-files-with-purrr/ Write & Read Multiple Excel files with purrr]
)
* [https://aosmith.rbind.io/2020/08/31/handling-errors/ Handling errors using purrr's possibly() and safely()]
 
* [https://www.business-science.io/code-tools/2020/10/08/automate-plots.html How to Automate Exploratory Analysis Plots]
# create a nested dataframe
* [https://www.infoworld.com/article/3601124/error-handling-in-r-with-purrrs-possibly.amp.html Easy error handling in R with purrr’s possibly]
nested_data <- data %>%
<ul>
  nest(data = -group)
<li>[http://www.rebeccabarter.com/blog/2019-08-19_purrr/ Learn to purrr]. Lots of good information like tilde-dot is a shorthand for functions.
 
<syntaxhighlight lang='rsplus'>
# define a function that runs linear regression on each dataset
function(x) {
lm_func <- function(data) {
   x + 10
   lm(y ~ x, data = data)
}
}
# is the same as
~{.x + 10}


# apply lm_func() to each row of the nested dataframe
map_dbl(c(1, 4, 7), ~{.x + 10})
results <- nested_data %>%
  mutate(model = pmap(list(data), lm_func))
</syntaxhighlight>
</syntaxhighlight>
</li>
<li>[https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/ A closer look at replicate() and purrr::map() for simulations]
<syntaxhighlight lang='rsplus'>
twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
    ngroup = 2
    group = rep( c("group1", "group2"), each = nrep)
    eps = rnorm(ngroup*nrep, 0, sigma)
    growth = b0 + b1*(group == "group2") + eps
    growthfit = lm(growth ~ group)
    growthfit
}
sim_lm = replicate(5, twogroup_fun(), simplify = FALSE )
str(sim_lm, max.level = 1)


== purrr vs base R ==
map_dbl(sim_lm, ~summary(.x)$r.squared)
https://purrr.tidyverse.org/dev/articles/base.html
# Same as function(x) {} style
 
map_dbl(sim_lm, function(x) summary(x)$r.squared)
= forcats =
# Same as sapply()
https://forcats.tidyverse.org/
sapply(sim_lm, function(x) summary(x)$r.squared)
 
map_dfr(sim_lm, broom::tidy, .id = "model")
[https://www.datasurg.net/2019/10/15/jama-retraction-after-miscoding-new-finalfit-function-to-check-recoding/ JAMA retraction after miscoding – new Finalfit function to check recoding]
 
= outer() =
 
= Genomic sequence =
* chartr
<syntaxhighlight lang='bash'>
> yourSeq <- "AAAACCCGGGTTTNNN"
> chartr("ACGT", "TGCA", yourSeq)
[1] "TTTTGGGCCCAAANNN"
</syntaxhighlight>
</syntaxhighlight>
</li>
<li>[http://adv-r.had.co.nz/Functional-programming.html Functional programming] from Advanced R.</li>
<li>[https://dcl-prog.stanford.edu/ Functional Programming] : Sara Altman, Bill Behrman, Hadley Wickham</li>
<li>[https://www.brodrigues.co/blog/2022-05-26-safer_programs/ Some learnings from functional programming you can use to write safer programs] </li>
</ul>


= broom =
== map() and map_dbl() ==
<ul>
<Ul>
<li>[https://cran.r-project.org/web/packages/broom/index.html broom]: Convert Statistical Analysis Objects into Tidy Tibbles
<Li>[https://www.spsanderson.com/steveondata/posts/2023-03-26/index.html Mastering the map() Function in R]
<li>Especially the tidy() function.
<li>An example from https://purrr.tidyverse.org/
{{Pre}}
<syntaxhighlight lang='rsplus'>
R> str(survfit(Surv(time, status) ~ x, data = aml))
mtcars |>  
List of 17
    split(mtcars$cyl) |>  # from base R
$ n        : int [1:2] 11 12
    map(\(df) lm(mpg ~ wt, data = df)) |>
$ time    : num [1:20] 9 13 18 23 28 31 34 45 48 161 ...
    map(summary) |> map_dbl("r.squared")
$ n.risk  : num [1:20] 11 10 8 7 6 5 4 3 2 1 ...
#        4        6        8
$ n.event  : num [1:20] 1 1 1 1 0 1 1 0 1 0 ...
# 0.5086326 0.4645102 0.4229655
...
</syntaxhighlight>
 
<li>Solution by base R lapply() and sapply(). See the article [https://purrr.tidyverse.org/articles/base.html purrr <-> base R]
R> tidy(survfit(Surv(time, status) ~ x, data = aml))
<syntaxhighlight lang='rsplus'>
# A tibble: 20 x 9
mtcars |>
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata       
    split(mtcars$cyl) |>
  <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr>         
    lapply(function(df) lm(mpg ~ wt, data = df)) |>
1    9    11      1        0  0.909    0.0953    1      0.754  x=Maintained 
    lapply(summary) |>
2    13    10      1        1  0.818    0.142      1      0.619  x=Maintained 
    sapply(function(x) x$r.squared)
...
#        4        6        8
18    33      3      1        0  0.194    0.627      0.664  0.0569 x=Nonmaintained
# 0.5086326 0.4645102 0.4229655
19    43      2      1        0  0.0972    0.945      0.620  0.0153 x=Nonmaintained
</syntaxhighlight>
20    45      1      1        0  0      Inf        NA      NA      x=Nonmaintained
</ul>
</pre>
<li>[https://www.frontiersin.org/files/Articles/746571/fonc-11-746571-HTML-r1/image_m/fonc-11-746571-t002.jpg Tables from journal papers]
<li>Multiple univariate models
<pre>
library(tidyverse)
library(broom)


mtcars %>%
== tilde ==
   select(-mpg) %>%
* The '''lambda syntax''' and tilde notation provided by purrr allow you to write concise and readable anonymous functions in R.
:<syntaxhighlight lang='rsplus'>
x <- 1:3
map_dbl(x, ~ .x^2)  # [1] 1 4 9
</syntaxhighlight>
:The notation '''~ .x^2''' is equivalent to writing '''function(.x) .x^2 ''' or '''function(z) z^2'''  or '''\(y) y^2'''
:<syntaxhighlight lang='rsplus'>
x <- list(a = 1:3, b = 4:6)
y <- list(a = 10, b = 100)
map2_dbl(x, y, ~ sum(.x * .y))
#  a    b
#  60 1500
</syntaxhighlight>
* https://dplyr.tidyverse.org/reference/funs.html
* [https://stackoverflow.com/a/14976479 Use of ~ (tilde) in R programming Language] (Hint: creating a formula object)
* [https://stackoverflow.com/a/44834671 What is meaning of first tilde in purrr::map] & the blog [https://www.itcodar.com/r/what-is-meaning-of-first-tilde-in-purrr-map.html What Is Meaning of First Tilde in Purrr::Map]
* [https://stackoverflow.com/a/68249687 Meaning of tilde and dot notation in dplyr]
* [https://www.rebeccabarter.com/blog/2019-08-19_purrr Learn to purrr] 2019
* [https://stackoverflow.com/q/58845722 dplyr piping data - difference between `.` and `.x`]
* [https://stackoverflow.com/a/62488532 Use of Tilde (~) and period (.) in R]
 
== .x  symbol ==
<ul>
<li>It is used with functions like purrr::map. In the context of an '''anonymous function''', '''.x''' is a '''placeholder''' for the first argument of the function.
* For a single argument function, you can use .. For example, ~ . + 2 is equivalent to function(.) {. + 2}.
* For a two argument function, you can use .x and .y. For example, ~ .x + .y is equivalent to function(.x, .y) {.x + .y}.
* For more arguments, you can use ..1, ..2, ..3, etc
<pre>
# Create a vector
vec <- c(1, 2, 3)
 
# Use purrr::map with an anonymous function
result <- purrr::map(vec, ~ .x * 2)
 
# Print the result
print(result)
[[1]]
[1] 2
 
[[2]]
[1] 4
 
[[3]]
[1] 6
</pre>
<li>[https://stackoverflow.com/a/56532176 dplyr piping data - difference between `.` and `.x`]
<li>[https://community.rstudio.com/t/function-argument-naming-conventions-x-vs-x/7764/2 Function argument naming conventions (`.x` vs `x`)]. Se [https://purrr.tidyverse.org/reference/map.html purrr::map]
</ul>
 
== negate() ==
[https://stackoverflow.com/a/48431135 How to select non-numeric columns using dplyr::select_if]
<syntaxhighlight lang='rsplus'>
library(tidyverse)
iris %>% select_if(negate(is.numeric))
</syntaxhighlight>
 
== pmap() ==
[https://purrr.tidyverse.org/reference/pmap.html ?pmap] - Map over multiple input simultaneously (in "parallel")
<pre>
# Create two lists with multiple elements
list1 <- list(1, 2, 3)
list2 <- list(10, 20, 30)
 
# Define a function to add the elements of each list
my_func <- function(x, y) {
  x + y
}
 
# Use pmap to apply the function to each element of the lists in parallel
result <- pmap(list(list1, list2), my_func); result
[[1]]
[1] 11
 
[[2]]
[1] 22
 
[[3]]
[1] 33
</pre>
 
A more practical example when we want to run analysis or visualization on each element of some group/class variable. nest() + pmap().
<syntaxhighlight lang='rsplus'>
# Create a data frame
df <- mpg %>%
  filter(manufacturer %in% c("audi", "volkswagen")) %>%
  select(manufacturer, year, cty)
 
# Nest the data by manufacturer
df_nested <- df %>%
  nest(data = -manufacturer)
 
# Create a function that takes a data frame and creates a ggplot object
my_plot_func <- function(data, manuf) {
    ggplot(data, aes(x = year, y = cty)) +
        geom_point() +
        ggtitle(manuf)
}
 
# Use pmap to apply the function to each element of the list-column in df_nested
df_nested_plot <- df_nested %>%
    mutate(plot = pmap(list(data, manufacturer), my_plot_func))
 
df_nested_plot[[1]]
</syntaxhighlight>
Another example: fitting regressions for data in each group
<syntaxhighlight lang='rsplus'>
library(tidyverse)
 
# create example data
data <- tibble(
  x = rnorm(100),
  y = rnorm(100),
  group = sample(c("A", "B", "C"), 100, replace = TRUE)
)
 
# create a nested dataframe
nested_data <- data %>%
  nest(data = -group)
 
# define a function that runs linear regression on each dataset
lm_func <- function(data) {
  lm(y ~ x, data = data)
}
 
# apply lm_func() to each row of the nested dataframe
results <- nested_data %>%
  mutate(model = pmap(list(data), lm_func))
</syntaxhighlight>
 
== reduce ==
[https://www.r-bloggers.com/2023/07/reducing-my-for-loop-usage-with-purrrreduce/ Reducing my for loop usage with purrr::reduce()]
 
== filter, subset data ==
[https://jcarroll.com.au/2023/08/30/four-filters-for-functional-programming-friends/ Four Filters for Functional (Programming) Friends]
 
== purrr vs base R ==
https://purrr.tidyverse.org/dev/articles/base.html
 
= forcats =
https://forcats.tidyverse.org/
 
[https://www.datasurg.net/2019/10/15/jama-retraction-after-miscoding-new-finalfit-function-to-check-recoding/ JAMA retraction after miscoding – new Finalfit function to check recoding]
 
= outer() =
 
= Genomic sequence =
* chartr
<syntaxhighlight lang='bash'>
> yourSeq <- "AAAACCCGGGTTTNNN"
> chartr("ACGT", "TGCA", yourSeq)
[1] "TTTTGGGCCCAAANNN"
</syntaxhighlight>
 
= broom =
<ul>
<li>[https://cran.r-project.org/web/packages/broom/index.html broom]: Convert Statistical Analysis Objects into Tidy Tibbles
<li>Especially the tidy() function.
{{Pre}}
R> str(survfit(Surv(time, status) ~ x, data = aml))
List of 17
$ n        : int [1:2] 11 12
$ time    : num [1:20] 9 13 18 23 28 31 34 45 48 161 ...
$ n.risk  : num [1:20] 11 10 8 7 6 5 4 3 2 1 ...
$ n.event  : num [1:20] 1 1 1 1 0 1 1 0 1 0 ...
...
 
R> tidy(survfit(Surv(time, status) ~ x, data = aml))
# A tibble: 20 x 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata       
  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr>         
1    9    11      1        0  0.909    0.0953    1      0.754  x=Maintained 
2    13    10      1        1  0.818    0.142      1      0.619  x=Maintained 
...
18    33      3      1        0  0.194    0.627      0.664  0.0569 x=Nonmaintained
19    43      2      1        0  0.0972    0.945      0.620  0.0153 x=Nonmaintained
20    45      1      1        0  0      Inf        NA      NA      x=Nonmaintained
</pre>
<li>[https://www.frontiersin.org/files/Articles/746571/fonc-11-746571-HTML-r1/image_m/fonc-11-746571-t002.jpg Tables from journal papers]
<li>Multiple univariate models
<pre>
library(tidyverse)
library(broom)
 
mtcars %>%
   select(-mpg) %>%
   names() %>%
   names() %>%
   map_dfr(~ tidy(lm(as.formula(paste("mpg ~", .x)), data = mtcars)))
   map_dfr(~ tidy(lm(as.formula(paste("mpg ~", .x)), data = mtcars)))
# A tibble: 20 × 5
# A tibble: 20 × 5
#  term        estimate std.error statistic  p.value
#  term        estimate std.error statistic  p.value
#  <chr>          <dbl>    <dbl>    <dbl>    <dbl>
#  <chr>          <dbl>    <dbl>    <dbl>    <dbl>
# 1 (Intercept)  37.9      2.07      18.3  8.37e-18
# 1 (Intercept)  37.9      2.07      18.3  8.37e-18
# 2 cyl          -2.88    0.322      -8.92  6.11e-10
# 2 cyl          -2.88    0.322      -8.92  6.11e-10
# 3 (Intercept)  29.6      1.23      24.1  3.58e-21
# 3 (Intercept)  29.6      1.23      24.1  3.58e-21
# 4 disp        -0.0412  0.00471    -8.75  9.38e-10
# 4 disp        -0.0412  0.00471    -8.75  9.38e-10
</pre>
</pre>
<li>Multivariate model
<li>Multivariate model
<pre>
<pre>
lm(mpg ~ ., data = mtcars) |> tidy()
lm(mpg ~ ., data = mtcars) |> tidy()
# A tibble: 11 × 5
# A tibble: 11 × 5
#  term        estimate std.error statistic p.value
#  term        estimate std.error statistic p.value
#  <chr>          <dbl>    <dbl>    <dbl>  <dbl>
#  <chr>          <dbl>    <dbl>    <dbl>  <dbl>
# 1 (Intercept)  12.3      18.7        0.657  0.518  
# 1 (Intercept)  12.3      18.7        0.657  0.518  
# 2 cyl          -0.111    1.05      -0.107  0.916  
# 2 cyl          -0.111    1.05      -0.107  0.916  
# 3 disp          0.0133    0.0179    0.747  0.463
# 3 disp          0.0133    0.0179    0.747  0.463
</pre>
</pre>
</ul>
</ul>
 
= lobstr package - dig into the internal representation and structure of R objects =
[https://www.tidyverse.org/articles/2018/12/lobstr/ lobstr 1.0.0]
 
= Other packages =
 
== Great R packages for data import, wrangling, and visualization ==
[https://www.computerworld.com/article/2921176/great-r-packages-for-data-import-wrangling-visualization.html Great R packages for data import, wrangling, and visualization]


= lobstr package - dig into the internal representation and structure of R objects =
== cli package ==
[https://www.tidyverse.org/articles/2018/12/lobstr/ lobstr 1.0.0]
* https://cli.r-lib.org/
 
* [https://www.r-bloggers.com/2023/11/cliff-notes-about-the-cli-package/ Cliff notes about the cli package]
= Other packages =
 
== Great R packages for data import, wrangling, and visualization ==
[https://www.computerworld.com/article/2921176/great-r-packages-for-data-import-wrangling-visualization.html Great R packages for data import, wrangling, and visualization]


== tidytext ==
== tidytext ==

Revision as of 08:57, 7 May 2024

Tidyverse

   Import
     |
     | readr, readxl
     | haven, DBI, httr   +----- Visualize ------+
     |                    |    ggplot2, ggvis    |
     |                    |                      |
   Tidy ------------- Transform 
   tibble               dplyr                   Model 
   tidyr                  |                    broom
                          +------ Model ---------+

Cheat sheet

The cheat sheets are downloaded from RStudio

Books

Going from Beginner to Advanced in the Tidyverse

Online

Animation to explain

Base-R and Tidyverse

tidyverse vs python panda

Why pandas feels clunky when coming from R

Examples

A Gentle Introduction to Tidy Statistics in R

A Gentle Introduction to Tidy Statistics in R by Thomas Mock on RStudio webinar. Good coverage with step-by-step explanation. See part 1 & part 2 about the data and markdown document. All documents are available in github repository.

Task R code Graph
Load the libraries
library(tidyverse)
library(readxl)
library(broom)
library(knitr)
Read Excel file
raw_df <- readxl::read_xlsx("ad_treatment.xlsx")

dplyr::glimpse(raw_df)
Check distribution
g2 <- ggplot(raw_df, aes(x = age)) +
  geom_density(fill = "blue")
g2
raw_df %>% summarize(min = min(age),
                     max = max(age))
File:Check dist.svg
Data cleaning
raw_df %>% 
  summarize(na_count = sum(is.na(mmse)))
Experimental variables

levels

# check Ns and levels for our variables
table(raw_df$drug_treatment, raw_df$health_status)
table(raw_df$drug_treatment, raw_df$health_status, raw_df$sex)

# tidy way of looking at variables
raw_df %>% 
  group_by(drug_treatment, health_status, sex) %>% 
  count()
Visual Exploratory

Data Analysis

ggplot(data = raw_df, # add the data
       aes(x = drug_treatment, y = mmse, # set x, y coordinates
           color = drug_treatment)) +    # color by treatment
  geom_boxplot() +
  facet_grid(~health_status)
File:Onefacet.svg
Summary Statistics
raw_df %>% 
  glimpse()
sum_df <- raw_df %>% 
            mutate(
              sex = factor(sex, 
                  labels = c("Male", "Female")),
              drug_treatment =  factor(drug_treatment, 
                  levels = c("Placebo", "Low dose", "High Dose")),
              health_status = factor(health_status, 
                  levels = c("Healthy", "Alzheimer's"))
              ) %>% 
            group_by(sex, health_status, drug_treatment # group by categorical variables
              ) %>%  
            summarize(
              mmse_mean = mean(mmse),      # calc mean
              mmse_se = sd(mmse)/sqrt(n()) # calc standard error
              ) %>%  
            ungroup() # ungrouping variable is a good habit to prevent errors

kable(sum_df)

write.csv(sum_df, "adx37_sum_stats.csv")
Plotting summary

statistics

g <- ggplot(data = sum_df, # add the data
       aes(x = drug_treatment,  #set x, y coordinates
           y = mmse_mean,
           group = drug_treatment,  # group by treatment
           color = drug_treatment)) +    # color by treatment
  geom_point(size = 3) + # set size of the dots
  facet_grid(sex~health_status) # create facets by sex and status
g
File:Twofacets.svg
ANOVA
# set up the statistics df
stats_df <- raw_df %>% # start with data
   mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")),
         sex = factor(sex, labels = c("Male", "Female")),
         health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")))

glimpse(stats_df)
# this gives main effects AND interactions
ad_aov <- aov(mmse ~ sex * drug_treatment * health_status, 
        data = stats_df)

summary(ad_aov)


# this extracts ANOVA output into a nice tidy dataframe
tidy_ad_aov <- tidy(ad_aov)
# which we can save to Excel
write.csv(tidy_ad_aov, "ad_aov.csv")
Post-hocs
# pairwise t.tests
ad_pairwise <- pairwise.t.test(stats_df$mmse,
                               stats_df$sex:stats_df$drug_treatment:stats_df$health_status, 
                               p.adj = "none")
# look at the posthoc p.values in a tidy dataframe
kable(head(tidy(ad_pairwise)))


# call and tidy the tukey posthoc
tidy_ad_tukey <- tidy(
                      TukeyHSD(ad_aov, 
                              which = 'sex:drug_treatment:health_status'))
Publication plot
sig_df <- tribble(
  ~drug_treatment, ~ health_status, ~sex, ~mmse_mean,
  "Low dose", "Alzheimer's", "Male", 17,
  "High Dose", "Alzheimer's", "Male", 25,
  "Low dose", "Alzheimer's", "Female", 18, 
  "High Dose", "Alzheimer's", "Female", 24
  )

sig_df <- sig_df %>% 
  mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")),
         sex = factor(sex, levels = c("Male", "Female")),
         health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")))
sig_df
# plot of cognitive function health and drug treatment
g1 <- ggplot(data = sum_df, 
       aes(x = drug_treatment, y = mmse_mean, fill = drug_treatment,  
           group = drug_treatment)) +
  geom_errorbar(aes(ymin = mmse_mean - mmse_se, 
                    ymax = mmse_mean + mmse_se), width = 0.5) +
  geom_bar(color = "black", stat = "identity", width = 0.7) +
  
  facet_grid(sex~health_status) +
  theme_bw() +
  scale_fill_manual(values = c("white", "grey", "black")) +
  theme(legend.position = "NULL",
        legend.title = element_blank(),
        axis.title = element_text(size = 20),
        legend.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12)) +
  geom_text(data = sig_df, label = "*", size = 8) +
  labs(x = "\nDrug Treatment", 
       y = "Cognitive Function (MMSE)\n",
       caption = "\nFigure 1. Effect of novel drug treatment AD-x37 on cognitive function in 
                    healthy and demented elderly adults. 
                  \nn = 100/treatment group (total n = 600), * indicates significance 
                    at p < 0.001")
g1

# save the graph!
ggsave("ad_publication_graph.png", g1, height = 7, width = 8, units = "in")
File:Ad public.svg

palmerpenguins data

Introduction to data manipulation in R with {dplyr}

glm() and ggplot2(), mtcars

data(mtcars)

# Fit a Poisson regression model to predict "mpg" based on "wt"
model <- mtcars %>% 
  select(mpg, wt) %>% 
  mutate(wt = as.numeric(wt)) %>% 
  glm(mpg ~ wt, family = poisson(link = "log"), data = .)

# Print the summary of the model
summary(model)

# Make predictions on new data
new_data <- data.frame(wt = c(2.5, 3.0, 3.5))
predictions <- predict(model, new_data, type = "response")
print(predictions)

# Visualize the results with ggplot2
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  stat_smooth(method = "glm", formula = "y ~ x", 
              method.args = list(family = poisson(link = "log")), 
              se = FALSE, color = "red") +
  labs(x = "Weight", y = "Miles per gallon")

Opioid prescribing habits in texas

https://juliasilge.com/blog/texas-opioids/.

  • It can read multiple sheets (27 sheets) at a time and merge them by rows.
  • case_when(): A general vectorised if. This function allows you to vectorise multiple if_else() statements. How to use the R case_when function.
    case_when(
      condition_1 ~ result_1,
      condition_2 ~ result_2,
      ...
      condition_n ~ result_n,
      .default = default_result
    )
    
    x %>% mutate(group = case_when(
      PredScore > quantile(PredScore, .5) ~ 'High',
      PredScore < quantile(PredScore, .5) ~ 'Low',
      TRUE ~ NA_character_
    ))
    
  • top_n(). weight parameter. top_n(n=5, wt=x) won't order rows by weight in the output actually. slice_max(order_by = x, n = 5) does it.
    set.seed(1)
    d <- data.frame(
      x   = runif(90),
      grp = gl(3, 30)
    ) 
    
    > d %>% group_by(grp) %>% top_n(5, wt=x)
    # A tibble: 15 x 2
    # Groups:   grp [3]
           x grp  
       <dbl> <fct>
     1 0.908 1    
     2 0.898 1    
    ...
    15 0.961 3 
    
    > d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
    # A tibble: 15 x 2
    # Groups:   grp [3]
           x grp  
       <dbl> <fct>
     1 0.992 1    
     2 0.945 1    
    ...
    15 0.864 3 
    

Tidying the Freedom Index

https://pacha.dev/blog/2023/06/05/freedom-index/index.html

tidyverse

  • gsub()
  • read_excel()
  • filter()
  • pivot_longer()
  • case_when()
  • fill()
  • group_by(), mutate(), row_number(), ungroup()
  • pivot_wider()
  • drop_na()
  • ungroup(), distinct()
  • left_join()

ggplot2

  • geom_line()
  • facet_wrap()
  • theme_minimal()
  • theme()
  • labs()

Useful dplyr functions (with examples)

Supervised machine learning case studies in R

Supervised machine learning case studies in R - A Free, Interactive Course Using Tidy Tools.

Time series data

Calculating change from baseline

group_by() + mutate() + ungroup(). We can accomplish the task by using split() + lapply() + do.call().

trial_data_chg <- trial_data %>%
  arrange(USUBJID, AVISITN) %>%
  group_by(USUBJID) %>%
  mutate(CHG = AVAL - AVAL[1L]) %>%
  ungroup()

# If the baseline is missing
trial_data_chg2 <- trial_data %>%
  group_by(USUBJID) %>%
  mutate(
    CHG = if (any(AVISIT == "Baseline")) AVAL - AVAL[AVISIT == "Baseline"] else NA
  ) %>%
  ungroup()

Split data and fitting models to subsets

https://twitter.com/romain_francois/status/1226967548144635907?s=20

library(dplyr)
iris %>% 
  group_by(Species) %>%
  summarise(broom::tidy(lm(Petal.Length ~ Sepal.Length))

Show all possible group combinations

Ten Tremendous Tricks in the Tidyverse

https://youtu.be/NDHSBUN_rVU (video).

  • count(),
  • add_count(),
  • summarize() w/ a list column,
  • fct_reorder() + geom_col() + coord_flip(),
  • fct_lump(),
  • scale_x/y_log10(),
  • crossing(),
  • separate(),
  • extract().

Gapminder dataset

Hands-on R and dplyr – Analyzing the Gapminder Dataset

Install on Ubuntu

sudo apt install r-cran-tidyverse

# Ubuntu >= 18.04. However, I get unmet dependencies errors on R 3.5.3.
# r-cran-curl : Depends: r-api-3.4
sudo apt-get install r-cran-curl r-cran-openssl r-cran-xml2

# Works fine on Ubuntu 16.04, 18.04, 20.04
sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev

80 R packages will be installed after tidyverse has been installed.

For RStudio server docker version (Debian 10), I also need to install zlib1g-dev

Install on Raspberry Pi/(ARM based) Chromebook

In additional to the requirements of installing on Ubuntu, I got an error when it is installing a dependent package fs: undefined symbol: pthread_atfork. The fs package version is 1.2.6. The solution is to add one line in fs/src/Makevars file and then install the "fs" package using the source on the local machine.

5 most useful data manipulation functions

  • subset() for making subsets of data (natch)
  • merge() for combining data sets in a smart and easy way
  • melt()-reshape2 package for converting from wide to long data formats. See an example here where we want to combine multiple columns of values into 1 column. melt() is replaced by gather().
  • dcast()-reshape2 package for converting from long to wide data formats (or just use tapply()), and for making summary tables
  • ddply()-plyr package for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations

Miscellaneous examples using tibble or dplyr packages

Print all columns or rows

?print.tbl_df

  • print(x, width = Inf) # all columns
  • print(x, n = Inf) # all rows

Move a column to rownames

?tibble::column_to_rownames

# It assumes the input data frame has no row names; otherwise we will get
# Error: `df` must be a data frame without row names in `column_to_rownames()`
# 
tibble::column_to_rownames(data.frame(x=letters[1:5], y = rnorm(5)), "x")

Move rownames to a variable

https://tibble.tidyverse.org/reference/rownames.html

tibble::rownames_to_column(trees, "newVar")
# Still a data frame

Old way add_rownames()

data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5]) %>% add_rownames("newvar")
# tibble object

Remove rows or columns only containing NAs

Surgically removing specific rows or columns that only contains `NA`s

library(dplyr)
df <- tibble(x = c(NA, NA, NA),
             y = c(2, 3, NA),
             z = c(NA, 5, NA) )

# removing columns where all elements are NA
df %>% select(where(~ !all(is.na(.x))))

# removing rows where all elements are NA
df %>% filter(if_any(.fns = ~ !is.na(.x)))

Rename variables

dplyr::rename(os, newName = oldName)

Drop/remove a variable/column

select(df, -x) # 'x' is the name of the variable 

Drop a level

group_by() has a .drop argument so you can also group by factor levels that don't appear in the data. See this example.

Remove rownames

tibble has_rownames(), rownames_to_column(), column_to_rownames()

has_rownames(mtcars)
#> [1] TRUE

# Remove row names
remove_rownames(mtcars) %>% has_rownames()
#> [1] FALSE
> tibble::has_rownames(trees)
[1] FALSE
> rownames(trees)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31"
> rownames(trees) <- NULL
> rownames(trees)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31"

relocate: change column order

relocate()

# Move Petal.Width column to appear next to Sepal.Width
iris %>% relocate(Petal.Width, .after = Sepal.Width) %>% head() 

# Move Petal.Width to the last column
iris %>% relocate(Petal.Width, .after = last_col()) %>% head()

pull: extract a single column

pull()

x <- iris %>% filter(Species == 'setosa') %>% select(Sepal.Length) %>% pull()
# x <- iris %>% filter(Species == 'setosa') %>% pull(Sepal.Length)
# x <- iris %>% filter(Species == 'setosa') %>% .$Sepal.Length
y <- iris %>% filter(Species == 'virginica') %>% select(Sepal.Length) %>% pull()
t.test(x, y)

Convert Multiple Columns to Numeric

Convert Multiple Columns to Numeric in R. mutate_at(), mutate_if()

select(): extract multiple columns

select(): drop columns

Simplifying Data Manipulation: How to Drop Columns from Data Frames in R

slice(): select rows by index

?slice

mtcars %>% slice_max(mpg, n = 1)
#                 mpg cyl disp hp drat    wt qsec vs am gear carb
# Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1

mtcars %>% slice(which.max(mpg))
#                 mpg cyl disp hp drat    wt qsec vs am gear carb
# Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1

Reorder columns

reorder()

iris %>% ggplot(aes(x=Species, y = Sepal.Width)) + 
         geom_boxplot() +
         xlab=("Species")

# reorder the boxplot based on the Species' median
iris %>% ggplot(aes(x=reorder(Species, Sepal.Width, FUN = median),
                    y=Sepal.Width)) + 
         geom_boxplot() +
         xlab=("Species")

fct_reorder()

10 Tidyverse functions that might save your day

Standardize variables

How to Standardize Data in R?

Anonymous functions

Transformation on multiple columns

  • How to apply a transformation to multiple columns in R?
    • df %>% mutate(across(c(col1, col2), function(x) x*2))
    • df %>% summarise(across(c(col1, col2), mean, na.rm=TRUE))
  • select() vs across()
    • the across() and select() functions are both used to manipulate columns in a data frame
    • The select() function is used to select columns from a data frame.
    • The across() function is used to apply a function to multiple columns in a data frame. It’s often used inside other functions like mutate() or summarize().
data.frame(
  x = c(1, 2, 3),
  y = c(4, 5, 6)
) %>% 
mutate(across(everything(), ~ .x * 2)) # purrr-style lambda
#  x  y
#1 2  8
#2 4 10
#3 6 12

Reading and writing data

Speeding up Reading and Writing in R

data.table

Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).

Note: data.table has its own ways (cf base R and dplyr) to subset columns.

Some resources:

OpenMP enabled compiler for Mac. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.

Question: how to make use multicore with data.table package?

dtplyr

https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/

reshape & reshape2 (superceded by tidyr package)

tidyr

Missing values

Handling Missing Values in R using tidyr

Pivot

  • tidyr package. pivot vignette, pivot_wider()
    R> d2 <- tibble(o=rep(LETTERS[1:2], each=3), n=rep(letters[1:3], 2), v=1:6); d2
    # A tibble: 6 × 3
      o     n         v
      <chr> <chr> <int>
    1 A     a         1
    2 A     b         2
    3 A     c         3
    4 B     a         4
    5 B     b         5
    6 B     c         6
    R> d1 <- d2%>% pivot_wider(names_from=n, values_from=v); d1
    # A tibble: 2 × 4
      o         a     b     c
      <chr> <int> <int> <int>
    1 A         1     2     3
    2 B         4     5     6
    

    pivot_longer()

    R> d1 %>% pivot_longer(!o, names_to = 'n', values_to = 'v')
    # Pivot all columns except 'o' column
    # A tibble: 6 × 3
      o     n         v
      <chr> <chr> <int>
    1 A     a         1
    2 A     b         2
    3 A     c         3
    4 B     a         4
    5 B     b         5
    6 B     c         6
    
    • In addition to the names_from and values_from columns, the data must have other columns
    • For each (combination) of unique value from other columns, the values from names_from variable must be unique
  • Conversion from gather() to pivot_longer()
    gather(df, key=KeyName, value = valueName, col1, col2, ...) # No quotes around KeyName and valueName
    
    pivot_longer(df, cols, names_to = "keyName", values_to = "valueName") 
      # cols can be everything()
      # cols can be numerical numbers or column names
    
  • A Tidy Transcriptomics introduction to RNA-Seq analyses
    data %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance")
    
  • Using R: setting a colour scheme in ggplot2. Note the new (default) column names value and name after the function pivot_longer(data, cols).
    set(1)
    dat1 <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10))
    dat2 <- pivot_longer(dat1, -y)
    head(dat2, 2)
    # A tibble: 2 x 3
          y name   value
      <dbl> <chr>  <dbl>
    1 -1.28 x1     0.717
    2 -1.28 x2    -0.320
    
    dat3 <- pivot_wider(dat2)
    range(dat1 - dat3)
    

Benchmark

An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.

Make wide tables long with gather() (see 6.3.1 of Efficient R Programming)

library(tidyr)
library(efficient)
data(pew) # wide table
dim(pew) # 18 x 10,  (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k') 
pewt <- gather(data = pew, key = Income, value = Count, -religion)
dim(pew) # 162 x 3,  (religion, Income, Count)

args(gather)
# function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

where the three arguments of gather() requires:

  • data: a data frame in which column names will become row values. If the data is a matrix, use %>% as.data.frame() beforehand.
  • key: the name of the categorical variable into which the column names in the original datasets are converted.
  • value: the name of cell value columns

In this example, the 'religion' column will not be included (-religion).

dplyr, plyr packages

  • plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
  • It's amazing the things one can do in base R (without installing or loading any other #rstats packages)
  • Essential functions: 3 rows functions, 3 column functions and 1 mixed function.
           select, mutate, rename, recode
            +------------------+
filter      +                  +
arrange     +                  +
group_by    +                  +
drop_na     +                  +
ungroup     + summarise        +
            +------------------+
  • These functions works on data frames and tibble objects. Note stats package also has a filter() function for time series data. If we have not loaded the dplyr package, the filter() function below will give an error (count() also is from dplyr).
iris %>% filter(Species == "setosa") %>% count()
head(iris %>% filter(Species == "setosa") %>% arrange(Sepal.Length))
  • dplyr tutorial from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
    • starts_with()/ ends_with() = Select columns that start/end with a character string
    • contains() = Select columns that contain a character string
    • matches() = Select columns that match a regular expression
    • one_of() = Select columns names that are from a group of names
  • Data Transformation in the book R for Data Science. Five key functions in the dplyr package:
# filter
jan1 <- filter(flights, month == 1, day == 1)
filter(flights, month == 11 | month == 12)
filter(flights, arr_delay <= 120, dep_delay <= 120)
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

# arrange
arrange(flights, year, month, day)
arrange(flights, desc(arr_delay))

# select
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

# mutate
flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)
# if you only want to keep the new variables
transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# summarise()
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

# pipe. Note summarise() can return more than 1 variable.
delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))
  • Another example
data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, 30, 35, 40, 45),
  gender = c("F", "M", "M", "M", "F"),
  score1 = c(80, 85, 90, 95, 100),
  score2 = c(75, 80, 85, 90, 95)
)

# Example usage of dplyr functions
result <- data %>%
  filter(gender == "M") %>%                # Keep only rows where gender is "M"
  select(name, age, score1) %>%            # Select specific columns
  mutate(score_diff = score1 - score2) %>% # Calculate a new column based on existing columns
  arrange(desc(age)) %>%                   # Arrange rows in descending order of age
  #group_by(gender) %>%                    # Group the data by gender
  summarize(mean_score1 = mean(score1))    # Calculate the mean of score1 for each group
  • the dot.
    matrix(rnorm(12),4, 3) %>% .[1:2, 1:2]
    

select() for columns

Select columns from a data frame

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

select() + everything()

If we want one particular column (say the dependent variable y) to appear first or last in the dataset. We can use the everything().

iris %>% select(Species, everything()) %>% head()
iris %>% select(-Species, everything()) %>% head() # put Species to the last col

.$Name

Extract a column using piping. The . represents the data frame that is being piped in, and $Name extracts the ‘Name’ column.

mtcars %>% .$mpg  # A vector

mtcars %>% select(mpg) # A list

filter() for rows

mtcars %>% filter(mpg>10)

identical(mtcars %>% filter(mpg>10), subset(mtcars, mpg>10))
# [1] TRUE

filter by date

What Is the Best Way to Filter by Date in R?

arrange (reorder)

  • Arrange values by a Single Variable:
    # Create a sample data frame
    students <- data.frame(
      Name = c("Ali", "Boby", "Charlie", "Davdas"),
      Score = c(85, 92, 78, 95)
    )
    
    # Arrange by Score in ascending order
    arrange(students, Score)
    #      Name Score
    # 1 Charlie    78
    # 2     Ali    85
    # 3    Boby    92
    # 4  Davdas    95
    
  • Arrange values by Multiple Variables: This is like the "sort" function in Excel.
    # Create a sample data frame
    transactions <- data.frame(
      Date = c("2024-04-01", "2024-04-01", "2024-04-02", "2024-04-03"),
      Amount = c(100, 150, 200, 75)
    )
    
    # Arrange by Date in ascending order, then by Amount in descending order
    arrange(transactions, Date, desc(Amount))
    #         Date Amount
    # 1 2024-04-01    150
    # 2 2024-04-01    100
    # 3 2024-04-02    200
    # 4 2024-04-03     75
    
  • Arrange values with Missing Values:
    # Create a sample data frame with missing values
    data <- data.frame(
      ID = c(1, 2, NA, 4),
      Value = c(20, NA, 15, 30)
    )
    
    # Arrange by Value in ascending order, placing missing values first
    arrange(data, desc(is.na(Value)), Value)
    #   ID Value
    # 1  2    NA
    # 2 NA    15
    # 3  1    20
    # 4  4    30
    

arrange and match

How to do the following in pipe A <- A[match(id.ref, A$id), ]

How to sort rows of a data frame based on a vector using dplyr pipe, Order data frame rows according to vector with specific order

  • Data
    library(dplyr)
    
    # Create a sample dataframe 'A'
    set.seed(1); A <- data.frame(
         id = sample(letters[1:5]),
         value = 1:5
         )
    print(A)
      id value
    1  a     1
    2  d     2
    3  c     3
    4  e     4
    5  b     5
    
    # Create a reference vector 'id.ref'
    id.ref <- c("e", "d", "c", "b", "a")
    # Goal:
    A[match(id.ref, A$id),]
      id value
    4  e     4
    2  d     2
    3  c     3
    5  b     5
    1  a     1
  • Method 1 (best): no match() is needed. Brilliant!
    A %>% arrange(factor(id, levels=id.ref))
      id value
    1  e     4
    2  d     2
    3  c     3
    4  b     5
    5  a     1
    # detail:
    factor(A$id, levels=id.ref)
    [1] a d c e b
    Levels: e d c b a
  • Method 2: complicated
    A %>%
         mutate(id.match = match(id, id.ref)) %>%
         arrange(id.match) %>%
         select(-id.match)
      id value
    1  e     4
    2  d     2
    3  c     3
    4  b     5
    5  a     1
    # detail:
    A %>%
         mutate(id.match = match(id, id.ref)) 
      id value id.match
    1  a     1        5
    2  d     2        2
    3  c     3        3
    4  e     4        1
    5  b     5        4
  • Method 3: a simplified version of Method 2, but it needs match()
    A %>% arrange(match(id, id.ref))
      id value
    1  e     4
    2  d     2
    3  c     3
    4  b     5
    5  a     1

group_by()

  • ?group_by and ungroup(),
  • Grouped data
  • Is ungroup() recommended after every group_by()? Always ungroup() when you’ve finished with your calculations. See here or this.
  • You might want to use ungroup() if you want to perform further calculations or manipulations on the data that don’t depend on the grouping. For example, after ungrouping the data, you could add new columns or filter rows without being restricted by the grouping.
                  +-- mutate() + ungroup() 
x -- group_by() --|
                  +-- summarise() # reduce the dimension, no way to get back

Subset rows by group

Subset rows based on their integer locations-slice in R

Rank by Group

How to Rank by Group in R?

df %>% arrange(team, points) %>%
    group_by(team) %>%
    mutate(rank = rank(points))

group_by() + summarise(), arrange(desc())

Data transformation from R for Data Science

Function in summarise()

  • group_by(var1) %>% summarise(varY = mean(var2)) %>% ggplot(aes(x = varX, y = varY, fill = varF)) + geom_bar(stat = "identity") + theme_classic()
  • summarise(newvar = sum(var1) / sum(var2))
  • arrange(desc(var1, var2))
  • Distinct number of observation: n_distinct()
  • Count the number of rows: n()
  • nth observation of the group: nth()
  • First observation of the group: first()
  • Last observation of the group: last()

group_by() + summarise() + across()

group_by() + nest(), mutate(, map()), unnest(), list-columns

nest(data=) is a function in the tidyr package in R that allows you to create nested data frames, where one column contains another data frame or list. This is useful when you want to perform analysis or visualization on each group separately. PS: it seems group_by() is not needed.

histogram <- gss_cat |> 
  nest(data = -marital) |>  # OR nest(.by = marital). 6x2 tibble. Col1=marital, col2=data.
  mutate(
    histogram = pmap(
      .l = list(marital, data),
      .f = \(marital, data) {
        ggplot(data, aes(x = tvhours)) +
          geom_histogram(binwidth = 1) +
          labs(
            title = marital
          )
      }
    )
  )
histogram$histogram[[1]]

Many models from R for Data Science

  • ?unnest, vignette('rectangle'), vignette('nest') & vignette('pivot')
    tibble(x = 1:2, y = list(1:4, 2:3)) %>% unnest(y) %>% group_by(x) %>% nest()
    # returns to tibble(x = 1:2, y = list(1:4, 2:3)) with 'groups' information
  • annotate boxplot in ggplot2
  • Coding in R: Nest and map your way to efficient code
          group_by() + nest()    mutate(, map())   unnest()
    data  -------------------->  --------------->  ------->
    
    install.packages('gapminder'); library(gapminder)
    
    gapminder_nest <- gapminder %>% 
      group_by(country) %>% 
      nest()  # country, data
              # each row of 'data' is a tibble
    
    gapminder_nest$data[[1]]  # tibble 57 x 8
    
    gapminder_nest <- gapminder_nest %>%
              mutate(pop_mean = map(.x = data, ~mean(.x$pop, na.rm = T)))
                                        # country, data, pop_mean
    
    gapminder_nest %>% unnest(pop_mean) # country, data, pop_mean
    
    gapminder_plot <- gapminder_nest %>% 
      unnest(pop_mean) %>% 
      select(country, pop_mean) %>% 
      ungroup() %>% 
      top_n(pop_mean, n = -10) %>% 
      mutate(pop_mean = pop_mean/10^3)
    gapminder_plot %>% 
      ggplot(aes(x = reorder(country, pop_mean), y = pop_mean)) +
      geom_point(colour = "#FF6699", size = 5) +
      geom_segment(aes(xend = country, yend = 0), colour = "#FF6699") +
      geom_text(aes(label = round(pop_mean, 0)), hjust = -1) +
      theme_minimal() +
      labs(title = "Countries with smallest mean population from 1960 to 2016",
           subtitle = "(thousands)",
           x = "",
           y = "") +
      theme(legend.position = "none",
            axis.text.x = element_blank(),
            plot.title = element_text(size = 14, face = "bold"),
            panel.grid.major.y = element_blank()) +
      coord_flip() +
      scale_y_continuous()
  • Tidy analysis from tidymodels
  • Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()

across()

  • ?across. Applying a function or operation to multiple columns in a data frame simultaneously.
    across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)
    gdf <-
      tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23) %>%
      group_by(g)
    gdf %>% mutate(across(v1:v2, ~ .x + rnorm(1)))
    #>       g    v1    v2
    #>   <dbl> <dbl> <dbl>
    #> 1     1  10.3  20.7
    #> 2     1  11.3  21.7
    #> 3     2  11.2  22.6
    #> 4     3  13.5  22.7
    
  • dplyr across: First look at a new Tidyverse function.
    ny <- filter(cases, State == "NY") %>%
      select(County = `County Name`, starts_with(c("3", "4")))
    
    daily_totals <- ny %>%
      summarize(
        across(starts_with("4"), sum)
      )
    
    median_and_max <- list(
      med = ~median(.x, na.rm = TRUE),
      max = ~max(.x, na.rm = TRUE)
    )
    
    april_median_and_max <- ny %>%
      summarize(
        across(starts_with("4"), median_and_max)
      )
    </pre>
    <pre>
    # across(.cols = everything(), .fns = NULL, ..., .names = NULL)
    
    # Rounding the columns Sepal.Length and Sepal.Width
    iris %>%
      as_tibble() %>%
      mutate(across(c(Sepal.Length, Sepal.Width), round))
    
    iris %>% summarise(across(contains("Sepal"), ~mean(.x, na.rm = TRUE)))
    
    # filter rows
    iris %>% filter(if_any(ends_with("Width"), ~ . > 4))
    
    iris %>% select(starts_with("Sepal"))
    
    iris %>% select(starts_with(c("Petal", "Sepal")))
    
    iris %>% select(contains("Sepal"))

ave() - Adding a column of means by group to original data

mutate vs tapply

Base-R is alive and well

mutate + replace() or ifelse()

Hash table

  • Create new column based on 4 values in another column. The trick is to create a named vector; like a Dictionary in Python. Here is my example:
    hashtable <- data.frame(value=1:4, key=c("B", "C", "A", "D"))
    input <- c("A", "B", "C", "D", "B", "B", "A", "A") # input to be matched with keys, 
                                                       # this could be very long
    # Trick: convert the hash table into a named vector
    htb <- hashtable$value; names(htb) <- hashtable$key
    
    # return the values according to the names
    out <- htb[input]; out
    A B C D B B A A
    3 1 2 4 1 1 3 3

    We can implement using Python by creating a variable of dictionary type/structure.

    hashtable = {'B': 1, 'C': 2, 'A': 3, 'D': 4}
    input = ['A', 'B', 'C', 'D', 'B', 'B', 'A', 'A']
    out = [hashtable[key] for key in input]

    Or using C

    #include <stdio.h>
    
    int main() {
        int hashtable[4] = {3, 1, 2, 4};
        char input[] = {'A', 'B', 'C', 'D', 'B', 'B', 'A', 'A'};
        int out[sizeof(input)/sizeof(input[0])];
    
        for (int i = 0; i < sizeof(input)/sizeof(input[0]); i++) {
            out[i] = hashtable[input[i] - 'A'];
        }
    
        for (int i = 0; i < sizeof(out)/sizeof(out[0]); i++) {
            printf("%d ", out[i]);
        }
        printf("\n");
    
        return 0;
    }
  • hash package
  • digest package

inner_join, left_join, full_join

plyr::rbind.fill()

Videos

dbplyr

stringr

  • stringr is part of the tidyverse but is not a core package. You need to load it separately.
  • Handling Strings with R(ebook) by Gaston Sanchez.
  • https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
  • stringr Cheat sheet (2 pages, this will immediately download the pdf file)
    • Detect Matches: str_detect(), str_which(), str_count(), str_locate()
    • Subset: str_sub(), str_subset(), str_extract(), str_match()
    • Manage Lengths: str_length(), str_pad(), str_trunc(), str_trim()
    • Mutate Strings: str_sub(), str_replace(), str_replace_all(), str_remove()
      • Case Conversion: str_to_lower(), str_to_upper(), str_to_title()
    • Joint and Split: str_c(), str_dup(), str_split_fixed(), str_glue(), str_glue_date()
  • Efficient data carpentry → Regular expressions from Efficient R programming book by Gillespie & Lovelace.
  • Common functions:
    `stringr` Function Description Base R Equivalent
    `str_length()` Returns the number of characters in each element of a character vector. `nchar()`
    `str_sub()` Extracts substrings from a character vector. `substr()`
    `str_trim()` Removes leading and trailing whitespace from strings. `trimws()`
    `str_split()` Splits a string into pieces based on a delimiter. `strsplit()`
    `str_replace()` Replaces occurrences of a pattern in a string with another string. `gsub()`
    `str_detect()` Detects whether a pattern is present in each element of a character vector. `grepl()`
    `str_subset()` Returns the elements of a character vector that contain a pattern. `grep()`
    `str_count()` Counts the number of occurrences of a pattern in each element of a character vector. `gregexpr()` and `lengths()`

str_replace()

Replace a string in a column: dplyr::across() & str_replace()

df <- data.frame(country=c('India', 'USA', 'CHINA', 'Algeria'),
                 position=c('1', '1', '2', '3'),
                 points=c(22, 25, 29, 13))

df %>%
  mutate(across('country', str_replace, 'India', 'Albania'))

df %>%
  mutate(across('country', str_replace, 'A|I', ''))

split

Split Data Frame Variable into Multiple Columns in R (3 Examples)

Three ways:

x <- c("a-1", "b-2", "c-3")

stringr::str_split_fixed(x, "-", 2)
#      [,1] [,2]
# [1,] "a"  "1" 
# [2,] "b"  "2" 
# [3,] "c"  "3" 

tidyr::separate(data.frame(x), x, c('x1', 'x2'), "-")
   # The first argument must be a data frame
   # The 2nd argument is the column names
#   x1 x2
# 1  a  1
# 2  b  2
# 3  c  3

magrittr: pipe

x %>% f     # f(x)
x %>% f(y)  # f(x, y)
x %>% f(arg=y)  # f(x, arg=y)
x %>% f(z, .) # f(z, x)
x %>% f(y) %>% g(z)  #  g(f(x, y), z)

x %>% select(which(colSums(!is.na(.))>0))  # remove columns with all missing data
x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows
suppressPackageStartupMessages(library("dplyr"))
starwars %>%
  filter(., height > 200) %>%
  select(., height, mass) %>%
  head(.)
# instead of 
starwars %>%
  filter(height > 200) %>%
  select(height, mass) %>%
  head
iris$Species
iris[["Species"]]

iris %>%
`[[`("Species")

iris %>%
`[[`(5)

iris %>%
  subset(select = "Species")
  • Split-apply-combine: group + summarize + sort/arrange + top n. The following example is from Efficient R programming.
data(wb_ineq, package = "efficient")
wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)
f <- function(x) {
  (y - x) %>% 
    '^'(2) %>% 
    sum %>%
    '/'(length(x)) %>% 
    sqrt %>% 
    round(2)
}
# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)

rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str()
rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe
    # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2))  
    # instead of the righthand side.

# If a function does not have a data frame based api, you can use %$%.
# It explodes out the variables in a data frame.
mtcars %$% cor(disp, mpg) 

# For assignment, magrittr provides the %<>% operator
mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by
mtcars %<>% transform(cyl = cyl * 2)

Upsides of using magrittr: no need to create intermediate objects, code is easy to read.

When not to use the pipe

  • your pipes are longer than (say) 10 steps
  • you have multiple inputs or outputs
  • Functions that use the current environment: assign(), get(), load()
  • Functions that use lazy evaluation: tryCatch(), try()

Dollar sign .$

%$%

Expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument.

lhs %$% rhs
# lhs:	A list, environment, or a data.frame.
# rhs: An expression where the names in lhs is available.

iris %>%
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)

set_rownames() and set_colnames()

https://stackoverflow.com/a/56613460, https://www.rdocumentation.org/packages/magrittr/versions/1.5/topics/extract

data.frame(x=1:5, y=2:6) %>% magrittr::set_rownames(letters[1:5])

cbind(1:5, 2:6) %>% magrittr::set_colnames(letters[1:2])

match()

a <- 1:3
id <- letters[1:3]
set.seed(1234); id.ref <- sample(id) 
id # [1] "b" "c" "a"

a[match(id.ref, b)] # [1] 2 3 1
id.ref %>% match(b) %>% `[`(a, .) # Same, but complicated

dtrackr

dtrackr: Track your Data Pipelines

purrr: : Functional Programming Tools

While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply…). See Learn to purrr.

  • What does the tilde mean in this context of R code, What is meaning of first tilde in purrr::map
  • Getting started with the purrr package in R, especially the map() and map_df() functions.
    library(purrr)
    # map() is a replacement of lapply()
    # lapply(dat, function(x) mean(x$Open))
    map(dat, function(x)mean(x$Open))  
    
    # map allows us to bypass the function function. 
    # Using a tilda (~) in place of function and a dot (.) in place of x
    map(dat, ~mean(.$Open))
    
    # map allows you to specify the structure of your output.
    map_dbl(dat, ~mean(.$Open))
    
    # map2() is a replacement of mapply()
    # mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)
    map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
data <- map(paths, read.csv)
data <- map_dfr(paths, read.csv, id = "path")

out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE)
out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)

map() and map_dbl()

  • Mastering the map() Function in R
  • An example from https://purrr.tidyverse.org/
    mtcars |> 
         split(mtcars$cyl) |>  # from base R
         map(\(df) lm(mpg ~ wt, data = df)) |> 
         map(summary) |> map_dbl("r.squared")
    #         4         6         8 
    # 0.5086326 0.4645102 0.4229655
  • Solution by base R lapply() and sapply(). See the article purrr <-> base R
    mtcars |>
         split(mtcars$cyl) |>
         lapply(function(df) lm(mpg ~ wt, data = df)) |>
         lapply(summary) |>
         sapply(function(x) x$r.squared)
    #         4         6         8 
    # 0.5086326 0.4645102 0.4229655

tilde

  • The lambda syntax and tilde notation provided by purrr allow you to write concise and readable anonymous functions in R.
x <- 1:3
map_dbl(x, ~ .x^2)  # [1] 1 4 9
The notation ~ .x^2 is equivalent to writing function(.x) .x^2 or function(z) z^2 or \(y) y^2
x <- list(a = 1:3, b = 4:6)
y <- list(a = 10, b = 100)
map2_dbl(x, y, ~ sum(.x * .y))
#   a    b 
#  60 1500

.x symbol

  • It is used with functions like purrr::map. In the context of an anonymous function, .x is a placeholder for the first argument of the function.
    • For a single argument function, you can use .. For example, ~ . + 2 is equivalent to function(.) {. + 2}.
    • For a two argument function, you can use .x and .y. For example, ~ .x + .y is equivalent to function(.x, .y) {.x + .y}.
    • For more arguments, you can use ..1, ..2, ..3, etc
    # Create a vector
    vec <- c(1, 2, 3)
    
    # Use purrr::map with an anonymous function
    result <- purrr::map(vec, ~ .x * 2)
    
    # Print the result
    print(result)
    [[1]]
    [1] 2
    
    [[2]]
    [1] 4
    
    [[3]]
    [1] 6
    
  • dplyr piping data - difference between `.` and `.x`
  • Function argument naming conventions (`.x` vs `x`). Se purrr::map

negate()

How to select non-numeric columns using dplyr::select_if

library(tidyverse)
iris %>% select_if(negate(is.numeric))

pmap()

?pmap - Map over multiple input simultaneously (in "parallel")

# Create two lists with multiple elements
list1 <- list(1, 2, 3)
list2 <- list(10, 20, 30)

# Define a function to add the elements of each list
my_func <- function(x, y) {
  x + y
}

# Use pmap to apply the function to each element of the lists in parallel
result <- pmap(list(list1, list2), my_func); result
[[1]]
[1] 11

[[2]]
[1] 22

[[3]]
[1] 33

A more practical example when we want to run analysis or visualization on each element of some group/class variable. nest() + pmap().

# Create a data frame
df <- mpg %>% 
  filter(manufacturer %in% c("audi", "volkswagen")) %>% 
  select(manufacturer, year, cty)

# Nest the data by manufacturer
df_nested <- df %>% 
  nest(data = -manufacturer)

# Create a function that takes a data frame and creates a ggplot object
my_plot_func <- function(data, manuf) {
     ggplot(data, aes(x = year, y = cty)) +
         geom_point() +
         ggtitle(manuf)
 }

# Use pmap to apply the function to each element of the list-column in df_nested
df_nested_plot <- df_nested %>% 
     mutate(plot = pmap(list(data, manufacturer), my_plot_func))

df_nested_plot[[1]]

Another example: fitting regressions for data in each group

library(tidyverse)

# create example data
data <- tibble(
  x = rnorm(100),
  y = rnorm(100),
  group = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# create a nested dataframe
nested_data <- data %>% 
  nest(data = -group)

# define a function that runs linear regression on each dataset
lm_func <- function(data) {
  lm(y ~ x, data = data)
}

# apply lm_func() to each row of the nested dataframe
results <- nested_data %>% 
  mutate(model = pmap(list(data), lm_func))

reduce

Reducing my for loop usage with purrr::reduce()

filter, subset data

Four Filters for Functional (Programming) Friends

purrr vs base R

https://purrr.tidyverse.org/dev/articles/base.html

forcats

https://forcats.tidyverse.org/

JAMA retraction after miscoding – new Finalfit function to check recoding

outer()

Genomic sequence

  • chartr
> yourSeq <- "AAAACCCGGGTTTNNN"
> chartr("ACGT", "TGCA", yourSeq)
[1] "TTTTGGGCCCAAANNN"

broom

  • broom: Convert Statistical Analysis Objects into Tidy Tibbles
  • Especially the tidy() function.
    R> str(survfit(Surv(time, status) ~ x, data = aml))
    List of 17
     $ n        : int [1:2] 11 12
     $ time     : num [1:20] 9 13 18 23 28 31 34 45 48 161 ...
     $ n.risk   : num [1:20] 11 10 8 7 6 5 4 3 2 1 ...
     $ n.event  : num [1:20] 1 1 1 1 0 1 1 0 1 0 ...
    ...
    
    R> tidy(survfit(Surv(time, status) ~ x, data = aml))
    # A tibble: 20 x 9
        time n.risk n.event n.censor estimate std.error conf.high conf.low strata         
       <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>          
     1     9     11       1        0   0.909     0.0953     1       0.754  x=Maintained   
     2    13     10       1        1   0.818     0.142      1       0.619  x=Maintained   
    ...
    18    33      3       1        0   0.194     0.627      0.664   0.0569 x=Nonmaintained
    19    43      2       1        0   0.0972    0.945      0.620   0.0153 x=Nonmaintained
    20    45      1       1        0   0       Inf         NA      NA      x=Nonmaintained
    
  • Tables from journal papers
  • Multiple univariate models
    library(tidyverse)
    library(broom)
    
    mtcars %>%
      select(-mpg) %>%
      names() %>%
      map_dfr(~ tidy(lm(as.formula(paste("mpg ~", .x)), data = mtcars)))
    # A tibble: 20 × 5
    #   term        estimate std.error statistic  p.value
    #   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    # 1 (Intercept)  37.9      2.07       18.3   8.37e-18
    # 2 cyl          -2.88     0.322      -8.92  6.11e-10
    # 3 (Intercept)  29.6      1.23       24.1   3.58e-21
    # 4 disp         -0.0412   0.00471    -8.75  9.38e-10
    
  • Multivariate model
    lm(mpg ~ ., data = mtcars) |> tidy()
    # A tibble: 11 × 5
    #   term        estimate std.error statistic p.value
    #   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
    # 1 (Intercept)  12.3      18.7        0.657  0.518 
    # 2 cyl          -0.111     1.05      -0.107  0.916 
    # 3 disp          0.0133    0.0179     0.747  0.463
    

lobstr package - dig into the internal representation and structure of R objects

lobstr 1.0.0

Other packages

Great R packages for data import, wrangling, and visualization

Great R packages for data import, wrangling, and visualization

cli package

tidytext

https://juliasilge.shinyapps.io/learntidytext/

tidytuesdayR

install.packages("tidytuesdayR")
library("tidytuesdayR")
tt_datasets(2020)  # get the exact day of the data we want to load
coffee_ratings <- tt_load("2020-07-07")
print(coffee_ratings)  #  readme(coffee_ratings)

janitor

funneljoin