Tidyverse: Difference between revisions
Line 251: | Line 251: | ||
** filter(), [https://tidyr.tidyverse.org/reference/gather.html gather()], bind_rows(), arrange() | ** filter(), [https://tidyr.tidyverse.org/reference/gather.html gather()], bind_rows(), arrange() | ||
** group_by(), summarize() | ** group_by(), summarize() | ||
** [https://www.rdocumentation.org/packages/rdbnomics/versions/0.5.0/topics/rdb rdb()], select(), spread(), mutate(), select(), gather() | ** [https://www.rdocumentation.org/packages/rdbnomics/versions/0.5.0/topics/rdb rdb()], [https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/year lubricate::year()], select(), spread(), mutate(), select(), gather() | ||
** filter(), [https://dplyr.tidyverse.org/reference/join.html full_join()], transmute(), !is.na() | ** filter(), [https://dplyr.tidyverse.org/reference/join.html full_join()], transmute(), !is.na() | ||
** [https://dplyr.tidyverse.org/reference/bind.html bind_rows()], mutate() | ** [https://dplyr.tidyverse.org/reference/bind.html bind_rows()], mutate() |
Revision as of 17:52, 18 December 2019
Tidyverse
Import | | readr, readxl | haven, DBI, httr +----- Visualize ------+ | | ggplot2, ggvis | | | | Tidy ------------- Transform tibble dplyr Model tidyr | broom +------ Model ---------+
- TidyverseSkeptic by Norm Matloff
- R for Data Science and tidyverse package (it is a collection of ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr & forcats 8 packages).
- tidyverse, among others, was used at Mining CRAN DESCRIPTION Files (tbl_df(), %>%, summarise(), count(), mutate(), arrange(), unite(), ggplot(), filter(), select(), ...). Note that there is a problem to reproduce the result. I need to run cran <- cran[, -14] to remove the MD5sum column.
- Compile R for Data Science to a PDF
- Data Wrangling with dplyr and tidyr Cheat Sheet
- Data Wrangling with Tidyverse from the Harvard Chan School of Public Health.
- Best packages for data manipulation in R. It demonstrates to perform the same tasks using data.table and dplyr packages. data.table is faster and it may be a go-to package when performance and memory are the constraints.
- DATA MANIPULATION IN R by Alboukadel Kassambara
- subset data frame columns: pull() [return a vector], select() [return data frame], select_if(), other helper functions
- subset (filter) data frame rows: slice(), filter(), filter_all(), filter_if(), filter_at(), sample_n(), top_n()
- identify and remove duplicate rows: duplicated(), unique(), distinct()
- ordering rows: arrange(), desc()
- renaming and adding columns: rename()
- compute and add new variables to a data frame: mutate(), transmutate()
- computing summary statistics (pay to view)
Cheat sheet
The cheat sheets are downloaded from RStudio
- Data Transformation with dply
- Data Import
- Data Import with readr, tibble, and tidyr (not in RStudio anymore?)
Examples
A Gentle Introduction to Tidy Statistics in R
A Gentle Introduction to Tidy Statistics in R by Thomas Mock on RStudio webinar. Good coverage with step-by-step explanation. See part 1 & part 2 about the data and markdown document. All documents are available in github repository.
Task | R code | Graph |
---|---|---|
Load the libraries | library(tidyverse) library(readxl) library(broom) library(knitr) |
|
Read Excel file | raw_df <- readxl::read_xlsx("ad_treatment.xlsx") dplyr::glimpse(raw_df) |
|
Check distribution | g2 <- ggplot(raw_df, aes(x = age)) + geom_density(fill = "blue") g2 raw_df %>% summarize(min = min(age), max = max(age)) |
File:Check dist.svg |
Data cleaning | raw_df %>% summarize(na_count = sum(is.na(mmse))) |
|
Experimental variables
levels |
# check Ns and levels for our variables table(raw_df$drug_treatment, raw_df$health_status) table(raw_df$drug_treatment, raw_df$health_status, raw_df$sex) # tidy way of looking at variables raw_df %>% group_by(drug_treatment, health_status, sex) %>% count() |
|
Visual Exploratory
Data Analysis |
ggplot(data = raw_df, # add the data aes(x = drug_treatment, y = mmse, # set x, y coordinates color = drug_treatment)) + # color by treatment geom_boxplot() + facet_grid(~health_status) |
File:Onefacet.svg |
Summary Statistics | raw_df %>% glimpse() sum_df <- raw_df %>% mutate( sex = factor(sex, labels = c("Male", "Female")), drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's")) ) %>% group_by(sex, health_status, drug_treatment # group by categorical variables ) %>% summarize( mmse_mean = mean(mmse), # calc mean mmse_se = sd(mmse)/sqrt(n()) # calc standard error ) %>% ungroup() # ungrouping variable is a good habit to prevent errors kable(sum_df) write.csv(sum_df, "adx37_sum_stats.csv") |
|
Plotting summary
statistics |
g <- ggplot(data = sum_df, # add the data aes(x = drug_treatment, #set x, y coordinates y = mmse_mean, group = drug_treatment, # group by treatment color = drug_treatment)) + # color by treatment geom_point(size = 3) + # set size of the dots facet_grid(sex~health_status) # create facets by sex and status g |
File:Twofacets.svg |
ANOVA | # set up the statistics df stats_df <- raw_df %>% # start with data mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), sex = factor(sex, labels = c("Male", "Female")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's"))) glimpse(stats_df) # this gives main effects AND interactions ad_aov <- aov(mmse ~ sex * drug_treatment * health_status, data = stats_df) summary(ad_aov) # this extracts ANOVA output into a nice tidy dataframe tidy_ad_aov <- tidy(ad_aov) # which we can save to Excel write.csv(tidy_ad_aov, "ad_aov.csv") |
|
Post-hocs | # pairwise t.tests ad_pairwise <- pairwise.t.test(stats_df$mmse, stats_df$sex:stats_df$drug_treatment:stats_df$health_status, p.adj = "none") # look at the posthoc p.values in a tidy dataframe kable(head(tidy(ad_pairwise))) # call and tidy the tukey posthoc tidy_ad_tukey <- tidy( TukeyHSD(ad_aov, which = 'sex:drug_treatment:health_status')) |
|
Publication plot | sig_df <- tribble( ~drug_treatment, ~ health_status, ~sex, ~mmse_mean, "Low dose", "Alzheimer's", "Male", 17, "High Dose", "Alzheimer's", "Male", 25, "Low dose", "Alzheimer's", "Female", 18, "High Dose", "Alzheimer's", "Female", 24 ) sig_df <- sig_df %>% mutate(drug_treatment = factor(drug_treatment, levels = c("Placebo", "Low dose", "High Dose")), sex = factor(sex, levels = c("Male", "Female")), health_status = factor(health_status, levels = c("Healthy", "Alzheimer's"))) sig_df # plot of cognitive function health and drug treatment g1 <- ggplot(data = sum_df, aes(x = drug_treatment, y = mmse_mean, fill = drug_treatment, group = drug_treatment)) + geom_errorbar(aes(ymin = mmse_mean - mmse_se, ymax = mmse_mean + mmse_se), width = 0.5) + geom_bar(color = "black", stat = "identity", width = 0.7) + facet_grid(sex~health_status) + theme_bw() + scale_fill_manual(values = c("white", "grey", "black")) + theme(legend.position = "NULL", legend.title = element_blank(), axis.title = element_text(size = 20), legend.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text = element_text(size = 12)) + geom_text(data = sig_df, label = "*", size = 8) + labs(x = "\nDrug Treatment", y = "Cognitive Function (MMSE)\n", caption = "\nFigure 1. Effect of novel drug treatment AD-x37 on cognitive function in healthy and demented elderly adults. \nn = 100/treatment group (total n = 600), * indicates significance at p < 0.001") g1 # save the graph! ggsave("ad_publication_graph.png", g1, height = 7, width = 8, units = "in") |
File:Ad public.svg |
Opioid prescribing habits in texas
https://juliasilge.com/blog/texas-opioids/.
- It can read multiple sheets (27 sheets) at a time and merge them by rows.
- case_when(): A general vectorised if
- fill()
- bind_rows()
- full_join(), left_join(), right_join(), inner_join(). See the exercises from Useful dplyr functions (with examples). Suppose df1=50x3, df2=45x3 with 25 overlaps. Then left_join=50x5, right_join=45x5, inner_join=25x5, full_join=70x5.
- gather()
- replace_na()
- str_to_title()
- count()
- top_n()
- kable()
Useful dplyr functions (with examples)
https://sw23993.wordpress.com/2017/07/10/useful-dplyr-functions-wexamples/
Supervised machine learning case studies in R
Supervised machine learning case studies in R - A Free, Interactive Course Using Tidy Tools.
Time series data
- Automating update of a fiscal database for the Euro Area
- readxl::read_excel()
- transmute(), as.Date()
- filter(), is.na()
- na.omit(), first()
- filter(), gather(), bind_rows(), arrange()
- group_by(), summarize()
- rdb(), lubricate::year(), select(), spread(), mutate(), select(), gather()
- filter(), full_join(), transmute(), !is.na()
- bind_rows(), mutate()
- chain() (deprecated!)
- ungroup()
- tibble(), left_join()
- Exploring eu wide data on new car registrations and co2 efficiency (data is available)
Install on Ubuntu
How to install Tidyverse on Ubuntu 16.04 and 17.04
# Ubuntu >= 18.04. However, I get unmet dependencies errors on R 3.5.3. # r-cran-curl : Depends: r-api-3.4 sudo apt-get install r-cran-curl r-cran-openssl r-cran-xml2 # Works fine on Ubuntu 16.04, 18.04 sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev
80 R packages will be installed after tidyverse has been installed.
Install on Raspberry Pi/(ARM based) Chromebook
In additional to the requirements of installing on Ubuntu, I got an error when it is installing a dependent package fs: undefined symbol: pthread_atfork. The fs package version is 1.2.6. The solution is to add one line in fs/src/Makevars file and then install the "fs" package using the source on the local machine.
5 most useful data manipulation functions
- subset() for making subsets of data (natch)
- merge() for combining data sets in a smart and easy way
- melt()-reshape2 package for converting from wide to long data formats. See an example here where we want to combine multiple columns of values into 1 column. melt() is replaced by gather().
- dcast()-reshape2 package for converting from long to wide data formats (or just use tapply()), and for making summary tables
- ddply()-plyr package for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations
Miscellaneous examples using tibble or dplyr packages
Move a column to rownames
?tibble::column_to_rownames
# It assumes the input data frame has no row names; otherwise we will get # Error: `df` must be a data frame without row names in `column_to_rownames()` # tibble::column_to_rownames(data.frame(x=letters[1:5], y = rnorm(5)), "x")
Move rownames to a variable
https://tibble.tidyverse.org/reference/rownames.html
tibble::rownames_to_column(trees, "newVar")
Rename variables
dplyr::rename(os, newName = oldName)
Drop a variable
select(df, -x)
Remove rownames
> tibble::has_rownames(trees) [1] FALSE > rownames(trees) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" [31] "31" > rownames(trees) <- NULL > rownames(trees) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" [31] "31"
data.table
Fast aggregation of large data (e.g. 100GB in RAM or just several GB size file), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread).
Note: data.table has its own ways (cf base R and dplyr) to subset columns.
Some resources:
- https://www.rdocumentation.org/packages/data.table/versions/1.12.0
- cookbook
- R Packages: dplyr vs data.table
- Cheat sheet from RStudio
- Reading large data tables in R. fread(FILENAME)
- Note that 'x[, 2] always return 2. If you want to do the thing you want, use x[, 2, with=FALSE] or x[, V2] where V2 is the header name. See the FAQ #1 in data.table.
- Understanding data.table Rolling Joins
- Intro to The data.table Package
- Subsetting rows and/or columns
- Alternative to using tapply(), aggregate(), table() to summarize data
- Similarities to SQL, DT[i, j, by]
- R : data.table (with 50 examples) from ListenData
- Describe Data
- Selecting or Keeping Columns
- Rename Variables
- Subsetting Rows / Filtering
- Faster Data Manipulation with Indexing
- Performance Comparison
- Sorting Data
- Adding Columns (Calculation on rows)
- How to write Sub Queries (like SQL)
- Summarize or Aggregate Columns
- GROUP BY (Within Group Calculation)
- Remove Duplicates
- Extract values within a group
- SQL's RANK OVER PARTITION
- Cumulative SUM by GROUP
- Lag and Lead
- Between and LIKE Operator
- Merging / Joins
- Convert a data.table to data.frame
- R Tutorial: data.table from dezyre.com
- Syntax: DT[where, select|update|do, by]
- Keys and setkey()
- Fast grouping using j and by: DT[,sum(v),by=x]
- Fast ordered joins: X[Y,roll=TRUE]
- In the Introduction to data.table vignette, the data.table::order() function is SLOWER than base::order() from my Odroid xu4 (running Ubuntu 14.04.4 trusty on uSD)
odt = data.table(col=sample(1e7)) (t1 <- system.time(ans1 <- odt[base::order(col)])) ## uses order from base R # user system elapsed # 2.730 0.210 2.947 (t2 <- system.time(ans2 <- odt[order(col)])) ## uses data.table's order # user system elapsed # 2.830 0.215 3.052 (identical(ans1, ans2)) # [1] TRUE
- Boost Your Data Munging with R
- rbindlist(). One problem, it uses too much memory. In fact, when I try to analyze R package downloads, the command "dat <- rbindlist(logs)" uses up my 64GB memory (OS becomes unresponsive).
- Convenience features of fread
OpenMP enabled compiler for Mac. This instruction works on my Mac El Capitan (10.11.6) when I need to upgrade the data.table version from 1.11.4 to 1.11.6.
Question: how to make use multicore with data.table package?
dtplyr
https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/
reshape & reshape2 (superceded by tidyr package)
- Data Shape Transformation With Reshape()
- Use acast() function in reshape2 package. It will convert data.frame used for analysis to a table-like data.frame good for display.
- http://lamages.blogspot.com/2013/10/creating-matrix-from-long-dataframe.html
tidyr
Missing values
Handling Missing Values in R using tidyr
Pivot
unnest()
Benchmark
An evolution of reshape2. It's designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines.
- vignette("tidy-data") & Cheat sheet
- Main functions
- Reshape data: gather() & spread(). These two will be deprecated
- Split cells: separate() & unite()
- Handle missing: drop_na() & fill() & replace_na()
- Other functions
- tidyr::separate() function. If a cell contains many elements separated by ",", we can use this function to create more columns. The opposite function is unite().
- tidyr::separate_rows(). If a cell contains many elements separated by ",", we can use this function to create one more row. See the cheat sheet link above.
- http://blog.rstudio.org/2014/07/22/introducing-tidyr/
- http://rpubs.com/seandavi/GEOMetadbSurvey2014
- http://timelyportfolio.github.io/rCharts_factor_analytics/factors_with_new_R.html
- tidyr vs reshape2
- A tidyr Tutorial from U of Virginia
- Benchmarking cast in R from long data frame to wide matrix
Make wide tables long with gather() (see 6.3.1 of Efficient R Programming)
library(tidyr) library(efficient) data(pew) # wide table dim(pew) # 18 x 10, (religion, '<$10k', '$10--20k', '$20--30k', ..., '>150k') pewt <- gather(data = pew, key = Income, value = Count, -religion) dim(pew) # 162 x 3, (religion, Income, Count) args(gather) # function(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
where the three arguments of gather() requires:
- data: a data frame in which column names will become row vaues
- key: the name of the categorical variable into which the column names in the original datasets are converted.
- value: the name of cell value columns
In this example, the 'religion' column will not be included (-religion).
dplyr, plyr packages
- plyr package suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. Another additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of query returned.
- Essential functions: 3 rows functions, 3 column functions and 1 mixed function.
select, mutate, rename +------------------+ filter + + arrange + + group_by + + drop_na + + ungroup + summarise + +------------------+
- These functions works on data frames and tibble objects. Note stats package also has a filter() function for time series data. If we have not loaded the dplyr package, the filter() function below will give an error (count() also is from dplyr).
iris %>% filter(Species == "setosa") %>% count() head(iris %>% filter(Species == "setosa") %>% arrange(Sepal.Length))
- dplyr tutorial from PH525x series (Biomedical Data Science by Rafael Irizarry and Michael Love). For select() function, some additional options to select columns based on a specific criteria include
- start_with()/ ends_with() = Select columns that start/end with a character string
- contains() = Select columns that contain a character string
- matches() = Select columns that match a regular expression
- one_of() = Select columns names that are from a group of names
- Data Transformation in the book R for Data Science. Five key functions in the dplyr package:
- Filter rows: filter()
- Arrange rows: arrange()
- Select columns: select()
- Add new variables: mutate()
- Grouped summaries: group_by() & summarise()
# filter jan1 <- filter(flights, month == 1, day == 1) filter(flights, month == 11 | month == 12) filter(flights, arr_delay <= 120, dep_delay <= 120) df <- tibble(x = c(1, NA, 3)) filter(df, x > 1) filter(df, is.na(x) | x > 1) # arrange arrange(flights, year, month, day) arrange(flights, desc(arr_delay)) # select select(flights, year, month, day) select(flights, year:day) select(flights, -(year:day)) # mutate flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) # if you only want to keep the new variables transmute(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # summarise() by_day <- group_by(flights, year, month, day) summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) # pipe. Note summarise() can return more than 1 variable. delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay, na.rm = TRUE))
- Videos
- Hands-on dplyr tutorial for faster data manipulation in R by Data School. At time 17:00, it compares the %>% operator, with() and aggregate() for finding group mean.
- https://youtu.be/aywFompr1F4 (shorter video) by Roger Peng
- https://youtu.be/8SGif63VW6E by Hadley Wickham
- Tidy eval: Programming with dplyr, tidyr, and ggplot2. Bang bang "!!" operator was introduced for use in a function call.
- Efficient R Programming
- Data wrangling: Transformation from R-exercises.
- Express Intro to dplyr by rollingyours.
- the dot.
- stringr and plyr A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors.
- https://randomjohn.github.io/r-maps-with-census-data/ dplyr and stringr are used
- 5 interesting subtle insights from TED videos data analysis in R
- What is tidy eval and why should I care?
stringr
- https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
- stringr Cheat sheet (2 pages, this will immediately download the pdf file)
- Efficient data carpentry → Regular expressions from Efficient R programming book by Gillespie & Lovelace.
magrittr
- Vignettes
- magrittr and wrapr Pipes in R, an Examination. Instead of nested statements, it is using pipe operator %>%. So the code is easier to read. Impressive!
x %>% f # f(x) x %>% f(y) # f(x, y) x %>% f(arg=y) # f(x, arg=y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) x %>% select(which(colSums(!is.na(.))>0)) # remove columns with all missing data x %>% select(which(colSums(!is.na(.))>0)) %>% filter((rowSums(!is.na(.))>0)) # remove all-NA columns _and_ rows
suppressPackageStartupMessages(library("dplyr")) starwars %>% filter(., height > 200) %>% select(., height, mass) %>% head(.) # instead of starwars %>% filter(height > 200) %>% select(height, mass) %>% head
iris$Species iris[["Species"]] iris %>% `[[`("Species") iris %>% `[[`(5) iris %>% subset(select = "Species")
- Split-apply-combine: group + summarize + sort/arrange + top n. The following example is from Efficient R programming.
data(wb_ineq, package = "efficient") wb_ineq %>% filter(grepl("g", Country)) %>% group_by(Year) %>% summarise(gini = mean(gini, na.rm = TRUE)) %>% arrange(desc(gini)) %>% top_n(n = 5)
- Writing Pipe-friendly Functions
- http://rud.is/b/2015/02/04/a-step-to-the-right-in-r-assignments/
- http://rpubs.com/tjmahr/pipelines_2015
- http://danielmarcelino.com/i-loved-this-crosstable/
- http://moderndata.plot.ly/using-the-pipe-operator-in-r-with-plotly/
- RMSE
f <- function(x) { (y - x) %>% '^'(2) %>% sum %>% '/'(length(x)) %>% sqrt %>% round(2) }
# Examples from R for Data Science-Import, Tidy, Transform, Visualize, and Model diamonds <- ggplot2::diamonds diamonds2 <- diamonds %>% dplyr::mutate(price_per_carat = price / carat) pryr::object_size(diamonds) pryr::object_size(diamonds2) pryr::object_size(diamonds, diamonds2) rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str() rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() # 'tee' pipe # %T>% works like %>% except that it returns the lefthand side (rnorm(100) %>% matrix(ncol = 2)) # instead of the righthand side. # If a function does not have a data frame based api, you can use %$%. # It explodes out the variables in a data frame. mtcars %$% cor(disp, mpg) # For assignment, magrittr provides the %<>% operator mtcars <- mtcars %>% transform(cyl = cyl * 2) # can be simplified by mtcars %<>% transform(cyl = cyl * 2)
Upsides of using magrittr: no need to create intermediate objects, code is easy to read.
When not to use the pipe
- your pipes are longer than (say) 10 steps
- you have multiple inputs or outputs
- Functions that use the current environment: assign(), get(), load()
- Functions that use lazy evaluation: tryCatch(), try()
purrr: : Functional Programming Tools
- https://purrr.tidyverse.org/
- purrr cookbook
- Functional programming (cf Object-Oriented Programming)
- Getting started with the purrr package in R, especially the map() function.
- map_dfr() function from "The Joy of Functional Programming (for Data Science)" with Hadley Wickham. It can be used to replace a loop.
data <- map(paths, read.csv) data <- map_dfr(paths, read.csv, id = "path") out1 <- mtcars %>% map_dbl(mean, na.rm = TRUE) out2 <- mtcars %>% map_dbl(median, na.rm = TRUE)
forcats
https://forcats.tidyverse.org/
JAMA retraction after miscoding – new Finalfit function to check recoding
outer()
Genomic sequence
- chartr
> yourSeq <- "AAAACCCGGGTTTNNN" > chartr("ACGT", "TGCA", yourSeq) [1] "TTTTGGGCCCAAANNN"
broom
broom: Convert Statistical Analysis Objects into Tidy Tibbles