Revision as of 09:48, 14 January 2019

Courses

STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel

How to prepare data for collaboration

How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.

naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
coding variables: be consistent, no spelling error
date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
missing data: "NA". Not leave any cells blank.
using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.

Five types of data:

continuous
oridinal
categorical
missing
censored

Some extra from Data organization in spreadsheets (the paper appears in American Statistician)

No empty cells
Put one thing in a cell
Make a rectangle
No calculation in the raw data files
Create a data dictionary (same as code book)

complete.cases()

Count the number of rows in a data frame that have missing values with

sum(!complete.cases(dF))

> tmp <- matrix(1:6, 3, 2)
> tmp
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> tmp[2,1] <- NA
> complete.cases(tmp)
[1]  TRUE FALSE  TRUE

Wrangling categorical data in R

https://peerj.com/preprints/3163.pdf

Some approaches:

options(stringAsFactors=FALSE)
Use the tidyverse package

Base R approach:

GSS <- read.csv("XXX.csv")
GSS$BaseLaborStatus <- GSS$LaborStatus
levels(GSS$BaseLaborStatus)
summary(GSS$BaseLaborStatus)
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)

Tidyverse approach:

GSS <- GSS %>%
    mutate(tidyLaborStatus =
        recode(LaborStatus,
            `Temp not working` = "Temporarily not working",
            `Unempl, laid off` = "Unemployed, laid off",
            `Working fulltime` = "Working full time",
            `Working parttime ` = "Working part time"))

@@ Line 1: / Line 1: @@
+= Courses =
 [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel
+= How to prepare data for collaboration =
+[https://peerj.com/preprints/3139.pdf How to share data for collaboration]. Especially [https://peerj.com/preprints/3139.pdf#page=7 Page 7] has some (raw data) variable coding guidelines.
+* naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
+* coding variables: be consistent, no spelling error
+* date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
+* missing data: "NA". Not leave any cells blank.
+* using a '''code book''' file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
+Five types of data:
+* continuous
+* oridinal
+* categorical
+* missing
+* censored
+Some extra from [https://peerj.com/preprints/3183/ Data organization in spreadsheets] (the paper appears in [https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 American Statistician])
+* No empty cells
+* Put one thing in a cell
+* Make a rectangle
+* No calculation in the raw data files
+* Create a '''data dictionary''' (same as '''code book''')
+= [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] =
+Count the number of rows in a data frame that have missing values with
+<syntaxhighlight lang='rsplus'>
+sum(!complete.cases(dF))
+</syntaxhighlight>
+<pre>
+> tmp <- matrix(1:6, 3, 2)
+> tmp
+     [,1] [,2]
+[1,]    1    4
+[2,]    2    5
+[3,]    3    6
+> tmp[2,1] <- NA
+> complete.cases(tmp)
+[1]  TRUE FALSE  TRUE
+</pre>
+= Wrangling categorical data in R =
+https://peerj.com/preprints/3163.pdf
+Some approaches:
+* options(stringAsFactors=FALSE)
+* Use the '''tidyverse''' package
+Base R approach:
+<syntaxhighlight lang='rsplus'>
+GSS <- read.csv("XXX.csv")
+GSS$BaseLaborStatus <- GSS$LaborStatus
+levels(GSS$BaseLaborStatus)
+summary(GSS$BaseLaborStatus)
+GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus)
+GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working"
+GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off"
+GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time"
+GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time"
+GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
+</syntaxhighlight>
+Tidyverse approach:
+<syntaxhighlight lang='rsplus'>
+GSS <- GSS %>%
+    mutate(tidyLaborStatus =
+        recode(LaborStatus,
+            `Temp not working` = "Temporarily not working",
+            `Unempl, laid off` = "Unemployed, laid off",
+            `Working fulltime` = "Working full time",
+            `Working parttime ` = "Working part time"))
+</syntaxhighlight>

Data science: Difference between revisions

Revision as of 09:48, 14 January 2019

Contents

Courses

How to prepare data for collaboration

complete.cases()

Wrangling categorical data in R

Navigation menu

Data science: Difference between revisions

Revision as of 09:48, 14 January 2019

Courses

How to prepare data for collaboration

complete.cases()

Wrangling categorical data in R

Navigation menu

Search