Data science: Difference between revisions
Jump to navigation
Jump to search
(Created page with "[https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel") |
No edit summary |
||
Line 1: | Line 1: | ||
= Courses = | |||
[https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | [https://stat430.com/ STAT 430: Topics in Applied Statistics] by Dirk Eddelbuettel | ||
= How to prepare data for collaboration = | |||
[https://peerj.com/preprints/3139.pdf How to share data for collaboration]. Especially [https://peerj.com/preprints/3139.pdf#page=7 Page 7] has some (raw data) variable coding guidelines. | |||
* naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore) | |||
* coding variables: be consistent, no spelling error | |||
* date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel. | |||
* missing data: "NA". Not leave any cells blank. | |||
* using a '''code book''' file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example. | |||
Five types of data: | |||
* continuous | |||
* oridinal | |||
* categorical | |||
* missing | |||
* censored | |||
Some extra from [https://peerj.com/preprints/3183/ Data organization in spreadsheets] (the paper appears in [https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 American Statistician]) | |||
* No empty cells | |||
* Put one thing in a cell | |||
* Make a rectangle | |||
* No calculation in the raw data files | |||
* Create a '''data dictionary''' (same as '''code book''') | |||
= [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/complete.cases complete.cases()] = | |||
Count the number of rows in a data frame that have missing values with | |||
<syntaxhighlight lang='rsplus'> | |||
sum(!complete.cases(dF)) | |||
</syntaxhighlight> | |||
<pre> | |||
> tmp <- matrix(1:6, 3, 2) | |||
> tmp | |||
[,1] [,2] | |||
[1,] 1 4 | |||
[2,] 2 5 | |||
[3,] 3 6 | |||
> tmp[2,1] <- NA | |||
> complete.cases(tmp) | |||
[1] TRUE FALSE TRUE | |||
</pre> | |||
= Wrangling categorical data in R = | |||
https://peerj.com/preprints/3163.pdf | |||
Some approaches: | |||
* options(stringAsFactors=FALSE) | |||
* Use the '''tidyverse''' package | |||
Base R approach: | |||
<syntaxhighlight lang='rsplus'> | |||
GSS <- read.csv("XXX.csv") | |||
GSS$BaseLaborStatus <- GSS$LaborStatus | |||
levels(GSS$BaseLaborStatus) | |||
summary(GSS$BaseLaborStatus) | |||
GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus) | |||
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working" | |||
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off" | |||
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time" | |||
GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time" | |||
GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus) | |||
</syntaxhighlight> | |||
Tidyverse approach: | |||
<syntaxhighlight lang='rsplus'> | |||
GSS <- GSS %>% | |||
mutate(tidyLaborStatus = | |||
recode(LaborStatus, | |||
`Temp not working` = "Temporarily not working", | |||
`Unempl, laid off` = "Unemployed, laid off", | |||
`Working fulltime` = "Working full time", | |||
`Working parttime ` = "Working part time")) | |||
</syntaxhighlight> |
Revision as of 09:48, 14 January 2019
Courses
STAT 430: Topics in Applied Statistics by Dirk Eddelbuettel
How to prepare data for collaboration
How to share data for collaboration. Especially Page 7 has some (raw data) variable coding guidelines.
- naming variables: using meaning variable names, no spacing in column header, avoiding separator (except an underscore)
- coding variables: be consistent, no spelling error
- date and time: YYYY-MM-DD (ISO 8601 standard). A gene symbol "Oct-4" will be interpreted as a date and reformatted in Excel.
- missing data: "NA". Not leave any cells blank.
- using a code book file (*.docx for example): any lengthy explanation about variables should be put here. See p5 for an example.
Five types of data:
- continuous
- oridinal
- categorical
- missing
- censored
Some extra from Data organization in spreadsheets (the paper appears in American Statistician)
- No empty cells
- Put one thing in a cell
- Make a rectangle
- No calculation in the raw data files
- Create a data dictionary (same as code book)
complete.cases()
Count the number of rows in a data frame that have missing values with
sum(!complete.cases(dF))
> tmp <- matrix(1:6, 3, 2) > tmp [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > tmp[2,1] <- NA > complete.cases(tmp) [1] TRUE FALSE TRUE
Wrangling categorical data in R
https://peerj.com/preprints/3163.pdf
Some approaches:
- options(stringAsFactors=FALSE)
- Use the tidyverse package
Base R approach:
GSS <- read.csv("XXX.csv") GSS$BaseLaborStatus <- GSS$LaborStatus levels(GSS$BaseLaborStatus) summary(GSS$BaseLaborStatus) GSS$BaseLaborStatus <- as.character(GSS$BaseLaborStatus) GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Temp not working"] <- "Temporarily not working" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Unempl, laid off"] <- "Unemployed, laid off" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working fulltime"] <- "Working full time" GSS$BaseLaborStatus[GSS$BaseLaborStatus == "Working parttime"] <- "Working part time" GSS$BaseLaborStatus <- factor(GSS$BaseLaborStatus)
Tidyverse approach:
GSS <- GSS %>% mutate(tidyLaborStatus = recode(LaborStatus, `Temp not working` = "Temporarily not working", `Unempl, laid off` = "Unemployed, laid off", `Working fulltime` = "Working full time", `Working parttime ` = "Working part time"))