Regression: Difference between revisions
Line 637: | Line 637: | ||
corstr = "exchangeable") | corstr = "exchangeable") | ||
</syntaxhighlight> | </syntaxhighlight> | ||
:How does the GEE model with the exchangeable correlation structure compare to a Generalized Mixed-Effect model. Wwe allow the intercept to be conditional on subject. This means we’ll get 340 subject_specific models with different intercepts but identical coefficients for everything else. The model coefficients listed under “Fixed effects” are almost identical to the GEE coefficients. | :How does the GEE model with the exchangeable correlation structure compare to a Generalized Mixed-Effect model. Wwe allow the intercept to be conditional on subject. This means we’ll get 340 subject_specific models with different intercepts but identical coefficients for everything else. The model coefficients listed under “Fixed effects” are almost identical to the GEE coefficients. [https://www.rdocumentation.org/packages/lme4/versions/1.1-32/topics/glmer ?glmer] for [https://en.wikipedia.org/wiki/Generalized_linear_mixed_model GLMMs/Generalized Linear Mixed-Effects Models]. | ||
:<syntaxhighlight lang='rsplus'> | :<syntaxhighlight lang='rsplus'> | ||
library(lme4) # version 1.1-31 | library(lme4) # version 1.1-31 |
Revision as of 14:11, 25 April 2023
Linear Regression
- Regression Models for Data Science in R by Brian Caffo
- Regression and Other Stories (book) by Andrew Gelman, Jennifer Hill, Aki Vehtari
Comic
MSE
- Mean squared error (within-sample)
- Mean squared prediction error (out-of-sample)
- Is MSE decreasing with increasing number of explanatory variables?
- Calculate (Root) Mean Squared Error in R (5 Examples)
Coefficient of determination R2
- https://en.wikipedia.org/wiki/Coefficient_of_determination.
- R2 is expressed as the ratio of the explained variance to the total variance.
- It is a statistical measure of how well the regression predictions approximate the real data points.
- See the wikipedia page for a list of caveats of R2 including correlation does not imply causation.
- Avoid R-squared to judge regression model performance
- R2 = RSS/TSS (model based)
- R2(x~y) = R2(y~x)
- R2 = Pearson correlation^2 (not model based)
- summary(lm())$r.squared. See Extract R-square value with R in linear models
- How to interpret root mean squared error (RMSE) vs standard deviation? 𝑅2 value.
- coefficient of determination R^2 (can be negative?)
- The R-squared and nonlinear regression: a difficult marriage?
- How R2 and RMSE are calculated in cross-validation of pls R
- How to Interpret R Squared and Goodness of Fit in Regression Analysis
- Limitations of R-squared: R-squared does not inform if the regression model has an adequate fit or not.
- Low R-squared and High R-squared values. A regression model with high R2 value can lead to – as the statisticians call it – specification bias. See an example.
- Five Reasons Why Your R-squared can be Too High
- R-squared is a biased estimate. R-squared estimates tend to be greater than the correct population value.
- Overfitting your model. This problem occurs when the model is too complex.
- Data mining and chance correlations. Multiple hypotheses.
- Trends in Panel (Time Series) Data
- Form of a Variable - include a different form of the same variable for both the dependent variable and an independent variable.
- Relationship between R squared and Pearson correlation coefficient
- What is the difference between Pearson R and Simple Linear Regression?
- R2 and MSE
- [math]\displaystyle{ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ &= 1 - \frac{MSE}{Var(y)} \end{align} }[/math]
- lasso/glmnet
- Beware of R2: simple, unambiguous assessment of the prediction accuracy of QSAR and QSPR models Alexander 2015. Golbraikh and Tropsha which identified the inadequacy of the leave-one-out cross-validation R2 (denoted as q2 in this case) calculated on training set data as a reliable characteristic of the model predictivity.
- The R-squared and nonlinear regression: a difficult marriage?
Pearson correlation and linear regression slope
- Simple Linear Regression and Correlation
- Pearson Correlation and Linear Regression, Comparing Correlation and Slope
- [math]\displaystyle{ \begin{align} b_1 &= r \frac{S_y}{S_x} \end{align} }[/math]
where [math]\displaystyle{ S_x=\sqrt{\sum(x-\bar{x})^2} }[/math].
set.seed(1) x <- rnorm(10); y <-rnorm(10) coef(lm(y~x)) # (Intercept) x # 0.3170798 -0.5161377 cor(x, y)*sd(y)/sd(x) # [1] -0.5161377
Different models (in R)
http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/
Factor Variables
Regression With Factor Variables
dummy.coef.lm() in R
Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.
Add Regression Line per Group to Scatterplot
How To Add Regression Line per Group to Scatterplot in ggplot2?
penguins_df %>% ggplot(aes(x=culmen_length_mm, y=flipper_length_mm, color=species))+ geom_point()+ geom_smooth(method="lm", se = FALSE)
model.matrix, design matrix
- https://en.wikipedia.org/wiki/Design_matrix
- ExploreModelMatrix: Explore design matrices interactively with R/Shiny. Paper on F1000research.
- model(~A+B) will return 1 + (a-1) + (b-1) columns. See an example Batch effects and confounders.
Contrasts in linear regression
- Page 147 of Modern Applied Statistics with S (4th ed)
- https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.
- A (sort of) Complete Guide to Contrasts in R by Rose Maier
mat ## constant NLvMH NvL MvH ## [1,] 1 -0.5 0.5 0.0 ## [2,] 1 -0.5 -0.5 0.0 ## [3,] 1 0.5 0.0 0.5 ## [4,] 1 0.5 0.0 -0.5 mat <- mat[ , -1] model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) ) summary(model7) ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 118.578 1.076 110.187 < 2e-16 *** ## doseNLvMH 3.179 2.152 1.477 0.14215 ## doseNvL -8.723 3.044 -2.866 0.00489 ** ## doseMvH 13.232 3.044 4.347 2.84e-05 *** # double check your contrasts attributes(model7$qr$qr)$contrasts ## $dose ## NLvMH NvL MvH ## None -0.5 0.5 0.0 ## Low -0.5 -0.5 0.0 ## Med 0.5 0.0 0.5 ## High 0.5 0.0 -0.5 library(dplyr) dose.means <- summarize(group_by(data, dose), y.mean=mean(y)) dose.means ## Source: local data frame [4 x 2] ## ## dose y.mean ## 1 None 112.6267 ## 2 Low 121.3500 ## 3 Med 126.7839 ## 4 High 113.5517 # The coefficient estimate for the first contrast (3.18) equals the average of # the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of # the first two groups (112.63 + 121.35 /2 = 116.99).
Multicollinearity
- A toy example
n <- 100 set.seed(1) x1 <- rnorm(n) e <- rnorm(n)*.01 y <- x1 + e cor(y, e) # 0.00966967 cor(y, x1) # 0.9999 lm(y ~ x1) |> summary() # p<2e-16 set.seed(2) x2 <- x1 + rnorm(n)*.1 # x2 = x1 + noise cor(x1, x2) # .99 lm(y ~ x1 + x2) |> summary() # x2 insig lm(y~ x2) |> summary() # x2 sig set.seed(3) x3 <- x1 + rnorm(n)*.0001 # x3 = x1 + tiny noise cor(x1, x3) # 1 lm(y ~ x1 + x3) |> summary() # both insig. SURPRISE! lm(y ~ x1) |> summary() x4 <- x1 # x4 is exactly equal to x1 lm(y~ x1 + x4) |> summary() # x4 coef not defined because of singularities lm(y~ x4 + x1) |> summary() # x1 coef not defined because of singularities
Consider lasso
fit <- cv.glmnet(x=cbind(x1, x3, matrix(rnorm(n*10), nr=n)), y=y) coefficients(fit, s = "lambda.min") # 13 x 1 sparse Matrix of class "dgCMatrix" # s1 # (Intercept) 0.002797165 # x1 0.970839175 # x3 . # . fit <- cv.glmnet(x=cbind(x1, x4, matrix(rnorm(n*10), nr=n)), y=y) coefficients(fit, s = "lambda.min") # 13 x 1 sparse Matrix of class "dgCMatrix" # s1 # (Intercept) 2.797165e-03 # x1 9.708392e-01 # x4 6.939215e-18 # . fit <- cv.glmnet(x=cbind(x4, x1, matrix(rnorm(n*10), nr=n)), y=y) coefficients(fit, s = "lambda.min") # 13 x 1 sparse Matrix of class "dgCMatrix" # s1 # (Intercept) 2.797165e-03 # x4 9.708392e-01 # x1 6.93 9215e-18 # .
- How to Fix in R: not defined because of singularities
- Multicollinearity in R
- Detecting multicollinearity — it’s not that easy sometimes
- alias: Find Aliases (Dependencies) In A Model
> op <- options(contrasts = c("contr.helmert", "contr.poly")) > npk.aov <- aov(yield ~ block + N*P*K, npk) > alias(npk.aov) Model : yield ~ block + N * P * K Complete : (Intercept) block1 block2 block3 block4 block5 N1 P1 K1 N1:P1 N1:K1 P1:K1 N1:P1:K1 0 1 1/3 1/6 -3/10 -1/5 0 0 0 0 0 0 > options(op)
Exposure
https://en.mimi.hu/mathematics/exposure_variable.html
Independent variable = predictor = explanatory = exposure variable
Marginal effects
The marginaleffects package for R. Compute and plot adjusted predictions, contrasts, marginal effects, and marginal means for 69 classes of statistical models in R. Conduct linear and non-linear hypothesis tests using the delta method.
Confounders, confounding
- What is a Confounding Variable? (Definition & Example). Confounding variable is a variable that can affect the relationship between the two variables under study. Requirements for Confounding Variables:
- It must be correlated with the independent variable.
- It must have a causal relationship with the dependent variable.
- Confounders Introduction to Data Science by Irizarry
- If X and Y are correlated, we call Z a confounder if changes in Z causes changes in both X and Y.
- https://en.wikipedia.org/wiki/Confounding
- A method for controlling complex confounding effects in the detection of adverse drug reactions using electronic health records. It provides a rule to identify a confounder.
- http://anythingbutrbitrary.blogspot.com/2016/01/how-to-create-confounders-with.html (R example)
- Logistic Regression: Confounding and Colinearity
- Identifying a confounder
- Is it possible to have a variable that acts as both an effect modifier and a confounder?
- Which test to use to check if a possible confounder impacts a 0 / 1 result?
- Addressing confounding artifacts in reconstruction of gene co-expression networks Parsana 2019
- Up Your Steps to Lower Blood Pressure, Heart Study Suggests
- Over about five months, participants averaged roughly 7,500 steps per day. Those with a higher daily step count had significantly lower blood pressure.
- the researchers found that systolic blood pressure was about 0.45 points lower for every 1,000 daily steps taken
- The link between daily step count and blood pressure was no longer significant when body mass index (BMI) was taken into account, however.
- Empirical economics with r (part b): confounders, proxies and sources of exogenous variations, causal effects.
- No, you have not controlled for confounders
- Visual Demonstration of Residual Confounding. Don't dichotomize a continuous variable.
- See Randomized block design
Confidence interval vs prediction interval
Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).
- http://www.graphpad.com/support/faqid/1506/
- http://en.wikipedia.org/wiki/Prediction_interval
- http://robjhyndman.com/hyndsight/intervals/
- https://stat.duke.edu/courses/Fall13/sta101/slides/unit7lec3H.pdf
- https://datascienceplus.com/prediction-interval-the-wider-sister-of-confidence-interval/
- Confidence and prediction intervals explained... (with a Shiny app!)
Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance
- Dealing with heteroskedasticity; regression with robust standard errors using R
- performance package check_heteroscedasticity(x, ...) and check_heteroskedasticity(x, ...)
- easystats: Quickly investigate model performance
- Homoscedasticity in Regression Analysis
Linear regression with Map Reduce
https://freakonometrics.hypotheses.org/53269
Relationship between multiple variables
Visualizing the relationship between multiple variables
Model fitting evaluation, Q-Q plot
Generalized least squares
- gls from the nlme package. The errors are allowed to be correlated and/or have unequal variances.
- varClasses: varPower(), varExp(), varConstPower(), varFunc()
- summary()$varBeta (variance of coefficient estimates), summary()$sigma (error sigma)
- intervals()$coef (coefficient estimates), intervals()$varStruct (lower, est, upper of variance function)
- anova()
- 95 Prediction intervals: predict(gls, newdata, interval = "prediction", level = .95) OR predict(gls, newdata) +/ qt(0.975,n-2)*se*sqrt(1+1/n+xd/ssx) where se=sigma.param*newx^pow.param, xd=(newx-xbar)^2, pow.param = coef(glsOjb$modelStruct$varStruct).
- gls() vs. lme() in the nlme package
- How to use Generalized Least Square gls() in r. Chapter 5.2.1 (page 208) in Mixed Effects Models in S and S-Plus by Pinheiro and Bates 2000.
- https://asancpt.github.io/nlme/chapter-8.html
- The lme function by Peter Dalgaard
- http://halweb.uc3m.es/esp/Personal/personas/durban/esp/web/notes/gls.pdf
Reduced rank regression
- The book Multivariate Reduced-Rank Regression by Velu, Raja & Reinsel, Gregory C.
- Generalized Low Rank Models (GLRM) and h2o
Singular value decomposition
- MATRIX DECOMPOSITIONS - a shiny app
- Application to rank-k approximation of X, missing data imputation, relationship to PCA, relationship to eigen value decomposition, check multicollinearity, Moore-Penrose pseudoinverse, relationship to NMF, calculation of SVD by hand.
- https://en.wikipedia.org/wiki/Singular_value_decomposition
- Minimum norm least-squares solution to linear equation from mathworks. Solve linear equations with infinite solutions. Underdetermined system (there are fewer equations than unknowns e.g. [math]\displaystyle{ 2x_1 +3x_2 =8 }[/math]. n=1, p=2). The geometry illustration of the problem is useful. The page also provides two ways to find the solution: one is by complete orthogonal decomposition (COD) and the other is by SVD/Moore-Penrose pseudoinverse.
- Moore-Penrose matrix inverse in R
> a = matrix(c(2, 3), nr=1) > MASS::ginv(a) * 8 [,1] [1,] 1.230769 [2,] 1.846154 # Same solution as matlab lsqminnorm(A,b) > a %*% MASS::ginv(a) [,1] [1,] 1 > a %*% MASS::ginv(a) %*% a [,1] [,2] [1,] 2 3 > MASS::ginv # view the source code
- Minimal Norm Solution to the least squares problem.
- The Moore-Penrose Inverse and Least Squares
- Why does SVD provide the least squares and least norm solution to 𝐴𝑥=𝑏?
Mahalanobis distance and outliers detection
- The Mahalanobis distance is a measure of the distance between a point P and a distribution D
- It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
- The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
- Distance is not always what it seems
performance::check_outliers() Outliers detection (check for influential observations)
How to Calculate Mahalanobis Distance in R
set.seed(1234) x <- matrix(rnorm(200), nc=10) x0 <- rnorm(10) mu <- colMeans(x) mahalanobis(x0, colMeans(x), var(x)) # 17.76527 t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) # 17.76527 # Variance is not full rank x <- matrix(rnorm(200), nc=20) x0 <- rnorm(20) mu <- colMeans(x) t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) mahalanobis(x0, colMeans(x), var(x)) # Error in solve.default(cov, ...) : # system is computationally singular: reciprocal condition number = 1.93998e-19
Type 1 error
Linear Regression And Type I Error
More Data Can Hurt for Linear Regression
Estimating Coefficients for Variables in R
Trying to Trick Linear Regression - Estimating Coefficients for Variables in R
How to interpret the interaction term
how to interpret the interaction term in lm formula in R? If both x1 and x2 are numerical, then x1:x2 is actually x1*x2 in computation. That is y ~ x1 + x2 + x1:x2 is equivalent to y ~ x1 + x2 + x3 where x3 = x1*x2. The cross is literally the two terms multiplied -- interpretation will largely depend on whether var1 and var2 are both continuous (quite hard to interpret, in my opinion) or whether one of these is e.g. binary categorical (easier to consider.)
Intercept only model and cross-validation
n <- 20 set.seed(1) x <- rnorm(n) y <- 2*x + .5*rnorm(n) plot(x, y) df <- data.frame(x=x, y=y) pred <- double(n) for(i in 1:n) { fit <- lm(y ~ 1, data = df[-i, ]) pred[i] <- predict(fit, df[i, ]) } plot(y, pred) cor(y, pred) # -1
How about 1000 simulated data?
foo <- function(n=3, debug=F) { x <- rnorm(n) y <- 2*x + .5*rnorm(n) df <- data.frame(x=x, y=y) pred <- double(n) for(i in 1:n) { fit <- lm(y ~ 1, data = df[-i, ]) pred[i] <- predict(fit, df[i, ]) } if (debug) { cat("num=", n*sum(y*pred)-sum(pred)*sum(y), "\n") cat("denom=", sqrt(n*sum(y**2) - sum(y)^2)*sqrt(n*sum(pred**2)-sum(pred)^2), "\n") invisible(list(y=y, pred=pred, cor=cor(y, pred))) } else { cor(y, pred) } } o <- replicate(1000, foo(n=10)) range(o) # [1] -1 -1 all.equal(o, rep(-1, 1000)) # TRUE
Note the property will not happen in k-fold CV (not LOOCV)
n <- 20; nfold <- 5 set.seed(1) x <- rnorm(n) y <- 2*x + .5*rnorm(n) #plot(x, y) df <- data.frame(x=x, y=y) set.seed(1) folds <- split(sample(1:n), rep(1:nfold, length = n)) pred <- double(n) for(i in 1:nfold) { fit <- lm(y ~ 1, data = df[-folds[[i]], ]) pred[folds[[i]]] <- predict(fit, df[folds[[i]], ]) } plot(y, pred) cor(y, pred) # -0.6696743
See also
lm.fit and multiple responses/genes
- https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm.fit (cf lmFit() from limma)
- Batch effects and confounders. As we can see, lm.fit()$ccoefficients is a matrix of (# parameters) x (# genes/responses).
Logistic regression
- 5個方式超越二元對立
- 一切都是遊戲
- 練習寬恕
- 不評判 感受一切
- 信任直覺(高我)
- 親修實證
- https://en.wikipedia.org/wiki/Logistic_regression
- Logistic regression or T test?. Choosing between logistic regression and Mann Whitney/t-tests
- The t-test is not significant but the logistic regression is
- The t-test is significant but the logistic regression is not, as in the question
- Simulation data shows when the covariate can perfectly separate in two groups, logistic regression won't work.
> set.seed(1234); n <- 16; mu=3; x <- c(rnorm(n), rnorm(n, mu)); y <- rep(0:1, each=n) > summary(glm(y ~ x, family = binomial)); plot(x, y) ... Estimate Std. Error z value Pr(>|z|) (Intercept) -116.7 76341.5 -0.002 0.999 x 88.5 56801.0 0.002 0.999 ... Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred
- Logistic regression in R. Assessing the fit with a pseudo R2. Alternative pseudo R2. Assessing the significance.
- How to Plot a Logistic Regression Curve in R Simple model.
- Plotting the results of your logistic regression Part 1: Continuous by categorical interaction. Multivariate model.
- Interpret Logistic Regression Coefficients (For Beginners).
- Increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by exp(β).
- exp^β is the odds ratio that associates smoking to the risk of heart disease.
- If exp^β = exp^0.38 = 1.46, the smoking group has a 1.46 times the odds of the non-smoking group of having heart disease.
- The smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.
- If β = – 0.38, then exp^β = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.
- If β = 0, then exp^β =1, the smoking group has the same odds as the non-smoking group of having heart disease.
- How to interpret the intercept? If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
- A Simple Interpretation of Logistic Regression Coefficients. logit(p) = log-odds ratio.
- A 1 unit increase in X₁ will result in b increase in the log-odds ratio of success : failure.
- In other words, if exp(b)=1.14, it means increasing studying hour by 1 unit will have a 14% increase in the odds of passing the exam (assuming that the variable female remains fixed) where p = the probability of passing an exam.
- How to Interpret Logistic Regression Coefficients
- Generalized Linear Models, Part I: The Logistic Model, odds ratio, hypothesis testing, change of the reference factor
- Logistic regression log odds log(p/(1-p)) = beta * X
- logit function f(x) = logit(x) = 1 / (1+exp(-x)) and we model the response Y by f(b0 + b1*x).
- Interpretation: consider X=(intercept, x), beta = (beta0, beta1) and assume x = 0/1. Then logit(beta0) is the percentage of positive results in Y when x = 0, and logit(beta0 + beta1) is the percentage of positive results in Y when x =1. Again, exp(beta1) is the odds ratio. The probabilities can be predicted by using the formula 1 / (1 + exp (-(b0 + b1*x)) )
Quantile regression
- Quantile regression is a type of regression analysis in which quantiles of the dependent variable (unlike traditional linear regression, which models the mean of the dependent variable) are modeled as a function of independent variables. This allows for a better understanding of the distribution of the dependent variable, especially when the distribution is skewed or has outliers.
- https://en.wikipedia.org/wiki/Quantile_regression
- Basic Quantile Regression
- QUANTILE REGRESSION (HOME MADE, PART 2)
library(quantreg) set.seed(123) x <- rnorm(100) y <- x + rnorm(100, mean = 0, sd = 2) y[1:10] <- y[1:10] + 10 # first 10 are outliers # Fit a traditional linear regression model fit_lm <- lm(y ~ x) # Fit a quantile regression model for the 50th percentile fit_qr <- rq(y ~ x, tau = 0.5) fit_lm$coefficients # (Intercept) x # 0.7961233 0.8759269 fit_qr$coefficients # (Intercept) x # 0.0772395 1.0058387 plot(x, y) points(x[1:10], y[1:10], col='red', pch=16) abline(fit_lm) abline(fit_qr, col = 'blue')
Isotonic regression
- What is nearly-isotonic regression?
- Getting predictions from an isotonic regression model
- Pool adjacent violators algorithm assisted learning with application on estimating optimal individualized treatment regimes 2021
Piecewise linear regression
- Trajectory (related to "time")
- Infection Trajectory: See Which Countries are Flattening Their COVID-19 Curve
- Covid-19: How to Predict a Patient's Trajectory and The COVID Inpatient Risk Calculator: CIRC
- Modelling the COVID-19 infection trajectory: A piecewise linear quantile trend model 2021
- Trajectory of long covid symptoms after covid-19 vaccination: community based cohort study 2022. The time series model came from this paper Interrupted time series regression for the evaluation of public health interventions: a tutorial which contains R code.
GEE/generalized estimating equations
- Longitudinal Data Analysis for Discrete and Continuous Outcomes Scott L. Zeger and Kung-Yee Liang 1986
- Analysis of Longitudinal Data (Oxford Statistical Science Series) 2nd edition by Diggle, Peter, Heagerty, Patrick, Liang, Kung-Yee, Zeger
- Youtube
- 12.1 - Introduction to Generalized Estimating Equations
- gee package. First version 1999.
- geepack package. First version 2002.
- Getting Started with Generalized Estimating Equations
- Chapter 15 GEE, Binary Outcome: Respiratory Illness from Encyclopedia of Quantitative Methods in R, vol. 5: Multilevel Models.
- Fast Pure R Implementation of GEE: Application of the Matrix Package
library(geepack) # load data data("cbpp", package = "lme4") cbpp[1:3, ] # herd incidence size period # 1 1 2 14 1 # 2 1 3 12 2 # 3 1 4 9 3 with(cbpp, table(herd, period)) period herd 1 2 3 4 1 1 1 1 1 2 1 1 1 0 3 1 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 7 1 1 1 1 8 1 0 0 0 9 1 1 1 1 10 1 1 1 1 11 1 1 1 1 12 1 1 1 1 13 1 1 1 1 14 1 1 1 1 15 1 1 1 1 # fit GEE model with logit link fit <- geeglm(cbind(incidence, size - incidence) ~ period, id = herd, family = binomial("logit"), corstr = "exchangeable", data = cbpp) # summary of the model summary(fit) Coefficients: Estimate Std.err Wald Pr(>|W|) (Intercept) -1.270 0.276 21.24 4.0e-06 *** period2 -1.156 0.436 7.04 0.0080 ** period3 -1.284 0.491 6.84 0.0089 ** period4 -1.754 0.375 21.85 2.9e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation structure = exchangeable Estimated Scale Parameters: Estimate Std.err (Intercept) 0.134 0.0542 Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0.0414 0.129 Number of clusters: 15 Maximum cluster size: 4 summary(fit)$coefficients Estimate Std.err Wald Pr(>|W|) (Intercept) -1.27 0.276 21.24 4.04e-06 period2 -1.16 0.436 7.04 7.99e-03 period3 -1.28 0.491 6.84 8.90e-03 period4 -1.75 0.375 21.85 2.94e-06 # extract robust variance-covariance matrix vcov_fit <- vcovHC(fit, type = "HC") # compute confidence intervals for coefficients library(sandwich) vcov_fit <- vcovHC(fit, type = "HC") ci <- coef(fit) + cbind(qnorm(0.025), qnorm(0.975)) * sqrt(diag(vcov_fit)) colnames(ci) <- c("2.5 %", "97.5 %") print(ci)
[math]\displaystyle{ \operatorname{logit}\left(\frac{p_{ij}}{1 - p_{ij}}\right) = \beta_0 + \beta_1 \cdot \text{period}_j }[/math] where [math]\displaystyle{ p_{ij} }[/math] is the probability of a positive outcome (i.e., an incidence) for the [math]\displaystyle{ i }[/math]-th animal in herd [math]\displaystyle{ j }[/math], and [math]\displaystyle{ \text{period}_j }[/math] is a categorical predictor variable indicating the period of observation for herd $j$ (e.g., pre- or post-vaccination). The parameters [math]\displaystyle{ \beta_0 }[/math] and [math]\displaystyle{ \beta_1 }[/math] represent the intercept and the effect of the period variable, respectively.
Simulation
How to simulate data for generalized estimating equations (GEE) with logistic link function?
GEE vs. mixed effects models
- https://stats.stackexchange.com/a/16415
- Getting Started with Generalized Estimating Equations
- It uses "gee", "lme4", "geepack" packages for model fittings
dep_gee2 <- gee(depression ~ diagnose + drug*time, data = dat, id = id, family = binomial, corstr = "exchangeable")
- How does the GEE model with the exchangeable correlation structure compare to a Generalized Mixed-Effect model. Wwe allow the intercept to be conditional on subject. This means we’ll get 340 subject_specific models with different intercepts but identical coefficients for everything else. The model coefficients listed under “Fixed effects” are almost identical to the GEE coefficients. ?glmer for GLMMs/Generalized Linear Mixed-Effects Models.
library(lme4) # version 1.1-31 dep_glmer <- glmer(depression ~ diagnose + drug*time + (1|id), data = dat, family = binomial) summary(dep_glmer, corr = FALSE)
dep_gee4 <- geeglm(depression ~ diagnose + drug*time, data = dat, id = id, family = binomial, corstr = "exchangeable")
- "ggeffects" package for plotting. The emmeans function calculates marginal effects and works for both GEE and mixed-effect models.
library(ggeffects) library(emmeans) plot(ggemmeans(dep_glmer, terms = c("time", "drug"), condition = c(diagnose = "severe"))) + ggplot2::ggtitle("GLMER Effect plot") plot(ggemmeans(dep_gee4, terms = c("time", "drug"), condition = c(diagnose = "severe"))) + ggplot2::ggtitle("GEE Effect plot")
- While both methods are used for modeling correlated data, there are some key differences between them.
- Assumptions: The key difference between GEE and nlme is that GEE makes fewer assumptions about the underlying data structure. GEE assumes that the mean structure is correctly specified, but the correlation structure is unspecified. In contrast, nlme assumes that both the mean and the variance-covariance structure are correctly specified.
- Model flexibility: GEE is a more flexible model than nlme because it allows for modeling the mean and variance of the response separately. This means that GEE is more suitable for modeling data with non-normal errors, such as count or binary data. In contrast, nlme assumes that the errors are normally distributed and allows for modeling the mean and variance jointly.
- Inference: Another key difference between GEE and nlme is the type of inference they provide. GEE provides marginal inference, which means that it estimates the population-averaged effect of the predictor variables on the response variable. In contrast, nlme provides conditional inference, which means that it estimates the subject-specific effect of the predictor variables on the response variable.
- Computation: The computation involved in fitting GEE and nlme models is also different. GEE uses an iterative algorithm to estimate the model parameters, while nlme uses a maximum likelihood estimation approach. The computational complexity of GEE is lower than that of nlme, making it more suitable for large datasets.
- In summary, GEE and nlme are two popular models used for analyzing longitudinal data. The main differences between the two models are the assumptions they make, model flexibility, inference type, and computational complexity. GEE is a more flexible model that makes fewer assumptions and provides marginal inference, while nlme assumes normality of errors and provides conditional inference.
Deming regression
Tweedie regression
Model selection, AIC and Tweedie regression
Causal inference
- https://en.wikipedia.org/wiki/Causal_inference
- Visual Guides for Causal Inference including Inverse Probability Weighting.
- Introduction to computational causal inference using reproducible Stata, R, and Python code: A tutorial Smith et Al 2021
- Confounding in causal inference: what is it, and what to do about it?
- An introduction to Causal inference
- Causal Inference cheat sheet for data scientists
- twang package
- The intuition behind inverse probability weighting in causal inference*, Confounding in causal inference: what is it, and what to do about it?
- Outcome [math]\displaystyle{ \begin{align} Y = T*Y(1) + (1-T)*Y(0) \end{align} }[/math]
- Causal effect (unobserved) [math]\displaystyle{
\begin{align}
\tau = E(Y(1) -Y(0))
\end{align}
}[/math]
Inverse-probability weighting removes confounding by creating a “pseudo-population” in which the treatment is independent of the measured confounders... Add a larger weight to the individuals who are underrepresented in the sample and a lower weight to those who are over-represented... propensity score P(T=1|X), logistic regression, stabilized weights. - A Crash Course in Causality: Inferring Causal Effects from Observational Data (Coursera) which includes Inverse Probability of Treatment Weighting (IPTW). R packages used: tableone, ipw, sandwich, survey.
- Propensity score matching
- MatchIt - Nonparametric Preprocessing for Parametric Causal Inference (R package).
- Practical Propensity Score Methods Using R (online book)
- Comparison of Propensity Score Methods and Covariate Adjustment: Evaluation in 4 Cardiovascular Studies Stuart Pocock 2017
- Simple examples to understand what confounders, colliders, mediators, and moderators are and how to “control for” variables in R with regression and propensity-score matching
- Statistical Rethinking (2022 Edition) by Richard McElreath. Youtube.
- Regression and Other Stores (book)
- Riddle: Estimate effect of x on y if you only have two noisy measures of x
- Assignment-Control Plots: A Visual Companion for Causal Inference Study Design, The American Statistician
- Regression and Causality by Michael Schomaker
- A Complete Guide to Causal Inference A compilation of the issues that you kept skipping over, and how to do it right.
- Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Wager Athey 2018. causalTree package.
- generalized random forests CRAN. Athey, Wager, Tibshirani. 2019
- Modeling Heterogeneous Treatment Effects with R (video)