Regression: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 629: Line 629:
<li>[https://data.library.virginia.edu/getting-started-with-generalized-estimating-equations/ Getting Started with Generalized Estimating Equations]
<li>[https://data.library.virginia.edu/getting-started-with-generalized-estimating-equations/ Getting Started with Generalized Estimating Equations]
<li>[https://cehs-research.github.io/eBook_multilevel/gee-binary-outcome-respiratory-illness.html Chapter 15 GEE, Binary Outcome: Respiratory Illness] from Encyclopedia of Quantitative Methods in R, vol. 5: Multilevel Models.
<li>[https://cehs-research.github.io/eBook_multilevel/gee-binary-outcome-respiratory-illness.html Chapter 15 GEE, Binary Outcome: Respiratory Illness] from Encyclopedia of Quantitative Methods in R, vol. 5: Multilevel Models.
<li>[https://rlbarter.github.io/Practical-Statistics/2017/05/10/generalized-estimating-equations-gee/ GENERALIZED ESTIMATING EQUATIONS (GEE)] by Practical Statistics.
<li>[https://journal.r-project.org/archive/2013/RJ-2013-017/RJ-2013-017.pdf Fast Pure R Implementation of GEE: Application of the Matrix Package]
<li>[https://journal.r-project.org/archive/2013/RJ-2013-017/RJ-2013-017.pdf Fast Pure R Implementation of GEE: Application of the Matrix Package]
<li>One example
<li>One example

Revision as of 15:25, 10 September 2023

Linear Regression

Comic

MSE

Coefficient of determination R2

  • https://en.wikipedia.org/wiki/Coefficient_of_determination.
    • R2 is expressed as the ratio of the explained variance to the total variance.
    • It is a statistical measure of how well the regression predictions approximate the real data points.
    • See the wikipedia page for a list of caveats of R2 including correlation does not imply causation.
R2.png
[math]\displaystyle{ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ &= 1 - \frac{MSE}{Var(y)} \end{align} }[/math]

Pearson correlation and linear regression slope

[math]\displaystyle{ \begin{align} b_1 &= r \frac{S_y}{S_x} \end{align} }[/math]

where [math]\displaystyle{ S_x=\sqrt{\sum(x-\bar{x})^2} }[/math].

set.seed(1)
x <- rnorm(10); y <-rnorm(10)
coef(lm(y~x))
# (Intercept)           x
#   0.3170798  -0.5161377

cor(x, y)*sd(y)/sd(x)
# [1] -0.5161377

Different models (in R)

http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/

Factor Variables

Regression With Factor Variables

dummy.coef.lm() in R

Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.

Add Regression Line per Group to Scatterplot

How To Add Regression Line per Group to Scatterplot in ggplot2?

penguins_df %>%
  ggplot(aes(x=culmen_length_mm, 
             y=flipper_length_mm, 
             color=species))+
  geom_point()+
  geom_smooth(method="lm", se = FALSE)

model.matrix, design matrix

Contrasts in linear regression

  • Page 147 of Modern Applied Statistics with S (4th ed)
  • https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.
  • A (sort of) Complete Guide to Contrasts in R by Rose Maier
    mat
    
    ##      constant NLvMH  NvL  MvH
    ## [1,]        1  -0.5  0.5  0.0
    ## [2,]        1  -0.5 -0.5  0.0
    ## [3,]        1   0.5  0.0  0.5
    ## [4,]        1   0.5  0.0 -0.5
    mat <- mat[ , -1]
    
    model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
    summary(model7)
    
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
    ## doseNLvMH      3.179      2.152   1.477  0.14215    
    ## doseNvL       -8.723      3.044  -2.866  0.00489 ** 
    ## doseMvH       13.232      3.044   4.347 2.84e-05 ***
    
    # double check your contrasts
    attributes(model7$qr$qr)$contrasts
    ## $dose
    ##      NLvMH  NvL  MvH
    ## None  -0.5  0.5  0.0
    ## Low   -0.5 -0.5  0.0
    ## Med    0.5  0.0  0.5
    ## High   0.5  0.0 -0.5
    
    library(dplyr)
    dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
    dose.means
    ## Source: local data frame [4 x 2]
    ## 
    ##   dose   y.mean
    ## 1 None 112.6267
    ## 2  Low 121.3500
    ## 3  Med 126.7839
    ## 4 High 113.5517
    
    # The coefficient estimate for the first contrast (3.18) equals the average of 
    # the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of 
    # the first two groups (112.63 + 121.35 /2 = 116.99).

Multicollinearity

  • A toy example
    n <- 100
    set.seed(1)
    x1 <- rnorm(n)
    e <- rnorm(n)*.01
    y <- x1 + e
    cor(y, e)  # 0.00966967
    cor(y, x1) # 0.9999
    lm(y ~ x1) |> summary()      # p<2e-16
    
    set.seed(2)
    x2 <- x1 + rnorm(n)*.1       # x2 = x1 + noise
    cor(x1, x2)  # .99
    lm(y ~ x1 + x2) |> summary() # x2 insig
    lm(y~ x2) |> summary()       # x2 sig
    
    set.seed(3)
    x3 <- x1 + rnorm(n)*.0001    # x3 = x1 + tiny noise
    cor(x1, x3) # 1
    lm(y ~ x1 + x3) |> summary() # both insig. SURPRISE!
    lm(y ~ x1) |> summary()
    
    x4 <- x1                     # x4 is exactly equal to x1  
    lm(y~ x1 + x4) |> summary()  # x4 coef not defined because of singularities
    lm(y~ x4 + x1) |> summary()  # x1 coef not defined because of singularities
    

    Consider lasso

    fit <- cv.glmnet(x=cbind(x1, x3, matrix(rnorm(n*10), nr=n)), y=y)
    coefficients(fit, s = "lambda.min")
    # 13 x 1 sparse Matrix of class "dgCMatrix"
    #                      s1
    # (Intercept) 0.002797165
    # x1          0.970839175
    # x3          .          
    #             .          
    
    fit <- cv.glmnet(x=cbind(x1, x4, matrix(rnorm(n*10), nr=n)), y=y)
    coefficients(fit, s = "lambda.min")
    # 13 x 1 sparse Matrix of class "dgCMatrix"
    #                       s1
    # (Intercept) 2.797165e-03
    # x1          9.708392e-01
    # x4          6.939215e-18
    #             .   
    
    fit <- cv.glmnet(x=cbind(x4, x1, matrix(rnorm(n*10), nr=n)), y=y)
    coefficients(fit, s = "lambda.min")
    # 13 x 1 sparse Matrix of class "dgCMatrix"
    #                      s1
    # (Intercept) 2.797165e-03
    # x4          9.708392e-01
    # x1          6.93 9215e-18
    #             .   
    
  • How to Fix in R: not defined because of singularities
  • Multicollinearity in R
  • Detecting multicollinearity — it’s not that easy sometimes
  • alias: Find Aliases (Dependencies) In A Model
    > op <- options(contrasts = c("contr.helmert", "contr.poly"))
    > npk.aov <- aov(yield ~ block + N*P*K, npk)
    > alias(npk.aov)
    Model :
    yield ~ block + N * P * K
    
    Complete :
             (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
    N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0     0     0
    
    > options(op)
    

Exposure

https://en.mimi.hu/mathematics/exposure_variable.html

Independent variable = predictor = explanatory = exposure variable

Marginal effects

The marginaleffects package for R. Compute and plot adjusted predictions, contrasts, marginal effects, and marginal means for 69 classes of statistical models in R. Conduct linear and non-linear hypothesis tests using the delta method.

Confounders, confounding

Confidence interval vs prediction interval

Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Linear regression with Map Reduce

https://freakonometrics.hypotheses.org/53269

Relationship between multiple variables

Visualizing the relationship between multiple variables

Model fitting evaluation, Q-Q plot

Generalized least squares

Reduced rank regression

Singular value decomposition

  • Moore-Penrose matrix inverse in R
    > a = matrix(c(2, 3), nr=1)
    > MASS::ginv(a) * 8 
             [,1]
    [1,] 1.230769
    [2,] 1.846154  
    # Same solution as matlab lsqminnorm(A,b)
    
    > a %*% MASS::ginv(a)
         [,1]
    [1,]    1
    > a %*% MASS::ginv(a) %*% a
         [,1] [,2]
    [1,]    2    3
    > MASS::ginv   # view the source code
    

Mahalanobis distance and outliers detection

Mahalanobis distance

  • The Mahalanobis distance is a measure of the distance between a point P and a distribution D
  • It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
  • The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
  • Distance is not always what it seems

performance::check_outliers() Outliers detection (check for influential observations)

How to Calculate Mahalanobis Distance in R

set.seed(1234)
x <- matrix(rnorm(200), nc=10)
x0 <- rnorm(10)
mu <- colMeans(x)
mahalanobis(x0, colMeans(x), var(x)) # 17.76527
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) # 17.76527

# Variance is not full rank
x <- matrix(rnorm(200), nc=20)
x0 <- rnorm(20)
mu <- colMeans(x)
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu)
mahalanobis(x0, colMeans(x), var(x))
# Error in solve.default(cov, ...) : 
#   system is computationally singular: reciprocal condition number = 1.93998e-19

Type 1 error

Linear Regression And Type I Error

More Data Can Hurt for Linear Regression

Sometimes more data can hurt!

Estimating Coefficients for Variables in R

Trying to Trick Linear Regression - Estimating Coefficients for Variables in R

How to interpret the interaction term

how to interpret the interaction term in lm formula in R? If both x1 and x2 are numerical, then x1:x2 is actually x1*x2 in computation. That is y ~ x1 + x2 + x1:x2 is equivalent to y ~ x1 + x2 + x3 where x3 = x1*x2. The cross is literally the two terms multiplied -- interpretation will largely depend on whether var1 and var2 are both continuous (quite hard to interpret, in my opinion) or whether one of these is e.g. binary categorical (easier to consider.)

Intercept only model and cross-validation

n <- 20
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
plot(x, y)
df <- data.frame(x=x, y=y)
pred <- double(n)
for(i in 1:n) {
  fit <- lm(y ~ 1, data = df[-i, ])
  pred[i] <- predict(fit, df[i, ])
}
plot(y, pred)
cor(y, pred) # -1

How about 1000 simulated data?

foo <- function(n=3, debug=F) {
  x <- rnorm(n)
  y <- 2*x + .5*rnorm(n)
  df <- data.frame(x=x, y=y)
  pred <- double(n)
  for(i in 1:n) {
    fit <- lm(y ~ 1, data = df[-i, ])
    pred[i] <- predict(fit, df[i, ])
  }
  if (debug) {
    cat("num=", n*sum(y*pred)-sum(pred)*sum(y), "\n")
    cat("denom=", sqrt(n*sum(y**2) - sum(y)^2)*sqrt(n*sum(pred**2)-sum(pred)^2), "\n")
    invisible(list(y=y, pred=pred, cor=cor(y, pred)))
  } else {
    cor(y, pred)
  }
}

o <- replicate(1000, foo(n=10))
range(o) # [1] -1 -1
all.equal(o, rep(-1, 1000)) # TRUE

Note the property will not happen in k-fold CV (not LOOCV)

n <- 20; nfold <- 5
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
#plot(x, y)
df <- data.frame(x=x, y=y)
set.seed(1)
folds <- split(sample(1:n), rep(1:nfold, length = n))
pred <- double(n)
for(i in 1:nfold) {
  fit <- lm(y ~ 1, data = df[-folds[[i]], ])
  pred[folds[[i]]] <- predict(fit, df[folds[[i]], ])
}
plot(y, pred)
cor(y, pred) # -0.6696743

See also

lm.fit and multiple responses/genes

Logistic regression

Multinomial logistic regression

Generalized linear models

  • ?glm, ?family
  • Examples:
    glm(counts ~ outcome + treatment, family = poisson())
    
    glm(Postwt ~ Prewt + Treat + offset(Prewt),
                    family = gaussian, data = anorexia)
    
    glm(lot1 ~ log(u), data = clotting, family = Gamma)
    
  • Summarize
    Probability Distribution family Parameter Typical Use Cases Default Link Function
    Gaussian gaussian (default) Continuous, normally distributed data Identity
    Binomial binomial Binary (yes/no) data or proportions Logit,
    (probit, cloglog)
    Poisson poisson Count data following Poisson distribution Log
    Gamma Gamma Continuous, positive data following gamma distribution Inverse
    Inverse Gaussian inverse.gaussian Continuous, positive data following inverse Gaussian distribution 1/mu^2
    Tweedie tweedie Flexible distribution for various data types Power value

Quantile regression

library(quantreg)
set.seed(123)
x <- rnorm(100)
y <- x + rnorm(100, mean = 0, sd = 2)
y[1:10] <- y[1:10] + 10 # first 10 are outliers

# Fit a traditional linear regression model
fit_lm <- lm(y ~ x)

# Fit a quantile regression model for the 50th percentile
fit_qr <- rq(y ~ x, tau = 0.5)

fit_lm$coefficients
# (Intercept)           x 
#   0.7961233   0.8759269 
fit_qr$coefficients
# (Intercept)           x 
#   0.0772395   1.0058387 

plot(x, y)
points(x[1:10], y[1:10], col='red', pch=16)
abline(fit_lm)
abline(fit_qr, col = 'blue')

Isotonic regression

Piecewise linear regression

Support vector regression

  • An Introduction to Support Vector Regression (SVR).
    • The goal of SVR is to find a function that approximates the relationship between the input and output variables in the training data, with an acceptable amount of error. This is done by mapping the input data into a high-dimensional feature space using a kernel function, and then finding a linear regression function in that space that fits the data with a specified margin of error.
    • SVR has been proven to be an effective tool in real-value function estimation, and like SVM, it is characterized by the use of kernels, sparse solution, and VC control of the margin and the number of support vectors.
  • Awad in Springer

Model Misspecification

Choose variables

Variable selection in linear regression models: Choosing the best subset is not always the best choice 2023

GEE/generalized estimating equations

  • https://en.wikipedia.org/wiki/Generalized_estimating_equation
  • Longitudinal Data Analysis for Discrete and Continuous Outcomes Scott L. Zeger and Kung-Yee Liang 1986
  • Analysis of Longitudinal Data (Oxford Statistical Science Series) 2nd edition by Peter Diggle, Patrick Heagerty, Kung-Yee Liang, and Scott Zeger
  • Youtube
  • 12.1 - Introduction to Generalized Estimating Equations from PennState
  • gee package. First version 1999.
  • geepack package. First version 2002.
  • Getting Started with Generalized Estimating Equations
  • Chapter 15 GEE, Binary Outcome: Respiratory Illness from Encyclopedia of Quantitative Methods in R, vol. 5: Multilevel Models.
  • GENERALIZED ESTIMATING EQUATIONS (GEE) by Practical Statistics.
  • Fast Pure R Implementation of GEE: Application of the Matrix Package
  • One example
    library(geepack)
    
    # load data
    data("cbpp", package = "lme4")
    cbpp[1:3, ]
    #   herd incidence size period
    # 1    1         2   14      1
    # 2    1         3   12      2
    # 3    1         4    9      3
    with(cbpp, table(herd, period))
        period
    herd 1 2 3 4
      1  1 1 1 1
      2  1 1 1 0
      3  1 1 1 1
      4  1 1 1 1
      5  1 1 1 1
      6  1 1 1 1
      7  1 1 1 1
      8  1 0 0 0
      9  1 1 1 1
      10 1 1 1 1
      11 1 1 1 1
      12 1 1 1 1
      13 1 1 1 1
      14 1 1 1 1
      15 1 1 1 1
    
    # fit GEE model with logit link
    fit <- geeglm(cbind(incidence, size - incidence) ~ period, id = herd,
                  family = binomial("logit"), corstr = "exchangeable", data = cbpp)
    
    # summary of the model
    summary(fit)
     Coefficients:
                Estimate Std.err  Wald Pr(>|W|)    
    (Intercept)   -1.270   0.276 21.24  4.0e-06 ***
    period2       -1.156   0.436  7.04   0.0080 ** 
    period3       -1.284   0.491  6.84   0.0089 ** 
    period4       -1.754   0.375 21.85  2.9e-06 ***
    ---
    Signif. codes:  
    0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Correlation structure = exchangeable 
    Estimated Scale Parameters:
    
                Estimate Std.err
    (Intercept)    0.134  0.0542
      Link = identity 
    
    Estimated Correlation Parameters:
          Estimate Std.err
    alpha   0.0414   0.129
    Number of clusters:   15  Maximum cluster size: 4 
    
    summary(fit)$coefficients
                Estimate Std.err  Wald Pr(>|W|)
    (Intercept)    -1.27   0.276 21.24 4.04e-06
    period2        -1.16   0.436  7.04 7.99e-03
    period3        -1.28   0.491  6.84 8.90e-03
    period4        -1.75   0.375 21.85 2.94e-06
    
    # extract robust variance-covariance matrix
    vcov_fit <- vcovHC(fit, type = "HC")
    
    # compute confidence intervals for coefficients
    library(sandwich)
    
    vcov_fit <- vcovHC(fit, type = "HC")
    ci <- coef(fit) + cbind(qnorm(0.025), qnorm(0.975)) * sqrt(diag(vcov_fit))
    colnames(ci) <- c("2.5 %", "97.5 %")
    print(ci)

    [math]\displaystyle{ \operatorname{logit}\left(\frac{p_{ij}}{1 - p_{ij}}\right) = \beta_0 + \beta_1 \cdot \text{period}_{ij} + u_j }[/math] where [math]\displaystyle{ p_{ij} }[/math] is the probability of a positive outcome (i.e., an incidence) for

    • the [math]\displaystyle{ i }[/math]-th animal/observation in herd(group/subject) [math]\displaystyle{ j }[/math], and
    • [math]\displaystyle{ \text{period}_{ij} }[/math] is a categorical predictor variable indicating the period of observation for herd/subject [math]\displaystyle{ j }[/math] and observation [math]\displaystyle{ i }[/math].
    • The parameters [math]\displaystyle{ \beta_0 }[/math] and [math]\displaystyle{ \beta_1 }[/math] represent the intercept and the effect of the period variable, respectively.
    • The [math]\displaystyle{ u_j }[/math] is the random subject-specific effect. The random effects [math]\displaystyle{ u_j }[/math] are assumed to follow a multivariate normal distribution with mean 0 and covariance matrix [math]\displaystyle{ \Sigma }[/math].
    • The correlation structure of the random effects is specified as "exchangeable", which means that any two random effects from the same subject have the same correlation.

Compare models

  • 015. GEE in Practice: Example on COVID-19 Test Positivity Data. QIC is available in the geepack package.
    • geeglm was used.
    • simple scatterplot by plot() with y=positivity_rate, x=time
    • lattice::xyplot()
    xyplot(positivity_rate ~ time|ID, 
           groups = ID,
           subset = ID %in% sampled_idx,
           data= covid, 
           panel = function(x, y) {
             panel.xyplot(x, y, plot='p')
             panel.linejoin(x, y,
                           fun=mean,
                           horizontal = F,
                           lwd=2, col=1)
           })
    • create a factor
    covid$size_factor <-  cut(covid$estimat_pop, 
                       breaks = quantile(unique(covid$estimated_pop)),
                       include.lowest = TRUE,
                       lables = FALSE)  |>  as.factor()
    
    xyplot(positivity_rate ~ time|size_factor, # size_factor is like a facet
           groups = ID,
           data= covid, 
           panel = function(x, y) {
             panel.xyplot(x, y, plot='p')
             panel.linejoin(x, y,
                           fun=mean,
                           horizontal = F,
                           lwd=2, col=1)
           })
    • Residual vs predict plot from plot(fit) can be used to evaluate which correlation structure is better.
    • QIC() output and what metric can be used for [model selection. QIC & CIC are most useful for selecting correlation structure. QICU is most useful for comparing models with the same correlation structure but different mean structure.
  • Analysis of panel data in R using Generalized Estimating Equations which uses the MESS package

Simulation

  • How to simulate data for generalized estimating equations (GEE) with logistic link function?
  • We generate some example data with repeated measurements. In this simulated data, we have 100 subjects, each measured at 5 time points.
    # Load the geepack package
    library(geepack)
    
    # Generate some example data with repeated measurements
    set.seed(123)
    n <- 100  # Number of subjects
    time <- rep(1:5, each = n)  # Time points
    subject <- rep(1:n, times = 5)  # Subject IDs
    y <- rnorm(n * 5, mean = time * 0.5 + subject * 0.2)  # Simulated response variable
    
    # Create a data frame
    data <- data.frame(subject, time, y)
    
    # Fit a GEE model using geeglm
    # In this example, we assume an exchangeable correlation structure
    model <- geeglm(y ~ time, id = subject, data = data, family = gaussian, corstr = "exchangeable")
    
    # Summary of the GEE model
    summary(model)
    # Coefficients:
    #             Estimate Std.err   Wald Pr(>|W|)    
    # (Intercept)  10.1039  0.6222 263.69  < 2e-16 ***
    # time          0.5102  0.1877   7.39  0.00656 ** 
    # ---
    # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    #
    # Correlation structure = exchangeable 
    # Estimated Scale Parameters:
    #
    #            Estimate Std.err
    # (Intercept)    35.09   1.444
    #   Link = identity 
    #
    # Estimated Correlation Parameters:
    #       Estimate Std.err
    # alpha        0       0
    # Number of clusters:   500  Maximum cluster size: 1 
    
    model2 = geeglm(y ~ time, id = subject, data = data, family = gaussian, corstr = "ind")
    summary(model2)
    #  Coefficients:
    #             Estimate Std.err   Wald Pr(>|W|)    
    # (Intercept)   10.104   0.622 263.68   <2e-16 ***
    # time           0.510   0.188   7.39   0.0066 ** 
    # ---
    # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    # 
    # Correlation structure = independence 
    # Estimated Scale Parameters:
    # 
    #             Estimate Std.err
    # (Intercept)     35.1    1.44
    # Number of clusters:   500  Maximum cluster size: 1

GEE vs. mixed effects models

  • https://stats.stackexchange.com/a/16415
  • Getting Started with Generalized Estimating Equations
    • It uses "gee", "lme4", "geepack" packages for model fittings
    dep_gee2 <- gee(depression ~ diagnose + drug*time,
                   data = dat, 
                   id = id, 
                   family = binomial,
                   corstr = "exchangeable")
    How does the GEE model with the exchangeable correlation structure compare to a Generalized Mixed-Effect model. We allow the intercept to be conditional on subject. This means we’ll get 340 subject_specific models with different intercepts but identical coefficients for everything else. The model coefficients listed under “Fixed effects” are almost identical to the GEE coefficients. ?glmer for GLMMs/Generalized Linear Mixed-Effects Models.
    library(lme4) # version 1.1-31
    dep_glmer <- glmer(depression ~ diagnose + drug*time + (1|id), 
                   data = dat, family = binomial)
    summary(dep_glmer, corr = FALSE)
    dep_gee4 <- geeglm(depression ~ diagnose + drug*time,
                       data = dat,
                       id = id,
                       family = binomial,
                       corstr = "exchangeable")
    • "ggeffects" package for plotting. The emmeans function calculates marginal effects and works for both GEE and mixed-effect models.
    library(ggeffects)
    library(emmeans)
    plot(ggemmeans(dep_glmer, terms = c("time", "drug"), 
                   condition = c(diagnose = "severe"))) + 
      ggplot2::ggtitle("GLMER Effect plot")
    
    data.frame(ggemmeans(dep_glmer, terms = c("time", "drug"), 
               condition = c(diagnose = "severe")) ) 
      x predicted std.error conf.low conf.high    group
    1 0     0.207     0.175    0.156     0.269 standard
    2 0     0.197     0.187    0.146     0.262      new
    3 1     0.297     0.119    0.251     0.348 standard
    4 1     0.525     0.126    0.463     0.586      new
    5 2     0.407     0.156    0.335     0.482 standard
    6 2     0.832     0.217    0.764     0.883      new
    filter(dat, diagnose == "severe") %>% 
       group_by(drug, time) %>% 
       summarize(mean = mean(depression))
      drug      time  mean
      <fct>    <int> <dbl>
    1 standard     0 0.21 
    2 standard     1 0.28 
    3 standard     2 0.46 
    4 new          0 0.178
    5 new          1 0.5  
    6 new          2 0.833
    
    plot(ggemmeans(dep_gee4, terms = c("time", "drug"),
                   condition = c(diagnose = "severe"))) + 
      ggplot2::ggtitle("GEE Effect plot")
  • While both methods are used for modeling correlated data, there are some key differences between them.
    • Assumptions: The key difference between GEE and nlme is that GEE makes fewer assumptions about the underlying data structure. GEE assumes that the mean structure is correctly specified, but the correlation structure is unspecified. In contrast, nlme assumes that both the mean and the variance-covariance structure are correctly specified.
    • Model flexibility: GEE is a more flexible model than nlme because it allows for modeling the mean and variance of the response separately. This means that GEE is more suitable for modeling data with non-normal errors, such as count or binary data. In contrast, nlme assumes that the errors are normally distributed and allows for modeling the mean and variance jointly.
    • Inference: Another key difference between GEE and nlme is the type of inference they provide. GEE provides marginal inference, which means that it estimates the population-averaged effect of the predictor variables on the response variable. In contrast, nlme provides conditional inference, which means that it estimates the subject-specific effect of the predictor variables on the response variable.
    • Computation: The computation involved in fitting GEE and nlme models is also different. GEE uses an iterative algorithm to estimate the model parameters, while nlme uses a maximum likelihood estimation approach. The computational complexity of GEE is lower than that of nlme, making it more suitable for large datasets.
    • In summary, GEE and nlme are two popular models used for analyzing longitudinal data. The main differences between the two models are the assumptions they make, model flexibility, inference type, and computational complexity. GEE is a more flexible model that makes fewer assumptions and provides marginal inference, while nlme assumes normality of errors and provides conditional inference.

Deming regression

Deming regression

Tweedie regression

Model selection, AIC and Tweedie regression

Causal inference

  • The intuition behind inverse probability weighting in causal inference*, Confounding in causal inference: what is it, and what to do about it?
    Outcome [math]\displaystyle{ \begin{align} Y = T*Y(1) + (1-T)*Y(0) \end{align} }[/math]
    Causal effect (unobserved) [math]\displaystyle{ \begin{align} \tau = E(Y(1) -Y(0)) \end{align} }[/math]
    where [math]\displaystyle{ E[Y(1)] }[/math] referred to the expected outcome in the hypothetical situation that everyone in the population was assigned to treatment, [math]\displaystyle{ E[Y|T=1] }[/math] refers to the expected outcome for all individuals in the population who are actually assigned to treatment... The key is that the value of [math]\displaystyle{ E[Y|T=1]−E[Y|T=0] }[/math] is only equal to the causal effect, [math]\displaystyle{ E[Y(1)−Y(0)] }[/math] if there are no confounders present.
    Inverse-probability weighting removes confounding by creating a “pseudo-population” in which the treatment is independent of the measured confounders... Add a larger weight to the individuals who are underrepresented in the sample and a lower weight to those who are over-represented... propensity score P(T=1|X), logistic regression, stabilized weights.
  • A Crash Course in Causality: Inferring Causal Effects from Observational Data (Coursera) which includes Inverse Probability of Treatment Weighting (IPTW). R packages used: tableone, ipw, sandwich, survey.

Seemingly unrelated regressions/SUR

  • Seemingly unrelated regressions from wikipedia
    • Advantages: The SUR model is useful when the error terms of the regression equations are correlated with each other. In this case, the ordinary least squares (OLS) estimator is inefficient and inconsistent. The SUR model can provide more efficient and consistent estimates of the regression coefficients
    • Disadvantages: One disadvantage of using the SUR model is that it requires the assumption that the error terms are correlated across equations. If this assumption is not met, then the SUR model may not provide more efficient estimates than estimating each equation separately1. Another disadvantage of using the SUR model is that it can be computationally intensive and may require more time to estimate than estimating each equation separately
  • Seemingly Unrelated Regression (SUR/SURE)
  • How can I perform seemingly unrelated regression in R?
  • spsur package
  • Equations for the simplest case:
    [math]\displaystyle{ \begin{align} y_1 &= b_1x_1 + b_2x_2 + u_1 \\ y_2 &= b_3x_1 + b_4x_2 + u_2 \end{align} }[/math]