Regression: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 249: Line 249:
* [https://davidlindelof.com/no-you-have-not-controlled-for-confounders/ No, you have not controlled for confounders]
* [https://davidlindelof.com/no-you-have-not-controlled-for-confounders/ No, you have not controlled for confounders]
* [https://stats.stackexchange.com/questions/558403/visual-demonstration-of-residual-confounding Visual Demonstration of Residual Confounding]. Don't dichotomize a continuous variable.
* [https://stats.stackexchange.com/questions/558403/visual-demonstration-of-residual-confounding Visual Demonstration of Residual Confounding]. Don't dichotomize a continuous variable.
* What is a '''randomized block design'''?
* See [[T-test#Randomized_block_design|Randomized block design]]
** In a randomized block design, the subjects are first divided into blocks based on their similarities, and then they are randomly assigned to the treatment groups within each block.
 
* How to interpret the result from a randomized block design?
** If the results are statistically significant, it means that there is a significant difference between the treatment groups in terms of the response variable. This indicates that the treatment had an effect on the response variable, and that this effect is not likely due to chance alone.
 
* How to incorporate the block variable in the interpretation?
** In a randomized block design, the block variable is a characteristic that is used to group the subjects or experimental units into blocks. The goal of using a block variable is to control for the effects of this characteristic, so that the effects of the experimental variables can be more accurately measured.
** To incorporate the block variable into the interpretation of the results, you will need to consider whether the block variable had an effect on the response variable.
** If the block variable had a significant effect on the response variable, it means that the results may be '''confounded''' by the block variable. In this case, it may be necessary to take the block variable into account when interpreting the results. For example, you might need to consider whether the treatment effects are different for different blocks, or whether the block variable is interacting with the treatment in some way.
 
* What does that mean the result is '''confounded''' by the variable?
** If the results of an experiment are '''confounded''' by a variable, it means that the variable is influencing the results in some way and making it difficult to interpret the effects of the experimental treatment. This can occur when the variable is correlated with both the treatment and the response variable, and it can lead to incorrect conclusions about the relationship between the treatment and the response.
** For example, consider an experiment in which the treatment is a drug that is being tested for its ability to lower blood pressure. If the subjects are divided into blocks based on their age, and the results show that the drug is more effective in younger subjects than in older subjects, this could be because the drug is more effective in younger subjects, or it could be because blood pressure tends to be higher in older subjects and the effect of the drug is being masked. In this case, the results would be confounded by age, and it would be difficult to draw conclusions about the effectiveness of the drug without taking age into account.
** To avoid confounding, it is important to carefully control for any variables that could potentially influence the results of the experiment. This may involve stratifying the sample, using a matching or blocking design, or controlling for the variable in the statistical analysis. By controlling for confounding variables, you can more accurately interpret the results of the experiment and draw valid conclusions about the relationship between the treatment and the response.


== Confidence interval vs prediction interval ==
== Confidence interval vs prediction interval ==

Revision as of 16:49, 10 January 2023

Linear Regression

Comic

MSE

Coefficient of determination R2

[math]\displaystyle{ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ &= 1 - \frac{MSE}{Var(y)} \end{align} }[/math]

Pearson correlation and linear regression slope

[math]\displaystyle{ \begin{align} b_1 &= r \frac{S_y}{S_x} \end{align} }[/math]

where [math]\displaystyle{ S_x=\sqrt{\sum(x-\bar{x})^2} }[/math].

set.seed(1)
x <- rnorm(10); y <-rnorm(10)
coef(lm(y~x))
# (Intercept)           x
#   0.3170798  -0.5161377

cor(x, y)*sd(y)/sd(x)
# [1] -0.5161377

Different models (in R)

http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/

Factor Variables

Regression With Factor Variables

dummy.coef.lm() in R

Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.

Add Regression Line per Group to Scatterplot

How To Add Regression Line per Group to Scatterplot in ggplot2?

penguins_df %>%
  ggplot(aes(x=culmen_length_mm, 
             y=flipper_length_mm, 
             color=species))+
  geom_point()+
  geom_smooth(method="lm", se = FALSE)

model.matrix, design matrix

Contrasts in linear regression

  • Page 147 of Modern Applied Statistics with S (4th ed)
  • https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.
  • A (sort of) Complete Guide to Contrasts in R by Rose Maier
    mat
    
    ##      constant NLvMH  NvL  MvH
    ## [1,]        1  -0.5  0.5  0.0
    ## [2,]        1  -0.5 -0.5  0.0
    ## [3,]        1   0.5  0.0  0.5
    ## [4,]        1   0.5  0.0 -0.5
    mat <- mat[ , -1]
    
    model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
    summary(model7)
    
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
    ## doseNLvMH      3.179      2.152   1.477  0.14215    
    ## doseNvL       -8.723      3.044  -2.866  0.00489 ** 
    ## doseMvH       13.232      3.044   4.347 2.84e-05 ***
    
    # double check your contrasts
    attributes(model7$qr$qr)$contrasts
    ## $dose
    ##      NLvMH  NvL  MvH
    ## None  -0.5  0.5  0.0
    ## Low   -0.5 -0.5  0.0
    ## Med    0.5  0.0  0.5
    ## High   0.5  0.0 -0.5
    
    library(dplyr)
    dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
    dose.means
    ## Source: local data frame [4 x 2]
    ## 
    ##   dose   y.mean
    ## 1 None 112.6267
    ## 2  Low 121.3500
    ## 3  Med 126.7839
    ## 4 High 113.5517
    
    # The coefficient estimate for the first contrast (3.18) equals the average of 
    # the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of 
    # the first two groups (112.63 + 121.35 /2 = 116.99).

Multicollinearity

  • A toy example
    n <- 100
    set.seed(1)
    x1 <- rnorm(n)
    e <- rnorm(n)*.01
    y <- x1 + e
    cor(y, e)  # 0.00966967
    cor(y, x1) # 0.9999
    lm(y ~ x1) |> summary()      # p<2e-16
    
    set.seed(2)
    x2 <- x1 + rnorm(n)*.1       # x2 = x1 + noise
    cor(x1, x2)  # .99
    lm(y ~ x1 + x2) |> summary() # x2 insig
    lm(y~ x2) |> summary()       # x2 sig
    
    set.seed(3)
    x3 <- x1 + rnorm(n)*.0001    # x3 = x1 + tiny noise
    cor(x1, x3) # 1
    lm(y ~ x1 + x3) |> summary() # both insig. SURPRISE!
    lm(y ~ x1) |> summary()
    
    x4 <- x1                     # x4 is exactly equal to x1  
    lm(y~ x1 + x4) |> summary()  # x4 coef not defined because of singularities
    lm(y~ x4 + x1) |> summary()  # x1 coef not defined because of singularities
    

    Consider lasso

    fit <- cv.glmnet(x=cbind(x1, x3, matrix(rnorm(n*10), nr=n)), y=y)
    coefficients(fit, s = "lambda.min")
    # 13 x 1 sparse Matrix of class "dgCMatrix"
    #                      s1
    # (Intercept) 0.002797165
    # x1          0.970839175
    # x3          .          
    #             .          
    
    fit <- cv.glmnet(x=cbind(x1, x4, matrix(rnorm(n*10), nr=n)), y=y)
    coefficients(fit, s = "lambda.min")
    # 13 x 1 sparse Matrix of class "dgCMatrix"
    #                       s1
    # (Intercept) 2.797165e-03
    # x1          9.708392e-01
    # x4          6.939215e-18
    #             .   
    
    fit <- cv.glmnet(x=cbind(x4, x1, matrix(rnorm(n*10), nr=n)), y=y)
    coefficients(fit, s = "lambda.min")
    # 13 x 1 sparse Matrix of class "dgCMatrix"
    #                      s1
    # (Intercept) 2.797165e-03
    # x4          9.708392e-01
    # x1          6.93 9215e-18
    #             .   
    
  • How to Fix in R: not defined because of singularities
  • Multicollinearity in R
  • Detecting multicollinearity — it’s not that easy sometimes
  • alias: Find Aliases (Dependencies) In A Model
    > op <- options(contrasts = c("contr.helmert", "contr.poly"))
    > npk.aov <- aov(yield ~ block + N*P*K, npk)
    > alias(npk.aov)
    Model :
    yield ~ block + N * P * K
    
    Complete :
             (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
    N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0     0     0
    
    > options(op)
    

Exposure

https://en.mimi.hu/mathematics/exposure_variable.html

Independent variable = predictor = explanatory = exposure variable

Marginal effects

The marginaleffects package for R. Compute and plot adjusted predictions, contrasts, marginal effects, and marginal means for 69 classes of statistical models in R. Conduct linear and non-linear hypothesis tests using the delta method.

Confounders, confounding

Confidence interval vs prediction interval

Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Linear regression with Map Reduce

https://freakonometrics.hypotheses.org/53269

Relationship between multiple variables

Visualizing the relationship between multiple variables

Model fitting evaluation, Q-Q plot

Generalized least squares

Reduced rank regression

Singular value decomposition

  • Moore-Penrose matrix inverse in R
    > a = matrix(c(2, 3), nr=1)
    > MASS::ginv(a) * 8 
             [,1]
    [1,] 1.230769
    [2,] 1.846154  
    # Same solution as matlab lsqminnorm(A,b)
    
    > a %*% MASS::ginv(a)
         [,1]
    [1,]    1
    > a %*% MASS::ginv(a) %*% a
         [,1] [,2]
    [1,]    2    3
    > MASS::ginv   # view the source code
    

Mahalanobis distance and outliers detection

Mahalanobis distance

  • The Mahalanobis distance is a measure of the distance between a point P and a distribution D
  • It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
  • The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
  • Distance is not always what it seems

performance::check_outliers() Outliers detection (check for influential observations)

How to Calculate Mahalanobis Distance in R

set.seed(1234)
x <- matrix(rnorm(200), nc=10)
x0 <- rnorm(10)
mu <- colMeans(x)
mahalanobis(x0, colMeans(x), var(x)) # 17.76527
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) # 17.76527

# Variance is not full rank
x <- matrix(rnorm(200), nc=20)
x0 <- rnorm(20)
mu <- colMeans(x)
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu)
mahalanobis(x0, colMeans(x), var(x))
# Error in solve.default(cov, ...) : 
#   system is computationally singular: reciprocal condition number = 1.93998e-19

Type 1 error

Linear Regression And Type I Error

More Data Can Hurt for Linear Regression

Sometimes more data can hurt!

Estimating Coefficients for Variables in R

Trying to Trick Linear Regression - Estimating Coefficients for Variables in R

How to interpret the interaction term

how to interpret the interaction term in lm formula in R? If both x1 and x2 are numerical, then x1:x2 is actually x1*x2 in computation. That is y ~ x1 + x2 + x1:x2 is equivalent to y ~ x1 + x2 + x3 where x3 = x1*x2. The cross is literally the two terms multiplied -- interpretation will largely depend on whether var1 and var2 are both continuous (quite hard to interpret, in my opinion) or whether one of these is e.g. binary categorical (easier to consider.)

Intercept only model and cross-validation

n <- 20
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
plot(x, y)
df <- data.frame(x=x, y=y)
pred <- double(n)
for(i in 1:n) {
  fit <- lm(y ~ 1, data = df[-i, ])
  pred[i] <- predict(fit, df[i, ])
}
plot(y, pred)
cor(y, pred) # -1

How about 1000 simulated data?

foo <- function(n=3, debug=F) {
  x <- rnorm(n)
  y <- 2*x + .5*rnorm(n)
  df <- data.frame(x=x, y=y)
  pred <- double(n)
  for(i in 1:n) {
    fit <- lm(y ~ 1, data = df[-i, ])
    pred[i] <- predict(fit, df[i, ])
  }
  if (debug) {
    cat("num=", n*sum(y*pred)-sum(pred)*sum(y), "\n")
    cat("denom=", sqrt(n*sum(y**2) - sum(y)^2)*sqrt(n*sum(pred**2)-sum(pred)^2), "\n")
    invisible(list(y=y, pred=pred, cor=cor(y, pred)))
  } else {
    cor(y, pred)
  }
}

o <- replicate(1000, foo(n=10))
range(o) # [1] -1 -1
all.equal(o, rep(-1, 1000)) # TRUE

Note the property will not happen in k-fold CV (not LOOCV)

n <- 20; nfold <- 5
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
#plot(x, y)
df <- data.frame(x=x, y=y)
set.seed(1)
folds <- split(sample(1:n), rep(1:nfold, length = n))
pred <- double(n)
for(i in 1:nfold) {
  fit <- lm(y ~ 1, data = df[-folds[[i]], ])
  pred[folds[[i]]] <- predict(fit, df[folds[[i]], ])
}
plot(y, pred)
cor(y, pred) # -0.6696743

See also

lm.fit and multiple responses/genes

Logistic regression

  • 5個方式超越二元對立
    1. 一切都是遊戲
    2. 練習寬恕
    3. 不評判 感受一切
    4. 信任直覺(高我)
    5. 親修實證
  • https://en.wikipedia.org/wiki/Logistic_regression
  • Logistic regression or T test?. Choosing between logistic regression and Mann Whitney/t-tests
    • The t-test is not significant but the logistic regression is
    • The t-test is significant but the logistic regression is not, as in the question
  • Simulation data shows when the covariate can perfectly separate in two groups, logistic regression won't work.
    > set.seed(1234); n <- 16; mu=3; x <- c(rnorm(n), rnorm(n, mu)); y <- rep(0:1, each=n)
    > summary(glm(y ~ x, family = binomial)); plot(x, y)
    ...
                Estimate Std. Error z value Pr(>|z|)
    (Intercept)   -116.7    76341.5  -0.002    0.999
    x               88.5    56801.0   0.002    0.999
    ...
    Warning messages:
    1: glm.fit: algorithm did not converge 
    2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
    

    LogisticFail.svg

  • Logistic regression in R. Assessing the fit with a pseudo R2. Alternative pseudo R2. Assessing the significance.
  • How to Plot a Logistic Regression Curve in R Simple model.
  • Plotting the results of your logistic regression Part 1: Continuous by categorical interaction. Multivariate model.
  • Interpret Logistic Regression Coefficients (For Beginners).
    • Increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by exp(β).
    • exp^β is the odds ratio that associates smoking to the risk of heart disease.
    • If exp^β = exp^0.38 = 1.46, the smoking group has a 1.46 times the odds of the non-smoking group of having heart disease.
    • The smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.
    • If β = – 0.38, then exp^β = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.
    • If β = 0, then exp^β =1, the smoking group has the same odds as the non-smoking group of having heart disease.
    • How to interpret the intercept? If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
  • A Simple Interpretation of Logistic Regression Coefficients. logit(p) = log-odds ratio.
    • A 1 unit increase in X₁ will result in b increase in the log-odds ratio of success : failure.
    • In other words, if exp(b)=1.14, it means increasing studying hour by 1 unit will have a 14% increase in the odds of passing the exam (assuming that the variable female remains fixed) where p = the probability of passing an exam.
  • How to Interpret Logistic Regression Coefficients
  • Generalized Linear Models, Part I: The Logistic Model, odds ratio, hypothesis testing, change of the reference factor
  • Logistic regression log odds log(p/(1-p)) = beta * X
  • logit function f(x) = logit(x) = 1 / (1+exp(-x)) and we model the response Y by f(b0 + b1*x).
  • Interpretation: consider X=(intercept, x), beta = (beta0, beta1) and assume x = 0/1. Then logit(beta0) is the percentage of positive results in Y when x = 0, and logit(beta0 + beta1) is the percentage of positive results in Y when x =1. Again, exp(beta1) is the odds ratio. The probabilities can be predicted by using the formula 1 / (1 + exp (-(b0 + b1*x)) )

Quantile regression

Isotonic regression

Piecewise linear regression

GEE/generalized estimating equations

Causal inference

  • The intuition behind inverse probability weighting in causal inference*, Confounding in causal inference: what is it, and what to do about it?
    Outcome [math]\displaystyle{ \begin{align} Y = T*Y(1) + (1-T)*Y(0) \end{align} }[/math]
    Causal effect (unobserved) [math]\displaystyle{ \begin{align} \tau = E(Y(1) -Y(0)) \end{align} }[/math]
    where [math]\displaystyle{ E[Y(1)] }[/math] referred to the expected outcome in the hypothetical situation that everyone in the population was assigned to treatment, [math]\displaystyle{ E[Y|T=1] }[/math] refers to the expected outcome for all individuals in the population who are actually assigned to treatment... The key is that the value of [math]\displaystyle{ E[Y|T=1]−E[Y|T=0] }[/math] is only equal to the causal effect, [math]\displaystyle{ E[Y(1)−Y(0)] }[/math] if there are no confounders present.
    Inverse-probability weighting removes confounding by creating a “pseudo-population” in which the treatment is independent of the measured confounders... Add a larger weight to the individuals who are underrepresented in the sample and a lower weight to those who are over-represented... propensity score P(T=1|X), logistic regression, stabilized weights.
  • A Crash Course in Causality: Inferring Causal Effects from Observational Data (Coursera) which includes Inverse Probability of Treatment Weighting (IPTW). R packages used: tableone, ipw, sandwich, survey.

Seemingly unrelated regressions

wikipedia