Regression

From 太極
Jump to navigation Jump to search

Linear Regression

Comic

MSE

Coefficient of determination R2

[math]\displaystyle{ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ &= 1 - \frac{MSE}{Var(y)} \end{align} }[/math]

Pearson correlation and linear regression slope

[math]\displaystyle{ \begin{align} b_1 &= r \frac{S_y}{S_x} \end{align} }[/math]

where [math]\displaystyle{ S_x=\sqrt{\sum(x-\bar{x})^2} }[/math].

set.seed(1)
x <- rnorm(10); y <-rnorm(10)
coef(lm(y~x))
# (Intercept)           x
#   0.3170798  -0.5161377

cor(x, y)*sd(y)/sd(x)
# [1] -0.5161377

Different models (in R)

http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/

Factor Variables

Regression With Factor Variables

dummy.coef.lm() in R

Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.

Add Regression Line per Group to Scatterplot

How To Add Regression Line per Group to Scatterplot in ggplot2?

penguins_df %>%
  ggplot(aes(x=culmen_length_mm, 
             y=flipper_length_mm, 
             color=species))+
  geom_point()+
  geom_smooth(method="lm", se = FALSE)

model.matrix, design matrix

Contrasts in linear regression

  • Page 147 of Modern Applied Statistics with S (4th ed)
  • https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.
  • A (sort of) Complete Guide to Contrasts in R by Rose Maier
    mat
    
    ##      constant NLvMH  NvL  MvH
    ## [1,]        1  -0.5  0.5  0.0
    ## [2,]        1  -0.5 -0.5  0.0
    ## [3,]        1   0.5  0.0  0.5
    ## [4,]        1   0.5  0.0 -0.5
    mat <- mat[ , -1]
    
    model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
    summary(model7)
    
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
    ## doseNLvMH      3.179      2.152   1.477  0.14215    
    ## doseNvL       -8.723      3.044  -2.866  0.00489 ** 
    ## doseMvH       13.232      3.044   4.347 2.84e-05 ***
    
    # double check your contrasts
    attributes(model7$qr$qr)$contrasts
    ## $dose
    ##      NLvMH  NvL  MvH
    ## None  -0.5  0.5  0.0
    ## Low   -0.5 -0.5  0.0
    ## Med    0.5  0.0  0.5
    ## High   0.5  0.0 -0.5
    
    library(dplyr)
    dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
    dose.means
    ## Source: local data frame [4 x 2]
    ## 
    ##   dose   y.mean
    ## 1 None 112.6267
    ## 2  Low 121.3500
    ## 3  Med 126.7839
    ## 4 High 113.5517
    
    # The coefficient estimate for the first contrast (3.18) equals the average of 
    # the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of 
    # the first two groups (112.63 + 121.35 /2 = 116.99).

Multicollinearity

> op <- options(contrasts = c("contr.helmert", "contr.poly"))
> npk.aov <- aov(yield ~ block + N*P*K, npk)
> alias(npk.aov)
Model :
yield ~ block + N * P * K

Complete :
         (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0     0     0

> options(op)

Exposure

https://en.mimi.hu/mathematics/exposure_variable.html

Independent variable = predictor = explanatory = exposure variable

Marginal effects

The marginaleffects package for R. Compute and plot adjusted predictions, contrasts, marginal effects, and marginal means for 69 classes of statistical models in R. Conduct linear and non-linear hypothesis tests using the delta method.

Confounders, confounding

Confidence interval vs prediction interval

Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Linear regression with Map Reduce

https://freakonometrics.hypotheses.org/53269

Relationship between multiple variables

Visualizing the relationship between multiple variables

Model fitting evaluation, Q-Q plot

Generalized least squares

Reduced rank regression

Singular value decomposition

  • Moore-Penrose matrix inverse in R
    > a = matrix(c(2, 3), nr=1)
    > MASS::ginv(a) * 8 
             [,1]
    [1,] 1.230769
    [2,] 1.846154  
    # Same solution as matlab lsqminnorm(A,b)
    
    > a %*% MASS::ginv(a)
         [,1]
    [1,]    1
    > a %*% MASS::ginv(a) %*% a
         [,1] [,2]
    [1,]    2    3
    > MASS::ginv   # view the source code
    

Mahalanobis distance and outliers detection

Mahalanobis distance

  • The Mahalanobis distance is a measure of the distance between a point P and a distribution D
  • It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
  • The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
  • Distance is not always what it seems

performance::check_outliers() Outliers detection (check for influential observations)

How to Calculate Mahalanobis Distance in R

set.seed(1234)
x <- matrix(rnorm(200), nc=10)
x0 <- rnorm(10)
mu <- colMeans(x)
mahalanobis(x0, colMeans(x), var(x)) # 17.76527
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) # 17.76527

# Variance is not full rank
x <- matrix(rnorm(200), nc=20)
x0 <- rnorm(20)
mu <- colMeans(x)
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu)
mahalanobis(x0, colMeans(x), var(x))
# Error in solve.default(cov, ...) : 
#   system is computationally singular: reciprocal condition number = 1.93998e-19

Type 1 error

Linear Regression And Type I Error

More Data Can Hurt for Linear Regression

Sometimes more data can hurt!

Estimating Coefficients for Variables in R

Trying to Trick Linear Regression - Estimating Coefficients for Variables in R

How to interpret the interaction term

how to interpret the interaction term in lm formula in R? If both x1 and x2 are numerical, then x1:x2 is actually x1*x2 in computation. That is y ~ x1 + x2 + x1:x2 is equivalent to y ~ x1 + x2 + x3 where x3 = x1*x2. The cross is literally the two terms multiplied -- interpretation will largely depend on whether var1 and var2 are both continuous (quite hard to interpret, in my opinion) or whether one of these is e.g. binary categorical (easier to consider.)

Intercept only model and cross-validation

n <- 20
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
plot(x, y)
df <- data.frame(x=x, y=y)
pred <- double(n)
for(i in 1:n) {
  fit <- lm(y ~ 1, data = df[-i, ])
  pred[i] <- predict(fit, df[i, ])
}
plot(y, pred)
cor(y, pred) # -1

How about 1000 simulated data?

foo <- function(n=3, debug=F) {
  x <- rnorm(n)
  y <- 2*x + .5*rnorm(n)
  df <- data.frame(x=x, y=y)
  pred <- double(n)
  for(i in 1:n) {
    fit <- lm(y ~ 1, data = df[-i, ])
    pred[i] <- predict(fit, df[i, ])
  }
  if (debug) {
    cat("num=", n*sum(y*pred)-sum(pred)*sum(y), "\n")
    cat("denom=", sqrt(n*sum(y**2) - sum(y)^2)*sqrt(n*sum(pred**2)-sum(pred)^2), "\n")
    invisible(list(y=y, pred=pred, cor=cor(y, pred)))
  } else {
    cor(y, pred)
  }
}

o <- replicate(1000, foo(n=10))
range(o) # [1] -1 -1
all.equal(o, rep(-1, 1000)) # TRUE

Note the property will not happen in k-fold CV (not LOOCV)

n <- 20; nfold <- 5
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
#plot(x, y)
df <- data.frame(x=x, y=y)
set.seed(1)
folds <- split(sample(1:n), rep(1:nfold, length = n))
pred <- double(n)
for(i in 1:nfold) {
  fit <- lm(y ~ 1, data = df[-folds[[i]], ])
  pred[folds[[i]]] <- predict(fit, df[folds[[i]], ])
}
plot(y, pred)
cor(y, pred) # -0.6696743

See also

lm.fit and multiple responses/genes

Logistic regression

  • 5個方式超越二元對立
    1. 一切都是遊戲
    2. 練習寬恕
    3. 不評判 感受一切
    4. 信任直覺(高我)
    5. 親修實證
  • Logistic regression or T test?. Choosing between logistic regression and Mann Whitney/t-tests
    • The t-test is not significant but the logistic regression is
    • The t-test is significant but the logistic regression is not, as in the question
  • Simulation data shows when the covariate can perfectly separate in two groups, logistic regression won't work.
    > set.seed(1234); n <- 16; mu=3; x <- c(rnorm(n), rnorm(n, mu)); y <- rep(0:1, each=n)
    > summary(glm(y ~ x, family = binomial)); plot(x, y)
    ...
                Estimate Std. Error z value Pr(>|z|)
    (Intercept)   -116.7    76341.5  -0.002    0.999
    x               88.5    56801.0   0.002    0.999
    ...
    Warning messages:
    1: glm.fit: algorithm did not converge 
    2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
    

    LogisticFail.svg

  • Logistic regression in R. Assessing the fit with a pseudo R2. Alternative pseudo R2. Assessing the significance.
  • How to Plot a Logistic Regression Curve in R Simple model.
  • Plotting the results of your logistic regression Part 1: Continuous by categorical interaction. Multivariate model.
  • Interpret Logistic Regression Coefficients (For Beginners).
    • Increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by exp(β).
    • exp^β is the odds ratio that associates smoking to the risk of heart disease.
    • If exp^β = exp^0.38 = 1.46, the smoking group has a 1.46 times the odds of the non-smoking group of having heart disease.
    • The smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.
    • If β = – 0.38, then exp^β = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.
    • If β = 0, then exp^β =1, the smoking group has the same odds as the non-smoking group of having heart disease.
    • How to interpret the intercept? If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
  • A Simple Interpretation of Logistic Regression Coefficients. logit(p) = log-odds ratio.
    • A 1 unit increase in X₁ will result in b increase in the log-odds ratio of success : failure.
    • In other words, if exp(b)=1.14, it means increasing studying hour by 1 unit will have a 14% increase in the odds of passing the exam (assuming that the variable female remains fixed) where p = the probability of passing an exam.
  • How to Interpret Logistic Regression Coefficients
  • Generalized Linear Models, Part I: The Logistic Model, odds ratio, hypothesis testing, change of the reference factor

Warning of too few observations

1: In lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground

Quantile regression

Isotonic regression

Causal inference

  • The intuition behind inverse probability weighting in causal inference*, Confounding in causal inference: what is it, and what to do about it?
    Outcome [math]\displaystyle{ \begin{align} Y = T*Y(1) + (1-T)*Y(0) \end{align} }[/math]
    Causal effect (unobserved) [math]\displaystyle{ \begin{align} \tau = E(Y(1) -Y(0)) \end{align} }[/math]
    where [math]\displaystyle{ E[Y(1)] }[/math] referred to the expected outcome in the hypothetical situation that everyone in the population was assigned to treatment, [math]\displaystyle{ E[Y|T=1] }[/math] refers to the expected outcome for all individuals in the population who are actually assigned to treatment... The key is that the value of [math]\displaystyle{ E[Y|T=1]−E[Y|T=0] }[/math] is only equal to the causal effect, [math]\displaystyle{ E[Y(1)−Y(0)] }[/math] if there are no confounders present.
    Inverse-probability weighting removes confounding by creating a “pseudo-population” in which the treatment is independent of the measured confounders... Add a larger weight to the individuals who are underrepresented in the sample and a lower weight to those who are over-represented... propensity score P(T=1|X), logistic regression, stabilized weights.
  • A Crash Course in Causality: Inferring Causal Effects from Observational Data (Coursera) which includes Inverse Probability of Treatment Weighting (IPTW). R packages used: tableone, ipw, sandwich, survey.