# Regression

Jump to navigation Jump to search

# Linear Regression

## Coefficient of determination R2

\displaystyle{ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ &= 1 - \frac{MSE}{Var(y)} \end{align} }

## Pearson correlation and linear regression slope

\displaystyle{ \begin{align} b_1 &= r \frac{S_y}{S_x} \end{align} }

where $\displaystyle{ S_x=\sqrt{\sum(x-\bar{x})^2} }$.

set.seed(1)
x <- rnorm(10); y <-rnorm(10)
coef(lm(y~x))
# (Intercept)           x
#   0.3170798  -0.5161377

cor(x, y)*sd(y)/sd(x)
# [1] -0.5161377


## dummy.coef.lm() in R

Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.

## Add Regression Line per Group to Scatterplot

penguins_df %>%
ggplot(aes(x=culmen_length_mm,
y=flipper_length_mm,
color=species))+
geom_point()+
geom_smooth(method="lm", se = FALSE)


## Contrasts in linear regression

• Page 147 of Modern Applied Statistics with S (4th ed)
• https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.
• A (sort of) Complete Guide to Contrasts in R by Rose Maier
mat

##      constant NLvMH  NvL  MvH
## [1,]        1  -0.5  0.5  0.0
## [2,]        1  -0.5 -0.5  0.0
## [3,]        1   0.5  0.0  0.5
## [4,]        1   0.5  0.0 -0.5
mat <- mat[ , -1]

model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
summary(model7)

## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
## doseNLvMH      3.179      2.152   1.477  0.14215
## doseNvL       -8.723      3.044  -2.866  0.00489 **
## doseMvH       13.232      3.044   4.347 2.84e-05 ***

# double check your contrasts
attributes(model7$qr$qr)$contrasts ##$dose
##      NLvMH  NvL  MvH
## None  -0.5  0.5  0.0
## Low   -0.5 -0.5  0.0
## Med    0.5  0.0  0.5
## High   0.5  0.0 -0.5

library(dplyr)
dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
dose.means
## Source: local data frame [4 x 2]
##
##   dose   y.mean
## 1 None 112.6267
## 2  Low 121.3500
## 3  Med 126.7839
## 4 High 113.5517

# The coefficient estimate for the first contrast (3.18) equals the average of
# the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of
# the first two groups (112.63 + 121.35 /2 = 116.99).


## Multicollinearity

• A toy example
n <- 100
set.seed(1)
x1 <- rnorm(n)
e <- rnorm(n)*.01
y <- x1 + e
cor(y, e)  # 0.00966967
cor(y, x1) # 0.9999
lm(y ~ x1) |> summary()      # p<2e-16

set.seed(2)
x2 <- x1 + rnorm(n)*.1       # x2 = x1 + noise
cor(x1, x2)  # .99
lm(y ~ x1 + x2) |> summary() # x2 insig
lm(y~ x2) |> summary()       # x2 sig

set.seed(3)
x3 <- x1 + rnorm(n)*.0001    # x3 = x1 + tiny noise
cor(x1, x3) # 1
lm(y ~ x1 + x3) |> summary() # both insig. SURPRISE!
lm(y ~ x1) |> summary()

x4 <- x1                     # x4 is exactly equal to x1
lm(y~ x1 + x4) |> summary()  # x4 coef not defined because of singularities
lm(y~ x4 + x1) |> summary()  # x1 coef not defined because of singularities


Consider lasso

fit <- cv.glmnet(x=cbind(x1, x3, matrix(rnorm(n*10), nr=n)), y=y)
coefficients(fit, s = "lambda.min")
# 13 x 1 sparse Matrix of class "dgCMatrix"
#                      s1
# (Intercept) 0.002797165
# x1          0.970839175
# x3          .
#             .

fit <- cv.glmnet(x=cbind(x1, x4, matrix(rnorm(n*10), nr=n)), y=y)
coefficients(fit, s = "lambda.min")
# 13 x 1 sparse Matrix of class "dgCMatrix"
#                       s1
# (Intercept) 2.797165e-03
# x1          9.708392e-01
# x4          6.939215e-18
#             .

fit <- cv.glmnet(x=cbind(x4, x1, matrix(rnorm(n*10), nr=n)), y=y)
coefficients(fit, s = "lambda.min")
# 13 x 1 sparse Matrix of class "dgCMatrix"
#                      s1
# (Intercept) 2.797165e-03
# x4          9.708392e-01
# x1          6.93 9215e-18
#             .

• How to Fix in R: not defined because of singularities
• Multicollinearity in R
• Detecting multicollinearity — it’s not that easy sometimes
• alias: Find Aliases (Dependencies) In A Model
> op <- options(contrasts = c("contr.helmert", "contr.poly"))
> npk.aov <- aov(yield ~ block + N*P*K, npk)
> alias(npk.aov)
Model :
yield ~ block + N * P * K

Complete :
(Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0     0     0

> options(op)


## Exposure

Independent variable = predictor = explanatory = exposure variable

## Marginal effects

The marginaleffects package for R. Compute and plot adjusted predictions, contrasts, marginal effects, and marginal means for 69 classes of statistical models in R. Conduct linear and non-linear hypothesis tests using the delta method.

## Confidence interval vs prediction interval

Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).

## Reduced rank regression

### Singular value decomposition

• Moore-Penrose matrix inverse in R
> a = matrix(c(2, 3), nr=1)
> MASS::ginv(a) * 8
[,1]
[1,] 1.230769
[2,] 1.846154
# Same solution as matlab lsqminnorm(A,b)

> a %*% MASS::ginv(a)
[,1]
[1,]    1
> a %*% MASS::ginv(a) %*% a
[,1] [,2]
[1,]    2    3
> MASS::ginv   # view the source code


## Mahalanobis distance and outliers detection

• The Mahalanobis distance is a measure of the distance between a point P and a distribution D
• It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
• The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
• Distance is not always what it seems

performance::check_outliers() Outliers detection (check for influential observations)

set.seed(1234)
x <- matrix(rnorm(200), nc=10)
x0 <- rnorm(10)
mu <- colMeans(x)
mahalanobis(x0, colMeans(x), var(x)) # 17.76527
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) # 17.76527

# Variance is not full rank
x <- matrix(rnorm(200), nc=20)
x0 <- rnorm(20)
mu <- colMeans(x)
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu)
mahalanobis(x0, colMeans(x), var(x))
# Error in solve.default(cov, ...) :
#   system is computationally singular: reciprocal condition number = 1.93998e-19


## How to interpret the interaction term

how to interpret the interaction term in lm formula in R? If both x1 and x2 are numerical, then x1:x2 is actually x1*x2 in computation. That is y ~ x1 + x2 + x1:x2 is equivalent to y ~ x1 + x2 + x3 where x3 = x1*x2. The cross is literally the two terms multiplied -- interpretation will largely depend on whether var1 and var2 are both continuous (quite hard to interpret, in my opinion) or whether one of these is e.g. binary categorical (easier to consider.)

## Intercept only model and cross-validation

n <- 20
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
plot(x, y)
df <- data.frame(x=x, y=y)
pred <- double(n)
for(i in 1:n) {
fit <- lm(y ~ 1, data = df[-i, ])
pred[i] <- predict(fit, df[i, ])
}
plot(y, pred)
cor(y, pred) # -1


How about 1000 simulated data?

foo <- function(n=3, debug=F) {
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
df <- data.frame(x=x, y=y)
pred <- double(n)
for(i in 1:n) {
fit <- lm(y ~ 1, data = df[-i, ])
pred[i] <- predict(fit, df[i, ])
}
if (debug) {
cat("num=", n*sum(y*pred)-sum(pred)*sum(y), "\n")
cat("denom=", sqrt(n*sum(y**2) - sum(y)^2)*sqrt(n*sum(pred**2)-sum(pred)^2), "\n")
invisible(list(y=y, pred=pred, cor=cor(y, pred)))
} else {
cor(y, pred)
}
}

o <- replicate(1000, foo(n=10))
range(o) # [1] -1 -1
all.equal(o, rep(-1, 1000)) # TRUE


Note the property will not happen in k-fold CV (not LOOCV)

n <- 20; nfold <- 5
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
#plot(x, y)
df <- data.frame(x=x, y=y)
set.seed(1)
folds <- split(sample(1:n), rep(1:nfold, length = n))
pred <- double(n)
for(i in 1:nfold) {
fit <- lm(y ~ 1, data = df[-folds[[i]], ])
pred[folds[[i]]] <- predict(fit, df[folds[[i]], ])
}
plot(y, pred)
cor(y, pred) # -0.6696743


See also

# Logistic regression

• 5個方式超越二元對立
1. 一切都是遊戲
2. 練習寬恕
3. 不評判 感受一切
4. 信任直覺(高我)
5. 親修實證
• https://en.wikipedia.org/wiki/Logistic_regression
• Logistic regression or T test?. Choosing between logistic regression and Mann Whitney/t-tests
• The t-test is not significant but the logistic regression is
• The t-test is significant but the logistic regression is not, as in the question
• Simulation data shows when the covariate can perfectly separate in two groups, logistic regression won't work.
> set.seed(1234); n <- 16; mu=3; x <- c(rnorm(n), rnorm(n, mu)); y <- rep(0:1, each=n)
> summary(glm(y ~ x, family = binomial)); plot(x, y)
...
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -116.7    76341.5  -0.002    0.999
x               88.5    56801.0   0.002    0.999
...
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred


• Logistic regression in R. Assessing the fit with a pseudo R2. Alternative pseudo R2. Assessing the significance.
• How to Plot a Logistic Regression Curve in R Simple model.
• Plotting the results of your logistic regression Part 1: Continuous by categorical interaction. Multivariate model.
• Interpret Logistic Regression Coefficients (For Beginners).
• Increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by exp(β).
• exp^β is the odds ratio that associates smoking to the risk of heart disease.
• If exp^β = exp^0.38 = 1.46, the smoking group has a 1.46 times the odds of the non-smoking group of having heart disease.
• The smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.
• If β = – 0.38, then exp^β = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.
• If β = 0, then exp^β =1, the smoking group has the same odds as the non-smoking group of having heart disease.
• How to interpret the intercept? If the intercept has a negative sign: then the probability of having the outcome will be < 0.5.
• A Simple Interpretation of Logistic Regression Coefficients. logit(p) = log-odds ratio.
• A 1 unit increase in X₁ will result in b increase in the log-odds ratio of success : failure.
• In other words, if exp(b)=1.14, it means increasing studying hour by 1 unit will have a 14% increase in the odds of passing the exam (assuming that the variable female remains fixed) where p = the probability of passing an exam.
• How to Interpret Logistic Regression Coefficients
• Generalized Linear Models, Part I: The Logistic Model, odds ratio, hypothesis testing, change of the reference factor
• Logistic regression log odds log(p/(1-p)) = beta * X
• logit function f(x) = logit(x) = 1 / (1+exp(-x)) and we model the response Y by f(b0 + b1*x).
• Interpretation: consider X=(intercept, x), beta = (beta0, beta1) and assume x = 0/1. Then logit(beta0) is the percentage of positive results in Y when x = 0, and logit(beta0 + beta1) is the percentage of positive results in Y when x =1. Again, exp(beta1) is the odds ratio. The probabilities can be predicted by using the formula 1 / (1 + exp (-(b0 + b1*x)) )

# Quantile regression

library(quantreg)
set.seed(123)
x <- rnorm(100)
y <- x + rnorm(100, mean = 0, sd = 2)
y[1:10] <- y[1:10] + 10 # first 10 are outliers

# Fit a traditional linear regression model
fit_lm <- lm(y ~ x)

# Fit a quantile regression model for the 50th percentile
fit_qr <- rq(y ~ x, tau = 0.5)

fit_lm$coefficients # (Intercept) x # 0.7961233 0.8759269 fit_qr$coefficients
# (Intercept)           x
#   0.0772395   1.0058387

plot(x, y)
points(x[1:10], y[1:10], col='red', pch=16)
abline(fit_lm)
abline(fit_qr, col = 'blue')


# Piecewise linear regression

• Trajectory (related to "time")

# Causal inference

• The intuition behind inverse probability weighting in causal inference*, Confounding in causal inference: what is it, and what to do about it?
Outcome \displaystyle{ \begin{align} Y = T*Y(1) + (1-T)*Y(0) \end{align} }
Causal effect (unobserved) \displaystyle{ \begin{align} \tau = E(Y(1) -Y(0)) \end{align} }
where $\displaystyle{ E[Y(1)] }$ referred to the expected outcome in the hypothetical situation that everyone in the population was assigned to treatment, $\displaystyle{ E[Y|T=1] }$ refers to the expected outcome for all individuals in the population who are actually assigned to treatment... The key is that the value of $\displaystyle{ E[Y|T=1]−E[Y|T=0] }$ is only equal to the causal effect, $\displaystyle{ E[Y(1)−Y(0)] }$ if there are no confounders present.
Inverse-probability weighting removes confounding by creating a “pseudo-population” in which the treatment is independent of the measured confounders... Add a larger weight to the individuals who are underrepresented in the sample and a lower weight to those who are over-represented... propensity score P(T=1|X), logistic regression, stabilized weights.
• A Crash Course in Causality: Inferring Causal Effects from Observational Data (Coursera) which includes Inverse Probability of Treatment Weighting (IPTW). R packages used: tableone, ipw, sandwich, survey.