Linear Regression

Regression Models for Data Science in R by Brian Caffo
Regression and Other Stories (book) by Andrew Gelman, Jennifer Hill, Aki Vehtari

Comic https://xkcd.com/1725/

Coefficient of determination R²

https://en.wikipedia.org/wiki/Coefficient_of_determination.
- R² is expressed as the ratio of the explained variance to the total variance.
- It is a statistical measure of how well the regression predictions approximate the real data points.
- See the wikipedia page for a list of caveats of R²
summary(lm())$r.squared. See Extract R-square value with R in linear models
coefficient of determination R^2 (can be negative?)
The R-squared and nonlinear regression: a difficult marriage?
How R2 and RMSE are calculated in cross-validation of pls R
How to Interpret R Squared and Goodness of Fit in Regression Analysis
- Limitations of R-squared: R-squared does not inform if the regression model has an adequate fit or not.
- Low R-squared and High R-squared values. A regression model with high R2 value can lead to – as the statisticians call it – specification bias. See an example.
Five Reasons Why Your R-squared can be Too High
Relationship between R squared and Pearson correlation coefficient
What is the difference between Pearson R and Simple Linear Regression?
R² and MSE

Different models (in R)

http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/

dummy.coef.lm() in R

Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.

model.matrix, design matrix

https://en.wikipedia.org/wiki/Design_matrix
ExploreModelMatrix: Explore design matrices interactively with R/Shiny. Paper on F1000research.

Contrasts in linear regression

Page 147 of Modern Applied Statistics with S (4th ed)
https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.

A (sort of) Complete Guide to Contrasts in R by Rose Maier

mat

##      constant NLvMH  NvL  MvH
## [1,]        1  -0.5  0.5  0.0
## [2,]        1  -0.5 -0.5  0.0
## [3,]        1   0.5  0.0  0.5
## [4,]        1   0.5  0.0 -0.5
mat <- mat[ , -1]

model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
summary(model7)

## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
## doseNLvMH      3.179      2.152   1.477  0.14215    
## doseNvL       -8.723      3.044  -2.866  0.00489 ** 
## doseMvH       13.232      3.044   4.347 2.84e-05 ***

# double check your contrasts
attributes(model7$qr$qr)$contrasts
## $dose
##      NLvMH  NvL  MvH
## None  -0.5  0.5  0.0
## Low   -0.5 -0.5  0.0
## Med    0.5  0.0  0.5
## High   0.5  0.0 -0.5

library(dplyr)
dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
dose.means
## Source: local data frame [4 x 2]
## 
##   dose   y.mean
## 1 None 112.6267
## 2  Low 121.3500
## 3  Med 126.7839
## 4 High 113.5517

# The coefficient estimate for the first contrast (3.18) equals the average of 
# the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of 
# the first two groups (112.63 + 121.35 /2 = 116.99).

Multicollinearity

Multicollinearity in R
Detecting multicollinearity — it’s not that easy sometimes
alias: Find Aliases (Dependencies) In A Model

> op <- options(contrasts = c("contr.helmert", "contr.poly"))
> npk.aov <- aov(yield ~ block + N*P*K, npk)
> alias(npk.aov)
Model :
yield ~ block + N * P * K

Complete :
         (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0     0     0

> options(op)

Exposure

https://en.mimi.hu/mathematics/exposure_variable.html

Independent variable = predictor = explanatory = exposure variable

Confounders, confounding

https://en.wikipedia.org/wiki/Confounding
- A method for controlling complex confounding effects in the detection of adverse drug reactions using electronic health records. It provides a rule to identify a confounder.
http://anythingbutrbitrary.blogspot.com/2016/01/how-to-create-confounders-with.html (R example)
Logistic Regression: Confounding and Colinearity
Identifying a confounder
Is it possible to have a variable that acts as both an effect modifier and a confounder?
Which test to use to check if a possible confounder impacts a 0 / 1 result?
Addressing confounding artifacts in reconstruction of gene co-expression networks Parsana 2019
Up Your Steps to Lower Blood Pressure, Heart Study Suggests
- Over about five months, participants averaged roughly 7,500 steps per day. Those with a higher daily step count had significantly lower blood pressure.
- the researchers found that systolic blood pressure was about 0.45 points lower for every 1,000 daily steps taken
- The link between daily step count and blood pressure was no longer significant when body mass index (BMI) was taken into account, however.
Empirical economics with r (part b): confounders, proxies and sources of exogenous variations, causal effects.
No, you have not controlled for confounders
Visual Demonstration of Residual Confounding. Don't dichotomize a continuous variable.

Causal inference

The intuition behind inverse probability weighting in causal inference*, Confounding in causal inference: what is it, and what to do about it?
Outcome [math]\displaystyle{ \begin{align} Y = T*Y(1) + (1-T)*Y(0) \end{align} }[/math]

Causal effect (unobserved) [math]\displaystyle{ \begin{align} \tau = E(Y(1) -Y(0)) \end{align} }[/math]
where [math]\displaystyle{ E[Y(1)] }[/math] referred to the expected outcome in the hypothetical situation that everyone in the population was assigned to treatment, [math]\displaystyle{ E[Y|T=1] }[/math] refers to the expected outcome for all individuals in the population who are actually assigned to treatment... The key is that the value of [math]\displaystyle{ E[Y|T=1]−E[Y|T=0] }[/math] is only equal to the causal effect, [math]\displaystyle{ E[Y(1)−Y(0)] }[/math] if there are no confounders present.
Inverse-probability weighting removes confounding by creating a “pseudo-population” in which the treatment is independent of the measured confounders... Add a larger weight to the individuals who are underrepresented in the sample and a lower weight to those who are over-represented... propensity score P(T=1|X), logistic regression, stabilized weights.
A Crash Course in Causality: Inferring Causal Effects from Observational Data (Coursera) which includes Inverse Probability of Treatment Weighting (IPTW). R packages used: tableone, ipw, sandwich, survey.

Propensity score matching
- MatchIt - Nonparametric Preprocessing for Parametric Causal Inference (R package).
- Practical Propensity Score Methods Using R (online book)
- Comparison of Propensity Score Methods and Covariate Adjustment: Evaluation in 4 Cardiovascular Studies Stuart Pocock 2017
Simple examples to understand what confounders, colliders, mediators, and moderators are and how to “control for” variables in R with regression and propensity-score matching
Statistical Rethinking (2022 Edition) by Richard McElreath. Youtube.
Regression and Other Stores (book)
Riddle: Estimate effect of x on y if you only have two noisy measures of x

Confidence interval vs prediction interval

Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Dealing with heteroskedasticity; regression with robust standard errors using R
performance package check_heteroscedasticity(x, ...) and check_heteroskedasticity(x, ...)
easystats: Quickly investigate model performance
Homoscedasticity in Regression Analysis

Linear regression with Map Reduce

https://freakonometrics.hypotheses.org/53269

Relationship between multiple variables

Visualizing the relationship between multiple variables

Model fitting evaluation, Q-Q plot

Generalized least squares

gls from the nlme package. The errors are allowed to be correlated and/or have unequal variances.
- varClasses: varPower(), varExp(), varConstPower(), varFunc()
- summary()$varBeta (variance of coefficient estimates), summary()$sigma (error sigma)
- intervals()$coef (coefficient estimates), intervals()$varStruct (lower, est, upper of variance function)
- anova()
- 95 Prediction intervals: predict(gls, newdata, interval = "prediction", level = .95) OR predict(gls, newdata) +/ qt(0.975,n-2)*se*sqrt(1+1/n+xd/ssx) where se=sigma.param*newx^pow.param, xd=(newx-xbar)^2, pow.param = coef(glsOjb$modelStruct$varStruct).
- gls() vs. lme() in the nlme package
- How to use Generalized Least Square gls() in r. Chapter 5.2.1 (page 208) in Mixed Effects Models in S and S-Plus by Pinheiro and Bates 2000.
- https://asancpt.github.io/nlme/chapter-8.html
- The lme function by Peter Dalgaard
- http://halweb.uc3m.es/esp/Personal/personas/durban/esp/web/notes/gls.pdf

Reduced rank regression

The book Multivariate Reduced-Rank Regression by Velu, Raja & Reinsel, Gregory C.
Generalized Low Rank Models (GLRM) and h2o

Singular value decomposition

MATRIX DECOMPOSITIONS - a shiny app
Application to rank-k approximation of X, missing data imputation, relationship to PCA, relationship to eigen value decomposition, check multicollinearity, Moore-Penrose pseudoinverse, relationship to NMF, calculation of SVD by hand.

https://en.wikipedia.org/wiki/Singular_value_decomposition
- Pseudoinverse or Computing SVD and pseudoinverse
Minimum norm least-squares solution to linear equation from mathworks. Solve linear equations with infinite solutions. Underdetermined system (there are fewer equations than unknowns e.g. [math]\displaystyle{ 2x_1 +3x_2 =8 }[/math]. n=1, p=2). The geometry illustration of the problem is useful. The page also provides two ways to find the solution: one is by complete orthogonal decomposition (COD) and the other is by SVD/Moore-Penrose pseudoinverse.

Moore-Penrose matrix inverse in R

> a = matrix(c(2, 3), nr=1)
> MASS::ginv(a) * 8 
         [,1]
[1,] 1.230769
[2,] 1.846154  
# Same solution as matlab lsqminnorm(A,b)

> a %*% MASS::ginv(a)
     [,1]
[1,]    1
> a %*% MASS::ginv(a) %*% a
     [,1] [,2]
[1,]    2    3
> MASS::ginv   # view the source code

Minimal Norm Solution to the least squares problem.
The Moore-Penrose Inverse and Least Squares
Why does SVD provide the least squares and least norm solution to 𝐴𝑥=𝑏?

The Singular Value Decomposition ( SVD ) Minimum Norm Solution

Mahalanobis distance and outliers detection

Mahalanobis distance

The Mahalanobis distance is a measure of the distance between a point P and a distribution D
It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
Distance is not always what it seems

performance::check_outliers() Outliers detection (check for influential observations)

How to Calculate Mahalanobis Distance in R

set.seed(1234)
x <- matrix(rnorm(200), nc=10)
x0 <- rnorm(10)
mu <- colMeans(x)
mahalanobis(x0, colMeans(x), var(x)) # 17.76527
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu) # 17.76527

# Variance is not full rank
x <- matrix(rnorm(200), nc=20)
x0 <- rnorm(20)
mu <- colMeans(x)
t(x0-mu) %*% MASS::ginv(var(x)) %*% (x0-mu)
mahalanobis(x0, colMeans(x), var(x))
# Error in solve.default(cov, ...) : 
#   system is computationally singular: reciprocal condition number = 1.93998e-19

Type 1 error

Linear Regression And Type I Error

More Data Can Hurt for Linear Regression

Sometimes more data can hurt!

Estimating Coefficients for Variables in R

Trying to Trick Linear Regression - Estimating Coefficients for Variables in R

Intercept only model

n <- 20
set.seed(1)
x <- rnorm(n)
y <- 2*x + .5*rnorm(n)
plot(x, y)
df <- data.frame(x=x, y=y)
pred <- double(n)
for(i in 1:n) {
  fit <- lm(y ~ 1, data = df[-i, ])
  pred[i] <- predict(fit, df[i, ])
}
plot(y, pred)
cor(y, pred) # -1

Regression

Contents

Linear Regression

Coefficient of determination R²

Different models (in R)

dummy.coef.lm() in R

model.matrix, design matrix

Contrasts in linear regression

Multicollinearity

Exposure

Confounders, confounding

Causal inference

Confidence interval vs prediction interval

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Linear regression with Map Reduce

Relationship between multiple variables

Model fitting evaluation, Q-Q plot

Generalized least squares

Reduced rank regression

Singular value decomposition

Mahalanobis distance and outliers detection

Type 1 error

More Data Can Hurt for Linear Regression

Estimating Coefficients for Variables in R

Intercept only model

Quantile regression

Isotonic regression

Navigation menu

Regression

Linear Regression

Coefficient of determination R2

Different models (in R)

dummy.coef.lm() in R

model.matrix, design matrix

Contrasts in linear regression

Multicollinearity

Exposure

Confounders, confounding

Causal inference

Confidence interval vs prediction interval

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Linear regression with Map Reduce

Relationship between multiple variables

Model fitting evaluation, Q-Q plot

Generalized least squares

Reduced rank regression

Singular value decomposition

Mahalanobis distance and outliers detection

Type 1 error

More Data Can Hurt for Linear Regression

Estimating Coefficients for Variables in R

Intercept only model

Quantile regression

Isotonic regression

Navigation menu

Search

Coefficient of determination R²