Latest revision as of 17:54, 6 April 2024

Linear Regression

Regression Models for Data Science in R by Brian Caffo
Regression and Other Stories (book) by Andrew Gelman, Jennifer Hill, Aki Vehtari

Comic

MSE

Mean squared error (within-sample)
Mean squared prediction error (out-of-sample)
Is MSE decreasing with increasing number of explanatory variables?
Calculate (Root) Mean Squared Error in R (5 Examples)

Coefficient of determination R²

https://en.wikipedia.org/wiki/Coefficient_of_determination.
- R² is expressed as the ratio of the explained variance to the total variance.
- It is a statistical measure of how well the regression predictions approximate the real data points.
- See the wikipedia page for a list of caveats of R² including correlation does not imply causation.

Based on the data collected, my tennis ball will reach orbit by tomorrow. R2=0.85 in the following example.

Avoid R-squared to judge regression model performance
- R2 = RSS/TSS (model based)
- R2(x~y) = R2(y~x)
- R2 = Pearson correlation^2 (not model based)
summary(lm())$r.squared. See Extract R-square value with R in linear models
How to interpret root mean squared error (RMSE) vs standard deviation? 𝑅² value.
coefficient of determination R^2 (can be negative?)
- The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation 2021
The R-squared and nonlinear regression: a difficult marriage?
How R2 and RMSE are calculated in cross-validation of pls R
How to Interpret R Squared and Goodness of Fit in Regression Analysis
- Limitations of R-squared: R-squared does not inform if the regression model has an adequate fit or not. It can be arbitrarily low when the model is completely correct. See How to make R-squared useless.
- Low R-squared and High R-squared values. A regression model with high R2 value can lead to – as the statisticians call it – specification bias. See an example.
Five Reasons Why Your R-squared can be Too High
- R-squared is a biased estimate. R-squared estimates tend to be greater than the correct population value.
- Overfitting your model. This problem occurs when the model is too complex.

library(ggplot2)
set.seed(123)
x <- 1:100
y <- x + rnorm(100, sd = 10)
model1 <- lm(y ~ x)
summary(model1)$r.squared  # 0.914

# Now, let's add some noise variables
noise <- matrix(rnorm(100*1000), ncol = 1000)
model2 <- lm(y ~ x + noise)
summary(model2)$r.squared  # 1

- Data mining and chance correlations. Multiple hypotheses.
- Trends in Panel (Time Series) Data
- Form of a Variable - include a different form of the same variable for both the dependent variable and an independent variable.
Relationship between R squared and Pearson correlation coefficient
What is the difference between Pearson R and Simple Linear Regression?
R² and MSE

[math]\displaystyle{ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ &= 1 - \frac{MSE}{Var(y)} \end{align} }[/math]

lasso/glmnet
Beware of R2: simple, unambiguous assessment of the prediction accuracy of QSAR and QSPR models Alexander 2015. Golbraikh and Tropsha which identified the inadequacy of the leave-one-out cross-validation R2 (denoted as q2 in this case) calculated on training set data as a reliable characteristic of the model predictivity.
The R-squared and nonlinear regression: a difficult marriage?

Pearson correlation and linear regression slope

[math]\displaystyle{ \begin{align} b_1 &= r \frac{S_y}{S_x} \end{align} }[/math]

where [math]\displaystyle{ S_x=\sqrt{\sum(x-\bar{x})^2} }[/math].

set.seed(1)
x <- rnorm(10); y <-rnorm(10)
coef(lm(y~x))
# (Intercept)           x
#   0.3170798  -0.5161377

cor(x, y)*sd(y)/sd(x)
# [1] -0.5161377

Different models (in R)

http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/

Factor Variables

Regression With Factor Variables

dummy.coef.lm() in R

Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.

Add Regression Line per Group to Scatterplot

How To Add Regression Line per Group to Scatterplot in ggplot2?

penguins_df %>%
  ggplot(aes(x=culmen_length_mm, 
             y=flipper_length_mm, 
             color=species))+
  geom_point()+
  geom_smooth(method="lm", se = FALSE)

model.matrix, design matrix

https://en.wikipedia.org/wiki/Design_matrix
ExploreModelMatrix: Explore design matrices interactively with R/Shiny. Paper on F1000research.
model(~A+B) will return 1 + (a-1) + (b-1) columns. See an example Batch effects and confounders.

Contrasts in linear regression

Page 147 of Modern Applied Statistics with S (4th ed)
https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.

A (sort of) Complete Guide to Contrasts in R by Rose Maier

mat

##      constant NLvMH  NvL  MvH
## [1,]        1  -0.5  0.5  0.0
## [2,]        1  -0.5 -0.5  0.0
## [3,]        1   0.5  0.0  0.5
## [4,]        1   0.5  0.0 -0.5
mat <- mat[ , -1]

model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
summary(model7)

## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
## doseNLvMH      3.179      2.152   1.477  0.14215    
## doseNvL       -8.723      3.044  -2.866  0.00489 ** 
## doseMvH       13.232      3.044   4.347 2.84e-05 ***

# double check your contrasts
attributes(model7$qr$qr)$contrasts
## $dose
##      NLvMH  NvL  MvH
## None  -0.5  0.5  0.0
## Low   -0.5 -0.5  0.0
## Med    0.5  0.0  0.5
## High   0.5  0.0 -0.5

library(dplyr)
dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
dose.means
## Source: local data frame [4 x 2]
## 
##   dose   y.mean
## 1 None 112.6267
## 2  Low 121.3500
## 3  Med 126.7839
## 4 High 113.5517

# The coefficient estimate for the first contrast (3.18) equals the average of 
# the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of 
# the first two groups (112.63 + 121.35 /2 = 116.99).

Multicollinearity

A toy example

n <- 100
set.seed(1)
x1 <- rnorm(n)
e <- rnorm(n)*.01
y <- x1 + e
cor(y, e)  # 0.00966967
cor(y, x1) # 0.9999
lm(y ~ x1) |> summary()      # p<2e-16

set.seed(2)
x2 <- x1 + rnorm(n)*.1       # x2 = x1 + noise
cor(x1, x2)  # .99
lm(y ~ x1 + x2) |> summary() # x2 insig
lm(y~ x2) |> summary()       # x2 sig

set.seed(3)
x3 <- x1 + rnorm(n)*.0001    # x3 = x1 + tiny noise
cor(x1, x3) # 1
lm(y ~ x1 + x3) |> summary() # both insig. SURPRISE!
lm(y ~ x1) |> summary()

x4 <- x1                     # x4 is exactly equal to x1  
lm(y~ x1 + x4) |> summary()  # x4 coef not defined because of singularities
lm(y~ x4 + x1) |> summary()  # x1 coef not defined because of singularities

Consider lasso

fit <- cv.glmnet(x=cbind(x1, x3, matrix(rnorm(n*10), nr=n)), y=y)
coefficients(fit, s = "lambda.min")
# 13 x 1 sparse Matrix of class "dgCMatrix"
#                      s1
# (Intercept) 0.002797165
# x1          0.970839175
# x3          .          
#             .          

fit <- cv.glmnet(x=cbind(x1, x4, matrix(rnorm(n*10), nr=n)), y=y)
coefficients(fit, s = "lambda.min")
# 13 x 1 sparse Matrix of class "dgCMatrix"
#                       s1
# (Intercept) 2.797165e-03
# x1          9.708392e-01
# x4          6.939215e-18
#             .   

fit <- cv.glmnet(x=cbind(x4, x1, matrix(rnorm(n*10), nr=n)), y=y)
coefficients(fit, s = "lambda.min")
# 13 x 1 sparse Matrix of class "dgCMatrix"
#                      s1
# (Intercept) 2.797165e-03
# x4          9.708392e-01
# x1          6.93 9215e-18
#             .

How to Fix in R: not defined because of singularities
Multicollinearity in R
Detecting multicollinearity — it’s not that easy sometimes

alias: Find Aliases (Dependencies) In A Model

> op <- options(contrasts = c("contr.helmert", "contr.poly"))
> npk.aov <- aov(yield ~ block + N*P*K, npk)
> alias(npk.aov)
Model :
yield ~ block + N * P * K

Complete :
         (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0     0     0

> options(op)

Exposure

https://en.mimi.hu/mathematics/exposure_variable.html

Independent variable = predictor = explanatory = exposure variable

Marginal effects

The marginaleffects package for R. Compute and plot adjusted predictions, contrasts, marginal effects, and marginal means for 69 classes of statistical models in R. Conduct linear and non-linear hypothesis tests using the delta method.

Confounders, confounding

What is a Confounding Variable? (Definition & Example). Confounding variable is a variable that can affect the relationship between the two variables under study. Requirements for Confounding Variables:
- It must be correlated with the independent variable.
- It must have a causal relationship with the dependent variable.
Confounders Introduction to Data Science by Irizarry
- If X and Y are correlated, we call Z a confounder if changes in Z causes changes in both X and Y.
https://en.wikipedia.org/wiki/Confounding
- A method for controlling complex confounding effects in the detection of adverse drug reactions using electronic health records. It provides a rule to identify a confounder.
http://anythingbutrbitrary.blogspot.com/2016/01/how-to-create-confounders-with.html (R example)
Logistic Regression: Confounding and Colinearity
Identifying a confounder
- Using a change in a continuous explanatory variable to test changes in gene expression with limma. If the second model gives just as much DE as the first, then that is evidence that expression changes between visits are largely explained by the weight loss.
Is it possible to have a variable that acts as both an effect modifier and a confounder?
Which test to use to check if a possible confounder impacts a 0 / 1 result?
Addressing confounding artifacts in reconstruction of gene co-expression networks Parsana 2019
Up Your Steps to Lower Blood Pressure, Heart Study Suggests
- Over about five months, participants averaged roughly 7,500 steps per day. Those with a higher daily step count had significantly lower blood pressure.
- the researchers found that systolic blood pressure was about 0.45 points lower for every 1,000 daily steps taken
- The link between daily step count and blood pressure was no longer significant when body mass index (BMI) was taken into account, however.
Empirical economics with r (part b): confounders, proxies and sources of exogenous variations, causal effects.
No, you have not controlled for confounders
Visual Demonstration of Residual Confounding. Don't dichotomize a continuous variable.
See Randomized block design
Simulating confounders, colliders and mediators
Age is a confounder for some disease.

Confidence interval vs prediction interval

Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).

Homoscedasticity, Heteroskedasticity, Check model for (non-)constant error variance

Dealing with heteroskedasticity; regression with robust standard errors using R
performance package check_heteroscedasticity(x, ...) and check_heteroskedasticity(x, ...)
easystats: Quickly investigate model performance
Homoscedasticity in Regression Analysis
Exploring Variance Inflation Factor (VIF) in R: A Practical Guide.