Bootstrap: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 3: Line 3:
** This contains an overview of different methods for computing bootstrap confidence intervals.
** This contains an overview of different methods for computing bootstrap confidence intervals.
** [https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/boot.ci boot.ci()] from the 'boot' package provides a short explanation for different methods for computing bootstrap confidence intervals.
** [https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/boot.ci boot.ci()] from the 'boot' package provides a short explanation for different methods for computing bootstrap confidence intervals.
** [https://si.biostat.washington.edu/sites/default/files/modules/2017_sisg_1_9_v3.pdf Typically] not useful for correlated (dependent) data.
* [https://github.com/jtleek/slipper Bootstrapping made easy and tidy with slipper]
* [https://github.com/jtleek/slipper Bootstrapping made easy and tidy with slipper]
* [https://cran.r-project.org/web/packages/bootstrap/ bootstrap] package. "An Introduction to the Bootstrap" by B. Efron and R. Tibshirani, 1993
* [https://cran.r-project.org/web/packages/bootstrap/ bootstrap] package. "An Introduction to the Bootstrap" by B. Efron and R. Tibshirani, 1993
* [https://cran.r-project.org/web/packages/boot/ boot] package. Functions and datasets for bootstrapping from the book [https://books.google.com/books?id=_uKcAgAAQBAJ Bootstrap Methods and Their Application] by A. C. Davison and D. V. Hinkley (1997, CUP). A short course material can be found [https://www.researchgate.net/publication/37434447_Bootstrap_Methods_and_Their_Application here].The main functions are '''boot()''' and '''boot.ci()'''.
* [https://cran.r-project.org/web/packages/boot/ boot] package. Functions and datasets for bootstrapping from the book [https://books.google.com/books?id=_uKcAgAAQBAJ Bootstrap Methods and Their Application] by A. C. Davison and D. V. Hinkley (1997, CUP). A short course material can be found [https://www.researchgate.net/publication/37434447_Bootstrap_Methods_and_Their_Application here].The main functions are '''boot()''' and '''boot.ci()'''.
** https://www.rdocumentation.org/packages/boot/versions/1.3-20
** https://www.rdocumentation.org/packages/boot/versions/1.3-20
Line 25: Line 29:
** [https://intobioinformatics.wordpress.com/2018/12/29/part-5-corrections-to-optimism-corrected-bootstrapping-series-but-it-still-is-problematic/ Part 5]: Code corrections to optimism corrected bootstrapping series
** [https://intobioinformatics.wordpress.com/2018/12/29/part-5-corrections-to-optimism-corrected-bootstrapping-series-but-it-still-is-problematic/ Part 5]: Code corrections to optimism corrected bootstrapping series
* [https://youtu.be/N4ZQQqyIf6k Bootstrapping Part 2: Calculating p-values!!!] from StatQuest
* [https://youtu.be/N4ZQQqyIf6k Bootstrapping Part 2: Calculating p-values!!!] from StatQuest
* [https://moderndive.com/8-confidence-intervals.html Chapter 8 Bootstrapping and Confidence Intervals] from the ebook "Statistical Inference via Data Science"
== rsample package ==
* [https://rtichoke.netlify.app/post/assessing_score_reliability/ Using bootstrapped sampling to assess variability in score predictions]. The [https://cran.r-project.org/web/packages/rsample/index.html rsample] (General Resampling Infrastructure) package was used.  
* [https://rtichoke.netlify.app/post/assessing_score_reliability/ Using bootstrapped sampling to assess variability in score predictions]. The [https://cran.r-project.org/web/packages/rsample/index.html rsample] (General Resampling Infrastructure) package was used.  
* [https://moderndive.com/8-confidence-intervals.html Chapter 8 Bootstrapping and Confidence Intervals] from the ebook "Statistical Inference via Data Science"
* [https://www.r-bloggers.com/2024/07/bootstrap-confidence-intervals-exports-in-japan/ Bootstrap Confidence Intervals: Exports in Japan]


== Nonparametric bootstrap ==
== Nonparametric bootstrap ==
Line 34: Line 41:


== Parametric bootstrap ==
== Parametric bootstrap ==
* Parametric bootstraps resample a known distribution function, whose parameters are estimated from your sample
<ul>
* http://www.math.ntu.edu.tw/~hchen/teaching/LargeSample/notes/notebootstrap.pdf#page=3 No package is used
<li>Parametric bootstraps resample a known distribution function, whose parameters are estimated from your sample
* [http://influentialpoints.com/Training/nonparametric-or-parametric_bootstrap.htm A parametric or non-parametric bootstrap?]
<li>In small samples, a parametric bootstrap approach might be preferred according to the [https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29#Types_of_bootstrap_scheme wikipedia]
* https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf#page=11
<li>http://www.math.ntu.edu.tw/~hchen/teaching/LargeSample/notes/notebootstrap.pdf#page=3 No package is used
* [https://bioconductor.org/packages/release/bioc/vignettes/simulatorZ/inst/doc/simulatorZ-vignette.pdf simulatorZ] Bioc package
<li>[http://influentialpoints.com/Training/nonparametric-or-parametric_bootstrap.htm A parametric or non-parametric bootstrap?]
<li>https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf#page=11
<li>[https://bioconductor.org/packages/release/bioc/vignettes/simulatorZ/inst/doc/simulatorZ-vignette.pdf simulatorZ] Bioc package
<li>[https://medium.com/@statisticianonline/bootstrap-r-tutorial-learn-about-parametric-and-nonparametric-bootstrap-through-simulation-f6be4d4fc532 Bootstrap R tutorial : Learn about parametric and non-parametric bootstrap through simulation]
{{Pre}}
df <- scan(textConnection("3331.608 3549.913 2809.252 2671.465 3383.177 3945.721 3672.601 3922.416 4647.278 4088.246 3718.874 3724.443 3925.457 3112.920 3495.621 3651.779 3240.194 3867.347 3431.015 4163.725"), sep = " ")
sd(df)
# [1] 464.5509
mean(df)
# [1] 3617.653
set.seed(1)
theta_star <- vector()
for (i in 1:1000){ theta_star[i] <- mean(rnorm(length(df),mean = 3617.7 ,sd = 464.6)) }
theta_boot <- mean(theta_star) # bootstrap estimate of theta_hat theta_boot
theta_boot
# [1] 3615.208
boot_se <- sd(theta_star) # standard eorrs of the estimate boot_se
boot_se
# [1] 100.106
 
bias <- theta_boot - mean(df)
bias
# -2.444507
CI <-c(theta_boot-1.96*boot_se,theta_boot +1.96*boot_se)
CI
# 3419.000 3811.416
</pre>
Using the '''boot''' package
{{Pre}}
# tell the boot() which distribution random sample will be generated
gen_function <- function(x,mle) {
  # Input:
  #  x: observed data
  #  mle: parameter estimates
  data <- rnorm(length(x),mle[1],mle[2])
  return(data)
}
 
# function to calculate sample statistic
theta_star_function <- function(x,i) mean(x[i])
 
set.seed(1)
B <- boot(data = df,
          sim = "parametric",
          ran.gen = gen_function,
          mle = c(3617.7,464.6),
          statistic = theta_star_function,
          R=1000)
B
# PARAMETRIC BOOTSTRAP
# Call:
# boot(data = df, statistic = theta_star_function, R = 1000, sim = "parametric",
#    ran.gen = gen_function, mle = c(3617.7, 464.6))
#
# Bootstrap Statistics :
#    original    bias    std. error
# t1* 3617.653 -2.444507    100.106
plot(B)
</pre>
</ul>


= Examples =
= Examples =
Line 107: Line 173:
# 1.354006
# 1.354006
</syntaxhighlight>
</syntaxhighlight>
<li>Difference of means from two samples, another example. It is generally observed that the bootstrap estimate of the standard error (SE) can be more conservative compared to the true SE. See  [http://www.econ.ucla.edu/liao/papers_pdf/boot_var.pdf Bootstrap Standard Error Estimates and Inference].
<syntaxhighlight lang="rsplus">
library(boot)
# Set a seed for reproducibility
set.seed(123)
# Generate two samples from normal distribution
sample1 <- rnorm(100, mean = 0, sd = 1)
sample2 <- rnorm(100, mean = 1, sd = 1)
# Define a function to compute the difference in means
diff_in_means <- function(data, indices) {
  sample1 <- data$sample1[indices]
  sample2 <- data$sample2[indices]
  # Return the difference in means
  return(mean(sample1) - mean(sample2))
}
# Combine the samples into a list
samples <- list(sample1 = sample1, sample2 = sample2)
# Use the boot() function to perform bootstrap
results <- boot(data = samples, statistic = diff_in_means, R = 1000)
# Print the bootstrap results
print(results)
# Bootstrap Statistics :
#      original      bias    std. error
# t1* -1.168565 -0.001910976  0.2193785
# Compute the true standard error
true_se <- sqrt(var(sample1)/length(sample1) + var(sample2)/length(sample2))
# Print the true standard error
print(paste("True SE: ", true_se))
# [1] "True SE:  0.132977288582124"
</syntaxhighlight>
<li>[https://si.biostat.washington.edu/sites/default/files/modules/2017_sisg_1_9_v3.pdf SE of the sample median]
<pre>
x <- c(1, 5, 8, 3, 7)
median(x) # 5
set.seed(1)
foo <- function(x) median(sample(x, replace = T))
y <- replicate(100, foo(x=x))
sd(y)
# 1.917595
c(median(x)-1.96*sd(y), median(x)+1.96*sd(y))
# [1] 1.241513 8.758487
</pre>
<li>[https://www.statology.org/bootstrapping-in-r/ How to Perform Bootstrapping in R (With Examples)]. boot package was used to estimate the SE of R-squared.
<li>[https://si.biostat.washington.edu/sites/default/files/modules/2017_sisg_1_9_v3.pdf SE for sample relative risk]
</ul>
</ul>


Line 150: Line 273:
== Correlation ==
== Correlation ==
[https://stats.oarc.ucla.edu/r/faq/how-can-i-generate-bootstrap-statistics-in-r/ A Practical Guide to Bootstrap in R]
[https://stats.oarc.ucla.edu/r/faq/how-can-i-generate-bootstrap-statistics-in-r/ A Practical Guide to Bootstrap in R]
== Stability of clinical prediction models ==
[https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.202200302 Stability of clinical prediction models developed using statistical or machine learning methods] 2023

Latest revision as of 12:16, 25 July 2024

General

  • Bootstrap from Wikipedia.
    • This contains an overview of different methods for computing bootstrap confidence intervals.
    • boot.ci() from the 'boot' package provides a short explanation for different methods for computing bootstrap confidence intervals.
    • Typically not useful for correlated (dependent) data.
  • bootstrap package. "An Introduction to the Bootstrap" by B. Efron and R. Tibshirani, 1993

rsample package

Nonparametric bootstrap

This is the most common bootstrap method

The upstrap Crainiceanu & Crainiceanu, Biostatistics 2018

Parametric bootstrap

  • Parametric bootstraps resample a known distribution function, whose parameters are estimated from your sample
  • In small samples, a parametric bootstrap approach might be preferred according to the wikipedia
  • http://www.math.ntu.edu.tw/~hchen/teaching/LargeSample/notes/notebootstrap.pdf#page=3 No package is used
  • A parametric or non-parametric bootstrap?
  • https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf#page=11
  • simulatorZ Bioc package
  • Bootstrap R tutorial : Learn about parametric and non-parametric bootstrap through simulation
    df <- scan(textConnection("3331.608 3549.913 2809.252 2671.465 3383.177 3945.721 3672.601 3922.416 4647.278 4088.246 3718.874 3724.443 3925.457 3112.920 3495.621 3651.779 3240.194 3867.347 3431.015 4163.725"), sep = " ")
    sd(df)
    # [1] 464.5509
    mean(df)
    # [1] 3617.653
    set.seed(1)
    theta_star <- vector() 
    for (i in 1:1000){ theta_star[i] <- mean(rnorm(length(df),mean = 3617.7 ,sd = 464.6)) }
    theta_boot <- mean(theta_star) # bootstrap estimate of theta_hat theta_boot
    theta_boot
    # [1] 3615.208
    boot_se <- sd(theta_star) # standard eorrs of the estimate boot_se
    boot_se
    # [1] 100.106
    
    bias <- theta_boot - mean(df) 
    bias
    # -2.444507
    CI <-c(theta_boot-1.96*boot_se,theta_boot +1.96*boot_se) 
    CI
    # 3419.000 3811.416
    

    Using the boot package

    # tell the boot() which distribution random sample will be generated
    gen_function <- function(x,mle) { 
      # Input:
      #   x: observed data
      #   mle: parameter estimates
      data <- rnorm(length(x),mle[1],mle[2]) 
      return(data)
    }
    
    # function to calculate sample statistic 
    theta_star_function <- function(x,i) mean(x[i])
    
    set.seed(1)
    B <- boot(data = df, 
              sim = "parametric", 
              ran.gen = gen_function, 
              mle = c(3617.7,464.6), 
              statistic = theta_star_function, 
              R=1000) 
    B
    # PARAMETRIC BOOTSTRAP
    # Call:
    # boot(data = df, statistic = theta_star_function, R = 1000, sim = "parametric", 
    #     ran.gen = gen_function, mle = c(3617.7, 464.6))
    # 
    # Bootstrap Statistics :
    #     original    bias    std. error
    # t1* 3617.653 -2.444507     100.106
    plot(B)
    

Examples

Standard error

  • Standard error from a mean
    foo <- function() mean(sample(x, replace = TRUE))
    set.seed(1234)
    x <- rnorm(300)
    set.seed(1)
    sd(replicate(10000, foo()))
    # [1] 0.05717679
    sd(x)/sqrt(length(x)) # The se of mean is s/sqrt(n)
    # [1] 0.05798401
    
    set.seed(1234)
    x <- rpois(300, 2)
    set.seed(1)
    sd(replicate(10000, foo()))
    # [1] 0.08038607
    sd(x)/sqrt(length(x)) # The se of mean is s/sqrt(n)
    # [1] 0.08183151
    
  • Difference of means from two samples (cf 8.3 The two-sample problem from the book "An introduction to Bootstrap" by Efron & Tibshirani)
    # Define the two samples
    sample1 <- 1:10
    sample2 <- 11:20
    
    # Define the number of bootstrap replicates
    nboot <- 100000
    
    # Initialize a vector to store the bootstrap estimates
    boot_estimates <- numeric(nboot)
    
    # Run the bootstrap
    set.seed(123)
    for (i in seq_len(nboot)) {
      # Resample the data with replacement
      resample1 <- sample(sample1, replace = TRUE)
      resample2 <- sample(sample2, replace = TRUE)
      
      # Compute the difference of means
      boot_estimates[i] <- mean(resample1) - mean(resample2)
    }
    
    # Compute the standard error of the bootstrap estimates
    se_boot <- sd(boot_estimates)
    
    # Print the result
    cat("Bootstrap SE estimate of difference of means:", se_boot, "\n")
    # 1.283541
    
    sd1 <- sd(sample1)
    sd2 <- sd(sample2)
    
    # Calculate the sample sizes
    n1 <- length(sample1)
    n2 <- length(sample2)
    
    # Calculate the true standard error of the difference of means
    se_true <- sqrt((sd1^2/n1) + (sd2^2/n2))
    
    # Print the result
    cat("True SE of difference of means:", se_true, "\n") \
    # 1.354006
  • Difference of means from two samples, another example. It is generally observed that the bootstrap estimate of the standard error (SE) can be more conservative compared to the true SE. See Bootstrap Standard Error Estimates and Inference.
    library(boot)
    
    # Set a seed for reproducibility
    set.seed(123)
    
    # Generate two samples from normal distribution
    sample1 <- rnorm(100, mean = 0, sd = 1)
    sample2 <- rnorm(100, mean = 1, sd = 1)
    
    # Define a function to compute the difference in means
    diff_in_means <- function(data, indices) {
      sample1 <- data$sample1[indices]
      sample2 <- data$sample2[indices]
      # Return the difference in means
      return(mean(sample1) - mean(sample2))
    }
    
    # Combine the samples into a list
    samples <- list(sample1 = sample1, sample2 = sample2)
    
    # Use the boot() function to perform bootstrap
    results <- boot(data = samples, statistic = diff_in_means, R = 1000)
    
    # Print the bootstrap results
    print(results)
    # Bootstrap Statistics :
    #      original       bias    std. error
    # t1* -1.168565 -0.001910976   0.2193785
    
    # Compute the true standard error
    true_se <- sqrt(var(sample1)/length(sample1) + var(sample2)/length(sample2))
    
    # Print the true standard error
    print(paste("True SE: ", true_se))
    # [1] "True SE:  0.132977288582124"
  • SE of the sample median
    x <- c(1, 5, 8, 3, 7)
    median(x) # 5
    set.seed(1)
    foo <- function(x) median(sample(x, replace = T))
    y <- replicate(100, foo(x=x))
    sd(y)
    # 1.917595
    c(median(x)-1.96*sd(y), median(x)+1.96*sd(y))
    # [1] 1.241513 8.758487
    
  • How to Perform Bootstrapping in R (With Examples). boot package was used to estimate the SE of R-squared.
  • SE for sample relative risk

Bootstrapping Extreme Value Estimators

Bootstrapping Extreme Value Estimators de Haan, 2022

R-squared from a regression

R in Action Nonparametric bootstrapping

# Compute the bootstrapped 95% confidence interval for R-squared in the linear regression
rsq <- function(data, indices, formula) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
} # 'formula' is optional depends on the problem

# bootstrapping with 1000 replications
set.seed(1234)
bootobject <- boot(data=mtcars, statistic=rsq, R=1000, 
                   formula=mpg~wt+disp)
plot(bootobject) # or plot(bootobject, index = 1) if we have multiple statistics
ci <- boot.ci(bootobject, conf = .95, type=c("perc", "bca") ) 
    # default type is "all" which contains c("norm","basic", "stud", "perc", "bca"). 
    # 'bca' (Bias Corrected and Accelerated) by Efron 1987 uses 
    # percentiles but adjusted to account for bias and skewness.
# Level     Percentile            BCa          
# 95%   ( 0.6838,  0.8833 )   ( 0.6344,  0.8549 )
# Calculations and Intervals on Original Scale
# Some BCa intervals may be unstable
ci$bca[4:5]  
# [1] 0.6343589 0.8549305
# the mean is not the same
mean(c(0.6838,  0.8833 ))
# [1] 0.78355
mean(c(0.6344,  0.8549 ))
# [1] 0.74465
summary(lm(mpg~wt+disp, data = mtcars))$r.square
# [1] 0.7809306

Coefficients in linear regression

Coefficients in a multiple linear regression

Correlation

A Practical Guide to Bootstrap in R

Stability of clinical prediction models

Stability of clinical prediction models developed using statistical or machine learning methods 2023