T-test

From 太極
Revision as of 21:19, 23 August 2019 by Brb (talk | contribs) (Created page with "= T-statistic = Let <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> be an estimator of parameter ''β'' in some statistical model. Then a '''''t''-stat...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

T-statistic

Let [math]\displaystyle{ \scriptstyle\hat\beta }[/math] be an estimator of parameter β in some statistical model. Then a t-statistic for this parameter is any quantity of the form

[math]\displaystyle{ t_{\hat{\beta}} = \frac{\hat\beta - \beta_0}{\mathrm{s.e.}(\hat\beta)}, }[/math]

where β0 is a non-random, known constant, and [math]\displaystyle{ \scriptstyle s.e.(\hat\beta) }[/math] is the standard error of the estimator [math]\displaystyle{ \scriptstyle\hat\beta }[/math].

Two sample test assuming equal variance

The t statistic (df = [math]\displaystyle{ n_1 + n_2 - 2 }[/math]) to test whether the means are different can be calculated as follows:

[math]\displaystyle{ t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} }[/math]

where

[math]\displaystyle{ s_{X_1X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2+(n_2-1)s_{X_2}^2}{n_1+n_2-2}}. }[/math]

[math]\displaystyle{ s_{X_1X_2} }[/math] is an estimator of the common/pooled standard deviation of the two samples. The square-root of a pooled variance estimator is known as a pooled standard deviation.

  • Pooled variance from Wikipedia
  • The pooled sample variance is an unbiased estimator of the common variance if Xi and Yi are following the normal distribution.
  • (From minitab) The pooled standard deviation is the average spread of all data points about their group mean (not the overall mean). It is a weighted average of each group's standard deviation. The weighting gives larger groups a proportionally greater effect on the overall estimate.
  • Type I error rates in two-sample t-test by simulation

Two sample test assuming unequal variance

The t statistic (Behrens-Welch test statistic) to test whether the population means are different is calculated as:

[math]\displaystyle{ t = {\overline{X}_1 - \overline{X}_2 \over s_{\overline{X}_1 - \overline{X}_2}} }[/math]

where

[math]\displaystyle{ s_{\overline{X}_1 - \overline{X}_2} = \sqrt{{s_1^2 \over n_1} + {s_2^2 \over n_2}}. }[/math]

Here s2 is the unbiased estimator of the variance of the two samples.

The degrees of freedom is evaluated using the Satterthwaite's approximation

[math]\displaystyle{ df = { ({s_1^2 \over n_1} + {s_2^2 \over n_2})^2 \over {({s_1^2 \over n_1})^2 \over n_1-1} + {({s_2^2 \over n_2})^2 \over n_2-1} }. }[/math]

Paired test

Have you ever asked yourself, "how should I approach the classic pre-post analysis?"

Z-value/Z-score

If the population parameters are known, then rather than computing the t-statistic, one can compute the z-score.

Nonparametric test: Wilcoxon rank sum test

Sensitive to differences in location

Nonparametric test: Kolmogorov-Smirnov test

Sensitive to difference in shape and location of the distribution functions of two groups

Limma: Empirical Bayes method

  • Some Bioconductor packages: limma, RnBeads, IMA, minfi packages.
  • The moderated T-statistics used in Limma is defined on Limma's user guide.
  • Diagram of usage ?makeContrasts, ?contrasts.fit, ?eBayes
               lmFit        contrasts.fit           eBayes       topTable
            x ------> fit ------------------> fit2  -----> fit2  --------->
                       ^                       ^
                       |                       |
        model.matrix   |    makeContrasts      |
    class ---------> design ----------> contrasts
  • Examples of contrasts (search contrasts.fit and/or model.matrix from the user guide)
    # Ex 1 (Single channel design):
    design <- model.matrix(~ 0+factor(c(1,1,1,2,2,3,3,3))) # number of samples x number of groups
    colnames(design) <- c("group1", "group2", "group3")
    fit <- lmFit(eset, design)
    contrast.matrix <- makeContrasts(group2-group1, group3-group2, group3-group1, 
                                     levels=design)        # number of groups x number of contrasts
    fit2 <- contrasts.fit(fit, contrast.matrix)
    fit2 <- eBayes(fit2)
    topTable(fit2, coef=1, adjust="BH")
    topTable(fit2, coef=1, sort = "none", n = Inf, adjust="BH")$adj.P.Val
    
    # Ex 2 (Common reference design):
    targets <- readTargets("runxtargets.txt")
    design <- modelMatrix(targets, ref="EGFP")
    contrast.matrix <- makeContrasts(AML1,CBFb,AML1.CBFb,AML1.CBFb-AML1,AML1.CBFb-CBFb,
                                     levels=design)
    fit <- lmFit(MA, design)
    fit2 <- contrasts.fit(fit, contrasts.matrix)
    fit2 <- eBayes(fit2)
    
    # Ex 3 (Direct two-color design):
    design <- modelMatrix(targets, ref="CD4")
    contrast.matrix <- cbind("CD8-CD4"=c(1,0),"DN-CD4"=c(0,1),"CD8-DN"=c(1,-1))
    rownames(contrast.matrix) <- colnames(design)
    fit <- lmFit(eset, design)
    fit2 <- contrasts.fit(fit, contrast.matrix)
    
    # Ex 4 (Single channel + Two groups):
    fit <- lmFit(eset, design)
    cont.matrix <- makeContrasts(MUvsWT=MU-WT, levels=design)
    fit2 <- contrasts.fit(fit, cont.matrix)
    fit2 <- eBayes(fit2)
    
    # Ex 5 (Single channel + Several groups):
    f <- factor(targets$Target, levels=c("RNA1","RNA2","RNA3"))
    design <- model.matrix(~0+f)
    colnames(design) <- c("RNA1","RNA2","RNA3")
    fit <- lmFit(eset, design)
    contrast.matrix <- makeContrasts(RNA2-RNA1, RNA3-RNA2, RNA3-RNA1,
                                     levels=design)
    fit2 <- contrasts.fit(fit, contrast.matrix)
    fit2 <- eBayes(fit2)
    
    # Ex 6 (Single channel + Interaction models 2x2 Factorial Designs) :
    cont.matrix <- makeContrasts(
      SvsUinWT=WT.S-WT.U,
      SvsUinMu=Mu.S-Mu.U,
      Diff=(Mu.S-Mu.U)-(WT.S-WT.U),
      levels=design)
    fit2 <- contrasts.fit(fit, cont.matrix)
    fit2 <- eBayes(fit2)
  • Example from user guide 17.3 (Mammary progenitor cell populations)
    setwd("~/Downloads/IlluminaCaseStudy")
    url <- c("http://bioinf.wehi.edu.au/marray/IlluminaCaseStudy/probe%20profile.txt.gz",
      "http://bioinf.wehi.edu.au/marray/IlluminaCaseStudy/control%20probe%20profile.txt.gz",
      "http://bioinf.wehi.edu.au/marray/IlluminaCaseStudy/Targets.txt")
    for(i in url)  system(paste("wget ", i))
    system("gunzip probe%20profile.txt.gz")
    system("gunzip control%20probe%20profile.txt.gz")
    
    source("http://www.bioconductor.org/biocLite.R")
    biocLite("limma")
    biocLite("statmod")
    library(limma)
    targets <- readTargets()
    targets
    
    x <- read.ilmn(files="probe profile.txt",ctrlfiles="control probe profile.txt",
      other.columns="Detection")
    options(digits=3)
    head(x$E)
    boxplot(log2(x$E),range=0,ylab="log2 intensity")
    y <- neqc(x)
    dim(y)
    expressed <- rowSums(y$other$Detection < 0.05) >= 3
    y <- y[expressed,]
    dim(y) # 24691 12
    plotMDS(y,labels=targets$CellType)
    
    ct <- factor(targets$CellType)
    design <- model.matrix(~0+ct)
    colnames(design) <- levels(ct)
    dupcor <- duplicateCorrelation(y,design,block=targets$Donor) # need statmod
    dupcor$consensus.correlation
    
    fit <- lmFit(y, design, block=targets$Donor, correlation=dupcor$consensus.correlation)
    contrasts <- makeContrasts(ML-MS, LP-MS, ML-LP, levels=design)
    fit2 <- contrasts.fit(fit, contrasts)
    fit2 <- eBayes(fit2, trend=TRUE)
    summary(decideTests(fit2, method="global"))
    topTable(fit2, coef=1) # Top ten differentially expressed probes between ML and MS
    #                 SYMBOL TargetID logFC AveExpr     t  P.Value adj.P.Val    B
    # ILMN_1766707     IL17B     <NA> -4.19    5.94 -29.0 2.51e-12  5.19e-08 18.1
    # ILMN_1706051      PLD5     <NA> -4.00    5.67 -27.8 4.20e-12  5.19e-08 17.7
    # ...
    tT <- topTable(fit2, coef=1, number = Inf)
    dim(tT)
    # [1] 24691     8
  • Three groups comparison (What is the difference of A vs Other AND A vs (B+C)/2?). Contrasts comparing one factor to multiple others
    library(limma)
    set.seed(1234)
    n <- 100
    testexpr <- matrix(rnorm(n * 10, 5, 1), nc= 10)
    testexpr[, 6:7] <- testexpr[, 6:7] + 7  # mean is 12
    
    design1 <- model.matrix(~ 0 + as.factor(c(rep(1,5),2,2,3,3,3)))
    design2 <- matrix(c(rep(1,5),rep(0,5),rep(0,5),rep(1,5)),ncol=2)
    colnames(design1) <- LETTERS[1:3]
    colnames(design2) <- c("A", "Other")
    
    fit1 <- lmFit(testexpr,design1)
    contrasts.matrix1 <- makeContrasts("AvsOther"=A-(B+C)/2, levels = design1)
    fit1 <- eBayes(contrasts.fit(fit1,contrasts=contrasts.matrix1))
    
    fit2 <- lmFit(testexpr,design2)
    contrasts.matrix2 <- makeContrasts("AvsOther"=A-Other, levels = design2)
    fit2 <- eBayes(contrasts.fit(fit2,contrasts=contrasts.matrix2))
    
    t1 <- topTable(fit1,coef=1, number = Inf)
    t2 <- topTable(fit2,coef=1, number = Inf)
    
    rbind(head(t1, 3), tail(t1, 3))
    #        logFC  AveExpr         t      P.Value    adj.P.Val         B
    # 92 -5.293932 5.810926 -8.200138 1.147084e-15 1.147084e-13 26.335702
    # 81 -5.045682 5.949507 -7.815607 2.009706e-14 1.004853e-12 23.334600
    # 37 -4.720906 6.182821 -7.312539 7.186627e-13 2.395542e-11 19.625964
    # 27 -2.127055 6.854324 -3.294744 1.034742e-03 1.055859e-03 -1.141991
    # 86 -1.938148 7.153142 -3.002133 2.776390e-03 2.804434e-03 -2.039869
    # 75 -1.876490 6.516004 -2.906626 3.768951e-03 3.768951e-03 -2.314869
    rbind(head(t2, 3), tail(t2, 3))
    #        logFC  AveExpr          t    P.Value adj.P.Val         B
    # 92 -4.518551 5.810926 -2.5022436 0.01253944 0.2367295 -4.587080
    # 81 -4.500503 5.949507 -2.4922492 0.01289503 0.2367295 -4.587156
    # 37 -4.111158 6.182821 -2.2766414 0.02307100 0.2367295 -4.588728
    # 27 -1.496546 6.854324 -0.8287440 0.40749644 0.4158127 -4.595601
    # 86 -1.341607 7.153142 -0.7429435 0.45773401 0.4623576 -4.595807
    # 75 -1.171366 6.516004 -0.6486690 0.51673851 0.5167385 -4.596008
    
    var(as.numeric(testexpr[, 6:10]))
    # [1] 12.38074
    var(as.numeric(testexpr[, 6:7]))
    # [1] 0.8501378
    var(as.numeric(testexpr[, 8:10]))
    # [1] 0.9640699
    As we can see the p-values returned from the first contrast are very small (large mean but small variance) but the p-values returned from the 2nd contrast are large (still large mean but very large variance). The variance from the "Other" group can be calculated from a mixture distribution ( pdf = .4 N(12, 1) + .6 N(5, 1), VarY = E(Y^2) - (EY)^2 where E(Y^2) = .4 (VarX1 + (EX1)^2) + .6 (VarX2 + (EX2)^2) = 73.6 and EY = .4 * 12 + .6 * 5 = 7.8; so VarY = 73.6 - 7.8^2 = 12.76).
  • Correct assumptions of using limma moderated t-test and the paper Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies.
    • Evaluation: statistical power (figure 3, 4, 5), false-positive rate (table 2), execution time and ease of use (table 3)
    • Limma presents several advantages
    • RVM inflates the expected number of false-positives when sample size is small. On the other hand the, RVM is very close to Limma from either their formulas (p3 of the supporting info) or the Hierarchical clustering (figure 2) of two examples.
    • Slides
  • Use Limma to run ordinary T tests
    # where 'fit' is the output from lmFit() or contrasts.fit().
    unmod.t <- fit$coefficients/fit$stdev.unscaled/fit$sigma
    pval <- 2*pt(-abs(unmod.t), fit$df.residual)
    
    # Following the above example
    t.test(testexpr[1, 1:5], testexpr[1, 6:10], var.equal = T)
    # Two Sample t-test
    # 
    # data:  testexpr[1, 1:5] and testexpr[1, 6:10]
    # t = -1.2404, df = 8, p-value = 0.25
    # alternative hypothesis: true difference in means is not equal to 0
    # 95 percent confidence interval:
    #   -7.987791  2.400082
    # sample estimates:
    #   mean of x mean of y 
    # 4.577183  7.371037 
    fit2$coefficients[1] / (fit2$stdev.unscaled[1] * fit2$sigma[1]) # Ordinary t-statistic
    # [1] -1.240416
    fit2$coefficients[1] / (fit2$stdev.unscaled[1] * sqrt(fit2$s2.post[1])) # moderated t-statistic
    # [1] -1.547156
    topTable(fit2,coef=1, sort.by = "none")[1,]
    #       logFC  AveExpr         t   P.Value adj.P.Val         B
    # 1 -2.793855 5.974110 -1.547156 0.1222210 0.2367295 -4.592992
    
    # Square root of the pooled variance
    fit2$sigma[1]
    # [1] 3.561284
    (((5-1)*var(testexpr[1, 1:5]) + (5-1)*var(testexpr[1, 6:10]))/(5+5-2)) %>% sqrt()
    # [1] 3.561284
  • Comparison of ordinary T-statistic, RVM T-statistic and Limma/eBayes moderated T-statistic.
Test statistic for gene g
Ordinary T-test [math]\displaystyle{ \frac{\overline{y}_{g1} - \overline{y}_{g2}}{S_g^{Pooled}/\sqrt{1/n_1 + 1/n_2}} }[/math] [math]\displaystyle{ (S_g^{Pooled})^2 = \frac{(n_1-1)S_{g1}^2 + (n_2-1)S_{g2}^2}{n1+n2-2} }[/math]
RVM [math]\displaystyle{ \frac{\overline{y}_{g1} - \overline{y}_{g2}}{S_g^{RVM}/\sqrt{1/n_1 + 1/n_2}} }[/math] [math]\displaystyle{ (S_g^{RVM})^2 = \frac{(n_1+n_2-2)S_{g}^2 + 2*a*(a*b)^{-1}}{n1+n2-2+2*a} }[/math]
Limma [math]\displaystyle{ \frac{\overline{y}_{g1} - \overline{y}_{g2}}{S_g^{Limma}/\sqrt{1/n_1 + 1/n_2}} }[/math] [math]\displaystyle{ (S_g^{Limma})^2 = \frac{d_0 S_0^2 + d_g S_g^2}{d_0 + d_g} }[/math]
  • In Limma,
    • [math]\displaystyle{ \sigma_g^2 }[/math] assumes an inverse Chi-square distribution with mean [math]\displaystyle{ S_0^2 }[/math] and [math]\displaystyle{ d_0 }[/math] degrees of freedom
    • [math]\displaystyle{ d_0 }[/math] (fit$df.prior) and [math]\displaystyle{ d_g }[/math] are, respectively, prior and residual/empirical degrees of freedom.
    • [math]\displaystyle{ S_0^2 }[/math] (fit$s2.prior) is the prior distribution and [math]\displaystyle{ S_g^2 }[/math] is the pooled variance.
    • [math]\displaystyle{ (S_g^{Limma})^2 }[/math] can be obtained from fit$s2.post.
  • Empirical Bayes estimation of normal means, accounting for uncertainty in estimated standard errors Lu 2019