Batch effect: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 137: Line 137:
= MultiBaC- Multiomic Batch effect Correction =
= MultiBaC- Multiomic Batch effect Correction =
[https://www.bioconductor.org/packages/release/bioc/html/MultiBaC.html MultiBaC]
[https://www.bioconductor.org/packages/release/bioc/html/MultiBaC.html MultiBaC]
= BatchQC =
* https://www.bioconductor.org/packages/release/bioc/html/BatchQC.html
* [https://www.statology.org/explained-variance/ What is Explained Variance? (Definition & Example)]
** The explained variance can be found in the SS (“sum of squares”) column for the Between Groups variation.
** In a regression model, the explained variance is summarized by R-squared, often written R2 = SSR/SST; Coefficient of Determination.
* Some screenshots
[[File:BatchqcSummary.png|200px]]
[[File:BatchqcVariation.png|200px]]
[[File:BatchqcDE.png|200px]]
[[File:BatchqcPCA.png|200px]]


= TCGA =
= TCGA =
[https://rdrr.io/bioc/TCGAbiolinks/man/TCGAbatch_Correction.html TCGAbatch_Correction()]
[https://rdrr.io/bioc/TCGAbiolinks/man/TCGAbatch_Correction.html TCGAbatch_Correction()]

Revision as of 14:23, 21 June 2022

Merging two gene expression studies

ComBat

  • Statistics for Genomic Data Science (Coursera) and https://github.com/jtleek/genstats
  • Some possible batch variables: operators, runs, machines, library kits, laboratories.
  • sva::ComBat() function in sva package from Bioconductor. [math]\displaystyle{ \begin{align} Y_{ijg} = \alpha_g + X \beta_g + \gamma_{ig} + \delta_{ig} \epsilon_{ijg} \end{align} }[/math] where [math]\displaystyle{ X=X_{ij} }[/math] consists of covariates (eg biological) of scientific interests (e.g. Pathway activation levels in Zhang's 2018 simulation example), while [math]\displaystyle{ \gamma_{ig} }[/math] and [math]\displaystyle{ \delta_{ig} }[/math] characterize the additive and multiplicative batch effects of batch i for gene g. The error terms, [math]\displaystyle{ \epsilon_{ijg} }[/math], are assumed to follow a normal distribution with expected value of zero and variance [math]\displaystyle{ \sigma^2_𝑔 }[/math]. The batch corrected data is [math]\displaystyle{ \begin{align} \frac{Y_{ijg} - \hat{\alpha_g} - X \hat{\beta_g} - \hat{\gamma_{ig}}}{\hat{\delta_{ig}}} + \hat{\alpha_g} + X \hat{\beta_g}. \end{align} }[/math]
  • Alternative empirical Bayes models for adjusting for batch effects in genomic studies Zhang et al. BMC Bioinformatics 2018. The R package is sva and BatchQC from Bioconductor.
    • Reference batch adjustment: [math]\displaystyle{ \begin{align} Y_{ijg} = \alpha_{rg} + X \beta_{rg} + \gamma_{rig} + \delta_{rig} \epsilon_{ijg} \end{align} }[/math] where [math]\displaystyle{ \alpha_{rg} }[/math] is the average gene expression in the chosen reference batch (r). Furthermore, [math]\displaystyle{ \gamma_{rig} }[/math] and [math]\displaystyle{ \delta_{rig} }[/math] represent the additive and multiplicative batch differences between the reference batch and batch i for gene g. The error terms, [math]\displaystyle{ \epsilon_{ijg} }[/math], are assumed to follow a normal distribution with expected value of zero and a reference batch variance [math]\displaystyle{ \sigma^2_{𝑟𝑔} }[/math].
    • Mean-only adjustment for batch effects: [math]\displaystyle{ \begin{align} Y_{ijg} = \alpha_{g} + X \beta_{g} + \gamma_{ig} + \epsilon_{ijg} \end{align} }[/math]
  • svg vignette example to remove the batch effect
    BiocManager::install("sva")
    library(sva)
    library(bladderbatch)
    data(bladderdata)
    pheno = pData(bladderEset)
    edata = exprs(bladderEset)
    batch = pheno$batch
    table(pheno$cancer)
    # Biopsy Cancer Normal 
    #      9     40      8 
    table(batch)
    # batch
    #  1  2  3  4  5 
    # 11 18  4  5 19 
    
    modcombat = model.matrix(~1, data=pheno)
    combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, 
                          prior.plots=FALSE)
    # This returns an expression matrix, with the same dimensions 
    # as your original dataset (genes x samples).
    # mod: Model matrix for outcome of interest and other covariates besides batch
    # By default, it performs parametric empirical Bayesian adjustments. 
    # If you would like to use nonparametric empirical Bayesian adjustments, 
    # use the par.prior=FALSE option (this will take longer). 
    
    combat_edata = ComBat(dat=edata, batch=batch, ref.batch=1)
  • ref.batch for reference-based batch adjustment. mean.only option if there is no need to adjust the variancec. Check out paper Alternative empirical Bayes models for adjusting for batch effects in genomic studies Zhang 2018. Figure 4 shows reference-based ComBat can clearly show the pathway activated samples in Batch 1 samples and show the true data pattern in Batch 2 samples from the simulated study (vs the original ComBat approach failed for both cases). In Figure 5 when we cluster genes using K-means, referenced-based Combat can better identify the role of DE or control genes (compared to the original ComBat method). In addition to the github reposition for the simulation R code, BatchQC::rnaseq_sim() can also do that.
  • Merging two gene-expression studies via cross-platform normalization by Shabalin et al, Bioinformatics 2008. This method (called Cross-Platform Normalization/XPN)was used by Ternès Biometrical Journal 2017.
  • Batch effect removal methods for microarray gene expression data integration: a survey by Lazar et al, Bioinformatics 2012. The R package is inSilicoMerging which has been removed from Bioconductor 3.4.
  • Question: Combine hgu133a&b and hgu133plus2. Adjusting batch effects in microarray expression data using empirical Bayes methods
  • Figure S1 shows the principal component analysis (PCA) before and after batch effect correction for training and validation datasets from another paper
  • Batch effects and GC content of NGS by Michael Love
  • 困扰的batch effect
  • Some note by Mikhail Dozmorov

ComBat-Seq

svaseq

Applications

DESeq2

limma::removeBatchEffect()

ComBat or removebatcheffects (limma package)

Batch effects : ComBat or removebatcheffects (limma package) ? The conclusion that you should get from reading this is that correcting for batch directly with programs like ComBat is best avoided. See Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses Nygaard 2016 (215 cites vs 5372 cites from ComBat)

correcting the batch effects in Limma and SVA answered by Gordon Smyth.

ComBat or blocking in limma

batch effect : comBat or blocking in limma ?. The main difference between what Limma does and ComBat is that ComBat adjusts for differences in both the mean and variance differences across the batches, whereas Limma (I believe--Gordon please confirm) assumes that the batch variances are the same and only accounts for mean differences across the batches. So if there are large differences in batch variances, it might still be better to use ComBat. If there are not large variance differences, then Limma should be the best.

HarmonizeR, proteomic

HarmonizR enables data harmonization across independent proteomic datasets with appropriate handling of missing values

MultiBaC- Multiomic Batch effect Correction

MultiBaC

BatchQC

BatchqcSummary.png BatchqcVariation.png

BatchqcDE.png BatchqcPCA.png

TCGA

TCGAbatch_Correction()