Batch effect: Difference between revisions

From 太極
Jump to navigation Jump to search
No edit summary
Tags: mobile edit mobile web edit advanced mobile edit
Line 1: Line 1:
= Merging two gene expression studies, ComBat =
= Merging two gene expression studies, ComBat =
<ul>
<li>[https://www.coursera.org/lecture/statistical-genomics/module-2-overview-1-12-cbqYZ Statistics for Genomic Data Science] (Coursera) and https://github.com/jtleek/genstats
<li>Some possible batch variables: operators,  runs, machines, library kits, laboratories.
<li>[https://www.rdocumentation.org/packages/sva/versions/3.20.0/topics/ComBat sva::ComBat()] function in [http://www.bioconductor.org/packages/release/bioc/html/sva.html sva] package from Bioconductor.
<math>
<math>
\begin{align}
\begin{align}
Line 5: Line 9:
\end{align}
\end{align}
</math>
</math>
where X consists of covariates of scientific interests, while <span style="color: red"><math>\gamma_{ig}</math></span> and <span style="color: red"><math>\delta_{ig}</math></span> characterize the ''additive'' and ''multiplicative'' <span style="color: red">batch effects</span> of batch i for gene g.
where <math>X=X_{ij}</math> consists of covariates of scientific interests (e.g. Pathway activation levels in Zhang's 2018 simulation example), while <span style="color: red"><math>\gamma_{ig}</math></span> and <span style="color: red"><math>\delta_{ig}</math></span> characterize the ''additive'' and ''multiplicative'' <span style="color: red">batch effects</span> of batch i for gene g. The error terms, <math>\epsilon_{ijg}</math>, are assumed to follow a normal distribution with expected value of zero and variance <math>\sigma^2_𝑔</math>.  


The batch corrected data is  
The batch corrected data is  
<math>
<math>
\begin{align}
\begin{align}
\frac{Y_{ijg} - \hat{\alpha_g} - X \hat{\beta_g} - \hat{\gamma_{ig}}}{\hat{\delta_{ig}}} + \hat{\alpha_g} + X \hat{\beta_g}
\frac{Y_{ijg} - \hat{\alpha_g} - X \hat{\beta_g} - \hat{\gamma_{ig}}}{\hat{\delta_{ig}}} + \hat{\alpha_g} + X \hat{\beta_g}.
\end{align}
\end{align}
</math>
</math>
 
<li>[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2263-6 Alternative empirical Bayes models for adjusting for batch effects in genomic studies] Zhang et al. BMC Bioinformatics 2018. The R package is '''sva''' and [http://www.bioconductor.org/packages/release/bioc/html/BatchQC.html BatchQC] from Bioconductor.
* [https://www.coursera.org/lecture/statistical-genomics/module-2-overview-1-12-cbqYZ Statistics for Genomic Data Science] (Coursera) and https://github.com/jtleek/genstats
<ul>
* Some possible batch variables: operators,  runs, machines, library kits, laboratories.
<li>'''Reference batch adjustment''':
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2263-6 Alternative empirical Bayes models for adjusting for batch effects in genomic studies] Zhang et al. BMC Bioinformatics 2018. The R package is [http://www.bioconductor.org/packages/release/bioc/html/BatchQC.html BatchQC] from Bioconductor.
<math>
* [https://www.rdocumentation.org/packages/sva/versions/3.20.0/topics/ComBat sva::ComBat()] function in [http://www.bioconductor.org/packages/release/bioc/html/sva.html sva] package from Bioconductor.
\begin{align}
** The [https://academic.oup.com/biostatistics/article/8/1/118/252073?searchresult=1 original paper] Johnson 2007 is number 2 of [https://academic.oup.com/biostatistics/pages/highly_cited_articles highly cited articles]. Figure 1 shows the EB adjustment has the advantage of being robust to outliers in small sample sizes (batch sizes are small).
Y_{ijg} = \alpha_{rg} + X \beta_{rg} + \gamma_{rig} + \delta_{rig} \epsilon_{ijg}
** It can remove both known batch effects and other potential latent sources of variation.
\end{align}
** The tutorial includes information on (1) how to estimate the number of latent sources of variation, (2) how to apply the sva package to estimate latent variables such as batch effects, (3) how to directly remove known batch effects using the ComBat function, (4) how to perform differential expression analysis using surrogate variables either directly or with thelimma package, and (4) how to apply “frozen” sva to improve prediction and clustering.
</math>
** Figure 1 shows 3 heatmaps. Each contains column annotation including Time, Treatment and Batch variables. a) No adjustment, '''b) standardize each gene within each batch''' (implemented in dChip software), c) EB batch adjustment. Note that there is no strong evidence of batch effects after adjustment in heat maps (b)–(c).
where <math>\alpha_{rg}</math> is the average gene expression in the chosen reference batch (r). Furthermore, <math>\gamma_{rig}</math> and <math>\delta_{rig}</math> represent the additive and multiplicative batch differences between the reference batch and batch i for gene g. The error terms, <math>\epsilon_{ijg}</math>, are assumed to follow a normal distribution with expected value of zero and a reference batch variance <math>\sigma^2_{𝑟𝑔}</math>.
** [https://bmccancer.biomedcentral.com/track/pdf/10.1186/s12885-018-4546-8#page=4 Figure S1 shows the principal component analysis (PCA) before and after batch effect correction for training and validation datasets] from another paper
<li>'''Mean-only adjustment for batch effects''':
** [https://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf#page=7 Tutorial example] to remove the batch effect  
<math>
\begin{align}
Y_{ijg} = \alpha_{g} + X \beta_{g} + \gamma_{ig} + \epsilon_{ijg}
\end{align}
</math>
</ul>
<li>[https://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf#page=7 svg vignette example] to remove the batch effect  
:<syntaxhighlight lang='bash'>
:<syntaxhighlight lang='bash'>
BiocManager::install("sva")
BiocManager::install("sva")
Line 52: Line 62:
combat_edata = ComBat(dat=edata, batch=batch, ref.batch=1)
combat_edata = ComBat(dat=edata, batch=batch, ref.batch=1)
</syntaxhighlight>
</syntaxhighlight>
* '''ref.batch''' for reference-based batch adjustment. '''mean.only''' option if there is no need to adjust the variancec. Check out paper [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2263-6 Alternative empirical Bayes models for adjusting for batch effects in genomic studies] Zhang 2018. Figure 4 shows reference-based ComBat can clearly show the pathway activated samples in Batch 1 samples and show the true data pattern in Batch 2 samples from the simulated study (vs the original ComBat approach failed for both cases). In Figure 5 when we cluster genes using K-means, referenced-based Combat can better identify the role of DE or control genes (compared to the original ComBat method).  
<li>'''ref.batch''' for reference-based batch adjustment. '''mean.only''' option if there is no need to adjust the variancec. Check out paper [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2263-6 Alternative empirical Bayes models for adjusting for batch effects in genomic studies] Zhang 2018. Figure 4 shows reference-based ComBat can clearly show the pathway activated samples in Batch 1 samples and show the true data pattern in Batch 2 samples from the simulated study (vs the original ComBat approach failed for both cases). In Figure 5 when we cluster genes using K-means, referenced-based Combat can better identify the role of DE or control genes (compared to the original ComBat method).  
* [https://academic.oup.com/bioinformatics/article/24/9/1154/206630 Merging two gene-expression studies via cross-platform normalization] by Shabalin et al, Bioinformatics 2008. This method (called '''Cross-Platform Normalization/XPN''')was used by Ternès Biometrical Journal 2017.
<li>[https://academic.oup.com/bioinformatics/article/24/9/1154/206630 Merging two gene-expression studies via cross-platform normalization] by Shabalin et al, Bioinformatics 2008. This method (called '''Cross-Platform Normalization/XPN''')was used by Ternès Biometrical Journal 2017.
* [https://academic.oup.com/bib/article/14/4/469/191565 Batch effect removal methods for microarray gene expression data integration: a survey] by Lazar et al, Bioinformatics 2012. The R package is '''[http://bioconductor.org/packages/3.3/bioc/html/inSilicoMerging.html inSilicoMerging]''' which has been removed from Bioconductor 3.4.  
<li>[https://academic.oup.com/bib/article/14/4/469/191565 Batch effect removal methods for microarray gene expression data integration: a survey] by Lazar et al, Bioinformatics 2012. The R package is '''[http://bioconductor.org/packages/3.3/bioc/html/inSilicoMerging.html inSilicoMerging]''' which has been removed from Bioconductor 3.4.  
* [https://support.bioconductor.org/p/25840/ Question: Combine hgu133a&b and hgu133plus2]. [https://academic.oup.com/biostatistics/article/8/1/118/252073 Adjusting batch effects in microarray expression data using empirical Bayes methods]
<li>[https://support.bioconductor.org/p/25840/ Question: Combine hgu133a&b and hgu133plus2]. [https://academic.oup.com/biostatistics/article/8/1/118/252073 Adjusting batch effects in microarray expression data using empirical Bayes methods]
* [https://rdrr.io/bioc/limma/man/removeBatchEffect.html removeBatchEffect()] from limma package
<li>[https://bmccancer.biomedcentral.com/track/pdf/10.1186/s12885-018-4546-8#page=4 Figure S1 shows the principal component analysis (PCA) before and after batch effect correction for training and validation datasets] from another paper
* [https://biodatascience.github.io/compbio/dist/batch.html Batch effects and GC content] of NGS by Michael Love
<li>[https://rdrr.io/bioc/limma/man/removeBatchEffect.html removeBatchEffect()] from limma package
* [https://www.jianshu.com/p/99b3411ad6ad 困扰的batch effect]
<li>[https://biodatascience.github.io/compbio/dist/batch.html Batch effects and GC content] of NGS by Michael Love
* [https://mdozmorov.github.io/BIOS567/assets/presentation_diffexpression/batch.pdf Some note] by Mikhail Dozmorov
<li>[https://www.jianshu.com/p/99b3411ad6ad 困扰的batch effect]
<li>[https://mdozmorov.github.io/BIOS567/assets/presentation_diffexpression/batch.pdf Some note] by Mikhail Dozmorov
</ul>


= MultiBaC- Multiomic Batch effect Correction =
= MultiBaC- Multiomic Batch effect Correction =

Revision as of 09:04, 17 May 2022

Merging two gene expression studies, ComBat

  • Statistics for Genomic Data Science (Coursera) and https://github.com/jtleek/genstats
  • Some possible batch variables: operators, runs, machines, library kits, laboratories.
  • sva::ComBat() function in sva package from Bioconductor. [math]\displaystyle{ \begin{align} Y_{ijg} = \alpha_g + X \beta_g + \gamma_{ig} + \delta_{ig} \epsilon_{ijg} \end{align} }[/math] where [math]\displaystyle{ X=X_{ij} }[/math] consists of covariates of scientific interests (e.g. Pathway activation levels in Zhang's 2018 simulation example), while [math]\displaystyle{ \gamma_{ig} }[/math] and [math]\displaystyle{ \delta_{ig} }[/math] characterize the additive and multiplicative batch effects of batch i for gene g. The error terms, [math]\displaystyle{ \epsilon_{ijg} }[/math], are assumed to follow a normal distribution with expected value of zero and variance [math]\displaystyle{ \sigma^2_𝑔 }[/math]. The batch corrected data is [math]\displaystyle{ \begin{align} \frac{Y_{ijg} - \hat{\alpha_g} - X \hat{\beta_g} - \hat{\gamma_{ig}}}{\hat{\delta_{ig}}} + \hat{\alpha_g} + X \hat{\beta_g}. \end{align} }[/math]
  • Alternative empirical Bayes models for adjusting for batch effects in genomic studies Zhang et al. BMC Bioinformatics 2018. The R package is sva and BatchQC from Bioconductor.
    • Reference batch adjustment: [math]\displaystyle{ \begin{align} Y_{ijg} = \alpha_{rg} + X \beta_{rg} + \gamma_{rig} + \delta_{rig} \epsilon_{ijg} \end{align} }[/math] where [math]\displaystyle{ \alpha_{rg} }[/math] is the average gene expression in the chosen reference batch (r). Furthermore, [math]\displaystyle{ \gamma_{rig} }[/math] and [math]\displaystyle{ \delta_{rig} }[/math] represent the additive and multiplicative batch differences between the reference batch and batch i for gene g. The error terms, [math]\displaystyle{ \epsilon_{ijg} }[/math], are assumed to follow a normal distribution with expected value of zero and a reference batch variance [math]\displaystyle{ \sigma^2_{𝑟𝑔} }[/math].
    • Mean-only adjustment for batch effects: [math]\displaystyle{ \begin{align} Y_{ijg} = \alpha_{g} + X \beta_{g} + \gamma_{ig} + \epsilon_{ijg} \end{align} }[/math]
  • svg vignette example to remove the batch effect
    BiocManager::install("sva")
    library(sva)
    library(bladderbatch)
    data(bladderdata)
    pheno = pData(bladderEset)
    edata = exprs(bladderEset)
    batch = pheno$batch
    table(pheno$cancer)
    # Biopsy Cancer Normal 
    #      9     40      8 
    table(batch)
    # batch
    #  1  2  3  4  5 
    # 11 18  4  5 19 
    
    modcombat = model.matrix(~1, data=pheno)
    combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, 
                          prior.plots=FALSE)
    # This returns an expression matrix, with the same dimensions 
    # as your original dataset (genes x samples).
    # mod: Model matrix for outcome of interest and other covariates besides batch
    # By default, it performs parametric empirical Bayesian adjustments. 
    # If you would like to use nonparametric empirical Bayesian adjustments, 
    # use the par.prior=FALSE option (this will take longer). 
    
    combat_edata = ComBat(dat=edata, batch=batch, ref.batch=1)
  • ref.batch for reference-based batch adjustment. mean.only option if there is no need to adjust the variancec. Check out paper Alternative empirical Bayes models for adjusting for batch effects in genomic studies Zhang 2018. Figure 4 shows reference-based ComBat can clearly show the pathway activated samples in Batch 1 samples and show the true data pattern in Batch 2 samples from the simulated study (vs the original ComBat approach failed for both cases). In Figure 5 when we cluster genes using K-means, referenced-based Combat can better identify the role of DE or control genes (compared to the original ComBat method).
  • Merging two gene-expression studies via cross-platform normalization by Shabalin et al, Bioinformatics 2008. This method (called Cross-Platform Normalization/XPN)was used by Ternès Biometrical Journal 2017.
  • Batch effect removal methods for microarray gene expression data integration: a survey by Lazar et al, Bioinformatics 2012. The R package is inSilicoMerging which has been removed from Bioconductor 3.4.
  • Question: Combine hgu133a&b and hgu133plus2. Adjusting batch effects in microarray expression data using empirical Bayes methods
  • Figure S1 shows the principal component analysis (PCA) before and after batch effect correction for training and validation datasets from another paper
  • removeBatchEffect() from limma package
  • Batch effects and GC content of NGS by Michael Love
  • 困扰的batch effect
  • Some note by Mikhail Dozmorov

MultiBaC- Multiomic Batch effect Correction

MultiBaC

Combat or limma?

Batch effects : ComBat or removebatcheffects (limma package) ?