GSEA: Difference between revisions

From 太極
Jump to navigation Jump to search
(55 intermediate revisions by the same user not shown)
Line 35: Line 35:
* [http://barc.wi.mit.edu/education/hot_topics/enrichment/Gene_list_enrichment_Mar10.pdf Gene List Enrichment Analysis] dhyper(), binom.test(), fisher.test().
* [http://barc.wi.mit.edu/education/hot_topics/enrichment/Gene_list_enrichment_Mar10.pdf Gene List Enrichment Analysis] dhyper(), binom.test(), fisher.test().
* [https://www.pathwaycommons.org/guide/primers/data_analysis/gsea/ Pathway Commons]
* [https://www.pathwaycommons.org/guide/primers/data_analysis/gsea/ Pathway Commons]
** A positive enrichment score indicates that the gene set is overrepresented at the top of the ranked list. This means that the genes in the set are more likely to be up-regulated, with relatively small p-values.
** A negative enrichment score suggests that the gene set is overrepresented at the bottom of the ranked list. This implies that the genes in the set are more likely to be down-regulated.
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103537/ Gene set analysis methods: statistical models and methodological differences] 2014
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103537/ Gene set analysis methods: statistical models and methodological differences] 2014
* http://software.broadinstitute.org/gsea/index.jsp, [https://www.gsea-msigdb.org/gsea/doc/subramanian_tamayo_gsea_pnas.pdf  Subramanian, et al 2005] paper
* http://software.broadinstitute.org/gsea/index.jsp, [https://www.gsea-msigdb.org/gsea/doc/subramanian_tamayo_gsea_pnas.pdf  Subramanian, et al 2005] paper
Line 46: Line 48:
** [https://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/gsea1_bw.pdf#page=14 Using GSEA.1.0.R]
** [https://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/gsea1_bw.pdf#page=14 Using GSEA.1.0.R]
* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017
* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017
* [https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis], [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR] package
* [https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html clusterProfiler] package and the online [http://yulab-smu.top/clusterProfiler-book/index.html book].
* [https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html clusterProfiler] package and the online [http://yulab-smu.top/clusterProfiler-book/index.html book].
* [https://www.biorxiv.org/content/10.1101/659920v2 Gene-set Enrichment with Regularized Regression] Fang 2019
* [https://www.biorxiv.org/content/10.1101/659920v2 Gene-set Enrichment with Regularized Regression] Fang 2019
Line 60: Line 61:


See also [[Arraytools#GSEA|BRB-ArrayTools -> GSEA]].
See also [[Arraytools#GSEA|BRB-ArrayTools -> GSEA]].
== Hypergeometric test ==
<ul>
<li>Suppose you have a bag with 100 balls, 20 of which are red and the rest are blue. You draw 10 balls from the bag without replacement. Let’s say 4 of these balls are red. You want to know if drawing 4 red balls out of 10 is significantly different from random chance. </br>
Here’s how you would set up the hypergeometric test:
* Define the Universe (N): The total number of balls, which is 100.
* Define the Gene Set (M): The total number of red balls, which is 20.
* Define the Number of Successes (DE genes) in the Universe (n): The number of balls you drew, which is 10.
* Define the Number of Successes (DE genes) in the Gene Set (x): The number of red balls you drew, which is 4.
The formula for the hypergeometric test is:
<math>
P(X=x) = \frac{C(M, x) C(N-M, n-x)}{C(N, n)} = \frac{C(20,4) C(80,6)}{C(100, 10)}
</math>
<pre>
    N=100
  +-----------------------------+
  |  M=20              n=10    |
  | (# in a pathway)  (# of DE) |
  |  +-------+--------+-------+ |       
  |  |      |  x=4  |      | |
  |  |      | common |      | |   
  |  +-------+--------+-------+ |
  +-----------------------------+
</pre>
<li>In the context of gene set enrichment analysis,
* Define the Universe (N): The total number of genes that you have profiled in your experiment. Let’s say you have profiled 20,000 genes.
* Define the Gene Set (M): The total number of genes known to be involved in a specific biological pathway. For example, suppose there are 200 genes known to be involved in the “cell cycle” pathway.
* Define the Number of Successes (DE genes) in the Universe (n): The number of genes that are differentially expressed in your experiment. Let’s say you found 2,000 genes to be differentially expressed.
* Define the Number of Successes (DE genes) in the Gene Set (x): The number of differentially expressed genes that are also in your gene set. Suppose 50 of the 2,000 differentially expressed genes are involved in the “cell cycle” pathway.
<pre>
    N=20,000
  +-----------------------------+
  |  M=200            n=2000  |
  | (# in a pathway)  (# of DE) |
  |  +-------+--------+-------+ |       
  |  |      |  x=50  |      | |
  |  |      | common |      | |   
  |  +-------+--------+-------+ |
  +-----------------------------+
</pre>
We calculate the tail probability to report the p-value; <math>P(X \geq 50) = P(X=50) + \cdots + P(X=min(M, n))</math>. If this p-value is below a certain threshold (commonly 0.05), we reject the null hypothesis and conclude that the '''pathway is significantly enriched among the differentially expressed genes'''. It’s also important to remember that statistical significance does not always imply biological significance, and further validation is often required.</math>
</ul>


== Interpretation ==
== Interpretation ==
* Gene Set Enrichment Analysis (GSEA) takes an alternative approach : it focuses on cumulative changes in expression of multiple genes as a group (belonging to a same gene-set/pahtway), which shifts the focus from individual genes to groups of genes. See [http://baderlab.org/CancerStemCellProject/VeroniqueVoisin/AdditionalResources/GSEA this].
* Gene Set Enrichment Analysis (GSEA) takes an alternative approach : it focuses on cumulative changes in expression of multiple genes as a group (belonging to a same gene-set/pahtway), which shifts the focus from individual genes to groups of genes. See [http://baderlab.org/CancerStemCellProject/VeroniqueVoisin/AdditionalResources/GSEA this].
* XXX class '''is associated with''' the XXX gene sets ([https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf#page=12 GSVA vignette])
* XXX class/subtype '''is associated with''' the OOO gene set(s) ([https://www.bioconductor.org/packages/devel/bioc/vignettes/GSVA/inst/doc/GSVA.html GSVA vignette])
* XXX subtype is characterized by the expression of XXX markers, thus we expect it to '''correlate with''' the XXX gene set (GSVA vignette)
* XXX subtype (of samples) is characterized by the expression of OOO markers, thus we expect it to '''correlate with''' the OOO2 gene set (GSVA vignette)
* The XXX subtype shows high expression of XXX genes, thus the XXX gene sets '''is highly enriched for''' this group (GSVA vignette)
* The XXX subtype (of samples) shows high expression of OOO genes, thus the OOO gene set '''is highly enriched for''' this group (GSVA vignette). OR if we find a gene set is '''enriched''' in XXX subtype (e.g. sensitive models), it means genes in that gene set are more highly expressed in the XXX subtype samples compared to the YYY subtype samples.
* [https://groups.google.com/g/gsea-help/c/RN5f-dTzwzY Negative ES Interpretation].  
* [https://groups.google.com/g/gsea-help/c/RN5f-dTzwzY Negative ES Interpretation].  
** A positive enrichment score should always reflect enrichment on the positive side of the zero cross (although not necessarily all genes on the positive side) and be enrichment in whichever phenotype was selected to be first in the comparison in the Phenotype selection dialogue. And vice versa, with a negative enrichment score '''reflecting enrichment in the genes on the negative side of the ranked list'''.
** A positive enrichment score should always reflect enrichment on the positive side of the zero cross (although not necessarily all genes on the positive side) and be enrichment in whichever phenotype was selected to be first in the comparison in the Phenotype selection dialogue. And vice versa, with a negative enrichment score '''reflecting enrichment in the genes on the negative side of the ranked list'''.
** GSEA is known to have some issues with highly skewed gene distributions, but that shouldn't affect the raw enrichment scores, just NES and significance calculations when GSEA runs in to a condition where its only sampled from one side of the distribution.
** GSEA is known to have some issues with highly skewed gene distributions, but that shouldn't affect the raw enrichment scores, just NES and significance calculations when GSEA runs in to a condition where its only sampled from one side of the distribution.
* [https://www.biostars.org/p/490403/ Enrichment Score Interpretation]
** If, for example, you provide a gene list ranked by a combination of fold change and p-value (e.g., sign(FC) * log10(pvalue)), then the positive scores are associated with upregulated genes and negative scores are associated with downregulated genes.
* [https://www.biorxiv.org/content/10.1101/060012v3.full.pdf#page=10 fgsea] paper
**  '''The more positive is the value of sr(p) the more enriched the gene set is in the positively-regulated genes (with Si > 0). Accordingly, negative sr(p) corresponds to enrichment in the negatively regulated genes.'''
* The opposite of "enriched" is "depleted". See [https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis Wikipedia].


== Calculation ==
== Calculation ==
Line 81: Line 134:
== FDR cutoff ==
== FDR cutoff ==
[https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Why_does_GSEA_use_a_false_discovery_rate_.28FDR.29_of_0.25_rather_than_the_more_classic_0.05.3F Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?]
[https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Why_does_GSEA_use_a_false_discovery_rate_.28FDR.29_of_0.25_rather_than_the_more_classic_0.05.3F Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?]
''Given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, using a more stringent FDR cutoff may lead you to overlook potentially significant results. ''
== GSEAtraining ==
https://jokergoo.github.io/GSEAtraining/


= piano =
= piano =
Line 112: Line 170:
</pre>
</pre>
</li>
</li>
<li>[https://bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html vignette]  
<li>Paper is still on [https://www.biorxiv.org/content/10.1101/060012v3.full.pdf biorxiv] only. [https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=Fast+gene+set+enrichment+analysis+Gennady+Korotkevich&btnG= Scholar.google.com]. Computer tech lab in Russia.
<li>[https://rdrr.io/bioc/fgsea/man/ Documentation] </li>
<li>[https://rdrr.io/bioc/fgsea/man/collapsePathways.html ?collapsePathways]. Collapse list of enriched pathways to independent ones.
<li>[https://stephenturner.github.io/deseq-to-fgsea/ DESeq results to pathways in 60 Seconds with the fgsea package] </li>
<li>[https://www.biostars.org/p/339934/ Performing GSEA using MSigDB gene sets in R]
</li>
<li>[https://davetang.org/muse/2018/01/10/using-fast-preranked-gene-set-enrichment-analysis-fgsea-package/ Using the fast preranked gene set enrichment analysis (fgsea) package]
</ul>
 
== Are fgsea and Broad Institute GSEA equivalent ==
[https://bioinformatics.stackexchange.com/questions/149/are-fgsea-and-broad-institute-gsea-equivalent Are fgsea and Broad Institute GSEA equivalent?]
 
== Examples ==
<ul>
<li>[https://bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html vignette],
<li>[https://rdrr.io/bioc/fgsea/man/plotEnrichment.html plotEnrichment()] for a single pathway (including the output plot). '''No need to run fgsea()'''. [https://github.com/ctlab/fgsea/blob/master/R/plot.R#L148 Source code]. Enrichment score (ES) on the plot is calculated by calcGseaStat()$res. '''The ES value is the same as the one shown in plotGseaTable()''' though plotEnrichment does not return it.
<li>[https://rdrr.io/bioc/fgsea/man/plotGseaTable.html plotGseaTable()] for a bunch of selected pathways (including the output plot).
<li>Basic usage
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='rsplus'>
library(fgsea)
library(fgsea)
Line 121: Line 196:
fgseaRes <- fgsea(pathways = examplePathways,  
fgseaRes <- fgsea(pathways = examplePathways,  
                   stats    = exampleRanks,
                   stats    = exampleRanks,
                   minSize  = 4,
                   minSize  = 4, # default minSize=1, maxSize=Inf
                   maxSize  =10) # choose a small gene set in order to see the detail
                   maxSize  =10)  
# I used a very small maxSize in order to see details later
# So the results here can't be compared with the default
class(fgseaRes)
# [1] "data.table" "data.frame"
dim(fgseaRes)
# [1] 386  8
fgseaRes[1:2, ]
#                                                          pathway      pval      padj
# 1:                        1368092_Rora_activates_gene_expression 0.8630705 0.9386018
# 2: 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression 0.4312268 0.8159487
#      log2err        ES        NES size              leadingEdge
# 1: 0.05412006 -0.3087414 -0.6545067    5 11865,12753,328572,20787
# 2: 0.08312913  0.4209054  1.0360288    9        20893,59027,19883
 
sapply(examplePathways, length) |> range()
# [1]    1 2366
# Question: minSize, maxSize mean 'matched' genes? Can't verify
sum(sapply(examplePathways, length) < 3)
# [1] 104
sum(sapply(examplePathways, length) > 10)
# [1] 939
1457 - 104 - 939
# [1] 414
examplePathways2 <- lapply(examplePathways, function(x) x[x %in% names(exampleRanks)])
sapply(examplePathways2, length) |> range()
# [1]  0 968
sum(sapply(examplePathways2, length) < 3)
# [1] 232
sum(sapply(examplePathways2, length) > 10)
# [1] 730
1457 - 232 - 730
# [1] 495
range(fgseaRes$ES)
# [1] -0.8442416  0.9488163
range(fgseaRes$NES)
# [1] -2.020695  2.075729
order(fgseaRes$ES)[1:5]
# [1]  75 289 320 249 312
order(fgseaRes$NES)[1:5]
# [1] 289  75 102 320 200


# choose the top gene set in order to zoom in & see the detail
head(fgseaRes[order(pval), ], 1)
head(fgseaRes[order(pval), ], 1)
#                            pathway        pval        padj  log2err        ES
#                            pathway        pval        padj  log2err        ES
Line 131: Line 247:


plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks)
plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks)
# still hard to understand the graph
# 12000 total genes though only 6 genes are matched in this top gene set.
debug(plotEnrichment)
# Browse[2]> toPlot
#        x            y
# 1      0  0.000000000
# 2    47 -0.003918626
# 3    48  0.308356695
# 4    322  0.285511939
# 5    323  0.452415897
# 6    407  0.445412396
# 7    408  0.593677191
# 8    447  0.590425565
# 9    448  0.730277162
# 10  617  0.716186784
# 11  618  0.833696389
# 12  638  0.832028888
# 13  639  0.947223612
# 14 12001  0.000000000
examplePathways[["5991601_Phosphorylation_of_Emi1"]]
# [1] "12534"    "18817"    "56371"    "67141"    "107995"    "268697" 
# [7] "434175"    "102643276"
match(examplePathways[["5991601_Phosphorylation_of_Emi1"]], names(exampleRanks))
# [1] 11678 11593 11362 11553 11953 11383    NA    NA
# Since there are 12000 total genes and exampleRanks have been sorted,
# these matched genes have very high ranks.
# See the plot below.
 
length(exampleRanks)
# [1] 12000
exampleRanks[1:10]
#    170942    109711    18124    12775    72148    16010    11931
# -63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626 -29.42328
#    13849    241230    665113
# -28.83475 -28.65111 -27.81583
sum(exampleRanks < 0)
# [1] 6469
sum(exampleRanks > 0)
# [1] 5531
plot(exampleRanks)
</syntaxhighlight>
[[File:FgseaPlot.png|250px]]  [[File:ExampleRanks.png|250px]]
 
There are 6*2 points (excluding the 0 value at the starting and end) in the figure.
 
<li>The range of NES is not [-1, 1] but ES seems to be in [-1, 1]. The example sort pathways by NES but it can be adapted to sort by pval. [https://github.com/ctlab/fgsea/blob/master/R/plot.R#L28 Source code]. Also ES and NES are not in the same order (see below) though we cannot tell it from the small plot. According to the [https://rdrr.io/bioc/fgsea/man/fgseaMultilevel.html manual],
* ES – enrichment score, same as in Broad GSEA implementation;
* NES – enrichment score normalized to '''mean enrichment''' of random samples of the '''same size''' (the number of genes in the gene set).
* [https://www.biostars.org/p/311473/ Hπow to interpret NES(normalized enrichment score)?]
* [https://www.biostars.org/p/169648/ Negative normalized Enrichment Score (NES) in GSEA analysis]
* [https://www.biostars.org/p/132575/ question about GSEA] about the sign of ES/NES. "Does this means in positive NES upregulated genes are enriched while in negative NES DN genes are enriched?"π Step 1.  rank all genes in your list (L) based on "how well they divide the conditions". Step 2, you want to see whether the genes present in a gene set (S) are at the top or at the bottom of your list...or if they are just spread around randomly. ''A positive NES will indicate that genes in set S will be mostly represented at the top of your list L. a negative NES will indicate that the genes in the set S will be mostly at the bottom of your list L.''
<pre>
with(fgseaRes, plot(abs(ES), pval))  # relationship is not consistent for a lot of pathways
with(fgseaRes, plot(abs(NES), pval)) # relationship is more consistent
with(fgseaRes, plot(ES, NES))
 
head(order(fgseaRes$pval))
# [1] 121 112 289  68 187 188
head(rev(order(abs(fgseaRes$NES))))
# [1] 121 289  68 188 187  22
</pre>
[[File:Fgsea3plots.png|350px]]
 
<li>The computation speed is FAST!
<pre>
system.time(invisible(fgsea(examplePathways, exampleRanks)))
#    user  system elapsed
#  31.162  38.427  15.800
</pre>
</ul>


== plotEnrichment() ==
<ul>
<li>Understand '''plotEnrichment()''' function. Plot shape (concave or convex) <span style="color: red">requires</span> the stats of genes from both in- and out- of the gene set. <span style="color: red">This affects the interpretation of the plot</span>. The plot always starts with 0 and end with 0 in Y (enrichment score). '''plotEnrichment()''' does not return enrichment scores even it calculates them for the plot.
<syntaxhighlight lang='rsplus'>
# Reduce the total number of genes
# Reduce the total number of genes
match(examplePathways[["5991601_Phosphorylation_of_Emi1"]], names(exampleRanks))
# [1] 11678 11593 11362 11553 11953 11383    NA    NA
i <- match(examplePathways[["5991601_Phosphorylation_of_Emi1"]], names(exampleRanks))
i <- match(examplePathways[["5991601_Phosphorylation_of_Emi1"]], names(exampleRanks))
i <- na.omit(i)
i <- na.omit(i)   # Exclude 2 genes in the pathways but not in the data
# plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[i])
 
# GSEA statistic is not defined when all genes are selected
if (FALSE) {
  # Total genes = pathway genes will not work?
  plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[i])
  # Error: GSEA statistic is not defined when all genes are selected
}
 
# PS values in 'exampleRanks' object are sorted from the smallest to the largest
 
# Case 1: include 1 down-regulated gene for our total genes
plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[c(1,i)])
plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[c(1,i)])
# Easy to understand
# saved as fgseaPlotSmall.png
exampleRanks[c(1,i)]
exampleRanks[c(1,i)]
#    170942    12534    18817    56371    67141    107995    268697  
#    170942    12534    18817    56371    67141    107995    268697  
# -63.33703  15.16265  13.46935  10.46504  12.70504  28.36914  10.67534
# -63.33703  15.16265  13.46935  10.46504  12.70504  28.36914  10.67534
#
# Rank and add +1/-1,
# 28.36 15.16 13.46 12.70 10.67 10.46 -63.33
#    +1    +1    +1    +1    +1    +1    -1
debug(plotEnrichment)
# Browse[2]> toPlot
#    x        y
# 1  0 0.0000000
# 2  0 0.0000000
# 3  1 0.3122753
# 4  1 0.3122753
# 5  2 0.4791793
# 6  2 0.4791793
# 7  3 0.6274441
# 8  3 0.6274441
# 9  4 0.7672957
# 10 4 0.7672957
# 11 5 0.8848053
# 12 5 0.8848053
# 13 6 1.0000000
# 14 8 0.0000000
x <-c(28.36, 15.16, 13.46, 12.70, 10.67, 10.46, -63.33)
plot(0:7, c(0, cumsum(x)), type = "b")
# Case 2: include another one up-regulated gene (so all are up-reg) for our total genes
# However, the largest gene (added) is not in the gene set.
# It will make the plot starting from a negative value (ignore 0 from the very left)
plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[c(i, 12000)])
# saved as fgseaPlotSmall2.png
exampleRanks[c(i, 12000)]
#    12534    18817    56371    67141  107995  268697    80876
# 15.16265 13.46935 10.46504 12.70504 28.36914 10.67534 53.28400
#
# Rank and add +1/-1
# 53.28 28.36  15.16 13.46 12.70 10.67 10.46
#    -1    +1    +1    +1    +1    +1    +1
</syntaxhighlight>
</syntaxhighlight>
For figure 1 (case 1), it has 6 points. Figure 2 is the one I generate using plot() and ''exampleRanks''. For figure 3 (case 2), it has 7 points (excluding the 0 value at the starting and end). The curve goes down first because the gene has the largest rank (statistic) and is not in the pathway. So the x-axis location is determined by the rank/statistics and going up or down on the y-axis is determined by whether the gene is in the pathway or not.


Note <span style="color: red">shifting exampleRanks values will change the enrichment scores </span>
[[File:FgseaPlotSmall.png|250px]], [[File:FgseaPlotSmallm.png|250px]],  [[File:FgseaPlotSmall2.png|250px]]
 
<li><span style="color: red">Shifting exampleRanks values will change the enrichment scores </span> because the directions of genes' statistics change?
<pre>
<pre>
plotEnrichment(examplePathways[[1]],
plotEnrichment(examplePathways[[1]],
Line 161: Line 394:
             # same problem if we shift exampleRanks to all negative  
             # same problem if we shift exampleRanks to all negative  
</pre>
</pre>
<span style="color: red">including NE, NES, pval, etc. So it is not rank-based.</span>
<span style="color: red">including NE, NES, pval, etc. So it is not purely rank-based.</span>
<pre>
<pre>
set.seed(42)
set.seed(42)
Line 181: Line 414:
3: -0.2908641 -0.5971951 0.9442724
3: -0.2908641 -0.5971951 0.9442724
</pre>
</pre>
However, scaling won't change anything.
However, scaling won't change anything (because scaling does not change the directions?).
<pre>
<pre>
set.seed(42)
set.seed(42)
Line 191: Line 424:
# [1] TRUE
# [1] TRUE
</pre>
</pre>
</li>
</ul>
<li>[https://rdrr.io/bioc/fgsea/man/collapsePathways.html ?collapsePathways]. Collapse list of enriched pathways to independent ones.
== plotGseaTable() ==
<li>[https://stephenturner.github.io/deseq-to-fgsea/ DESeq results to pathways in 60 Seconds with the fgsea package] </li>
<ul>
<li>[https://www.biostars.org/p/339934/ Performing GSEA using MSigDB gene sets in R]
<li>Obtain enrichment scores by calling fgsea(); i.e. fgseaMultilevel().
</li>
<pre>
examplePathways[["5991601_Phosphorylation_of_Emi1"]] |> length()
# [1] 8
set.seed(1)
fgsea2 <- fgsea(pathways = examplePathways["5991601_Phosphorylation_of_Emi1"],
                stats    = exampleRanks)
tibble::as_tibble(fgsea2)
# A tibble: 1 × 8
#  pathway              pval    padj log2err    ES  NES  size leadi…¹
#  <chr>              <dbl>  <dbl>  <dbl> <dbl> <dbl> <int> <list>
# 1 5991601_Phosphor… 9.39e-7 9.39e-7  0.659 0.947  2.05    6 <chr> 
# … with abbreviated variable name ¹​leadingEdge
 
plotGseaTable(examplePathways["5991601_Phosphorylation_of_Emi1"],
              exampleRanks, fgsea2)
</pre>
[[File:FgseaPlotTop.png|550px]]
 
The signs of NES/ES can be confirmed from the gene rank plot. Below is a plot from 10 enriched pathways
 
[[File:GseaTable.png|350px]]
 
and a plot from 10 non-enriched pathways
 
[[File:GseaTable2.png|350px]]
</ul>
</ul>


= Subramanian algorithm =
= Subramanian algorithm =
In the plot, (x-axis) genes are sorted by their expression across all samples. Y-axis represents enrichment score. See [https://youtu.be/KY6SS4vRchY?t=412 HOW TO PERFORM GSEA - A tutorial on gene set enrichment analysis for RNA-seq]. [https://youtu.be/KY6SS4vRchY?t=429 Bars represents genes being in the gene set]. Genes on the LHS/RHS are more highly expressed in the experimental/control group. Small p means this gene set is enriched in this experimental sample.
In the plot, (x-axis) genes are sorted by their expression across all samples. Y-axis represents enrichment score. See [https://youtu.be/KY6SS4vRchY?t=412 HOW TO PERFORM GSEA - A tutorial on gene set enrichment analysis for RNA-seq]. [https://youtu.be/KY6SS4vRchY?t=429 Bars represents genes being in the gene set]. Genes on the LHS/RHS are more highly expressed in the experimental/control group. Small p means this gene set is enriched in this experimental sample.
== Kolmogorov Smirnov test statistic ==
* In Gene Set Enrichment Analysis (GSEA), the Kolmogorov-Smirnov test is used to calculate an '''enrichment score''' for each gene-set (pathway).
* This score reflects how often members (genes) included in that gene-set (pathway) occur at the top or bottom of the ranked data set.
* The ranking is based on a measure of differential gene expression.
* The enrichment score is calculated using the Kolmogorov-Smirnov statistic, which quantifies the distance between the empirical distribution function of a sample and the cumulative distribution function of a reference distribution, or between the empirical distribution functions of two samples.
* In GSEA, the empirical distribution function represents the ranked list of genes, and the reference distribution represents the gene-set (pathway) being tested for enrichment.
* The enrichment score is then used to determine whether the gene-set (pathway) is significantly enriched in the ranked list of genes.
* An example
** Let’s say we have a ranked list of 10 genes based on their differential expression between two conditions, with the most highly expressed genes at the top and the least expressed genes at the bottom. The list might look like this: GeneA, GeneB, GeneC, GeneD, GeneE, GeneF, GeneG, GeneH, GeneI, GeneJ.
** Now let’s say we have a gene-set (pathway) that contains three genes: GeneB, GeneD, and GeneH. We want to determine if this gene-set is significantly enriched in our ranked list of genes.
** To do this, we can use the Kolmogorov-Smirnov test to calculate an enrichment score for the gene-set. The test will calculate the maximum distance between the '''empirical distribution function of our ranked list of genes''' and the '''cumulative distribution function of our gene-set'''.
** The empirical distribution function of our ranked list of genes is a step function that increases by 1/10 at each gene in the list. The cumulative distribution function of our gene-set is also a step function that increases by 1/3 at each gene in the set.
** The Kolmogorov-Smirnov statistic will then calculate the maximum vertical distance between these two functions. '''If this distance is large, it suggests that the members of our gene-set (GeneB, GeneD, and GeneH) are occurring more often at the top or bottom of our ranked list of genes than would be expected by chance.''' This would indicate that our gene-set is significantly enriched in our ranked list of genes.
* To determine if the distance is statistically significant, we can compare the calculated KS statistic to a critical value obtained from a reference distribution. If the KS statistic is greater than the critical value, we can conclude that the gene-set is significantly enriched in the ranked list of genes.
* What is the reference distribution of KS statistic? In the context of Gene Set Enrichment Analysis (GSEA), the reference distribution for the Kolmogorov-Smirnov (KS) statistic is obtained by ''permuting the sample labels'' (''resampling genes'' actually) and recalculating the enrichment score for the gene-set many times. This creates a null distribution of enrichment scores that represents what we would expect to see if the gene-set was not enriched in the ranked list of genes.
** The critical value for the KS statistic is then obtained from this null distribution, typically by taking a high percentile (e.g., the 95th percentile) of the distribution. If the observed KS statistic is greater than this critical value, we can conclude that the gene-set is significantly enriched in the ranked list of genes.
* Why do we need to rank genes first? The ranking of genes is also used in the calculation of the enrichment score for a gene-set. The enrichment score reflects how often members (genes) included in that gene-set (pathway) occur at the top or bottom of the ranked list of genes. If the members of a gene-set occur more often at the top or bottom of the ranked list of genes than would be expected by chance, it suggests that the gene-set is significantly enriched in the ranked list of genes.
* R
<pre>
# Define a ranked list of genes based on their differential expression
# between two conditions, with the most highly expressed genes at the top
# and the least expressed genes at the bottom
ranked_genes <- c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE",
                  "GeneF", "GeneG", "GeneH", "GeneI", "GeneJ")
# Define a gene-set (pathway) that contains three genes
gene_set <- c("GeneB", "GeneD", "GeneH")
# Calculate the enrichment score for the gene-set using the KS statistic
enrichment_score <- max(ks.test(ranked_genes, gene_set)$statistic)
# Permute the sample labels and recalculate the enrichment score many times
# to create a null distribution of enrichment scores
set.seed(123)
n_permutations <- 1000
null_distribution <- replicate(n_permutations, {
  permuted_genes <- sample(ranked_genes)
  ks.test(permuted_genes, gene_set)$statistic
})
# Calculate the critical value for the KS statistic from the null distribution
critical_value <- quantile(null_distribution, 0.95)
# Determine if the gene-set is significantly enriched in the ranked list of genes
if (enrichment_score > critical_value) {
  print("The gene-set is significantly enriched in the ranked list of genes.")
} else {
  print("The gene-set is not significantly enriched in the ranked list of genes.")
}
</pre>


= Rafael A Irizarry =
= Rafael A Irizarry =
[https://pubmed.ncbi.nlm.nih.gov/20048385/ Gene set enrichment analysis made simple]
[https://pubmed.ncbi.nlm.nih.gov/20048385/ Gene set enrichment analysis made simple]


= RBiomirGS for microRNA/miRNA =
= microRNA/miRNA =
* [http://www.kenstoreylab.com/research-tools/rbiomirgs/ RBiomirGS] R package
* https://en.wikipedia.org/wiki/MicroRNA 小分子核糖核酸
* MicroRNAs (miRNAs) is a family of non-coding RNAs. miRNAs post-transcriptionally regulate the gene expression.
* MicroRNAs (miRNAs) are a class of '''small, non-coding RNA molecules''' (無法進一步轉譯成蛋白質的RNA) that play a role in the regulation of gene expression. They are typically 21-23 nucleotides in length and are found in plants, animals, and some viruses. miRNAs function by binding to complementary sequences in messenger RNA (mRNA) molecules, leading to the '''silencing''' of the '''target gene''' through the degradation of the mRNA or the inhibition of its translation into protein.
** The expression of microRNAs (miRNAs) and their target mRNAs are often inversely correlated.
** This interaction results in the reduction of mRNA and/or protein levels.
** Therefore, '''when a miRNA is highly expressed, the level of its target mRNA is usually low, and vice versa.''' See [https://academic.oup.com/bib/article/14/3/263/254884 Joint analysis of miRNA and mRNA expression data], [https://www.thermofisher.com/blog/behindthebench/mirna-regulated-gene-expression-in-sars-cov-2-research/ miRNA Regulated Gene Expression in SARS-CoV-2 Research].
** MicroRNAs (miRNAs) are small non-coding RNAs that play a crucial role in regulating gene expression. They typically do this by binding to messenger RNAs (mRNAs) and preventing them from being translated into proteins. '''If a pathway undergoes less miRNA inhibition, it means that the miRNAs are not binding to the mRNAs as much, which could allow more protein to be produced.'''
** It is also important to note that one miRNA can target multiple mRNAs, and one mRNA can be targeted by multiple miRNAs.
** In the context of miRNAs, the term ''' “target” ''' refers to the specific mRNA molecules that a miRNA '''interacts with'''. When a miRNA '''binds''' to its target mRNA, it can inhibit the translation of the mRNA into protein or lead to the degradation of the mRNA. This is why the expression levels of a miRNA and its target mRNA are often inversely correlated. So, when I mentioned “target” in the last sentence, I was referring to the mRNA molecules that are '''regulated''' by the miRNA.
 
* A '''target gene''' is a gene whose expression is regulated by a specific regulatory molecule, such as a transcription factor or a microRNA. The regulatory molecule binds to specific sequences in the DNA or RNA of the target gene, leading to changes in the expression of that gene. For example, a transcription factor may bind to the promoter region of a target gene, leading to an increase or decrease in the transcription of that gene into messenger RNA (mRNA). Similarly, a microRNA may bind to the mRNA of a target gene, leading to the degradation of the mRNA or the inhibition of its translation into protein.
* MiRNAs are involved in a wide range of cellular processes, including cell growth, differentiation, development, and apoptosis, and have been implicated in many diseases
 
== RBiomirGS ==
<ul>
<li>[http://www.kenstoreylab.com/research-tools/rbiomirgs/ RBiomirGS] R package which is based on logistic regressions with binary predictor variable (two groups).
<syntaxhighlight lang='r'>
# R 4.3.3
docker run -it --rm --user rstudio bioconductor/bioconductor_docker:RELEASE_3_18 R
 
devtools::install_github("jzhangc/git_RBiomiRGS/RBiomirGS", ref = "74a567a",
        repos = BiocManager::repositories())  # version 0.2.19, as of 4/17/2024
</syntaxhighlight>
<pre>
require("RBiomirGS")
 
# two input files
#  liver.csv
#  kegg.v5.2.entrez.gmt, 186 pathways
raw <- read.csv(file = "liver.csv", header = TRUE, stringsAsFactors = FALSE)
dim(raw)
# [1] 85  3
raw[1:3,]
#              miRNA  FC pvalue
# 1 mmu-miR-let-7f-5p 0.58  0.034
# 2    mmu-miR-1a-5p 0.32  0.037
# 3    mmu-miR-1b-5p 0.50  0.001
 
# Target mRNA mapping
# creating 95 csv files; e.g. "mmu-miR-1a-5p_mRNA.csv"
rbiomirgs_mrnascan(objTitle = "mmu_liver_predicted",
                  mir = raw$miRNA,
                  sp = "mmu", addhsaEntrez = TRUE, queryType = "predicted",
                  parallelComputing = TRUE, clusterType = "PSOCK")
# > head -3 mmu-miR-10b-5p_mRNA.csv
# "database","mature_mirna_acc","mature_mirna_id","target_symbol","target_entrez","target_ensembl","score"
# "diana_microt","MIMAT0000208","mmu-miR-10b-5p","Bdnf","12064","ENSMUSG00000048482","1"
# "diana_microt","MIMAT0000208","mmu-miR-10b-5p","Bcl2l11","12125","ENSMUSG00000027381","1"
 
# GSEA
# generating 3 files:
#  mirnascore.csv
#  mrnascore.csv
#  mirna_mrna_iwls_GS.csv
rbiomirgs_logistic(objTitle = "mirna_mrna_iwls", mirna_DE = raw,
                  var_mirnaName = "miRNA",
                  var_mirnaFC = "FC",
                  var_mirnaP = "pvalue",
                  mrnalist = mmu_liver_predicted_mrna_hsa_entrez_list,
                  mrna_Weight = NULL,
                  gs_file = "kegg.v5.2.entrez.gmt",
                  optim_method = "IWLS",
                  p.adj ="fdr",
                  parallelComputing = FALSE, clusterType = "PSOCK")
> wc -l mirnascore.csv                                           
      86 mirnascore.csv
> head -3 mirnascore.csv                                                                                              13:46:47
"miRNA","S_mirna"
"mmu-miR-let-7f-5p",1.46852108295774
"mmu-miR-1a-5p",1.43179827593301
 
> wc -l mrnascore.csv                                             
  15297 mrnascore.csv
> head -3 mrnascore.csv                                                                                                13:46:53
"EntrezID","S_mrna"
"1",-3.20388726317687
"10",-11.7197772358718
 
> wc -l mirna_mrna_iwls_GS.csv                                                                               
    187 mirna_mrna_iwls_GS.csv
> head -3 mirna_mrna_iwls_GS.csv                                                         
# "GS","converged","loss","gene.tested","coef","std.err","t.value","p.value","adj.p.val"
# "KEGG_GLYCOLYSIS_GLUCONEOGENESIS","Y",0.0231448486023595,54,0.103936359331543,...
</pre>
<syntaxhighlight lang='r'>
# Histogram of all gene sets
# y-axis = model coefficients
# x-axis = gene set
rbiomirgs_bar(gsadfm = mirna_mrna_iwls_GS, n = "all",
              y.rightside = TRUE, yTxtSize = 8, plotWidth = 300, plotHeight = 200,
              xLabel = "Gene set", yLabel = "model coefficient")
 
# Histogram of top 50 ranked KEGG pathways based on absolute value of the coefficient
# (with significant adjusted p values < 0.05)
# y-axis = gene set
# x-axis = log odds ratio
rbiomirgs_bar(gsadfm = mirna_mrna_iwls_GS, gs.name = TRUE,
                    n = 50, y.rightside = FALSE, yTxtSize = 8,
                    plotWidth = 200, plotHeight = 300,
                    xLabel = "Log odss ratio", signif_only = TRUE)
 
# Volcano plot of all gene sets
# y-axis = -log10(p-value). The default cutoff (q_value) is .05. -log10(.05)=1.30.
# x-axis = model coefficient
rbiomirgs_volcano(gsadfm = mirna_mrna_iwls_GS, topgsLabel = TRUE,
                  n =15, gsLabelSize = 3, sigColour = "blue",
                  plotWidth = 250, plotHeight = 220,
                  xLabel = "model coefficient")
</syntaxhighlight>
</li>
[[File:Rbiomirgs barall.png|250px]], [[File:Rbiomirgs bar.png|250px]], [[File:Rbiomirgs volcano.png|250px]]
</ul>
* miRNAs are small molecules that can bind to mRNAs and prevent them from being translated into proteins. This process is known as ‘inhibition’.
 
* [https://peerj.com/articles/4262/ Interpretation]
** See [[Regression#Logistic_regression|Logistic regression]]
** If the '''coefficient is positive''', miRNA inhibition on target mRNAs might be lifted/reduced, thereby leading to '''less suppression''' on the gene set of interest in the '''experimental group'''.
*** When the statement mentions a ‘positive coefficient’, it’s referring to a situation where an increase in miRNA levels is associated with an increase in the levels of target mRNAs. ''This is unusual'' because '''we’d normally expect more miRNA to lead to less mRNA, due to the inhibition process''' I mentioned earlier.
*** So, if the coefficient is positive, it might mean that the miRNA’s inhibitory effect on the mRNA is being lifted or reduced. As a result, there’s less suppression of the mRNA, which could lead to more of its associated protein being produced.
*** If these genes are the ones producing the target mRNAs, then less suppression by the miRNAs would lead to '''increased activity of these genes in the experimental group.'''
** Furthermore, with a positive coefficient, a unit increase in Smrna results in an increased odds ratio of a gene belonging to the gene set of interest.
** It needs to be clarified that a positive model coefficient for a gene set means that the gene set of interest might be under more miRNA-dependent inhibition in the control group, as opposed to being activated under the experimental condition.
 
* [https://mirbase.org/ miRBase]: the microRNA database
* [https://mirbase.org/ miRBase]: the microRNA database
* https://mirdb.org/
* One microRNA/miR is one '''set'''. For human, its name ('''miRBase ID''') looks like [https://en.wikipedia.org/wiki/Mir-181_microRNA_precursor hsa-miR-181a], [https://www.thermofisher.com/order/genome-database/details/microrna/MH10421 miRBase Accession Number is MIMAT0000256].
* One microRNA/miR is one '''set'''. For human, its name ('''miRBase ID''') looks like [https://en.wikipedia.org/wiki/Mir-181_microRNA_precursor hsa-miR-181a], [https://www.thermofisher.com/order/genome-database/details/microrna/MH10421 miRBase Accession Number is MIMAT0000256].
* [https://en.wikipedia.org/wiki/MicroRNA_and_microRNA_target_database MicroRNA and microRNA target database]
* [https://en.wikipedia.org/wiki/MicroRNA_and_microRNA_target_database MicroRNA and microRNA target database]
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-10-218 Correlation of expression profiles between microRNAs and mRNA targets using NCI-60 data]
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-10-218 Correlation of expression profiles between microRNAs and mRNA targets using NCI-60 data]
== MIENTURNET ==
* http://userver.bio.uniroma1.it/apps/mienturnet/
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3105-x MIENTURNET: an interactive web tool for microRNA-target enrichment and network-based analysis] 2019


= multiGSEA =
= multiGSEA =
Line 241: Line 676:
** Functions in the GSEABase package help with reading, parsing and processing the signatures.
** Functions in the GSEABase package help with reading, parsing and processing the signatures.


== ssGSEA ==
== ssGSEA & GSVA package ==
<ul>
<ul>
<li>https://github.com/broadinstitute/ssGSEA2.0 </li>
<li>https://github.com/broadinstitute/ssGSEA2.0 </li>
Line 278: Line 713:
</ul>
</ul>
</li>
</li>
<li>[https://rdrr.io/bioc/GSVA/man/gsva.html gsva()] from the [https://www.bioconductor.org/packages/release/bioc/html/GSVA.html GSVA] package has an option to compute ssGSEA. [https://www.jianshu.com/p/0074965a2bd0 单样本基因集富集分析 --- ssGSEA] ('''it includes the formula from [https://www.nature.com/articles/nature08460#Sec3 Barbie's paper]'''). The output of gsva() is a matrix of ES (# gene sets x # samples). It does not produce plots nor running the permutation tests. Notice the option '''ssgsea.norm'''. Note '''ssgsea.norm = TRUE''' (default) option will scale ES by the absolute difference of the max and min ES; '''Unscaled ES / (max(unscaled ES) - min(unscaled ES))'''. So the scaled ES values depends on the included samples. But it seems the impact of the included samples is small from some real data. See the source code on [https://github.com/rcastelo/GSVA/blob/a795b9e4b4a9e280503ab290dbe6df3d62cbedb9/R/gsva.R#L877 github] and [[R#Debug_an_S4_function|on how to debug an S4 function]].
<li>[https://rdrr.io/bioc/GSVA/man/gsva.html gsva()] from the [https://www.bioconductor.org/packages/release/bioc/html/GSVA.html GSVA] package has an option to compute ssGSEA. [https://www.jianshu.com/p/0074965a2bd0 单样本基因集富集分析 --- ssGSEA] ('''it includes the formula from [https://europepmc.org/article/MED/19847166 Barbie's paper]'''). The output of gsva() is a matrix of ES (# gene sets x # samples). It does not produce plots nor running the permutation tests. Notice the option '''ssgsea.norm'''. Note '''ssgsea.norm = TRUE''' (default) option will scale ES by the absolute difference of the max and min ES; '''Unscaled ES / (max(unscaled ES) - min(unscaled ES))'''. So the scaled ES values depends on the included samples. But it seems the impact of the included samples is small from some real data. See the source code on [https://github.com/rcastelo/GSVA/blob/a795b9e4b4a9e280503ab290dbe6df3d62cbedb9/R/gsva.R#L877 github] and [[R#Debug_an_S4_function|on how to debug an S4 function]].
<pre>
<pre>
library(GSVA)
library(GSVA)
Line 382: Line 817:
</pre>
</pre>
</li>
</li>
<li>Is '''mx.diff''' useful? No.
<li>Is '''mx.diff''' useful? No. [https://support.bioconductor.org/p/9150770/ GSVA with and without absolute scaling method]. The '''mx.diff''' and '''abs.ranking''' only apply when method="gsva". The only parameters that apply when method="ssgsea" are '''tau''' and '''ssgsea.norm'''.
<pre>
<pre>
gsva4 <- gsva(y, geneSets, method="ssgsea", mx.diff = FALSE)
gsva4 <- gsva(y, geneSets, method="ssgsea", mx.diff = FALSE)
Line 435: Line 870:
</li>
</li>
<li>[https://youtu.be/AdxTuwAPskg 【生信分析 3】教你看懂GSEA和ssGSEA分析结果]. [https://youtu.be/AdxTuwAPskg?t=393 No groups/classes in the data] (6:33). Output is a heatmap. Each value is computed sample by sample. Rows = gene set. Columns = (sorted by the 1st gene set) samples.</li>
<li>[https://youtu.be/AdxTuwAPskg 【生信分析 3】教你看懂GSEA和ssGSEA分析结果]. [https://youtu.be/AdxTuwAPskg?t=393 No groups/classes in the data] (6:33). Output is a heatmap. Each value is computed sample by sample. Rows = gene set. Columns = (sorted by the 1st gene set) samples.</li>
<li>[https://support.bioconductor.org/p/9153126/ Gene set analysis: GSVA, Z-Score, and fold-changes]
</ul>
</ul>


== escape ==
== escape ==
[https://bioconductor.org/packages/release/bioc/html/escape.html escape] - Easy single cell analysis platform for enrichment. [https://github.com/ncborcherding/escape Github].
[https://bioconductor.org/packages/release/bioc/html/escape.html escape] - Easy single cell analysis platform for enrichment. [https://github.com/ncborcherding/escape Github].
== sTAM ==
* Paper [https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(20)30193-1 sTAM: An Online Tool for the Discovery of miRNA-Set Level Disease Biomarkers] 2020
* http://mir.rnanut.net/stam


= Benchmark =
= Benchmark =
Line 446: Line 887:
* [http://bioconductor.org/packages/release/bioc/html/phantasus.html phantasus] Visual and interactive gene expression analysis. read.gct(), [https://rdrr.io/bioc/phantasus/man/write.gct.html write.gct()], getGSE(), loadGEO(), gseaPlot().
* [http://bioconductor.org/packages/release/bioc/html/phantasus.html phantasus] Visual and interactive gene expression analysis. read.gct(), [https://rdrr.io/bioc/phantasus/man/write.gct.html write.gct()], getGSE(), loadGEO(), gseaPlot().
* [https://artyomovlab.wustl.edu/phantasus/phantasus-tutorial.html Using Phantasus application]
* [https://artyomovlab.wustl.edu/phantasus/phantasus-tutorial.html Using Phantasus application]
= PhenoExam =
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05122-x PhenoExam: gene set analyses through integration of different phenotype databa]


= Selected papers =
= Selected papers =
Line 462: Line 906:
= GSDA =
= GSDA =
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04110-x Gene-set distance analysis (GSDA): a powerful tool for gene-set association analysis]
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04110-x Gene-set distance analysis (GSDA): a powerful tool for gene-set association analysis]
= Genekitr =
https://github.com/GangLiLab/genekitr. [https://twitter.com/strnr/status/1663537311668109314 twitter]


= MSigDB =
= MSigDB =
Line 471: Line 918:
== Hallmark ==
== Hallmark ==
* [https://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION]/EMT from MsigDB
* [https://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION]/EMT from MsigDB
* Epithelial (170 genes), mesenchymal gene signatures 218-170 genes). Tan et al. 2014 and can be found in the ‘Table S1B. Generic EMT signature for cell line’ ('''Epi/Mes''' column) in the supplementary tables file. See [https://bioconductor.org/packages/release/bioc/vignettes/singscore/inst/doc/singscore.html#22_Sample_scoring_with_a_reduced_number_of_measurements here]. I save the gene lists in [https://gist.github.com/arraytools/7e175795ec62ec940e0fcee8795dcce2#file-emgenes-txt Github].
* Epithelial and mesenchymal gene signatures. See [https://bioconductor.org/packages/release/bioc/vignettes/singscore/inst/doc/singscore.html#22_Sample_scoring_with_a_reduced_number_of_measurements here]. I save the gene lists in [https://gist.github.com/arraytools/7e175795ec62ec940e0fcee8795dcce2#file-emgenes-txt Github].
** 'Table S1A. Generic EMT signature for '''tumour''' ' is really more interesting since Epi is more tumor related. The first 145 are epi, the rest (315-145=170) are mes.
** Epithelial (170 genes), mesenchymal gene signatures  218-170=48 genes). Tan et al. 2014 and can be found in the 'Table S1B. Generic EMT signature for '''cell line''' ’.

Revision as of 12:49, 9 May 2024

Basic

https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis

Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states

Two categories of GSEA procedures:

  • Competitive: compare genes in the test set relative to all other genes.
  • Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by Jiang & Gentleman Bioinformatics 2007

See also BRB-ArrayTools -> GSEA.

Hypergeometric test

  • Suppose you have a bag with 100 balls, 20 of which are red and the rest are blue. You draw 10 balls from the bag without replacement. Let’s say 4 of these balls are red. You want to know if drawing 4 red balls out of 10 is significantly different from random chance.
    Here’s how you would set up the hypergeometric test:
    • Define the Universe (N): The total number of balls, which is 100.
    • Define the Gene Set (M): The total number of red balls, which is 20.
    • Define the Number of Successes (DE genes) in the Universe (n): The number of balls you drew, which is 10.
    • Define the Number of Successes (DE genes) in the Gene Set (x): The number of red balls you drew, which is 4.
    The formula for the hypergeometric test is: [math]\displaystyle{ P(X=x) = \frac{C(M, x) C(N-M, n-x)}{C(N, n)} = \frac{C(20,4) C(80,6)}{C(100, 10)} }[/math]
        N=100
       +-----------------------------+
       |   M=20              n=10    |
       | (# in a pathway)  (# of DE) | 
       |  +-------+--------+-------+ |         
       |  |       |  x=4   |       | |
       |  |       | common |       | |    
       |  +-------+--------+-------+ |
       +-----------------------------+
    
  • In the context of gene set enrichment analysis,
    • Define the Universe (N): The total number of genes that you have profiled in your experiment. Let’s say you have profiled 20,000 genes.
    • Define the Gene Set (M): The total number of genes known to be involved in a specific biological pathway. For example, suppose there are 200 genes known to be involved in the “cell cycle” pathway.
    • Define the Number of Successes (DE genes) in the Universe (n): The number of genes that are differentially expressed in your experiment. Let’s say you found 2,000 genes to be differentially expressed.
    • Define the Number of Successes (DE genes) in the Gene Set (x): The number of differentially expressed genes that are also in your gene set. Suppose 50 of the 2,000 differentially expressed genes are involved in the “cell cycle” pathway.
        N=20,000
       +-----------------------------+
       |   M=200             n=2000  |
       | (# in a pathway)  (# of DE) | 
       |  +-------+--------+-------+ |         
       |  |       |  x=50  |       | |
       |  |       | common |       | |    
       |  +-------+--------+-------+ |
       +-----------------------------+
    

    We calculate the tail probability to report the p-value; [math]\displaystyle{ P(X \geq 50) = P(X=50) + \cdots + P(X=min(M, n)) }[/math]. If this p-value is below a certain threshold (commonly 0.05), we reject the null hypothesis and conclude that the pathway is significantly enriched among the differentially expressed genes. It’s also important to remember that statistical significance does not always imply biological significance, and further validation is often required.</math>

Interpretation

  • Gene Set Enrichment Analysis (GSEA) takes an alternative approach : it focuses on cumulative changes in expression of multiple genes as a group (belonging to a same gene-set/pahtway), which shifts the focus from individual genes to groups of genes. See this.
  • XXX class/subtype is associated with the OOO gene set(s) (GSVA vignette)
  • XXX subtype (of samples) is characterized by the expression of OOO markers, thus we expect it to correlate with the OOO2 gene set (GSVA vignette)
  • The XXX subtype (of samples) shows high expression of OOO genes, thus the OOO gene set is highly enriched for this group (GSVA vignette). OR if we find a gene set is enriched in XXX subtype (e.g. sensitive models), it means genes in that gene set are more highly expressed in the XXX subtype samples compared to the YYY subtype samples.
  • Negative ES Interpretation.
    • A positive enrichment score should always reflect enrichment on the positive side of the zero cross (although not necessarily all genes on the positive side) and be enrichment in whichever phenotype was selected to be first in the comparison in the Phenotype selection dialogue. And vice versa, with a negative enrichment score reflecting enrichment in the genes on the negative side of the ranked list.
    • GSEA is known to have some issues with highly skewed gene distributions, but that shouldn't affect the raw enrichment scores, just NES and significance calculations when GSEA runs in to a condition where its only sampled from one side of the distribution.
  • Enrichment Score Interpretation
    • If, for example, you provide a gene list ranked by a combination of fold change and p-value (e.g., sign(FC) * log10(pvalue)), then the positive scores are associated with upregulated genes and negative scores are associated with downregulated genes.
  • fgsea paper
    • The more positive is the value of sr(p) the more enriched the gene set is in the positively-regulated genes (with Si > 0). Accordingly, negative sr(p) corresponds to enrichment in the negatively regulated genes.
  • The opposite of "enriched" is "depleted". See Wikipedia.

Calculation

  • Subramanian paper
  • https://youtu.be/bT00oJh2x_4
  • *pathwaycommons.org
  • Compute cumulative sum over ranked genes. MD Anderson lecture
    • Increase sum when gene in set, decrease it otherwise. That is, +1/-1 weights in cumulative sum is used to represent whether genes are in the interested gene set.
    • Magnitude of increment depends on correlation of gene with phenotype.
    • Record the maximum deviation from zero as the enrichment score

FDR cutoff

Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?

Given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, using a more stringent FDR cutoff may lead you to overlook potentially significant results.

GSEAtraining

https://jokergoo.github.io/GSEAtraining/

piano

edgeR

Over-representation analysis. ?goana and ?kegga. See UserGuide 2.14 Gene ontology (GO) and pathway analysis.

fgsea

fgsea package and download stat

Are fgsea and Broad Institute GSEA equivalent

Are fgsea and Broad Institute GSEA equivalent?

Examples

  • vignette,
  • plotEnrichment() for a single pathway (including the output plot). No need to run fgsea(). Source code. Enrichment score (ES) on the plot is calculated by calcGseaStat()$res. The ES value is the same as the one shown in plotGseaTable() though plotEnrichment does not return it.
  • plotGseaTable() for a bunch of selected pathways (including the output plot).
  • Basic usage
    library(fgsea)
    library(ggplot2)
    data(examplePathways) # a list of length 1457 (pathways)
    data(exampleRanks)    # a vector of length 12000 (genes)
    set.seed(42)
    fgseaRes <- fgsea(pathways = examplePathways, 
                       stats    = exampleRanks,
                       minSize  = 4, # default minSize=1, maxSize=Inf
                       maxSize  =10) 
    # I used a very small maxSize in order to see details later
    # So the results here can't be compared with the default
    class(fgseaRes)
    # [1] "data.table" "data.frame"
    dim(fgseaRes)
    # [1] 386   8
    fgseaRes[1:2, ]
    #                                                          pathway      pval      padj
    # 1:                        1368092_Rora_activates_gene_expression 0.8630705 0.9386018
    # 2: 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression 0.4312268 0.8159487
    #       log2err         ES        NES size              leadingEdge
    # 1: 0.05412006 -0.3087414 -0.6545067    5 11865,12753,328572,20787
    # 2: 0.08312913  0.4209054  1.0360288    9        20893,59027,19883
    
    sapply(examplePathways, length) |> range()
    # [1]    1 2366
    # Question: minSize, maxSize mean 'matched' genes? Can't verify
    sum(sapply(examplePathways, length) < 3)
    # [1] 104
    sum(sapply(examplePathways, length) > 10)
    # [1] 939
    1457 - 104 - 939
    # [1] 414
    examplePathways2 <- lapply(examplePathways, function(x) x[x %in% names(exampleRanks)])
    sapply(examplePathways2, length) |> range()
    # [1]   0 968
    sum(sapply(examplePathways2, length) < 3)
    # [1] 232
    sum(sapply(examplePathways2, length) > 10)
    # [1] 730
    1457 - 232 - 730
    # [1] 495
    range(fgseaRes$ES)
    # [1] -0.8442416  0.9488163
    range(fgseaRes$NES)
    # [1] -2.020695  2.075729
    order(fgseaRes$ES)[1:5]
    # [1]  75 289 320 249 312
    order(fgseaRes$NES)[1:5]
    # [1] 289  75 102 320 200
    
    # choose the top gene set in order to zoom in & see the detail
    head(fgseaRes[order(pval), ], 1)
    #                            pathway         pval         padj   log2err        ES
    # 1: 5991601_Phosphorylation_of_Emi1 2.461461e-07 9.501241e-05 0.6749629 0.9472236
    #         NES size                           leadingEdge
    # 1: 2.082967    6 107995,12534,18817,67141,268697,56371
    
    plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks)
    # 12000 total genes though only 6 genes are matched in this top gene set. 
    debug(plotEnrichment)
    # Browse[2]> toPlot
    #        x            y
    # 1      0  0.000000000
    # 2     47 -0.003918626
    # 3     48  0.308356695
    # 4    322  0.285511939
    # 5    323  0.452415897
    # 6    407  0.445412396
    # 7    408  0.593677191
    # 8    447  0.590425565
    # 9    448  0.730277162
    # 10   617  0.716186784
    # 11   618  0.833696389
    # 12   638  0.832028888
    # 13   639  0.947223612
    # 14 12001  0.000000000
    examplePathways[["5991601_Phosphorylation_of_Emi1"]]
    # [1] "12534"     "18817"     "56371"     "67141"     "107995"    "268697"   
    # [7] "434175"    "102643276"
    match(examplePathways[["5991601_Phosphorylation_of_Emi1"]], names(exampleRanks)) 
    # [1] 11678 11593 11362 11553 11953 11383    NA    NA
    # Since there are 12000 total genes and exampleRanks have been sorted, 
    # these matched genes have very high ranks.
    # See the plot below.
    
    length(exampleRanks)
    # [1] 12000
    exampleRanks[1:10]
    #    170942    109711     18124     12775     72148     16010     11931 
    # -63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626 -29.42328 
    #     13849    241230    665113 
    # -28.83475 -28.65111 -27.81583 
    sum(exampleRanks < 0)
    # [1] 6469
    sum(exampleRanks > 0)
    # [1] 5531
    plot(exampleRanks)

    FgseaPlot.png ExampleRanks.png

    There are 6*2 points (excluding the 0 value at the starting and end) in the figure.

  • The range of NES is not [-1, 1] but ES seems to be in [-1, 1]. The example sort pathways by NES but it can be adapted to sort by pval. Source code. Also ES and NES are not in the same order (see below) though we cannot tell it from the small plot. According to the manual,
    • ES – enrichment score, same as in Broad GSEA implementation;
    • NES – enrichment score normalized to mean enrichment of random samples of the same size (the number of genes in the gene set).
    • Hπow to interpret NES(normalized enrichment score)?
    • Negative normalized Enrichment Score (NES) in GSEA analysis
    • question about GSEA about the sign of ES/NES. "Does this means in positive NES upregulated genes are enriched while in negative NES DN genes are enriched?"π Step 1. rank all genes in your list (L) based on "how well they divide the conditions". Step 2, you want to see whether the genes present in a gene set (S) are at the top or at the bottom of your list...or if they are just spread around randomly. A positive NES will indicate that genes in set S will be mostly represented at the top of your list L. a negative NES will indicate that the genes in the set S will be mostly at the bottom of your list L.
    with(fgseaRes, plot(abs(ES), pval))  # relationship is not consistent for a lot of pathways
    with(fgseaRes, plot(abs(NES), pval)) # relationship is more consistent
    with(fgseaRes, plot(ES, NES))
    
    head(order(fgseaRes$pval))
    # [1] 121 112 289  68 187 188
    head(rev(order(abs(fgseaRes$NES))))
    # [1] 121 289  68 188 187  22
    

    Fgsea3plots.png

  • The computation speed is FAST!
    system.time(invisible(fgsea(examplePathways, exampleRanks)))
    #    user  system elapsed 
    #  31.162  38.427  15.800 
    

plotEnrichment()

  • Understand plotEnrichment() function. Plot shape (concave or convex) requires the stats of genes from both in- and out- of the gene set. This affects the interpretation of the plot. The plot always starts with 0 and end with 0 in Y (enrichment score). plotEnrichment() does not return enrichment scores even it calculates them for the plot.
    # Reduce the total number of genes
    i <- match(examplePathways[["5991601_Phosphorylation_of_Emi1"]], names(exampleRanks))
    i <- na.omit(i)   # Exclude 2 genes in the pathways but not in the data
    
    if (FALSE) {
      # Total genes = pathway genes will not work?
      plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[i])
      # Error: GSEA statistic is not defined when all genes are selected
    }
    
    # PS values in 'exampleRanks' object are sorted from the smallest to the largest
    
    # Case 1: include 1 down-regulated gene for our total genes
    plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[c(1,i)])
    # saved as fgseaPlotSmall.png
    exampleRanks[c(1,i)]
    #    170942     12534     18817     56371     67141    107995    268697 
    # -63.33703  15.16265  13.46935  10.46504  12.70504  28.36914  10.67534
    #
    # Rank and add +1/-1,
    # 28.36 15.16 13.46 12.70 10.67 10.46 -63.33
    #    +1    +1    +1    +1    +1    +1     -1
    debug(plotEnrichment)
    # Browse[2]> toPlot
    #    x         y
    # 1  0 0.0000000
    # 2  0 0.0000000
    # 3  1 0.3122753
    # 4  1 0.3122753
    # 5  2 0.4791793
    # 6  2 0.4791793
    # 7  3 0.6274441
    # 8  3 0.6274441
    # 9  4 0.7672957
    # 10 4 0.7672957
    # 11 5 0.8848053
    # 12 5 0.8848053
    # 13 6 1.0000000
    # 14 8 0.0000000
    x <-c(28.36, 15.16, 13.46, 12.70, 10.67, 10.46, -63.33)
    plot(0:7, c(0, cumsum(x)), type = "b")
    
    # Case 2: include another one up-regulated gene (so all are up-reg) for our total genes
    # However, the largest gene (added) is not in the gene set.
    # It will make the plot starting from a negative value (ignore 0 from the very left)
    plotEnrichment(examplePathways[["5991601_Phosphorylation_of_Emi1"]], exampleRanks[c(i, 12000)])
    # saved as fgseaPlotSmall2.png
    exampleRanks[c(i, 12000)]
    #    12534    18817    56371    67141   107995   268697    80876 
    # 15.16265 13.46935 10.46504 12.70504 28.36914 10.67534 53.28400
    #
    # Rank and add +1/-1
    # 53.28 28.36  15.16 13.46 12.70 10.67 10.46
    #    -1    +1     +1    +1    +1    +1    +1

    For figure 1 (case 1), it has 6 points. Figure 2 is the one I generate using plot() and exampleRanks. For figure 3 (case 2), it has 7 points (excluding the 0 value at the starting and end). The curve goes down first because the gene has the largest rank (statistic) and is not in the pathway. So the x-axis location is determined by the rank/statistics and going up or down on the y-axis is determined by whether the gene is in the pathway or not.

    FgseaPlotSmall.png, FgseaPlotSmallm.png, FgseaPlotSmall2.png

  • Shifting exampleRanks values will change the enrichment scores because the directions of genes' statistics change?
    plotEnrichment(examplePathways[[1]],
                    exampleRanks) + labs(title="Programmed Cell Death") 
                # like sin() with one period
    range(exampleRanks)
    # [1] -63.33703  53.28400
    plotEnrichment(examplePathways[[1]],
                    exampleRanks+64) + labs(title="Programmed Cell Death")
                # like a time series plot
    plotEnrichment(examplePathways[[1]],
                    exampleRanks-54) + labs(title="Programmed Cell Death")
                # same problem if we shift exampleRanks to all negative 
    

    including NE, NES, pval, etc. So it is not purely rank-based.

    set.seed(42)
    fgseaRes2 <- fgsea(pathways = examplePathways, 
                       stats    = exampleRanks+2, # OR 2*(exampleRanks+2)
                       minSize  = 4,
                       maxSize  =10) 
    identical(fgseaRes, fgseaRes2)
    # [1] FALSE
    R> fgseaRes[1:3, c("ES", "NES", "pval")]
               ES        NES      pval
    1: -0.3087414 -0.6545067 0.8630705
    2:  0.4209054  1.0360288 0.4312268
    3: -0.4065043 -0.8617561 0.6431535
    R> fgseaRes2[1:3, c("ES", "NES", "pval")]
               ES        NES      pval
    1:  0.4554877  0.8284722 0.7201767
    2:  0.5520914  1.1604366 0.2833333
    3: -0.2908641 -0.5971951 0.9442724
    

    However, scaling won't change anything (because scaling does not change the directions?).

    set.seed(42)
    fgseaRes3 <- fgsea(pathways = examplePathways, 
                       stats    = 2*exampleRanks,
                       minSize  = 4,
                       maxSize  =10)
    identical(fgseaRes, fgseaRes3)
    # [1] TRUE
    

plotGseaTable()

  • Obtain enrichment scores by calling fgsea(); i.e. fgseaMultilevel().
    examplePathways[["5991601_Phosphorylation_of_Emi1"]] |> length()
    # [1] 8
    set.seed(1)
    fgsea2 <- fgsea(pathways = examplePathways["5991601_Phosphorylation_of_Emi1"], 
                    stats    = exampleRanks)
    tibble::as_tibble(fgsea2)
    # A tibble: 1 × 8
    #   pathway              pval    padj log2err    ES   NES  size leadi…¹
    #   <chr>               <dbl>   <dbl>   <dbl> <dbl> <dbl> <int> <list> 
    # 1 5991601_Phosphor… 9.39e-7 9.39e-7   0.659 0.947  2.05     6 <chr>  
    # … with abbreviated variable name ¹​leadingEdge
    
    plotGseaTable(examplePathways["5991601_Phosphorylation_of_Emi1"], 
                  exampleRanks, fgsea2)
    

    FgseaPlotTop.png

    The signs of NES/ES can be confirmed from the gene rank plot. Below is a plot from 10 enriched pathways

    GseaTable.png

    and a plot from 10 non-enriched pathways

    GseaTable2.png

Subramanian algorithm

In the plot, (x-axis) genes are sorted by their expression across all samples. Y-axis represents enrichment score. See HOW TO PERFORM GSEA - A tutorial on gene set enrichment analysis for RNA-seq. Bars represents genes being in the gene set. Genes on the LHS/RHS are more highly expressed in the experimental/control group. Small p means this gene set is enriched in this experimental sample.

Kolmogorov Smirnov test statistic

  • In Gene Set Enrichment Analysis (GSEA), the Kolmogorov-Smirnov test is used to calculate an enrichment score for each gene-set (pathway).
  • This score reflects how often members (genes) included in that gene-set (pathway) occur at the top or bottom of the ranked data set.
  • The ranking is based on a measure of differential gene expression.
  • The enrichment score is calculated using the Kolmogorov-Smirnov statistic, which quantifies the distance between the empirical distribution function of a sample and the cumulative distribution function of a reference distribution, or between the empirical distribution functions of two samples.
  • In GSEA, the empirical distribution function represents the ranked list of genes, and the reference distribution represents the gene-set (pathway) being tested for enrichment.
  • The enrichment score is then used to determine whether the gene-set (pathway) is significantly enriched in the ranked list of genes.
  • An example
    • Let’s say we have a ranked list of 10 genes based on their differential expression between two conditions, with the most highly expressed genes at the top and the least expressed genes at the bottom. The list might look like this: GeneA, GeneB, GeneC, GeneD, GeneE, GeneF, GeneG, GeneH, GeneI, GeneJ.
    • Now let’s say we have a gene-set (pathway) that contains three genes: GeneB, GeneD, and GeneH. We want to determine if this gene-set is significantly enriched in our ranked list of genes.
    • To do this, we can use the Kolmogorov-Smirnov test to calculate an enrichment score for the gene-set. The test will calculate the maximum distance between the empirical distribution function of our ranked list of genes and the cumulative distribution function of our gene-set.
    • The empirical distribution function of our ranked list of genes is a step function that increases by 1/10 at each gene in the list. The cumulative distribution function of our gene-set is also a step function that increases by 1/3 at each gene in the set.
    • The Kolmogorov-Smirnov statistic will then calculate the maximum vertical distance between these two functions. If this distance is large, it suggests that the members of our gene-set (GeneB, GeneD, and GeneH) are occurring more often at the top or bottom of our ranked list of genes than would be expected by chance. This would indicate that our gene-set is significantly enriched in our ranked list of genes.
  • To determine if the distance is statistically significant, we can compare the calculated KS statistic to a critical value obtained from a reference distribution. If the KS statistic is greater than the critical value, we can conclude that the gene-set is significantly enriched in the ranked list of genes.
  • What is the reference distribution of KS statistic? In the context of Gene Set Enrichment Analysis (GSEA), the reference distribution for the Kolmogorov-Smirnov (KS) statistic is obtained by permuting the sample labels (resampling genes actually) and recalculating the enrichment score for the gene-set many times. This creates a null distribution of enrichment scores that represents what we would expect to see if the gene-set was not enriched in the ranked list of genes.
    • The critical value for the KS statistic is then obtained from this null distribution, typically by taking a high percentile (e.g., the 95th percentile) of the distribution. If the observed KS statistic is greater than this critical value, we can conclude that the gene-set is significantly enriched in the ranked list of genes.
  • Why do we need to rank genes first? The ranking of genes is also used in the calculation of the enrichment score for a gene-set. The enrichment score reflects how often members (genes) included in that gene-set (pathway) occur at the top or bottom of the ranked list of genes. If the members of a gene-set occur more often at the top or bottom of the ranked list of genes than would be expected by chance, it suggests that the gene-set is significantly enriched in the ranked list of genes.
  • R
# Define a ranked list of genes based on their differential expression
# between two conditions, with the most highly expressed genes at the top
# and the least expressed genes at the bottom
ranked_genes <- c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE",
                  "GeneF", "GeneG", "GeneH", "GeneI", "GeneJ")

# Define a gene-set (pathway) that contains three genes
gene_set <- c("GeneB", "GeneD", "GeneH")

# Calculate the enrichment score for the gene-set using the KS statistic
enrichment_score <- max(ks.test(ranked_genes, gene_set)$statistic)

# Permute the sample labels and recalculate the enrichment score many times
# to create a null distribution of enrichment scores
set.seed(123)
n_permutations <- 1000
null_distribution <- replicate(n_permutations, {
  permuted_genes <- sample(ranked_genes)
  ks.test(permuted_genes, gene_set)$statistic
})

# Calculate the critical value for the KS statistic from the null distribution
critical_value <- quantile(null_distribution, 0.95)

# Determine if the gene-set is significantly enriched in the ranked list of genes
if (enrichment_score > critical_value) {
  print("The gene-set is significantly enriched in the ranked list of genes.")
} else {
  print("The gene-set is not significantly enriched in the ranked list of genes.")
}

Rafael A Irizarry

Gene set enrichment analysis made simple

microRNA/miRNA

  • https://en.wikipedia.org/wiki/MicroRNA 小分子核糖核酸
  • MicroRNAs (miRNAs) are a class of small, non-coding RNA molecules (無法進一步轉譯成蛋白質的RNA) that play a role in the regulation of gene expression. They are typically 21-23 nucleotides in length and are found in plants, animals, and some viruses. miRNAs function by binding to complementary sequences in messenger RNA (mRNA) molecules, leading to the silencing of the target gene through the degradation of the mRNA or the inhibition of its translation into protein.
    • The expression of microRNAs (miRNAs) and their target mRNAs are often inversely correlated.
    • This interaction results in the reduction of mRNA and/or protein levels.
    • Therefore, when a miRNA is highly expressed, the level of its target mRNA is usually low, and vice versa. See Joint analysis of miRNA and mRNA expression data, miRNA Regulated Gene Expression in SARS-CoV-2 Research.
    • MicroRNAs (miRNAs) are small non-coding RNAs that play a crucial role in regulating gene expression. They typically do this by binding to messenger RNAs (mRNAs) and preventing them from being translated into proteins. If a pathway undergoes less miRNA inhibition, it means that the miRNAs are not binding to the mRNAs as much, which could allow more protein to be produced.
    • It is also important to note that one miRNA can target multiple mRNAs, and one mRNA can be targeted by multiple miRNAs.
    • In the context of miRNAs, the term “target” refers to the specific mRNA molecules that a miRNA interacts with. When a miRNA binds to its target mRNA, it can inhibit the translation of the mRNA into protein or lead to the degradation of the mRNA. This is why the expression levels of a miRNA and its target mRNA are often inversely correlated. So, when I mentioned “target” in the last sentence, I was referring to the mRNA molecules that are regulated by the miRNA.
  • A target gene is a gene whose expression is regulated by a specific regulatory molecule, such as a transcription factor or a microRNA. The regulatory molecule binds to specific sequences in the DNA or RNA of the target gene, leading to changes in the expression of that gene. For example, a transcription factor may bind to the promoter region of a target gene, leading to an increase or decrease in the transcription of that gene into messenger RNA (mRNA). Similarly, a microRNA may bind to the mRNA of a target gene, leading to the degradation of the mRNA or the inhibition of its translation into protein.
  • MiRNAs are involved in a wide range of cellular processes, including cell growth, differentiation, development, and apoptosis, and have been implicated in many diseases

RBiomirGS

  • RBiomirGS R package which is based on logistic regressions with binary predictor variable (two groups).
    # R 4.3.3
    docker run -it --rm --user rstudio bioconductor/bioconductor_docker:RELEASE_3_18 R
    
    devtools::install_github("jzhangc/git_RBiomiRGS/RBiomirGS", ref = "74a567a", 
            repos = BiocManager::repositories())  # version 0.2.19, as of 4/17/2024
    require("RBiomirGS")
    
    # two input files
    #   liver.csv
    #   kegg.v5.2.entrez.gmt, 186 pathways
    raw <- read.csv(file = "liver.csv", header = TRUE, stringsAsFactors = FALSE)
    dim(raw)
    # [1] 85  3
    raw[1:3,]
    #               miRNA   FC pvalue
    # 1 mmu-miR-let-7f-5p 0.58  0.034
    # 2     mmu-miR-1a-5p 0.32  0.037
    # 3     mmu-miR-1b-5p 0.50  0.001
    
    # Target mRNA mapping
    # creating 95 csv files; e.g. "mmu-miR-1a-5p_mRNA.csv"
    rbiomirgs_mrnascan(objTitle = "mmu_liver_predicted", 
                       mir = raw$miRNA,
                       sp = "mmu", addhsaEntrez = TRUE, queryType = "predicted",
                       parallelComputing = TRUE, clusterType = "PSOCK")
    # > head -3 mmu-miR-10b-5p_mRNA.csv
    # "database","mature_mirna_acc","mature_mirna_id","target_symbol","target_entrez","target_ensembl","score"
    # "diana_microt","MIMAT0000208","mmu-miR-10b-5p","Bdnf","12064","ENSMUSG00000048482","1"
    # "diana_microt","MIMAT0000208","mmu-miR-10b-5p","Bcl2l11","12125","ENSMUSG00000027381","1"
    
    # GSEA
    # generating 3 files:
    #   mirnascore.csv
    #   mrnascore.csv
    #   mirna_mrna_iwls_GS.csv
    rbiomirgs_logistic(objTitle = "mirna_mrna_iwls", mirna_DE = raw,
                       var_mirnaName = "miRNA", 
                       var_mirnaFC = "FC", 
                       var_mirnaP = "pvalue",
                       mrnalist = mmu_liver_predicted_mrna_hsa_entrez_list, 
                       mrna_Weight = NULL, 
                       gs_file = "kegg.v5.2.entrez.gmt", 
                       optim_method = "IWLS", 
                       p.adj ="fdr", 
                       parallelComputing = FALSE, clusterType = "PSOCK") 
    > wc -l mirnascore.csv                                             
          86 mirnascore.csv
    > head -3 mirnascore.csv                                                                                               13:46:47
    "miRNA","S_mirna"
    "mmu-miR-let-7f-5p",1.46852108295774
    "mmu-miR-1a-5p",1.43179827593301
    
    > wc -l mrnascore.csv                                               
       15297 mrnascore.csv
    > head -3 mrnascore.csv                                                                                                13:46:53
    "EntrezID","S_mrna"
    "1",-3.20388726317687
    "10",-11.7197772358718
    
    > wc -l mirna_mrna_iwls_GS.csv                                                                                 
         187 mirna_mrna_iwls_GS.csv
    > head -3 mirna_mrna_iwls_GS.csv                                                          
    # "GS","converged","loss","gene.tested","coef","std.err","t.value","p.value","adj.p.val"
    # "KEGG_GLYCOLYSIS_GLUCONEOGENESIS","Y",0.0231448486023595,54,0.103936359331543,...
    
    # Histogram of all gene sets
    # y-axis = model coefficients
    # x-axis = gene set
    rbiomirgs_bar(gsadfm = mirna_mrna_iwls_GS, n = "all",
                  y.rightside = TRUE, yTxtSize = 8, plotWidth = 300, plotHeight = 200,
                  xLabel = "Gene set", yLabel = "model coefficient")
    
    # Histogram of top 50 ranked KEGG pathways based on absolute value of the coefficient 
    # (with significant adjusted p values < 0.05)
    # y-axis = gene set
    # x-axis = log odds ratio
    rbiomirgs_bar(gsadfm = mirna_mrna_iwls_GS, gs.name = TRUE, 
                        n = 50, y.rightside = FALSE, yTxtSize = 8, 
                        plotWidth = 200, plotHeight = 300, 
                        xLabel = "Log odss ratio", signif_only = TRUE)
    
    # Volcano plot of all gene sets
    # y-axis = -log10(p-value). The default cutoff (q_value) is .05. -log10(.05)=1.30.
    # x-axis = model coefficient
    rbiomirgs_volcano(gsadfm = mirna_mrna_iwls_GS, topgsLabel = TRUE, 
                      n =15, gsLabelSize = 3, sigColour = "blue", 
                      plotWidth = 250, plotHeight = 220, 
                      xLabel = "model coefficient")
  • Rbiomirgs barall.png, Rbiomirgs bar.png, Rbiomirgs volcano.png
  • miRNAs are small molecules that can bind to mRNAs and prevent them from being translated into proteins. This process is known as ‘inhibition’.
  • Interpretation
    • See Logistic regression
    • If the coefficient is positive, miRNA inhibition on target mRNAs might be lifted/reduced, thereby leading to less suppression on the gene set of interest in the experimental group.
      • When the statement mentions a ‘positive coefficient’, it’s referring to a situation where an increase in miRNA levels is associated with an increase in the levels of target mRNAs. This is unusual because we’d normally expect more miRNA to lead to less mRNA, due to the inhibition process I mentioned earlier.
      • So, if the coefficient is positive, it might mean that the miRNA’s inhibitory effect on the mRNA is being lifted or reduced. As a result, there’s less suppression of the mRNA, which could lead to more of its associated protein being produced.
      • If these genes are the ones producing the target mRNAs, then less suppression by the miRNAs would lead to increased activity of these genes in the experimental group.
    • Furthermore, with a positive coefficient, a unit increase in Smrna results in an increased odds ratio of a gene belonging to the gene set of interest.
    • It needs to be clarified that a positive model coefficient for a gene set means that the gene set of interest might be under more miRNA-dependent inhibition in the control group, as opposed to being activated under the experimental condition.

MIENTURNET

multiGSEA

multiGSEA - Combining GSEA-based pathway enrichment with multi omics data integration.

Plot

Single sample

singscore

  • singscore
    • Paper Single sample scoring of molecular phenotypes. Our approach does not depend upon background samples and scores are thus stable regardless of the composition and number of samples being scored. In contrast, scores obtained by GSVA, z-score, PLAGE and ssGSEA can be unstable when less data are available (NS < 25). Simulation was conducted.
    • SingscoreAMLMutations
    • TotalScore = UpScore + DownScore
    • centerScore and knownDirection(=TRUE by default) parameters used in generateNull() and simpleScore() functions.
    • On the paper, epithelial and mesenchymal gene sets are up-regulated and TGFβ-EMT signature is bidirectional.
    • The pvalue calculation seems wrong. For example, the first sample D_Ctrl_R1 returns p=0.996 but it should be 2*(1-.996) according to the null distribution plot. However, the 1st sample is a control sample so we don't expect it has a large score; its p-value should be large?
  • Gene-set enrichment analysis workshop. Also use ExperimentHub, SummarizedExperiment, emtdata, GSEABase, msigdb, edgeR, limma, vissE, igraph & patchwork packages.
    • Functions in the GSEABase package help with reading, parsing and processing the signatures.

ssGSEA & GSVA package

  • https://github.com/broadinstitute/ssGSEA2.0
  • ssGSEAProjection (v9.1.2). Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a sample.
  • single sample GSEA (ssGSEA) from http://baderlab.org/
  • How single sample GSEA works
  • Data formats gct (expression data format), gmt (gene set database format).
  • ssGSEA produced by Broad Gene Pattern is different from one produced by gsva package.
    • Actually they are not compatible. See a plot in singscore paper.
    • Gene Pattern also compute p-values testing the Spearman correlation
  • GSVA and SSGSEA for RNA-Seq TPM data.
    • Generally, negative enrichment values imply down-regulation of a signature / pathway; whereas, positive values imply up-regulation.
    • The idea is to then conduct a differential signature / pathway analysis (using, for example, limma) so that you can have, in addition to differentially expressed genes, differentially expressed signatures / pathways.
  • GSVA vignette. In GSEA Subramanian et al. (2005) it is also observed that the empirical null distribution obtained by permuting phenotypes is bimodal and, for this reason, significance is determined independently using the positive and negative sides of the null distribution.
  • Tips:
  • 纯R代码实现ssGSEA算法评估肿瘤免疫浸润程度. The pheatmap package was used to draw the heatmap. Original paper.
    • fpkm expression data was sorted by gene name & median values. Duplicated genes are removed. The data is transformed by log2(x+1) before sent to gsva()
    • NES scores are scaled and truncated to [-2, 2]. The scores are further scaled to have the range [0,1] before sending it to the heatmap function
  • gsva() from the GSVA package has an option to compute ssGSEA. 单样本基因集富集分析 --- ssGSEA (it includes the formula from Barbie's paper). The output of gsva() is a matrix of ES (# gene sets x # samples). It does not produce plots nor running the permutation tests. Notice the option ssgsea.norm. Note ssgsea.norm = TRUE (default) option will scale ES by the absolute difference of the max and min ES; Unscaled ES / (max(unscaled ES) - min(unscaled ES)). So the scaled ES values depends on the included samples. But it seems the impact of the included samples is small from some real data. See the source code on github and on how to debug an S4 function.
    library(GSVA)
    library(heatmaply)
    p <- 10 ## number of genes
    n <- 30 ## number of samples
    nGrp1 <- 15 ## number of samples in group 1
    nGrp2 <- n - nGrp1 ## number of samples in group 2
    
    ## consider three disjoint gene sets
    geneSets <- list(set1=paste("g", 1:3, sep=""),
                     set2=paste("g", 4:6, sep=""),
                     set3=paste("g", 7:10, sep=""))
    
    ## sample data from a normal distribution with mean 0 and st.dev. 1
    set.seed(1234)
    y <- matrix(rnorm(n*p), nrow=p, ncol=n,
                dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))
    
    ## genes in set1 are expressed at higher levels in the last 'nGrp1+1' to 'n' samples
    y[geneSets$set1, (nGrp1+1):n] <- y[geneSets$set1, (nGrp1+1):n] + 2
    
    gsva_es <- gsva(y, geneSets, method="ssgsea")
    dim(gsva_es) #  3 x 30
    hist(gsva_es) # bi-modal
    range(gsva_es)
    # [1] -0.4390651  0.5609349
    
    ## build design matrix
    design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2)))
    ## fit the same linear model now to the GSVA enrichment scores
    fit <- lmFit(gsva_es, design)
    
    ## estimate moderated t-statistics
    fit <- eBayes(fit)
    
    ## set1 is differentially expressed
    topTable(fit, coef="sampleGroup2vs1")
    #           logFC     AveExpr         t      P.Value    adj.P.Val         B
    # set1  0.5045008 0.272674410  8.065096 4.493289e-12 1.347987e-11 17.067380
    # set2 -0.1474396 0.029578749 -2.315576 2.301957e-02 3.452935e-02 -4.461466
    # set3 -0.1266808 0.001380826 -2.060323 4.246231e-02 4.246231e-02 -4.992049
    
    heatmaply(gsva_es)   # easy to see a pattern
                         # samples' clusters are not perfect
    heatmaply(gsva_es, scale = "none")  # 'scale' is not working?
    
    heatmaply(y, Colv = F, Rowv= F, scale = "none") # not easy to see a pattern
    
  • If we set ssgsea.norm=FALSE, do we get the same results when we compute ssgsea using a subset of samples?
    gsva1 <- gsva(y, geneSets, method="ssgsea", ssgsea.norm = FALSE)
    gsva2 <- gsva(y[, 1:2], geneSets, method="ssgsea", ssgsea.norm = FALSE)
    range(abs(gsva1[, 1:2] - gsva2)) # [1] 0 0
    
  • Verify ssgsea.norm = TRUE option.
    gsva_es <- gsva(y, geneSets, method="ssgsea")
    gsva1 <- gsva(y, geneSets, method="ssgsea", ssgsea.norm = FALSE)
    gsva2 <- gsva1 / abs(max(gsva1) - min(gsva1)) # abs(max(gsva1) - min(gsva1)) 8.9
    
    range(gsva_es)              # [1] -0.4390651  0.5609349
    range(abs(gsva_es - gsva2)) # [1] 0 0
    
  • Only ranks matters! If I replace the sample 1 gene expression values with the ranks, the ssgsea scores are not changed at all.
    y2 <- y
    y2[, 1] <- rank(y2[, 1])
    gsva2 <- gsva(y2, geneSets, method="ssgsea")
    gsva_es[, 1]
    #      set1       set2       set3 
    # 0.1927056  0.1699799 -0.1782934 
    gsva2[, 1]
    #      set1       set2       set3 
    # 0.1927056  0.1699799 -0.1782934 
    all.equal(gsva_es, gsva2)
    # [1] TRUE
    
    # How about the reverse ranking? No that will change everything.
    # The one with the smallest value was assigned one according to 
    # the definition of 'rank'
    y3[, 1] <- 11 - y2[, 1]
    gsva3 <- gsva(y3, geneSets, method="ssgsea")
    gsva3[, 1]
    #        set1         set2         set3 
    #-0.054991319 -0.009603802  0.191733634
    
    cbind(y[, 1], y2[, 1], y3[, 1])
    #           [,1] [,2] [,3]
    # g1  -1.2070657    2    9
    # g2   0.2774292    7    4
    # g3   1.0844412   10    1
    # g4  -2.3456977    1   10
    # g5   0.4291247    8    3
    # g6   0.5060559    9    2
    # g7  -0.5747400    4    7
    # g8  -0.5466319    6    5
    # g9  -0.5644520    5    6
    # g10 -0.8900378    3    8
    
  • Is mx.diff useful? No. GSVA with and without absolute scaling method. The mx.diff and abs.ranking only apply when method="gsva". The only parameters that apply when method="ssgsea" are tau and ssgsea.norm.
    gsva4 <- gsva(y, geneSets, method="ssgsea", mx.diff = FALSE)
    all.equal(gsva_es, gsva4)
    # [1] TRUE
    
  • A simple implementation of ssGSEA (single sample gene set enrichment analysis)
  • Use "ssgsea-gui.R". The first question is a folder containing input files GCT. The 2nd question is about gene set database in GMT format. This has to be very restrict. For example, "ptm.sig.db.all.uniprot.human.v1.9.0.gmt" and "ptm.sig.db.all.sitegrpid.human.v1.9.0.gmt" provided in github won't work with the example GCT file.
    setwd("~/github/ssGSEA2.0/")
    source("ssgsea-gui.R")
    # select a folder containing gct files; e.g. PI3K_pert_logP_n2x23936.gct 
    # select a gene set file; e.g. <ptm.sig.db.all.flanking.human.v1.8.1.gmt>
    

    A new folder (e.g. 2021-03-01) will be created under the same parent folder as the gct file folder.

    tree -L 1 ~/github/ssGSEA2.0/example/gct/2021-03-20/                         
    
    ├── PI3K_pert_logP_n2x23936_ssGSEA-combined.gct
    ├── PI3K_pert_logP_n2x23936_ssGSEA-fdr-pvalues.gct
    ├── PI3K_pert_logP_n2x23936_ssGSEA-pvalues.gct
    ├── PI3K_pert_logP_n2x23936_ssGSEA-scores.gct
    ├── PI3K_pert_logP_n2x23936_ssGSEA.RData
    ├── parameters.txt
    ├── rank-plots
    ├── run.log
    └── signature_gct
    
    tree ~/github/ssGSEA2.0/example/gct/2021-03-20/rank-plots | head -3 
    # 102 files. One file per matched gene set
    ├── DISEASE.PSP_Alzheime_2.pdf
    ├── DISEASE.PSP_breast_c_2.pdf
    
    tree ~/github/ssGSEA2.0/example/gct/2021-03-20/signature_gct | head -3                    
    # 102 files. One file per matched gene set
    ├── DISEASE.PSP_Alzheimer.s_disease_n2x23.gct
    ├── DISEASE.PSP_breast_cancer_n2x14.gct
    

    Since the GCT file contains 2 samples (the last 2 columns), ssGSEA produces one rank plot for each gene set (with adjusted p-value & the plot could contain multiple samples). The ES scores are saved in <PI3K_pert_logP_n2x23936_ssGSEA-scores> and adjust p-values are saved in <PI3K_pert_logP_n2x23936_ssGSEA-fdr-pvalues>.

    SsGSEA.png

  • Some possible problems using ssGSEA to replace GSEA for DE analysis? See a toy example on Single Sample GSEA (ssGSEA) and dynamic range of expression. PS: the rank values table is wrong; they should be reversed.
  • In general terms, PLAGE and z-score are parametric and should perform well with close-to-Gaussian expression profiles, and ssGSEA and GSVA are non-parametric and more robust to departures of Gaussianity in gene expression data. See Method and kcdf arguments in gsva package
  • Some discussions from biostars.org. Find -> "ssgsea"
  • Some papers.
  • 【生信分析 3】教你看懂GSEA和ssGSEA分析结果. No groups/classes in the data (6:33). Output is a heatmap. Each value is computed sample by sample. Rows = gene set. Columns = (sorted by the 1st gene set) samples.
  • Gene set analysis: GSVA, Z-Score, and fold-changes

escape

escape - Easy single cell analysis platform for enrichment. Github.

sTAM

Benchmark

Toward a gold standard for benchmarking gene set enrichment analysis Geistlinger, 2021. GSEABenchmarkeR package in Bioconductor.

phantasus

PhenoExam

PhenoExam: gene set analyses through integration of different phenotype databa

Selected papers

Malacards

GSDA

Gene-set distance analysis (GSDA): a powerful tool for gene-set association analysis

Genekitr

https://github.com/GangLiLab/genekitr. twitter

MSigDB

https://www.gsea-msigdb.org/gsea/msigdb/

  • C1: Hallmark
  • C2: Curated gene sets including BioCarta, KEGG, Reactome
  • C5: Oncology

Hallmark

  • HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION/EMT from MsigDB
  • Epithelial and mesenchymal gene signatures. See here. I save the gene lists in Github.
    • 'Table S1A. Generic EMT signature for tumour ' is really more interesting since Epi is more tumor related. The first 145 are epi, the rest (315-145=170) are mes.
    • Epithelial (170 genes), mesenchymal gene signatures 218-170=48 genes). Tan et al. 2014 and can be found in the 'Table S1B. Generic EMT signature for cell line ’.