Tcga: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Resources =
= Resources =
* [https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/CC2024/TCGA/TCGA_download/ Accessing and Downloading TCGA Data]
* [https://www.cancer.gov/about-nci/organization/ccg/blog/2017/tcga-pancan-atlas TCGA to Complete its Final Analysis: the PanCanAtlas], [https://gdc.cancer.gov/about-data/publications/pancanatlas Data download] from gdc portal.
* [https://www.cancer.gov/about-nci/organization/ccg/blog/2017/tcga-pancan-atlas TCGA to Complete its Final Analysis: the PanCanAtlas], [https://gdc.cancer.gov/about-data/publications/pancanatlas Data download] from gdc portal.
** [https://www.jianshu.com/p/8cf8503dc61c TCGA pan-cancer 分析], [https://www.jianshu.com/p/329795604970 文献阅读 The Cancer Genome Atlas Pan-Cancer analysis project]
** [https://www.jianshu.com/p/8cf8503dc61c TCGA pan-cancer 分析], [https://www.jianshu.com/p/329795604970 文献阅读 The Cancer Genome Atlas Pan-Cancer analysis project]
Line 7: Line 8:
* [https://www.jianshu.com/p/3c0f74e85825 TCGA癌症中英文对照]
* [https://www.jianshu.com/p/3c0f74e85825 TCGA癌症中英文对照]
* https://github.com/cBioPortal/cbioportal
* https://github.com/cBioPortal/cbioportal
* [https://bioconductor.org/packages/release/workflows/html/TCGAWorkflow.html TCGAWorkflow] package. Rank 7/29. It suggests
** [http://bioconductor.org/packages/release/data/experiment/html/TCGAbiolinksGUI.data.html TCGAbiolinksGUI.data] (rank 2/416) Why?
** [[#TCGAbiolinks|TCGAbiolinks]] package (rank 91/2083) & [http://bioconductor.org/packages/release/bioc/html/TCGAbiolinksGUI.html TCGAbiolinksGUI]
** [http://bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html curatedTCGAData] (rank 34/416)
** [http://bioconductor.org/packages/release/data/experiment/html/TCGAWorkflowData.html TCGAWorkflowData] (rank 54/416)
** [http://bioconductor.org/packages/release/data/experiment/html/HarmonizedTCGAData.html HarmonizedTCGAData] (276/416)
** [https://bioconductor.org/packages/release/bioc/html/RTCGAToolbox.html RTCGAToolbox], [https://antonioahn.github.io/post/tcgadata_download/ Downloading RNAseq, me450k and clinical data from TCGA melanoma tumours] (rank 251/2083).
* https://gdac.broadinstitute.org/  
* https://gdac.broadinstitute.org/  
** [https://github.com/RTCGA/RTCGA/blob/master/R/downloadTCGA.R RCTGA::downloadTCGA()] uses the above website. [http://www.bioconductor.org/packages/release/data/experiment/html/RTCGA.RPPA.html RTCGA.RPPA] package. RPPA = reverse phase '''p'''rotein array.
** [https://github.com/RTCGA/RTCGA/blob/master/R/downloadTCGA.R RCTGA::downloadTCGA()] uses the above website. * [http://zyxue.github.io/2017/06/02/understanding-TCGA-mRNA-Level3-analysis-results-files-from-firebrose.html Understanding TCGA mRNA Level3 analysis results files from FireBrowse]
* [https://cran.r-project.org/web/packages/cgdsr/ cgdsr] package
** [https://www.biostars.org/p/219024/ Tutorial: retrieve full TCGA datasets from cBioportal with R]
** [http://qiubio.com/new/book/sessionInfo/#cbioportal 生物信息学生R入门教程]
* [http://zyxue.github.io/2017/06/02/understanding-TCGA-mRNA-Level3-analysis-results-files-from-firebrose.html Understanding TCGA mRNA Level3 analysis results files from FireBrowse]
* [https://github.com/jmzeng1314/tcga_example TCGA实战大全]
* [https://github.com/jmzeng1314/tcga_example TCGA实战大全]
* [https://www.jianshu.com/p/79816a20cbb1 TCGA学习01:数据下载与整理], [https://biowolf.cn/m/view.php?aid=25 TCGA下载和提取临床数据], [https://blog.csdn.net/sayhello1025/article/details/103474816 TCGA临床数据整理], [https://www.cnblogs.com/nkwy2012/p/10112581.html TCGA样本命名详解]. [https://www.jianshu.com/p/69dc9e1e4f62 TCGA数据下载与ID转换]
* [https://www.jianshu.com/p/79816a20cbb1 TCGA学习01:数据下载与整理], [https://biowolf.cn/m/view.php?aid=25 TCGA下载和提取临床数据], [https://blog.csdn.net/sayhello1025/article/details/103474816 TCGA临床数据整理], [https://www.cnblogs.com/nkwy2012/p/10112581.html TCGA样本命名详解]. [https://www.jianshu.com/p/69dc9e1e4f62 TCGA数据下载与ID转换]
Line 26: Line 16:
** https://docs.ropensci.org/UCSCXenaTools/
** https://docs.ropensci.org/UCSCXenaTools/
** Several hubs are used. For example TCGA Hub is from https://tcga.xenahubs.net/
** Several hubs are used. For example TCGA Hub is from https://tcga.xenahubs.net/
* [https://www.bioconductor.org/packages/release/data/experiment/html/GSE62944.html TCGA processed RNA-Seq data (GSE62944) as a SummarizedExperiment]
** This was used in [https://bioconductor.org/packages/release/bioc/vignettes/GSEABenchmarkeR/inst/doc/GSEABenchmarkeR.html GSEABenchmarkeR] package vignette
* [http://www.linkedomics.org/ LinkedOmics]. [http://www.linkedomics.org/data_download/TCGA-KIRC/ TCGA-KIRC]
* [http://www.linkedomics.org/ LinkedOmics]. [http://www.linkedomics.org/data_download/TCGA-KIRC/ TCGA-KIRC]
* [https://github.com/GerkeLab/TCGAclinical TCGA Pancancer Clinical Data] RData & tsv.
* [https://docs.cbioportal.org/user-guide/faq/ cBioPortal FAQs]
= R packages =
* [https://bioconductor.org/packages/release/bioc/html/cBioPortalData.html cBioPortalData]
* [http://www.bioconductor.org/packages/release/data/experiment/html/RTCGA.RPPA.html RTCGA.RPPA] package. RPPA = reverse phase '''p'''rotein array.
* [https://cran.r-project.org/web/packages/cgdsr/ cgdsr] package
** [https://www.biostars.org/p/219024/ Tutorial: retrieve full TCGA datasets from cBioportal with R]
** [http://qiubio.com/new/book/sessionInfo/#cbioportal 生物信息学生R入门教程]
* Comparison of RTCGA, UCSCXenaTools (CRAN), TCGAbiolinks & [https://www.bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html curatedTCGAData]
* Comparison of RTCGA, UCSCXenaTools (CRAN), TCGAbiolinks & [https://www.bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html curatedTCGAData]
** [https://support.bioconductor.org/p/109896/ Differences between RTCGA and TCGAbiolinks data]
** [https://support.bioconductor.org/p/109896/ Differences between RTCGA and TCGAbiolinks data]
* [https://github.com/GerkeLab/TCGAclinical TCGA Pancancer Clinical Data] RData & tsv.
 
== TCGAbiolinks: STAR counts ==
<ul>
<li>[https://bioconductor.org/packages/release/workflows/html/TCGAWorkflow.html TCGAWorkflow] package. Rank 7/29. It suggests
* [http://bioconductor.org/packages/release/data/experiment/html/TCGAbiolinksGUI.data.html TCGAbiolinksGUI.data] (rank 2/416) Why?
* [https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html TCGAbiolinks] package (rank 91/2083)
* [http://bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html curatedTCGAData] (rank 34/416)
* [http://bioconductor.org/packages/release/data/experiment/html/TCGAWorkflowData.html TCGAWorkflowData] (rank 54/416)
* [http://bioconductor.org/packages/release/data/experiment/html/HarmonizedTCGAData.html HarmonizedTCGAData] (276/416)
* [https://bioconductor.org/packages/release/bioc/html/RTCGAToolbox.html RTCGAToolbox], [https://antonioahn.github.io/post/tcgadata_download/ Downloading RNAseq, me450k and clinical data from TCGA melanoma tumours] (rank 251/2083).
 
<li>Example from [https://bioconductor.org/packages/release/workflows/vignettes/TCGAWorkflow/inst/doc/TCGAWorkflow.html TCGA Workflow]. LGG=Low Grade Glioma.
* [https://www.cbioportal.org/study/summary?id=lgg_tcga cBioPortal]. Click the download arrow next to the word '''Brain Lower Grade Glioma (TCGA, Firehose Legacy)'''. The tar.gz file is 165MB. One file is '''data_mrna_seq_v2_rsem.txt'''. It probably records FPKM data? (ALL are non-integer, so they are either '''expected count''' or '''FPKM''').
* [https://portal.gdc.cancer.gov/projects/TCGA-LGG GDC]. TCGA-LGG -> Repository. On the LHS "Filter", choose Experiment strategy = "RNA-Seq", Data category = "transcriptome profiling", Data type = "Gene expression quantification", Workflow type = '''STAR Counts'''. This will show 516 tsv files.
* [https://xenabrowser.net/datapages/?dataset=TCGA.LGG.sampleMap%2FHiSeqV2&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 Zenabrowser] also has a version based on "RSEM normalized count" log2(TPM + 1). So this is probably a duplicate of cBioPortal data.
* '''RTCGA''' package downloaded data from [http://firebrowse.org/?cohort=LGG&download_dialog=true GDAC broadinstitute/firebrowse]. This contains RSEM output files. If I directly downloaded a file from filebrowser, the file 'LGG.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt' shows "raw_count" but most data are integers (column sum is about 47132282).
* '''GSE62944''' package used '''Rsubread::featureCounts()''' to obtain raw counts data from TCGA samples.RTCGA Package
* '''TCGAbiolinks''' package can download all '''STAR counts & TPM & FPKM''' data.
<syntaxhighlight lang='r'>
library(TCGAbiolinks)
# Step 1: query
query_exp_lgg <- GDCquery(
  project = "TCGA-LGG",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)
 
# Step 2: download
# A total of 2.27 GB. Take a while... One time only:)
system.time(GDCdownload(query_exp_lgg)) # 9 min
 
# Step 3: prepare
# Output is a SummarizedExperiment object
exp_lgg <- GDCprepare(
  query = query_exp_lgg
)
# Available assays in SummarizedExperiment :
#  => unstranded        ---> Raw count data
#  => stranded_first
#  => stranded_second
#  => tpm_unstrand
#  => fpkm_unstrand
#  => fpkm_uq_unstrand
class(exp_lgg)
 
# Step 4: get gene expression matrix
library(SummarizedExperiment)
assayNames(exp_lgg)
# [1] "unstranded"      "stranded_first"  "stranded_second"  "tpm_unstrand"
# [5] "fpkm_unstrand"    "fpkm_uq_unstrand"
data.expr <- assay(exp_lgg, "tpm_unstrand") # default is the 1st assay name/data type: 'unstranded'
dim(data.expr)
# [1] 60660  534
summary(apply(data.expr, 2, sum))
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06
 
# Step 5: get sample information
data.pheno <- colData(exp_lgg)
dim(data.pheno)
# [1] 534 112
 
# Step 6: get feature information
data.feature <- rowData(exp_lgg)
dim(data.feature)
# [1] 60660    10
</syntaxhighlight>
 
<li>In the context of the TCGAbiolinks package, the following assays are available:
* '''unstranded''', stranded_first, stranded_second: These typically represent different RNA sequencing protocols regarding how reads are counted and mapped, but they usually include '''raw count''' data.
* '''tpm_unstrand''': This stands for Transcripts Per Million, which normalizes for gene length and sequencing depth.
* '''fpkm_unstrand''': This stands for Fragments Per Kilobase of transcript per Million mapped reads, also normalized.
* fpkm_uq_unstrand: This stands for Upper Quartile normalized FPKM, providing another layer of normalization.
<li>[https://github.com/tmemklab/tcga_data_download TCGA Biolinks Data Processing]
</ul>
 
== GSE62944 ==
* https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944.
** 9264 tumor samples and 741 normal samples across 24 cancer types.
** Rsubread
** Raw counts, FPKM and TPM values
* GSE62944 - [https://www.bioconductor.org/packages/release/data/experiment/html/GSE62944.html TCGA processed RNA-Seq data  as a SummarizedExperiment]
** This was used in [https://bioconductor.org/packages/release/bioc/vignettes/GSEABenchmarkeR/inst/doc/GSEABenchmarkeR.html GSEABenchmarkeR] package vignette
 
= PanCancer atlas vs Firehose legacy =
[https://docs.cbioportal.org/user-guide/faq/#what-are-tcga-firehose-legacy-datasets-and-how-do-they-compare-to-the-publication-associated-datasets-and-the-pancancer-atlas-datasets What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?]
 
= Relationship of TCGA, Firehose and cBioportal =
https://www.biostars.org/p/366545/
 
= Some examples =
* Colon sample case: '''Clinical data'''
* Colon sample case: '''Clinical data'''
** Compared to firebrowse.org, it seems cbioportal website has a good clinical data format to use. It also has several web-based analyses tool to use that may not be useful.  
** Compared to firebrowse.org, it seems cbioportal website has a good clinical data format to use. It also has several web-based analyses tool to use that may not be useful.  
Line 38: Line 126:
** Click the download button "Download clinical data for the selected cases"
** Click the download button "Download clinical data for the selected cases"
** Save the file "coadread_tcga_clinical_data.tsv". It is a tab delimited text file. Column 3 is "Sample ID", column W is Disease Free (Months), column X is Disease Free Status, column BB is Overall Survival (Months), column BC is Overall Survival Status. (Cf. firebrowse gives a complicated table).
** Save the file "coadread_tcga_clinical_data.tsv". It is a tab delimited text file. Column 3 is "Sample ID", column W is Disease Free (Months), column X is Disease Free Status, column BB is Overall Survival (Months), column BC is Overall Survival Status. (Cf. firebrowse gives a complicated table).
* Colon sample case: '''Mutation data'''
* Colon sample case: '''Mutation data'''
** http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy) > Query By Gene
** http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy) > Query By Gene
Line 50: Line 139:
** Click the left-most tab "Download" > Mutations (OQL is not in effect) Tab Delimited Format
** Click the left-most tab "Download" > Mutations (OQL is not in effect) Tab Delimited Format
** Save the file mutations.txt for 640 samples.
** Save the file mutations.txt for 640 samples.
* Colon sample case: '''RNASeq'''
* Colon sample case: '''RNASeq'''
** http://firebrowse.org/ (good for downloading gene expression data)
** http://firebrowse.org/ (good for downloading gene expression data)
Line 63: Line 153:
** Fourth method - GenomicDataCommons R package. [https://www.biostars.org/p/204092/ Tutorial:Protocol To Downlad TCGA Data From GDC]
** Fourth method - GenomicDataCommons R package. [https://www.biostars.org/p/204092/ Tutorial:Protocol To Downlad TCGA Data From GDC]
** Fifth method - GDC Data Transfer Tool [https://www.biostars.org/p/204092/ Tutorial:Protocol To Downlad TCGA Data From GDC]
** Fifth method - GDC Data Transfer Tool [https://www.biostars.org/p/204092/ Tutorial:Protocol To Downlad TCGA Data From GDC]
* Pancreatic sample: '''RNASeq'''
** https://www.cbioportal.org/datasets
** Pancreatic Adenocarcinoma (TCGA, '''PanCancer Atlas''')
** Click the download icon in the same line and save the file paad_tcga_pan_can_atlas_2018.tar.gz
** Unzip the above file and pick "data_mrna_seq_v2_rsem.txt" (non-integer)


= Tumor vs normal =
= Tumor vs normal =
Line 323: Line 419:
# Get all metadata  
# Get all metadata  
metadata_clean <- recount::all_metadata("tcga")
metadata_clean <- recount::all_metadata("tcga")
dim(metadata_clean)
# [1] 11284  864
kable(table(metadata_clean$gdc_cases.project.project_id))
|Var1      | Freq|
|:---------|----:|
|TCGA-ACC  |  79|
|TCGA-BLCA |  433|
|TCGA-BRCA | 1246|
|TCGA-CESC |  309|
|TCGA-CHOL |  45|
|TCGA-COAD |  546|
|TCGA-DLBC |  48|
|TCGA-ESCA |  198|
|TCGA-GBM  |  175|
|TCGA-HNSC |  548|
|TCGA-KICH |  91|
|TCGA-KIRC |  616|
|TCGA-KIRP |  323|
|TCGA-LAML |  126|
|TCGA-LGG  |  532|
|TCGA-LIHC |  424|
|TCGA-LUAD |  601|
|TCGA-LUSC |  555|
|TCGA-MESO |  87|
|TCGA-OV  |  430|
|TCGA-PAAD |  183|
|TCGA-PCPG |  187|
|TCGA-PRAD |  558|
|TCGA-READ |  177|
|TCGA-SARC |  265|
|TCGA-SKCM |  473|
|TCGA-STAD |  453|
|TCGA-TGCT |  156|
|TCGA-THCA |  572|
|TCGA-THYM |  122|
|TCGA-UCEC |  589|
|TCGA-UCS  |  57|
|TCGA-UVM  |  80|


# Get only PAAD project
# Get only PAAD project
Line 431: Line 565:
* "A36H" - This represents the TCGA biospecimen type, in this case, it's a "Solid Tissue Normal" sample from the patient's adrenal gland.
* "A36H" - This represents the TCGA biospecimen type, in this case, it's a "Solid Tissue Normal" sample from the patient's adrenal gland.
* "07" - This is the TCGA sample type, in this case, it's "Diagnostic Slide".
* "07" - This is the TCGA sample type, in this case, it's "Diagnostic Slide".
= in silico method/data/approach =
* [https://www.future-science.com/doi/10.2144/btn-2018-0179 A simple in silico approach to generate gene-expression profiles from subsets of cancer genomics data].
** It also demonstrated how to use cbioportal to do data analysis.
** It focuses on gene expression, DNA methylation and protein expression.
** [https://www.future-science.com/doi/suppl/10.2144/btn-2018-0179 Supp].
** Fig 1 (Representative analysis). KM plot.
** Fig 2 (Representative analysis). mRNA expression of genes boxplot colored by mutation
** Fig 3 (Representative analysis). mRNA expression vs mutation boxplot
** Fig 4 (Representative analysis). Methylation vs mRNA expression (counts) colored by mutation
** Fig 5 (Representative analysis). RPPA vs mutation. RPPA vs mRNA expression
= Microbiome =
[https://www.nature.com/articles/s41467-022-30512-3#Sec9 Predicting cancer prognosis and drug response from the tumor microbiome] Hermida 2022.

Latest revision as of 16:04, 16 October 2024

Resources

R packages

TCGAbiolinks: STAR counts

  • TCGAWorkflow package. Rank 7/29. It suggests
  • Example from TCGA Workflow. LGG=Low Grade Glioma.
    • cBioPortal. Click the download arrow next to the word Brain Lower Grade Glioma (TCGA, Firehose Legacy). The tar.gz file is 165MB. One file is data_mrna_seq_v2_rsem.txt. It probably records FPKM data? (ALL are non-integer, so they are either expected count or FPKM).
    • GDC. TCGA-LGG -> Repository. On the LHS "Filter", choose Experiment strategy = "RNA-Seq", Data category = "transcriptome profiling", Data type = "Gene expression quantification", Workflow type = STAR Counts. This will show 516 tsv files.
    • Zenabrowser also has a version based on "RSEM normalized count" log2(TPM + 1). So this is probably a duplicate of cBioPortal data.
    • RTCGA package downloaded data from GDAC broadinstitute/firebrowse. This contains RSEM output files. If I directly downloaded a file from filebrowser, the file 'LGG.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt' shows "raw_count" but most data are integers (column sum is about 47132282).
    • GSE62944 package used Rsubread::featureCounts() to obtain raw counts data from TCGA samples.RTCGA Package
    • TCGAbiolinks package can download all STAR counts & TPM & FPKM data.
    library(TCGAbiolinks)
    # Step 1: query
    query_exp_lgg <- GDCquery(
      project = "TCGA-LGG",
      data.category = "Transcriptome Profiling",
      data.type = "Gene Expression Quantification", 
      workflow.type = "STAR - Counts"
    )
    
    # Step 2: download
    # A total of 2.27 GB. Take a while... One time only:)
    system.time(GDCdownload(query_exp_lgg)) # 9 min
    
    # Step 3: prepare
    # Output is a SummarizedExperiment object
    exp_lgg <- GDCprepare(
      query = query_exp_lgg
    )
    # Available assays in SummarizedExperiment :
    #   => unstranded        ---> Raw count data
    #   => stranded_first
    #   => stranded_second
    #   => tpm_unstrand
    #   => fpkm_unstrand
    #   => fpkm_uq_unstrand
    class(exp_lgg)
    
    # Step 4: get gene expression matrix
    library(SummarizedExperiment)
    assayNames(exp_lgg)
    # [1] "unstranded"       "stranded_first"   "stranded_second"  "tpm_unstrand"
    # [5] "fpkm_unstrand"    "fpkm_uq_unstrand"
    data.expr <- assay(exp_lgg, "tpm_unstrand") # default is the 1st assay name/data type: 'unstranded'
    dim(data.expr)
    # [1] 60660   534
    summary(apply(data.expr, 2, sum))
    #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    #  1e+06   1e+06   1e+06   1e+06   1e+06   1e+06
    
    # Step 5: get sample information
    data.pheno <- colData(exp_lgg)
    dim(data.pheno)
    # [1] 534 112
    
    # Step 6: get feature information
    data.feature <- rowData(exp_lgg)
    dim(data.feature)
    # [1] 60660    10
  • In the context of the TCGAbiolinks package, the following assays are available:
    • unstranded, stranded_first, stranded_second: These typically represent different RNA sequencing protocols regarding how reads are counted and mapped, but they usually include raw count data.
    • tpm_unstrand: This stands for Transcripts Per Million, which normalizes for gene length and sequencing depth.
    • fpkm_unstrand: This stands for Fragments Per Kilobase of transcript per Million mapped reads, also normalized.
    • fpkm_uq_unstrand: This stands for Upper Quartile normalized FPKM, providing another layer of normalization.
  • TCGA Biolinks Data Processing

GSE62944

PanCancer atlas vs Firehose legacy

What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?

Relationship of TCGA, Firehose and cBioportal

https://www.biostars.org/p/366545/

Some examples

  • Colon sample case: Clinical data
    • Compared to firebrowse.org, it seems cbioportal website has a good clinical data format to use. It also has several web-based analyses tool to use that may not be useful.
    • http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy); click the last icon "View clinical and genomic data of this study"
    • Tab "Summary" and check "Colon Adenocarcinoma 392"
    • Click the download button "Download clinical data for the selected cases"
    • Save the file "coadread_tcga_clinical_data.tsv". It is a tab delimited text file. Column 3 is "Sample ID", column W is Disease Free (Months), column X is Disease Free Status, column BB is Overall Survival (Months), column BC is Overall Survival Status. (Cf. firebrowse gives a complicated table).
  • Colon sample case: Mutation data
    • http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy) > Query By Gene
    • Query
      • Check "Mutations"
      • Uncheck "Putative copy-number alterations from GISTIC"
      • Check "mRNA Expression. Select one of the profiles below: > mRNA expression z-scores relative to diploid samples (microarray)"
      • Select Patient/Case Set: All samples (640)
      • Copy contents in oncomineGene.txt to "Enter Genes:" > Replace gene symbol: MRE11A:MRE11, RB:RB1
      • oncomineGene.txt has 169 genes. It has RB and RB1 genes. This website changes RB to RB1. There are 2 RB1. Remove one duplicate RB1 and total 168 genes.
      • Submit Query
    • Click the left-most tab "Download" > Mutations (OQL is not in effect) Tab Delimited Format
    • Save the file mutations.txt for 640 samples.
  • Pancreatic sample: RNASeq
    • https://www.cbioportal.org/datasets
    • Pancreatic Adenocarcinoma (TCGA, PanCancer Atlas)
    • Click the download icon in the same line and save the file paad_tcga_pan_can_atlas_2018.tar.gz
    • Unzip the above file and pick "data_mrna_seq_v2_rsem.txt" (non-integer)

Tumor vs normal

Drug response

  • Evaluating the molecule-based prediction of clinical drug responses in cancer Ding 2016 Bioinformatics, "bioinformatics_32_19_2891_s1.zip". 572 samples.
    library(readxl)
    dat <- read_excel("~/Downloads/bioinformatics_32_19_2891_s1/bioinfo16_supplementary_tables.xlsx", 
                      "Table S2", skip=2)
    dat <- dat[-1, ]
    dim(dat)
    # [1] 2572   14
    kable(table(dat$Cancer))
    |Var1                                                                    | Freq|
    |:-----------------------------------------------------------------------|----:|
    |Adrenocortical carcinoma (ACC)                                          |   13|
    |Bladder Urothelial Carcinoma (BLCA)                                     |  164|
    |Brain Lower Grade Glioma (LGG)                                          |  162|
    |Breast invasive carcinoma (BRCA)                                        |  389|
    |Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) |   97|
    |Colon adenocarcinoma (COAD)                                             |  192|
    |Esophageal carcinoma (ESCA)                                             |   25|
    |Glioblastoma multiforme (GBM)                                           |   10|
    |Head and Neck squamous cell carcinoma (HNSC)                            |  112|
    |Kidney Chromophobe (KICH)                                               |    2|
    |Kidney renal clear cell carcinoma (KIRC)                                |   14|
    |Kidney renal papillary cell carcinoma (KIRP)                            |   14|
    |Liver hepatocellular carcinoma (LIHC)                                   |   29|
    |Lung adenocarcinoma (LUAD)                                              |  151|
    |Lung squamous cell carcinoma (LUSC)                                     |   69|
    |Mesothelioma (MESO)                                                     |   80|
    |Ovarian serous cystadenocarcinoma (OV)                                  |   11|
    |Pancreatic adenocarcinoma (PAAD)                                        |   99|
    |Pheochromocytoma and Paraganglioma (PCPG)                               |    7|
    |Prostate adenocarcinoma (PRAD)                                          |   45|
    |Rectum adenocarcinoma (READ)                                            |   67|
    |Sarcoma (SARC)                                                          |  101|
    |Skin Cutaneous Melanoma (SKCM)                                          |  137|
    |Stomach adenocarcinoma (STAD)                                           |  243|
    |Testicular Germ Cell Tumors (TGCT)                                      |  159|
    |Thyroid carcinoma (THCA)                                                |   10|
    |Uterine Carcinosarcoma (UCS)                                            |   83|
    |Uterine Corpus Endometrial Carcinoma (UCEC)                             |   87|
    
    kable(table(dat$drug_name))
    |Var1                                               | Freq|
    |:--------------------------------------------------|----:|
    |Aldesleukin                                        |    6|
    |Alverine                                           |    1|
    |Anastrozole                                        |   18|
    |anti-A5B1 integrin monoclonal antibody PF-04605412 |    1|
    |anti-endosialin/TEM1 monoclonal antibody MORAb-004 |    1|
    |autologous vaccine                                 |    1|
    |Axitinib                                           |    3|
    |AZD2171                                            |    1|
    |Bacillus Calmette-Guerin (BCG)                     |    1|
    |BCG                                                |    2|
    |Bevacizumab                                        |   48|
    |Bicalutamide                                       |   17|
    |Bleomycin                                          |   54|
    |BRAF inhibitor                                     |    1|
    |Cabazitaxel                                        |    2|
    |Cabozantinib                                       |    1|
    |Cancer Vax                                         |    2|
    |Capecitabine                                       |   55|
    |Carboplatin                                        |  181|
    |Carmustine                                         |    6|
    |Cetuximab                                          |   21|
    |Chemo, Multi-Agent, NOS                            |    1|
    |Chemo, NOS                                         |    3|
    |Cilengtide                                         |    1|
    |Cisplatin                                          |  330|
    |Copolang                                           |    5|
    |COPOLANG CAPS                                      |    1|
    |Cyclophosphamide                                   |  103|
    |cyclophosphamide, vincristine, and dacarbazine     |    1|
    |Cyclosporine                                       |    1|
    |Dabrafenib                                         |    4|
    |Dacarbazine                                        |   30|
    |Dactinomycin                                       |    3|
    |Dasatinib                                          |    8|
    |Degarelix                                          |    1|
    |Denosumab                                          |    2|
    |Dexamethasone                                      |    6|
    |Didox                                              |    3|
    |Docetaxel                                          |  106|
    |Docetaxel +/- Zactima                              |    1|
    |Doxorubicin                                        |  108|
    |doxorubicin/cyclophosphamide                       |    1|
    |E7389                                              |    2|
    |Enoticumab                                         |    1|
    |Epirubicin                                         |   28|
    |Epoetin alfa                                       |    1|
    |Eribulin                                           |    1|
    |Erlotinib                                          |    7|
    |Etoposide                                          |   87|
    |Everolimus                                         |    6|
    |Everolimus, Gemcitabine, and Cisplatin             |    1|
    |Exemestane                                         |    3|
    |EZN-2968                                           |    1|
    |Fluorouracil                                       |  212|
    |Folfiri                                            |    1|
    |Folfox                                             |    2|
    |FOLFOX                                             |    2|
    |Fotemustine                                        |    3|
    |Fulvestrant                                        |    1|
    |Gefitinib                                          |    2|
    |Gemcitabine                                        |  165|
    |Gemox                                              |    1|
    |Goserelin                                          |    8|
    |GP-100                                             |    1|
    |GP100                                              |    1|
    |HSC vaccine injection                              |    1|
    |Hydrocortisone                                     |    1|
    |Hydroxyurea                                        |    1|
    |Ifosfamid                                          |    1|
    |Ifosfamide                                         |   24|
    |Imatinib                                           |    3|
    |Infliximab                                         |    2|
    |Interferon alfa-n1                                 |    6|
    |Interferon alfacon-1                               |    8|
    |iodine I 131 monoclonal antibody 81C6              |    1|
    |Ipilimumab                                         |   11|
    |Irinotecan                                         |   30|
    |Ixabepilone                                        |    1|
    |Ketoconazole                                       |    1|
    |Lapatinib                                          |    2|
    |Letrozole                                          |    5|
    |Leucovorin                                         |   93|
    |Leuprolide                                         |   16|
    |Levothyroxine                                      |    1|
    |Liothyronine                                       |    7|
    |Lomustine                                          |   11|
    |LY228820                                           |    1|
    |Megestrol acetate                                  |    2|
    |MEL-44                                             |    2|
    |Melphalan                                          |    6|
    |Methotrexate                                       |   15|
    |Methylprednisolone                                 |    1|
    |Mitomycin                                          |    7|
    |Mitotane                                           |    1|
    |Mitoxantrone                                       |    1|
    |Mycophenolic acid                                  |    6|
    |Nilutamide                                         |    2|
    |nivolumab                                          |    1|
    |Ondansetron                                        |    1|
    |Oxaliplatin                                        |   75|
    |Paclitaxel                                         |  172|
    |Pamidronate                                        |    3|
    |Panitumumab                                        |    1|
    |Pazopanib                                          |    6|
    |Pegfilgrastim                                      |    6|
    |Pemetrexed                                         |   44|
    |PI-88                                              |    1|
    |Platinum                                           |    5|
    |PNU-159548                                         |    1|
    |Poly E                                             |    1|
    |Polyplatillen                                      |    2|
    |Procarbazine                                       |    8|
    |px-866                                             |    1|
    |R1507                                              |    1|
    |Raloxifene                                         |    1|
    |recMAGE- A3                                        |    1|
    |recombinant interferon-∥2b                         |    1|
    |Regorafenib                                        |    1|
    |RenAmin                                            |    1|
    |Resiquimod                                         |    1|
    |ridaforolimus                                      |    1|
    |rigosertib                                         |    1|
    |Rituximab                                          |    1|
    |Sargramostim                                       |    2|
    |Sorafenib                                          |   17|
    |Streptozocin                                       |    1|
    |Sulindac                                           |    1|
    |Sunitinib                                          |   10|
    |Talimogene Laherparepvec (T-VEC)                   |    1|
    |Tamoxifen                                          |   24|
    |tegafur-gimeracil-oteracil potassium               |    3|
    |Temozolomide                                       |  116|
    |Temsirolimus                                       |    3|
    |Thalidomide                                        |    1|
    |Themozolomide                                      |    2|
    |Threshold-302                                      |    1|
    |Topotecan                                          |    4|
    |Toremifene                                         |    1|
    |Trabectedin                                        |    3|
    |Trametinib                                         |    2|
    |Trastuzumab                                        |   17|
    |Trelstar                                           |    2|
    |triptorelin                                        |    1|
    |Tyrosine kinase inhibitor                          |    1|
    |veliparib                                          |    2|
    |Vemurafenib                                        |    3|
    |Vinblastine                                        |   16|
    |Vincristine                                        |   13|
    |Vinorelbine                                        |   31|
    |Vorinostat                                         |    3|
    |Yervoy                                             |    2|
    |Zoledronate                                        |    2|
    
    kable(table(dat$Cancer[dat$drug_name == "Gemcitabine"]))
    |Var1                                                                    | Freq|
    |:-----------------------------------------------------------------------|----:|
    |Bladder Urothelial Carcinoma (BLCA)                                     |   48|
    |Breast invasive carcinoma (BRCA)                                        |    1|
    |Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) |    2|
    |Esophageal carcinoma (ESCA)                                             |    2|
    |Liver hepatocellular carcinoma (LIHC)                                   |    3|
    |Lung adenocarcinoma (LUAD)                                              |    7|
    |Lung squamous cell carcinoma (LUSC)                                     |   10|
    |Mesothelioma (MESO)                                                     |    6|
    |Pancreatic adenocarcinoma (PAAD)                                        |   60|
    |Pheochromocytoma and Paraganglioma (PCPG)                               |    1|
    |Sarcoma (SARC)                                                          |   22|
    |Skin Cutaneous Melanoma (SKCM)                                          |    1|
    |Uterine Carcinosarcoma (UCS)                                            |    1|
    |Uterine Corpus Endometrial Carcinoma (UCEC)                             |    1|
    
    kable(table(dat$drug_name[dat$Cancer == "Pancreatic adenocarcinoma (PAAD)"]))
    |Var1             | Freq|
    |:----------------|----:|
    |Capecitabine     |    6|
    |Carboplatin      |    1|
    |Cyclophosphamide |    1|
    |Docetaxel        |    1|
    |Doxorubicin      |    1|
    |Erlotinib        |    1|
    |Fluorouracil     |   13|
    |Gemcitabine      |   60|
    |Irinotecan       |    3|
    |Leucovorin       |    4|
    |Oxaliplatin      |    6|
    |Paclitaxel       |    2|
    
  • The above data was used by Predicting cancer prognosis and drug response from the tumor microbiome Hermida 2022.
  • TCGA immunotherapy treated melanoma data. This uses recount::all_metadata() function.
    # Get all metadata 
    metadata_clean <- recount::all_metadata("tcga")
    dim(metadata_clean)
    # [1] 11284   864
    kable(table(metadata_clean$gdc_cases.project.project_id))
    |Var1      | Freq|
    |:---------|----:|
    |TCGA-ACC  |   79|
    |TCGA-BLCA |  433|
    |TCGA-BRCA | 1246|
    |TCGA-CESC |  309|
    |TCGA-CHOL |   45|
    |TCGA-COAD |  546|
    |TCGA-DLBC |   48|
    |TCGA-ESCA |  198|
    |TCGA-GBM  |  175|
    |TCGA-HNSC |  548|
    |TCGA-KICH |   91|
    |TCGA-KIRC |  616|
    |TCGA-KIRP |  323|
    |TCGA-LAML |  126|
    |TCGA-LGG  |  532|
    |TCGA-LIHC |  424|
    |TCGA-LUAD |  601|
    |TCGA-LUSC |  555|
    |TCGA-MESO |   87|
    |TCGA-OV   |  430|
    |TCGA-PAAD |  183|
    |TCGA-PCPG |  187|
    |TCGA-PRAD |  558|
    |TCGA-READ |  177|
    |TCGA-SARC |  265|
    |TCGA-SKCM |  473|
    |TCGA-STAD |  453|
    |TCGA-TGCT |  156|
    |TCGA-THCA |  572|
    |TCGA-THYM |  122|
    |TCGA-UCEC |  589|
    |TCGA-UCS  |   57|
    |TCGA-UVM  |   80|
    
    # Get only PAAD project
    x <- metadata_clean[metadata_clean$gdc_cases.project.project_id == "TCGA-PAAD",]
    dim(x)
    # [1] 183 864
    class(x)
    # [1] "DFrame"
    x$xml_tumor_response_cdus_type
      [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
     [17] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
     [33] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
     [49] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
     [65] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
     [81] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
     [97] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    [113] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    [129] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    [145] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    [161] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    [177] <NA> <NA> <NA> <NA> <NA> <NA> <NA>
    Levels:  Complete response Partial response Progression Stable
    
    library(knitr)
    kable(table(toupper(x$cgc_drug_therapy_drug_name)))
    |Var1                  | Freq|
    |:---------------------|----:|
    |5 FU                  |    3|
    |5-FLUOROURACIL        |    3|
    |5-FU                  |    4|
    |5FU                   |    1|
    |ABRAXANE              |    2|
    |CAPECITABINE          |    2|
    |CHEMO, NOS            |    2|
    |CISPLATIN             |    2|
    |CYCLOPHOSPHAMIDE      |    1|
    |DOCETAXEL             |    1|
    |FLUOROURACIL          |    2|
    |FOLINIC ACID          |    1|
    |FU7                   |    1|
    |GEMCITABINE           |   68|
    |GEMCITABINE INJECTION |    1|
    |GEMCITIBINE           |    1|
    |GEMZAR                |    7|
    |LEUCOVORIN            |    3|
    |LEUCOVORIN CALCIUM    |    2|
    |OXALIPLATIN           |    4|
    |XELODA                |    2|
    
    x$gdc_cases.submitter_id
    sum(x$gdc_cases.project.project_id == "TCGA-PAAD")
    # [1] 183
    x$cgc_case_primary_therapy_outcome_success
    x$cgc_case_id == "TCGA-F2-6880"
    x$xml_bcr_patient_barcode
    x$xml_vital_status
    x$xml_tumor_type
    x$xml_primary_therapy_outcome_success
    colnames(x)[c(198, 359)] # Open the dumped xlsx file, search "complete"
    # [1] "cgc_case_primary_therapy_outcome_success" "xml_primary_therapy_outcome_success" 
    # cgc column is characters and xml column is a factor
    
    # write.table(as.matrix(x), file = "x.txt", sep="\t") NOT WORKING
    writexl::write_xlsx(as.data.frame(x), "~/Downloads/x.xlsx")
    
    kable(table(x[x$gdc_cases.project.project_id == "TCGA-PAAD", 
                  "xml_primary_therapy_outcome_success"] ))
    |Var1                        | Freq|
    |:---------------------------|----:|
    |                            |    0|
    |Complete Remission/Response |   43|
    |Partial Remission/Response  |    8|
    |Progressive Disease         |   40|
    |Stable Disease              |    8|
    |0                           |    0|
    |1                           |    0|
    |2                           |    0|
    |NO                          |    0|
    |YES                         |    0|
    
    x[x$gdc_cases.project.project_id == "TCGA-PAAD" & !is.na(x$xml_primary_therapy_outcome_success), 
      c("xml_bcr_patient_barcode", "xml_vital_status", "cgc_case_pathologic_stage", 
        "cgc_case_primary_therapy_outcome_success")] |> dim()
    # [1] 99  4
    
    x[x$gdc_cases.project.project_id == "TCGA-PAAD", "cgc_case_primary_therapy_outcome_success"] |> table() |> kable()
    |Var1                        | Freq|
    |:---------------------------|----:|
    |Complete Remission/Response |   44|
    |Partial Remission/Response  |    8|
    |Progressive Disease         |   40|
    |Stable Disease              |    8|
    

Understand TCGA Barcode/Sample ID

https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

The TCGA sample label you provided, "TCGA.06.0675.11A.32R.A36H.07", is a standardized label used by The Cancer Genome Atlas (TCGA) project to identify biological samples collected from patients with cancer. The label provides important information about the sample, including the tumor type, the patient ID, and the sample collection site.

Here's a breakdown of the label components:

  • "TCGA" - This is the prefix used for all TCGA samples.
  • "06" - This represents the TCGA disease program, in this case, it refers to the program for Prostate Adenocarcinoma.
  • "0675" - This is the patient ID, a unique identifier assigned to each patient whose samples were included in the TCGA study.
  • "11A" - This represents the type + vial of sample, in this case, it's a primary tumor. Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29. vial = a tube for collecting something.
  • "32R" - This is the portion of the tumor that was collected, in this case, it's the 32nd sample collected from the right lobe of the prostate.
  • "A36H" - This represents the TCGA biospecimen type, in this case, it's a "Solid Tissue Normal" sample from the patient's adrenal gland.
  • "07" - This is the TCGA sample type, in this case, it's "Diagnostic Slide".

in silico method/data/approach

  • A simple in silico approach to generate gene-expression profiles from subsets of cancer genomics data.
    • It also demonstrated how to use cbioportal to do data analysis.
    • It focuses on gene expression, DNA methylation and protein expression.
    • Supp.
    • Fig 1 (Representative analysis). KM plot.
    • Fig 2 (Representative analysis). mRNA expression of genes boxplot colored by mutation
    • Fig 3 (Representative analysis). mRNA expression vs mutation boxplot
    • Fig 4 (Representative analysis). Methylation vs mRNA expression (counts) colored by mutation
    • Fig 5 (Representative analysis). RPPA vs mutation. RPPA vs mRNA expression

Microbiome

Predicting cancer prognosis and drug response from the tumor microbiome Hermida 2022.