Tcga: Difference between revisions
Jump to navigation
Jump to search
Line 36: | Line 36: | ||
= PanCancer atlas vs Firehose legacy = | = PanCancer atlas vs Firehose legacy = | ||
[https://docs.cbioportal.org/user-guide/faq/#what-are-tcga-firehose-legacy-datasets-and-how-do-they-compare-to-the-publication-associated-datasets-and-the-pancancer-atlas-datasets What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?] | [https://docs.cbioportal.org/user-guide/faq/#what-are-tcga-firehose-legacy-datasets-and-how-do-they-compare-to-the-publication-associated-datasets-and-the-pancancer-atlas-datasets What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?] | ||
= Relationship of TCGA, Firehose and cBioportal = | |||
https://www.biostars.org/p/366545/ | |||
= Some examples = | = Some examples = |
Revision as of 15:48, 11 April 2023
Resources
- TCGA to Complete its Final Analysis: the PanCanAtlas, Data download from gdc portal.
- The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets.
- TCGA癌症中英文对照
- https://github.com/cBioPortal/cbioportal
- TCGAWorkflow package. Rank 7/29. It suggests
- TCGAbiolinksGUI.data (rank 2/416) Why?
- TCGAbiolinks package (rank 91/2083) & TCGAbiolinksGUI
- curatedTCGAData (rank 34/416)
- TCGAWorkflowData (rank 54/416)
- HarmonizedTCGAData (276/416)
- RTCGAToolbox, Downloading RNAseq, me450k and clinical data from TCGA melanoma tumours (rank 251/2083).
- https://gdac.broadinstitute.org/
- RCTGA::downloadTCGA() uses the above website. RTCGA.RPPA package. RPPA = reverse phase protein array.
- cgdsr package
- Understanding TCGA mRNA Level3 analysis results files from FireBrowse
- TCGA实战大全
- TCGA学习01:数据下载与整理, TCGA下载和提取临床数据, TCGA临床数据整理, TCGA样本命名详解. TCGA数据下载与ID转换
- UCSCXenaTools
- UCSCXenaTools: Retrieve Gene Expression and Clinical Information from UCSC Xena for Survival Analysis
- https://docs.ropensci.org/UCSCXenaTools/
- Several hubs are used. For example TCGA Hub is from https://tcga.xenahubs.net/
- TCGA processed RNA-Seq data (GSE62944) as a SummarizedExperiment
- This was used in GSEABenchmarkeR package vignette
- LinkedOmics. TCGA-KIRC
- Comparison of RTCGA, UCSCXenaTools (CRAN), TCGAbiolinks & curatedTCGAData
- TCGA Pancancer Clinical Data RData & tsv.
- cBioPortal FAQs
PanCancer atlas vs Firehose legacy
Relationship of TCGA, Firehose and cBioportal
https://www.biostars.org/p/366545/
Some examples
- Colon sample case: Clinical data
- Compared to firebrowse.org, it seems cbioportal website has a good clinical data format to use. It also has several web-based analyses tool to use that may not be useful.
- http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy); click the last icon "View clinical and genomic data of this study"
- Tab "Summary" and check "Colon Adenocarcinoma 392"
- Click the download button "Download clinical data for the selected cases"
- Save the file "coadread_tcga_clinical_data.tsv". It is a tab delimited text file. Column 3 is "Sample ID", column W is Disease Free (Months), column X is Disease Free Status, column BB is Overall Survival (Months), column BC is Overall Survival Status. (Cf. firebrowse gives a complicated table).
- Colon sample case: Mutation data
- http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy) > Query By Gene
- Query
- Check "Mutations"
- Uncheck "Putative copy-number alterations from GISTIC"
- Check "mRNA Expression. Select one of the profiles below: > mRNA expression z-scores relative to diploid samples (microarray)"
- Select Patient/Case Set: All samples (640)
- Copy contents in oncomineGene.txt to "Enter Genes:" > Replace gene symbol: MRE11A:MRE11, RB:RB1
- oncomineGene.txt has 169 genes. It has RB and RB1 genes. This website changes RB to RB1. There are 2 RB1. Remove one duplicate RB1 and total 168 genes.
- Submit Query
- Click the left-most tab "Download" > Mutations (OQL is not in effect) Tab Delimited Format
- Save the file mutations.txt for 640 samples.
- Colon sample case: RNASeq
- http://firebrowse.org/ (good for downloading gene expression data)
- Click tab "BROAD GDAC"
- Click last Browse (data column): Colon adenocarcinoma COAD 460 Browse Browse
- Click "mRNAseq_Preprocess" and save the file gdac.broadinstitute.org_COAD.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
- Unzip the above file and pick "COAD.uncv2.mRNAseq_raw_counts.txt"
- Remove rows with ? mark gene names.
- Remove | and numbers after gene names.
- SLC35E2 has 2 rows in COAD.uncv2.mRNAseq_raw_counts.txt. Pick the first one (SLC35E2|728661).
- Second method - GDC data portal website. See Download data from TCGA, GDC RNA-Seq Data Processing – July 27, 2020 GDC Monthly Webinar (video)
- Third method - TCGAbiolinks R package
- Fourth method - GenomicDataCommons R package. Tutorial:Protocol To Downlad TCGA Data From GDC
- Fifth method - GDC Data Transfer Tool Tutorial:Protocol To Downlad TCGA Data From GDC
- http://firebrowse.org/ (good for downloading gene expression data)
- Pancreatic sample: RNASeq
- https://www.cbioportal.org/datasets
- Pancreatic Adenocarcinoma (TCGA, PanCancer Atlas)
- Click the download icon in the same line and save the file paad_tcga_pan_can_atlas_2018.tar.gz
- Unzip the above file and pick "data_mrna_seq_v2_rsem.txt"
Tumor vs normal
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
- https://cloud.r-project.org/web/packages/UCSCXenaShiny/vignettes/api.html
- How can I get the normal sample of TCGA data?
- How do I compare tumor vs normal expression?
- Compare tumor vs normal within or across tissue types
- curatedTCGAData, MultiAssayExperiment and curatedTCGAData Bioconductor 2020 workshop (video)
data(sampleTypes, package = "TCGAutils") head(sampleTypes) unique(sampleTypes$Definition) # 17 types grep("Normal", unique(sampleTypes$Definition), val = T) # [1] "Blood Derived Normal" "Solid Tissue Normal" # [3] "Buccal Cell Normal" "EBV Immortalized Normal" # [5] "Bone Marrow Normal"
- TCGAbiolinks package.
See the example KM Plot for gene of interest (e.g. TP53) using TCGA-PAAD dataset. TCGAquery_SampleTypes(barcode, typesample). For typesample, TP=PRIMARY SOLID TUMOR, NT=Solid Tissue Normal.
Drug response
- Evaluating the molecule-based prediction of clinical drug responses in cancer Ding 2016 Bioinformatics, "bioinformatics_32_19_2891_s1.zip". 572 samples.
library(readxl) dat <- read_excel("~/Downloads/bioinformatics_32_19_2891_s1/bioinfo16_supplementary_tables.xlsx", "Table S2", skip=2) dat <- dat[-1, ] dim(dat) # [1] 2572 14 kable(table(dat$Cancer)) |Var1 | Freq| |:-----------------------------------------------------------------------|----:| |Adrenocortical carcinoma (ACC) | 13| |Bladder Urothelial Carcinoma (BLCA) | 164| |Brain Lower Grade Glioma (LGG) | 162| |Breast invasive carcinoma (BRCA) | 389| |Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) | 97| |Colon adenocarcinoma (COAD) | 192| |Esophageal carcinoma (ESCA) | 25| |Glioblastoma multiforme (GBM) | 10| |Head and Neck squamous cell carcinoma (HNSC) | 112| |Kidney Chromophobe (KICH) | 2| |Kidney renal clear cell carcinoma (KIRC) | 14| |Kidney renal papillary cell carcinoma (KIRP) | 14| |Liver hepatocellular carcinoma (LIHC) | 29| |Lung adenocarcinoma (LUAD) | 151| |Lung squamous cell carcinoma (LUSC) | 69| |Mesothelioma (MESO) | 80| |Ovarian serous cystadenocarcinoma (OV) | 11| |Pancreatic adenocarcinoma (PAAD) | 99| |Pheochromocytoma and Paraganglioma (PCPG) | 7| |Prostate adenocarcinoma (PRAD) | 45| |Rectum adenocarcinoma (READ) | 67| |Sarcoma (SARC) | 101| |Skin Cutaneous Melanoma (SKCM) | 137| |Stomach adenocarcinoma (STAD) | 243| |Testicular Germ Cell Tumors (TGCT) | 159| |Thyroid carcinoma (THCA) | 10| |Uterine Carcinosarcoma (UCS) | 83| |Uterine Corpus Endometrial Carcinoma (UCEC) | 87| kable(table(dat$drug_name)) |Var1 | Freq| |:--------------------------------------------------|----:| |Aldesleukin | 6| |Alverine | 1| |Anastrozole | 18| |anti-A5B1 integrin monoclonal antibody PF-04605412 | 1| |anti-endosialin/TEM1 monoclonal antibody MORAb-004 | 1| |autologous vaccine | 1| |Axitinib | 3| |AZD2171 | 1| |Bacillus Calmette-Guerin (BCG) | 1| |BCG | 2| |Bevacizumab | 48| |Bicalutamide | 17| |Bleomycin | 54| |BRAF inhibitor | 1| |Cabazitaxel | 2| |Cabozantinib | 1| |Cancer Vax | 2| |Capecitabine | 55| |Carboplatin | 181| |Carmustine | 6| |Cetuximab | 21| |Chemo, Multi-Agent, NOS | 1| |Chemo, NOS | 3| |Cilengtide | 1| |Cisplatin | 330| |Copolang | 5| |COPOLANG CAPS | 1| |Cyclophosphamide | 103| |cyclophosphamide, vincristine, and dacarbazine | 1| |Cyclosporine | 1| |Dabrafenib | 4| |Dacarbazine | 30| |Dactinomycin | 3| |Dasatinib | 8| |Degarelix | 1| |Denosumab | 2| |Dexamethasone | 6| |Didox | 3| |Docetaxel | 106| |Docetaxel +/- Zactima | 1| |Doxorubicin | 108| |doxorubicin/cyclophosphamide | 1| |E7389 | 2| |Enoticumab | 1| |Epirubicin | 28| |Epoetin alfa | 1| |Eribulin | 1| |Erlotinib | 7| |Etoposide | 87| |Everolimus | 6| |Everolimus, Gemcitabine, and Cisplatin | 1| |Exemestane | 3| |EZN-2968 | 1| |Fluorouracil | 212| |Folfiri | 1| |Folfox | 2| |FOLFOX | 2| |Fotemustine | 3| |Fulvestrant | 1| |Gefitinib | 2| |Gemcitabine | 165| |Gemox | 1| |Goserelin | 8| |GP-100 | 1| |GP100 | 1| |HSC vaccine injection | 1| |Hydrocortisone | 1| |Hydroxyurea | 1| |Ifosfamid | 1| |Ifosfamide | 24| |Imatinib | 3| |Infliximab | 2| |Interferon alfa-n1 | 6| |Interferon alfacon-1 | 8| |iodine I 131 monoclonal antibody 81C6 | 1| |Ipilimumab | 11| |Irinotecan | 30| |Ixabepilone | 1| |Ketoconazole | 1| |Lapatinib | 2| |Letrozole | 5| |Leucovorin | 93| |Leuprolide | 16| |Levothyroxine | 1| |Liothyronine | 7| |Lomustine | 11| |LY228820 | 1| |Megestrol acetate | 2| |MEL-44 | 2| |Melphalan | 6| |Methotrexate | 15| |Methylprednisolone | 1| |Mitomycin | 7| |Mitotane | 1| |Mitoxantrone | 1| |Mycophenolic acid | 6| |Nilutamide | 2| |nivolumab | 1| |Ondansetron | 1| |Oxaliplatin | 75| |Paclitaxel | 172| |Pamidronate | 3| |Panitumumab | 1| |Pazopanib | 6| |Pegfilgrastim | 6| |Pemetrexed | 44| |PI-88 | 1| |Platinum | 5| |PNU-159548 | 1| |Poly E | 1| |Polyplatillen | 2| |Procarbazine | 8| |px-866 | 1| |R1507 | 1| |Raloxifene | 1| |recMAGE- A3 | 1| |recombinant interferon-∥2b | 1| |Regorafenib | 1| |RenAmin | 1| |Resiquimod | 1| |ridaforolimus | 1| |rigosertib | 1| |Rituximab | 1| |Sargramostim | 2| |Sorafenib | 17| |Streptozocin | 1| |Sulindac | 1| |Sunitinib | 10| |Talimogene Laherparepvec (T-VEC) | 1| |Tamoxifen | 24| |tegafur-gimeracil-oteracil potassium | 3| |Temozolomide | 116| |Temsirolimus | 3| |Thalidomide | 1| |Themozolomide | 2| |Threshold-302 | 1| |Topotecan | 4| |Toremifene | 1| |Trabectedin | 3| |Trametinib | 2| |Trastuzumab | 17| |Trelstar | 2| |triptorelin | 1| |Tyrosine kinase inhibitor | 1| |veliparib | 2| |Vemurafenib | 3| |Vinblastine | 16| |Vincristine | 13| |Vinorelbine | 31| |Vorinostat | 3| |Yervoy | 2| |Zoledronate | 2| kable(table(dat$Cancer[dat$drug_name == "Gemcitabine"])) |Var1 | Freq| |:-----------------------------------------------------------------------|----:| |Bladder Urothelial Carcinoma (BLCA) | 48| |Breast invasive carcinoma (BRCA) | 1| |Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) | 2| |Esophageal carcinoma (ESCA) | 2| |Liver hepatocellular carcinoma (LIHC) | 3| |Lung adenocarcinoma (LUAD) | 7| |Lung squamous cell carcinoma (LUSC) | 10| |Mesothelioma (MESO) | 6| |Pancreatic adenocarcinoma (PAAD) | 60| |Pheochromocytoma and Paraganglioma (PCPG) | 1| |Sarcoma (SARC) | 22| |Skin Cutaneous Melanoma (SKCM) | 1| |Uterine Carcinosarcoma (UCS) | 1| |Uterine Corpus Endometrial Carcinoma (UCEC) | 1| kable(table(dat$drug_name[dat$Cancer == "Pancreatic adenocarcinoma (PAAD)"])) |Var1 | Freq| |:----------------|----:| |Capecitabine | 6| |Carboplatin | 1| |Cyclophosphamide | 1| |Docetaxel | 1| |Doxorubicin | 1| |Erlotinib | 1| |Fluorouracil | 13| |Gemcitabine | 60| |Irinotecan | 3| |Leucovorin | 4| |Oxaliplatin | 6| |Paclitaxel | 2|
- The above data was used by Predicting cancer prognosis and drug response from the tumor microbiome Hermida 2022.
- TCGA immunotherapy treated melanoma data. This uses recount::all_metadata() function.
# Get all metadata metadata_clean <- recount::all_metadata("tcga") dim(metadata_clean) # [1] 11284 864 kable(table(metadata_clean$gdc_cases.project.project_id)) |Var1 | Freq| |:---------|----:| |TCGA-ACC | 79| |TCGA-BLCA | 433| |TCGA-BRCA | 1246| |TCGA-CESC | 309| |TCGA-CHOL | 45| |TCGA-COAD | 546| |TCGA-DLBC | 48| |TCGA-ESCA | 198| |TCGA-GBM | 175| |TCGA-HNSC | 548| |TCGA-KICH | 91| |TCGA-KIRC | 616| |TCGA-KIRP | 323| |TCGA-LAML | 126| |TCGA-LGG | 532| |TCGA-LIHC | 424| |TCGA-LUAD | 601| |TCGA-LUSC | 555| |TCGA-MESO | 87| |TCGA-OV | 430| |TCGA-PAAD | 183| |TCGA-PCPG | 187| |TCGA-PRAD | 558| |TCGA-READ | 177| |TCGA-SARC | 265| |TCGA-SKCM | 473| |TCGA-STAD | 453| |TCGA-TGCT | 156| |TCGA-THCA | 572| |TCGA-THYM | 122| |TCGA-UCEC | 589| |TCGA-UCS | 57| |TCGA-UVM | 80| # Get only PAAD project x <- metadata_clean[metadata_clean$gdc_cases.project.project_id == "TCGA-PAAD",] dim(x) # [1] 183 864 class(x) # [1] "DFrame" x$xml_tumor_response_cdus_type [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [17] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [33] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [49] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [65] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [81] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [97] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [113] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [129] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [145] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [161] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> [177] <NA> <NA> <NA> <NA> <NA> <NA> <NA> Levels: Complete response Partial response Progression Stable library(knitr) kable(table(toupper(x$cgc_drug_therapy_drug_name))) |Var1 | Freq| |:---------------------|----:| |5 FU | 3| |5-FLUOROURACIL | 3| |5-FU | 4| |5FU | 1| |ABRAXANE | 2| |CAPECITABINE | 2| |CHEMO, NOS | 2| |CISPLATIN | 2| |CYCLOPHOSPHAMIDE | 1| |DOCETAXEL | 1| |FLUOROURACIL | 2| |FOLINIC ACID | 1| |FU7 | 1| |GEMCITABINE | 68| |GEMCITABINE INJECTION | 1| |GEMCITIBINE | 1| |GEMZAR | 7| |LEUCOVORIN | 3| |LEUCOVORIN CALCIUM | 2| |OXALIPLATIN | 4| |XELODA | 2| x$gdc_cases.submitter_id sum(x$gdc_cases.project.project_id == "TCGA-PAAD") # [1] 183 x$cgc_case_primary_therapy_outcome_success x$cgc_case_id == "TCGA-F2-6880" x$xml_bcr_patient_barcode x$xml_vital_status x$xml_tumor_type x$xml_primary_therapy_outcome_success colnames(x)[c(198, 359)] # Open the dumped xlsx file, search "complete" # [1] "cgc_case_primary_therapy_outcome_success" "xml_primary_therapy_outcome_success" # cgc column is characters and xml column is a factor # write.table(as.matrix(x), file = "x.txt", sep="\t") NOT WORKING writexl::write_xlsx(as.data.frame(x), "~/Downloads/x.xlsx") kable(table(x[x$gdc_cases.project.project_id == "TCGA-PAAD", "xml_primary_therapy_outcome_success"] )) |Var1 | Freq| |:---------------------------|----:| | | 0| |Complete Remission/Response | 43| |Partial Remission/Response | 8| |Progressive Disease | 40| |Stable Disease | 8| |0 | 0| |1 | 0| |2 | 0| |NO | 0| |YES | 0| x[x$gdc_cases.project.project_id == "TCGA-PAAD" & !is.na(x$xml_primary_therapy_outcome_success), c("xml_bcr_patient_barcode", "xml_vital_status", "cgc_case_pathologic_stage", "cgc_case_primary_therapy_outcome_success")] |> dim() # [1] 99 4 x[x$gdc_cases.project.project_id == "TCGA-PAAD", "cgc_case_primary_therapy_outcome_success"] |> table() |> kable() |Var1 | Freq| |:---------------------------|----:| |Complete Remission/Response | 44| |Partial Remission/Response | 8| |Progressive Disease | 40| |Stable Disease | 8|
Understand TCGA Barcode/Sample ID
https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
The TCGA sample label you provided, "TCGA.06.0675.11A.32R.A36H.07", is a standardized label used by The Cancer Genome Atlas (TCGA) project to identify biological samples collected from patients with cancer. The label provides important information about the sample, including the tumor type, the patient ID, and the sample collection site.
Here's a breakdown of the label components:
- "TCGA" - This is the prefix used for all TCGA samples.
- "06" - This represents the TCGA disease program, in this case, it refers to the program for Prostate Adenocarcinoma.
- "0675" - This is the patient ID, a unique identifier assigned to each patient whose samples were included in the TCGA study.
- "11A" - This represents the type + vial of sample, in this case, it's a primary tumor. Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29. vial = a tube for collecting something.
- "32R" - This is the portion of the tumor that was collected, in this case, it's the 32nd sample collected from the right lobe of the prostate.
- "A36H" - This represents the TCGA biospecimen type, in this case, it's a "Solid Tissue Normal" sample from the patient's adrenal gland.
- "07" - This is the TCGA sample type, in this case, it's "Diagnostic Slide".