Genome
Visualization
- Awesome genome visualization
- Bioconductor > BiocViews > Visualization. Search 'genom' as the keyword.
Ten simple rules
Ten simple rules for developing visualization tools in genomics
IGV
- UserGuide
- Inspect alignments with IGV in Anders2013
- For paired data we will see two reads (one is -> and the other is <- direction) from the same fragment.
nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB ~/binary/IGV_2.3.52/igv.sh
Simulated DNA-Seq
The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.
Whole genome
Whole exome
- (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
- (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.
File:Igv gse48215.png File:Igv srp066363.png
RNA-Seq
- (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
- (Right) GSE46876, UCSC/hg19.
File:Igv anders2013 rna.png File:Igv gse46876 rna.png
Tell DNA or RNA
- DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
- RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).
ChromoMap
ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes
RNA-seq DRaMA
https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from 2nd Annual Shiny Contest
Gviz
GIVE: Genomic Interactive Visualization Engine
Build your own genome browser
ChromHeatMap
Heat map plotting by genome coordinate.
ggbio
NOISeq package
Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.
rtracklayer
R interface to genome browsers and their annotation tracks
- Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.
ssviz
A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).
Sushi
See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.
cBioPortal, TCGA, PanCanAtlas
- TCGA to Complete its Final Analysis: the PanCanAtlas, Data download from gdc portal.
- The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets.
- TCGA癌症中英文对照
- https://github.com/cBioPortal/cbioportal
- TCGAWorkflow package. Rank 7/29. It suggests
- TCGAbiolinksGUI.data (rank 2/416) Why?
- TCGAbiolinks package (rank 91/2083) & TCGAbiolinksGUI
- curatedTCGAData (rank 34/416)
- TCGAWorkflowData (rank 54/416)
- HarmonizedTCGAData (276/416)
- RTCGAToolbox, Downloading RNAseq, me450k and clinical data from TCGA melanoma tumours (rank 251/2083).
- https://gdac.broadinstitute.org/
- RCTGA::downloadTCGA() uses the above website. RTCGA.RPPA package. RPPA = reverse phase protein array.
- cgdsr package
- Understanding TCGA mRNA Level3 analysis results files from FireBrowse
- TCGA实战大全
- TCGA学习01:数据下载与整理, TCGA下载和提取临床数据, TCGA临床数据整理, TCGA样本命名详解. TCGA数据下载与ID转换
- UCSCXenaTools
- UCSCXenaTools: Retrieve Gene Expression and Clinical Information from UCSC Xena for Survival Analysis
- https://docs.ropensci.org/UCSCXenaTools/
- Several hubs are used. For example TCGA Hub is from https://tcga.xenahubs.net/
- TCGA processed RNA-Seq data (GSE62944) as a SummarizedExperiment
- This was used in GSEABenchmarkeR package vignette
- LinkedOmics. TCGA-KIRC
- Comparison of RTCGA, UCSCXenaTools (CRAN), TCGAbiolinks & curatedTCGAData
- TCGA Pancancer Clinical Data RData & tsv.
- Colon sample case: Clinical data
- Compared to firebrowse.org, it seems cbioportal website has a good clinical data format to use. It also has several web-based analyses tool to use that may not be useful.
- http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy); click the last icon "View clinical and genomic data of this study"
- Tab "Summary" and check "Colon Adenocarcinoma 392"
- Click the download button "Download clinical data for the selected cases"
- Save the file "coadread_tcga_clinical_data.tsv". It is a tab delimited text file. Column 3 is "Sample ID", column W is Disease Free (Months), column X is Disease Free Status, column BB is Overall Survival (Months), column BC is Overall Survival Status. (Cf. firebrowse gives a complicated table).
- Colon sample case: Mutation data
- http://www.cbioportal.org/. Bowel > Colorectal Adenocarcinoma (TCGA, Firehose Legacy) > Query By Gene
- Query
- Check "Mutations"
- Uncheck "Putative copy-number alterations from GISTIC"
- Check "mRNA Expression. Select one of the profiles below: > mRNA expression z-scores relative to diploid samples (microarray)"
- Select Patient/Case Set: All samples (640)
- Copy contents in oncomineGene.txt to "Enter Genes:" > Replace gene symbol: MRE11A:MRE11, RB:RB1
- oncomineGene.txt has 169 genes. It has RB and RB1 genes. This website changes RB to RB1. There are 2 RB1. Remove one duplicate RB1 and total 168 genes.
- Submit Query
- Click the left-most tab "Download" > Mutations (OQL is not in effect) Tab Delimited Format
- Save the file mutations.txt for 640 samples.
- Colon sample case: RNASeq
- http://firebrowse.org/ (good for downloading gene expression data)
- Click tab "BROAD GDAC"
- Click last Browse (data column): Colon adenocarcinoma COAD 460 Browse Browse
- Click "mRNAseq_Preprocess" and save the file gdac.broadinstitute.org_COAD.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
- Unzip the above file and pick "COAD.uncv2.mRNAseq_raw_counts.txt"
- Remove rows with ? mark gene names.
- Remove | and numbers after gene names.
- SLC35E2 has 2 rows in COAD.uncv2.mRNAseq_raw_counts.txt. Pick the first one (SLC35E2|728661).
- Second method - GDC data portal website. See Download data from TCGA, GDC RNA-Seq Data Processing – July 27, 2020 GDC Monthly Webinar (video)
- Third method - TCGAbiolinks R package
- Fourth method - GenomicDataCommons R package. Tutorial:Protocol To Downlad TCGA Data From GDC
- Fifth method - GDC Data Transfer Tool Tutorial:Protocol To Downlad TCGA Data From GDC
- http://firebrowse.org/ (good for downloading gene expression data)
TCPA
Download. Level 4.
Qualimap
Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.
SeqMonk
SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.
dittoSeq
dittoSeq – universal user-friendly single-cell and bulk RNA sequencing visualization toolkit, bioinformatics
SeqCVIBE
SeqCVIBE – interactive analysis, exploration, and visualization of RNA-Seq data
Copy Number
Copy number work flow using Bioconductor
Detect copy number variation (CNV) from the whole exome sequencing
- http://www.ncbi.nlm.nih.gov/pubmed/24172663
- http://www.biomedcentral.com/1471-2164/15/661
- http://www.blog.biodiscovery.com/2014/06/16/comparison-of-cnv-detection-from-whole-exome-sequencing-wes-as-compared-with-snp-microarray/
- Bioconductor - CopyNumberVariation
- exomeCopy package
- exomeCNV package and more info including slides by Fah Sathirapongsasuti.
- ximmer and source
- dudeML: A Simple Deep Learning Approach for Detecting Duplications and Deletions in Next-Generation Sequencing Data
Whole exome sequencing != whole genome sequencing
Consensus CDS/CCDS
- UCSC
- http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=573539327_9Ga3TjKR7LNFYJ2JIMZKDNZfl9eF&c=chr1&g=ccdsGene
- Schema/Example
- The GenomicFeatures package can download the CCDS regions.
- The CCDS regions are convenient to use as genomic ranges for CNV detection in exome sequencing data, as they are often in the center of the targeted region in exome enrichment protocols and tend to have less variable coverage then the flanking regions -- exomeCopy package vignette.
DBS segmentation algorithm
DBS: a fast and informative segmentation algorithm for DNA copy number analysis
modSaRa2
An accurate and powerful method for copy number variation detection
Visualization
reconCNV: interactive visualization of copy number data from high-throughput sequencing 2021
NGS
File:CentralDogmaMolecular.png
See NGS.
R and Bioconductor packages
Resources
- HarvardX Biomedical Data Science Open Online Training, Bioinformatics Training at the Harvard Chan Bioinformatics Core
- Introduction to Bioinformatics and Computational Biology Xiaole Shirley Liu (ebook & videos)
- CSAMA 2017: Statistical Data Analysis for Genome-Scale Biology
- Some basics of biomaRt (and GenomicRanges)
- Annotating Ranges Represent common sequence data types (e.g., from BAM, gff, bed, and wig files) as genomic ranges for simple and advanced range-based queries.
library(VariantAnnotation) library(AnnotationHub) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(TxDb.Mmusculus.UCSC.mm10.ensGene) library(org.Hs.eg.db) library(org.Mm.eg.db) library(BSgenome.Hsapiens.UCSC.hg19)
- Analysis of RNA-Seq Data with R/Bioconductor by Girke in UC Riverside
- 2018 Data Intensive Biology Summer Institute at UC Davis
- R / Bioconductor for High-Throughput Sequence Analysis 2012 by Martin Morgan1 and Nicolas Delhomme
- Count-based differential expression analysis of RNA sequencing data using R and Bioconductor by Simon Anders
- Sequences, Genomes, and Genes in R / Bioconductor by Martin Morgan 2013.
- Orchestrating high-throughput genomic analysis with Bioconductor by Wolfgang Huber et al 2015.
- RNA-seq Analysis Example This is a script that will do differential gene expression (DGE) analysis for RNA-seq experiments using the bioconductor package edgeR. RPKMs were calculated for bar plots.
- An introduction to R and Bioconductor for the analysis of high-throughput sequencing data by Pascal MARTIN Oct 2018
- Bioinformatics Analysis of Omics Data with the Shell & R
Docker
Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.
Some workflows
RNA-Seq workflow
Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.
STAR Samtools Rsamtools fastq -----> sam ----------> bam ----------> bamfiles -| \ GenomicAlignments DESeq2 --------------------> se --------> dds GenomicFeatures GenomicFeatures / (SummarizedExperiment) (DESeqDataSet) gtf ----------------> txdb ---------------> genes -----|
- https://bioconductor.org/packages/3.12/workflows/
- RNA-Seq workflow
- RnaSeqGeneEdgeRQL
- RNA-Seq differential expression work flow using DESeq2
rnaseqGene
rnaseqGene - RNA-seq workflow: gene-level exploratory analysis and differential expression
CodeOcean - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). Plan.
Sequence analysis
library(ShortRead) or library(Biostrings) (QA) gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information) GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count) edgeR or DESeq2 (gene expression analysis) library(org.Sc.sgd.db) or library(biomaRt)
Accessing Annotation Data
Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
library(org.Hs.eg.db) # Sample OrgDb Workflow library("hgu95av2.db") # Sample ChipDb Workflow library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow library(Homo.sapiens) # Sample OrganismDb Workflow library(AnnotationHub) # Sample AnnotationHub Workflow library("biomaRt") # Using biomaRt library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type | example package name | contents |
---|---|---|
OrgDb | org.Hs.eg.db | gene based information for Homo sapiens |
TxDb | TxDb.Hsapiens.UCSC.hg19.knownGene | transcriptome ranges for Homo sapiens |
OrganismDb | Homo.sapiens | composite information for Homo sapiens |
BSgenome | BSgenome.Hsapiens.UCSC.hg19 | genome sequence for Homo sapiens |
refGenome |
RNA-Seq Data Analysis using R/Bioconductor
- https://github.com/datacarpentry/rnaseq-data-analysis by Stephen Turner.
- Tutorial: Introduction to Bioconductor for high-throughput sequence analysis by UseR 2015
- systemPipeR Building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. An important feature is support for running command-line software, such as NGS aligners, on both single machines or compute clusters.
- BioC2015
recount2
- https://github.com/leekgroup/recount
- recount2 - A multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets.
- Human RNA-seq data from recount2 and related packages from BioC 2020 Workshop
- recount2 works with DESeq2, edgeR and limma-voom
dbGap
eQTL
Statistics for Genomic Data Science (Coursera) and MatrixEQTL from CRAN
GenomicDataCommons package
- Genomic Data Commons
- Use the GenomicDataCommons package to find and download variants from TCGA (NCI Genomic Data Commons Access) dataset and maftools package for analysis and visualization. See https://seandavi.github.io/talk/2018/02/08/bioconductor-a-potential-hub-in-the-cancer-biomarker-data-ecosystem/
- https://seandavi.github.io/post/2018/03/extracting-clinical-information-using-the-genomicdatacommons-package/
- The Cancer Data Ecosystem: Data and cloud resources for cancer data science
- The GenomicDataCommons and GEOquery Bioconductor Packages
Note:
- The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
- Each patient has 4 categories data and the 'case_id' is common to them:
- demographic: gender, race, year_of_birth, year_of_death
- diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
- exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
- main: disease_type, primary_site
- The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.
Visualization
GenVisR
ComplexHeatmap
Read counts
Rsubread
See this post for about C version of the featureCounts program.
RSEM
- RSEM, rsem-calculate-expression
- RSEM on Biowulf
$ mkdir SeqTestdata/RNASeqFibroblast/output $ sinteractive --cpus-per-task=2 --mem=10g $ module load rsem bowtie STAR $ rsem-calculate-expression -p 2 --paired-end --star \ ../test.SRR493366_1.fastq ../test.SRR493366_2.fastq \ /fdb/rsem/ref_from_genome/hg19 Sample1 # 12 seconds $ ls -lthog total 5.8M -rw-r----- 1 1.6M Nov 24 13:39 Sample1.genes.results -rw-r----- 1 2.5M Nov 24 13:39 Sample1.isoforms.results -rw-r----- 1 1.6M Nov 24 13:39 Sample1.transcript.bam drwxr-x--- 2 4.0K Nov 24 13:39 Sample1.stat $ wc -l Sample1.genes.results 26335 Sample1.genes.results $ wc -l Sample1.isoforms.results 51399 Sample1.isoforms.results $ head -2 Sample1.genes.results gene_id transcript_id(s) length effective_length expected_count TPM FPKM A1BG NM_130786 1766.00 1589.99 0.00 0.00 0.00 $ head -2 Sample1.isoforms.results transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct NM_130786 A1BG 1766 1589.99 0.00 0.00 0.00 0.00 $ head -1 /fdb/rsem/ref_from_genome/hg19.transcripts.fa >NM_130786 $ grep NM_130786 /fdb/igenomes/Homo_sapiens/UCSC/hg19/transcriptInfo.tab NM_130786 2721192635 2721199328 2721188175 2 8 431185
- An RNA-Seq Protocol for Differential Expression Analysis Owens 2019
- Cbioportal -> Breast Invasive Carcinoma (TCGA, PanCancer Atlas) -> Explore selected study -> Original data -> PanCanAtlas Publications -> RNA (Final) - EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv
- Evaluation and comparison of computational tools for RNA-seq isoform quantification 2017
- The Concepts of Mean Fragment Length and Effective Length in RNA Sequencing
- RSEM gene level result file (see here for an example) contains 5 essential columns (and the element saved by tximport() function) excluding transcript_id
- Effective length → length. This is different across samples. This is much shorter than Length (e.g. 105 vs 1).
- Expected count → count. This is the sum of the posterior probability of each read comes from this transcript over all reads.
- TPM → abundance. The sum of all transcripts' TPM is 1 million.
- FPKM (not kept?). FPKM_i = 10^3 / l_bar * TPM_i for gene i. So for each sample FPKM is a scaling of TPM.
R> dfpkm[1:5, 1:3] / txi.rsem$abundance[1:5, 1:3] 144126_210-T_JKQFX5 144126_210-T_JKQFX6 144126_210-T_JKQFX8 5S_rRNA 0.7563603 1.118008 0.8485292 5_8S_rRNA NaN NaN NaN 6M1-18 NaN NaN NaN 7M1-2 NaN NaN NaN 7SK 0.7563751 1.118029 0.8485281
- An example using tximport::tximport() and DESeq2::DESeqDataSetFromTximport. Note it directly uses round(expected_count) to get the integer-value counts. See the source of DESeqDataSetFromTximport() here. The tximport vignette has discussed two suggested ways of importing estimates for use with differential gene expression (DGE) methods in the section of "Downstream DGE in Bioconductor". The vignette does not say anything about "expected_count" from RSEM output.
txi.rsem <- tximport(files, type = "rsem", txIn = F, txOut = F) txi.rsem$length[txi.rsem$length == 0] <- 1 names(txi.rsem) # a list, # length = effective_length (matrix) # counts = expected counts column (matrix), non-integer # abundance = TPM (matrix) # countsFromAbundance = "no" # [1] "abundance" "counts" "length" # [4] "countsFromAbundance" sampleTable <- pheno[, c("EXPID", "PatientID")] rownames(sampleTable) <- colnames(txi.rsem$counts) dds <- DESeq2::DESeqDataSetFromTximport(txi.rsem, sampleTable, ~ PatientID) # using counts and average transcript lengths from tximport # # The DESeqDataSet class enforces non-negative integer values in the "counts" # matrix stored as the first element in the assay list. dds@assays@data@listData$counts[1:5, 1:3] # integer values. How to compute? # https://support.bioconductor.org/p/9134840/ dds@assays@data@listData$avgTxLength[1:5, 1:3] # effective_length plot(txi.rsem$counts[,1], dds@assays@data@listData$counts[,1]) abline(0, 1, col = 'red') # compare expected counts vs integer-value counts # a straight line ddsColl1 <- DESeq2::estimateSizeFactors(dds) # using 'avgTxLength' from assays(dds), correcting for library size # Question: how does the function correct for library size? ddsColl2 <- DESeq2::estimateDispersions(ddsColl1) # gene-wise dispersion estimates # mean-dispersion relationship # final dispersion estimates # Note: it seems estimateDispersions is not required # if we only want to get the normalized count (still need estimateSizeFactors()) # See ArrayTools/R/FilterAndNormalize.R cnts2 <- DESeq2::counts(ddsColl2, normalized = FALSE) all(dds@assays@data@listData$counts == cnts2) # [1] TRUE all(round(txi.rsem$counts) == cnts2 ) # [1] TRUE. So in this case round(expected values) = integer-value counts
- RSEM example on Odyssey
- A Short Tutorial for RSEM
- Hands-on Training in RNA-Seq Data Analysis* which includes Quantification using RSEM and Perform DE analysis. Note the expected count column was used in edgeR.
- Understanding RSEM: raw read counts vs expected counts. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
- RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. Abundance estimates are given in terms of two measures. The output file (XXX_RNASeq.RSEM.genes.results) contains 7 columns: gene_id, transcript_id(s), length, effective_length, expected_count, TPM, FPKM.
- Expected counts: The is an estimate of the number of fragments that are derived from a given isoform or gene. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR or DESeq.
- TPM: This is the estimated fraction of transcripts made up by a given isoform or gene. The transcript fraction measure is preferred over the popular RPKM/FPKM measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species.
- The length or effective_length are different (though similar) for different samples for the same gene
- A scatter plot and correlation shows the expected_count and TPM are different
x <- read.delim("144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results") colnames(x) # [1] "gene_id" "transcript_id.s." "length" "effective_length" # [5] "expected_count" "TPM" "FPKM" plot(x[, "TPM"], x[, "expected_count"]) cor(x[, "TPM"], x[, "expected_count"]) # [1] 0.4902708 cor(x[, "TPM"], x[, "expected_count"], method = 'spearman') # [1] 0.9886384 x[1:5, "length"] [1] 105.01 161.00 473.00 68.00 304.47 x2 <- read.delim("144126_210-T_JKQFX6_v2.0.1.4.0_RNASeq.RSEM.genes.results") x2[1:5, "length"] # [1] 105.00 161.00 473.00 68.00 305.27 x[1:5, "effective_length"] # [1] 1.03 16.65 293.88 0.00 129.00 x2[1:5, "effective_length"] # [1] 1.58 17.82 293.05 0.00 128.86
- A benchmark for RNA-seq quantification pipelines. 2016 They compare the STAR, TopHat2, and Bowtie2 mapping methods and the Cufflinks, eXpress , Flux Capacitor, kallisto, RSEM, Sailfish, and Salmon quantification methods. RSEM slightly outperforming the rest.
- Downsample reads from Evaluation of Cell Type Annotation R Packages on Single Cell RNA-seq Data 2020.
Expected_count
Number of reads mapping to that transcript
- Understanding RSEM: raw read counts vs expected counts In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.
R> x <- read.delim("41samples/165739~295-R~AM1I30~RNASEQ.genes.results") R> summary(x$expected_count) # Larger than TPM, contradict to the above statement Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 10 1346 556 533634 R> summary(x$TPM) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 0.00 0.16 35.58 7.50 70091.93 R> x[1:5, c("expected_count", "TPM")] expected_count TPM 1 6190.00 70091.93 2 0.00 0.00 3 0.00 0.00 4 0.00 0.00 5 795.01 171.67
- Alignment-based的转录本定量-RSEM/ the sum of the posterior probability of each read comes from this transcript over all reads
Examples
$ wc -l 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results 28110 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results $ head -n 4 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results | cut -f1,3,4,5,6,7 gene_id length effective_length expected_count TPM FPKM 5S_rRNA 105.01 1.03 1513.66 31450.7 23788.06 5_8S_rRNA 161 16.65 0 0 0 6M1-18 473 293.88 0 0 0
Second example
$ wc -l Sample_HS-578T_CB6CRANXX.genes.results 28110 $ head -1 Sample_HS-578T_CB6CRANXX.genes.results gene_id transcript_id.s. length effective_length expected_count TPM FPKM $ tail -4Sample_HS-578T_CB6CRANXX.genes.results septin 9/TNRC6C fusion uc010wto.1 41 0 0 0 0 svRNAa uc022bxg.1 22 0 0 0 0 tRNA Pro uc022bqx.1 65 0 0 0 0 unknown uc002afm.3 1117 922.85 0 0 0 $ wc -l Sample_HS-578T_CB6CRANXX.isoforms.results 78376 $ head -1 Sample_HS-578T_CB6CRANXX.isoforms.results transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct $ tail -4 Sample_HS-578T_CB6CRANXX.isoforms.results uc010wto.1 septin 9/TNRC6C fusion 41 0 0 0 0 0 uc022bxg.1 svRNAa 22 0 0 0 0 0 uc022bqx.1 tRNA Pro 65 0 0 0 0 0 uc002afm.3 unknown 1117 922.85 0 0 0 0
limma
- Differential expression analyses for RNA-sequencing and microarray studies
- Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
- Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.
RSEM
- RSEM expected counts to limma-voom. Choice 1: tximport. Choice 2: If you are working with RSEM gene-level expected counts, then you can just pass them to limma as if they were counts. That's what we do.
- Differential expression of RNA-seq data using limma and voom()
Time Course Experiments
- See Limma's vignette on Section 9.6
- Few time points (ANOVA, contrast)
- Which genes respond at either the 6 hour or 24 hour times in the wild-type?
- Which genes respond (i.e., change over time) in the mutant?
- Which genes respond differently over time in the mutant relative to the wild-type?
- Many time points (regression such as cubic spline, moderated F-test)
- Detect genes with different time trends for treatment vs control.
- Few time points (ANOVA, contrast)
easyRNASeq
Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
ShortRead
Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.
Rsamtools
The Rsamtools package provides an interface to BAM files.
The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts
This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.
IRanges
IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.
The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.
If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.
flank
The example is obtained from ?IRanges::flank.
ir3 <- IRanges(c(2,5,1), c(3,7,3)) # IRanges of length 3 # start end width # [1] 2 3 2 # [2] 5 7 3 # [3] 1 3 3 flank(ir3, 2) # start end width # [1] 0 1 2 # [2] 3 4 2 # [3] -1 0 2 # Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE) # For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1] # == == flank(ir3, 2, start=FALSE) # start end width # [1] 4 5 2 # [2] 8 9 2 # [3] 4 5 2 # For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5] # == == flank(ir3, 2, start=c(FALSE, TRUE, FALSE)) # start end width # [1] 4 5 2 # [2] 3 4 2 # [3] 4 5 2 # Combine the ideas of the previous 2 cases. flank(ir3, c(2, -2, 2)) # start end width # [1] 0 1 2 # [2] 5 6 2 # [3] -1 0 2 # The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F) # For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6] # == == flank(ir3, -2, start=F) # start end width # [1] 2 3 2 # [2] 6 7 2 # [3] 2 3 2 # For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7] # == == flank(ir3, 2, both = TRUE) # start end width # [1] 0 3 4 # [2] 3 6 4 # [3] -1 2 4 # The original statement is equivalent to flank(ir3, 2, start=T, both=T) # (From the manual) If both = TRUE, extends the flanking region width positions into the range. # The resulting range thus straddles the end point, with width positions on either side. # For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3] # == # == == == == flank(ir3, 2, start=FALSE, both=TRUE) # start end width # [1] 2 5 4 # [2] 6 9 4 # [3] 2 5 4 # For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5] # == # == == == ==
Both IRanges and GenomicRanges packages provide the flank function.
Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.
- General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
- Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.
Biostrings
- Manipulate Biological Data Using Biostrings Package Exercises (Part 1)
- Manipulate Biological Data Using Biostrings Package Exercises (Part 2) - it covers global, local & overlap alignments.
GenomicRanges
GenomicRanges depends on IRanges package. See the dependency diagram below.
GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics | + +-----+ +- GenomeInfoDb | | GenomicAlignments +--- Rsamtools --+-----+ +--- Biostrings
The package defines some classes
- GRanges
- GRangesList
- GAlignments
- SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.
(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.
GenomicAlignments
Counting reads with summarizeOverlaps vignette
library(GenomicAlignments) library(DESeq) library(edgeR) fls <- list.files(system.file("extdata", package="GenomicAlignments"), recursive=TRUE, pattern="*bam$", full=TRUE) features <- GRanges( seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)), ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 300, 500, 500)), "-", group_id=c(rep("A", 4), rep("B", 5), rep("C", 2))) features # GRanges object with 11 ranges and 1 metadata column: # seqnames ranges strand | group_id # <Rle> <IRanges> <Rle> | <character> # [1] chr2L [1000, 1499] - | A # [2] chr2L [3000, 3499] - | A # [3] chr2L [4000, 4499] - | A # [4] chr2L [7000, 7599] - | A # [5] chr2R [2000, 2899] - | B # ... ... ... ... ... ... # [7] chr2R [3600, 3899] - | B # [8] chr2R [4000, 4899] - | B # [9] chr2R [7500, 7799] - | B # [10] chr3L [5000, 5499] - | C # [11] chr3L [5400, 5899] - | C # ------- # seqinfo: 3 sequences from an unspecified genome; no seqlengths olap # class: SummarizedExperiment # dim: 11 2 # exptData(0): # assays(1): counts # rownames: NULL # rowData metadata column names(1): group_id # colnames(2): sm_treated1.bam sm_untreated1.bam # colData names(0): assays(olap)$counts # sm_treated1.bam sm_untreated1.bam # [1,] 0 0 # [2,] 0 0 # [3,] 0 0 # [4,] 0 0 # [5,] 5 1 # [6,] 5 0 # [7,] 2 0 # [8,] 376 104 # [9,] 0 0 # [10,] 0 0 # [11,] 0 0
Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).
samtools view -h -o outputFile.bam inputFile.sam
A modified R code that works is
################################################### ### code chunk number 11: gff (eval = FALSE) ################################################### library(rtracklayer) fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/", "gtf/drosophila_melanogaster/", "Drosophila_melanogaster.BDGP5.25.62.gtf.gz") gffFile <- file.path(tempdir(), basename(fl)) download.file(fl, gffFile) gff0 <- import(gffFile, asRangedData=FALSE) ################################################### ### code chunk number 12: gff_parse (eval = FALSE) ################################################### idx <- mcols(gff0)$source == "protein_coding" & mcols(gff0)$type == "exon" & seqnames(gff0) == "4" gff <- gff0[idx] ## adjust seqnames to match Bam files seqlevels(gff) <- paste("chr", seqlevels(gff), sep="") chr4genes <- split(gff, mcols(gff)$gene_id) ################################################### ### code chunk number 12: gff_parse (eval = FALSE) ################################################### library(GenomicAlignments) # fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam") fls <- list.files(system.file("extdata", package="pasillaBamSubset"), recursive=TRUE, pattern="*bam$", full=TRUE) path <- system.file("extdata", package="pasillaBamSubset") bamlst <- BamFileList(fls) genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object assays(genehits)$counts ################################################### ### code chunk number 15: pasilla_exoncountset (eval = FALSE) ################################################### library(DESeq) expdata = MIAME( name="pasilla knockdown", lab="Genetics and Developmental Biology, University of Connecticut Health Center", contact="Dr. Brenton Graveley", title="modENCODE Drosophila pasilla RNA Binding Protein RNAi knockdown RNA-Seq Studies", pubMedIds="20921232", url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508", abstract="RNA-seq of 3 biological replicates of from the Drosophila melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs encoding pasilla, a mRNA binding protein and 4 biological replicates of the the untreated cell line.") design <- data.frame( condition=c("untreated", "untreated"), replicate=c(1,1), type=rep("single-read", 2), stringsAsFactors=TRUE) library(DESeq) geneCDS <- newCountDataSet( countData=assay(genehits), conditions=design) experimentData(geneCDS) <- expdata sampleNames(geneCDS) = colnames(genehits) ################################################### ### code chunk number 16: pasilla_genes (eval = FALSE) ################################################### chr4tx <- split(gff, mcols(gff)$transcript_id) txhits <- summarizeOverlaps(chr4tx, bamlst) txCDS <- newCountDataSet(assay(txhits), design) experimentData(txCDS) <- expdata
We can also check out ?summarizeOverlaps to find some fake examples.
tidybulk
- tidybulk: an R tidy framework for modular transcriptomic data analysis, paper
- A Tidy Transcriptomics introduction to RNA-Seq analyses
Bisque
Bisque: An R toolkit for accurate and efficient estimation of cell composition ('decomposition') from bulk expression data with single-cell information.
chromoMap
Inference
- How well do RNA-Seq differential gene expression tools perform in a complex eukaryote? A case study in A. thaliana Froussios, Bioinformatics 2019
tximport
- Vignette. Search "offset".
- Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences by Soneson, et al. F1000Research. Open peer review.
- http://bioconductor.org/packages/release/bioc/html/tximport.html
$ head -5 quant.sf Name Length EffectiveLength TPM NumReads ENST00000456328.2 1657 1410.79 0.083908 2.46885 ENST00000450305.2 632 410.165 0 0 ENST00000488147.1 1351 1035.94 10.4174 225.073 ENST00000619216.1 68 24 0 0 ENST00000473358.1 712 453.766 0 0
- Another real data from PDMR -> PDX. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
$ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results gene_id transcript_id.s. length effective_length expected_count TPM FPKM 5S_rRNA uc021ofx.1 105.04 2.23 2039.99 49629.87 35353.97 $ R x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results") dim(x) # [1] 28109 7 names(x) # [1] "gene_id" "transcript_id.s." "length" "effective_length" # [5] "expected_count" "TPM" "FPKM" x[1:3, -2] # gene_id length effective_length expected_count TPM FPKM # 1 5S_rRNA 105.04 2.23 2039.99 49629.87 35353.97 # 2 5_8S_rRNA 161.00 21.19 0.00 0.00 0.00 # 3 6M1-18 473.00 302.74 0.00 0.00 0.00 y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results") dim(y) # [1] 78375 8 names(y) # [1] "transcript_id" "gene_id" "length" "effective_length" # [5] "expected_count" "TPM" "FPKM" "IsoPct" y[1:3, -1] # gene_id length effective_length expected_count TPM FPKM IsoPct # 1 5S_rRNA 110 3.06 0 0 0 0 # 2 5S_rRNA 133 9.08 0 0 0 0 # 3 5S_rRNA 92 0.00 0 0 0 0
DESeq2 or edgeR
- DESeq2 method
- DESeq2 steps
- DESeq2 uses the median of ratio method for normalization: briefly, the counts are divided by sample-specific size factors.
- Four steps: 1. Geometric mean is calculated for each gene across all samples (length=number of genes vector). 2. ratios: The counts for a gene in each sample is then divided by this mean. 3. The median of these (gene) ratios (aka size factor) in a sample is the size factor for that sample (length=number of samples). 4. the raw counts data is divided by these size factors sample by sample.
- Manual implementation. The results (size factor values) are not the same as counts(dds, normalized = TRUE) returns? Ans: see the help of counts()
- counts(). normalization factors is used by default (normalization factors always preempt size factors)
- sizeFactors() - sample by sample (a vector)
- normalizationFactors() - Gene-specific normalization factors for each sample (a matrix)
- DESeq2 with a large number of samples -> use DESEq2 to normalize the data and then use do a Wilcoxon rank-sum test on the normalized counts, for each gene separately, or, even better, use a permutation test. See this post. Or consider the limma-voom method instead, which will handle 1000 samples in a few seconds without the need for extra memory.
- edgeR normalization factor post. Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the paper by Robinson & Oshlack 2010 for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
- edgeR From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline
- Can I feed TCGA normalized count data to EdgeR?
- counts() function and normalized counts.
- Why use Negative binomial distribution in RNA-Seq data?] and the Presentation by Simon Anders.
- XBSeq2 – a fast and accurate quantification of differential expression and differential polyadenylation. By using simulated datasets, they demonstrated that, overall, XBSeq2 performs equally well as XBSeq in terms of several statistical metrics and both perform better than DESeq2 and edgeR.
- DEBrowser: Interactive Differential Expression Analysis and Visualization Tool for Count Data Alper Kucukural 2018
- EdgeR or DESeq2? Comparing the performance of differential expression tools
- StatQuest:
- Differential gene expression and DESeq2 with Michael Love (#60)
- Downstream Bioinformatics Analysis of Omics Data with edgeR
Shrinkage estimators
- The package uses a negative binomial statistical model to fit the count data, and can account for differences in sequencing depth across samples.
- Shrinkage is a technique used to regularize the estimates of the parameters of the negative binomial model.
- The idea behind shrinkage is to pull the estimated values of the parameters towards a prior distribution, which can help to reduce the variability of the estimates and improve the stability of the results.
- The specific shrinkage method used in DESeq2 is called the "shrinkage prior for dispersion" method. This method involves adding a prior distribution on the dispersion parameter of the negative binomial model, which is used to control the degree of overdispersion in the data.
- This prior distribution is designed to shrink the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
- More details:
- The negative binomial model is used to model the count data, y, as a function of the mean, [math]\displaystyle{ \mu }[/math], and the dispersion, [math]\displaystyle{ \alpha }[/math]. The probability mass function of the negative binomial is given by:
- [math]\displaystyle{ \begin{align} P(y | \mu, \alpha) = (y + \alpha - 1)! / (y! * (\alpha - 1)!) * (\mu / (\mu + \alpha))^y * (\alpha / (\mu + \alpha))^\alpha \end{align} }[/math]
- The likelihood of the data is given by:
- [math]\displaystyle{ \begin{align} L(\mu, \alpha | y) = \prod_i [ P(y_i | \mu_i, \alpha) ] \end{align} }[/math]
- The log-likelihood is :
- [math]\displaystyle{ \begin{align} logL(\mu, \alpha | y) = \sum_i [ \log(P(y_i | \mu_i, \alpha)) ] \end{align} }[/math]
- In this model, [math]\displaystyle{ \mu }[/math] is the mean of the negative binomial for each gene and it is modeled as a linear function of the design matrix.
- [math]\displaystyle{ \begin{align} \mu_i = exp(X_i \beta) \end{align} }[/math]
- [math]\displaystyle{ \alpha }[/math] is the dispersion parameter and it's the same for all the genes, following the common practice in RNA-seq analysis
- The shrinkage prior is added on [math]\displaystyle{ \alpha }[/math], it assumes that [math]\displaystyle{ \alpha }[/math] is following a hyper-prior distribution like Gamma distribution
- [math]\displaystyle{ \begin{align} \alpha \sim \Gamma(a_0, b_0) \end{align} }[/math]
- This prior allows the shrinkage of [math]\displaystyle{ \alpha }[/math] estimates from the data towards a common value across all the genes, which can help to reduce the variance of the estimated log-fold changes.
- The goal is to find the values of mu and alpha that maximize the log-likelihood, this is done by using maximum likelihood estimation (MLE) or Bayesian approach where the prior are considered and integrated in the calculation and the result is the posterior probability.
- Dispersion parameter.
- In the context of the negative binomial model used in DESeq2, the dispersion parameter, alpha, is a measure of the degree of overdispersion in the data. In other words, it represents the variability of the data around the mean. A value of alpha greater than 1 indicates that the data is more dispersed (more variable) than would be expected if the data were following a Poisson distribution, which is a common distribution used to model count data. The Poisson distribution has a single parameter, the mean, which represents both the location and the scale of the distribution. In contrast, the negative binomial distribution has two parameters, the mean and the dispersion, which allows for more flexibility in fitting the data.
- The shrinkage method in DESeq2 involves shrinking the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
- What is the formula of the fold change estimator given by DESeq2?
- The fold change estimator given by the DESeq2 package is calculated as the ratio of the estimated mean expression levels in two conditions, with the log2 of this ratio being the log2 fold change. The mean expression levels are calculated using a negative binomial model, which accounts for both the mean and the overdispersion of the data.
- The estimated mean expression level for a gene i in condition j is given by:
- [math]\displaystyle{ \begin{align} \log(\mu_{i,j}) = \beta_j + X_i \gamma \end{align} }[/math]
- where [math]\displaystyle{ \beta_j }[/math] is the overall mean for condition [math]\displaystyle{ j }[/math], [math]\displaystyle{ X_i }[/math] is the design matrix for gene [math]\displaystyle{ i }[/math] and [math]\displaystyle{ \gamma_i }[/math] is the gene-specific effect.
- The log2 fold change is calculated as:
- [math]\displaystyle{ \begin{align} \log2(\mu_{i,j} / \mu_{i,k}) \end{align} }[/math]
- where j and k are the two conditions being compared.
- So, you can see that the fold change estimator depends on the design matrix [math]\displaystyle{ X }[/math] and the parameters of the model, [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ \gamma }[/math]. The DESeq2 implementation also includes an estimation of the variance-covariance matrix of the parameters to compute the standard deviation (uncertainty) of these parameters, and therefore the standard deviation on the fold-change estimator. This can help to estimate the significance of the fold change between conditions.
- Hypothesis testing H0: log(FC)=0 using Wald test
- What is the variance of the estimated log-fold changes before and after applying DESeq2 method?
- In RNA-seq data analysis, the log-fold change is a measure of the relative difference in expression between two conditions. The log-fold change is calculated as the log2 of the ratio of the mean expression in one condition to the mean expression in another condition.
- Without using any methods like DESeq2, the variance of the estimated log-fold changes can be high, particularly for genes with low expression levels, which can lead to unreliable results. This high variance is due to the over-dispersion present in RNA-seq data, which results in a large variability in the estimated expression levels even for genes with similar means.
- When using the DESeq2 package, the shrinkage method is applied on dispersion parameter alpha, which helps to reduce the variance of the estimated log-fold changes. By applying a prior on alpha and by shrinking the estimates of alpha towards a common value across all the genes, the method reduces the variability of the estimates. This results in more stable and reliable estimates of the log-fold changes, which can improve the accuracy and robustness of the results of the differential expression analysis.
- Additionally, the DESeq2 package also accounts for differences in sequencing depth across samples, which can also help to reduce the variability of the estimated log-fold changes.
- how DESeq2 package accounts for differences in sequencing depth across samples
- The DESeq2 package accounts for differences in sequencing depth across samples by using the raw count data to estimate the normalized expression levels for each gene in each sample. This normalization process is necessary because sequencing depth can vary widely between samples, leading to differences in the overall number of reads and the apparent expression levels of the genes.
- The package uses a method called regularized-logarithm (rlog) transformation to normalize the data, which is a variance stabilization method that is based on the logarithm of the counts, but also adjusts for the total library size and the mean expression level.
- The method starts by computing a weighted mean of the counts across all samples, which is used as a reference. Next, for each sample, the counts are divided by the library size and then multiplied by the weighted mean of the counts. This scaling step corrects for differences in sequencing depth by making the library sizes comparable between samples.
- Then, the regularized logarithm (rlog) transformation is applied to the scaled counts, which is given by :
- [math]\displaystyle{ \begin{align} vst = \log(counts/sizeFactor + c) \end{align} }[/math]
- where c is a small positive constant added to the counts to stabilize the variance, size_factor is the ratio of library size for each sample over the weighted mean of the library size.
- The rlog transformation can stabilize the variance of the data and make the mean expression levels more comparable between samples. This transformed data can then be used for downstream analysis like calculating the fold changes.
- In addition to rlog transformation the DESeq2 package uses a negative binomial distribution to model the count data, this distribution helps to account for over-dispersion in the data, and shrinkage method on the dispersion parameter is applied as well to improve the stability of results. All of these techniques work together to help correct for sequencing depth differences across samples, which can improve the accuracy of the estimated fold changes and provide more robust results in differential gene expression analysis.
- type='apeglm' shrinkage only for use with 'coef'
Time course experiment
- RNA-seq workflow: gene-level exploratory analysis and differential expression using DESeq2
- Genes with small p values from this test are those which at one or more time points after time 0 showed a strain-specific effect. Tested by LRT.
- Wald tests for the log2 fold changes at individual time points can be investigated using the test argument
- Time course trend analysis from the edgeR's vignette. glmQLFTest()
- Finds genes that respond to the treatment at either 1 hour or 2 hours versus the 0 hour baseline. This is analogous to an ANOVA F-test for a normal linear model.
- Assuming gene expression changes smoothly over time, we can use a polynomial or a cubic spline curve with a certain number of degrees of freedom to model gene expression along time.
- We are looking for genes that change expression level over time. We test for a trend by conducting F-tests for each gene. The topTags function lists the top set of genes with most significant time effects.
- The total number of genes with significant (5% FDR) changes at different time points can be examined with decideTests.
DESeq2 experimental design and interpretation
DESeq2 experimental design and interpretation
Controlling for batch differences
The variable we are interested in ("condition") is placed after the batch variable.
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design= ~ batch + condition) dds <- DESeq(dds)
OR
dds <- DESeq(dds, test="LRT", reduced=~batch) res <- results(dds)
DESeq2 diagnostic plot
- RNA-Seq differential expression work flow using DESeq2. MA plot, dispersion plot, histogram of p-values, rlog transoformation.
- RNA-seq Analysis in R Annotation and Visualisation of Differential Expression Results. MA plot, Volcano plot, venn diagram, heatmap (ComplexHeatmap).
vst over rlog transformation
- https://twitter.com/mikelove/status/1420671546088173569. See RNA-seq workflow: gene-level exploratory analysis and differential expression. The transformed data can be used in computing sample distances, PCA, MDS, clustering, ....
- Homoscedasticity = homogeneity of variance
Expected counts | round() v /-- vst transformation ---\ Raw counts --> normalized counts -- -- Other analyses such as PCA, Hclust (sample distances). \-- rlog transformation ---/
Simulate negative binomial distribution data
- rnegbin() in sim.counts() from the ssizeRNA package
rnegbin(10000 * 10, lambda, 1 / disp) # 10000 genes, 20 samples # lambda: mean counts from control group, a matrix. # disp: dispersion parameter, a matrix.
Reducing false positives in differential analyses of large RNA sequencing data sets
Reducing false positives in differential analyses of large RNA sequencing data sets
Generalized Linear Models and Plots with edgeR
Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis
EBSeq
An R package for gene and isoform differential expression analysis of RNA-seq data
http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/
prebs
Probe region expression estimation for RNA-seq data for improved microarray comparability
DEXSeq
Inference of differential exon usage in RNA-Seq
rSeqNP
A non-parametric approach for detecting differential expression and splicing from RNA-Seq data
voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data
http://www.biosoft.hacettepe.edu.tr/voomDDA/
Pathway analysis
About the KEGG pathways
- MSigDB database or msigdbr package - seems to be old. It only has 186 KEGG pathways for human.
- msigdb from Bioconductor
- KEGGREST package directly pull the data from kegg.jp. I can get 337 KEGG pathways for human ('hsa')
BiocManager::install("KEGGREST") library(KEGGREST) res <- keggList("pathway", "hsa") length(res) # 337
GSOAP
GSOAP: a tool for visualization of gene set over-representation analysis
clusterProfiler
fgsea: Fast Gene Set Enrichment Analysis
- Are fgsea and Broad Institute GSEA equivalent?. Note that enrichment score are the same though the way of calculating p-values can be different.
- DESeq results to pathways in 60 Seconds with the fgsea package
- GSEA plot for multiple comparisons
- Using the fast preranked gene set enrichment analysis (fgsea) package 2018
GSEABenchmarkeR: Reproducible GSEA Benchmarking
Towards a gold standard for benchmarking gene set enrichment analysis
hypeR
- hypeR: an R package for geneset enrichment workflows
- Efficient, Scalable, and Reproducible Enrichment Workflows from Bioc2020
GSEPD
GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display
SeqGSEA
http://www.bioconductor.org/packages/release/bioc/html/SeqGSEA.html
BAGSE
BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis 2020
GeneSetCluster
GeneSetCluster: a tool for summarizing and integrating gene-set analysis results
Pipeline
SPEAQeasy
SPEAQeasy – a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses
Nextflow
- Nextflow: Data-driven computational pipelines, Blog.
- RNA-Seq pipeline
GeneTEFlow
GeneTEFlow – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data
pipeComp
pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools
SARTools
SEQprocess
SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package
GEMmaker
pasilla and pasillaBamSubset Data
pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.
pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).
BitSeq
Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.
ReportingTools
The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.
Figures can be included in a cell in output table. See Using ReportingTools in an Analysis of Microarray Data.
It is suggested by e.g. EnrichmentBrowser.
sequences
More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.
QuasR
Bioinformatics paper
CRAN/Bioconductor packages
ssizeRNA
- Sample Size Calculation for RNA-Seq Experimental Design
- B4B: Module 2 - RNAseq power calculation
RNASeqPower
RNASeqPower Sample size for RNAseq studies
RnaSeqSampleSize
Shiny app
rbamtools
Provides an interface to functions of the 'SAMtools' C-Library by Heng Li
refGenome
The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.
WhopGenome
Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.
TCGA2STAT
Simple TCGA Data Access for Integrated Statistical Analysis in R
TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.
source("https://bioconductor.org/biocLite.R") biocLite("CNTools") install.packages("TCGA2STAT")
The getTCGA() function allows to download various kind of data:
- gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
- miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
- mutation data (data.type="Mutation")
- methylation expression (data.type="Methylation")
- copy number changes (data.type="CNA_SNP")
TCGAbiolinks
- https://bioconductor.org/packages/3.12/TCGAbiolinks
- data access through GenomicDataCommons
- provides data both from the legacy Firehose pipeline used by the TCGA publications (alignments based on hg18 and hg19 builds2), and the GDC harmonized GRCh38 pipeline
- The help page of GDCquery does not say it clearly about the option of file.type. How to use TCGAbiolinks to download raw RSEM gene counts for specific type of cancer. file.type= "results" or file.type= "normalized_results".
- An example from Public Data Resources in Bioconductor workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
library(TCGAbiolinks) library(SummarizedExperiment) query <- GDCquery(project = "TCGA-ACC", data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "normalized_results", experimental.strategy = "RNA-Seq", legacy = TRUE) gdcdir <- file.path("Waldron_PublicData", "GDCdata") GDCdownload(query, method = "api", files.per.chunk = 10, directory = gdcdir) # 79 files ACCse <- GDCprepare(query, directory = gdcdir) ACCse class(ACCse) dim(assay(ACCse)) # 19947 x 79 assay(ACCse)[1:3, 1:2] # symbol id length(unique(rownames(assay(ACCse)))) # 19672 rowData(ACCse)[1:2, ] # DataFrame with 2 rows and 3 columns # gene_id entrezgene ensembl_gene_id # <character> <integer> <character> # A1BG A1BG 1 ENSG00000121410 # A2M A2M 2 ENSG00000175899
- HTSeq counts data example
query2 <- GDCquery(project = "TCGA-ACC", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type="HTSeq - Counts") gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2") GDCdownload(query2, method = "api", files.per.chunk = 10, directory = gdcdir2) # 79 files ACCse2 <- GDCprepare(query2, directory = gdcdir2) ACCse2 dim(assay(ACCse2)) # 56457 x 79 assay(ACCse2)[1:3, 1:2] # ensembl id rowData(ACCse2)[1:2, ] DataFrame with 2 rows and 3 columns ensembl_gene_id external_gene_name original_ensembl_gene_id <character> <character> <character> ENSG00000000003 ENSG00000000003 TSPAN6 ENSG00000000003.13 ENSG00000000005 ENSG00000000005 TNMD ENSG00000000005.5
- Clinical data
acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical") dim(acc_clin) # [1] 92 71
- TCGAanalyze_DEA(). Differentially Expression Analysis (DEA) Using edgeR Package.
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT], dataFilt[,samplesTP],"Normal", "Tumor") # 2nd example dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM, Cond1type = "LGG", Cond2type = "GBM", fdr.cut = 0.01, logFC.cut = 1, method = "glmLRT")
- Enrichment analysis
ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", RegulonList = rownames(dataDEGs)) TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC, GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat, nRGTab = rownames(dataDEGs), nBar = 20)
- mRNA Analysis Pipeline from GDC documentation.
- See the vignette in the SingscoreAMLMutations package.
- Papers
- RangedSummarizedExperiment class
- assay(()
- colData()
- rowData()
- assayNames()
- metadata()
> dim(colData(ACCse)) [1] 79 72 > dim(rowData(ACCse)) [1] 19947 3 > dim(assay(ACCse)) [1] 19947 79 > assayNames(ACCse) [1] "normalized_count" > assayNames(ACCse2) [1] "HTSeq - Counts" > metadata(ACCse) $data_release [1] "Data Release 25.0 - July 22, 2020"
curatedTCGAData
- Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
- data access through ExperimentHub
- provides data from the legacy Firehose pipeline
- Multi-omic Integration and Analysis of cBioPortal and TCGA data with MultiAssayExperiment from Bioc2020
- Public data resources and Bioconductor from Bioc2020
library(curatedTCGAData) library(MultiAssayExperiment) curatedTCGAData(diseaseCode = "*", assays = "*") curatedTCGAData(diseaseCode = "ACC") ACCmae <- curatedTCGAData("ACC", c("RPPAArray", "RNASeq2GeneNorm"), dry.run=FALSE) ACCmae dim(colData(ACCmae)) # 79 (samples) x 822 (features) head(metadata(colData(ACCmae))[["subtypes"]])
- Caveats for working with TCGA data
- Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types.
- Use sampleTables on the MultiAssayExperiment object along with data(sampleTypes, package = "TCGAutils") to see what samples are present in the data.
- There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as mergeReplicates.
- Primary tumors should be selected using TCGAutils::TCGAsampleSelect and used as input to the subsetting mechanisms.
caOmicsV
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used
Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.
Map2NCBI
The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).
> library(Map2NCBI) > GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/")) # choose [2], [n], and [1] to filter the build and feature information. # The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files. > str(GeneList) 'data.frame': 52157 obs. of 15 variables: $ tax_id : chr "9606" "9606" "9606" "9606" ... $ chromosome : chr "1" "1" "1" "1" ... $ chr_start : num 11874 14362 17369 30366 34611 ... $ chr_stop : num 14409 29370 17436 30503 36081 ... $ chr_orient : chr "+" "-" "-" "+" ... $ contig : chr "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ... $ ctg_start : num 1874 4362 7369 20366 24611 ... $ ctg_stop : num 4409 19370 7436 20503 26081 ... $ ctg_orient : chr "+" "-" "-" "+" ... $ feature_name : chr "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ... $ feature_id : chr "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ... $ feature_type : chr "GENE" "GENE" "GENE" "GENE" ... $ group_label : chr "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ... $ transcript : chr "Assembly" "Assembly" "Assembly" "Assembly" ... $ evidence_code: chr "-" "-" "-" "-" ... > GeneList$feature_name[grep("^NAP", GeneList$feature_name)]
TCseq: Time course sequencing data analysis
http://bioconductor.org/packages/devel/bioc/html/TCseq.html
UCSC Xena
- Welcome to UCSC Xena, Source code. Note one should not confuse it with Xeva for analysis PDX data.
- UCSCXenaTools
RTCGA
https://www.bioconductor.org/packages/release/bioc/html/RTCGA.html
genefu
Computation of Gene Expression-Based Signatures in Breast Cancer
GEO
See the internal link at R-GEO.
GREIN: An interactive web platform for re-analyzing GEO RNA-seq data
GEO2RNAseq
GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data
Proteomics
- MatrixQCvis: Shiny-based interactive data-quality exploration for omics data
- R/Bioconductor tools for mass spectrometry-based proteomics
OlinkAnalyze
OlinkRPackage
- OlinkRPackage, CRAN
- Plasma proteomics reveals tissue-specific cell death and mediators of cell-cell interactions in severe COVID-19 patients 2021
- What is NPX? NPX, Normalized Protein eXpression, is Olink’s arbitrary unit which is in Log2 scale.
- Longitudinal proteomic analysis of severe COVID-19 reveals survival-associated signatures, tissue-specific cell death, and cell-cell interactions Filbin 2021. Olink data is available in here.
Protein-protein interaction/PPI
- https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction
- Obtaining a protein-protein interaction network for a gene list in R and other related posts.
- Protein-Protein Interaction Graphs
- BioGRID, The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions
$ wc -l BIOGRID-ALL-4.4.214.tab.txt 12:07:59 2454000 BIOGRID-ALL-4.4.214.tab.txt
R> x <- read.delim("BIOGRID-ALL-4.4.214.tab.txt", skip=35, nrows=10) R> dim(x) [1] 10 11 R> x[1:2, ] INTERACTOR_A INTERACTOR_B OFFICIAL_SYMBOL_A OFFICIAL_SYMBOL_B 1 ETG6416 ETG2318 MAP2K4 FLNC 2 ETG84665 ETG88 MYPN ACTN2 ALIASES_FOR_A 1 JNKK|JNKK1|MAPKK4|MEK4|MKK4|PRKMK4|SAPKK-1|SAPKK1|SEK1|SERK1|SKK1 2 CMD1DD|CMH22|MYOP|RCM4 ALIASES_FOR_B EXPERIMENTAL_SYSTEM SOURCE 1 ABP-280|ABP280A|ABPA|ABPL|FLN2|MFM5|MPD4 Two-hybrid Marti A (1997) 2 CMD1AA Two-hybrid Bang ML (2001) PUBMED_ID ORGANISM_A_ID ORGANISM_B_ID 1 9006895 9606 9606 2 11309420 9606 9606
- Defining the extent of gene function using ROC curvature 2022
- https://string-db.org/, STRINGdb R package.
Journals
Biometrical Journal
- https://onlinelibrary.wiley.com/journal/15214036
- Author's Guideline
- Biostatistics (topic) journals from Wiley
Biostatistics
Bioinformatics
Genome Analysis section
BMC Bioinformatics
BioRxiv
PLOS
Software
BRB-SeqTools
https://brb.nci.nih.gov/seqtools/
WebMeV
GeneSpring
RNA-Seq
CCBR Exome Pipeliner
https://ccbr.github.io/Pipeliner/
Tibanna
Tibanna helps you run your genomic pipelines on Amazon cloud (AWS). It is used by the 4DN DCIC (4D Nucleome Data Coordination and Integration Center) to process data. Tibanna supports CWL/WDL (w/ docker), Snakemake (w/ conda) and custom Docker/shell command.
MOFA: Multi-Omics Factor Analysis
WGCNA
- It uses a network method to create distance matrix for genes. We can further to use Hierarchical clustering to create groups/modules/clusters of genes.
- Weighted gene co-expression network analysis
- Webinar #7 – Introduction to Weighted Gene Co-expression Network Analysis (video)
Benchmarking
Essential guidelines for computational method benchmarking
Simulation
Simulate RNA-Seq
- http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#RNA-Seq_simulators
- https://popmodels.cancercontrol.cancer.gov/gsr/packages/
Maq
Used by TopHat: discovering splice junctions with RNA-Seq
BEERS/Grant G.R. 2011
http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called BEERS and it was used in the STAR software paper.
For the command line options of <reads_simulator.pl> and more details about the config files that are needed/prepared by BEERS, see this gist.
This can generate paired end data but they are in one FASTA file.
$ sudo apt-get install cpanminus $ sudo cpanm Math::Random $ wget http://cbil.upenn.edu/BEERS/beers.tar $ $ tar -xvf beers.tar # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl> $ $ cd ~/Downloads/ $ mkdir beers_output $ mkdir beers_simulator_refseq && cd "$_" $ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz $ tar xzvf simulator_config_refseq.tar.gz $ ls -lth total 1.4G -rw-r--r-- 1 brb brb 44M Sep 16 2010 simulator_config_featurequantifications_refseq -rw-r--r-- 1 brb brb 7.7M Sep 15 2010 simulator_config_geneinfo_refseq -rw-r--r-- 1 brb brb 106M Sep 15 2010 simulator_config_geneseq_refseq -rw-r--r-- 1 brb brb 1.3G Sep 15 2010 simulator_config_intronseq_refseq $ cd ~/Downloads/ $ perl reads_simulator.pl 100 testbeers \ -configstem refseq \ -customcfgdir ~/Downloads/beers_simulator_refseq \ -outdir ~/Downloads/beers_output $ ls -lh beers_output total 3.9M -rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt -rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt -rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt -rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt -rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed -rw-r--r-- 1 brb brb 31K Mar 16 15:25 simulated_reads_testbeers.cig -rw-r--r-- 1 brb brb 22K Mar 16 15:25 simulated_reads_testbeers.fa -rw-r--r-- 1 brb brb 584 Mar 16 15:25 simulated_reads_testbeers.log $ wc -l simulated_reads2genes_testbeers.txt 102 simulated_reads2genes_testbeers.txt $ head -4 simulated_reads2genes_testbeers.txt seq.1 GENE.5600 seq.2 GENE.35506 seq.3 GENE.506 seq.4 GENE.34922 $ tail -4 simulated_reads2genes_testbeers.txt seq.97 GENE.4197 seq.98 GENE.8763 seq.99 GENE.19573 seq.100 GENE.18830 $ wc -l simulated_reads_indels_testbeers.txt 36131 simulated_reads_indels_testbeers.txt $ head -2 simulated_reads_indels_testbeers.txt chr1:6052304-6052531 25 1 G chr2:73899436-73899622 141 3 ATA $ tail -2 simulated_reads_indels_testbeers.txt chr4:68619532-68621804 1298 -2 AA chr21:32554738-32554962 174 1 T $ wc -l simulated_reads_substitutions_testbeers.txt 71678 simulated_reads_substitutions_testbeers.txt $ head -2 simulated_reads_substitutions_testbeers.txt chr22:50902963-50903167 50903077 G->A chr1:6052304-6052531 6052330 G->C $ wc -l simulated_reads_junctions-crossed_testbeers.txt 49 simulated_reads_junctions-crossed_testbeers.txt $ head -2 simulated_reads_junctions-crossed_testbeers.txt seq.1a chrX:49084601-49084713 seq.1b chrX:49084909-49086682 $ cat beers_output/simulated_reads_testbeers.log Simulator run: 'testbeers' started: Thu Mar 16 15:25:39 EDT 2017 num reads: 100 readlength: 100 substitution frequency: 0.001 indel frequency: 0.0005 base error: 0.005 low quality tail length: 10 percent of tails that are low quality: 0 quality of low qulaity tails: 0.8 percent of alt splice forms: 0.2 number of alt splice forms per gene: 2 stem: refseq sum of gene counts: 3,886,863,063 sum of intron counts = 1,304,815,198 sum of intron counts = 2,365,472,596 intron frequency: 0.355507598105262 padded intron frequency: 0.52453796437909 finished at Thu Mar 16 15:25:58 EDT 2017 $ wc -l simulated_reads_testbeers.fa 400 simulated_reads_testbeers.fa $ head simulated_reads_testbeers.fa >seq.1a CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC >seq.1b GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA >seq.2a GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG >seq.2b ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT >seq.3a CCCCAGAGGAGCGCCACCTGTCCAAGATGCAGCAGAACGGCTACGAAAATCCAACCTACAAGTTCTTTGAGCAGATGCAGAACTAGACCCCCGCCACAGC # Take a look at the true coordinates $ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span chrX 49084529 49084601 + chrX 49084713 49084739 + chrX 49084863 49084909 + chrX 49086682 49086734 + $ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates seq.1a chrX 49084529 73M111N27M 49084529-49084601, 49084713-49084739 + CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC seq.1b chrX 49084863 47M1772N53M 49084863-49084909, 49086682-49086734 - GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA seq.2a chr1 183516256 100M 183516256-183516355 - GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG seq.2b chr1 183515275 100M 183515275-183515374 + ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT $ wc -l simulated_reads_testbeers.fa 400 simulated_reads_testbeers.fa $ wc -l simulated_reads_testbeers.bed 247 simulated_reads_testbeers.bed $ wc -l simulated_reads_testbeers.cig 200 simulated_reads_testbeers.cig
Flux Sammeth 2010
SimNGS
SimSeq
A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.
empiricalFDR.DESeq2
http://biorxiv.org/content/early/2014/12/05/012211
The key function is simulateCounts, which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.
Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.
polyester
http://biorxiv.org/content/early/2014/12/05/012211
Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.
Input: reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.
Output: FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.
Too many dependencies. Got an error in installation.. It seems it has not considered splice junctions.
seqgendiff
Data-based RNA-seq simulations by binomial thinning
Simulate DNA-Seq
- Software list - https://popmodels.cancercontrol.cancer.gov/gsr/packages/
- A comparison of tools for the simulation of genomic next-generation sequencing data Merly Escalona 2016
wgsim
- Used by Cleaning clinical genomic data: Simple identification and removal of recurrently miscalled variants in single genomes bioRxiv 2017
- (How to) Simulate reads using a reference genome ALT contig
- [http://research.cs.wisc.edu/wham/comparison-using-wgsim/ Comparing WHAM with BWA using wgsim
- http://biobits.org/samtools_primer.html
dwgssim
- Whole Genome Simulator for Next-Generation Sequencing
- Example: How to generate variant calls with bcftools
NEAT
- https://github.com/zstephens/neat-genreads
- Simulating next-generation sequencing datasets from empirical mutation and sequencing models Zachary Stephens, 2016
- If I set 10 as the coverage rate and read length 101, the generated fq file is about 34GB (3.3GB * 10) for each one of the pairs.
DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads
http://genomespot.blogspot.com/2014/11/dna-aligner-accuracy-bwa-bowtie-soap.html
$ head simDNA_100bp_16del.fasta >Pt-0-100 TGGCGAACGCGGGAATTGACCGCGATGGTGATTCACATCACTCCTAATCCACTTGCTAATCGCCCTACGCTACTATCATTCTTT >Pt-10-110 GCGGGATTGAACCCGATTGAATTCCAATCACTGCTTAATCCACTTGCTACATCGCCCTACGTACTATCTATTTTTTTGTATTTC >Pt-20-120 GAACCCGCGATGAATTCAATCCACTGCTACCATTGGCTACATCCGCCCCTACGCTACTCTTCTTTTTTGTATGTCTAAAAAAAA >Pt-30-130 TGGTGAATCACAATCACTGCCTAACCATTGGCTACATCCGCCCCTACGCTACACTATTTTTTGTATTGCTAAAAAAAAAAATAA >Pt-40-140 ACAACACTGCCTAATCCACTTGGCTACTCCGCCCCTAGCTACTATCTTTTTTTGTATTTCTAAAAAAAAAAAATCAATTTCAAT
Simulate Whole genome
- DWGSIM mentioned by Variant Callers for Next-Generation Sequencing Data: A Comparison Study. For its usage, see http://davetang.org/wiki/tiki-index.php?page=DWGSIM
Simulate whole exome
- https://www.biostars.org/p/66714/ (no final answer)
- Wessim: a whole-exome sequencing simulator based on in silico exome capture Sangwoo Kim 2013 & software
SCSIM
SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data
Variant simulator
Mutation-Simulator
SigProfilerSimulator
Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator
Convert FASTA to FASTQ
It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.
If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.
$ cat test.fasta >Pt-0-50 TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT >Pt-10-60 GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC >Pt-20-70 GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC >Pt-30-80 GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT >Pt-40-90 AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT $ perl ~/Downloads/fasta_to_fastq.pl test.fasta @Pt-0-50 TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @Pt-10-60 GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @Pt-20-70 GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @Pt-30-80 GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @Pt-40-90 AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Alternatively we can use just one line of code by awk
$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \ test.fasta > test.fq $ cat test.fq @Pt-0-50 TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT + HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @Pt-10-60 GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC + HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @Pt-20-70 GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC + HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @Pt-30-80 GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT + HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @Pt-40-90 AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT + HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).
Simulate genetic data
‘Simulating genetic data with R: an example with deleterious variants (and a pun)’
PDX/Xenograft
- https://en.wikipedia.org/wiki/Patient_derived_xenograft
- Are special read alignment strategies necessary and cost-effective when handling sequencing reads from patient-derived tumor xenografts? by Tso et al, BMC Genomics, 2014.
- Map xenograft reads by Array Suite
- PDXFinder includes PDMR as one of 6 providers. The website source code https://github.com/PDXFinder/pdxfinder.
- A Colorectal Carcinoma example contains 'genomic data'. Each row represents one seq position. No raw FASTQ files available.
- NCI Patient-Derived Models Repository (PDMR).
- ftp://dctdftp.nci.nih.gov/pub/pdm/
- The ftp link can be obtained by clicking 'PDMR Models' -> 'PDMR database' -> Click here to access the PDMR Database -> 'Genomic Analysis'.
- There will be three tabs: NCI Cancer Genome Panel (4246 records), Whole Exome Sequence (785 records) and RNASeq (807 records).
- For Whole Exome Sequence, VCF was provided. For RNASeq, RSEM files per genes or isoforms are available.
- The RNASeq Transcriptome Data Analysis Pipeline and Specifications and Whole Exome Sequencing Data Analysis Pipeline and Specifications are available under SOPs. RNASeq TPM data.
- Reproducible Bioinformatics Project
- Xenome a tool for classifying reads from xenograft samples, Thomas Conway et al 2012.
#!/bin/bash module load gossamer xenome index -M 24 -T 16 -P idx \ -H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \ -G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
- Next-Generation Sequence Analysis of Cancer Xenograft Models by Fernando J. Rossello et al 2013.
- Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers Bradford et al 2016
- An open-source application for disambiguating two species in next generation sequencing data from grafted samples by Ahdesmäki MJ et al 2016. Disambiguation
- Next-Generation Sequencing Analysis and Algorithms for PDX and CDX Models by Garima Khandelwal et al 2017. bamcmp software.
- Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts by Maurizio Callari et al 2018. Both RNA-Seq and DNA-Seq are considered. Software ICRG.
- XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data Kluin et al 2018. Software in github.
- BBSplit, PDX Whole Exome Seq Analysis Pipeline, RNASeq Transciptome Data Analysis Pipeline from PDMR. Tool to separate human and mouse rna seq reads.
- Whole genomes define concordance of matched primary, xenograft, and organoid models of pancreas cancer Gendoo et al 2019
- Integrative Pharmacogenomics Analysis of Patient-Derived Xenografts 2019
- Genomic data analysis workflows for tumors from patient-derived xenografts (PDXs): challenges and guidelines 2019
- Xeva package. Analysis of patient-derived xenograft (PDX) data. Paper Integrative Pharmacogenomics Analysis of Patient-Derived Xenografts Mer, 2019. Molecular profile and pharmacologic profile.
- GSE78806 - microarray-based Gene expression data, PDX passages, and tissue information
- High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response Gao, 2015. Molecular profiles including mutation, CNA, RNASeq-based gene expression, and pharmacologic profiles.
RNA-Seq
- RNASeq Transciptome Data Analysis Pipeline and Specifications by Mocha
- PDX Mouse reads are removed from the raw FASTQ files using bbsplit (bbtools v37.36).
- The fastq files are mapped to human transcriptome based on exon models from hg19 using Bowtie2 (version 2.2.6) [1]. The resulting SAM files are converted to BAM forma using samtools [2] and the coordinations in BAM are converted to the genomic (hg19) coordinations using RSEM (version 1.2.31). Gene and transcript quantifications are also done using RSEM.
- Removal of Small nucleolar RNAs (snoRNAs)
- Platform GPL16791 Illumina HiSeq 2500
- GPL25431 Illumina HiSeq 4000 (Homo sapiens; Mus musculus)
- GSE118197
- GPL18573 Illumina NextSeq 500
- GSE106336 MYC regulates ductal-neuroendocrine lineage plasticity in pancreatic ductal adenocarcinoma associated with poor outcome and chemoresistance
- Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis. This is posted on rna-seqblog. See the link Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers.
- PDXGEM: patient-derived tumor xenograft-based gene expression model for predicting clinical response to anticancer therapy in cancer patients
DNA-Seq
- Endocrine-Therapy-Resistant ESR1 Variants Revealed by Genomic Characterization of Breast-Cancer-Derived Xenografts
- https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&f=study&term=xenograft&go=Go
- ERP021871
MAF (TCGA, GDC)
- See D3Oncoprint manual. D3oncoprint uses java to create the interface (java -jar D3Oncoprint.jar). The result is shown in a browser. The top part is a heatmap and the bottom part is a table. The heatmap is good only for a small number of variants and there is no way to zoom in if we have a lot of variants (eg 1000). The phenotypes file is optional. The following 3 columns are required and the others are for tooltip and table. The column name from a VCF file input is given below.
- Gene column: GENE
- Variant type column: ExonicFunc.refGene
- Aminoacid change column: AAChange.refGene (this defines the name of variant. The variant uses the single letter aminoacid codes. This information is only used in HTML table. For example if AAChange.refGene=UGT1A7:NM_019077:exon1:c.T387G:p.N129K, then the variant will be N129K. My observation is the variant name for variant type 'frameshift_deletion' is altered in table; R128fs -> R128Xfs)
- https://www.reddit.com/r/bioinformatics/comments/cmdza5/vcf_and_maf_file_formats/ which links to
- The Standing Operating Procedure (SOP) describes procedures for generating consensus/intersect variants calls for reporting in the NCI Patient-Derived Models database
- Convert vcf to maf. https://hpc.nih.gov/apps/vcf2maf.html
- maftools : Summarize, Analyze and Visualize MAF Files (oncoplots)
- https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html
- Use TCGAbiolinks package to download maf file.
- http://firebrowse.org/. For example, BRCA -> Mutation Annotation File (from bar plot on RHS). Pick any data. The (extracted) file name will be *.maf.txt.
DNA Seq Data
NIH
- Go to SRA/Sequence Read Archiveand type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
- Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
- The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
- The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
- I pick the first one Accession:SRP066837 from this page. The page shows the Study type is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
- (Important trick) Click the number next to Run. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
- Download the raw data from any one of them (eg SRR2968056). For whole genome, the Strategy is WGS. For whole exome, the Strategy is called WXS.
- Search the keywords 'nonsynonymous' and 'human' in PMC
Use SRAToolKit instead of wget to download
Don't use the wget command since it requires the specification of right http address.
Downloading SRA data using command line utilities
SRA2R - a package to import SRA data directly into R.
(Method 1) Use the fastq-dump command. For example, the following command (modified from the document will download the first 5 reads and save it to a file called <SRR390728.fastq> (NOT sra format) in the current directory.
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O . # OR /opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar
This will download the files in FASTQ format.
(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.
(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list
brb@T3600 ~/Downloads $ tar xzvf aspera-connect-3.6.2.117442-linux-64.tar.gz aspera-connect-3.6.2.117442-linux-64.sh brb@T3600 ~/Downloads $ ./aspera-connect-3.6.2.117442-linux-64.sh Installing Aspera Connect Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only. Restart firefox manually to load the Aspera Connect plug-in Install complete. brb@T3600 ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \ -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \ [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra . SRR590795.sra 100% 239MB 535Mb/s 00:06 Completed: 245535K bytes transferred in 7 seconds (272848K bits/sec), in 1 file. brb@T3600 ~/Downloads $
Aspera is typically 10 times faster than FTP according to the website. For this case, wget takes 12s while ascp uses 7s.
Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "[email protected]"
SRAdb package
https://bioconductor.org/packages/release/bioc/html/SRAdb.html
First we install some required package for XML and RCurl.
sudo apt-get update sudo apt-get install libxml2-dev sudo apt-get install libcurl4-openssl-dev
and then
source("https://bioconductor.org/biocLite.R") biocLite("SRAdb")
SRA
The wait is over… NIH’s Public Sequence Read Archive is now open access on the cloud
Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers
SRA Explorer
SRP056969
- Inference of RNA decay rate from transcriptional profiling highlights the regulatory programs of Alzheimer’s disease
- RNA-Seq reveals mRNA stability a marker in Alzheimer’s patients
- REMBRANDTS: REMoving Bias from Rna-seq ANalysis of Differential Transcript Stability
SRP066363 - lung cancer
- Platform: GPL11154 Illumina HiSeq 2000 (Homo sapiens)
- Overall design: RNAseq and DNA copy number analysis of H1975 cells
- Strategy: 6 RNA-Seq and 3 Whole exome. Paired. 9 samples
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066363
- http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74866
SRP015769 or SRP062882 - prostate cancer
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP015769 5 are from normal and 5 are from tumor. Whole Exome Seq.
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062882 6 normal and the rest are tumor.
SRP053134 - breast cancer
Look at the MBases value column. It determines the coverage for each run.
SRP050992 single cell RNA-Seq
Used in Design and computational analysis of single-cell RNA-sequencing experiments
Single cell RNA-Seq
- Exploiting single-cell expression to characterize co-expression replicability
- NGS -> Single cell RNA-Seq
SRP040626 or SRP040540 - Colon and rectal cancer
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP040626
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP040540
OmicIDX
Tutorials
See the BWA section.
Whole Exome Seq
- 1000genomes. 1000genomes and tcga are two places to get vcf files too.
- Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing (Bao 2014)
- A framework for variation discovery and genotyping using next-generation DNA sequencing data. See the table 1 there.
- Some data from SRA repository.
Whole Genome Seq
- BioProject and
- Search: filter by 'Homo sapiens wgs'
- Project data type: Genome sequencing
- Click 'Date' to sort by it
- 20 hits as of 1/11/2017 (many of them do not have data)
- PRJNA352450 18 experiments
- SRX2341045 20.5M spots, 4.1G bases, 1.8Gb
- PRJNA343545 48 experiments
- SRX2187657 417.4M spots, 126.1G bases, 55.1Gb
- 309109 5 experiments
- SRX1538498 1.7M spots, 346.1M bases, 99.7Mb
- PRJNA289286 5 experiments
- SRX1100298 504M spots, 101.8G bases, 45.3Gb
- PRJNA260389 27 experiments
- SRX699196 34,099,675 spots, 6.8G bases, 4.3Gb size
- 248553 3 experiments
- SRX1026041 1.2G spots, 250.9G bases, 114.2Gb
- 210123 26 experiments
- SRX318496 173.7M spots, 34.7G bases, 23Gb
- 43433 3 experiments (ABI SOLiD System 3.0)
- SRX017230 851.1M spots, 85.1G bases, 68.8Gb
SraRunTable.txt
- http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
- http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click SRP004077
- http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click Runs from the RHS
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click RunInfoTable
Note that (For this study, it has 2377 rows)
- Column A (AssemblyName_s) eg GRCh37
- Column I (library_name_s) eg
- column N (header=Run_s) shows all SRR or ERR accession numbers.
- Column P (Sample_Name)
- Column Y (header=Assay_Type_s) shows WGS.
- Column AB (LibraryLayout_s): PAIRED
Public Data
Public data resources and Bioconductor from Bioc2020.
Package name | Object class | Downloads (Distinct IPs, Jul 2020) |
---|---|---|
GEOquery | SummarizedExperiment | 5754 |
GenomicDataCommons | GDCQuery | 1154 |
TCGAbiolinks curatedTCGAData |
RangedSummarizedExperiment MultiAssayExperimentObjects |
2752 275 |
recount | RangedSummarizedExperiment | 418 |
curatedMetagenomicData | ExperimentHub | 224 |
ISB Cancer Genomics Cloud (ISB-CGC)
https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis
The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.
CCLE, DepMap
- Next-generation characterization of the Cancer Cell Line Encyclopedia 2019
- It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).
- Assembling Clinical Information for the CCLE Data
- Data download Depmap
- Dependency Map (DepMap) portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualization tools CCLE
- sample_info.csv also available from the download page
- CCLE_RNAseq_reads.csv: read counts from RSEM. 1406 x (54358 - 1). Use readr::read_csv(). range(x[, -1]) = 0 13018000. Note: log2(13018000) = 23.634.
- CCLE_expression_full.csv: log2(TPM + 1). 1406 x (53971 - 1). range(x[, -1]) = 0.00000 17.78354
- CCLE_expression.csv: log2(TPM+1). 1406 x 19221 genes. protein coding genes. 33 diseases.
- CCLE_expression_proteincoding_genes_expected_count.csv: 1406 x (19222 - 1). read count (non-integers) data from RSEM for just protein coding genes. range(x[, -1]) = 0 13018000.
- CCLE_expression_transcripts_expected_count.csv: read count data from RSEM. 1406 x (228138-1). Non-integers. range(x[, -1]) = 0 11664000.
- depmap package: Cancer Dependency Map Data Package. The depmap package currently contains eight (kinds) datasets available through ExperimentHub.
- RNA inference knockout data
- CRISPR-Cas9 knockout data
- WES copy number data
- CCLE Reverse Phase Protein Array data
- CCLE RNAseq gene expression data
- Cancer cell lines
- Mutation calls
- Drug Sensitivity
R> eh <- ExperimentHub() R> class(eh) [1] "ExperimentHub" attr(,"package") [1] "ExperimentHub" R> rnai <- eh[["EH2260"]] R> class(rnai) [1] "tbl_df" "tbl" "data.frame"
NCI's Genomic Data Commons (GDC)/TCGA
The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).
- NCI's GDC - Genomic Data Commons Data Portal. Researchers can access over 3 PB of bigData from projects like CPTAC, TARGET and of course TCGA.
- MDAnderson download, FAQ, Available Software
- Working with TCGAbiolinks package
- edgeR is used to find DE; see the TCGAanalyze_DEA() function and TCGAanalyze: Analyze data from TCGA
- GEO accession data GSE62944 as a SummarizedExperiment and GEO website.
- BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis.
- https://github.com/srp33/TCGA_RNASeq_Clinical
- MOGSA: integrative single sample gene-set analysis of multiple omics data 2019. The data was obtained from TCGA and NCI60.
- RNA Sequencing of the NCI-60: Integration into CellMiner and CellMiner CDB
- Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data 2020
GenomicDataCommons package
- See here
- GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation
NCI60
Molecular Characterization of the NCI-60. NCI-ADR-RES and OVCAR-8 being derived from one another, SNB-19 and U251 are derived from the same patient
Case studies
Expression of the SARS-CoV-2 cell receptor gene ACE2 in a wide variety of human tissues
NCI Proteomic Data Commons
https://pdc.cancer.gov/pdc/ vs https://gdc.cancer.gov/
GTEx
- Genotype-Tissue Expression (GTEx) project
- recount workflow: accessing over 70,000 human RNA-seq samples with Bioconductor
- Basal Contamination of Bulk Sequencing: Lessons from the GTEx dataset
- variancePartition Quantify and interpret divers of variation in multilevel gene expression experiments. Transcriptomic Insight Into the Polygenic Mechanisms Underlying Psychiatric Disorders 2020
NIH LINCS
- http://www.lincsproject.org/, http://www.lincsproject.org/LINCS/tools
- slinky, paper
- cmapR
- GRcalculator: an online tool for calculating and mining dose-response data
- Reproducible Bioconductor workflows using browser-based interactive notebooks and containers
Sharing data
- NCI Data Sharing
- Ten quick tips for sharing open genomic data Brown et al, PLOS 2018
Gene set analysis
Hypergeometric test
- A course from bioinformatics.ca and over-represented pathway.
- How informative are enrichment analyses really?
Next-generation sequencing data
Forums
- Biostars source code https://github.com/ialbert/biostar-central
- https://support.bioconductor.org/ (powered by Biostar too)
Batch effect
Misc
Advice
- Ten great papers for biologists starting out in computational biology
- Bioinformatics advice I wish I learned 10 years ago from NIH
High Performance
Cloud Computing
- Falco: A quick and flexible single-cell RNA-seq processing framework on the cloud
- Getting started with Bioconductor in the cloud
- miCloud: a bioinformatics cloud for seamless execution of complex NGS data analysis pipelines
Merge different datasets (different genechips)
Low read count and filtering
- StatQuest: edgeR and DESeq2, part 2 - Independent Filtering
- glmnet(, exclude). See its vignettte for some examples including computing univariate T-test and excluding 40% genes with low T-statistic. Notice the bias could be introduced if we use both x and y to filter variables unless the same procedure is applied during CV.
- RNA-seq: filtering, quality control and visualisation. As a general rule, a good threshold can be chosen for a CPM value that corresponds to a count of 10.
- Effect of low-expression gene filtering on detection of differentially expressed genes in RNA-seq data 2015
- Removing low count genes for RNA-seq downstream analysis
- RNA-seq analysis in R Pre-processsing RNA-seq data. It keeps all genes where the total number of reads across all samples is greater than 5.
- RNAseq data analysis in R - Notebook. A common way to do this is by filtering out genes having less than 1 count-per-million reads (cpm) in half the samples. The “edgeR” library provides the cpm function which can be used here.
- RNA-seq workflow - gene-level exploratory analysis and differential expression. we will remove those genes which have a total count of less than 5.
- While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. DESeq2 vignette.
- Differential expression analysis workshop material. It referred to this paper From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. It uses the criterion we keep genes that have CPM values above 0.5 in at least two libraries.
- RNA Sequence Analysis in R: edgeR which also uses cpm to filter genes.
Normalization
- How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets 2015
- Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data 2015
- Expression analysis of RNA sequencing data from human neural and glial cell lines depends on technical replication and normalization methods 2018
- A statistical normalization method and differential expression analysis for RNA-seq data between different species Zhou 2019
- RNA-seq analysis some notes from Ming Tang
- A protocol to evaluate RNA sequencing normalization methods Abrams et al 2019. It concluded that transcripts per million (TPM) was the best performing normalization method based on its preservation of biological signal as compared to the other methods tested.
- Data pre-processing 6.6.3 Normalising gene expression distributions. log2 CPM (counts per million) was considered. edgeR::calcNormFactors(x, method = "TMM") was used to normalize counts.
- Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis
- CrossNorm: a novel normalization strategy for microarray data in cancers Cheng 2016
- EnrichmentBrowser::normalize()
- SingleCellExperiment vignette. It lists 4 special assays: counts, logcounts, cpm and tpm.
- 8.3.4 Within sample normalization of the read counts and 8.3.5 Computing different normalization schemes in R from Computational Genomics with R by Altuna Akalin
- CPM is the simplest method. It addresses the sequencing depth bias by normalizing the read counts per gene by dividing each gene’s read count by a certain value and multiplying it by 10^6. There are 3 ways for the denominator: Total Counts Normalization, Upper Quartile Normalization and Median Normalization.
- Popular metrics that improve upon CPM are RPKM/FPKM (reads/fragments per kilobase of million reads) and TPM (transcripts per million).
- RPKM is obtained by dividing the CPM value by another factor, which is the length of the gene per kilobase. FPKM (substitute reads with fragments) is the same as RPKM, but is used for paired-end reads. So RPKM differs from CPM by adding one step.
- TPM also controls for both the library size and the gene lengths, however, with the TPM method, the read counts are first normalized by the gene length (per kilobase), and then gene-length normalized values are divided by the sum of the gene-length normalized values and multiplied by 10^6.
- Library composition or RNA composition. In DESeq2 the read counts are normalized by computing size factors, which addresses the differences not only in the library sizes, but also the library compositions. See 8.3.7 Differential expression analysis
- Different samples contain different active genes
- This procedure corrects for library size and RNA composition bias, which can arise for example when only a small number of genes are very highly expressed in one experiment condition but not in the other.
- DESeq2 first normalizes the count data to account for differences in library sizes and RNA composition between samples.
- A few highly differentially expressed genes between samples, differences in the number of genes expressed between samples, or presence of contamination can skew some types of normalization methods. Introduction to DGE gives a nice graphical illustration of RNA composition.
- StatQuest: DESeq2, part 1, Library Normalization - adjusting for differences in library composition
- Sequencing bias detection from RNA composition from the NOISeq's vignette.
- Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions Evans 2018. R code is in github. seqc package from Bioconductor.
- Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data Vandenbon 2021. Methods that correct for differences in gene length (RPKM and FPKM) don't affect correlation value; that is, these methods would be equivalent to CPM normalization.
- TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository Zhao 2021
- Youtube
- Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data 2022
log2 transformation
No matter we use TPM, TMM, FPKM, or DESeq2 normalized counts, we still need to take a log2(x+1) transformation before any analyses.
Quantile normalization
- https://en.wikipedia.org/wiki/Quantile_normalization. The new values are the averaged ordered values per gene based on the original rank.
- Question: Quantile normalizing prior to or after TPM scaling?
- When to use Quantile Normalization? and its R package quantro
- Quantile normalisation in R
- normalize.quantiles() from preprocessCore package.
- for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667.
- I got into an error when I use the function in RStudio docker container but the solution here (BiocManager::install("preprocessCore", configure.args="--disable-threading")) works.
source('http://bioconductor.org/biocLite.R') biocLite('preprocessCore') #load package library(preprocessCore) #the function expects a matrix #create a matrix using the same example mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8), ncol=3) mat # [,1] [,2] [,3] #[1,] 5 4 3 #[2,] 2 1 4 #[3,] 3 4 6 #[4,] 4 2 8 #quantile normalisation normalize.quantiles(mat) # [,1] [,2] [,3] #[1,] 5.666667 5.166667 2.000000 #[2,] 2.000000 2.000000 3.000000 #[3,] 3.000000 5.166667 4.666667 #[4,] 4.666667 3.000000 5.666667
Distribution, density plot
Density plot showing the distribution of RNA-seq read counts (FPKM) log10(FPKM)
Negative binomial distribution
RNA-seq and Negative binomial distribution
Z-score transformation
- This practice has been used extensively in papers without a clear foundation.
- Analysis of Microarray Data Using Z Score Transformation Cheadle 2003. Z-normalization was calculated per gene.
- For example, ssGSEA score-based Ras dependency indexes derived from gene expression data reveal potential Ras addiction mechanisms with possible clinica implications applies z-scores on each gene AND ssGSEA scores for each pathway/gene signature.
- Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures Zwiener 2014. Often, standardizing gene expressions is implemented as a default in software packages.
- RNAseq: Z score, Intensity, and Resources. For visualization in heatmaps or for other clustering (e.g., k-means, fuzzy) it is useful to use z-scores.
Ensembl to gene symbol
- http://useast.ensembl.org/index.html
- Training
- Ensembl to gene symbol using ensembl_to_symbol().
- A Tidy Transcriptomics introduction to RNA sequencing analyses (Bioc 2020).
- "org.Hs.eg.db" package and mapIds() function. See Converting Gene Symbol to Ensembl ID in R or the vignette of EnhancedVolcano package.
- "biomaRt" package
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
How to use UCSC Table Browser
- An instruction from BitSeq software
- How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols
- Where to download refseq gene coding regions data?
- http://genome.ucsc.edu/cgi-bin/hgTables OR
- Download refGene.txt.gz file from UCSC directly using http links
- Where To Download Genome Annotation Including Exon, Intron, Utr, Intergenic Information?
File:Tablebrowser.png File:Tablebrowser2.png
Note
- the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
- the output does not have a header
- The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1
If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.
$ wc -l hg38Tables.bed 75893 hg38Tables.bed $ head -2 hg38Tables.bed chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754, chr1 201283451 201332993 NM_000299 0 + 201283702 201328836 0 15 453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920, 0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622, $ tail -2 hg38Tables.bed chr22_KI270734v1_random 131493 137393 NM_005675 0 + 131645 136994 0 5 262,161,101,141,549, 0,342,3949,4665,5351, chr22_KI270734v1_random 138078 161852 NM_016335 0 - 138479 161586 0 15 589,89,99,176,147,93,82,80,117,65,150,35,209,313,164, 0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610, $ wc -l hg38CodingExon.bed 577387 hg38CodingExon.bed $ head -2 hg38CodingExon.bed chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 - chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 - $ tail -2 hg38CodingExon.bed chr22_KI270734v1_random 156288 156497 NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r 0 - chr22_KI270734v1_random 161313 161586 NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r 0 - # Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698) $ grep NM_001276352 hg38Tables.bed chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754, $ grep NM_001276352 hg38CodingExon.bed chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 - chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 - chr1 67103237 67103382 NM_001276352_cds_2_0_chr1_67103238_r 0 - chr1 67111576 67111644 NM_001276352_cds_3_0_chr1_67111577_r 0 - chr1 67115351 67115464 NM_001276352_cds_4_0_chr1_67115352_r 0 - chr1 67125751 67125909 NM_001276352_cds_5_0_chr1_67125752_r 0 - chr1 67127165 67127240 NM_001276352_cds_6_0_chr1_67127166_r 0 -
This can be compared to refGene(?) directly downloaded via http
$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz --2018-10-09 15:44:43-- http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163 Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 7457957 (7.1M) [application/x-gzip] Saving to: ‘hg38.refGene.txt.gz’ hg38.refGene.txt.gz 100%[===============================================================>] 7.11M 901KB/s in 10s 2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957] $ zcat hg38.refGene.txt.gz | wc -l 75893 15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2 1072 NM_003288 chr20 + 63865227 63891545 63865365 63889945 7 63865227,63869295,63873667,63875815,63882718,63889189,63889849, 63865384,63869441,63873816,63875875,63882820,63889238,63891545, 0 TPD52L2 cmpl cmpl 0,1,0,2,2,2,0, 1815 NR_110164 chr2 + 161244738 161249050 161249050 161249050 2 161244738,161246874, 161244895,161249050, 0 LINC01806 unk unk -1,-1, $ zcat hg38.refGene.txt.gz | tail -2 1006 NM_130467 chrX + 55220345 55224108 55220599 55224003 5 55220345,55221374,55221766,55222620,55223986, 55220651,55221463,55221875,55222746,55224108, 0 PAGE5 cmpl cmpl 0,1,0,1,1, 637 NM_001364814 chrY - 6865917 6874027 6866072 6872608 7 6865917,6868036,6868731,6868867,6870005,6872554,6873971, 6866078,6868462,6868776,6868909,6870053,6872620,6874027, 0 AMELY cmpl cmpl 0,0,0,0,0,0,-1,
Where to download reference genome
- UCSC and twoBitToFa to UCSC convert .2bit to fasta.
- hg19 from UCSC (chromosome-wise).
Which human reference genome to use?
http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)
GENCODE transcript database
RefSeq categories
See Table 1 of Chapter 18The Reference Sequence (RefSeq) Database.
Category | Description |
---|---|
NC | Complete genomic molecules |
NG | Incomplete genomic region |
NM | mRNA |
NR | ncRNA |
NP | Protein |
XM | predicted mRNA model |
XR | predicted ncRNA model |
XP | predicted Protein model (eukaryotic sequences) |
WP | predicted Protein model (prokaryotic sequences) |
UCSC version & NCBI release corresponding
Gene Annotation
- GeneCards
- Genetics Home Reference from National Library of Medicine
- My Cancer Genome
- Cosmic
- annotables R package.
- ensembldb R package
- https://www.gencodegenes.org/human/releases.html
- For gene symbols, there are NCBI and HUGO. See an example from GSE6532 (annot.tam object).
library(rtracklayer) genes <- readGFF("gencode.v27.annotation.gff3.gz") genes[1:2, 1:5] # DataFrame with 2 rows and 5 columns # seqid source type start end # <factor> <factor> <factor> <integer> <integer> # 1 chr1 HAVANA gene 11869 14409 # 2 chr1 HAVANA transcript 11869 14409 genes[1:100, ] %>% filter(type == "gene") %>% dim() # Error in UseMethod("filter") : # no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'RectangularData', 'SimpleList', 'DataFrame_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')" library(ape) genes2 <- read.gff("gencode.v27.annotation.gff3.gz") genes2[1:2, 1:5] # seqid source type start end # 1 chr1 HAVANA gene 11869 14409 # 2 chr1 HAVANA transcript 11869 14409 genes2[1:100,] %>% filter(type == "gene") %>% dim() # [1] 11 9
How many DNA strands are there in humans?
- http://www.numberof.net/number-of-dna-strands/
- http://www.answers.com/Q/How_many_DNA_strands_are_there_in_humans
How many base pairs in human
- 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
- chromosome 22 has the smallest number of bps (~50 million).
- chromosome 1 has the largest number of bps (245 million base pairs).
- Illumina iGenome Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa file is 3.0GB (so is other genome.fa from human).
Gene, Transcript, Coding/Non-coding exon
- https://hslnews.wordpress.com/2015/07/02/bioinformatics-bite-how-to-find-the-transcription-start-site-of-a-gene/
- According to https://en.wikipedia.org/wiki/Exon, in the human genome only
- 1.1% of the genome is spanned by exons,
- 24% is in introns,
- 75% of the genome being intergenic DNA.
- https://twitter.com/ensembl/status/1276469013707665410?s=20
SNP
Types of SNPs and number of SNPs in each chromosomes
NGS technology
DNA methylation, Epigenetics
- hemimethylated DNA vs allele-specific methylation. DNA-hemimethylation is when only one of two (complementary) strands is methylated.
- methylationArrayAnalysis : A cross-package Bioconductor workflow for analysing methylation array data]
- Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis Du 2010. Figure 3 shows scatterplots of SD vs mean based on technical replicates of using beta or M-value. It can be seen M-value satisfies homoscedasticity but beta does not.
- Is Your DNA Your Destiny? A Primer on Epigenetics
- GC content = (G+C)/(A+T+G+C) x 100%
- How many CpGs (C follows by G)?
- Some papers
- Identification of prostate cancer specific methylation biomarkers from a multi-cancer analysis 2021. Within all bootstrapped datasets, sites with non-zero coefficients in more than 99% LASSO models were kept.
- Analyzing DNA methylation data (part of the book Biomedical Data Science) and the PH525x: Data Analysis for Genomics (edX course). The Github website is on https://github.com/genomicsclass/labs. The source code may not be correct. See also http://www.biostat.jhsph.edu/~iruczins/teaching/kogo/html/ml/week8/methylation.Rmd. The paper Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies the tutorial has mentioned.
devtools::install_github("coloncancermeth","genomicsclass") library(coloncancermeth) # 485512 x 26 data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects dim(meth) dim(pd) length(gr) colnames(pd) table(pd$Status) # 9 normals, 17 cancers normalIndex <- which(pd$Status=="normal") cancerlIndex <- which(pd$Status=="cancer") i=normalIndex[1] plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n") for(i in normalIndex){ lines(density(meth[,i],from=0,to=1),col=1) } ### Add the cancer samples for(i in cancerlIndex){ lines(density(meth[,i],from=0,to=1),col=2) } # finding regions of the genome that are different between cancer and normal samples library(limma) X<-model.matrix(~pd$Status) fit<-lmFit(meth,X) eb <- ebayes(fit) # plot of the region surrounding the top hit library(GenomicRanges) i <- which.min(eb$p.value[,2]) middle <- gr[i,] Index<-gr%over%(middle+10000) cols=ifelse(pd$Status=="normal",1,2) chr=as.factor(seqnames(gr)) pos=start(gr) plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference") matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location") # http://www.ncbi.nlm.nih.gov/pubmed/22422453 # within each chromosome we usually have big gaps creating subgroups of regions to be analyzed chr1Index <- which(chr=="chr1") hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method") library(bumphunter) cl=clusterMaker(chr,pos,maxGap=500) table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them #consider two example regions# ...
Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing
- http://www.rna-seqblog.com/exome-sequencing-vs-rna-seq-to-identify-coding-region-variants/
- http://www.rna-seqblog.com/combined-use-of-exome-and-transcriptome-sequencing/
- A comparison of whole genome and whole transcriptome sequencing
Sequence + Expression
Integrate RNA-Seq and DNA-Seq
- Integrated RNA and DNA sequencing reveals early drivers of metastatic breast cancer by Perou. An R code is provided.
Deconvolve bulk tumor tissue
Performance of computational algorithms to deconvolve heterogeneous bulk tumor tissue depends on experimental factors. Twitter.
Integrate/combine Omics
- awesome-multi-omics by Michael Love
- BloodCancerMultiOmics2017. "Drug-perturbation-based stratification of blood cancer" by Dietrich S, Oles M, Lu J et al. - experimental data and complete analysis.
- OmicsPLS & the paper in BMC Bioinformatics 2018
- MultiAssayExperiment & the paper in AACR 2017
- mixOmics, http://mixomics.org/, the paper on PLOS 2017
- The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models Biostatistics 2018
- Tweedieverse, Github
- Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes 2015, R code for Supplementary Data 2 and 4 and mg14.R.
- A comprehensive database for integrated analysis of omics data in autoimmune diseases 2021
- Integrative survival analysis of breast cancer with gene expression and DNA methylation data Bichindaritz, 2021
- Machine learning to analyse omic-data for COVID-19 diagnosis and prognosis 2023
Gene expression
Expression level is the amount of RNA in cell that was transcribed from that gene. Slides from Alyssa Frazee.
Fusion gene
- https://en.wikipedia.org/wiki/Fusion_gene
- https://github.com/STAR-Fusion/STAR-Fusion,
- Gene fusion discovery
- chimeraviz Visualization tools for gene fusions
- Tools to call fusions & Common issues with fusion calling, visualizing fusion events in IGV
Structural variation
- https://en.wikipedia.org/wiki/Structural_variation
- https://www.ncbi.nlm.nih.gov/dbvar/content/overview/
- Detection of complex structural variation from paired-end sequencing data Joseph G. Arthur, 2018
LUMPY, DELLY, ForestSV, Pindel, breakdancer , SVDetect.
RNASeq + ChipSeq
Labs
Biowulf2 at NIH
- Main site: http://hpc.nih.gov
- User guide: https://hpc.nih.gov/docs/user_guides.html
- Unlock account (60 days inactive) https://hpc.nih.gov/dashboard/
- Transitioning from PBS to Slurm: https://hpc.nih.gov/docs/pbs2slurm.html
- Job Submission 'cheat sheet': https://hpc.nih.gov/docs/biowulf2-handout.pdf
- STAR: https://hpc.nih.gov/apps/STAR.html
BamHash
Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.
Reproducibility
- Improving reproducibility in computational biology research
- SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses by Claire Rioualen et al, 2017.
Selected Papers
- Testing for association between RNA-Seq and high-dimensional data and the Bioconductor package globalSeq.
- The FDA’s Experience with Emerging Genomics Technologies—Past, Present, and Future
- Differential analysis of gene regulation at transcript resolution with RNA-seq Trapnell et al, Nature Biotechnology 31, 46–53 (2013)
- A Study of TP53 RNA Splicing Illustrates Pitfalls of RNA-seq Methodology
- Top RNA-Seq Articles – 2016 from RNA-Seq blog
- Multivariate association analysis with somatic mutation data by He 2017 Biometrics.
- RASflow – An RNA-Seq Analysis Workflow with Snakemake Zhang 2019
- A Survey of Bioinformatics Database and Software Usage through Mining the Literature
- Computational analysis of cancer genome sequencing data Cortés-Ciriano 2021. Focus on point mutations, copy number alterations, structural variations. For RNA-seq data, it focused on gene fusion detection.
Pictures
https://www.flickr.com/photos/genomegov
FISH/Fluorescence In Situ Hybridization
- [https://www.genome.gov/genetics-glossary/Fluorescence-In-Situ-Hybridization
- Fluorescent In Situ Hybridization (FISH) Assay
用DNA做身分鑑識
如何自学入门生物信息学
CRISPR
Staying current
Staying Current in Bioinformatics & Genomics: 2017 Edition
Papers
Common issues in algorithmic bioinformatics papers
What are some common issues I find when reviewing algorithmic bioinformatics conference papers?
Precision Medicine courses
Personalized medicine
- F.D.A. Approves First Gene-Altering Leukemia Treatment
- The FDA Just Approved a New Way of Fighting (lymphoma) Cancer Using Personalized Gene Therapy
- What is the difference between precision medicine and personalized medicine? What about pharmacogenomics? "personalized medicine" is an older term. Help Me Understand Genetics
- Predictive Biomarkers: Progress on the Road to Personalized Cancer Immunotherapy Emens 2021
Cancer and gene markers
- Colorectal cancer patients without KRAS mutations have far better outcomes with EGFR treatment than those with KRAS mutations.
- Two EGFR inhibitors, cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
- Breast cancer.
The shocking truth about space travel
bioSyntax: syntax highlighting for computational biology
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2315-y
Deep learning
Deep learning: new computational modelling techniques for genomics
HRD/homologous recombination deficiency 同源重组修复缺陷
- The inability to repair DNA damage is referred to as homologous recombination deficiency (HRD). DNA damage response from Qiagen.
- openai
- Poly(ADP-ribose) polymerase (PARP) inhibitors are a class of drugs that are designed to inhibit the activity of PARP enzymes. PARP enzymes are proteins that are involved in DNA repair pathways. They help to repair DNA damage and maintain the stability of the genome by adding a chemical group called poly(ADP-ribose) to other proteins.
- PARP inhibitors work by blocking the activity of PARP enzymes, which can interfere with the ability of cells to repair damaged DNA. This can be especially useful in cancer cells, which often rely on PARP enzymes to repair DNA damage and maintain genomic stability. By inhibiting PARP enzymes, PARP inhibitors can sensitizize cancer cells to chemotherapy and other treatments, making them more vulnerable to cell death.
- scarHRD package. Note for the 2 test samples, the return object is a 1x4 matrix.
- HRD-LOH/Loss of Heterozygosity
- LST/Large Scale Transitions
- Number of Telomeric Allelic Imbalances
- HRD-sum (sum of the above 3)
- Homologous Recombination Deficiency: Concepts, Definitions, and Assays Stewart 2022
- 同源重组修复缺陷HRD 分析
- 这篇只有两个Figure的10分+SCI是靠什么取胜的?
- HRD检测方法
- Link to HRD score, genome-wide DNA damage footprint. The HRD value has a range 0 to 101. Most are 0 and the rest follows an exponential distribution.
- Dynamically Accumulating Homologous Recombination Deficiency Score Served as an Important Prognosis Factor in High-Grade Serous Ovarian Cancer Su 2021
- A combined HRD score ≥42 was associated with shorter OS in 33 cancer types.
- However, in ovarian cancer, which ranked the highest HRD score among other cancers, HRD ≥42 cohort was significantly associated with longer OS.
- An HRD score of ≥42 was determined to signify HRD (HR-deficient), and a score of <42 was considered HR-proficient in clinical trials.
- The datasets including 1. signatures–HRD score and genome-wide DNA damage footprint; 2. phenotype-curated clinical data; and 3. gene expression RNAseq-TOIL RSEM TPM were downloaded from https://xenabrowser.net/.
- Identification of molecular subtypes and prognostic signature for hepatocellular carcinoma based on genes associated with homologous recombination deficiency Lin 2021.
- the combat function of R software package SVA was used for batch effect removal.
- https://xenabrowser.net/datapages/ has one set called "TCGA Pan-Cancer (PANCAN) (41 datasets)" -> HRD score, genome-wide DNA damage footprint (n=10,647) Pan-Cancer Atlas Hub.
- https://github.com/GerkeLab/TCGAhrd where HDR can be downloaded from https://gdc.cancer.gov/about-data/publications/PanCan-DDR-2018. Choose 'Zip file with TSV files of DDR Data Resources'. Pay attention to "DDRscores.tsv" and "Scores.tsv", "Samples.tsv".
- Homologous recombination deficiency in TCGA PanCancer Atlas
> load("clinical_and_hrd.RData") > dim(full_dat) [1] 9105 792 > full_dat[1:5, c("HRD_Score", "HRD_LOH", "HRD_LST", "HRD_TAI")] HRD_Score HRD_LOH HRD_LST HRD_TAI 1 7 2 2 3 2 9 3 2 4 3 0 0 0 0 4 8 4 2 2 5 5 1 1 3 > summary( full_dat[1:5, c("HRD_Score", "HRD_LOH", "HRD_LST", "HRD_TAI")]) HRD_Score HRD_LOH HRD_LST HRD_TAI Min. :0.0 Min. :0 Min. :0.0 Min. :0.0 1st Qu.:5.0 1st Qu.:1 1st Qu.:1.0 1st Qu.:2.0 Median :7.0 Median :2 Median :2.0 Median :3.0 Mean :5.8 Mean :2 Mean :1.4 Mean :2.4 3rd Qu.:8.0 3rd Qu.:3 3rd Qu.:2.0 3rd Qu.:3.0 Max. :9.0 Max. :4 Max. :2.0 Max. :4.0 > load("DDRscores.RData") > ls() [1] "dat" "full_dat" > dim(dat) [1] 9125 46 > colnames(dat) [1] "patient_id" "acronym" "mutLoad_silent" [4] "mutLoad_nonsilent" "mutSig1" "mutSig2" [7] "mutSig3" "mutSig4" "mutSig5" [10] "mutSig6" "mutSig7" "mutSig8" [13] "mutSig9" "mutSig10" "mutSig11" [16] "mutSig12" "mutSig13" "mutSig14" [19] "mutSig15" "mutSig16" "mutSig17" [22] "mutSig18" "mutSig19" "mutSig20" [25] "mutSig21" "CNA_n_segs" "CNA_frac_altered" [28] "CNA_n_focal_amp_del" "aneuploidy_score" "aneuploidy_score_prime" [31] "LOH_n_seg" "LOH_frac_altered" "purity" [34] "ploidy" "genome_doublings" "subclonal_frac" [37] "HRD_TAI" "HRD_LST" "HRD_LOH" [40] "HRD_Score" "eCARD" "PARPi7" [43] "PARPi7_bin" "RPS" "tp53_score" [46] "rppa_ddr_score"
SOMAscan assay (proteomic)
- https://somalogic.com/somascan-discovery/
- readat: An R package for reading and working with SomaLogic ADAT files
- https://bitbucket.org/graumannlabtools/readat/src/master/. The R package is not in Bioconductor.
- NCBI-GEO, All list
Scandal
阿茲海默症關鍵論文被揭發疑似造假,16年來全球醫學專家可能都被呼弄 & 阿茲海默症關鍵論文疑造假 誤導外界16年
Terms
RNA vs DNA
- Why RNA is a better measure of a patient’s current health than DNA
- 英美接種疫苗,陰謀論隨之盛行. 核糖核酸=RNA. 脱氧核糖核酸=DNA. 輝瑞公司開發的新冠疫苗,是使用mRNA疫苗原理,即抽取病毒內部分核糖核酸編碼蛋白(或者稱信使核糖核酸)制成疫苗。
- How mRNA Vaccines Work
- How Pfizer's RNA Vaccine Works
- Pfizer’s Covid Vaccine: 11 Things You Need to Know
- COVID-19 vaccine “95% effective”: It doesn’t mean what you think it means!
- Vaccine Technology: How MRNA Changed the Fight Against Covid-19
基因结构
https://zhuanlan.zhihu.com/p/49601643
Pseudogene
https://www.genome.gov/genetics-glossary/Pseudogene. An example: OR7E47P with alias bpl 41-16 or bpl41-16.
PCR
什麼是PCR? 聚合酶鏈鎖反應? 基因叔叔
Epidemiology
Epidemiology for the uninitiated
Cell lines
- cell line (體外). tumor samples (體內)
- A resource for cell line authentication, annotation and quality control
- Cell Lines: Types, Nomenclature, Selection and Maintenance (With Statistics)
- Comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer Jiang 2016
- Evaluating cell lines as tumour models by comparison of genomic profiles Domcke 2013
- Google: tumor cell line vs tumor samples
in vivo, in vitro, and in situ
In silico
https://en.wikipedia.org/wiki/In_silico An in silico experiment is one performed on computer or via computer simulation.
In situ
https://en.wikipedia.org/wiki/In_situ 意義大致介於in vivo與in vitro之間。
in vivo 活体内
- Human primary cells and immortal cell lines: differences and advantages
- What is the Difference Between Primary Cell Culture and Cell Line
in vitro 试管内/体外
- https://en.wikipedia.org/wiki/In_vitro
- https://en.wikipedia.org/wiki/In_silico
- https://en.wikipedia.org/wiki/RNA-Seq
RNA sequencing 101
Web
- Introduction to Biology - The Secret of Life from edX. It's one of the top 100 best free online courses posted by ClassCentral.
- Introduction to RNA-Seq including biology overview (DNA, Alternative splcing, mRNA structure, human genome) and sequencing technology.
- Introduction to RNA-Seq for Researchers (youtube)
- RNA Sequencing 101 by Agilent Technologies. Includes the definition of sequencing depth (number of reads per sample) and coverage (number of reads/locus).
- What is a good sequencing depth for bulk RNA-Seq? In general 5 M mapped reads is a good bare minimum for a differential gene expression (DGE) analysis in human. In many cases 5 M – 15 M mapped reads are sufficient. Many published human RNA-Seq experiments have been sequenced with a sequencing depth between 20 M - 50 M reads per sample.
- How to count fastq reads echo $(zcat yourfile.fastq.gz | wc -l)/4 | bc
- Where do we get reads(A,C,T,G) from sample RNA? See page 12 of this pdf from Colin Dewey in U. Wisc.
- Quantification of RNA-Seq data (see the above pdf)
- Convert read counts into expression: RPKM (see the above pdf)
- RPKM and FPKM (Data analysis of RNA-seq from new generation sequencing by 張庭毓) RNA-Seq資料分析研討會與實作課程 / RNA Seq定序 / 次世代定序(NGS) / 高通量基因定序 分析.
- First vs Second generation sequence.
- The Simple Fool’s Guide to Population Genomics via RNASeq: An Introduction to HighThroughput Sequencing Data Analysis. This covers QC, De novo assembly, BLAST, mapping reads to reference sequences, gene expression analysis and variant (SNP) detection.
- An Introduction to Bioinformatics Resources and their Practical Applications from NIH library Bioinformatics Support Program.
- Teaching material from rnaseqforthenextgeneration.org which includes Designing RNA-Seq experiments, Processing RNA-Seq data, and Downstream analyses with RNA-Seq data.
Books
- RNA-seq Data Analysis: A Practical Approach. The pdf version is available on slideshare.net.
- Statistical Analysis of Next Generation Sequencing Data
- Modern Statistics for Modern Biology (free, see Statistics books) by Holmes and Huber. Plots are based on ggplot2.
- Learning Bioinformatics At Home - some resources gathered by the Harvard Informatics group.
- Install all required packages using the R script gives me errors. It would be great if there is a Docker image. Another solution is to run install.packages(pkgsToInstall, Ncpus=4) manually.
- Computational Genomics with R by Altuna Akalin
strand-specific vs non-strand specific experiment
- http://seqanswers.com/forums/showthread.php?t=28025. According to this message and the article (under the paragraph of Read counting) from PLOS, most of the RNA-seq protocols that are used nowadays are not strand-specific.
- http://biology.stackexchange.com/questions/1958/difference-between-strand-specific-and-not-strand-specific-rna-seq-data
- https://www.biostars.org/p/61625/ how to find if this public rnaseq data are prepared by strand-specific assay?
- https://www.biostars.org/p/62747/ Discussion of using IGV to view strand-specific coverage. See also the similar posts on the right hand side.
- https://www.biostars.org/p/44319/ How To Find Stranded Rna-Seq Experiments Data. The text dUTP 2nd strand marking includes a link to stranded rna-seq data.
- forward (+)/ reverse(-) strand in GAlignments objects (p68 of the pdf manual and page 7 of sam format specification.
Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.
This post mentioned to use infer_experiment.py script to check whether the rna-seq run is stranded or not.
The rna-seq experiment used in this tutorial is not stranded-specific.
FASTQ
- FASTQ=FASTA + Qual. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
Phred quality score
q = -10log10(p) where p = error probability for the base.
q | error probability | base call accuracy |
---|---|---|
10 | 0.1 | 90% |
13 | 0.05 | 95% |
20 | 0.01 | 99% |
30 | 0.001 | 99.9% |
40 | 0.0001 | 99.99% |
50 | 0.00001 | 99.999% |
FASTA
fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.
Download sequence files
- ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/RNA/ Assembled genome sequence and annotation data for RefSeq genome assemblies
- ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/
Compute the sequence length of a FASTA file
https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa head -2 file.fa | \ awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' | \ tail -1
FASTA <=> FASTQ conversion
According to this post,
- FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
- FastQ files are like fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)
Convert FASTA to FASTQ without quality scores
Biostars. For example, the bioawk by lh3 (Heng Li) worked.
Convert FASTA to FASTQ with quality score file
See the links on the above post.
Convert FASTQ to FASTA using Seqtk
Use the Seqtk program; see this post.
The Seqtk program by lh3 can be used to sample reads from a fastq file including paired-end; see this post.
RPKM (Mortazavi et al. 2008) and cpm (counts per million)
Reads per Kilobase of Exon per Million of Mapped reads.
- RPKMs can only be calculated for those genes for which the gene length and GC content information is available; see the vignette of GSVA
- rpkm function in edgeR package.
- RPKM function in easyRNASeq package.
- TMM > cpm > log2 transformation on the paper Gene expression profiling of 1200 pancreatic ductal adenocarcinoma reveals novel subtypes
- Gene expression quantification from RNA-Seq wikipedia page
- Sequencing depth/coverage: the total number of reads generated in a single experiment is typically normalized by converting counts to fragments, reads, or counts per million mapped reads (FPM, RPM, or CPM).
- Gene length: FPKM, TPM. Longer genes will have more fragments/reads/counts than shorter genes if transcript expression is the same. This is adjusted by dividing the FPM by the length of a gene, resulting in the metric fragments per kilobase of transcript per million mapped reads (FPKM). When looking at groups of genes across samples, FPKM is converted to transcripts per million (TPM) by dividing each FPKM by the sum of FPKMs within a sample.
- Difference between CPM and TPM and which one for downstream analysis?. CPM is basically depth-normalized counts whereas TPM is length normalized (and then normalized by the length-normalized values of the other genes).
- The RNA-seq abundance zoo. The counts per million (CPM) metric takes the raw (or estimated) counts, and performs the first type of normalization I mention in the previous section. That is, it normalized the count by the library size, and then multiplies it by a million (to avoid scary small numbers).
- See also the log(CPM) implemented in Seurat::NormalizeData() for scRNA-seq data.
Idea
- The more we sequence, the more reads we expect from each gene. This is the most relevant correction of this method.
- Longer transcript are expected to generate more reads. The latter is only relevant for comparisons among different genes which we rarely perform!. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.
Calculation
- Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
- Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
- Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
Formula
RPKM = (10^9 * C)/(N * L), with C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = gene length in base-pairs for a gene
source("http://www.bioconductor.org/biocLite.R") biocLite("edgeR") library(edgeR) set.seed(1234) y <- matrix(rnbinom(20,size=1,mu=10),5,4) [,1] [,2] [,3] [,4] [1,] 0 0 5 0 [2,] 6 2 7 3 [3,] 5 13 7 2 [4,] 3 3 9 11 [5,] 1 2 1 15 d <- DGEList(counts=y, lib.size=1001:1004) # Note that lib.size is optional # By default, lib.size = colSums(counts) cpm(d) # counts per million Sample1 Sample2 Sample3 Sample4 1 0.000 0.000 4985.045 0.000 2 5994.006 1996.008 6979.063 2988.048 3 4995.005 12974.052 6979.063 1992.032 4 2997.003 2994.012 8973.081 10956.175 5 999.001 1996.008 997.009 14940.239 > cpm(d,log=TRUE) Sample1 Sample2 Sample3 Sample4 1 7.961463 7.961463 12.35309 7.961463 2 12.607393 11.132027 12.81875 11.659911 3 12.355838 13.690089 12.81875 11.129470 4 11.663897 11.662567 13.17022 13.451207 5 10.285119 11.132027 10.28282 13.890078 d$genes$Length <- c(1000,2000,500,1500,3000) rpkm(d) Sample1 Sample2 Sample3 Sample4 1 0.0000 0.000 4985.0449 0.000 2 2997.0030 998.004 3489.5314 1494.024 3 9990.0100 25948.104 13958.1256 3984.064 4 1998.0020 1996.008 5982.0538 7304.117 5 333.0003 665.336 332.3363 4980.080 > cpm function (x, ...) UseMethod("cpm") <environment: namespace:edgeR> > showMethods("cpm") Function "cpm": <not an S4 generic function> > cpm.default function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...) { x <- as.matrix(x) if (is.null(lib.size)) lib.size <- colSums(x) if (log) { prior.count.scaled <- lib.size/mean(lib.size) * prior.count lib.size <- lib.size + 2 * prior.count.scaled } lib.size <- 1e-06 * lib.size if (log) log2(t((t(x) + prior.count.scaled)/lib.size)) else t(t(x)/lib.size) } <environment: namespace:edgeR> > rpkm.default function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25, ...) { y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count) gene.length.kb <- gene.length/1000 if (log) y - log2(gene.length.kb) else y/gene.length.kb } <environment: namespace:edgeR>
Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as
# step 1: 6/(1.0e-6 *1001) = 5994.006 # cpm, compute column-wise # step 2: 5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise # Another way # step 1 (RPK) 6/ (2000/1.0e3) = 3 # step 2 (RPKM) 3/ (1.0e-6 * 1001) = 2997.003
Another example. source code of calc_cpm().
library(edgeR) set.seed(1234) y <- matrix(rnbinom(20,size=1,mu=10),5,4) cpm(y) # [,1] [,2] [,3] [,4] #[1,] 0.00 0 172413.79 0.00 #[2,] 400000.00 100000 241379.31 96774.19 #[3,] 333333.33 650000 241379.31 64516.13 #[4,] 200000.00 150000 310344.83 354838.71 #[5,] 66666.67 100000 34482.76 483870.97 calc_cpm <- function (expr_mat) { norm_factor <- colSums(expr_mat) return(t(t(expr_mat)/norm_factor) * 10^6) # Fix a bug in the original code # Not affect silhouette() } calc_cpm(y) # [,1] [,2] [,3] [,4] #[1,] 0.00 0 172413.79 0.00 #[2,] 400000.00 100000 241379.31 96774.19 #[3,] 333333.33 650000 241379.31 64516.13 #[4,] 200000.00 150000 310344.83 354838.71 #[5,] 66666.67 100000 34482.76 483870.97
Critics
Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!
(another critic) Union Exon Based Approach
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910
In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.
It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.
FPKM (Trapnell et al. 2010)
Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks). FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
RPKM, FPKM, TPM and DESeq
- RPKM can be calculated using the edgeR package if we have the raw count data (e.g. Rsubread::featureCount()). Rsubread is the only R package that can run in R.
- The youtube video is on here. TPM 比 RPKM/FPKM 好因為 total reads in each experiments are the same.
- Relationship (formula) between RPKM and TPM
- Differences
- The main difference between RPKM and FPKM is that the former is a unit based on single-end reads, while the latter is based on paired-end reads and counts the two reads from the same RNA fragment as one instead of two.
- The difference between RPKM/FPKM and TPM is that the former calculates sample-scaling factors before dividing read counts by gene lengths, while the latter divides read counts by gene lengths first and calculates samples calling factors based on the length-normalized read counts.
- If researchers would like to interpret gene expression levels as the proportions of RNA molecules from different genes in a sample,
- 为什么说FPKM和RPKM都错了?
- Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples (TPM method) by Wagner 2012.
- Between samples normalization.
- How to Normalize Salmon TPM output?
- How do I normalize for my RNA-seq data across different samples in different conditions. Using DESeq2 (paper, estimateSizeFactors()). DESeq paper by Anders 2010 where the NB model and size factor was first used.
> set.seed(1) > dds <- makeExampleDESeqDataSet(m=4) > head(counts(dds)) sample1 sample2 sample3 sample4 gene1 14 1 1 4 gene2 5 17 13 14 gene3 0 12 8 6 gene4 152 62 149 110 gene5 23 36 33 94 gene6 0 1 1 4 > dds <- estimateSizeFactors(dds) > sizeFactors(dds) sample1 sample2 sample3 sample4 1.068930 1.014687 1.010392 1.033559 > head(counts(dds)) sample1 sample2 sample3 sample4 gene1 14 1 1 4 gene2 5 17 13 14 gene3 0 12 8 6 gene4 152 62 149 110 gene5 23 36 33 94 gene6 0 1 1 4 > head(counts(dds, normalized=TRUE)) sample1 sample2 sample3 sample4 gene1 13.097206 0.9855256 0.9897147 3.870122 gene2 4.677574 16.7539354 12.8662916 13.545427 gene3 0.000000 11.8263073 7.9177179 5.805183 gene4 142.198237 61.1025878 147.4674957 106.428358 gene5 21.516838 35.4789219 32.6605863 90.947869 gene6 0.000000 0.9855256 0.9897147 3.870122 # normalized counts is calculated as the following R> head(scale(counts(dds, normalized=F), F, sizeFactors(dds))) sample1 sample2 sample3 sample4 gene1 13.097206 0.9855256 0.9897147 3.870122 gene2 4.677574 16.7539354 12.8662916 13.545427 gene3 0.000000 11.8263073 7.9177179 5.805183 gene4 142.198237 61.1025878 147.4674957 106.428358 gene5 21.516838 35.4789219 32.6605863 90.947869 gene6 0.000000 0.9855256 0.9897147 3.870122 # The situation of DESeqDataSet object created using 'tximport()' is different. See the next item.
- ?estimateSizeFactors(). If tximport is used, the information in assays(dds)[["avgTxLength"]] is automatically used to create appropriate normalization factors. In this case, sizeFactors(dds) will return NULL. See also Debug an S4 function for the source code.
- TPM (generated by RSEM or Salmon or Kallisto) has been suggested as a better unit than RPKM/FPKM. But it cannot be used to do comparison between samples.
- RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li 2011, 8k citations from google scholar.
- TPM and FPKM from array suite wiki
- What the FPKM? A review of RNA-Seq expression units. R code for computing effective counts, TPM, and FPKM.
- In RNA-Seq, 2 != 2: Between-sample normalization. Among the most popular and well-accepted BSN (between-sample normalization) methods are TMM and DESeq normalization.
P -- per K -- kilobase (related to gene length) M -- million (related to sequencing depth)
- How to convert transcript level TPM to gene level TPM ?
- R scripts to convert HTseq counts to TPM
- Calculating TPM from featureCounts output, A simpler version. It seems CPM and TPM 差別再 TPM 考慮gene length.
- Comparison of common normalization methods from a workshop from Harvard Chan Bioinformatics Core.
- Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols Zhao 2020
- Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq
TMM (Robinson and Oshlack, 2010)
Trimmed Means of M values (edgeR).
- Trimmed Mean
- RNA Sequence Analysis in R: edgeR
- Normalisation methods implemented in edgeR. TMM, RLE, Upper-quartile.
- Using edgeR package
library(magrittr) library(edgeR) set.seed(1) M <- matrix(rnbinom(10000,mu=5,size=2), ncol=4) out <- DGEList(M) %>% calcNormFactors() %>% cpm()
- Using NOISeq package
library(NOISeq) out2 <- tmm(M, long = 1000, lc = 0, k = 0) out[1:3, 1:3] / out2[1:3, 1:3] # Sample1 Sample2 Sample3 # 1 80.81611 81.32609 NaN # 2 80.81611 81.32609 80.81611 # 3 80.81611 NaN 80.81611
Sample size
Coverage
- Recommended Coverage and Read Depth for NGS Applications from @Genohub.
- bedtools. The bedtools is now hosted on github
- https://github.com/alyssafrazee/polyester
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
- Page 18 of this RNA-Seq 101 from Agilent or Estimating Sequencing Coverage from Illumina.
C = L N / G
where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.
- coverage() function in IRanges package.
- bamcov - Quickly calculate and visualize sequence coverage in alignment files
- Coverage and read depth
- What Is The Sequencing 'Depth' ? Coverage = (total number of bases generated) / (size of genome sequenced). So a 30x coverage means, on an average each base has been read by 30 sequences.
- Sequencing depth and coverage: key considerations in genomic analyses
- Qualimap
- https://www.biostars.org/p/5165/
# Assume the bam file is sorted by chromosome location # took 40 min on 5.8G bam file. samtools depth has no threads option:( # it is not right since it only account for regions that were covered with reads samtools depth *bamfile* | awk '{sum+=$3} END { print "Average = ",sum/NR}' # maybe 42 # The following is the right way! The result matches with Qualimap program. samtools depth -a *bamfile* | awk '{sum+=$3} END { print "Average = ",sum/NR}' # maybe 8 # OR LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'` # 3095693981 SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'` # 24473867730 echo $(( $LEN/$SUM ))
5 common genomics file formats
5 genomics file formats you must know (video)
- fastq,
- fastq,
- bam,
- vcf,
- bed (genomic intervals regions)
SAM/Sequence Alignment Format and BAM format specification
- https://samtools.github.io/hts-specs/SAMv1.pdf and samtools webpage.
- http://genome.sph.umich.edu/wiki/SAM
Single-end, pair-end, fragment, insert size
Germline vs Somatic mutation
Germline: inherit from parents. See the Wikipedia page.
Driver vs passenger mutation
https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer
Nonsynonymous mutation
It is related to the genetic code, Wikipedia. There are 20 amino acids though there are 64 codes.
See
- http://evolution.about.com/od/Overview/a/Synonymous-Vs-Nonsynonymous-Mutations.htm
- https://en.wikipedia.org/wiki/Nonsynonymous_substitution
- Understanding SNPs and INDELs in microbial genomes
- An example from https://en.wikipedia.org/wiki/Silent_mutation
- nonsynonymous: ATG to GTG mutation (AUG = Met, GUG = Val)
- synonymous: CAT to CAC mutation (CAU = His, CAC = His)
isma: analysis of mutations detected by multiple pipelines
isma: an R package for the integrative analysis of mutations detected by multiple pipelines
mutSignatures: analysis of cancer mutational signatures
- MutSignatures: an R package for extraction and analysis of cancer mutational signatures
- https://github.com/dami82/mutSignatures
Rediscover: identify mutually exclusive mutations
Rediscover: an R package to identify mutually exclusive mutations
Tumor mutational burden
- https://en.wikipedia.org/wiki/Tumor_mutational_burden
- Treatment Response: An analysis of a large cohort of patients receiving ICI (immune checkpoint inhibitors) therapy revealed that higher TMB levels (≥ 20 mutations/Mb) corresponded to a 58% response rate to ICIs while lower TMB levels (<20 mutations/Mb) reduced response to 20%.
- Cut-offs (High and low TMB status): Different studies have assigned different cut-offs to delineate between high and low TMB status.
- Correlate tumor mutation burden with immune signatures in human cancers Wang 2019
- Significance of tumor mutation burden combined with immune infiltrates in the progression and prognosis of ovarian cancer Bi 2020
- The Challenges of Tumor Mutational Burden as an Immunotherapy Biomarker 2020
- Tumor Mutational Burden as a Predictive Biomarker in Solid Tumors 2020
- maftools Summarize, Analyze and Visualize MAF Files
- 2449 result(s) for 'TMB' from BMC
- How to calculate TMB for somatic data TCGA
- Question: TMB Tumor Mutation Burden
- Mining TCGA database for tumor mutation burden and their clinical significance in bladder cancer
- Establishing guidelines to harmonize tumor mutational burden (TMB): in silico assessment of variation in TMB quantification across diagnostic platforms 2020
- TMB (python - This tool was designed to calculate a Tumor Mutational Burden (TMB) score from a VCF file.
- TMBleR R package, docker, shiny.
- TMB prediction App R, shiny
Missense variants
aminoacid changing variants
Alternative and differential splicing
- Best practices and appropriate workflows to analyse alternative and differential splicing
- AS-CMC – a pan-cancer database of alternative splicing for molecular classification of cancer
Allele vs Gene
http://www.diffen.com/difference/Allele_vs_Gene
- A gene is a stretch of DNA or RNA that determines a certain trait.
- Genes mutate and can take two or more alternative forms; an allele is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
- An allele is found at a fixed spot on a chromosome?
- Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).
Locus
https://en.wikipedia.org/wiki/Locus_(genetics)
Haplotypes
- http://www.brown.edu/Research/Istrail_Lab/proj_cmsh.php
- http://www.nature.com/nri/journal/v5/n1/fig_tab/nri1532_F2.html
- https://www.sciencenews.org/article/seeking-genetic-fate
- http://www.medscape.com/viewarticle/553400_3
Base quality, Mapping quality, Variant quality
- Fastq base quality: https://en.wikipedia.org/wiki/FASTQ_format
- Mapping quality: http://genome.cshlp.org/content/18/11/1851.long
- Variant quality: http://www.ncbi.nlm.nih.gov/pubmed/21903627
VarSAP
Enrichment of Variant Information for the Variant Standardization and Annotation Pipeline
Mapping quality (MAPQ) vs Alignment score (AS)
http://seqanswers.com/forums/showthread.php?t=66634 & SAM format specification
- MAPQ (5th column): MAPping Quality. It equals −10 log10 Pr{mapping position is wrong} (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
- AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)
Note:
- MAPQ scores produced by the aligners typically involves the alignment score and other information.
- You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
- You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
- MAPQ values are really useful but their implementation is a mess by Simon Andrews
gene's isoform
- Bioinformatics Tools for RNA-seq Gene and Isoform Quantification 2016. Figure 1 gives an illustration how isoform can affect the computation of RPKM.
- Evaluation and comparison of computational tools for RNA-seq isoform quantification BMC Genomics 2017.
- Differential Isoform Expression With RNA-Seq: Are We Really There Yet?
- An example of unidentified transcript-level estimates while gene-level estimation is still possible. Figure 1C in Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences
FFPE Tissue vs Frozen Tissue
- What Is FFPE Tissue And What Are Its Uses
- A new model to predict survival in colorectal cancer from DNA / RNA-Seq data
Wild type vs mutant
ns
Not significant
PARP inhibitor
What is a PARP Inhibitor? Dana-Farber Cancer Institute
Undifferentiated cancer
- Medical Definition of Undifferentiated cancer (不好的 tumor)
- What are undifferentiated cells
- Undifferentiated cells are cells that have not yet developed into a specific type of cell and do not possess the characteristics of a fully differentiated cell. They are also known as stem cells or progenitor cells.
- In developmental biology, undifferentiated cells are the cells that have not yet undergone differentiation, the process by which a less specialized cell becomes a more specialized cell, with a specific function and characteristics. These cells have the potential to divide and differentiate into multiple cell types, either through normal development or in response to injury or disease.
- In the context of cancer, undifferentiated cells refer to cells that have not yet developed into a specific type of cancer cell. These cells are sometimes called cancer stem cells, and they are thought to be the cells that give rise to the various types of cells within a tumor. They are believed to be responsible for the maintenance and growth of the tumor, and for its ability to spread to other parts of the body. They can be found within many types of cancer, and are considered to be an important target for cancer therapy, as they are thought to be more resistant to traditional treatments such as chemotherapy and radiation.
- GSE164174
Other software
Partek
- Single Cell RNA-Seq
- Partek Flow Software http://youtu.be/-6aeQPOYuHY
- RNA Seq Analysis with Partek Flow
- A playlist of Single Cell Analysis with Partek Flow Bioinformatics Software (WTH the videos are 720p only)
- Understanding RNA-Seq Data Analysis: A Back-to-basics Overview
- https://partekflow.cit.nih.gov/
dCHIP
MeV
MeV v4.8 (11/18/2011) allows annotation from Bioconductor
IPA from Ingenuity
Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.
Features:
- easily search the scientific literature/integrate diverse biological information.
- build dynamic pathway models
- quickly analyze experimental data/Functional discovery: assign function to genes
- share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.
Start Here
Expression data -> New core analysis -> Functions/Diseases -> Network analysis Canonical pathways | | | Simple or advanced search --------------------+ | | | v | My pathways, Lists <------+ ^ | Creating a custom pathway --------------------+
Resource:
- http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf
- http://libguides.mit.edu/content.php?pid=14149&sid=843471
- http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/
- IPA 5.5 manual http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/ipa_help_manual_5.5_v1.pdf
- Help and supports
- Tutorials which includes
- Search for genes
- Analysis results
- Upload and analyze example data
- Upload and analyze your own expression data
- Visualize connections among genes
- Learn more special features
- Human isoform view
- Transcription factor analysis
- Downstream effects analysis
Notes:
- The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
- The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
- The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
- In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
- The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.
Screenshots:
File:IngenuityAnalysisOutput.png
DAVID Bioinformatics Resource
It offers an integrated annotation combining gene ontology, pathways and protein annotations.
It can be used to identify the pathways associated with a set of genes; e.g. this paper.
GOTrapper
GOTrapper: a tool to navigate through branches of gene ontology hierarchy
qpcR
Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).
GSEA
- http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp
- http://www.hmwu.idv.tw/web/CourseSMDA/MADA/Hank_MicroarrayDataAnalysis-GSEA-20110616.pdf
- MGSEA– a multivariate Gene set enrichment analysis
sandbox.bio: Interactive bioinformatics tutorials
https://sandbox.bio/. An interactive playground for learning bioinformatics command-line tools like bedtools, bowtie2, and samtools.