Genome: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 2,446: Line 2,446:
*** A simulated data shows '''[https://youtu.be/Gi0JdrxRq5s?si=1klGB8taFVnZ1B9v&t=462 FDR is doing a great job limiting number of <span style="color: orange">False Positives</span> in the significant results, but it's not awesome keeping the <span style="color: green">True Positives</span>]'''.
*** A simulated data shows '''[https://youtu.be/Gi0JdrxRq5s?si=1klGB8taFVnZ1B9v&t=462 FDR is doing a great job limiting number of <span style="color: orange">False Positives</span> in the significant results, but it's not awesome keeping the <span style="color: green">True Positives</span>]'''.
*** [https://youtu.be/Gi0JdrxRq5s?si=1pJkbX6_ltrAB0Nx&t=542 Each time we increase the number of bogus tests, we reduce the number of True Positive p-values < .05 that survive the FDR adjustment].
*** [https://youtu.be/Gi0JdrxRq5s?si=1pJkbX6_ltrAB0Nx&t=542 Each time we increase the number of bogus tests, we reduce the number of True Positive p-values < .05 that survive the FDR adjustment].
*** [https://youtu.be/Gi0JdrxRq5s?si=OAsr2lI92sf0429P&t=580 A plot of True/False Positives vs Number of bogus tests]. That means FDR is still having its limitation. The independent filtering helps address this limitation.
*** [https://youtu.be/Gi0JdrxRq5s?si=OAsr2lI92sf0429P&t=580 A plot of True/False Positives vs Number of bogus tests]. That means FDR is still having its limitation. <s>The independent filtering helps address this limitation.</s>
*** [https://youtu.be/Gi0JdrxRq5s?si=qDEhvzHcI7_M3eX_&t=637 edgeR recommends removing all genes except those with > 1 CPM in 2 or more samples]. It compensates for differences in read depth between libraries. The CPM cutoff "1" depends on the data. [https://youtu.be/Gi0JdrxRq5s?si=m6YVfwkIBwZGd1dN&t=924 How can we determine what a good CPM cutoff is?].  
*** [https://youtu.be/Gi0JdrxRq5s?si=qDEhvzHcI7_M3eX_&t=637 edgeR recommends removing all genes except those with > 1 CPM in 2 or more samples]. It compensates for differences in read depth between libraries. The CPM cutoff "1" depends on the data. [https://youtu.be/Gi0JdrxRq5s?si=m6YVfwkIBwZGd1dN&t=924 How can we determine what a good CPM cutoff is?].  
*** [https://youtu.be/Gi0JdrxRq5s?si=LX2m1mlFVZH9REzY&t=983 Differences of edgeR and DESeq2].  
*** [https://youtu.be/Gi0JdrxRq5s?si=LX2m1mlFVZH9REzY&t=983 Differences of edgeR and DESeq2].  

Revision as of 12:39, 8 March 2024

Visualization

Ten simple rules

Ten simple rules for developing visualization tools in genomics

IGV

nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB
~/binary/IGV_2.3.52/igv.sh

Simulated DNA-Seq

The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.

File:Igv dna simul.png

Whole genome

PRJEB1486

File:Igv prjeb1486 wgs.png

Whole exome

  • (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
  • (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.

File:Igv gse48215.png File:Igv srp066363.png

RNA-Seq

  • (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
  • (Right) GSE46876, UCSC/hg19.

File:Igv anders2013 rna.png File:Igv gse46876 rna.png

Tell DNA or RNA

  • DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
  • RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).

ChromoMap

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

RNA-seq DRaMA

https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from 2nd Annual Shiny Contest

Gviz

GIVE: Genomic Interactive Visualization Engine

Build your own genome browser

ChromHeatMap

Heat map plotting by genome coordinate.

ggbio

Wondering how to look at the reads of a gene in samples to check if it was knocked out?

NOISeq package

Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.

rtracklayer

R interface to genome browsers and their annotation tracks

  • Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.

ssviz

A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).

Sushi

See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.

cBioPortal, TCGA, PanCanAtlas

See TCGA.

TCPA

Download. Level 4.

Qualimap

Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

SeqMonk

SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.

dittoSeq

dittoSeq – universal user-friendly single-cell and bulk RNA sequencing visualization toolkit, bioinformatics

SeqCVIBE

SeqCVIBE – interactive analysis, exploration, and visualization of RNA-Seq data

ggcoverage

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data 2023

Copy Number

Copy number work flow using Bioconductor

Detect copy number variation (CNV) from the whole exome sequencing

Whole exome sequencing != whole genome sequencing

Consensus CDS/CCDS

DBS segmentation algorithm

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

modSaRa2

An accurate and powerful method for copy number variation detection

Visualization

reconCNV: interactive visualization of copy number data from high-throughput sequencing 2021

NGS

File:CentralDogmaMolecular.png

See NGS.

R and Bioconductor packages

Resources

library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)

Docker

Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.

Some workflows

RNA-Seq workflow

Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.

       STAR       Samtools         Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments       DESeq2 
                                                           --------------------> se --------> dds
      GenomicFeatures         GenomicFeatures             /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|

rnaseqGene

rnaseqGene - RNA-seq workflow: gene-level exploratory analysis and differential expression

tximport

CodeOcean - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). Plan.

Sequence analysis

library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)

Accessing Annotation Data

Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.

library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")     # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type example package name contents
OrgDb org.Hs.eg.db gene based information for Homo sapiens
TxDb TxDb.Hsapiens.UCSC.hg19.knownGene transcriptome ranges for Homo sapiens
OrganismDb Homo.sapiens composite information for Homo sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19 genome sequence for Homo sapiens
refGenome

RNA-Seq Data Analysis using R/Bioconductor

recount2

recount3

  • Intro RNA-seq LCG-UNAM 2022 (Spanish)
  • Recount3: PCR duplicates.
    • PCR duplication can refer to two different things. It can mean the process of making many copies of a specific DNA region using a technique called Polymerase Chain Reaction (PCR). PCR relies on a thermostable DNA polymerase and requires DNA primers designed specifically for the DNA region of interest.
    • On the other hand, PCR duplication can also refer to a problem that occurs when the same DNA fragment is amplified and sequenced multiple times, resulting in identical reads that can bias many types of high-throughput-sequencing experiments. These identical reads are called PCR duplicates and can be eliminated using various methods such as removing all but one read of identical sequences or using unique molecular identifiers (UMIs) to enable accurate counting and tracking of molecules.
    • UMI stands for Unique Molecular Identifier. It is a complex index added to sequencing libraries before any PCR amplification steps, enabling the accurate bioinformatic identification of PCR duplicates. UMIs are also known as Molecular Barcodes or Random Barcodes. UMIs are valuable tools for both quantitative sequencing applications and also for genomic variant detection, especially the detection of rare mutations. UMI sequence information in conjunction with alignment coordinates enables grouping of sequencing data into read families representing individual sample DNA or RNA fragments.
    • dedup - Deduplicate reads using UMI and mapping coordinates
    • UMIs can be extracted from a fastq file using awk. For example awk 'NR % 4 == 1 {split($0,a,":"); print a[6]}' input.fastq > umis.txt . Here we assume the read header is @SEQ_ID:LANE:TILE:X:Y:UMI, then the UMI sequence is in the 6th field, following the 5th colon.

dbGap

dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

eQTL

Statistics for Genomic Data Science (Coursera) and MatrixEQTL from CRAN

GenomicDataCommons package

Note:

  1. The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
  2. Each patient has 4 categories data and the 'case_id' is common to them:
    • demographic: gender, race, year_of_birth, year_of_death
    • diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
    • exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
    • main: disease_type, primary_site
  3. The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.

Visualization

GenVisR

ComplexHeatmap

Read counts

Read, fragment

  • Meaning of the "reads" keyword in terms of RNA-seq or next generation sequencing. A read refers to the sequence of a cluster that is obtained after the end of the sequencing process which is ultimately the sequence of a section of a unique fragment.
  • What is the difference between a Read and a Fragment in RNA-seq?. Diagram , Pair-end, single-end.
  • In the context of RNA-seq, "read" and "fragment" may refer to slightly different things, but they are related concepts.
    • A read is a short sequence of nucleotides that has been generated by a sequencing machine. These reads are typically around 100-150 bases long. RNA-seq experiments generate millions or billions of reads, and these reads are aligned to a reference genome or transcriptome to determine which reads came from which genes or transcripts, this information is used to quantify gene and transcript expression levels.
    • A fragment is a piece of RNA that has been broken up and converted into a read. In RNA-seq, the first step is to convert the RNA into a library of fragments. To do this, the RNA is typically broken up into smaller pieces using a process called fragmentation. Then, adapters are added to the ends of the fragments to allow them to be sequenced. The fragments are then converted into a library of reads that can be sequenced using a next-generation sequencing platform.
    • In summary, a read is a short sequence of nucleotides that has been generated by a sequencing machine, whereas a fragment is a piece of RNA that has been broken up and converted into a read. The process of fragmentation creates a library of fragments that are then converted into reads that can be sequenced.
  • Does one fragment contain 1 read or multiple reads?
    • One fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used.
    • In the process of library preparation, RNA is first fragmented into smaller pieces, then adapters are ligated to the ends of the fragments. The fragments are then amplified using PCR, generating multiple copies of the original fragment. These amplified fragments are then sequenced using a next-generation sequencing platform, generating multiple reads per fragment.
    • For example, in Illumina sequencing, fragments are ligated with adapters, then they are clonally amplified using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating a large number of reads per fragment.
    • In other technologies like PacBio or Nanopore, the sequencing of a fragment generates only one read, as the technology can read long stretches of DNA, therefore it doesn't need to fragment the RNA prior to sequencing.
    • In summary, one fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used. The number of reads per fragment can vary from one to several thousands.
  • How many reads in a fragment on average in illumination sequencing?
    • The number of reads per fragment in Illumina sequencing can vary depending on the sequencing platform and library preparation protocol used, as well as the sequencing depth and the complexity of the sample. However, on average, one fragment can generate several hundred to several thousand reads in Illumina sequencing.
    • When sequencing is performed on the Illumina platform, the process of library preparation includes fragmenting the RNA into smaller pieces, ligating adapters to the ends of the fragments, and then amplifying the fragments using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating multiple reads per fragment.
    • The number of reads per fragment can also be affected by the sequencing depth, which refers to the total number of reads generated by the sequencing machine. A higher sequencing depth will result in more reads per fragment, while a lower sequencing depth will result in fewer reads per fragment.
    • In summary, the number of reads per fragment in Illumina sequencing can vary, but on average, one fragment can generate several hundred to several thousand reads. The number of reads per fragment can be influenced by the sequencing platform, library preparation protocol, sequencing depth, and the complexity of the sample.

Rsubread

RSEM

  • RSEM, rsem-calculate-expression
  • RSEM on Biowulf
    $ mkdir SeqTestdata/RNASeqFibroblast/output
    $ sinteractive --cpus-per-task=2 --mem=10g
    $ module load rsem bowtie STAR
    $ rsem-calculate-expression -p 2 --paired-end --star \
    				../test.SRR493366_1.fastq ../test.SRR493366_2.fastq \
    				/fdb/rsem/ref_from_genome/hg19 Sample1 # 12 seconds
    				
    $ ls -lthog
    total 5.8M
    -rw-r----- 1 1.6M Nov 24 13:39 Sample1.genes.results
    -rw-r----- 1 2.5M Nov 24 13:39 Sample1.isoforms.results
    -rw-r----- 1 1.6M Nov 24 13:39 Sample1.transcript.bam
    drwxr-x--- 2 4.0K Nov 24 13:39 Sample1.stat
    
    $ wc -l Sample1.genes.results
    26335 Sample1.genes.results
    $ wc -l Sample1.isoforms.results
    51399 Sample1.isoforms.results
    
    $ head -2 Sample1.genes.results
    gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
    A1BG	NM_130786	1766.00	1589.99	0.00	0.00	0.00
    $ head -2 Sample1.isoforms.results
    transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
    NM_130786	A1BG	1766	1589.99	0.00	0.00	0.00	0.00
    $ head -1 /fdb/rsem/ref_from_genome/hg19.transcripts.fa
    >NM_130786
    $ grep NM_130786 /fdb/igenomes/Homo_sapiens/UCSC/hg19/transcriptInfo.tab
    NM_130786	2721192635	2721199328	2721188175	2	8	431185
  • RSEM gene level result file (see here for an example) contains 5 essential columns (and the element saved by tximport() function) excluding transcript_id
    • Effective length → length. This is different across samples. This is much shorter than Length (e.g. 105 vs 1).
    • Expected count → count. This is the sum of the posterior probability of each read comes from this transcript over all reads.
    • TPM → abundance. The sum of all transcripts' TPM is 1 million.
    • FPKM (not kept?). FPKM_i = 10^3 / l_bar * TPM_i for gene i. So for each sample FPKM is a scaling of TPM.
    R> dfpkm[1:5, 1:3] / txi.rsem$abundance[1:5, 1:3]
              144126_210-T_JKQFX5 144126_210-T_JKQFX6 144126_210-T_JKQFX8
    5S_rRNA             0.7563603            1.118008           0.8485292
    5_8S_rRNA                 NaN                 NaN                 NaN
    6M1-18                    NaN                 NaN                 NaN
    7M1-2                     NaN                 NaN                 NaN
    7SK                 0.7563751            1.118029           0.8485281
    
  • An example using tximport::tximport() and DESeq2::DESeqDataSetFromTximport. Note it directly uses round(expected_count) to get the integer-value counts. See the source of DESeqDataSetFromTximport() here. The tximport vignette has discussed two suggested ways of importing estimates for use with differential gene expression (DGE) methods in the section of "Downstream DGE in Bioconductor". The vignette does not say anything about "expected_count" from RSEM output.
    txi.rsem <- tximport(files, type = "rsem", txIn = F, txOut = F)
    txi.rsem$length[txi.rsem$length == 0] <- 1
    names(txi.rsem) # a list, 
                    # length = effective_length (matrix)
                    # counts = expected counts column (matrix), non-integer
                    # abundance = TPM (matrix)
                    # countsFromAbundance = "no"
    # [1] "abundance"           "counts"              "length"             
    # [4] "countsFromAbundance"
    
    sampleTable <- pheno[, c("EXPID", "PatientID")]
    rownames(sampleTable) <- colnames(txi.rsem$counts)
    
    dds <- DESeq2::DESeqDataSetFromTximport(txi.rsem, sampleTable, ~ PatientID)
    # using counts and average transcript lengths from tximport
    # 
    # The DESeqDataSet class enforces non-negative integer values in the "counts" 
    #     matrix stored as the first element in the assay list.
    dds@assays@data@listData$counts[1:5, 1:3] # integer values. How to compute?
                                   # https://support.bioconductor.org/p/9134840/
    dds@assays@data@listData$avgTxLength[1:5, 1:3] # effective_length
    
    plot(txi.rsem$counts[,1], dds@assays@data@listData$counts[,1])
    abline(0, 1, col = 'red')      # compare expected counts vs integer-value counts
                                   # a straight line
    
    ddsColl1 <- DESeq2::estimateSizeFactors(dds)
    # using 'avgTxLength' from assays(dds), correcting for library size
    # Question: how does the function correct for library size?
    
    ddsColl2 <- DESeq2::estimateDispersions(ddsColl1)
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # final dispersion estimates
    # Note: it seems estimateDispersions is not required 
    #       if we only want to get the normalized count (still need estimateSizeFactors())
    # See ArrayTools/R/FilterAndNormalize.R
    
    cnts2 <- DESeq2::counts(ddsColl2, normalized = FALSE)
    all(dds@assays@data@listData$counts == cnts2)
    # [1] TRUE
    
    all(round(txi.rsem$counts) == cnts2 )
    # [1] TRUE.     So in this case round(expected values) = integer-value counts
    
  • RSEM example on Odyssey
  • A Short Tutorial for RSEM
  • Hands-on Training in RNA-Seq Data Analysis* which includes Quantification using RSEM and Perform DE analysis. Note the expected count column was used in edgeR.
  • Understanding RSEM: raw read counts vs expected counts. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
  • RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. Abundance estimates are given in terms of two measures. The output file (XXX_RNASeq.RSEM.genes.results) contains 7 columns: gene_id, transcript_id(s), length, effective_length, expected_count, TPM, FPKM.
    • Expected counts: The is an estimate of the number of fragments that are derived from a given isoform or gene. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR or DESeq.
    • TPM: This is the estimated fraction of transcripts made up by a given isoform or gene. The transcript fraction measure is preferred over the popular RPKM/FPKM measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species.
  • The length or effective_length are different (though similar) for different samples for the same gene
  • A scatter plot and correlation shows the expected_count and TPM are different
    x <- read.delim("144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    colnames(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM" 
    plot(x[, "TPM"], x[, "expected_count"])
    cor(x[, "TPM"], x[, "expected_count"])
    # [1] 0.4902708
    cor(x[, "TPM"], x[, "expected_count"], method = 'spearman')
    # [1] 0.9886384
    x[1:5, "length"]
    [1] 105.01 161.00 473.00  68.00 304.47
    
    x2 <- read.delim("144126_210-T_JKQFX6_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    x2[1:5, "length"]
    # [1] 105.00 161.00 473.00  68.00 305.27
    x[1:5, "effective_length"]
    # [1]   1.03  16.65 293.88   0.00 129.00
    x2[1:5, "effective_length"]
    # [1]   1.58  17.82 293.05   0.00 128.86
    
  • A benchmark for RNA-seq quantification pipelines. 2016 They compare the STAR, TopHat2, and Bowtie2 mapping methods and the Cufflinks, eXpress , Flux Capacitor, kallisto, RSEM, Sailfish, and Salmon quantification methods. RSEM slightly outperforming the rest.
  • Downsample reads from Evaluation of Cell Type Annotation R Packages on Single Cell RNA-seq Data 2020.

Expected_count

Number of reads mapping to that transcript

  • Understanding RSEM: raw read counts vs expected counts In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.
    R> x <- read.delim("41samples/165739~295-R~AM1I30~RNASEQ.genes.results")
    R> summary(x$expected_count)     # Larger than TPM, contradict to the above statement
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
          0       0      10    1346     556  533634
    R> summary(x$TPM)
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
        0.00     0.00     0.16    35.58     7.50 70091.93
    R> x[1:5, c("expected_count", "TPM")]
      expected_count      TPM
    1        6190.00 70091.93
    2           0.00     0.00
    3           0.00     0.00
    4           0.00     0.00
    5         795.01   171.67
    
  • Alignment-based的转录本定量-RSEM/ the sum of the posterior probability of each read comes from this transcript over all reads

Expected counts from RSEM in DESeq2? Yes, RSEM expected counts can be used with DESeq2.

# adding
txi$length[txi$length <= 0] <- 1
# before
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

Examples

$ wc -l 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results
   28110 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results

$ head -n 4 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results | cut -f1,3,4,5,6,7

gene_id	  length  effective_length  expected_count TPM	   FPKM
5S_rRNA	  105.01  1.03	            1513.66	   31450.7 23788.06
5_8S_rRNA 161	  16.65	            0	           0	   0
6M1-18	  473	  293.88	    0	           0	   0

Second example

$ wc -l Sample_HS-578T_CB6CRANXX.genes.results
28110
$ head -1 Sample_HS-578T_CB6CRANXX.genes.results
gene_id	transcript_id.s.	length	effective_length	expected_count	TPM	FPKM
$ tail -4Sample_HS-578T_CB6CRANXX.genes.results
septin 9/TNRC6C fusion	uc010wto.1	41	0	0	0	0
svRNAa	uc022bxg.1	22	0	0	0	0
tRNA Pro	uc022bqx.1	65	0	0	0	0
unknown	uc002afm.3	1117	922.85	0	0	0

$ wc -l Sample_HS-578T_CB6CRANXX.isoforms.results
78376
$ head -1 Sample_HS-578T_CB6CRANXX.isoforms.results
transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
$ tail -4 Sample_HS-578T_CB6CRANXX.isoforms.results
uc010wto.1	septin 9/TNRC6C fusion	41	0	0	0	0	0
uc022bxg.1	svRNAa	22	0	0	0	0	0
uc022bqx.1	tRNA Pro	65	0	0	0	0	0
uc002afm.3	unknown	1117	922.85	0	0	0	0

limma

  • Differential expression analyses for RNA-sequencing and microarray studies
  • Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
  • Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.

RSEM

Within-subject correlation

  • Does this RNAseq experiment require a repeated measures approach?
    • Solution 1: 9.7 Multi-level Experiments of LIMMA user guide. duplicateCorrelation(), lmFit(), makeContrasts(), contrasts.fit() and eBayes()
    • Solution 2: Section 3.5 "Comparisons both between and within subjects" in edgeR. model.matrix(), glmQLFit(), glmQLFTtest(), topTags().

Time Course Experiments

  • See Limma's vignette on Section 9.6
    • Few time points (ANOVA, contrast)
      • Which genes respond at either the 6 hour or 24 hour times in the wild-type?
      • Which genes respond (i.e., change over time) in the mutant?
      • Which genes respond differently over time in the mutant relative to the wild-type?
    • Many time points (regression such as cubic spline, moderated F-test)
      • Detect genes with different time trends for treatment vs control.

easyRNASeq

Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.

ShortRead

Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.

Rsamtools

The Rsamtools package provides an interface to BAM files.

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts

This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.

IRanges

IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.

The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.

If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.

flank

The example is obtained from ?IRanges::flank.

ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
#     start end width
# [1]     2   3     2
# [2]     5   7     3
# [3]     1   3     3

flank(ir3, 2)
#     start end width
# [1]     0   1     2
# [2]     3   4     2
# [3]    -1   0     2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                     == ==

flank(ir3, 2, start=FALSE)
#     start end width
# [1]     4   5     2
# [2]     8   9     2
# [3]     4   5     2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == == 

flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
#     start end width
# [1]     4   5     2
# [2]     3   4     2
# [3]     4   5     2
# Combine the ideas of the previous 2 cases.

flank(ir3, c(2, -2, 2))
#     start end width
# [1]     0   1     2
# [2]     5   6     2
# [3]    -1   0     2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                   == ==

flank(ir3, -2, start=F)
#     start end width
# [1]     2   3     2
# [2]     6   7     2
# [3]     2   3     2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                       == ==

flank(ir3, 2, both = TRUE)
#     start end width
# [1]     0   3     4
# [2]     3   6     4
# [3]    -1   2     4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range. 
#        The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
#                                             ==
#                                       == == == ==

flank(ir3, 2, start=FALSE, both=TRUE)
#     start end width
# [1]     2   5     4
# [2]     6   9     4
# [3]     2   5     4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
#                                          ==
#                                       == == == ==

Both IRanges and GenomicRanges packages provide the flank function.

Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.

  • General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
  • Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.

Biostrings

GenomicRanges

GenomicRanges depends on IRanges package. See the dependency diagram below.

GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                         |            +
                   +-----+            +- GenomeInfoDb
                   |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings

The package defines some classes

  • GRanges
  • GRangesList
  • GAlignments
  • SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.

(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.

GenomicAlignments

Counting reads with summarizeOverlaps vignette

library(GenomicAlignments)
library(DESeq)
library(edgeR)

fls <- list.files(system.file("extdata", package="GenomicAlignments"),
    recursive=TRUE, pattern="*bam$", full=TRUE)

features <- GRanges(
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features

# GRanges object with 11 ranges and 1 metadata column:
#       seqnames       ranges strand   |    group_id
#          <Rle>    <IRanges>  <Rle>   | <character>
#   [1]    chr2L [1000, 1499]      -   |           A
#   [2]    chr2L [3000, 3499]      -   |           A
#   [3]    chr2L [4000, 4499]      -   |           A
#   [4]    chr2L [7000, 7599]      -   |           A
#   [5]    chr2R [2000, 2899]      -   |           B
#   ...      ...          ...    ... ...         ...
#   [7]    chr2R [3600, 3899]      -   |           B
#   [8]    chr2R [4000, 4899]      -   |           B
#   [9]    chr2R [7500, 7799]      -   |           B
#  [10]    chr3L [5000, 5499]      -   |           C
#  [11]    chr3L [5400, 5899]      -   |           C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment 
# dim: 11 2 
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):

assays(olap)$counts
#       sm_treated1.bam sm_untreated1.bam
#  [1,]               0                 0
#  [2,]               0                 0
#  [3,]               0                 0
#  [4,]               0                 0
#  [5,]               5                 1
#  [6,]               5                 0
#  [7,]               2                 0
#  [8,]             376               104
#  [9,]               0                 0
# [10,]               0                 0
# [11,]               0                 0

Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).

samtools view -h -o outputFile.bam inputFile.sam

A modified R code that works is

###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
             "gtf/drosophila_melanogaster/",
             "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
idx <- mcols(gff0)$source == "protein_coding" & 
           mcols(gff0)$type == "exon" & 
           seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
library(GenomicAlignments)

# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
     recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts

###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)

expdata = MIAME(
              name="pasilla knockdown",
              lab="Genetics and Developmental Biology, University of 
                  Connecticut Health Center",
              contact="Dr. Brenton Graveley",
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi 
                  knockdown RNA-Seq Studies",
              pubMedIds="20921232",
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs 
                  encoding pasilla, a mRNA binding protein and 4 biological replicates 
                  of the the untreated cell line.")

design <- data.frame(
              condition=c("untreated", "untreated"),
              replicate=c(1,1),
              type=rep("single-read", 2), stringsAsFactors=TRUE)
library(DESeq)
geneCDS <- newCountDataSet(
                  countData=assay(genehits),
                  conditions=design)

experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)

###################################################
### code chunk number 16: pasilla_genes (eval = FALSE)
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design) 
experimentData(txCDS) <- expdata

We can also check out ?summarizeOverlaps to find some fake examples.

tidybulk

Bisque

Bisque: An R toolkit for accurate and efficient estimation of cell composition ('decomposition') from bulk expression data with single-cell information.

chromoMap

Inference

tximport

  • http://bioconductor.org/packages/release/bioc/html/tximport.html
    $ head -5 quant.sf 
    Name	Length	EffectiveLength	TPM	NumReads
    ENST00000456328.2	1657	1410.79	0.083908	2.46885
    ENST00000450305.2	632	410.165	0	0
    ENST00000488147.1	1351	1035.94	10.4174	225.073
    ENST00000619216.1	68	24	0	0
    ENST00000473358.1	712	453.766	0	0
    
  • Another real data from PDMR -> PDX. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
    $ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results 
    gene_id	transcript_id.s. length	effective_length  expected_count  TPM	  FPKM
    5S_rRNA	uc021ofx.1	 105.04	2.23	          2039.99	  49629.87 35353.97
    
    $ R
    x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    dim(x)
    # [1] 28109     7
    names(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"            
    x[1:3, -2]
    #     gene_id length effective_length expected_count      TPM     FPKM
    # 1   5S_rRNA 105.04             2.23        2039.99 49629.87 35353.97
    # 2 5_8S_rRNA 161.00            21.19           0.00     0.00     0.00
    # 3    6M1-18 473.00           302.74           0.00     0.00     0.00
    
    y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results")
    dim(y)
    # [1] 78375     8
    names(y)
    # [1] "transcript_id"    "gene_id"          "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"             "IsoPct" 
    y[1:3, -1]
    #   gene_id length effective_length expected_count TPM FPKM IsoPct
    # 1 5S_rRNA    110             3.06              0   0    0      0
    # 2 5S_rRNA    133             9.08              0   0    0      0
    # 3 5S_rRNA     92             0.00              0   0    0      0
    

File:RSEM PDX.png

DESeq2 or edgeR

Shrinkage estimators

  • The package uses a negative binomial statistical model to fit the count data, and can account for differences in sequencing depth across samples.
    • Shrinkage is a technique used to regularize the estimates of the parameters of the negative binomial model.
    • The idea behind shrinkage is to pull the estimated values of the parameters towards a prior distribution, which can help to reduce the variability of the estimates and improve the stability of the results.
    • The specific shrinkage method used in DESeq2 is called the "shrinkage prior for dispersion" method. This method involves adding a prior distribution on the dispersion parameter of the negative binomial model, which is used to control the degree of overdispersion in the data.
    • This prior distribution is designed to shrink the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
  • More details:
    • The negative binomial model is used to model the count data, y, as a function of the mean, [math]\displaystyle{ \mu }[/math], and the dispersion, [math]\displaystyle{ \alpha }[/math]. The probability mass function of the negative binomial is given by:
    [math]\displaystyle{ \begin{align} P(y | \mu, \alpha) = (y + \alpha - 1)! / (y! * (\alpha - 1)!) * (\mu / (\mu + \alpha))^y * (\alpha / (\mu + \alpha))^\alpha \end{align} }[/math]
    • The likelihood of the data is given by:
    [math]\displaystyle{ \begin{align} L(\mu, \alpha | y) = \prod_i [ P(y_i | \mu_i, \alpha) ] \end{align} }[/math]
    • The log-likelihood is :
    [math]\displaystyle{ \begin{align} logL(\mu, \alpha | y) = \sum_i [ \log(P(y_i | \mu_i, \alpha)) ] \end{align} }[/math]
    In this model, [math]\displaystyle{ \mu }[/math] is the mean of the negative binomial for each gene and it is modeled as a linear function of the design matrix.
    [math]\displaystyle{ \begin{align} \mu_i = exp(X_i \beta) \end{align} }[/math]
    [math]\displaystyle{ \alpha }[/math] is the dispersion parameter and it's the same for all the genes, following the common practice in RNA-seq analysis
    • The shrinkage prior is added on [math]\displaystyle{ \alpha }[/math], it assumes that [math]\displaystyle{ \alpha }[/math] is following a hyper-prior distribution like Gamma distribution
    [math]\displaystyle{ \begin{align} \alpha \sim \Gamma(a_0, b_0) \end{align} }[/math]
    This prior allows the shrinkage of [math]\displaystyle{ \alpha }[/math] estimates from the data towards a common value across all the genes, which can help to reduce the variance of the estimated log-fold changes.
    • The goal is to find the values of mu and alpha that maximize the log-likelihood, this is done by using maximum likelihood estimation (MLE) or Bayesian approach where the prior are considered and integrated in the calculation and the result is the posterior probability.
  • Dispersion parameter.
    • In the context of the negative binomial model used in DESeq2, the dispersion parameter, alpha, is a measure of the degree of overdispersion in the data. In other words, it represents the variability of the data around the mean. A value of alpha greater than 1 indicates that the data is more dispersed (more variable) than would be expected if the data were following a Poisson distribution, which is a common distribution used to model count data. The Poisson distribution has a single parameter, the mean, which represents both the location and the scale of the distribution. In contrast, the negative binomial distribution has two parameters, the mean and the dispersion, which allows for more flexibility in fitting the data.
    • The shrinkage method in DESeq2 involves shrinking the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
  • What is the formula of the fold change estimator given by DESeq2?
    • The fold change estimator given by the DESeq2 package is calculated as the ratio of the estimated mean expression levels in two conditions, with the log2 of this ratio being the log2 fold change. The mean expression levels are calculated using a negative binomial model, which accounts for both the mean and the overdispersion of the data.
    • The estimated mean expression level for a gene i in condition j is given by:
    [math]\displaystyle{ \begin{align} \log(\mu_{i,j}) = \beta_j + X_i \gamma \end{align} }[/math]
    where [math]\displaystyle{ \beta_j }[/math] is the overall mean for condition [math]\displaystyle{ j }[/math], [math]\displaystyle{ X_i }[/math] is the design matrix for gene [math]\displaystyle{ i }[/math] and [math]\displaystyle{ \gamma_i }[/math] is the gene-specific effect.
    • The log2 fold change is calculated as:
    [math]\displaystyle{ \begin{align} \log2(\mu_{i,j} / \mu_{i,k}) \end{align} }[/math]
    where j and k are the two conditions being compared.
    • So, you can see that the fold change estimator depends on the design matrix [math]\displaystyle{ X }[/math] and the parameters of the model, [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ \gamma }[/math]. The DESeq2 implementation also includes an estimation of the variance-covariance matrix of the parameters to compute the standard deviation (uncertainty) of these parameters, and therefore the standard deviation on the fold-change estimator. This can help to estimate the significance of the fold change between conditions.
    • Hypothesis testing H0: log(FC)=0 using Wald test
  • What is the variance of the estimated log-fold changes before and after applying DESeq2 method?
    • In RNA-seq data analysis, the log-fold change is a measure of the relative difference in expression between two conditions. The log-fold change is calculated as the log2 of the ratio of the mean expression in one condition to the mean expression in another condition.
    • Without using any methods like DESeq2, the variance of the estimated log-fold changes can be high, particularly for genes with low expression levels, which can lead to unreliable results. This high variance is due to the over-dispersion present in RNA-seq data, which results in a large variability in the estimated expression levels even for genes with similar means.
    • When using the DESeq2 package, the shrinkage method is applied on dispersion parameter alpha, which helps to reduce the variance of the estimated log-fold changes. By applying a prior on alpha and by shrinking the estimates of alpha towards a common value across all the genes, the method reduces the variability of the estimates. This results in more stable and reliable estimates of the log-fold changes, which can improve the accuracy and robustness of the results of the differential expression analysis.
    • Additionally, the DESeq2 package also accounts for differences in sequencing depth across samples, which can also help to reduce the variability of the estimated log-fold changes.
  • how DESeq2 package accounts for differences in sequencing depth across samples
    • The DESeq2 package accounts for differences in sequencing depth across samples by using the raw count data to estimate the normalized expression levels for each gene in each sample. This normalization process is necessary because sequencing depth can vary widely between samples, leading to differences in the overall number of reads and the apparent expression levels of the genes.
    • The package uses a method called regularized-logarithm (rlog) transformation to normalize the data, which is a variance stabilization method that is based on the logarithm of the counts, but also adjusts for the total library size and the mean expression level.
    • The method starts by computing a weighted mean of the counts across all samples, which is used as a reference. Next, for each sample, the counts are divided by the library size and then multiplied by the weighted mean of the counts. This scaling step corrects for differences in sequencing depth by making the library sizes comparable between samples.
    • Then, the regularized logarithm (rlog) transformation is applied to the scaled counts, which is given by :
    [math]\displaystyle{ \begin{align} vst = \log(counts/sizeFactor + c) \end{align} }[/math]
    where c is a small positive constant added to the counts to stabilize the variance, size_factor is the ratio of library size for each sample over the weighted mean of the library size.
    • The rlog transformation can stabilize the variance of the data and make the mean expression levels more comparable between samples. This transformed data can then be used for downstream analysis like calculating the fold changes.
    • In addition to rlog transformation the DESeq2 package uses a negative binomial distribution to model the count data, this distribution helps to account for over-dispersion in the data, and shrinkage method on the dispersion parameter is applied as well to improve the stability of results. All of these techniques work together to help correct for sequencing depth differences across samples, which can improve the accuracy of the estimated fold changes and provide more robust results in differential gene expression analysis.
  • type='apeglm' shrinkage only for use with 'coef'

Time course experiment

  • Time course trend analysis from the edgeR's vignette. glmQLFTest()
    • Finds genes that respond to the treatment at either 1 hour or 2 hours versus the 0 hour baseline. This is analogous to an ANOVA F-test for a normal linear model.
    • Assuming gene expression changes smoothly over time, we can use a polynomial or a cubic spline curve with a certain number of degrees of freedom to model gene expression along time.
    • We are looking for genes that change expression level over time. We test for a trend by conducting F-tests for each gene. The topTags function lists the top set of genes with most significant time effects.
    • The total number of genes with significant (5% FDR) changes at different time points can be examined with decideTests.
  • RNA-seq data collected at different time points. Identify differentially expressed genes associated with seasonal changes

DESeq2 experimental design and interpretation

DESeq2 experimental design and interpretation

Controlling for batch differences

The variable we are interested in ("condition") is placed after the batch variable.

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition)
dds <- DESeq(dds)

OR

dds <- DESeq(dds, test="LRT", reduced=~batch)
res <- results(dds)

DESeq2 diagnostic plot

vst over rlog transformation

Expected counts
     | round()
     v                              /-- vst transformation  ---\
Raw counts --> normalized counts  --                            -- Other analyses such as PCA, Hclust (sample distances).
                                    \-- rlog transformation ---/

Simulate negative binomial distribution data

  • rnegbin() in sim.counts() from the ssizeRNA package
    rnegbin(10000 * 10, lambda, 1 / disp) 
    # 10000 genes, 20 samples
    # lambda: mean counts from control group, a matrix. 
    # disp: dispersion parameter, a matrix.
    

Reducing false positives in differential analyses of large RNA sequencing data sets

edgeR vs DESeq2 vs limma

  • edgeR
    library(edgeR)
    
    # create DGEList object from count data
    counts <- matrix(c(20,30,25,50,45,55,15,20,10,5,10,8,100,120,110,80,90,95), nrow=3, ncol=6, byrow=TRUE)
    rownames(counts) <- c("G1", "G2", "G3")
    colnames(counts) <- c("A1", "A2", "A3", "B1", "B2", "B3")
    counts
    #     A1  A2  A3 B1 B2 B3
    # G1  20  30  25 50 45 55
    # G2  15  20  10  5 10  8
    # G3 100 120 110 80 90 95
    d <- DGEList(counts)
    
    # perform normalization and differential expression analysis
    d <- calcNormFactors(d)
    design <- model.matrix(~0+factor(c(rep("A",3), rep("B",3))))
    d <- estimateDisp(d, design)
    fit <- glmQLFit(d, design)
    res <- glmQLFTest(fit, contrast=c(-1, 1))
    
    # summarize the results and identify significant genes
    summary(res)
    res2 <- topTags(res); res2
    # Coefficient:  -1*factor(c(rep("A", 3), rep("B", 3)))A 1*factor(c(rep("A", 3), rep("B", 3)))B 
    #         logFC   logCPM          F       PValue          FDR
    # G1  1.1683093 17.98614 146.579840 4.380158e-08 1.314048e-07
    # G2 -0.7865080 16.41917   3.056504 1.059256e-01 1.588884e-01
    # G3 -0.1437956 19.34279  40.852893 2.256279e-01 2.256279e-01
    de_genes <- rownames(res2)[which(res2$FDR < 0.05 & abs(res2$log2FoldChange) > 1)]
  • DESeq2. The count data above will result in an error. The error can occur when there is very little variability in the count data, which can happen if the biological samples are very homogeneous or if the sequencing depth is very low. In such cases, it may be difficult to reliably identify differentially expressed genes using DESeq2.
    library(DESeq2)
    
    col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(3, 3))))
    
    # create a DESeq2 dataset object
    dds <- DESeqDataSetFromMatrix(countData = counts, colData = col_data, design = ~ condition)
    
    # differential expression analysis
    dds <- DESeq(dds)
    # estimating size factors
    # estimating dispersions
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) : 
    #   all gene-wise dispersion estimates are within 2 orders of magnitude
    #   from the minimum value, and so the standard curve fitting techniques will not work.
    #   One can instead use the gene-wise estimates as final estimates:
    #   dds <- estimateDispersionsGeneEst(dds)
    #   dispersions(dds) <- mcols(dds)$dispGeneEst
    #   ...then continue with testing using nbinomWaldTest or nbinomLRT
    

    Try another data.

    count_data <- matrix(c(100, 500, 200, 1000, 300,
                           200, 400, 150, 500, 300,
                           300, 300, 100, 1500, 300,
                           400, 200, 50, 2000, 300), nrow = 5, byrow = TRUE)
    
    colnames(count_data) <- paste0("sample", 1:4)
    rownames(count_data) <- paste0("gene", 1:5)
    col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(2, 2))))
    
    # create a DESeq2 dataset object
    dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition)
    # estimating size factors
    # estimating dispersions
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # -- note: fitType='parametric', but the dispersion trend was not well captured by the
    #    function: y = a/x + b, and a local regression fit was automatically substituted.
    #    specify fitType='local' or 'mean' to avoid this message next time.
    # final dispersion estimates
    # fitting model and testing
    # Warning message:
    # In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
    #   Estimated rdf < 1.0; not estimating variance
    
    # differential expression analysis
    dds <- DESeq(dds)
    
    # extract results
    res <- results(dds)
    res
    # log2 fold change (MLE): condition untreated vs treated 
    # Wald test p-value: condition untreated vs treated 
    # DataFrame with 5 rows and 6 columns
    #        baseMean log2FoldChange     lfcSE      stat    pvalue      padj
    #       <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
    # gene1   465.558       0.707313  1.243608  0.568758 0.5695201  0.711900
    # gene2   309.591      -0.119330  0.885551 -0.134752 0.8928081  0.892808
    # gene3   413.959      -0.742921  0.513376 -1.447130 0.1478604  0.369651
    # gene4   638.860      -1.372724  1.428977 -0.960634 0.3367360  0.561227
    # gene5   721.470       2.928705  1.536174  1.906493 0.0565863  0.282931
    
    # extract DE genes with adjusted p-value < 0.05 and |log2 fold change| > 1
    DESeq2_DE_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
    
    # print the number of DE genes identified by DESeq2
    cat("DESeq2 identified", nrow(DESeq2_DE_genes), "DE genes.\n")
  • limma-voom. voom is a function in the limma package that modifies RNA-Seq data for use with limma. Differential Expression with Limma-Voom.
    library(limma)
    
    # Create a design matrix with the sample groups
    design_matrix <- model.matrix(~ condition, data = col_data)
    
    # filter out low-expressed genes
    if (FALSE) {
      keep <- rowSums(counts) >= 10
      counts <- counts[keep,]
    }
    
    # normalization using voom
    v <- voom(count_data, design_matrix)
    
    # linear model fitting
    fit <- lmFit(v, design_matrix)
    
    # Calculate the empirical Bayes statistics
    fit <- eBayes(fit)
    
    top.table <- topTable(fit, sort.by = "P", n = Inf)
    top.table
    #            logFC  AveExpr          t    P.Value adj.P.Val         B
    # gene3 -2.3242262 15.67787 -1.8229203 0.08330054 0.3153894 -4.249298
    # gene5  2.0865122 16.34012  1.5960616 0.12615577 0.3153894 -4.376405
    # gene1  0.5610761 16.69722  0.5079444 0.61704872 0.7551329 -4.834183
    # gene2  0.2477959 18.69843  0.3829295 0.70581164 0.7551329 -5.150958
    # gene4  0.2869870 17.11809  0.3161922 0.75513288 0.7551329 -4.963932
    
    # Perform hypothesis testing to identify DE genes
    results <- decideTests(fit)
    
    summary(results)
    #        (Intercept) conditionuntreated
    # Down             0                  0
    # NotSig           0                  5
    # Up               5                  0
    
    # Extract the DE genes
    de_genes <- rownames(count_data)[which(results$all != 0)]

DESeq2 vs edgeR

D vs E?

  • One major difference is in the method used to estimate the dispersion parameter. DESeq2 uses a local regression method, whereas edgeR uses a Cox-Reid profile-adjusted likelihood method. The local regression method estimates the dispersion parameter for each gene independently, whereas the profile-adjusted likelihood method estimates a common dispersion parameter for all genes, with gene-specific scaling factors that depend on the mean expression levels.
  • Another difference is in the approach to normalization. DESeq2 uses a variance-stabilizing transformation to account for differences in library size and composition, whereas edgeR uses a trimmed mean of M-values (TMM) normalization method, which adjusts for library size differences by scaling the counts of each sample to a common effective library size.
  • DESeq2 also uses a different statistical model for differential expression analysis. DESeq2 models the count data as a negative binomial distribution, but includes additional terms to account for batch effects and other sources of variation. It uses a shrinkage estimator to improve the estimation of fold changes and reduce false positives. EdgeR, on the other hand, uses a similar negative binomial model but applies an empirical Bayes method to estimate gene-specific dispersions and to borrow information across genes to improve the power of detection and reduce false positives.

When should I choose DESeq2 and when should I choose edgeR?

  • The choice between DESeq2 and edgeR for differential gene expression analysis depends on several factors, including the experimental design, sample size, and the nature of the biological question being investigated. Here are some general guidelines to help you choose between these two algorithms:
  • Choose DESeq2 when:
    • The experimental design includes multiple batches or covariates that may affect the gene expression levels
    • The sample size is small, typically fewer than 12 samples per group
    • The gene expression levels are highly variable across replicates, and the goal is to identify differentially expressed genes with a low false discovery rate (FDR)
    • The focus is on the fold change rather than the statistical significance of differential expression
  • Choose edgeR when:
    • The experimental design includes several factors, such as treatment, time, and biological replicate, and the goal is to identify the main effects and interaction effects of these factors on gene expression
    • The sample size is moderate to large, typically more than 12 samples per group
    • The gene expression levels are less variable across replicates, and the goal is to achieve high statistical power to detect differentially expressed genes
    • The focus is on both the fold change and the statistical significance of differential expression, and the researcher is interested in performing downstream analyses such as gene set enrichment analysis or pathway analysis.

DESeq2 in Python

PyDESeq2, nice work

Generalized Linear Models and Plots with edgeR

Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis

EBSeq

An R package for gene and isoform differential expression analysis of RNA-seq data

http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/

prebs

Probe region expression estimation for RNA-seq data for improved microarray comparability

DEXSeq

Inference of differential exon usage in RNA-Seq

rSeqNP

A non-parametric approach for detecting differential expression and splicing from RNA-Seq data

voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data

http://www.biosoft.hacettepe.edu.tr/voomDDA/

Pathway analysis

About the KEGG pathways

  • MSigDB database or msigdbr package - seems to be old. It only has 186 KEGG pathways for human.
  • msigdb from Bioconductor
  • KEGGREST package directly pull the data from kegg.jp. I can get 337 KEGG pathways for human ('hsa')
    BiocManager::install("KEGGREST")
    library(KEGGREST)
    res <- keggList("pathway", "hsa") 
    length(res) # 337
    

GSOAP

GSOAP: a tool for visualization of gene set over-representation analysis

clusterProfiler

fgsea: Fast Gene Set Enrichment Analysis

GSEABenchmarkeR: Reproducible GSEA Benchmarking

Towards a gold standard for benchmarking gene set enrichment analysis

hypeR

GSEPD

GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display

SeqGSEA

http://www.bioconductor.org/packages/release/bioc/html/SeqGSEA.html

BAGSE

BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis 2020

GeneSetCluster

GeneSetCluster: a tool for summarizing and integrating gene-set analysis results

Pipeline

SPEAQeasy

SPEAQeasy – a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses

Nextflow

GeneTEFlow

GeneTEFlow – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data

pipeComp

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools

SARTools

http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/

SEQprocess

SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package

GEMmaker

GEMmaker, Paper

pasilla and pasillaBamSubset Data

pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).

BitSeq

Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.

ReportingTools

The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.

Figures can be included in a cell in output table. See Using ReportingTools in an Analysis of Microarray Data.

It is suggested by e.g. EnrichmentBrowser.

sequences

More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.

QuasR

Bioinformatics paper

CRAN/Bioconductor packages

ssizeRNA

RNASeqPower

RNASeqPower Sample size for RNAseq studies

RnaSeqSampleSize

Shiny app

rbamtools

Provides an interface to functions of the 'SAMtools' C-Library by Heng Li

refGenome

The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.

WhopGenome

Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.

TCGA2STAT

Simple TCGA Data Access for Integrated Statistical Analysis in R

TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.

source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")

install.packages("TCGA2STAT")

The getTCGA() function allows to download various kind of data:

  • gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
  • miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
  • mutation data (data.type="Mutation")
  • methylation expression (data.type="Methylation")
  • copy number changes (data.type="CNA_SNP")

TCGAbiolinks

  • An example from Public Data Resources in Bioconductor workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-ACC",
                               data.category = "Gene expression",
                               data.type = "Gene expression quantification",
                               platform = "Illumina HiSeq", 
                               file.type  = "normalized_results",
                               experimental.strategy = "RNA-Seq",
                               legacy = TRUE)
    
    gdcdir <- file.path("Waldron_PublicData", "GDCdata")
    GDCdownload(query, method = "api", files.per.chunk = 10,
                directory = gdcdir)  # 79 files
    ACCse <- GDCprepare(query, directory = gdcdir)
    ACCse
    class(ACCse)
    dim(assay(ACCse))  # 19947 x 79
    assay(ACCse)[1:3, 1:2] # symbol id
    length(unique(rownames(assay(ACCse))))   #  19672
    rowData(ACCse)[1:2, ]
    # DataFrame with 2 rows and 3 columns
    #          gene_id entrezgene ensembl_gene_id
    #      <character>  <integer>     <character>
    # A1BG        A1BG          1 ENSG00000121410
    # A2M          A2M          2 ENSG00000175899
    
  • HTSeq counts data example. DeMixT. Error when running GDC_prepare.
    query2 <- GDCquery(project = "TCGA-ACC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type="HTSeq - Counts") # or "STAR - Counts"
    gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2")
    GDCdownload(query2, method = "api", files.per.chunk = 10,
                directory = gdcdir2)  # 79 files
    ACCse2 <- GDCprepare(query2, directory = gdcdir2)
    ACCse2
    dim(assay(ACCse2))  # 56457 x 79
    assay(ACCse2)[1:3, 1:2]  # ensembl id
    rowData(ACCse2)[1:2, ]
    DataFrame with 2 rows and 3 columns
                    ensembl_gene_id external_gene_name original_ensembl_gene_id
                        <character>        <character>              <character>
    ENSG00000000003 ENSG00000000003             TSPAN6       ENSG00000000003.13
    ENSG00000000005 ENSG00000000005               TNMD        ENSG00000000005.5
    
  • Clinical data
    acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical")
    dim(acc_clin)
    # [1] 92 71
    
  • TCGAanalyze_DEA(). Differentially Expression Analysis (DEA) Using edgeR Package.
    dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
    dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut =  0.25)
    samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
    samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
    dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
                          dataFilt[,samplesTP],"Normal", "Tumor")
    # 2nd example
    dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM,
                               Cond1type = "LGG", Cond2type = "GBM",
                               fdr.cut = 0.01,  logFC.cut = 1,
                               method = "glmLRT")
    
  • Enrichment analysis
    ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", 
                                    RegulonList = rownames(dataDEGs))
    
    TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                            GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC,
                            GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat,
                            nRGTab = rownames(dataDEGs),
                            nBar = 20)
    
  • mRNA Analysis Pipeline from GDC documentation.