Genome: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1,664: Line 1,664:
*** [https://github.com/xia-lab/MetaboAnalystR MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics]
*** [https://github.com/xia-lab/MetaboAnalystR MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics]
*** [https://www.metaboanalyst.ca/docs/RTutorial.xhtml MetaboAnalyst6.0 - from raw spectra to biomarkers, patterns, functions and systems biology]
*** [https://www.metaboanalyst.ca/docs/RTutorial.xhtml MetaboAnalyst6.0 - from raw spectra to biomarkers, patterns, functions and systems biology]
*** [https://www.metaboanalyst.ca/Secure/utils/NameMapView.xhtml Compound Name/ID Standardization]
** [https://cran.r-project.org/web/packages/omu/index.html omu] : A Metabolomics Analysis Tool for Intuitive Figures and Convenient Metadata Collection
** [https://cran.r-project.org/web/packages/omu/index.html omu] : A Metabolomics Analysis Tool for Intuitive Figures and Convenient Metadata Collection
** FELLA
** FELLA
Line 1,674: Line 1,675:
*** Make sure to use '''internalDir = TRUE''' option in '''buildDataFromGraph()''' so the database can be shown in the database dropdown menu when we use '''launchApp()''',
*** Make sure to use '''internalDir = TRUE''' option in '''buildDataFromGraph()''' so the database can be shown in the database dropdown menu when we use '''launchApp()''',
*** https://github.com/b2slab/FELLA/blob/master/R/plotGraph.R
*** https://github.com/b2slab/FELLA/blob/master/R/plotGraph.R
* [https://www.metabolomicsworkbench.org/databases/refmet/index.php RefMet: A Reference list of Metabolite names]


=== Enzyme ===
=== Enzyme ===
Line 3,723: Line 3,726:
* FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
* FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
* [https://support.bioconductor.org/p/9154302/ Differential expression analysis with only FPKM matrix available from total newbie in R]
* [https://support.bioconductor.org/p/9154302/ Differential expression analysis with only FPKM matrix available from total newbie in R]
== Gene length ==
<pre>
library(AnnotationHub)
library(airway)
library(SummarizedExperiment)
# Load the airway dataset
data("airway")
counts <- assay(airway)
counts <- as.matrix(counts)  # Ensure it's a matrix
# TPM Calculation Function
compute_tpm <- function(counts, lengths) {
  length_scaled <- counts / lengths
  scaling_factors <- colSums(length_scaled)
  tpm <- t(t(length_scaled) / scaling_factors) * 1e6
  return(tpm)
}
BiocManager::install("biomaRt")  # If not installed yet
library(biomaRt)
# Connect to Ensembl BioMart for human
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
# Fetch gene lengths for human genes
gene_lengths <- getBM(
  attributes = c("ensembl_gene_id", "gene_biotype", "transcript_length"),  # Attribute options
  mart = mart
)
# Make sure to use ENSEMBL IDs (Ensembl Gene IDs) from the airway data
head(gene_lengths)
  ensembl_gene_id  gene_biotype transcript_length
1 ENSG00000210049        Mt_tRNA                71
2 ENSG00000211459        Mt_rRNA              954
3 ENSG00000210077        Mt_tRNA                69
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# Extract the transcripts
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
genes <- genes(txdb)
length(genes)
# [1] 30969
names(genes)[1:5]
# [1] "1"        "10"        "100"      "1000"      "100008586"
# Calculate lengths (this uses the exons of the genes)
gene_lengths <- width(reduce(genes))
gene_lengths <- as.numeric(gene_lengths)
length(gene_lengths)
# [1] 26269
gene_lengths[1:3]
# [1] 2541 1556 6167
</pre>


== [http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ RPKM, FPKM, TPM and DESeq] ==
== [http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ RPKM, FPKM, TPM and DESeq] ==

Latest revision as of 20:34, 13 December 2024

Visualization

Ten simple rules

Ten simple rules for developing visualization tools in genomics

IGV

nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB
~/binary/IGV_2.3.52/igv.sh

Simulated DNA-Seq

The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.

File:Igv dna simul.png

Whole genome

PRJEB1486

File:Igv prjeb1486 wgs.png

Whole exome

  • (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
  • (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.

File:Igv gse48215.png File:Igv srp066363.png

RNA-Seq

  • (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
  • (Right) GSE46876, UCSC/hg19.

File:Igv anders2013 rna.png File:Igv gse46876 rna.png

Tell DNA or RNA

  • DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
  • RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).

ChromoMap

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

RNA-seq DRaMA

https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from 2nd Annual Shiny Contest

Gviz

GIVE: Genomic Interactive Visualization Engine

Build your own genome browser

ChromHeatMap

Heat map plotting by genome coordinate.

ggbio

Wondering how to look at the reads of a gene in samples to check if it was knocked out?

NOISeq package

Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.

rtracklayer

R interface to genome browsers and their annotation tracks

  • Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.

ssviz

A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).

Sushi

See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.

cBioPortal, TCGA, PanCanAtlas

See TCGA.

TCPA

Download. Level 4.

Qualimap

Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

SeqMonk

SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.

dittoSeq

dittoSeq – universal user-friendly single-cell and bulk RNA sequencing visualization toolkit, bioinformatics

SeqCVIBE

SeqCVIBE – interactive analysis, exploration, and visualization of RNA-Seq data

ggcoverage

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data 2023

Copy Number

Copy number work flow using Bioconductor

Detect copy number variation (CNV) from the whole exome sequencing

Whole exome sequencing != whole genome sequencing

Consensus CDS/CCDS

DBS segmentation algorithm

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

modSaRa2

An accurate and powerful method for copy number variation detection

Visualization

reconCNV: interactive visualization of copy number data from high-throughput sequencing 2021

NGS

File:CentralDogmaMolecular.png

See NGS.

mNGS

找出病原菌的新武器 :總基因體次世代定序是什麼?

R and Bioconductor packages

Resources

library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)

Docker

Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.

Some workflows

RNA-Seq workflow

Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.

       STAR       Samtools         Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments       DESeq2 
                                                           --------------------> se --------> dds
      GenomicFeatures         GenomicFeatures             /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|

rnaseqGene

rnaseqGene - RNA-seq workflow: gene-level exploratory analysis and differential expression

tximport

CodeOcean - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). Plan.

Sequence analysis

library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)

Accessing Annotation Data

Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.

library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")     # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type example package name contents
OrgDb org.Hs.eg.db gene based information for Homo sapiens
TxDb TxDb.Hsapiens.UCSC.hg19.knownGene transcriptome ranges for Homo sapiens
OrganismDb Homo.sapiens composite information for Homo sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19 genome sequence for Homo sapiens
refGenome

RNA-Seq Data Analysis using R/Bioconductor

recount2

recount3

  • Intro RNA-seq LCG-UNAM 2022 (Spanish)
  • Recount3: PCR duplicates.
    • PCR duplication can refer to two different things. It can mean the process of making many copies of a specific DNA region using a technique called Polymerase Chain Reaction (PCR). PCR relies on a thermostable DNA polymerase and requires DNA primers designed specifically for the DNA region of interest.
    • On the other hand, PCR duplication can also refer to a problem that occurs when the same DNA fragment is amplified and sequenced multiple times, resulting in identical reads that can bias many types of high-throughput-sequencing experiments. These identical reads are called PCR duplicates and can be eliminated using various methods such as removing all but one read of identical sequences or using unique molecular identifiers (UMIs) to enable accurate counting and tracking of molecules.
    • UMI stands for Unique Molecular Identifier. It is a complex index added to sequencing libraries before any PCR amplification steps, enabling the accurate bioinformatic identification of PCR duplicates. UMIs are also known as Molecular Barcodes or Random Barcodes. UMIs are valuable tools for both quantitative sequencing applications and also for genomic variant detection, especially the detection of rare mutations. UMI sequence information in conjunction with alignment coordinates enables grouping of sequencing data into read families representing individual sample DNA or RNA fragments.
    • dedup - Deduplicate reads using UMI and mapping coordinates
    • UMIs can be extracted from a fastq file using awk. For example awk 'NR % 4 == 1 {split($0,a,":"); print a[6]}' input.fastq > umis.txt . Here we assume the read header is @SEQ_ID:LANE:TILE:X:Y:UMI, then the UMI sequence is in the 6th field, following the 5th colon.

dbGap

dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

eQTL

Statistics for Genomic Data Science (Coursera) and MatrixEQTL from CRAN

GenomicDataCommons package

Note:

  1. The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
  2. Each patient has 4 categories data and the 'case_id' is common to them:
    • demographic: gender, race, year_of_birth, year_of_death
    • diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
    • exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
    • main: disease_type, primary_site
  3. The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.

Visualization

GenVisR

ComplexHeatmap

Read counts

Read, fragment

  • Meaning of the "reads" keyword in terms of RNA-seq or next generation sequencing. A read refers to the sequence of a cluster that is obtained after the end of the sequencing process which is ultimately the sequence of a section of a unique fragment.
  • What is the difference between a Read and a Fragment in RNA-seq?. Diagram , Pair-end, single-end.
  • In the context of RNA-seq, "read" and "fragment" may refer to slightly different things, but they are related concepts.
    • A read is a short sequence of nucleotides that has been generated by a sequencing machine. These reads are typically around 100-150 bases long. RNA-seq experiments generate millions or billions of reads, and these reads are aligned to a reference genome or transcriptome to determine which reads came from which genes or transcripts, this information is used to quantify gene and transcript expression levels.
    • A fragment is a piece of RNA that has been broken up and converted into a read. In RNA-seq, the first step is to convert the RNA into a library of fragments. To do this, the RNA is typically broken up into smaller pieces using a process called fragmentation. Then, adapters are added to the ends of the fragments to allow them to be sequenced. The fragments are then converted into a library of reads that can be sequenced using a next-generation sequencing platform.
    • In summary, a read is a short sequence of nucleotides that has been generated by a sequencing machine, whereas a fragment is a piece of RNA that has been broken up and converted into a read. The process of fragmentation creates a library of fragments that are then converted into reads that can be sequenced.
  • Does one fragment contain 1 read or multiple reads?
    • One fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used.
    • In the process of library preparation, RNA is first fragmented into smaller pieces, then adapters are ligated to the ends of the fragments. The fragments are then amplified using PCR, generating multiple copies of the original fragment. These amplified fragments are then sequenced using a next-generation sequencing platform, generating multiple reads per fragment.
    • For example, in Illumina sequencing, fragments are ligated with adapters, then they are clonally amplified using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating a large number of reads per fragment.
    • In other technologies like PacBio or Nanopore, the sequencing of a fragment generates only one read, as the technology can read long stretches of DNA, therefore it doesn't need to fragment the RNA prior to sequencing.
    • In summary, one fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used. The number of reads per fragment can vary from one to several thousands.
  • How many reads in a fragment on average in illumination sequencing?
    • The number of reads per fragment in Illumina sequencing can vary depending on the sequencing platform and library preparation protocol used, as well as the sequencing depth and the complexity of the sample. However, on average, one fragment can generate several hundred to several thousand reads in Illumina sequencing.
    • When sequencing is performed on the Illumina platform, the process of library preparation includes fragmenting the RNA into smaller pieces, ligating adapters to the ends of the fragments, and then amplifying the fragments using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating multiple reads per fragment.
    • The number of reads per fragment can also be affected by the sequencing depth, which refers to the total number of reads generated by the sequencing machine. A higher sequencing depth will result in more reads per fragment, while a lower sequencing depth will result in fewer reads per fragment.
    • In summary, the number of reads per fragment in Illumina sequencing can vary, but on average, one fragment can generate several hundred to several thousand reads. The number of reads per fragment can be influenced by the sequencing platform, library preparation protocol, sequencing depth, and the complexity of the sample.

Rsubread

RSEM

  • RSEM, rsem-calculate-expression
  • RSEM on Biowulf
    $ mkdir SeqTestdata/RNASeqFibroblast/output
    $ sinteractive --cpus-per-task=2 --mem=10g
    $ module load rsem bowtie STAR
    $ rsem-calculate-expression -p 2 --paired-end --star \
    				../test.SRR493366_1.fastq ../test.SRR493366_2.fastq \
    				/fdb/rsem/ref_from_genome/hg19 Sample1 # 12 seconds
    				
    $ ls -lthog
    total 5.8M
    -rw-r----- 1 1.6M Nov 24 13:39 Sample1.genes.results
    -rw-r----- 1 2.5M Nov 24 13:39 Sample1.isoforms.results
    -rw-r----- 1 1.6M Nov 24 13:39 Sample1.transcript.bam
    drwxr-x--- 2 4.0K Nov 24 13:39 Sample1.stat
    
    $ wc -l Sample1.genes.results
    26335 Sample1.genes.results
    $ wc -l Sample1.isoforms.results
    51399 Sample1.isoforms.results
    
    $ head -2 Sample1.genes.results
    gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
    A1BG	NM_130786	1766.00	1589.99	0.00	0.00	0.00
    $ head -2 Sample1.isoforms.results
    transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
    NM_130786	A1BG	1766	1589.99	0.00	0.00	0.00	0.00
    $ head -1 /fdb/rsem/ref_from_genome/hg19.transcripts.fa
    >NM_130786
    $ grep NM_130786 /fdb/igenomes/Homo_sapiens/UCSC/hg19/transcriptInfo.tab
    NM_130786	2721192635	2721199328	2721188175	2	8	431185
  • RSEM gene level result file (see here for an example) contains 5 essential columns (and the element saved by tximport() function) excluding transcript_id
    • Effective length → length. This is different across samples. This is much shorter than Length (e.g. 105 vs 1).
    • Expected count → count. This is the sum of the posterior probability of each read comes from this transcript over all reads.
    • TPM → abundance. The sum of all transcripts' TPM is 1 million.
    • FPKM (not kept?). FPKM_i = 10^3 / l_bar * TPM_i for gene i. So for each sample FPKM is a scaling of TPM.
    R> dfpkm[1:5, 1:3] / txi.rsem$abundance[1:5, 1:3]
              144126_210-T_JKQFX5 144126_210-T_JKQFX6 144126_210-T_JKQFX8
    5S_rRNA             0.7563603            1.118008           0.8485292
    5_8S_rRNA                 NaN                 NaN                 NaN
    6M1-18                    NaN                 NaN                 NaN
    7M1-2                     NaN                 NaN                 NaN
    7SK                 0.7563751            1.118029           0.8485281
    
  • An example using tximport::tximport() and DESeq2::DESeqDataSetFromTximport. Note it directly uses round(expected_count) to get the integer-value counts. See the source of DESeqDataSetFromTximport() here. The tximport vignette has discussed two suggested ways of importing estimates for use with differential gene expression (DGE) methods in the section of "Downstream DGE in Bioconductor". The vignette does not say anything about "expected_count" from RSEM output.
    txi.rsem <- tximport(files, type = "rsem", txIn = F, txOut = F)
    txi.rsem$length[txi.rsem$length == 0] <- 1
    names(txi.rsem) # a list, 
                    # length = effective_length (matrix)
                    # counts = expected counts column (matrix), non-integer
                    # abundance = TPM (matrix)
                    # countsFromAbundance = "no"
    # [1] "abundance"           "counts"              "length"             
    # [4] "countsFromAbundance"
    
    sampleTable <- pheno[, c("EXPID", "PatientID")]
    rownames(sampleTable) <- colnames(txi.rsem$counts)
    
    dds <- DESeq2::DESeqDataSetFromTximport(txi.rsem, sampleTable, ~ PatientID)
    # using counts and average transcript lengths from tximport
    # 
    # The DESeqDataSet class enforces non-negative integer values in the "counts" 
    #     matrix stored as the first element in the assay list.
    dds@assays@data@listData$counts[1:5, 1:3] # integer values. How to compute?
                                   # https://support.bioconductor.org/p/9134840/
    dds@assays@data@listData$avgTxLength[1:5, 1:3] # effective_length
    
    plot(txi.rsem$counts[,1], dds@assays@data@listData$counts[,1])
    abline(0, 1, col = 'red')      # compare expected counts vs integer-value counts
                                   # a straight line
    
    ddsColl1 <- DESeq2::estimateSizeFactors(dds)
    # using 'avgTxLength' from assays(dds), correcting for library size
    # Question: how does the function correct for library size?
    
    ddsColl2 <- DESeq2::estimateDispersions(ddsColl1)
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # final dispersion estimates
    # Note: it seems estimateDispersions is not required 
    #       if we only want to get the normalized count (still need estimateSizeFactors())
    # See ArrayTools/R/FilterAndNormalize.R
    
    cnts2 <- DESeq2::counts(ddsColl2, normalized = FALSE)
    all(dds@assays@data@listData$counts == cnts2)
    # [1] TRUE
    
    all(round(txi.rsem$counts) == cnts2 )
    # [1] TRUE.     So in this case round(expected values) = integer-value counts
    
  • RSEM example on Odyssey
  • A Short Tutorial for RSEM
  • Hands-on Training in RNA-Seq Data Analysis* which includes Quantification using RSEM and Perform DE analysis. Note the expected count column was used in edgeR.
  • Understanding RSEM: raw read counts vs expected counts. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
  • RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. Abundance estimates are given in terms of two measures. The output file (XXX_RNASeq.RSEM.genes.results) contains 7 columns: gene_id, transcript_id(s), length, effective_length, expected_count, TPM, FPKM.
    • Expected counts: The is an estimate of the number of fragments that are derived from a given isoform or gene. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR or DESeq.
    • TPM: This is the estimated fraction of transcripts made up by a given isoform or gene. The transcript fraction measure is preferred over the popular RPKM/FPKM measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species.
  • The length or effective_length are different (though similar) for different samples for the same gene
  • A scatter plot and correlation shows the expected_count and TPM are different
    x <- read.delim("144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    colnames(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM" 
    plot(x[, "TPM"], x[, "expected_count"])
    cor(x[, "TPM"], x[, "expected_count"])
    # [1] 0.4902708
    cor(x[, "TPM"], x[, "expected_count"], method = 'spearman')
    # [1] 0.9886384
    x[1:5, "length"]
    [1] 105.01 161.00 473.00  68.00 304.47
    
    x2 <- read.delim("144126_210-T_JKQFX6_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    x2[1:5, "length"]
    # [1] 105.00 161.00 473.00  68.00 305.27
    x[1:5, "effective_length"]
    # [1]   1.03  16.65 293.88   0.00 129.00
    x2[1:5, "effective_length"]
    # [1]   1.58  17.82 293.05   0.00 128.86
    
  • A benchmark for RNA-seq quantification pipelines. 2016 They compare the STAR, TopHat2, and Bowtie2 mapping methods and the Cufflinks, eXpress , Flux Capacitor, kallisto, RSEM, Sailfish, and Salmon quantification methods. RSEM slightly outperforming the rest.
  • Downsample reads from Evaluation of Cell Type Annotation R Packages on Single Cell RNA-seq Data 2020.

Expected_count

Number of reads mapping to that transcript

  • Understanding RSEM: raw read counts vs expected counts In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.
    R> x <- read.delim("41samples/165739~295-R~AM1I30~RNASEQ.genes.results")
    R> summary(x$expected_count)     # Larger than TPM, contradict to the above statement
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
          0       0      10    1346     556  533634
    R> summary(x$TPM)
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
        0.00     0.00     0.16    35.58     7.50 70091.93
    R> x[1:5, c("expected_count", "TPM")]
      expected_count      TPM
    1        6190.00 70091.93
    2           0.00     0.00
    3           0.00     0.00
    4           0.00     0.00
    5         795.01   171.67
    
  • Expected counts from RSEM in DESeq2? Yes, RSEM expected counts can be used with DESeq2. Don't use TPM or FPKM because they are already normalized. It seems there is no need to do rounding before calling the tximport() function.
    # adding
    txi$length[txi$length <= 0] <- 1
    # before
    dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
  • Alignment-based的转录本定量-RSEM/ the sum of the posterior probability of each read comes from this transcript over all reads

Examples

$ wc -l 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results
   28110 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results

$ head -n 4 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results | cut -f1,3,4,5,6,7

gene_id	  length  effective_length  expected_count TPM	   FPKM
5S_rRNA	  105.01  1.03	            1513.66	   31450.7 23788.06
5_8S_rRNA 161	  16.65	            0	           0	   0
6M1-18	  473	  293.88	    0	           0	   0

Second example

$ wc -l Sample_HS-578T_CB6CRANXX.genes.results
28110
$ head -1 Sample_HS-578T_CB6CRANXX.genes.results
gene_id	transcript_id.s.	length	effective_length	expected_count	TPM	FPKM
$ tail -4Sample_HS-578T_CB6CRANXX.genes.results
septin 9/TNRC6C fusion	uc010wto.1	41	0	0	0	0
svRNAa	uc022bxg.1	22	0	0	0	0
tRNA Pro	uc022bqx.1	65	0	0	0	0
unknown	uc002afm.3	1117	922.85	0	0	0

$ wc -l Sample_HS-578T_CB6CRANXX.isoforms.results
78376
$ head -1 Sample_HS-578T_CB6CRANXX.isoforms.results
transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
$ tail -4 Sample_HS-578T_CB6CRANXX.isoforms.results
uc010wto.1	septin 9/TNRC6C fusion	41	0	0	0	0	0
uc022bxg.1	svRNAa	22	0	0	0	0	0
uc022bqx.1	tRNA Pro	65	0	0	0	0	0
uc002afm.3	unknown	1117	922.85	0	0	0	0

limma

  • Differential expression analyses for RNA-sequencing and microarray studies
  • Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
  • Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.

RSEM

Within-subject correlation

  • Does this RNAseq experiment require a repeated measures approach?
    • Solution 1: 9.7 Multi-level Experiments of LIMMA user guide. duplicateCorrelation(), lmFit(), makeContrasts(), contrasts.fit() and eBayes()
    • Solution 2: Section 3.5 "Comparisons both between and within subjects" in edgeR. model.matrix(), glmQLFit(), glmQLFTtest(), topTags().

Time Course Experiments

  • See Limma's vignette on Section 9.6
    • Few time points (ANOVA, contrast)
      • Which genes respond at either the 6 hour or 24 hour times in the wild-type?
      • Which genes respond (i.e., change over time) in the mutant?
      • Which genes respond differently over time in the mutant relative to the wild-type?
    • Many time points (regression such as cubic spline, moderated F-test)
      • Detect genes with different time trends for treatment vs control.

easyRNASeq

Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.

ShortRead

Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.

Rsamtools

The Rsamtools package provides an interface to BAM files.

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts

This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.

IRanges

IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.

The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.

If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.

flank

The example is obtained from ?IRanges::flank.

ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
#     start end width
# [1]     2   3     2
# [2]     5   7     3
# [3]     1   3     3

flank(ir3, 2)
#     start end width
# [1]     0   1     2
# [2]     3   4     2
# [3]    -1   0     2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                     == ==

flank(ir3, 2, start=FALSE)
#     start end width
# [1]     4   5     2
# [2]     8   9     2
# [3]     4   5     2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == == 

flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
#     start end width
# [1]     4   5     2
# [2]     3   4     2
# [3]     4   5     2
# Combine the ideas of the previous 2 cases.

flank(ir3, c(2, -2, 2))
#     start end width
# [1]     0   1     2
# [2]     5   6     2
# [3]    -1   0     2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                   == ==

flank(ir3, -2, start=F)
#     start end width
# [1]     2   3     2
# [2]     6   7     2
# [3]     2   3     2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                       == ==

flank(ir3, 2, both = TRUE)
#     start end width
# [1]     0   3     4
# [2]     3   6     4
# [3]    -1   2     4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range. 
#        The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
#                                             ==
#                                       == == == ==

flank(ir3, 2, start=FALSE, both=TRUE)
#     start end width
# [1]     2   5     4
# [2]     6   9     4
# [3]     2   5     4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
#                                          ==
#                                       == == == ==

Both IRanges and GenomicRanges packages provide the flank function.

Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.

  • General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
  • Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.

Biostrings

GenomicRanges

GenomicRanges depends on IRanges package. See the dependency diagram below.

GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                         |            +
                   +-----+            +- GenomeInfoDb
                   |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings

The package defines some classes

  • GRanges
  • GRangesList
  • GAlignments
  • SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.

(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.

GenomicAlignments

Counting reads with summarizeOverlaps vignette

library(GenomicAlignments)
library(DESeq)
library(edgeR)

fls <- list.files(system.file("extdata", package="GenomicAlignments"),
    recursive=TRUE, pattern="*bam$", full=TRUE)

features <- GRanges(
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features

# GRanges object with 11 ranges and 1 metadata column:
#       seqnames       ranges strand   |    group_id
#          <Rle>    <IRanges>  <Rle>   | <character>
#   [1]    chr2L [1000, 1499]      -   |           A
#   [2]    chr2L [3000, 3499]      -   |           A
#   [3]    chr2L [4000, 4499]      -   |           A
#   [4]    chr2L [7000, 7599]      -   |           A
#   [5]    chr2R [2000, 2899]      -   |           B
#   ...      ...          ...    ... ...         ...
#   [7]    chr2R [3600, 3899]      -   |           B
#   [8]    chr2R [4000, 4899]      -   |           B
#   [9]    chr2R [7500, 7799]      -   |           B
#  [10]    chr3L [5000, 5499]      -   |           C
#  [11]    chr3L [5400, 5899]      -   |           C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment 
# dim: 11 2 
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):

assays(olap)$counts
#       sm_treated1.bam sm_untreated1.bam
#  [1,]               0                 0
#  [2,]               0                 0
#  [3,]               0                 0
#  [4,]               0                 0
#  [5,]               5                 1
#  [6,]               5                 0
#  [7,]               2                 0
#  [8,]             376               104
#  [9,]               0                 0
# [10,]               0                 0
# [11,]               0                 0

Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).

samtools view -h -o outputFile.bam inputFile.sam

A modified R code that works is

###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
             "gtf/drosophila_melanogaster/",
             "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
idx <- mcols(gff0)$source == "protein_coding" & 
           mcols(gff0)$type == "exon" & 
           seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
library(GenomicAlignments)

# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
     recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts

###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)

expdata = MIAME(
              name="pasilla knockdown",
              lab="Genetics and Developmental Biology, University of 
                  Connecticut Health Center",
              contact="Dr. Brenton Graveley",
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi 
                  knockdown RNA-Seq Studies",
              pubMedIds="20921232",
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs 
                  encoding pasilla, a mRNA binding protein and 4 biological replicates 
                  of the the untreated cell line.")

design <- data.frame(
              condition=c("untreated", "untreated"),
              replicate=c(1,1),
              type=rep("single-read", 2), stringsAsFactors=TRUE)
library(DESeq)
geneCDS <- newCountDataSet(
                  countData=assay(genehits),
                  conditions=design)

experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)

###################################################
### code chunk number 16: pasilla_genes (eval = FALSE)
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design) 
experimentData(txCDS) <- expdata

We can also check out ?summarizeOverlaps to find some fake examples.

tidybulk

Bisque

Bisque: An R toolkit for accurate and efficient estimation of cell composition ('decomposition') from bulk expression data with single-cell information.

chromoMap

Inference

tximport

  • http://bioconductor.org/packages/release/bioc/html/tximport.html
    $ head -5 quant.sf 
    Name	Length	EffectiveLength	TPM	NumReads
    ENST00000456328.2	1657	1410.79	0.083908	2.46885
    ENST00000450305.2	632	410.165	0	0
    ENST00000488147.1	1351	1035.94	10.4174	225.073
    ENST00000619216.1	68	24	0	0
    ENST00000473358.1	712	453.766	0	0
    
  • Another real data from PDMR -> PDX. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
    $ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results 
    gene_id	transcript_id.s. length	effective_length  expected_count  TPM	  FPKM
    5S_rRNA	uc021ofx.1	 105.04	2.23	          2039.99	  49629.87 35353.97
    
    $ R
    x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    dim(x)
    # [1] 28109     7
    names(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"            
    x[1:3, -2]
    #     gene_id length effective_length expected_count      TPM     FPKM
    # 1   5S_rRNA 105.04             2.23        2039.99 49629.87 35353.97
    # 2 5_8S_rRNA 161.00            21.19           0.00     0.00     0.00
    # 3    6M1-18 473.00           302.74           0.00     0.00     0.00
    
    y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results")
    dim(y)
    # [1] 78375     8
    names(y)
    # [1] "transcript_id"    "gene_id"          "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"             "IsoPct" 
    y[1:3, -1]
    #   gene_id length effective_length expected_count TPM FPKM IsoPct
    # 1 5S_rRNA    110             3.06              0   0    0      0
    # 2 5S_rRNA    133             9.08              0   0    0      0
    # 3 5S_rRNA     92             0.00              0   0    0      0
    

File:RSEM PDX.png

DESeq2 or edgeR

Shrinkage estimators

  • The package uses a negative binomial statistical model to fit the count data, and can account for differences in sequencing depth across samples.
    • Shrinkage is a technique used to regularize the estimates of the parameters of the negative binomial model.
    • The idea behind shrinkage is to pull the estimated values of the parameters towards a prior distribution, which can help to reduce the variability of the estimates and improve the stability of the results.
    • The specific shrinkage method used in DESeq2 is called the "shrinkage prior for dispersion" method. This method involves adding a prior distribution on the dispersion parameter of the negative binomial model, which is used to control the degree of overdispersion in the data.
    • This prior distribution is designed to shrink the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
  • More details:
    • The negative binomial model is used to model the count data, y, as a function of the mean, [math]\displaystyle{ \mu }[/math], and the dispersion, [math]\displaystyle{ \alpha }[/math]. The probability mass function of the negative binomial is given by:
    [math]\displaystyle{ \begin{align} P(y | \mu, \alpha) = (y + \alpha - 1)! / (y! * (\alpha - 1)!) * (\mu / (\mu + \alpha))^y * (\alpha / (\mu + \alpha))^\alpha \end{align} }[/math]
    • The likelihood of the data is given by:
    [math]\displaystyle{ \begin{align} L(\mu, \alpha | y) = \prod_i [ P(y_i | \mu_i, \alpha) ] \end{align} }[/math]
    • The log-likelihood is :
    [math]\displaystyle{ \begin{align} logL(\mu, \alpha | y) = \sum_i [ \log(P(y_i | \mu_i, \alpha)) ] \end{align} }[/math]
    In this model, [math]\displaystyle{ \mu }[/math] is the mean of the negative binomial for each gene and it is modeled as a linear function of the design matrix.
    [math]\displaystyle{ \begin{align} \mu_i = exp(X_i \beta) \end{align} }[/math]
    [math]\displaystyle{ \alpha }[/math] is the dispersion parameter and it's the same for all the genes, following the common practice in RNA-seq analysis
    • The shrinkage prior is added on [math]\displaystyle{ \alpha }[/math], it assumes that [math]\displaystyle{ \alpha }[/math] is following a hyper-prior distribution like Gamma distribution
    [math]\displaystyle{ \begin{align} \alpha \sim \Gamma(a_0, b_0) \end{align} }[/math]
    This prior allows the shrinkage of [math]\displaystyle{ \alpha }[/math] estimates from the data towards a common value across all the genes, which can help to reduce the variance of the estimated log-fold changes.
    • The goal is to find the values of mu and alpha that maximize the log-likelihood, this is done by using maximum likelihood estimation (MLE) or Bayesian approach where the prior are considered and integrated in the calculation and the result is the posterior probability.
  • Dispersion parameter.
    • In the context of the negative binomial model used in DESeq2, the dispersion parameter, alpha, is a measure of the degree of overdispersion in the data. In other words, it represents the variability of the data around the mean. A value of alpha greater than 1 indicates that the data is more dispersed (more variable) than would be expected if the data were following a Poisson distribution, which is a common distribution used to model count data. The Poisson distribution has a single parameter, the mean, which represents both the location and the scale of the distribution. In contrast, the negative binomial distribution has two parameters, the mean and the dispersion, which allows for more flexibility in fitting the data.
    • The shrinkage method in DESeq2 involves shrinking the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
  • What is the formula of the fold change estimator given by DESeq2?
    • The fold change estimator given by the DESeq2 package is calculated as the ratio of the estimated mean expression levels in two conditions, with the log2 of this ratio being the log2 fold change. The mean expression levels are calculated using a negative binomial model, which accounts for both the mean and the overdispersion of the data.
    • The estimated mean expression level for a gene i in condition j is given by:
    [math]\displaystyle{ \begin{align} \log(\mu_{i,j}) = \beta_j + X_i \gamma \end{align} }[/math]
    where [math]\displaystyle{ \beta_j }[/math] is the overall mean for condition [math]\displaystyle{ j }[/math], [math]\displaystyle{ X_i }[/math] is the design matrix for gene [math]\displaystyle{ i }[/math] and [math]\displaystyle{ \gamma_i }[/math] is the gene-specific effect.
    • The log2 fold change is calculated as:
    [math]\displaystyle{ \begin{align} \log2(\mu_{i,j} / \mu_{i,k}) \end{align} }[/math]
    where j and k are the two conditions being compared.
    • So, you can see that the fold change estimator depends on the design matrix [math]\displaystyle{ X }[/math] and the parameters of the model, [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ \gamma }[/math]. The DESeq2 implementation also includes an estimation of the variance-covariance matrix of the parameters to compute the standard deviation (uncertainty) of these parameters, and therefore the standard deviation on the fold-change estimator. This can help to estimate the significance of the fold change between conditions.
    • Hypothesis testing H0: log(FC)=0 using Wald test
  • What is the variance of the estimated log-fold changes before and after applying DESeq2 method?
    • In RNA-seq data analysis, the log-fold change is a measure of the relative difference in expression between two conditions. The log-fold change is calculated as the log2 of the ratio of the mean expression in one condition to the mean expression in another condition.
    • Without using any methods like DESeq2, the variance of the estimated log-fold changes can be high, particularly for genes with low expression levels, which can lead to unreliable results. This high variance is due to the over-dispersion present in RNA-seq data, which results in a large variability in the estimated expression levels even for genes with similar means.
    • When using the DESeq2 package, the shrinkage method is applied on dispersion parameter alpha, which helps to reduce the variance of the estimated log-fold changes. By applying a prior on alpha and by shrinking the estimates of alpha towards a common value across all the genes, the method reduces the variability of the estimates. This results in more stable and reliable estimates of the log-fold changes, which can improve the accuracy and robustness of the results of the differential expression analysis.
    • Additionally, the DESeq2 package also accounts for differences in sequencing depth across samples, which can also help to reduce the variability of the estimated log-fold changes.
  • how DESeq2 package accounts for differences in sequencing depth across samples
    • The DESeq2 package accounts for differences in sequencing depth across samples by using the raw count data to estimate the normalized expression levels for each gene in each sample. This normalization process is necessary because sequencing depth can vary widely between samples, leading to differences in the overall number of reads and the apparent expression levels of the genes.
    • The package uses a method called regularized-logarithm (rlog) transformation to normalize the data, which is a variance stabilization method that is based on the logarithm of the counts, but also adjusts for the total library size and the mean expression level.
    • The method starts by computing a weighted mean of the counts across all samples, which is used as a reference. Next, for each sample, the counts are divided by the library size and then multiplied by the weighted mean of the counts. This scaling step corrects for differences in sequencing depth by making the library sizes comparable between samples.
    • Then, the regularized logarithm (rlog) transformation is applied to the scaled counts, which is given by :
    [math]\displaystyle{ \begin{align} vst = \log(counts/sizeFactor + c) \end{align} }[/math]
    where c is a small positive constant added to the counts to stabilize the variance, size_factor is the ratio of library size for each sample over the weighted mean of the library size.
    • The rlog transformation can stabilize the variance of the data and make the mean expression levels more comparable between samples. This transformed data can then be used for downstream analysis like calculating the fold changes.
    • In addition to rlog transformation the DESeq2 package uses a negative binomial distribution to model the count data, this distribution helps to account for over-dispersion in the data, and shrinkage method on the dispersion parameter is applied as well to improve the stability of results. All of these techniques work together to help correct for sequencing depth differences across samples, which can improve the accuracy of the estimated fold changes and provide more robust results in differential gene expression analysis.
  • type='apeglm' shrinkage only for use with 'coef'

Time course experiment

  • Time course trend analysis from the edgeR's vignette. glmQLFTest()
    • Finds genes that respond to the treatment at either 1 hour or 2 hours versus the 0 hour baseline. This is analogous to an ANOVA F-test for a normal linear model.
    • Assuming gene expression changes smoothly over time, we can use a polynomial or a cubic spline curve with a certain number of degrees of freedom to model gene expression along time.
    • We are looking for genes that change expression level over time. We test for a trend by conducting F-tests for each gene. The topTags function lists the top set of genes with most significant time effects.
    • The total number of genes with significant (5% FDR) changes at different time points can be examined with decideTests.
  • RNA-seq data collected at different time points. Identify differentially expressed genes associated with seasonal changes

DESeq2 experimental design and interpretation

DESeq2 experimental design and interpretation

Controlling for batch differences

The variable we are interested in ("condition") is placed after the batch variable.

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition)
dds <- DESeq(dds)

OR

dds <- DESeq(dds, test="LRT", reduced=~batch)
res <- results(dds)

DESeq2 diagnostic plot, MA plot

vst over rlog transformation

Expected counts
     | round()
     v                              /-- vst transformation  ---\
Raw counts --> normalized counts  --                            -- Other analyses such as PCA, Hclust (sample distances).
                                    \-- rlog transformation ---/

Simulate negative binomial distribution data

  • rnegbin() in sim.counts() from the ssizeRNA package
    rnegbin(10000 * 10, lambda, 1 / disp) 
    # 10000 genes, 20 samples
    # lambda: mean counts from control group, a matrix. 
    # disp: dispersion parameter, a matrix.
    

Reducing false positives in differential analyses of large RNA sequencing data sets

edgeR vs DESeq2 vs limma

  • edgeR
    library(edgeR)
    
    # create DGEList object from count data
    counts <- matrix(c(20,30,25,50,45,55,15,20,10,5,10,8,100,120,110,80,90,95), nrow=3, ncol=6, byrow=TRUE)
    rownames(counts) <- c("G1", "G2", "G3")
    colnames(counts) <- c("A1", "A2", "A3", "B1", "B2", "B3")
    counts
    #     A1  A2  A3 B1 B2 B3
    # G1  20  30  25 50 45 55
    # G2  15  20  10  5 10  8
    # G3 100 120 110 80 90 95
    d <- DGEList(counts)
    
    # perform normalization and differential expression analysis
    d <- calcNormFactors(d)
    design <- model.matrix(~0+factor(c(rep("A",3), rep("B",3))))
    d <- estimateDisp(d, design)
    fit <- glmQLFit(d, design)
    res <- glmQLFTest(fit, contrast=c(-1, 1))
    
    # summarize the results and identify significant genes
    summary(res)
    res2 <- topTags(res); res2
    # Coefficient:  -1*factor(c(rep("A", 3), rep("B", 3)))A 1*factor(c(rep("A", 3), rep("B", 3)))B 
    #         logFC   logCPM          F       PValue          FDR
    # G1  1.1683093 17.98614 146.579840 4.380158e-08 1.314048e-07
    # G2 -0.7865080 16.41917   3.056504 1.059256e-01 1.588884e-01
    # G3 -0.1437956 19.34279  40.852893 2.256279e-01 2.256279e-01
    de_genes <- rownames(res2)[which(res2$FDR < 0.05 & abs(res2$log2FoldChange) > 1)]
  • DESeq2. The count data above will result in an error. The error can occur when there is very little variability in the count data, which can happen if the biological samples are very homogeneous or if the sequencing depth is very low. In such cases, it may be difficult to reliably identify differentially expressed genes using DESeq2.
    library(DESeq2)
    
    col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(3, 3))))
    
    # create a DESeq2 dataset object
    dds <- DESeqDataSetFromMatrix(countData = counts, colData = col_data, design = ~ condition)
    
    # differential expression analysis
    dds <- DESeq(dds)
    # estimating size factors
    # estimating dispersions
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) : 
    #   all gene-wise dispersion estimates are within 2 orders of magnitude
    #   from the minimum value, and so the standard curve fitting techniques will not work.
    #   One can instead use the gene-wise estimates as final estimates:
    #   dds <- estimateDispersionsGeneEst(dds)
    #   dispersions(dds) <- mcols(dds)$dispGeneEst
    #   ...then continue with testing using nbinomWaldTest or nbinomLRT
    

    Try another data.

    count_data <- matrix(c(100, 500, 200, 1000, 300,
                           200, 400, 150, 500, 300,
                           300, 300, 100, 1500, 300,
                           400, 200, 50, 2000, 300), nrow = 5, byrow = TRUE)
    
    colnames(count_data) <- paste0("sample", 1:4)
    rownames(count_data) <- paste0("gene", 1:5)
    col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(2, 2))))
    
    # create a DESeq2 dataset object
    dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition)
    # estimating size factors
    # estimating dispersions
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # -- note: fitType='parametric', but the dispersion trend was not well captured by the
    #    function: y = a/x + b, and a local regression fit was automatically substituted.
    #    specify fitType='local' or 'mean' to avoid this message next time.
    # final dispersion estimates
    # fitting model and testing
    # Warning message:
    # In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
    #   Estimated rdf < 1.0; not estimating variance
    
    # differential expression analysis
    dds <- DESeq(dds)
    
    # extract results
    res <- results(dds)
    res
    # log2 fold change (MLE): condition untreated vs treated 
    # Wald test p-value: condition untreated vs treated 
    # DataFrame with 5 rows and 6 columns
    #        baseMean log2FoldChange     lfcSE      stat    pvalue      padj
    #       <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
    # gene1   465.558       0.707313  1.243608  0.568758 0.5695201  0.711900
    # gene2   309.591      -0.119330  0.885551 -0.134752 0.8928081  0.892808
    # gene3   413.959      -0.742921  0.513376 -1.447130 0.1478604  0.369651
    # gene4   638.860      -1.372724  1.428977 -0.960634 0.3367360  0.561227
    # gene5   721.470       2.928705  1.536174  1.906493 0.0565863  0.282931
    
    # extract DE genes with adjusted p-value < 0.05 and |log2 fold change| > 1
    DESeq2_DE_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
    
    # print the number of DE genes identified by DESeq2
    cat("DESeq2 identified", nrow(DESeq2_DE_genes), "DE genes.\n")
  • limma-voom. voom is a function in the limma package that modifies RNA-Seq data for use with limma. Differential Expression with Limma-Voom.
    library(limma)
    
    # Create a design matrix with the sample groups
    design_matrix <- model.matrix(~ condition, data = col_data)
    
    # filter out low-expressed genes
    if (FALSE) {
      keep <- rowSums(counts) >= 10
      counts <- counts[keep,]
    }
    
    # normalization using voom
    v <- voom(count_data, design_matrix)
    
    # linear model fitting
    fit <- lmFit(v, design_matrix)
    
    # Calculate the empirical Bayes statistics
    fit <- eBayes(fit)
    
    top.table <- topTable(fit, sort.by = "P", n = Inf)
    top.table
    #            logFC  AveExpr          t    P.Value adj.P.Val         B
    # gene3 -2.3242262 15.67787 -1.8229203 0.08330054 0.3153894 -4.249298
    # gene5  2.0865122 16.34012  1.5960616 0.12615577 0.3153894 -4.376405
    # gene1  0.5610761 16.69722  0.5079444 0.61704872 0.7551329 -4.834183
    # gene2  0.2477959 18.69843  0.3829295 0.70581164 0.7551329 -5.150958
    # gene4  0.2869870 17.11809  0.3161922 0.75513288 0.7551329 -4.963932
    
    # Perform hypothesis testing to identify DE genes
    results <- decideTests(fit)
    
    summary(results)
    #        (Intercept) conditionuntreated
    # Down             0                  0
    # NotSig           0                  5
    # Up               5                  0
    
    # Extract the DE genes
    de_genes <- rownames(count_data)[which(results$all != 0)]

DESeq2 vs edgeR

D vs E?

  • One major difference is in the method used to estimate the dispersion parameter. DESeq2 uses a local regression method, whereas edgeR uses a Cox-Reid profile-adjusted likelihood method. The local regression method estimates the dispersion parameter for each gene independently, whereas the profile-adjusted likelihood method estimates a common dispersion parameter for all genes, with gene-specific scaling factors that depend on the mean expression levels.
  • Another difference is in the approach to normalization. DESeq2 uses a variance-stabilizing transformation to account for differences in library size and composition, whereas edgeR uses a trimmed mean of M-values (TMM) normalization method, which adjusts for library size differences by scaling the counts of each sample to a common effective library size.
  • DESeq2 also uses a different statistical model for differential expression analysis. DESeq2 models the count data as a negative binomial distribution, but includes additional terms to account for batch effects and other sources of variation. It uses a shrinkage estimator to improve the estimation of fold changes and reduce false positives. EdgeR, on the other hand, uses a similar negative binomial model but applies an empirical Bayes method to estimate gene-specific dispersions and to borrow information across genes to improve the power of detection and reduce false positives.

When should I choose DESeq2 and when should I choose edgeR?

  • The choice between DESeq2 and edgeR for differential gene expression analysis depends on several factors, including the experimental design, sample size, and the nature of the biological question being investigated. Here are some general guidelines to help you choose between these two algorithms:
  • Choose DESeq2 when:
    • The experimental design includes multiple batches or covariates that may affect the gene expression levels
    • The sample size is small, typically fewer than 12 samples per group
    • The gene expression levels are highly variable across replicates, and the goal is to identify differentially expressed genes with a low false discovery rate (FDR)
    • The focus is on the fold change rather than the statistical significance of differential expression
  • Choose edgeR when:
    • The experimental design includes several factors, such as treatment, time, and biological replicate, and the goal is to identify the main effects and interaction effects of these factors on gene expression
    • The sample size is moderate to large, typically more than 12 samples per group
    • The gene expression levels are less variable across replicates, and the goal is to achieve high statistical power to detect differentially expressed genes
    • The focus is on both the fold change and the statistical significance of differential expression, and the researcher is interested in performing downstream analyses such as gene set enrichment analysis or pathway analysis.

DESeq2 in Python

PyDESeq2, nice work

Generalized Linear Models and Plots with edgeR

Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis

EBSeq

An R package for gene and isoform differential expression analysis of RNA-seq data

http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/

prebs

Probe region expression estimation for RNA-seq data for improved microarray comparability

DEXSeq

Inference of differential exon usage in RNA-Seq

rSeqNP

A non-parametric approach for detecting differential expression and splicing from RNA-Seq data

voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data

http://www.biosoft.hacettepe.edu.tr/voomDDA/

Pathway analysis

About the KEGG pathways

  • MSigDB database or msigdbr package - seems to be old. It only has 186 KEGG pathways for human.
  • msigdb from Bioconductor
  • KEGGREST package directly pull the data from kegg.jp. I can get 337 KEGG pathways for human ('hsa')
    BiocManager::install("KEGGREST")
    library(KEGGREST)
    res <- keggList("pathway", "hsa") 
    length(res) # 337
    

GSOAP

GSOAP: a tool for visualization of gene set over-representation analysis

clusterProfiler

fgsea: Fast Gene Set Enrichment Analysis

GSEABenchmarkeR: Reproducible GSEA Benchmarking

Towards a gold standard for benchmarking gene set enrichment analysis

hypeR

GSEPD

GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display

SeqGSEA

http://www.bioconductor.org/packages/release/bioc/html/SeqGSEA.html

BAGSE

BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis 2020

GeneSetCluster

GeneSetCluster: a tool for summarizing and integrating gene-set analysis results

Pipeline

SPEAQeasy

SPEAQeasy – a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses

Nextflow

GeneTEFlow

GeneTEFlow – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data

pipeComp

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools

SARTools

http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/

SEQprocess

SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package

GEMmaker

GEMmaker, Paper

pasilla and pasillaBamSubset Data

pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).

BitSeq

Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.

ReportingTools

The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.

Figures can be included in a cell in output table. See Using ReportingTools in an Analysis of Microarray Data.

It is suggested by e.g. EnrichmentBrowser.

sequences

More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.

QuasR

Bioinformatics paper

CRAN/Bioconductor packages

ssizeRNA

RNASeqPower

RNASeqPower Sample size for RNAseq studies

RnaSeqSampleSize

Shiny app

rbamtools

Provides an interface to functions of the 'SAMtools' C-Library by Heng Li

refGenome

The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.

WhopGenome

Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.

TCGA2STAT

Simple TCGA Data Access for Integrated Statistical Analysis in R

TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.

source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")

install.packages("TCGA2STAT")

The getTCGA() function allows to download various kind of data:

  • gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
  • miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
  • mutation data (data.type="Mutation")
  • methylation expression (data.type="Methylation")
  • copy number changes (data.type="CNA_SNP")

TCGAbiolinks

  • An example from Public Data Resources in Bioconductor workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-ACC",
                               data.category = "Gene expression",
                               data.type = "Gene expression quantification",
                               platform = "Illumina HiSeq", 
                               file.type  = "normalized_results",
                               experimental.strategy = "RNA-Seq",
                               legacy = TRUE)
    
    gdcdir <- file.path("Waldron_PublicData", "GDCdata")
    GDCdownload(query, method = "api", files.per.chunk = 10,
                directory = gdcdir)  # 79 files
    ACCse <- GDCprepare(query, directory = gdcdir)
    ACCse
    class(ACCse)
    dim(assay(ACCse))  # 19947 x 79
    assay(ACCse)[1:3, 1:2] # symbol id
    length(unique(rownames(assay(ACCse))))   #  19672
    rowData(ACCse)[1:2, ]
    # DataFrame with 2 rows and 3 columns
    #          gene_id entrezgene ensembl_gene_id
    #      <character>  <integer>     <character>
    # A1BG        A1BG          1 ENSG00000121410
    # A2M          A2M          2 ENSG00000175899
    
  • HTSeq counts data example. DeMixT. Error when running GDC_prepare.
    query2 <- GDCquery(project = "TCGA-ACC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type="HTSeq - Counts") # or "STAR - Counts"
    gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2")
    GDCdownload(query2, method = "api", files.per.chunk = 10,
                directory = gdcdir2)  # 79 files
    ACCse2 <- GDCprepare(query2, directory = gdcdir2)
    ACCse2
    dim(assay(ACCse2))  # 56457 x 79
    assay(ACCse2)[1:3, 1:2]  # ensembl id
    rowData(ACCse2)[1:2, ]
    DataFrame with 2 rows and 3 columns
                    ensembl_gene_id external_gene_name original_ensembl_gene_id
                        <character>        <character>              <character>
    ENSG00000000003 ENSG00000000003             TSPAN6       ENSG00000000003.13
    ENSG00000000005 ENSG00000000005               TNMD        ENSG00000000005.5
    
  • Clinical data
    acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical")
    dim(acc_clin)
    # [1] 92 71
    
  • TCGAanalyze_DEA(). Differentially Expression Analysis (DEA) Using edgeR Package.
    dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
    dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut =  0.25)
    samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
    samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
    dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
                          dataFilt[,samplesTP],"Normal", "Tumor")
    # 2nd example
    dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM,
                               Cond1type = "LGG", Cond2type = "GBM",
                               fdr.cut = 0.01,  logFC.cut = 1,
                               method = "glmLRT")
    
  • Enrichment analysis
    ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", 
                                    RegulonList = rownames(dataDEGs))
    
    TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                            GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC,
                            GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat,
                            nRGTab = rownames(dataDEGs),
                            nBar = 20)
    
  • mRNA Analysis Pipeline from GDC documentation.
  • RangedSummarizedExperiment class
    • assay(()
    • colData()
    • rowData()
    • assayNames()
    • metadata()
    • > dim(colData(ACCse))
      [1] 79 72
      > dim(rowData(ACCse))
      [1] 19947     3
      > dim(assay(ACCse))
      [1] 19947    79
      > assayNames(ACCse)
      [1] "normalized_count"
      > assayNames(ACCse2)
      [1] "HTSeq - Counts"
      > metadata(ACCse)
      $data_release
      [1] "Data Release 25.0 - July 22, 2020"
      
  • TCGAbiolinks to DESEq2. My verified version (R 4.3.2 & Bioc ‘3.17’) available on Github.

curatedTCGAData

  • Public data resources and Bioconductor from Bioc2020
    library(curatedTCGAData)
    library(MultiAssayExperiment)
    curatedTCGAData(diseaseCode = "*", assays = "*")
    curatedTCGAData(diseaseCode = "ACC")
    
    ACCmae <- curatedTCGAData("ACC", c("RPPAArray", "RNASeq2GeneNorm"), 
                              dry.run=FALSE)
    ACCmae
    dim(colData(ACCmae)) # 79 (samples) x 822 (features)
    
    head(metadata(colData(ACCmae))[["subtypes"]])
    
  • Caveats for working with TCGA data
    • Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types.
    • Use sampleTables on the MultiAssayExperiment object along with data(sampleTypes, package = "TCGAutils") to see what samples are present in the data.
    • There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as mergeReplicates.
    • Primary tumors should be selected using TCGAutils::TCGAsampleSelect and used as input to the subsetting mechanisms.

caOmicsV

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used

Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.

Map2NCBI

The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).

> library(Map2NCBI)
> GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/"))
  # choose [2], [n], and [1] to filter the build and feature information.
  # The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files.
> str(GeneList)
'data.frame':	52157 obs. of  15 variables:
 $ tax_id       : chr  "9606" "9606" "9606" "9606" ...
 $ chromosome   : chr  "1" "1" "1" "1" ...
 $ chr_start    : num  11874 14362 17369 30366 34611 ...
 $ chr_stop     : num  14409 29370 17436 30503 36081 ...
 $ chr_orient   : chr  "+" "-" "-" "+" ...
 $ contig       : chr  "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ...
 $ ctg_start    : num  1874 4362 7369 20366 24611 ...
 $ ctg_stop     : num  4409 19370 7436 20503 26081 ...
 $ ctg_orient   : chr  "+" "-" "-" "+" ...
 $ feature_name : chr  "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ...
 $ feature_id   : chr  "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ...
 $ feature_type : chr  "GENE" "GENE" "GENE" "GENE" ...
 $ group_label  : chr  "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ...
 $ transcript   : chr  "Assembly" "Assembly" "Assembly" "Assembly" ...
 $ evidence_code: chr  "-" "-" "-" "-" ...
> GeneList$feature_name[grep("^NAP", GeneList$feature_name)]

TCseq: Time course sequencing data analysis

http://bioconductor.org/packages/devel/bioc/html/TCseq.html

UCSC Xena

RTCGA

https://www.bioconductor.org/packages/release/bioc/html/RTCGA.html

genefu

Computation of Gene Expression-Based Signatures in Breast Cancer

GEO

See the internal link at R-GEO.

GREIN: An interactive web platform for re-analyzing GEO RNA-seq data

GEO2RNAseq

GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

Network-based

Network-based integration of multi-omics data for clinical outcome prediction in neuroblastoma 2022

Proteomics

OlinkAnalyze

OlinkRPackage

Mass spectrometry (MS)-based proteomics

$ head -5 data_phosphoprotein_quantification.txt' | cut -f 1-5
ENTITY_STABLE_ID	GENE_SYMBOL	PHOSPHOSITE	01CO005	01CO006
AAAS_pS495	AAAS	pS495	NA	-0.365
AAAS_pS525	AAAS	pS525	NA	NA
AAAS_pS541	AAAS	pS541	-0.24	NA
AAED1_pS12	AAED1	pS12	-0.46	-0.424

# R
> summary(x[, 4], na.rm = T)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -5.776  -1.095  -0.608  -0.690  -0.213   2.383   20378 
> summary(as.vector(as.matrix(x[, 4:5])), na.rm = T)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  -5.78   -0.81   -0.36   -0.43    0.00    3.13   36304 

Cbioportal cptac.png

Metabolomics Analysis

Enzyme

  • One gene, one enzyme. The one gene, one enzyme hypothesis is the idea that each gene encodes a single enzyme. Today, we know that this idea is generally (but not exactly) correct.
library(KEGGREST)

# Step 1: Retrieve KO entries associated with the enzyme
enzyme_id <- "ec:2.7.1.1"
ko_entries <- keggLink("ko", enzyme_id)

# Step 2: Retrieve genes associated with the KO entries
ko_ids <- sub("ko:", "", ko_entries)  # Extract KO IDs
gene_links <- lapply(ko_ids, function(ko) keggLink("genes", paste0("ko:", ko)))

# Combine results into a data frame
genes_df <- do.call(rbind, lapply(names(gene_links), function(ko) {
  data.frame(KO = ko, Gene = gene_links[[ko]], stringsAsFactors = FALSE)
}))

# View the first few rows of the genes data frame
head(genes_df)
  • Enzymes are the proteins that are encoded by genes.
  • Gene → Transcription → mRNA → Translation → Enzyme.
    • A gene is a segment of DNA that contains the instructions for making a protein. It's like a blueprint or a recipe for creating a specific protein.
    • Transcription: When a gene is "turned on" or expressed, the DNA sequence is transcribed into a complementary RNA molecule called messenger RNA (mRNA). This process is like copying the recipe from the DNA blueprint.
    • mRNA: The mRNA molecule carries the genetic information from the DNA to the ribosome, which is the site of protein synthesis.
    • Translation: The mRNA sequence is translated into a sequence of amino acids, which are the building blocks of proteins. This process is like following the recipe to assemble the ingredients into a final product.
    • Enzyme: The resulting protein, in this case, is an enzyme. Enzymes are biological molecules that catalyze specific chemical reactions, allowing them to speed up or facilitate various biological processes. An enzyme is a type of protein, but not all proteins are enzymes.

Protein-protein interaction/PPI

Drug-Drug Interactions

Understanding Drug-Drug Interactions Using R Shiny

Journals

Biometrical Journal

Biostatistics

Bioinformatics

Genome Analysis section

BMC Bioinformatics

BioRxiv

PLOS

MDPI

https://en.wikipedia.org/wiki/MDPI, All MDPI, Frontiers & Hindawi journals planned to be erased (level 0) from Finnish academic assessment by end of 2024.

Software

BRB-SeqTools

https://brb.nci.nih.gov/seqtools/

WebMeV

GeneSpring

RNA-Seq

CCBR Exome Pipeliner

https://ccbr.github.io/Pipeliner/

Tibanna

Tibanna helps you run your genomic pipelines on Amazon cloud (AWS). It is used by the 4DN DCIC (4D Nucleome Data Coordination and Integration Center) to process data. Tibanna supports CWL/WDL (w/ docker), Snakemake (w/ conda) and custom Docker/shell command.

MOFA: Multi-Omics Factor Analysis

WGCNA

Benchmarking

Essential guidelines for computational method benchmarking

Simulation

Simulate RNA-Seq

Maq

Used by TopHat: discovering splice junctions with RNA-Seq

BEERS/Grant G.R. 2011

http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called BEERS and it was used in the STAR software paper.

For the command line options of <reads_simulator.pl> and more details about the config files that are needed/prepared by BEERS, see this gist.

This can generate paired end data but they are in one FASTA file.

$ sudo apt-get install cpanminus
$ sudo cpanm Math::Random
$ wget http://cbil.upenn.edu/BEERS/beers.tar
$
$ tar -xvf beers.tar      # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl>
$
$ cd ~/Downloads/
$ mkdir beers_output  
$ mkdir beers_simulator_refseq && cd "$_"
$ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz
$ tar xzvf simulator_config_refseq.tar.gz
$ ls -lth 
total 1.4G
-rw-r--r-- 1 brb brb  44M Sep 16  2010 simulator_config_featurequantifications_refseq
-rw-r--r-- 1 brb brb 7.7M Sep 15  2010 simulator_config_geneinfo_refseq
-rw-r--r-- 1 brb brb 106M Sep 15  2010 simulator_config_geneseq_refseq
-rw-r--r-- 1 brb brb 1.3G Sep 15  2010 simulator_config_intronseq_refseq
$ cd ~/Downloads/
$ perl reads_simulator.pl 100 testbeers \
   -configstem refseq \
   -customcfgdir ~/Downloads/beers_simulator_refseq \
   -outdir ~/Downloads/beers_output

$ ls -lh beers_output
total 3.9M
-rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt
-rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt
-rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt
-rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt
-rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed
-rw-r--r-- 1 brb brb  31K Mar 16 15:25 simulated_reads_testbeers.cig
-rw-r--r-- 1 brb brb  22K Mar 16 15:25 simulated_reads_testbeers.fa
-rw-r--r-- 1 brb brb  584 Mar 16 15:25 simulated_reads_testbeers.log

$ wc -l simulated_reads2genes_testbeers.txt
102 simulated_reads2genes_testbeers.txt
$ head -4 simulated_reads2genes_testbeers.txt
seq.1	GENE.5600
seq.2	GENE.35506
seq.3	GENE.506
seq.4	GENE.34922
$ tail -4 simulated_reads2genes_testbeers.txt
seq.97	GENE.4197
seq.98	GENE.8763
seq.99	GENE.19573
seq.100	GENE.18830
$ wc -l simulated_reads_indels_testbeers.txt
36131 simulated_reads_indels_testbeers.txt
$ head -2 simulated_reads_indels_testbeers.txt
chr1:6052304-6052531	25	1	G
chr2:73899436-73899622	141	3	ATA
$ tail -2 simulated_reads_indels_testbeers.txt
chr4:68619532-68621804	1298	-2	AA
chr21:32554738-32554962	174	1	T
$ wc -l simulated_reads_substitutions_testbeers.txt 
71678  simulated_reads_substitutions_testbeers.txt
$ head -2 simulated_reads_substitutions_testbeers.txt 
chr22:50902963-50903167	50903077	G->A
chr1:6052304-6052531	6052330	G->C
$ wc -l simulated_reads_junctions-crossed_testbeers.txt 
49   simulated_reads_junctions-crossed_testbeers.txt
$ head -2 simulated_reads_junctions-crossed_testbeers.txt 
seq.1a	chrX:49084601-49084713
seq.1b	chrX:49084909-49086682

$ cat beers_output/simulated_reads_testbeers.log
Simulator run: 'testbeers'
started: Thu Mar 16 15:25:39 EDT 2017
num reads: 100
readlength: 100
substitution frequency: 0.001
indel frequency: 0.0005
base error: 0.005
low quality tail length: 10
percent of tails that are low quality: 0
quality of low qulaity tails: 0.8
percent of alt splice forms: 0.2
number of alt splice forms per gene: 2
stem: refseq
sum of gene counts: 3,886,863,063
sum of intron counts = 1,304,815,198
sum of intron counts = 2,365,472,596
intron frequency: 0.355507598105262
padded intron frequency: 0.52453796437909
finished at Thu Mar 16 15:25:58 EDT 2017

$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ head simulated_reads_testbeers.fa
>seq.1a
CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
>seq.1b
GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
>seq.2a
GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
>seq.2b
ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
>seq.3a
CCCCAGAGGAGCGCCACCTGTCCAAGATGCAGCAGAACGGCTACGAAAATCCAACCTACAAGTTCTTTGAGCAGATGCAGAACTAGACCCCCGCCACAGC

# Take a look at the true coordinates
$ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span
chrX	49084529	49084601	+
chrX	49084713	49084739	+
chrX	49084863	49084909	+
chrX	49086682	49086734	+
$ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates
seq.1a	chrX	49084529	73M111N27M	49084529-49084601, 49084713-49084739	+	CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
seq.1b	chrX	49084863	47M1772N53M	49084863-49084909, 49086682-49086734	-	GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
seq.2a	chr1	183516256	100M	183516256-183516355	-	GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
seq.2b	chr1	183515275	100M	183515275-183515374	+	ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ wc -l simulated_reads_testbeers.bed
247 simulated_reads_testbeers.bed
$ wc -l simulated_reads_testbeers.cig
200 simulated_reads_testbeers.cig

Flux Sammeth 2010

SimNGS

SimSeq

Bioinformatics

A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.

empiricalFDR.DESeq2

http://biorxiv.org/content/early/2014/12/05/012211

The key function is simulateCounts, which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.

Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.

polyester

http://biorxiv.org/content/early/2014/12/05/012211

Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.

Input: reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.

Output: FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.

Too many dependencies. Got an error in installation.. It seems it has not considered splice junctions.

seqgendiff

Data-based RNA-seq simulations by binomial thinning

Simulate DNA-Seq

wgsim

https://github.com/lh3/wgsim

dwgssim

NEAT

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

http://genomespot.blogspot.com/2014/11/dna-aligner-accuracy-bwa-bowtie-soap.html

$ head simDNA_100bp_16del.fasta
>Pt-0-100
TGGCGAACGCGGGAATTGACCGCGATGGTGATTCACATCACTCCTAATCCACTTGCTAATCGCCCTACGCTACTATCATTCTTT
>Pt-10-110
GCGGGATTGAACCCGATTGAATTCCAATCACTGCTTAATCCACTTGCTACATCGCCCTACGTACTATCTATTTTTTTGTATTTC
>Pt-20-120
GAACCCGCGATGAATTCAATCCACTGCTACCATTGGCTACATCCGCCCCTACGCTACTCTTCTTTTTTGTATGTCTAAAAAAAA
>Pt-30-130
TGGTGAATCACAATCACTGCCTAACCATTGGCTACATCCGCCCCTACGCTACACTATTTTTTGTATTGCTAAAAAAAAAAATAA
>Pt-40-140
ACAACACTGCCTAATCCACTTGGCTACTCCGCCCCTAGCTACTATCTTTTTTTGTATTTCTAAAAAAAAAAAATCAATTTCAAT

Simulate Whole genome

Simulate whole exome

SCSIM

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

Variant simulator

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

Mutation-Simulator

Mutation-Simulator

SigProfilerSimulator

Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator

Convert FASTA to FASTQ

It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.

If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.

$ cat test.fasta
>Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
>Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
>Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
>Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
>Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
$ perl ~/Downloads/fasta_to_fastq.pl test.fasta
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Alternatively we can use just one line of code by awk

$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \
   test.fasta > test.fq
$ cat test.fq
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).

Simulate genetic data

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

PDX/Xenograft

#!/bin/bash
module load gossamer
xenome index -M 24 -T 16 -P idx \
  -H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \
  -G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa

RNA-Seq

DNA-Seq

MAF (TCGA, GDC)

DNA Seq Data

NIH

  • Go to SRA/Sequence Read Archiveand type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
    1. Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
    2. The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
    3. The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
    4. I pick the first one Accession:SRP066837 from this page. The page shows the Study type is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
    5. (Important trick) Click the number next to Run. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
    6. Download the raw data from any one of them (eg SRR2968056). For whole genome, the Strategy is WGS. For whole exome, the Strategy is called WXS.
  • Search the keywords 'nonsynonymous' and 'human' in PMC

Use SRAToolKit instead of wget to download

Don't use the wget command since it requires the specification of right http address.

Downloading SRA data using command line utilities

SRA2R - a package to import SRA data directly into R.

(Method 1) Use the fastq-dump command. For example, the following command (modified from the document will download the first 5 reads and save it to a file called <SRR390728.fastq> (NOT sra format) in the current directory.

/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O .
# OR 
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar

This will download the files in FASTQ format.

(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra

It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.

(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list

brb@T3600 ~/Downloads $ tar xzvf aspera-connect-3.6.2.117442-linux-64.tar.gz
aspera-connect-3.6.2.117442-linux-64.sh
brb@T3600 ~/Downloads $ ./aspera-connect-3.6.2.117442-linux-64.sh

Installing Aspera Connect

Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only.
Restart firefox manually to load the Aspera Connect plug-in

Install complete.

brb@T3600 ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \
  -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
  [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra .
SRR590795.sra                                                                           100%  239MB  535Mb/s    00:06
Completed: 245535K bytes transferred in 7 seconds
 (272848K bits/sec), in 1 file.
brb@T3600 ~/Downloads $

Aspera is typically 10 times faster than FTP according to the website. For this case, wget takes 12s while ascp uses 7s.

Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "[email protected]"

SRAdb package

https://bioconductor.org/packages/release/bioc/html/SRAdb.html

First we install some required package for XML and RCurl.

sudo apt-get update
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev

and then

source("https://bioconductor.org/biocLite.R")
biocLite("SRAdb")

SRA

The wait is over… NIH’s Public Sequence Read Archive is now open access on the cloud

Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers

SRA Explorer

SRP056969

SRP066363 - lung cancer

SRP015769 or SRP062882 - prostate cancer

SRP053134 - breast cancer

Look at the MBases value column. It determines the coverage for each run.

SRP050992 single cell RNA-Seq

Used in Design and computational analysis of single-cell RNA-sequencing experiments

Single cell RNA-Seq

SRP040626 or SRP040540 - Colon and rectal cancer

OmicIDX

OmicIDX on BigQuery

Tutorials

See the BWA section.

Whole Exome Seq

Whole Genome Seq

SraRunTable.txt

  1. http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
  2. http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click SRP004077
  3. http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click Runs from the RHS
  4. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click RunInfoTable

Note that (For this study, it has 2377 rows)

  • Column A (AssemblyName_s) eg GRCh37
  • Column I (library_name_s) eg
  • column N (header=Run_s) shows all SRR or ERR accession numbers.
  • Column P (Sample_Name)
  • Column Y (header=Assay_Type_s) shows WGS.
  • Column AB (LibraryLayout_s): PAIRED

Public Data

Ten Resources for easy access public genomic data 6/7/2023. UCSCXenaTools (TCGA, ICGC, GDC), PharmacoGx, rDGidb, OmicsDI, AnnotationHub, TCGAbiolinks, GenomicDataCommons, cbioportal.

Public data resources and Bioconductor from Bioc2020.

Package name Object class Downloads (Distinct IPs, Jul 2020)
GEOquery SummarizedExperiment 5754
GenomicDataCommons GDCQuery 1154
TCGAbiolinks
curatedTCGAData
RangedSummarizedExperiment
MultiAssayExperimentObjects
2752
275
recount RangedSummarizedExperiment 418
curatedMetagenomicData ExperimentHub 224

Caution

Public RNA-seq data are not representative of global human diversity 2024

ISB Cancer Genomics Cloud (ISB-CGC)

https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis

The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.

ISB-CGC Web Application

CCLE, DepMap

  • Next-generation characterization of the Cancer Cell Line Encyclopedia 2019
  • It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).
  • Assembling Clinical Information for the CCLE Data
  • Data download Depmap
  • Dependency Map (DepMap) portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualization tools CCLE
    • sample_info.csv also available from the download page
    • CCLE_RNAseq_reads.csv: read counts from RSEM. 1406 x (54358 - 1). Use readr::read_csv(). range(x[, -1]) = 0 13018000. Note: log2(13018000) = 23.634.
    • CCLE_expression_full.csv: log2(TPM + 1). 1406 x (53971 - 1). range(x[, -1]) = 0.00000 17.78354
    • CCLE_expression.csv: log2(TPM+1). 1406 x 19221 genes. protein coding genes. 33 diseases.
    • CCLE_expression_proteincoding_genes_expected_count.csv: 1406 x (19222 - 1). read count (non-integers) data from RSEM for just protein coding genes. range(x[, -1]) = 0 13018000.
    • CCLE_expression_transcripts_expected_count.csv: read count data from RSEM. 1406 x (228138-1). Non-integers. range(x[, -1]) = 0 11664000.
  • depmap package: Cancer Dependency Map Data Package. The depmap package currently contains eight (kinds) datasets available through ExperimentHub.
    • RNA inference knockout data
    • CRISPR-Cas9 knockout data
    • WES copy number data
    • CCLE Reverse Phase Protein Array data
    • CCLE RNAseq gene expression data
    • Cancer cell lines
    • Mutation calls
    • Drug Sensitivity
    R> eh <- ExperimentHub()
    R> class(eh)
    [1] "ExperimentHub"
    attr(,"package")
    [1] "ExperimentHub"
    
    R> rnai <- eh[["EH2260"]]
    R> class(rnai)
    [1] "tbl_df"     "tbl"        "data.frame"
    

NCI's Genomic Data Commons (GDC)/TCGA

The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).

GenomicDataCommons package

NCI60

Molecular Characterization of the NCI-60. NCI-ADR-RES and OVCAR-8 being derived from one another, SNB-19 and U251 are derived from the same patient

Case studies

Expression of the SARS-CoV-2 cell receptor gene ACE2 in a wide variety of human tissues

NCI Proteomic Data Commons

https://pdc.cancer.gov/pdc/ vs https://gdc.cancer.gov/

GTEx

NIH LINCS

Sharing data

Gene set analysis

Hypergeometric test

Next-generation sequencing data

Forums

Batch effect

Batch effect

Misc

Advice

High Performance

Cloud Computing

Merge different datasets (different genechips)

Genomic data vs transcriptomic data

  • The main difference between genomic data and transcriptomic data is that genomic data provides information on the complete DNA sequence of an organism, while transcriptomic data provides information on the expression levels of genes.
  • Genomic data:
    • Genomic data refers to the complete DNA sequence of an organism, which includes all of its genes, regulatory regions, and non-coding regions. This type of data provides information on the genetic makeup of an organism, including its potential to develop certain diseases, its evolutionary history, and its overall genetic diversity.
    • Examples of genomic data: Whole genome sequencing (WGS), Genome-wide association studies (GWAS), Copy number variation (CNV) analysis, Comparative genomics, Metagenomics.
  • Transcriptomic data
    • Transcriptomic data, on the other hand, refers to the collection of all RNA transcripts produced by the genes of an organism. RNA transcripts are produced when genes are transcribed into RNA molecules, which are then used as templates to synthesize proteins. Transcriptomic data provides information on the expression levels of genes, which can help researchers understand how genes are regulated and how they contribute to biological processes.
    • Examples of transcriptomic data: RNA-Seq, Microarray, scRNA-Seq, qPCR, Ribosome profiling

Low read count and filtering

  • DESeq2 pre-filtering:
    • While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. DESeq2 vignette.
    • One can also omit this step entirely and just rely on the independent filtering procedures available in results(), either IHW or genefilter::filtered_p().
smallestGroupSize <- 3 # smallest group size
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep, ]

Independent Filtering

edgeR::filterByExpr

Normalization

log2 transformation

No matter we use TPM, TMM, FPKM, or DESeq2 normalized counts, we still need to take a log2(x+1) transformation before any analyses.

Quantile normalization

  • normalize.quantiles() from preprocessCore package. How to Perform Quantile Normalization in R
    • for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667.
    • I got into an error when I use the function in RStudio docker container but the solution here (BiocManager::install("preprocessCore", configure.args="--disable-threading")) works.
    source('http://bioconductor.org/biocLite.R')
    biocLite('preprocessCore')
    #load package
    library(preprocessCore)
     
    #the function expects a matrix
    #create a matrix using the same example
    mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
                 ncol=3)
    mat
    #     [,1] [,2] [,3]
    #[1,]    5    4    3
    #[2,]    2    1    4
    #[3,]    3    4    6
    #[4,]    4    2    8
     
    #quantile normalisation
    normalize.quantiles(mat)
    #         [,1]     [,2]     [,3]
    #[1,] 5.666667 5.166667 2.000000
    #[2,] 2.000000 2.000000 3.000000
    #[3,] 3.000000 5.166667 4.666667
    #[4,] 4.666667 3.000000 5.666667

Distribution, density plot

Density plot showing the distribution of RNA-seq read counts (FPKM) log10(FPKM)

Negative binomial distribution

RNA-seq and Negative binomial distribution

Z-score transformation

Ensembl to gene symbol

How to use UCSC Table Browser

File:Tablebrowser.png File:Tablebrowser2.png

Note

  1. the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
  2. the output does not have a header
  3. The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1

If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.

$ wc -l hg38Tables.bed 
75893 hg38Tables.bed
$ head -2 hg38Tables.bed 
chr1	67092175	67134971	NM_001276352	0	-	67093579	67127240	0	9	1429,70,145,68,113,158,92,86,42,	0,4076,11062,19401,23176,33576,34990,38966,42754,
chr1	201283451	201332993	NM_000299	0	+	201283702	201328836	0	15	453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920,	0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
$ tail -2 hg38Tables.bed 
chr22_KI270734v1_random	131493	137393	NM_005675	0	+	131645	136994	0	5	262,161,101,141,549,	0,342,3949,4665,5351,
chr22_KI270734v1_random	138078	161852	NM_016335	0	-	138479	161586	0	15	589,89,99,176,147,93,82,80,117,65,150,35,209,313,164,	0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610,

$ wc -l hg38CodingExon.bed 
577387 hg38CodingExon.bed
$ head -2 hg38CodingExon.bed 
chr1	67093579	67093604	NM_001276352_cds_0_0_chr1_67093580_r	0	-
chr1	67096251	67096321	NM_001276352_cds_1_0_chr1_67096252_r	0	-
$ tail -2 hg38CodingExon.bed 
chr22_KI270734v1_random	156288	156497	NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r	0	-
chr22_KI270734v1_random	161313	161586	NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r	0	-

# Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698)
$ grep NM_001276352 hg38Tables.bed 
chr1	67092175	67134971	NM_001276352	0	-	67093579	67127240	0	9	1429,70,145,68,113,158,92,86,42,	0,4076,11062,19401,23176,33576,34990,38966,42754,
$ grep NM_001276352 hg38CodingExon.bed
chr1	67093579	67093604	NM_001276352_cds_0_0_chr1_67093580_r	0	-
chr1	67096251	67096321	NM_001276352_cds_1_0_chr1_67096252_r	0	-
chr1	67103237	67103382	NM_001276352_cds_2_0_chr1_67103238_r	0	-
chr1	67111576	67111644	NM_001276352_cds_3_0_chr1_67111577_r	0	-
chr1	67115351	67115464	NM_001276352_cds_4_0_chr1_67115352_r	0	-
chr1	67125751	67125909	NM_001276352_cds_5_0_chr1_67125752_r	0	-
chr1	67127165	67127240	NM_001276352_cds_6_0_chr1_67127166_r	0	-

This can be compared to refGene(?) directly downloaded via http

$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
--2018-10-09 15:44:43--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7457957 (7.1M) [application/x-gzip]
Saving to: ‘hg38.refGene.txt.gz’

hg38.refGene.txt.gz                100%[===============================================================>]   7.11M   901KB/s    in 10s

2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957]

$ zcat hg38.refGene.txt.gz | wc -l
75893
15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2
1072	NM_003288	chr20	+	63865227	63891545	63865365	63889945	7	63865227,63869295,63873667,63875815,63882718,63889189,63889849,	63865384,63869441,63873816,63875875,63882820,63889238,63891545,	0	TPD52L2	cmpl	cmpl	0,1,0,2,2,2,0,
1815	NR_110164	chr2	+	161244738	161249050	161249050	161249050	2	161244738,161246874,	161244895,161249050,	0	LINC01806	unk	unk	-1,-1,

$ zcat hg38.refGene.txt.gz | tail -2
1006	NM_130467	chrX	+	55220345	55224108	55220599	55224003	5	55220345,55221374,55221766,55222620,55223986,	55220651,55221463,55221875,55222746,55224108,	0	PAGE5	cmpl	cmpl	0,1,0,1,1,
637	NM_001364814	chrY	-	6865917	6874027	6866072	6872608	7	6865917,6868036,6868731,6868867,6870005,6872554,6873971,	6866078,6868462,6868776,6868909,6870053,6872620,6874027,	0	AMELY	cmpl	cmpl	0,0,0,0,0,0,-1,

Where to download reference genome

Which human reference genome to use?

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)

UHR, HBR

In RNA sequencing (RNA-seq), Universal Human Reference (UHR) and Human Brain Reference (HBR) are two types of commercially available RNA samples that are often used as control samples and assess the performance and accuracy of RNA-seq assays. See this (github) and Lesson 13: Aligning raw sequences to reference genome.

GENCODE transcript database

RefSeq categories

See Table 1 of Chapter 18The Reference Sequence (RefSeq) Database.

Category Description
NC Complete genomic molecules
NG Incomplete genomic region
NM mRNA
NR ncRNA
NP Protein
XM predicted mRNA model
XR predicted ncRNA model
XP predicted Protein model (eukaryotic sequences)
WP predicted Protein model (prokaryotic sequences)

UCSC version & NCBI release corresponding

Gene Annotation

library(rtracklayer)
genes <- readGFF("gencode.v27.annotation.gff3.gz")
genes[1:2, 1:5]
# DataFrame with 2 rows and 5 columns
#      seqid   source       type     start       end
#   <factor> <factor>   <factor> <integer> <integer>
# 1     chr1   HAVANA gene           11869     14409
# 2     chr1   HAVANA transcript     11869     14409
genes[1:100, ] %>% filter(type == "gene") %>% dim()
# Error in UseMethod("filter") :
#   no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'RectangularData', 'SimpleList', 'DataFrame_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"

library(ape)
genes2 <- read.gff("gencode.v27.annotation.gff3.gz")
genes2[1:2, 1:5]
#   seqid source       type start   end
# 1  chr1 HAVANA       gene 11869 14409
# 2  chr1 HAVANA transcript 11869 14409
genes2[1:100,]  %>% filter(type == "gene") %>% dim()
# [1] 11  9

Genecards

  • https://www.genecards.org/
  • Q: What are genes with gene symbols starting with LINC?; eg LINC00491
    A: Genes with gene symbols starting with "LINC" are long intergenic non-coding RNA (lncRNA) genes. lncRNAs are RNA molecules that are transcribed from the genome but do not encode proteins. Unlike protein-coding genes, lncRNAs do not have a well-defined coding sequence, but they do play important regulatory roles in cellular processes such as gene expression, chromatin structure, and genome stability. Some lncRNAs are specifically expressed in cancer cells and have been implicated in tumor development and progression, making them of interest for cancer research.

How many DNA strands are there in humans?

How many base pairs in human

  • 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
  • chromosome 22 has the smallest number of bps (~50 million).
  • chromosome 1 has the largest number of bps (245 million base pairs).
  • Illumina iGenome Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa file is 3.0GB (so is other genome.fa from human).

Gene, Transcript, Coding/Non-coding exon

SNP

Types of SNPs and number of SNPs in each chromosomes

NGS technology

DNA methylation, Epigenetics

  • Relation of methylation genes and gene expression
    • Methylation of genes can have different effects on gene expression, depending on where the methylation occurs in the gene and the specific context of the gene and the cellular environment. Generally, methylation of promoter regions of genes is associated with reduced gene expression, whereas methylation of gene body regions is less clearly associated with gene expression changes.
    • When DNA is methylated at the promoter region of a gene, it can prevent the binding of transcription factors and RNA polymerase, which are necessary for transcription initiation. Methylation at the promoter region can also recruit proteins that block transcription or promote histone modifications that lead to chromatin compaction, further limiting access to the gene for transcriptional machinery.
    • However, methylation in other regions of the gene, such as the gene body, can have more complex effects on gene expression. In some cases, gene body methylation can be associated with increased expression, while in other cases it may have no effect or even lead to decreased expression. It is thought that gene body methylation may be involved in regulating alternative splicing or RNA stability, among other possible mechanisms.
    • Therefore, the effect of methylation on gene expression is not always straightforward and depends on various factors, including the specific gene, the location of methylation, and the cellular context.
devtools::install_github("coloncancermeth","genomicsclass")
library(coloncancermeth) # 485512 x 26
data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects
dim(meth)
dim(pd)
length(gr)
colnames(pd)
table(pd$Status) # 9 normals, 17 cancers
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")

i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
for(i in normalIndex){
  lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
  lines(density(meth[,i],from=0,to=1),col=2)
}

# finding regions of the genome that are different between cancer and normal samples
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- ebayes(fit)

# plot of the region surrounding the top hit
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)

plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
# http://www.ncbi.nlm.nih.gov/pubmed/22422453

# within each chromosome we usually have big gaps creating subgroups of regions to be analyzed
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")

library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
#consider two example regions#
...

Integrate DNA methylation and gene expression

Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing

Sequence + Expression

Integrate RNA-Seq and DNA-Seq

Immunohistochemistry/IHC

https://en.wikipedia.org/wiki/Immunohistochemistry. Protein expression by IHC.

Deconvolve bulk tumor tissue

Performance of computational algorithms to deconvolve heterogeneous bulk tumor tissue depends on experimental factors. Twitter.

Tumor purity

  • Tumor Purity in Preclinical Mouse Tumor Models 2022
  • Systematic Assessment of Tumor Purity and Its Clinical Implications Haider 2020.
    • With the exception of naive miRNA profiles, purity estimates were inversely correlated with molecular profiles regardless of the underlying purity estimation profile .
    • These data suggest that the presence of genomic and transcriptomic correlates of tumor purity are likely to confound biologic and clinical interpretations.
  • Estimators:
    • DNA: ABSOLUTE, ASCAT, CLONET, INTEGER, OncoSNP
    • RNA: DeMix, ISOpure-R (matlab/R), ESTIMATE (Yoshihara, R)
    • miRNA/microRNA: ISOpure-I
  • TCGA purity estimate by Aran 2015 Systematic pan-cancer analysis of tumour purity - Supplementary Data 1 (xlsx file with columns: Sample ID,Cancer type,ESTIMATE,ABSOLUTE,LUMP,IHC & CPE).
  • Tumor.purity: TCGA samples with their Tumor Purity measures (a data frame with 9364 rows and 7 variables) from TCGAbiolinks package
  • Prediction of tumor purity from gene expression data using machine learning 2021.
    • We selected the CPE as the target variable, which is the median purity value after normalizing values from the other four purity estimates (ESTIMATE, ABSOLUTE, LUMP and IHC).
    • our data set consisted of 8405 tumor samples.
  • How VAF is related to tumor purity?
    • Variant allelic fraction (VAF) is related to tumor purity because it reflects the proportion of cells in a sample that carry a specific genetic variant. In the context of cancer, VAF can be used as a surrogate marker for tumor purity, as the fraction of cells in the sample that carry the variant will depend on the proportion of cancer cells relative to normal cells.
    • The VAF of a specific genetic variant in a cancer sample can be calculated as the ratio of the number of reads supporting the variant to the total number of reads covering that locus. In a sample that is purely composed of cancer cells, the VAF should approach 1, as all cells will carry the variant. In a sample that is mixed with normal cells, the VAF will be lower and proportional to the proportion of cancer cells in the sample.
    • Therefore, by measuring the VAF of one or more genetic variants, it is possible to estimate the tumor purity, which is the proportion of cancer cells in the sample relative to normal cells. This information is important for a variety of downstream analyses, including variant calling, gene expression analysis, and the estimation of the mutational burden, as it can affect the interpretation of the results and the accuracy of the analysis.
  • How gene expression can be used to estimate tumor purity?
    • Gene expression analysis can be used to estimate tumor purity by comparing the expression levels of genes known to be specific to either normal or cancer cells. In a sample that is mixed with normal and cancer cells, the expression levels of these genes will reflect the proportion of normal and cancer cells present in the sample.
    • For example, genes that are highly expressed in normal cells, such as housekeeping genes, can be used as a reference to estimate the proportion of normal cells in the sample. Similarly, genes that are highly expressed in cancer cells, such as oncogenes, can be used to estimate the proportion of cancer cells in the sample.
    • The relative expression levels of these genes can then be used to estimate the tumor purity, either by comparing the expression levels to a reference sample of known purity, or by using mathematical models to estimate the proportion of normal and cancer cells in the sample.
    • It is important to note that this method is not without limitations, as the expression levels of specific genes can be influenced by various factors, such as the presence of cell-to-cell heterogeneity, gene amplification, and epigenetic modifications, among others. Therefore, gene expression analysis should be used in combination with other methods, such as copy number analysis and variant allelic fraction analysis, to obtain a more accurate estimate of tumor purity.
  • Some papers
  • ISOpureR (Quon 2013): intensity or count data. Error term [math]\displaystyle{ e_n }[/math] multinomial distribution.
    library(ISOpureR)
    
    # For reproducible results, set the random seed
    set.seed(123);
    
    # Run ISOpureR Step 1 - Cancer Profile Estimation
    system.time(ISOpureS1model <- ISOpure.step1.CPE(
      tumor.expression.data,  # intensity or count data
      normal.expression.data  # intensity or count data 
    ))
    ISOpureS1model$alphapurities  # tumor purity estimates
    
  • ESTIMATE (Yoshihara 2013): normalized data.
    • Since it is based ssGSEA, only ranks are used. It does not matter we used the log transformed or count data.
    • ssGSEA is based on two gene signatures: Stromal signature (141 genes) and immune signature (141 genes)
    • The formula for calculating ESTIMATE tumor purity was developed in TCGA Affymetrix data (n=1001) including both the ESTIMATE score and ABSOLUTE-based tumor purity.
    • An evolutionary algorithm was used for the mathematical model.
    • Nonlinear least squares method was used to determine the final model estimate.
    • Tumor purity = cos(0.6 + 0.000146 * ESTIMATE score)
    library(estimate)
    OvarianCancerExpr <- system.file("extdata", "sample_input.txt", package="estimate")
    filterCommonGenes(input.f=OvarianCancerExpr, output.f="OV_10412genes.gct", id="GeneSymbol")
    estimateScore("OV_10412genes.gct", "OV_estimate_score.gct", platform="affymetrix")
    
    plotPurity(scores="OV_estimate_score.gct", samples="s516", platform="affymetrix")
    
    scan("OV_estimate_score.gct", "", skip=6)[-c(1:2)] |> as.numeric() # tumor purity estimates
    
  • DeMixT (Wang 2018): count data. Estimation of tumor cell total mRNA expression in 15 cancer types predicts disease progression, Cao 2022 for profile likelihood method & supplementary information for more information about the benchmarking between DeMixT_DE and DeMixT_GS.
    library(DeMixT)
    source("DeMixT_preprocessing.R")
    
    count.mat <- cbind(normal.expression.data, tumor.expression.data)
    colnames(count.mat) <- paste0("sample", 1:ncol(count.mat))
    
    label = factor(c(rep('Normal', ncol(normal.expression.data)),
                     rep('Tumor', ncol(tumor.expression.data))))
    set.seed(1234) # not sure if this is needed
    preprocessed_data = DeMixT_preprocessing(count.mat, label)
    PRAD_filter = preprocessed_data$count.matrix
    
    set.seed(1234)
    Normal.id <- paste0("sample", 1:n1)
    Tumor.id <- paste0("sample", (n1+1):(n1+n2))
    data.Y = SummarizedExperiment(assays = list(counts = PRAD_filter[, Tumor.id]))
    data.N1 <- SummarizedExperiment(assays = list(counts = PRAD_filter[, Normal.id]))
    res = DeMixT(data.Y = data.Y,
                 data.N1 = data.N1,
                 nthread = 64,
                 gene.selection.method = "DE") # default is "GS"
    res$pi[2, ] # tumor purity estimates
    
  • CIBERSORTx (Newman 2019). Web only. Determining cell type abundance and expression from bulk tissues with digital cytometry
  • PUREE (Revkov 2023). Python-based. Only API, no source code.

Integrate/combine Omics

Gene expression

Expression level is the amount of RNA in cell that was transcribed from that gene. Slides from Alyssa Frazee.

Fusion gene

Structural variation

LUMPY, DELLY, ForestSV, Pindel, breakdancer , SVDetect.

Covid-19

Bulk RNA sequencing for analysis of post COVID-19 condition 2024. 13 differentially expressed genes associated with PCC (long Covid) were found. Enriched pathways were related to interferon-signalling and anti-viral immune processes.

RNASeq + ChipSeq

Labs

Biowulf2 at NIH

BamHash

Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.

Reproducibility

Selected Papers

Pictures

https://www.flickr.com/photos/genomegov

FISH/Fluorescence In Situ Hybridization

用DNA做身分鑑識

用DNA做身分鑑識

如何自学入门生物信息学

CRISPR

基因編輯的原理是什麼?一次看懂基因神剪CRISPR

Staying current

Staying Current in Bioinformatics & Genomics: 2017 Edition

Papers

Common issues in algorithmic bioinformatics papers

What are some common issues I find when reviewing algorithmic bioinformatics conference papers?

Precision Medicine courses

Personalized medicine

Cancer and gene markers

  • Colorectal cancer patients without KRAS mutations have far better outcomes with EGFR treatment than those with KRAS mutations.
    • Two EGFR inhibitors, cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
  • Breast cancer.

The shocking truth about space travel

7 percent of DNA belonging to NASA astronaut Scott Kelly changed in the time he was aboard the International Space Station

bioSyntax: syntax highlighting for computational biology

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2315-y

Deep learning

Deep learning: new computational modelling techniques for genomics

HRD/homologous recombination deficiency 同源重组修复缺陷

  • https://en.wikipedia.org/wiki/Homologous_recombination
    • Homologous recombination proficient (HRP) cancer cells can repair DNA damage caused by chemotherapy, making them difficult to treat.
    • drugs have been developed to target homologous recombination via c-Abl inhibition and to exploit (take advantage of) deficiencies in homologous recombination in cancer cells with BRCA mutations.
    • One such drug is Olaparib, a PARP1 inhibitor that targets cancer cells by inhibiting base-excision repair (BER) in HR-deficient cells. However, cancer cells can become resistant to PARP1 inhibitors if they undergo deletions of mutations in BRCA2, restoring their ability to repair DNA by HR.
  • https://www.genome.gov/genetics-glossary/homologous-recombination (graphical illustration). Homologous recombination is a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA.
  • BRCA1 and BRCA2 are genes that produce proteins responsible for repairing damaged DNA within cells. Mutations in these genes can lead to errors in the DNA repair process, resulting in an accumulation of mutations that can cause cancer. This condition is known as Homologous Recombination Deficiency (HRD). PARP inhibitors are a type of targeted therapy that blocks the enzyme Poly (ADP-ribose) polymerase (PARP), which helps repair DNA damage in (cancer) cells. By inhibiting PARP, these drugs prevent cancer cells from repairing their DNA, leading to cell death.
  • The inability to repair DNA damage is referred to as homologous recombination deficiency (HRD). DNA damage response from Qiagen.
  • FDA approved for HRD assessment: 1) Foundation Medicine's FoundationOne CDx in 2020, 2) Myriad myChoice CDx in 2019. PS:
    • CDx stands for Companion Diagnostic. They help identify patients who might benefit from specific treatments, particularly in the context of certain cancer therapies. In the case of HRD (Homologous Recombination Deficiency) assessment, these CDx tests help identify patients who might respond well to PARP inhibitors or other targeted therapies.
    • In the case of HRD (Homologous Recombination Deficiency) assessment, these CDx tests help identify patients who might respond well to PARP inhibitors or other targeted therapies.
  • How HRD and PARP inhibitors relate:
    • Cancer cells with HRD are already struggling to repair DNA damage.
    • When these cells are treated with PARP inhibitors, it becomes even harder for them to repair DNA damage.
    • This combination can lead to cancer cell death, a concept known as "synthetic lethality."
    • In other words, PARP inhibitors (PARPi) are primarily used in patients who are HRD-positive.
  • Role of CDx tests in HRD assessment:
    • These tests analyze tumor samples for signs of HRD.
    • They may look for specific gene mutations (like BRCA1/2) or broader genomic patterns indicative of HRD.
    • The results help predict whether a patient's cancer is likely to respond to PARP inhibitors.
  • Clinical implications:
    • Patients who test positive for HRD are more likely to benefit from PARP inhibitors.
    • Those without HRD might be better suited for other treatment options.
  • Beyond PARP inhibitors:
    • HRD status can also inform the use of platinum-based chemotherapies and potentially other DNA-damaging agents.
  • Cancer types:
    • This approach is particularly relevant in ovarian, breast, prostate, and pancreatic cancers, among others.
  • openai
    • Poly(ADP-ribose) polymerase (PARP) inhibitors are a class of drugs that are designed to inhibit the activity of PARP enzymes. PARP enzymes are proteins that are involved in DNA repair pathways. They help to repair DNA damage and maintain the stability of the genome by adding a chemical group called poly(ADP-ribose) to other proteins.
    • PARP inhibitors work by blocking the activity of PARP enzymes, which can interfere with the ability of cells to repair damaged DNA. This can be especially useful in cancer cells, which often rely on PARP enzymes to repair DNA damage and maintain genomic stability. By inhibiting PARP enzymes, PARP inhibitors can sensitizize cancer cells to chemotherapy and other treatments, making them more vulnerable to cell death.
    • is there drug targeting HRP cancer patients? One approach is to target homologous recombination via c-Abl inhibition. For example, Niraparib is a PARP inhibitor that has been shown to be effective in treating advanced ovarian cancer in both homologous recombination deficient (HRD) and homologous recombination proficient (HRP) patients.
    • is there any drugs target HRD cancer patients? Yes, there are several drugs that have been developed to target HRD cancer cells. One approach is to use PARP inhibitors, which exploit deficiencies in homologous recombination in cancer cells with BRCA mutations. For example, Olaparib is a PARP1 inhibitor that has been shown to be effective in shrinking or stopping the growth of tumors from breast, ovarian and prostate cancers caused by mutations in the BRCA1 or BRCA2 genes. By inhibiting base-excision repair (BER) in HR-deficient cells, Olaparib applies the concept of synthetic lethality to specifically target cancer cells. PS: Examples of PARP inhibitors include niraparib (Zejula), olaparib (Lynparza), talazoparib (Talzenna), and rucaparib (Rubraca).
    • PARP1 is a member of the PARP family of proteins. PARP stands for Poly (ADP-ribose) polymerase. The PARP family comprises 17 members.
  • Loss-of-function genes involved in this pathway can sensitize tumors to poly(adenosine diphosphate [ADP]-ribose) polymerase (PARP) inhibitors and platinum-based (Pt) chemotherapy
    • Certain genes that are involved in a process called Homologous Recombination Repair (HRR) can make cancer cells more susceptible to certain treatments. Specifically, it says that when these loss-of-function genes are present, the cancer cells become more sensitive to two types of chemotherapy: PARP inhibitors and platinum-based chemotherapy.
    • Loss-of-function genes are genes that have mutations that prevent them from functioning properly or at all. By disrupting the normal functioning of the HRR pathway, these loss-of-function genes can make cancer cells more sensitive to PARP inhibitors and platinum-based chemotherapy.
  • scarHRD package. Note for the 2 test samples, the return object is a 1x4 matrix.
    • HRD-LOH/Loss of Heterozygosity
    • LST/Large Scale Transitions
    • Number of Telomeric Allelic Imbalances
    • HRD-sum (sum of the above 3)
  • Harmonization of homologous recombination deficiency testing in ovarian cancer: Results from the MITO16A/MaNGO-OV2 trial 2024
    • BRCA mutation status: the BRCA status was defined positive (BRCA+) when a clinically significant mutation (pathogenic/likely pathogenic) was found in BRCA genes, and negative (BRCA-) if no clinically significant mutation was found. NOTE: Pathogenicity 致病性 refers to the ability of an organism to cause disease or damage in a host.
    • Genomic instability/GI score (same as HRD score)
    • HRD status
    • Sensitivty, Specificity, Agreement, Cohen kappa are used to compare new assays/tests vs reference test (Myriad)
    • RECIST response rate, PFS, OS

Computational Pathology

Bi-allelic, monoallelic

Bi-allelic and monoallelic expression. In most cases, both alleles (the two chromosomal copies) are transcribed; this is known as bi-allelic expression (left). However, a minority of genes show monoallelic expression (right). In these cases, only one allele of a gene is expressed (right).

SOMAscan assay (proteomic)

Scandal

阿茲海默症關鍵論文被揭發疑似造假,16年來全球醫學專家可能都被呼弄 & 阿茲海默症關鍵論文疑造假 誤導外界16年

Terms

RNA vs DNA

基因结构

https://zhuanlan.zhihu.com/p/49601643

Pseudogene

https://www.genome.gov/genetics-glossary/Pseudogene. An example: OR7E47P with alias bpl 41-16 or bpl41-16.

PCR

什麼是PCR? 聚合酶鏈鎖反應? 基因叔叔

Epidemiology

Epidemiology for the uninitiated

Cell lines

in vivo, in vitro, and in situ

In silico 電腦模擬 (in silicon, s=simulation)

  • https://en.wikipedia.org/wiki/In_silico An in silico experiment is one performed on computer or via computer simulation.
  • The main difference between in silico gene expression analysis and experimental gene expression analysis is the method used to study the patterns and levels of gene expression.
    • In silico gene expression analysis involves the use of computational tools and algorithms to analyze large datasets of gene expression data obtained from techniques such as microarrays, RNA sequencing, or single-cell RNA sequencing. This analysis can include identifying differentially expressed genes between samples, clustering genes with similar expression patterns, and predicting the functional roles of genes based on their expression profiles.
    • On the other hand, experimental gene expression analysis involves directly measuring the levels and patterns of gene expression using laboratory techniques. These techniques can include real-time polymerase chain reaction (PCR), northern blotting, western blotting, and immunohistochemistry, among others. These experimental techniques allow researchers to directly measure the levels of specific RNA or protein molecules in biological samples.
    • While in silico gene expression analysis is a rapid and cost-effective way to analyze large datasets of gene expression data, it relies on the accuracy and completeness of the data being analyzed. Experimental gene expression analysis provides a more direct and accurate view of gene expression but can be more time-consuming and expensive. In practice, both in silico and experimental gene expression analysis are valuable tools that can be used to complement each other in the study of gene expression and its role in various biological processes and diseases.
  • In silico gene expression analysis--an overview Murray 2007
  • A simple in silico approach to generate gene-expression profiles from subsets of cancer genomics data 2019

In situ 原處 (介於in vivo與in vitro之間)

  • https://en.wikipedia.org/wiki/In_situ 意義大致介於in vivo與in vitro之間。
  • Something that’s performed in situ means that it’s observed in its natural context, but outside of a living organism. In vivo is Latin for “within the living.” It refers to work that’s performed in a whole, living organism .
  • A good example of this is a technique called in situ hybridization (ISH). ISH can be used to look for a specific nucleic acid (DNA or RNA) within something like a tissue sample. Specialized probes are used to bind to a specific nucleic acid sequence that the researcher is looking to find. These probes are tagged with things like radioactivity or fluorescence. This allows the researcher to see where the nucleic acid is located within the tissue sample. ISH allows the researcher to observe where a nucleic acid is located within its natural context, yet outside of a living organism. Examples are microarray experiments.

in vivo 活体内

Syngenic

Syngeneic tumor models are experimental models used in cancer research that use genetically identical animals to study the growth and spread of cancer cells. In these models, a malignant tumor is induced in one animal and then transplanted into another animal of the same genetic background. This allows researchers to study the interactions between the host immune system and the cancer cells, as well as the response of the tumor to various treatments.

Syngeneic tumor models are often used in combination with other experimental models, such as xenograft models (where the cancer cells are transplanted into a genetically different animal) or cell line models (where the cancer cells are grown in a laboratory). By using a combination of these models, researchers can gain a more complete understanding of the biology of cancer and develop new treatments for cancer patients.

The syngeneic model is an important tool for studying the role of the immune system in cancer, as the genetically identical animals allow researchers to control for genetic differences that might impact the immune response. Additionally, because the host immune system in these models is functional and can mount a response against the transplanted cancer cells, the syngeneic model provides a more realistic representation of the host-tumor interaction than other models that rely on immunodeficient animals.

in vitro 试管内/体外

RUO: research use only

RUO stands for "Research Use Only". In the context of clinical trials and laboratory research, it refers to in vitro diagnostic products (IVDs) that are intended to be used in non-clinical studies, including to gather data for submission as required by regulatory authorities³. These products are not intended for use in diagnostic procedures. They are often used by medical laboratories and other institutions for research purposes. However, if these products are used for purposes other than research, it could have legal implications. It's important to note that RUO products are not subject to the same regulatory controls as in-vitro diagnostic medical devices (CE-IVDs) that must comply with the applicable legal requirements.

RNA sequencing 101

Web

Books

strand-specific vs non-strand specific experiment

Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.

This post mentioned to use infer_experiment.py script to check whether the rna-seq run is stranded or not.

The rna-seq experiment used in this tutorial is not stranded-specific.

FASTQ

  • FASTQ=FASTA + Qual. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

Phred quality score

q = -10log10(p) where p = error probability for the base.

q error probability base call accuracy
10 0.1 90%
13 0.05 95%
20 0.01 99%
30 0.001 99.9%
40 0.0001 99.99%
50 0.00001 99.999%

FASTA

fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.

Download sequence files

Compute the sequence length of a FASTA file

https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

head -2 file.fa | \
    awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}'  | \
    tail -1

FASTA <=> FASTQ conversion

According to this post,

  • FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
  • FastQ files are like fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)

Convert FASTA to FASTQ without quality scores

Biostars. For example, the bioawk by lh3 (Heng Li) worked.

Convert FASTA to FASTQ with quality score file

See the links on the above post.

Convert FASTQ to FASTA using Seqtk

Use the Seqtk program; see this post.

The Seqtk program by lh3 can be used to sample reads from a fastq file including paired-end; see this post.

RPKM (Mortazavi et al. 2008) and cpm (counts per million)

Reads per Kilobase of Exon per Million of Mapped reads.

  • RPKMs can only be calculated for those genes for which the gene length and GC content information is available; see the vignette of GSVA
  • rpkm function in edgeR package.
  • RPKM function in easyRNASeq package.
  • TMM > cpm > log2 transformation on the paper Gene expression profiling of 1200 pancreatic ductal adenocarcinoma reveals novel subtypes
  • Gene expression quantification from RNA-Seq wikipedia page
    • Sequencing depth/coverage: the total number of reads generated in a single experiment is typically normalized by converting counts to fragments, reads, or counts per million mapped reads (FPM, RPM, or CPM).
    • Gene length: FPKM, TPM. Longer genes will have more fragments/reads/counts than shorter genes if transcript expression is the same. This is adjusted by dividing the FPM by the length of a gene, resulting in the metric fragments per kilobase of transcript per million mapped reads (FPKM). When looking at groups of genes across samples, FPKM is converted to transcripts per million (TPM) by dividing each FPKM by the sum of FPKMs within a sample.
  • Difference between CPM and TPM and which one for downstream analysis?. CPM is basically depth-normalized counts whereas TPM is length normalized (and then normalized by the length-normalized values of the other genes).
  • The RNA-seq abundance zoo. The counts per million (CPM) metric takes the raw (or estimated) counts, and performs the first type of normalization I mention in the previous section. That is, it normalized the count by the library size, and then multiplies it by a million (to avoid scary small numbers).
  • See also the log(CPM) implemented in Seurat::NormalizeData() for scRNA-seq data.

Idea

  • The more we sequence, the more reads we expect from each gene. This is the most relevant correction of this method.
  • Longer transcript are expected to generate more reads. The latter is only relevant for comparisons among different genes which we rarely perform!. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.

Calculation

  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

Formula

RPKM = (10^9 * C)/(N * L), with 

C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)

set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
     [,1] [,2] [,3] [,4]
[1,]    0    0    5    0
[2,]    6    2    7    3
[3,]    5   13    7    2
[4,]    3    3    9   11
[5,]    1    2    1   15

d <- DGEList(counts=y, lib.size=1001:1004)
# Note that lib.size is optional
# By default, lib.size = colSums(counts)
cpm(d) # counts per million
   Sample1   Sample2  Sample3   Sample4
1    0.000     0.000 4985.045     0.000
2 5994.006  1996.008 6979.063  2988.048
3 4995.005 12974.052 6979.063  1992.032
4 2997.003  2994.012 8973.081 10956.175
5  999.001  1996.008  997.009 14940.239
> cpm(d,log=TRUE)
    Sample1   Sample2  Sample3   Sample4
1  7.961463  7.961463 12.35309  7.961463
2 12.607393 11.132027 12.81875 11.659911
3 12.355838 13.690089 12.81875 11.129470
4 11.663897 11.662567 13.17022 13.451207
5 10.285119 11.132027 10.28282 13.890078

d$genes$Length <- c(1000,2000,500,1500,3000)
rpkm(d)
    Sample1   Sample2    Sample3  Sample4
1    0.0000     0.000  4985.0449    0.000
2 2997.0030   998.004  3489.5314 1494.024
3 9990.0100 25948.104 13958.1256 3984.064
4 1998.0020  1996.008  5982.0538 7304.117
5  333.0003   665.336   332.3363 4980.080

> cpm
function (x, ...)
UseMethod("cpm")
<environment: namespace:edgeR>
> showMethods("cpm")

Function "cpm":
 <not an S4 generic function>
> cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25,
    ...)
{
    x <- as.matrix(x)
    if (is.null(lib.size))
        lib.size <- colSums(x)
    if (log) {
        prior.count.scaled <- lib.size/mean(lib.size) * prior.count
        lib.size <- lib.size + 2 * prior.count.scaled
    }
    lib.size <- 1e-06 * lib.size
    if (log)
        log2(t((t(x) + prior.count.scaled)/lib.size))
    else t(t(x)/lib.size)
}
<environment: namespace:edgeR>
> rpkm.default
function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25,
    ...)
{
    y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count)
    gene.length.kb <- gene.length/1000
    if (log)
        y - log2(gene.length.kb)
    else y/gene.length.kb
}
<environment: namespace:edgeR>

Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as

# step 1:
6/(1.0e-6 *1001) = 5994.006    # cpm, compute column-wise
# step 2:
5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise

# Another way
# step 1 (RPK) 
6/ (2000/1.0e3) = 3
# step 2 (RPKM)
3/ (1.0e-6 * 1001) = 2997.003

Another example. source code of calc_cpm().

library(edgeR)
set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
cpm(y)
#          [,1]   [,2]      [,3]      [,4]
#[1,]      0.00      0 172413.79      0.00
#[2,] 400000.00 100000 241379.31  96774.19
#[3,] 333333.33 650000 241379.31  64516.13
#[4,] 200000.00 150000 310344.83 354838.71
#[5,]  66666.67 100000  34482.76 483870.97

calc_cpm <- function (expr_mat) {
    norm_factor <- colSums(expr_mat)
    return(t(t(expr_mat)/norm_factor) * 10^6)
    # Fix a bug in the original code
    # Not affect silhouette()
}

calc_cpm(y)
#          [,1]   [,2]      [,3]      [,4]
#[1,]      0.00      0 172413.79      0.00
#[2,] 400000.00 100000 241379.31  96774.19
#[3,] 333333.33 650000 241379.31  64516.13
#[4,] 200000.00 150000 310344.83 354838.71
#[5,]  66666.67 100000  34482.76 483870.97

Critics

Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!

CPM vs TPM

Both has the property that the sumof reads is 1 million(10^6). But TPM includes gene length normalization (TPM accounts for variations in gene length (done first) and sequencing depth (done second)). So if want to find DE genes between samples, it is common to use the TPM normalization method.

(another critic) Union Exon Based Approach

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910

In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.

It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.

FPKM (Trapnell et al. 2010)

  • Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks).
  • FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
  • Differential expression analysis with only FPKM matrix available from total newbie in R

Gene length

library(AnnotationHub)
library(airway)
library(SummarizedExperiment)

# Load the airway dataset
data("airway")
counts <- assay(airway)
counts <- as.matrix(counts)  # Ensure it's a matrix

# TPM Calculation Function
compute_tpm <- function(counts, lengths) {
  length_scaled <- counts / lengths
  scaling_factors <- colSums(length_scaled)
  tpm <- t(t(length_scaled) / scaling_factors) * 1e6
  return(tpm)
}

BiocManager::install("biomaRt")  # If not installed yet
library(biomaRt)
# Connect to Ensembl BioMart for human
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
# Fetch gene lengths for human genes
gene_lengths <- getBM(
  attributes = c("ensembl_gene_id", "gene_biotype", "transcript_length"),  # Attribute options
  mart = mart
)

# Make sure to use ENSEMBL IDs (Ensembl Gene IDs) from the airway data
head(gene_lengths)
  ensembl_gene_id   gene_biotype transcript_length
1 ENSG00000210049        Mt_tRNA                71
2 ENSG00000211459        Mt_rRNA               954
3 ENSG00000210077        Mt_tRNA                69

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# Extract the transcripts
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
genes <- genes(txdb)
length(genes)
# [1] 30969
names(genes)[1:5]
# [1] "1"         "10"        "100"       "1000"      "100008586"

# Calculate lengths (this uses the exons of the genes)
gene_lengths <- width(reduce(genes))
gene_lengths <- as.numeric(gene_lengths)
length(gene_lengths)
# [1] 26269
gene_lengths[1:3]
# [1] 2541 1556 6167

RPKM, FPKM, TPM and DESeq

> set.seed(1)
> dds <- makeExampleDESeqDataSet(m=4)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      14       1       1       4
gene2       5      17      13      14
gene3       0      12       8       6
gene4     152      62     149     110
gene5      23      36      33      94
gene6       0       1       1       4
> dds <- estimateSizeFactors(dds)
> sizeFactors(dds)
 sample1  sample2  sample3  sample4
1.068930 1.014687 1.010392 1.033559
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      14       1       1       4
gene2       5      17      13      14
gene3       0      12       8       6
gene4     152      62     149     110
gene5      23      36      33      94
gene6       0       1       1       4
> head(counts(dds, normalized=TRUE))
         sample1    sample2     sample3    sample4
gene1  13.097206  0.9855256   0.9897147   3.870122
gene2   4.677574 16.7539354  12.8662916  13.545427
gene3   0.000000 11.8263073   7.9177179   5.805183
gene4 142.198237 61.1025878 147.4674957 106.428358
gene5  21.516838 35.4789219  32.6605863  90.947869
gene6   0.000000  0.9855256   0.9897147   3.870122

# normalized counts is calculated as the following
R> head(scale(counts(dds, normalized=F), F, sizeFactors(dds)))
         sample1    sample2     sample3    sample4
gene1  13.097206  0.9855256   0.9897147   3.870122
gene2   4.677574 16.7539354  12.8662916  13.545427
gene3   0.000000 11.8263073   7.9177179   5.805183
gene4 142.198237 61.1025878 147.4674957 106.428358
gene5  21.516838 35.4789219  32.6605863  90.947869
gene6   0.000000  0.9855256   0.9897147   3.870122

# The situation of DESeqDataSet object created using 'tximport()' is different. See the next item.
P -- per
K -- kilobase (related to gene length)
M -- million (related to sequencing depth)

TMM (Robinson and Oshlack, 2010)

Trimmed Means of M values (edgeR).

  • TMM relies on the assumption that most genes are not differentially expressed. See the paper. DESeq2 does not rely on this assumption.
  • TMM will not work well for samples where the library size is so small that most of the counts become zero. A library size of 1 million is on the small side, but is probably ok. Is there any reason to think that normalization (e.g. TMM) doesn't work well with samples that that have very different raw counts?
  • Many normalization RNA-Seq normalization methods perform poorly on samples with extreme composition bias. For instance, in one sample a large number of reads comes from rRNAs while in another they have been removed more efficiently. Most scaling based methods, including RPKM and CPM, will underestimate the expression of weaker expressed genes in the presence of extremely abundant mRNAs (less sequencing real estate available for them). The TMM methods tries to correct this bias.
  • Does EdgeR trimmed mean of M values (TMM) account for gene length? No. In general, edgeR does not need to adjust for gene length in DE analyses because gene length cancels out of DE comparisons.
  • Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq
  • Q: Does TMM method require count data?
    • A: Yes, the TMM method requires RNA-seq count data as input.
    • The TMM method uses these count data to calculate the scaling factors that adjust for differences in library size and gene length, as well as the effects of highly expressed genes.
    • Before applying the TMM method, it is important to ensure that the count data has been properly preprocessed and filtered to remove low-quality reads, adapter sequences, and other artifacts.
  • Q: can TMM method be applied to non-integer data?
    • A: It is possible to apply TMM to non-integer data, such as normalized expression values or FPKM (fragments per kilobase of transcript per million mapped reads) values, by rounding the values to the nearest integer.
    • In practice, the TMM method can be applied to non-integer data by first converting the data to counts, for example, by multiplying the expression values by a scaling factor that represents the average library size, and then rounding the resulting values to the nearest integer. The TMM method can then be applied to the rounded count data as usual.
  • StatQuest: edgeR, part 1, Library Normalization. Good explanation about reference sample selection.
  • Trimmed Mean
  • RNA Sequence Analysis in R: edgeR
  • Normalisation methods implemented in edgeR. TMM, RLE, Upper-quartile.
  • Using edgeR package
    library(magrittr)
    library(edgeR)
    set.seed(1)
    M <- matrix(rnbinom(10000,mu=5,size=2), ncol=4)
    
    out <- DGEList(M) %>% calcNormFactors() %>% cpm()
    
  • Using NOISeq package
    library(NOISeq)
    out2 <- tmm(M, long = 1000, lc = 0, k = 0)
    out[1:3, 1:3] / out2[1:3, 1:3]
    #    Sample1  Sample2  Sample3
    # 1 80.81611 81.32609      NaN
    # 2 80.81611 81.32609 80.81611
    # 3 80.81611      NaN 80.81611
    

Sample size

Power-> RNA-seq

Coverage

~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
C = L N / G

where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.

# Assume the bam file is sorted by chromosome location
# took 40 min on 5.8G bam file. samtools depth has no threads option:(
# it is not right since it only account for regions that were covered with reads
samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'    # maybe 42

# The following is the right way! The result matches with Qualimap program.
samtools depth -a *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'  # maybe 8
# OR
LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'`   # 3095693981
SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'`   # 24473867730
echo $(( $LEN/$SUM ))

5 common genomics file formats

5 genomics file formats you must know (video)

  • fastq,
  • fastq,
  • bam,
  • vcf,
  • bed (genomic intervals regions)

SAM/Sequence Alignment Format and BAM format specification

Single-end, pair-end, fragment, insert size

Germline vs Somatic mutation

  • Germline: inherit from parents. See the Wikipedia page.
  • Somatic SNVs are mutations that occur in the cells of a tumor. These mutations can be found in multiple copies of the same gene, while germline SNVs are mutations that are found in a single copy of the gene, usually the original copy.
  • Somatic & Germline Mutations

Pathogenic mutation

  • A pathogenic mutation is a change in the genetic sequence that causes a specific genetic disease. To determine if a change found in the gene is something that causes disease, a laboratory looks at many different factors. For example, they look at the type of change found. Some changes, like nonsense mutations or frameshift mutations, almost always result in a major problem with the protein produced, so they are often labeled as pathogenic mutations. Laboratories will also check the scientific literature and databases to see if the particular change has been reported in other individuals with the genetic disease. Lastly, they look to see if the change is in an area of the gene that is conserved across species, meaning that the area where the change is located is the same in lots of animals, thus may be an important area for the function of the protein.
  • pathogenic variant. See Genetic Disorders
  • Pathogenic means able to cause or produce disease.
  • What are some examples of genetic diseases caused by pathogenic mutations? Cystic fibrosis, Duchenne muscular dystrophy, Familial hypercholesterolemia, Hemochromatosis, Sickle cell disease, Tay-Sachs disease ...

Driver vs passenger mutation

https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer

Nonsynonymous mutation

It is related to the genetic code, Wikipedia. There are 20 amino acids though there are 64 codes.

See

isma: analysis of mutations detected by multiple pipelines

isma: an R package for the integrative analysis of mutations detected by multiple pipelines

mutSignatures: analysis of cancer mutational signatures

Rediscover: identify mutually exclusive mutations

Rediscover: an R package to identify mutually exclusive mutations

Tumor mutational burden

Types of mutations

Cytogenetic alternations

Alternative and differential splicing

Allele vs Gene

http://www.diffen.com/difference/Allele_vs_Gene

  • A gene is a stretch of DNA or RNA that determines a certain trait.
  • Genes mutate and can take two or more alternative forms; an allele is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
  • An allele is found at a fixed spot on a chromosome?
  • Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).

Locus

https://en.wikipedia.org/wiki/Locus_(genetics)

Haplotypes

Base quality, Mapping quality, Variant quality

VarSAP

Enrichment of Variant Information for the Variant Standardization and Annotation Pipeline

The Clinical Knowledgebase (CKB)

https://ckb.jax.org/gene/show?geneId=7157 (TP53)

Mapping quality (MAPQ) vs Alignment score (AS)

http://seqanswers.com/forums/showthread.php?t=66634 & SAM format specification

  • MAPQ (5th column): MAPping Quality. It equals −10 log10 Pr{mapping position is wrong} (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
  • AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)

Note:

  1. MAPQ scores produced by the aligners typically involves the alignment score and other information.
  2. You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
  3. You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
  4. MAPQ values are really useful but their implementation is a mess by Simon Andrews

gene's isoform

FFPE Tissue vs Frozen Tissue

Wild type vs mutant

ns

Not significant

PARP inhibitor

  • What is a PARP Inhibitor? Dana-Farber Cancer Institute
  • PARP is an enzyme/a family of proteins that help repair damaged DNA in cells. When DNA is damaged, PARP detects the damage and signals other enzymes to come and fix it. This helps maintain the stability of the cell’s genetic material and prevent cell death.
  • Is PARP good or bad? PARP is neither inherently good nor bad. It is a protein that plays an important role in maintaining the stability of the cell’s genetic material by helping to repair damaged DNA.
    • In normal cells, PARP helps prevent cell death and maintain genomic stability.
    • In cancer cells, PARP can help the cancer cells survive and continue to grow by repairing their DNA. This is why PARP inhibitors (PARPi) are used in cancer treatment to block the function of PARP and prevent cancer cells from repairing their DNA.
  • PARP inhibitors are a type of targeted cancer therapy, not a traditional chemotherapy.
  • PARPi therapy is a cancer treatment that blocks the PARP enzyme, which helps repair DNA damage in cancer cells
    • PARP Inhibitors: Clinical Relevance, Mechanisms of Action and Tumor Resistance
    • List of PARP inhibitors: olaparib, niraparib, rucaparib, and talazoparib.
    • Olaparib is a medication for the maintenance treatment of BRCA-mutated advanced ovarian cancer in adults. It is a PARP inhibitor, inhibiting poly ADP ribose polymerase (PARP), an enzyme involved in DNA repair. Others include Letrozole, Avastin.
    • Maintenance therapy is called so because it is the ongoing treatment of cancer with medication after the cancer has responded to the first recommended treatment. The main goals of maintenance therapy are
      • To prevent the cancer’s return
      • To delay the growth of advanced cancer after the initial treatment
  • PARP inhibitors are a class of drugs that inhibit the activity of PARP enzymes. By blocking PARP’s ability to help repair DNA damage, these drugs can make it more difficult for cancer cells to survive DNA damage caused by other treatments, such as chemotherapy or radiation therapy. This can make these treatments more effective against certain types of cancer.
  • PARP inhibitors are drugs that block the action of the PARP enzymes, which are involved in DNA repair. There are several PARP inhibitors available, including olaparib (Lynparza), niraparib (Zejula), rucaparib (Rubraca), and talazoparib (Talzenna). These drugs are approved for some types of cancer, such as ovarian and prostate cancer, depending on the presence of certain genetic mutations.

Inhibitor genes and activator genes/enhancer genes

  • Inhibitor genes: Inhibitor genes are genes that code for proteins that can regulate or inhibit the activity of other genes or proteins in a cell. These inhibitor proteins can interact with other proteins to prevent them from functioning or alter their activity.
    • TP53 gene, which codes for the p53 protein, a tumor suppressor protein that plays a critical role in regulating cell division and preventing the formation of cancerous tumors. Mutations in the TP53 gene can result in loss of p53 function and an increased risk of cancer.
  • Activator genes: Activator genes play crucial roles in a variety of biological processes, including embryonic development, immune system responses, and the regulation of gene expression. For example, the NF-kB gene codes for a transcription factor that activates genes involved in immune system responses.
    • The MYC gene is an oncogene that codes for a transcription factor that promotes cell growth and proliferation. Dysregulation or overexpression of MYC can contribute to the development of many types of cancer.
    • Overexpression of the HER2 (human epidermal growth factor receptor 2) gene is commonly observed in certain types of cancer, particularly breast cancer. Approximately 20-25% of breast cancer cases overexpress HER2, which is associated with a more aggressive form of the disease.
    • Overexpression of the epidermal growth factor receptor (EGFR) gene is a common genetic alteration observed in glioblastoma multiforme (GBM) patients.
  • It's important to note that the distinction between inhibitor and activator genes is not always clear-cut, as many genes can have both inhibitory and activating effects depending on the context and the specific proteins they interact with.
  • Normal/cancer cells and PARP Inhibition:
    • Normal cells can tolerate DNA damage caused by PARP inhibition due to their efficient homologous recombination (HR) mechanism.
    • In contrast, cancer cells with a deficient HR struggle to manage the DNA double-strand breaks (DSBs) and are especially sensitive to the effects of PARP inhibitors (PARPi)
    • PARP has been found to be overexpressed in various types of cancers, including breast, ovarian, and oral cancers, compared to their corresponding normal healthy tissues.
    • This overexpression makes inhibition of PARP activity an attractive strategy for cancer therapeutics. By disrupting PARP functions, it impairs DNA damage repair (DDR) pathways in cancer cells.
    • After cancer patients receive PARP inhibitor (PARPi) drugs, the expression of PARP genes in cancer patients tends to be lower compared to normal patients.

Undifferentiated cancer

  • Medical Definition of Undifferentiated cancer (不好的 tumor)
  • What are undifferentiated cells
    • Undifferentiated cells are cells that have not yet developed into a specific type of cell and do not possess the characteristics of a fully differentiated cell. They are also known as stem cells or progenitor cells.
    • In developmental biology, undifferentiated cells are the cells that have not yet undergone differentiation, the process by which a less specialized cell becomes a more specialized cell, with a specific function and characteristics. These cells have the potential to divide and differentiate into multiple cell types, either through normal development or in response to injury or disease.
    • In the context of cancer, undifferentiated cells refer to cells that have not yet developed into a specific type of cancer cell. These cells are sometimes called cancer stem cells, and they are thought to be the cells that give rise to the various types of cells within a tumor. They are believed to be responsible for the maintenance and growth of the tumor, and for its ability to spread to other parts of the body. They can be found within many types of cancer, and are considered to be an important target for cancer therapy, as they are thought to be more resistant to traditional treatments such as chemotherapy and radiation.
  • GSE164174

NCI Information Technology for Cancer Research program /ITCR

https://itcr.cancer.gov/. Videos. It sponsors several programs like Bioconductor, GenePattern, UCSC Xena, IGV, PDX Finder, WebMeV, et al.

Other software

Partek

dCHIP

MeV

MeV v4.8 (11/18/2011) allows annotation from Bioconductor

IPA from Ingenuity

Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.

Features:

  • easily search the scientific literature/integrate diverse biological information.
  • build dynamic pathway models
  • quickly analyze experimental data/Functional discovery: assign function to genes
  • share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.

Start Here

Expression data -> New core analysis -> Functions/Diseases -> Network analysis
                                        Canonical pathways        |
                                              |                   |
Simple or advanced search --------------------+                   |
                                              |                   |
                                              v                   |
                                        My pathways, Lists <------+
                                              ^
                                              |
Creating a custom pathway --------------------+

Resource:

Notes:

  • The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
  • The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
  • The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
  • In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
  • The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.

Screenshots:

File:IngenuityAnalysisOutput.png

DAVID Bioinformatics Resource

It offers an integrated annotation combining gene ontology, pathways and protein annotations.

It can be used to identify the pathways associated with a set of genes; e.g. this paper.

GOTrapper

GOTrapper: a tool to navigate through branches of gene ontology hierarchy

qpcR

Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).

GSEA

sandbox.bio: Interactive bioinformatics tutorials

https://sandbox.bio/. An interactive playground for learning bioinformatics command-line tools like bedtools, bowtie2, and samtools.

GWAS

Genome-wide association studies in R