Genome: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 2,100: Line 2,100:
* [https://en.wikipedia.org/wiki/GC-content GC content]  = (G+C)/(A+T+G+C) x 100%
* [https://en.wikipedia.org/wiki/GC-content GC content]  = (G+C)/(A+T+G+C) x 100%
* How many CpGs (C follows by G)?
* How many CpGs (C follows by G)?
* Some papers
** [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04416-w Identification of prostate cancer specific methylation biomarkers from a multi-cancer analysis] 2021
* [http://genomicsclass.github.io/book/pages/methylation.html Analyzing DNA methylation data] (part of the book [http://genomicsclass.github.io/book/ Biomedical Data Science]) and the [https://www.class-central.com/mooc/1615/edx-ph525x-data-analysis-for-genomics PH525x: Data Analysis for Genomics] (edX course). The Github website is on https://github.com/genomicsclass/labs. The source code may not be correct. See also http://www.biostat.jhsph.edu/~iruczins/teaching/kogo/html/ml/week8/methylation.Rmd. The paper [http://ije.oxfordjournals.org/content/41/1/200.long Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies] the tutorial has mentioned.
* [http://genomicsclass.github.io/book/pages/methylation.html Analyzing DNA methylation data] (part of the book [http://genomicsclass.github.io/book/ Biomedical Data Science]) and the [https://www.class-central.com/mooc/1615/edx-ph525x-data-analysis-for-genomics PH525x: Data Analysis for Genomics] (edX course). The Github website is on https://github.com/genomicsclass/labs. The source code may not be correct. See also http://www.biostat.jhsph.edu/~iruczins/teaching/kogo/html/ml/week8/methylation.Rmd. The paper [http://ije.oxfordjournals.org/content/41/1/200.long Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies] the tutorial has mentioned.
<source lang="rsplus">
<source lang="rsplus">

Revision as of 21:45, 13 October 2021

Visualization

IGV

nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB
~/binary/IGV_2.3.52/igv.sh

Simulated DNA-Seq

The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.

File:Igv dna simul.png

Whole genome

PRJEB1486

File:Igv prjeb1486 wgs.png

Whole exome

  • (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
  • (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.

File:Igv gse48215.png File:Igv srp066363.png

RNA-Seq

  • (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
  • (Right) GSE46876, UCSC/hg19.

File:Igv anders2013 rna.png File:Igv gse46876 rna.png

Tell DNA or RNA

  • DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
  • RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).

RNA-seq DRaMA

https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from 2nd Annual Shiny Contest

Gviz

GIVE: Genomic Interactive Visualization Engine

Build your own genome browser

ChromHeatMap

Heat map plotting by genome coordinate.

ggbio

NOISeq package

Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.

rtracklayer

R interface to genome browsers and their annotation tracks

  • Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.

ssviz

A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).

Sushi

See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.

cBioPortal and TCGA

TCPA

Download. Level 4.

Qualimap

Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

SeqMonk

SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.

dittoSeq

dittoSeq – universal user-friendly single-cell and bulk RNA sequencing visualization toolkit, bioinformatics

Copy Number

Copy number work flow using Bioconductor

Detect copy number variation (CNV) from the whole exome sequencing

Whole exome sequencing != whole genome sequencing

Consensus CDS/CCDS

DBS segmentation algorithm

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

modSaRa2

An accurate and powerful method for copy number variation detection

Visualization

reconCNV: interactive visualization of copy number data from high-throughput sequencing 2021

NGS

File:CentralDogmaMolecular.png

See NGS.

R and Bioconductor packages

Resources

library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)

Docker

Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.

Some workflows

RNA-Seq workflow

Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.

       STAR       Samtools         Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments       DESeq2 
                                                           --------------------> se --------> dds
      GenomicFeatures         GenomicFeatures             /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|

rnaseqGene

rnaseqGene - RNA-seq workflow: gene-level exploratory analysis and differential expression

tximport

CodeOcean - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). Plan.

Sequence analysis

library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)

Accessing Annotation Data

Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.

library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")     # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type example package name contents
OrgDb org.Hs.eg.db gene based information for Homo sapiens
TxDb TxDb.Hsapiens.UCSC.hg19.knownGene transcriptome ranges for Homo sapiens
OrganismDb Homo.sapiens composite information for Homo sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19 genome sequence for Homo sapiens
refGenome

RNA-Seq Data Analysis using R/Bioconductor

recount2

dbGap

dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

eQTL

Statistics for Genomic Data Science (Coursera) and MatrixEQTL from CRAN

GenomicDataCommons package

Note:

  1. The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
  2. Each patient has 4 categories data and the 'case_id' is common to them:
    • demographic: gender, race, year_of_birth, year_of_death
    • diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
    • exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
    • main: disease_type, primary_site
  3. The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.

Visualization

GenVisR

ComplexHeatmap

limma

  • Differential expression analyses for RNA-sequencing and microarray studies
  • Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
  • Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.

RSEM

Time Course Experiments

  • See Limma's vignette on Section 9.6
    • Few time points (ANOVA, contrast)
      • Which genes respond at either the 6 hour or 24 hour times in the wild-type?
      • Which genes respond (i.e., change over time) in the mutant?
      • Which genes respond differently over time in the mutant relative to the wild-type?
    • Many time points (regression such as cubic spline, moderated F-test)
      • Detect genes with different time trends for treatment vs control.

easyRNASeq

Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.

ShortRead

Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.

Rsamtools

The Rsamtools package provides an interface to BAM files.

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts

This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.

IRanges

IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.

The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.

If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.

flank

The example is obtained from ?IRanges::flank.

ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
#     start end width
# [1]     2   3     2
# [2]     5   7     3
# [3]     1   3     3

flank(ir3, 2)
#     start end width
# [1]     0   1     2
# [2]     3   4     2
# [3]    -1   0     2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                     == ==

flank(ir3, 2, start=FALSE)
#     start end width
# [1]     4   5     2
# [2]     8   9     2
# [3]     4   5     2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == == 

flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
#     start end width
# [1]     4   5     2
# [2]     3   4     2
# [3]     4   5     2
# Combine the ideas of the previous 2 cases.

flank(ir3, c(2, -2, 2))
#     start end width
# [1]     0   1     2
# [2]     5   6     2
# [3]    -1   0     2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                   == ==

flank(ir3, -2, start=F)
#     start end width
# [1]     2   3     2
# [2]     6   7     2
# [3]     2   3     2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                       == ==

flank(ir3, 2, both = TRUE)
#     start end width
# [1]     0   3     4
# [2]     3   6     4
# [3]    -1   2     4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range. 
#        The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
#                                             ==
#                                       == == == ==

flank(ir3, 2, start=FALSE, both=TRUE)
#     start end width
# [1]     2   5     4
# [2]     6   9     4
# [3]     2   5     4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
#                                          ==
#                                       == == == ==

Both IRanges and GenomicRanges packages provide the flank function.

Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.

  • General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
  • Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.

Biostrings

GenomicRanges

GenomicRanges depends on IRanges package. See the dependency diagram below.

GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                         |            +
                   +-----+            +- GenomeInfoDb
                   |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings

The package defines some classes

  • GRanges
  • GRangesList
  • GAlignments
  • SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.

(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.

GenomicAlignments

Counting reads with summarizeOverlaps vignette

library(GenomicAlignments)
library(DESeq)
library(edgeR)

fls <- list.files(system.file("extdata", package="GenomicAlignments"),
    recursive=TRUE, pattern="*bam$", full=TRUE)

features <- GRanges(
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features

# GRanges object with 11 ranges and 1 metadata column:
#       seqnames       ranges strand   |    group_id
#          <Rle>    <IRanges>  <Rle>   | <character>
#   [1]    chr2L [1000, 1499]      -   |           A
#   [2]    chr2L [3000, 3499]      -   |           A
#   [3]    chr2L [4000, 4499]      -   |           A
#   [4]    chr2L [7000, 7599]      -   |           A
#   [5]    chr2R [2000, 2899]      -   |           B
#   ...      ...          ...    ... ...         ...
#   [7]    chr2R [3600, 3899]      -   |           B
#   [8]    chr2R [4000, 4899]      -   |           B
#   [9]    chr2R [7500, 7799]      -   |           B
#  [10]    chr3L [5000, 5499]      -   |           C
#  [11]    chr3L [5400, 5899]      -   |           C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment 
# dim: 11 2 
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):

assays(olap)$counts
#       sm_treated1.bam sm_untreated1.bam
#  [1,]               0                 0
#  [2,]               0                 0
#  [3,]               0                 0
#  [4,]               0                 0
#  [5,]               5                 1
#  [6,]               5                 0
#  [7,]               2                 0
#  [8,]             376               104
#  [9,]               0                 0
# [10,]               0                 0
# [11,]               0                 0

Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).

samtools view -h -o outputFile.bam inputFile.sam

A modified R code that works is

###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
             "gtf/drosophila_melanogaster/",
             "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
idx <- mcols(gff0)$source == "protein_coding" & 
           mcols(gff0)$type == "exon" & 
           seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
library(GenomicAlignments)

# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
     recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts

###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)

expdata = MIAME(
              name="pasilla knockdown",
              lab="Genetics and Developmental Biology, University of 
                  Connecticut Health Center",
              contact="Dr. Brenton Graveley",
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi 
                  knockdown RNA-Seq Studies",
              pubMedIds="20921232",
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs 
                  encoding pasilla, a mRNA binding protein and 4 biological replicates 
                  of the the untreated cell line.")

design <- data.frame(
              condition=c("untreated", "untreated"),
              replicate=c(1,1),
              type=rep("single-read", 2), stringsAsFactors=TRUE)
library(DESeq)
geneCDS <- newCountDataSet(
                  countData=assay(genehits),
                  conditions=design)

experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)

###################################################
### code chunk number 16: pasilla_genes (eval = FALSE)
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design) 
experimentData(txCDS) <- expdata

We can also check out ?summarizeOverlaps to find some fake examples.

tidybulk

Bisque

Bisque: An R toolkit for accurate and efficient estimation of cell composition ('decomposition') from bulk expression data with single-cell information.

chromoMap

Rsubread

See this post for about C version of the featureCounts program.

featureCounts vs HTSeq-count

featureCounts or htseq-count?

featurecounts的使用说明

Inference

tximport

  • http://bioconductor.org/packages/release/bioc/html/tximport.html
    $ head -5 quant.sf 
    Name	Length	EffectiveLength	TPM	NumReads
    ENST00000456328.2	1657	1410.79	0.083908	2.46885
    ENST00000450305.2	632	410.165	0	0
    ENST00000488147.1	1351	1035.94	10.4174	225.073
    ENST00000619216.1	68	24	0	0
    ENST00000473358.1	712	453.766	0	0
    
  • Another real data from PDMR -> PDX. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
    $ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results 
    gene_id	transcript_id.s. length	effective_length  expected_count  TPM	  FPKM
    5S_rRNA	uc021ofx.1	 105.04	2.23	          2039.99	  49629.87 35353.97
    
    $ R
    x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    dim(x)
    # [1] 28109     7
    names(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"            
    x[1:3, -2]
    #     gene_id length effective_length expected_count      TPM     FPKM
    # 1   5S_rRNA 105.04             2.23        2039.99 49629.87 35353.97
    # 2 5_8S_rRNA 161.00            21.19           0.00     0.00     0.00
    # 3    6M1-18 473.00           302.74           0.00     0.00     0.00
    
    y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results")
    dim(y)
    # [1] 78375     8
    names(y)
    # [1] "transcript_id"    "gene_id"          "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"             "IsoPct" 
    y[1:3, -1]
    #   gene_id length effective_length expected_count TPM FPKM IsoPct
    # 1 5S_rRNA    110             3.06              0   0    0      0
    # 2 5S_rRNA    133             9.08              0   0    0      0
    # 3 5S_rRNA     92             0.00              0   0    0      0
    

File:RSEM PDX.png

DESeq2 or edgeR

Shrinkage estimators

type='apeglm' shrinkage only for use with 'coef'

Time course experiment

  • Time course trend analysis from the edgeR's vignette. glmQLFTest()
    • Finds genes that respond to the treatment at either 1 hour or 2 hours versus the 0 hour baseline. This is analogous to an ANOVA F-test for a normal linear model.
    • Assuming gene expression changes smoothly over time, we can use a polynomial or a cubic spline curve with a certain number of degrees of freedom to model gene expression along time.
    • We are looking for genes that change expression level over time. We test for a trend by conducting F-tests for each gene. The topTags function lists the top set of genes with most significant time effects.
    • The total number of genes with significant (5% FDR) changes at different time points can be examined with decideTests.

DESeq2 experimental design and interpretation

DESeq2 experimental design and interpretation

vst over rlog transformation

https://twitter.com/mikelove/status/1420671546088173569. See RNA-seq workflow: gene-level exploratory analysis and differential expression. The transformed data can be used in computing sample distances, PCA, MDS, clustering, ....

Expected counts
     | round()
     v                              /-- vst transformation  ---\
Raw counts --> normalized counts  --                            -- Other analyses such as PCA, Hclust (sample distances).
                                    \-- rlog transformation ---/

Generalized Linear Models and Plots with edgeR

Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis

EBSeq

An R package for gene and isoform differential expression analysis of RNA-seq data

http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/

prebs

Probe region expression estimation for RNA-seq data for improved microarray comparability

DEXSeq

Inference of differential exon usage in RNA-Seq

rSeqNP

A non-parametric approach for detecting differential expression and splicing from RNA-Seq data

voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data

http://www.biosoft.hacettepe.edu.tr/voomDDA/

Pathway analysis

About the KEGG pathways

  • MSigDB database or msigdbr package - seems to be old. It only has 186 KEGG pathways for human.
  • KEGGREST package directly pull the data from kegg.jp. I can get 337 KEGG pathways for human ('hsa')
    BiocManager::install("KEGGREST")
    library(KEGGREST)
    res <- keggList("pathway", "hsa") 
    length(res) # 337
    

GSOAP

GSOAP: a tool for visualization of gene set over-representation analysis

clusterProfiler

fgsea: Fast Gene Set Enrichment Analysis

GSEABenchmarkeR: Reproducible GSEA Benchmarking

Towards a gold standard for benchmarking gene set enrichment analysis

hypeR

GSEPD

GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display

SeqGSEA

http://www.bioconductor.org/packages/release/bioc/html/SeqGSEA.html

BAGSE

BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis 2020

GeneSetCluster

GeneSetCluster: a tool for summarizing and integrating gene-set analysis results

Pipeline

SPEAQeasy

SPEAQeasy – a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses

Nextflow

GeneTEFlow

GeneTEFlow – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data

pipeComp

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools

SARTools

http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/

SEQprocess

SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package

pasilla and pasillaBamSubset Data

pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).

BitSeq

Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.

ReportingTools

The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.

Figures can be included in a cell in output table. See Using ReportingTools in an Analysis of Microarray Data.

It is suggested by e.g. EnrichmentBrowser.

sequences

More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.

QuasR

Bioinformatics paper

CRAN/Bioconductor packages

ssizeRNA

Sample Size Calculation for RNA-Seq Experimental Design

RnaSeqSampleSize

Shiny app

rbamtools

Provides an interface to functions of the 'SAMtools' C-Library by Heng Li

refGenome

The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.

WhopGenome

Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.

TCGA2STAT

Simple TCGA Data Access for Integrated Statistical Analysis in R

TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.

source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")

install.packages("TCGA2STAT")

The getTCGA() function allows to download various kind of data:

  • gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
  • miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
  • mutation data (data.type="Mutation")
  • methylation expression (data.type="Methylation")
  • copy number changes (data.type="CNA_SNP")

TCGAbiolinks

  • An example from Public Data Resources in Bioconductor workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-ACC",
                               data.category = "Gene expression",
                               data.type = "Gene expression quantification",
                               platform = "Illumina HiSeq", 
                               file.type  = "normalized_results",
                               experimental.strategy = "RNA-Seq",
                               legacy = TRUE)
    
    gdcdir <- file.path("Waldron_PublicData", "GDCdata")
    GDCdownload(query, method = "api", files.per.chunk = 10,
                directory = gdcdir)  # 79 files
    ACCse <- GDCprepare(query, directory = gdcdir)
    ACCse
    class(ACCse)
    dim(assay(ACCse))  # 19947 x 79
    assay(ACCse)[1:3, 1:2] # symbol id
    length(unique(rownames(assay(ACCse))))   #  19672
    rowData(ACCse)[1:2, ]
    # DataFrame with 2 rows and 3 columns
    #          gene_id entrezgene ensembl_gene_id
    #      <character>  <integer>     <character>
    # A1BG        A1BG          1 ENSG00000121410
    # A2M          A2M          2 ENSG00000175899
    
  • HTSeq counts data example
    query2 <- GDCquery(project = "TCGA-ACC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type="HTSeq - Counts") 
    gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2")
    GDCdownload(query2, method = "api", files.per.chunk = 10,
                directory = gdcdir2)  # 79 files
    ACCse2 <- GDCprepare(query2, directory = gdcdir2)
    ACCse2
    dim(assay(ACCse2))  # 56457 x 79
    assay(ACCse2)[1:3, 1:2]  # ensembl id
    rowData(ACCse2)[1:2, ]
    DataFrame with 2 rows and 3 columns
                    ensembl_gene_id external_gene_name original_ensembl_gene_id
                        <character>        <character>              <character>
    ENSG00000000003 ENSG00000000003             TSPAN6       ENSG00000000003.13
    ENSG00000000005 ENSG00000000005               TNMD        ENSG00000000005.5
    
  • Clinical data
    acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical")
    dim(acc_clin)
    # [1] 92 71
    
  • TCGAanalyze_DEA(). Differentially Expression Analysis (DEA) Using edgeR Package.
    dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
    dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut =  0.25)
    samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
    samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
    dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
                          dataFilt[,samplesTP],"Normal", "Tumor")
    # 2nd example
    dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM,
                               Cond1type = "LGG", Cond2type = "GBM",
                               fdr.cut = 0.01,  logFC.cut = 1,
                               method = "glmLRT")
    
  • Enrichment analysis
    ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", 
                                    RegulonList = rownames(dataDEGs))
    
    TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                            GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC,
                            GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat,
                            nRGTab = rownames(dataDEGs),
                            nBar = 20)
    
  • mRNA Analysis Pipeline from GDC documentation.