Genome
Visualization
See also Bioconductor > BiocViews > Visualization. Search 'genom' as the keyword.
IGV
nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB ~/binary/IGV_2.3.52/igv.sh
Gviz
ChromHeatMap
Heat map plotting by genome coordinate.
ggbio
NOISeq package
Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.
rtracklayer
R interface to genome browsers and their annotation tracks
- Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.
ssviz
A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).
Sushi
See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.
Copy Number
Copy number work flow using Bioconductor
Detect copy number variation (CNV) from the whole exome sequencing
- http://www.ncbi.nlm.nih.gov/pubmed/24172663
- http://www.biomedcentral.com/1471-2164/15/661
- http://www.blog.biodiscovery.com/2014/06/16/comparison-of-cnv-detection-from-whole-exome-sequencing-wes-as-compared-with-snp-microarray/
- exomeCopy package
- exomeCNV package and more info including slides by Fah Sathirapongsasuti.
Whole exome sequencing != whole genome sequencing
RNA seq
Introduction to Sequence Data Analysis
- http://www.cbcb.umd.edu/~hcorrada/CMSC858B/
- http://watson.nci.nih.gov/~sdavis/tutorials/biowulf-2011/
- http://cbsu.tc.cornell.edu/ngw2010/Day3_lecture1.pdf
- http://static.msi.umn.edu/tutorial/lifescience/RNA-Seq-Module-1.pdf
NIH only
- Some training material from NIH:
- Biowulf seminar by Steven Fellini & Susan Chacko.
- Bash Scripting and Linux commands
Quality control
FastQC
Trim Galore!
QoRTs
A comprehensive toolset for quality control and data processing of RNA-Seq experiments.
Alignment
Bowtie
Extremely fast, general purpose short read aligner.
bowtie needs to have an index of the genome in order to perform its alignment functionality. For example, to build a bowtie index against UCSC hg19
bowtie-build /data/ngs/public/sequences/hg19/genome.fa hg19
The reference genome index <genome.fa> can be generated by following Sean Davis' instruction. Note that genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files for many commonly used reference genomes can be directly downloaded from http://tophat.cbcb.umd.edu/igenomes.shtml. Even the index file can be directly downloaded without going through Bowtie program, bowtie program is still needed by Tophat program where Tophat's job is to align the RNA-seq data to reference genome.
PS: these genome sequence indexes files are quite big; for example 21 GB for hg19.
BWA
Used by whole-exome sequencing. For example, http://bib.oxfordjournals.org/content/15/2/256.full and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3865592/. Whole Exome Analysis.
Tophat
Aligns RNA-Seq reads to the genome using Bowtie/Discovers splice sites. It does so by splitting longer reads into small sections and aligning those to the genome. It then looks for potential splice sites between pairs of sections to construct a final alignment.
Linux part.
$ type -a tophat # Find out which command the shell executes: tophat is /home/mli/binary/tophat $ ls -l ~/binary
Quick test of Tophat program
$ wget http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz $ tar xzvf test_data.tar.gz $ cd ~/tophat_test_data/test_data $ PATH=$PATH:/home/mli/bowtie-0.12.8 $ export PATH $ ls reads_1.fq test_ref.1.ebwt test_ref.3.bt2 test_ref.rev.1.bt2 test_ref.rev.2.ebwt reads_2.fq test_ref.2.bt2 test_ref.4.bt2 test_ref.rev.1.ebwt test_ref.1.bt2 test_ref.2.ebwt test_ref.fa test_ref.rev.2.bt2 $ tophat -r 20 test_ref reads_1.fq reads_2.fq $ # This will generate a new folder <tophat_out> $ ls tophat_out accepted_hits.bam deletions.bed insertions.bed junctions.bed logs prep_reads.info unmapped.bam
TopHat accepts FASTQ and FASTA files of sequencing reads as input. Alignments are reported in BAM files. BAM is the compressed, binary version of SAM43, a flexible and general purpose read alignment format. SAM and BAM files are produced by most next-generation sequence alignment tools as output, and many downstream analysis tools accept SAM and BAM as input. There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps most popular among these are the SAM tools (http://samtools.sourceforge.net/) and the Picard tools (http://picard.sourceforge.net/).
Note that if the data is DNA-Seq, we can merely use Bowtie2 or BWA tools since we don't have to worry about splicing.
An example of using Tophat2 (paired end in this case, 5 threads) is
tophat2 --no-coverage-search -p 5 \ -o "Sample1" \ -G ~/iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \ --transcriptome-index=transcriptome_data/known \ ~/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome \ myfastq_R1.fastq.gz myfastq_R2.fastq.gz
STAR (Spliced Transcripts Alignment to a Reference)
Note that the readme file says HARDWARE/SOFTWARE REQUIREMENTS:
- x86-64 compatible processors
- 64 bit Linux or Mac OS X
- 30GB of RAM for human genome
See the blog on gettinggeneticsdone.com for a comparison of speed and memory requirement.
In short, the notable increase in speed comes at the price of a larger memory requirement.
RNASequel
RNASequel: accurate and repeat tolerant realignment of RNA-seq reads
SAMtools
- SAMtools: Primer / Tutorial by Ethan Cerami. It covers installing samtools, bcftools, generating simulated reads, align reads, convert sam to bam, sorting & indexing, identifying genomic variants, understand vcf format, visualize reads on a command line.
Cufflinks package
Transcriptome assembly and differential expression analysis for RNA-Seq.
Both Cufflinks and Cuffdiff accept SAM and BAM files as input. It is not uncommon for a single lane of Illumina HiSeq sequencing to produce FASTQ and BAM files with a combined size of 20 GB or larger. Laboratories planning to perform more than a small number of RNA-seq experiments should consider investing in robust storage infrastructure, either by purchasing their own hardware or through cloud storage services.
Tuxedo protocol
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks by Trapnell, et al 2012.
- https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite
- bowtie2 - fast alignment
- tophat2 - splice alignment (rna-seq reads, rna are spliced, introns are removed, some reads may span over 2 exons)
- cufflinks - transcript assembly & quantitation
- cuffdiff2 - differential expression
Cufflinks - assemble reads into transcript
Installation
# about 47MB on ver 2.2.1 for Linux binary version wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz sudo tar xzvf ~/Downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz -C /opt/RNA-Seq/bin/ export PATH=$PATH:/opt/RNA-Seq/bin/cufflinks-2.2.1.Linux_x86_64/ # test cufflinks -h
Cufflinks uses this map (done from Tophat) against the genome to assemble the reads into transcripts.
# Quantifying Known Transcripts using Cufflinks cufflinks -o OutputDirectory/ -G refseq.gtf mappedReads.bam # De novo Transcript Discovery using Cufflinks cufflinks -o OutputDirectory/ mappedReads.bam
It can be used to calculate FPKM.
- https://www.biostars.org/p/11378/
- http://www.partek.com/Tutorials/microarray/User_Guides/UnderstandingReads.pdf
Cuffcompare - compares transcript assemblies to annotation
Cuffmerge - merges two or more transcript assemblies
Cuffdiff
Finds differentially expressed genes and transcripts/Detect differential splicing and promoter use.
Cuffdiff takes the aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis.
Follow the tutorial, we can quickly test the cuffdiff program.
$ wget http://cufflinks.cbcb.umd.edu/downloads/test_data.sam $ cufflinks ./test_data.sam $ ls -l total 56 -rw-rw-r-- 1 mli mli 221 2013-03-05 15:51 genes.fpkm_tracking -rw-rw-r-- 1 mli mli 231 2013-03-05 15:51 isoforms.fpkm_tracking -rw-rw-r-- 1 mli mli 0 2013-03-05 15:51 skipped.gtf -rw-rw-r-- 1 mli mli 41526 2009-09-26 19:15 test_data.sam -rw-rw-r-- 1 mli mli 887 2013-03-05 15:51 transcripts.gtf
CummeRbund
Plots abundance and differential expression results from Cuffdiff. CummeRbund also handles the details of parsing Cufflinks output file formats to connect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website
The tool appears on the paper Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks by Trapnell, et al 2012.
Finding differentially expressed genes
Big pictures
BWA/Bowtie samtools fa ---------> sam ------> sam/bam (sorted indexed, short reads), vcf or tophat Rsamtools GenomeFeatures edgeR (normalization) ---------> --------------> table of counts --------->
Readings
- Introduction to RNA Sequencing and Analysis Kukurba KR, Montgomery SB. (2015) RNA Sequencing and Analysis. Cold Spring Harb Protoc
- RNA-Seq: a revolutionary tool for transcriptomics
- Alignment and visualization from bioinformatics.ca.
- The RNA-seqlopedia from the Cresko Lab of the University of Oregon.
- RNA-Seq Analysis by Simon Andrews.
Youtube videos
- Analyze public dataset by using Galaxy and IGV David Coil from UC Davis Genoem Center
Download the raw fastq data GSE19602 from GEO and uncompress fastq.bz2 to fastq (~700MB) file. NOTE: the data downloaded from ncbi is actually sra file format. We can use fastq_dump program in SRA_toolkit to convert sra to fastq. http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
~/Downloads/sratoolkit.2.3.2-ubuntu64/bin/fastq-dump ~/Downloads/SRR034580.sra
If we want to run Galaxy locally, we can install it simply by 2 command lines
hg clone https://bitbucket.org/galaxy/galaxy-dist/ cd galaxy-dist hg update stable
To run Galaxy locally, we do
cd galaxy-dist sh run.sh --reload
The command line will show Starting server in PID XXXX. serving on http://127.0.0.1:8080. We can use Ctrl + C to stop the Galaxy process.
Note: One problem with this simple instruction is we have not created a user yet.
- Upload one fastq data. Click 'refresh' icon on the history panel to ensure the data is uploaded. Don't use 'refresh' button on the browser; it will cause an error for the current process.
- FASTQ Groomer. Convert the data to Galaxy needs. NGS: QC and manipulation => Illumina FASTQ. FASTQ quality scores type: Sanger. (~10 minutes). This part uses CPU not memory.
- Open a new browser tab and go to ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/. Right click the file all.cDNA and copy link location. In Galaxy click 'Upload File from your computer' paste URL to URL/Text entry.
- Scroll down Galaxy and select NGS:Mapping -> Map with BWA. PS. for some reason, BWA is not available. So I use Bowtie2 instead. The output of Bowtie2 is bam file.
- For reference genome, choose 'Use one from the history'. Galaxy automatically find the reference file 'ftp://ftp.plantbiology....' from history.
- Library mate-paired => Single-end.
- FASTQ file => 2: FASTQ Groomer on data 1.
- BWA settings to use => Commonly Used.
- Execute (~ 15 minutes)
- We can view the alignment file (sam format) created from BWA by using UCSV or IGV (input is bam or bai format). We now use NGS: SAM Tools to convert sam file to bam file. Click 'SAM-to-BAM converts SAM format to BAM format' tool.
- Choose the source for the reference list => History
- Converts SAM file => 4: Map with BWA on data 2 and data 3.
- Using reference file => 3:ftp://ftp.plantbiology.....
- Execute (~5 minutes)
- We want to create bai file which is a shortcut to IGV. It breaks the data into smaller accessible chunks. So when you pull up a certain cDNA, it goes straight to the subset. Go to the history, click the pencil icon (Edit Attributes) on the file SAM-to-BAM on data 3 and data 4.
- Look at 'Convert to new format' section. Go ahead and click 'Convert'. (< 1 minute). This will create another file.
- Use browser and go to ftp website to download all.cDNA file to desktop. The desktop should contain 3 files - all.cDNA, rice.bam and rice.bai files for IGV.
- Goto http://www.broadinstitute.org/software/igv/download to download IGV which is a java-based application. I need to install java machine first by install openjdk-7-jdk. IGV by default will launch 'Human hg18' genome. Launch IGV by cd IGV_2.2.13;java -Xmx750m -jar igv.jar. I found the IGV input requires sam+bai OR bam+bai. So we need to click the pencil icon to create bai file first before we want to upload sam or bam file to IGV.
- Goto File => Import Genome. Call it 'rice' and select 'all.cDNA' sequence file. Click 'Save' button.
- Goto File => Upload from File => rice.bam.
- Top right panel is cDNA
- Middle right panel has a lot of 'boxes' which is a read. If we zoom in, we can see some read points to left (backward) while some points to right (forward). On the top is a histogram. For example, a base may be covered by a lot of reads then the histogram will show the high frequence.
- If we keep zoom in, we can see color at the Bottom right panel. Keeping zoom in, we can see the base G, C, T, A themselves.
- Using IGV, we can 1. examine coverage.
- We can 2. check 'alternative splicing'. (not for this cDNA)
- We can 3. examine SNPs for base change. If we see gray color (dark gray is hight quality read, light gray means low quality read), it means they are perfect match. If we see color, it means there is a change. For example, a read is 'C' but in fact it should be 'A'. If a case has many high quality reads, and half of them are 'G' but the reference genome shows 'A'. This is most likely a SNP. This is heterogeisity.
- Tophat - align RNA seq data to genomic DNA
- Suppose we have use Galaxy to upload 2 data. One is SRR034580 and we have run FASTQ Groomer on data 1. The second data is SRR034584 and we also have run FASTQ Groomer on data 2. We also have uploaded reference genome sequence.
- Goto Galaxy and find NGS: RNA Analysis => Tophat.
- reference genome => Use one from the history
- RNA-Seq FASTQ file => 2; FASTQ Groomer on data 1.
- Execute. This will create 2 files. One is splice junctions and the other is accepted_hits. We queue the job and run another Tophat with the 2nd 'groomer'ed data file. We are going to work on accepted_hits file.
- While the queue are running, we can click on 'pencil' icon on 'accepted_hits' job and run the utlity 'Convert to new format' (Bam to Bai). We should do this for both 'accepted_hits' files.
- For some reason, the execution failed: An error occurred with this dataset: TopHat v2.0.7 TopHat v2.0.7 Error indexing reference sequence /bin/sh: 1: bowtie-build: not found.
- Cufflinks. We will estimate transcript abundance by using FPKM (RPKM).
- SAM or BAM file of alignmed RNA-Seq reads => tophat on data 2.. accepted_hits
- Use Reference Annotation - No (choose Yes if we want annotation. This requires GTF format. See http://genome.ucsc.edu/FAQ/FAQformat.html#format4. We don't have it for rice.)
- Execute. This will create 3 files. Gene expression, transcript expression and assembled transcripts.
- We also run Cufflinks for 2nd accepted_hits file. (~ 25 minutes)
- Cuffcompare. Compare one to each other.
- GTF file produced by Cufflinks => assembled transcript from the 1st data
- Use another GTF file produced by Cufflinks => Yes. It automatically find the other one.
- Execute. (< 10 minutes). This will create 7 files. Transcript accuracy, tmap file & refmap flie from each assembled transcripts, combined transcripts and transcript tracking.
- We are interested in combined transcripts file (to use in Cuffdiff).
- Cuffdiff.
- Transcripts => combined transcripts.
- SAM or BAM file of aligned RNA-Seq reads => 1st accepted_hits
- SAM or BAM file or aligned RNA-Seq reads => 2nd accepted_hits
- Execute. This will generate 11 files. Isoform expression, gene expression, TSS groups expression, CDS Expression FPKM Tracking, isoform FPKM tracking, gene FPKM tracking, TSS groups FPKM tracking, CDS FPKM tracking, splicing diff, promoters diff, CDS diff. We are interested in 'gene expression' file. We can save it and open it in Excel.
- IGV - 2 RNA-Seq datasets aligned to genomic DNA using Tophat
- Load the reference genome rice (see above)
- Upload from file => rice4.bam. Upload from file => rice5.bam.
- Alternative RNA splicing.
edX course
PH525.5x Case Study: RNA-seq data analysis. The course notes are forming a book. Check out https://github.com/genomicsclass/labs and http://genomicsclass.github.io/book/.
Variant calling
Overview/papers
- A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data by Heng Li.
- http://www.slideshare.net/AustralianBioinformatics/introduction-to-nextgeneration
- http://www.slideshare.net/thomaskeane/overview-of-methods-for-variant-calling-from-nextgeneration-sequence-data-9608507
- http://www.slideshare.net/jandot/nextgeneration-sequencing-variation-discovery
Workflow
- Sequence reads -> Quality control -> Mapping -> Variant calling -> Filtering & Annotation -> Querying.
Software
Variant detector/discovery, genotyping
- A survey of tools for variant analysis of next-generation genome sequencing data
- Evaluation of variant detection software for pooled next-generation sequence data
- Reliable Identification of Genomic Variants from RNA-Seq Data
- Samtools
- https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools
- http://petridishtalk.com/2013/02/01/variant-discovery-annotation-filtering-with-samtools-the-gatk/
- http://lab.loman.net/2012/10/31/a-simple-guide-to-variant-calling-with-bwa-samtools-varscan2/
- Samtools and GATK
- Difference Between Samtools And Gatk Algorithms
- GATK by BROAD Institute.
- freebayes
- Bioconductor VariantTools package can export to VCF and VariantAnnotation package can read VCF format files. See also the Annotating Variants workflow in Bioconductor. See also SIFT.Hsapiens.dbSNP137, COSMIC.67, and PolyPhen.Hsapiens.dbSNP131 packages.
- R packages seqminer
vcf format
- One row is one snp.
- variant category (snp, insertion, deletion, mixed) http://www.1000genomes.org/node/101
- Count number of snps: vcftools. See https://www.biostars.org/p/49386/ & https://www.biostars.org/p/109086/ & https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools.
grep -c -v "^#" XXX.vcf
- Count number of indels
grep -c "INDEL" XXX.vcf
samtools mpileup
- http://samtools.sourceforge.net/mpileup.shtml
- http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html. Li's paper is available in NCBI PMC.
export bdge_bowtie_PATH=/opt/RNA-Seq/bin/bowtie2-2.2.1 export bdge_tophat_PATH=/opt/RNA-Seq/bin/tophat-2.0.11.Linux_x86_64 export bdge_samtools_PATH=/opt/RNA-Seq/bin/samtools-0.1.19:/opt/RNA-Seq/bin/samtools-0.1.19/bcftools:/opt/RNA-Seq/bin/samtools-0.1.19/misc export PATH=$bdge_bowtie_PATH:$bdge_samtools_PATH:$bdge_tophat_PATH:$PATH samtools sort TNBC1/accepted_hits.bam TNBC1/accepted_hits-sorted samtools index TNBC1/accepted_hits-sorted.bam TNBC1/accepted_hits-sorted.bai samtools mpileup -uf ~/igenome/human/NCBI/build37.2/genome.fa \ TNBC1/accepted_hits-sorted.bam | bcftools view -vcg - > TNBC1/var.raw.vcf
GATK
- Need to register and create an account w/ password in order to download the software (require Java).
- GATK guide and requirements.
- Best Practices workflow for SNP and indel calling on RNAseq data using GATK. It recommend using HaplotypeCaller.
- UnifiedGenotyper for variant call.
- https://wikis.utexas.edu/display/bioiteam/Variant+calling+with+GATK
- Identification of somatic and germline mutations using whole exome sequencing of congenital acute lymphoblastic leukemia: a paper that has used GATK for germline and somatic analyses
- http://gatkforums.broadinstitute.org/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail
# Download GenomeAnalysisTK-3.4-46.tar.bz2 from gatk website sudo mkdir /opt/RNA-Seq/bin/gatk sudo tar jxvf ~/Downloads/GenomeAnalysisTK-3.4-46.tar.bz2 -C /opt/RNA-Seq/bin/gatk ls /opt/RNA-Seq/bin/gatk # GenomeAnalysisTK.jar resources
bcftools
bcftools — utilities for variant calling and manipulating VCFs and their binary counterparts BCFs.
Installation
wget https://github.com/samtools/bcftools/releases/download/1.2/bcftools-1.2.tar.bz2 sudo tar jxf bcftools-1.2.tar.bz2 -C /opt/RNA-Seq/bin/ cd /opt/RNA-Seq/bin/bcftools-1.2/ sudo make # create bcftools, plot-vcfstats, vcfutils.pl commands
Example
samtools sort TNBC1/accepted_hits.bam TNBC1/accepted_hits-sorted samtools index TNBC1/accepted_hits-sorted.bam TNBC1/accepted_hits-sorted.bai samtools mpileup -uf ~/igenome/human/NCBI/build37.2/genome.fa \ TNBC1/accepted_hits-sorted.bam | bcftools view -vcg - > TNBC1/var.raw.vcf
htslib
- bgzip – Block compression/decompression utility. The output file .gz is in a binary format.
- tabix – Generic indexer for TAB-delimited genome position files. The output file tbi is in a binary format.
Installation
wget https://github.com/samtools/htslib/releases/download/1.2.1/htslib-1.2.1.tar.bz2 sudo tar jxf htslib-1.2.1.tar.bz2 -C /opt/RNA-Seq/bin/ cd /opt/RNA-Seq/bin/htslib-1.2.1/ sudo make # create tabix, htsfile, bgzip commands
Example
export PATH=/opt/RNA-Seq/bin/bcftools-1.2/:$PATH export PATH=/opt/RNA-Seq/bin/htslib-1.2.1/:$PATH bgzip -c var.raw.vcf > var.raw.vcf.gz tabix var.raw.vcf.gz # need to index all vcf files (very fast in this step) bcftools annotate -c ID -a common_all_20150603.vcf.gz var.raw.vcf.gz > var_annot.vcf # 2 min
vcftools
wget http://sourceforge.net/projects/vcftools/files/vcftools_0.1.12b.tar.gz/download \ -o vcftools_0.1.12b.tar.gz tar -xzvf vcftools_0.1.12b.tar.gz sudo mv vcftools_0.1.12b /opt/RNA-Seq/bin/ export PERL5LIB=/opt/RNA-Seq/bin/vcftools_0.1.12b/perl/ /opt/RNA-Seq/bin/vcftools_0.1.12b/ make export PATH=$PATH:/opt/RNA-Seq/bin/vcftools_0.1.12b/bin ls bin # fill-aa vcf-annotate vcf-convert vcf-phased-join vcf-subset # fill-an-ac vcf-compare vcf-fix-ploidy vcf-query vcftools # fill-fs vcf-concat vcf-indel-stats vcf-shuffle-cols vcf-to-tab # fill-ref-md5 vcf-consensus vcf-isec vcf-sort vcf-tstv # man1 vcf-contrast vcf-merge vcf-stats vcf-validator
Some example
$ cd ~/SRP032789 $ vcftools --vcf GSM1261016_IP2-50_var.flt.vcf VCFtools - v0.1.12b (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf GSM1261016_IP2-50_var.flt.vcf After filtering, kept 1 out of 1 Individuals After filtering, kept 193609 out of a possible 193609 Sites Run Time = 1.00 seconds $ wc -l GSM1261016_IP2-50_var.flt.vcf 193636 GSM1261016_IP2-50_var.flt.vcf $ vcf-indel-stats < GSM1261016_IP2-50_var.flt.vcf > out.txt Use of uninitialized value in pattern match (m//) at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 49. Use of uninitialized value in concatenation (.) or string at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 49. <: No such file or directory at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 18. main::error('<: No such file or directory') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 50 main::init_regions('HASH(0xd77cb8)') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 71 main::do_stats('HASH(0xd77cb8)') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 9
Online course on Variant calling
- edX: HarvardX: PH525.6x Case Study: Variant Discovery and Genotyping. Course notes is at their Github page.
Variant Annotation
See also the paper A survey of tools for variant analysis of next-generation genome sequencing data.
dbSNP
SNPlocs data R package for Human. Some clarification about SNPlocs.Hsapiens.dbSNP.20120608 package.
> library(BSgenome) > available.SNPs() [1] "SNPlocs.Hsapiens.dbSNP141.GRCh38" [2] "SNPlocs.Hsapiens.dbSNP142.GRCh37" [3] "SNPlocs.Hsapiens.dbSNP.20090506" [4] "SNPlocs.Hsapiens.dbSNP.20100427" [5] "SNPlocs.Hsapiens.dbSNP.20101109" [6] "SNPlocs.Hsapiens.dbSNP.20110815" [7] "SNPlocs.Hsapiens.dbSNP.20111119" [8] "SNPlocs.Hsapiens.dbSNP.20120608" [9] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38"
An example:
wget https://github.com/samtools/bcftools/releases/download/1.2/bcftools-1.2.tar.bz2 sudo tar jxf bcftools-1.2.tar.bz2 -C /opt/RNA-Seq/bin/ cd /opt/RNA-Seq/bin/bcftools-1.2/ sudo make wget https://github.com/samtools/htslib/releases/download/1.2.1/htslib-1.2.1.tar.bz2 sudo tar jxf htslib-1.2.1.tar.bz2 -C /opt/RNA-Seq/bin/ cd /opt/RNA-Seq/bin/htslib-1.2.1/ sudo make # create tabix, htsfile, bgzip commands export bdge_bowtie_PATH=/opt/RNA-Seq/bin/bowtie2-2.2.1 export bdge_tophat_PATH=/opt/RNA-Seq/bin/tophat-2.0.11.Linux_x86_64 export bdge_samtools_PATH=/opt/RNA-Seq/bin/samtools-0.1.19 export PATH=$bdge_bowtie_PATH:$bdge_samtools_PATH:$bdge_tophat_PATH:$PATH export PATH=/opt/RNA-Seq/bin/bcftools-1.2/:$PATH export PATH=/opt/RNA-Seq/bin/htslib-1.2.1/:$PATH wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/common_all_20150603.vcf.gz.tbi wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/common_all_20150603.vcf.gz cd TNBC1 mv ~/Downloads/common_all_20150603.vcf.gz* . bgzip -c var.raw.vcf > var.raw.vcf.gz tabix var.raw.vcf.gz bcftools annotate -c ID -a common_all_20150603.vcf.gz var.raw.vcf.gz > var_annot.vcf
Any found in dbSNP?
grep -c 'rs[0-9]' raw_snps.vcf
SnpEff & SnpSift
- http://snpeff.sourceforge.net/SnpEff.html (Basic information)
- http://snpeff.sourceforge.net/protocol.html (Usage examples)
- https://wiki.hpcc.msu.edu/display/Bioinfo/SnpEff+-+The+Basics
- https://docs.uabgrid.uab.edu/wiki/Galaxy_DNA-Seq_Tutorial
- http://www.genome.gov/Pages/Research/DIR/DIRNewsFeatures/Next-Gen101/Teer_VariantAnnotation.pdf
- http://veda.cs.uiuc.edu/course2013/pres/fields/lab/Variant_Calling_v1.pptx
SnpEff: Genetic variant annotation and effect prediction toolbox.
- Input: vcf & reference genome database (eg GRCh38.79).
- Output: vcf & <snpEff_summary.html> & < snpEff_genes.txt> files.
wget http://iweb.dl.sourceforge.net/project/snpeff/snpEff_latest_core.zip sudo unzip snpEff_latest_core.zip -d /opt/RNA-Seq/bin export PATH=/opt/RNA-Seq/bin/snpEff/:$PATH # Next we want to download snpEff database. # Instead of using the command line (very slow < 1MB/s), # java -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar databases | grep GRCh # java -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar download GRCh38.79 # we just go to the file using the the web browser # http://sourceforge.net/projects/snpeff/files/databases/v4_1/ # 525MB mv ~/Downloads/snpEff_v4_1_GRCh38.79.zip . unzip snpEff_v4_1GRCh38.79.zip sudo java -Xmx4G -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar \ -i vcf -o vcf GRCh38.79 var_annot.vcf > var_annot_snpEff.vcf
SnpSift: SnpSift helps filtering and manipulating genomic annotated files (VCF). Once you annotated your files using SnpEff, you can use SnpSift to help you filter large genomic datasets in order to find the most significant variants
COSMIC
- Youtube videos
- Histogram of BRAF (melanoma).
De novo genome assembly
Single Cell RNA-Seq
The best RNA-Seq analysis interface
Systematically evaluating interfaces for RNA-seq analysis from a life scientist perspective
Monitor Software Version Change
Circos Plot
Circos is a popular tool for summarizing genomic events in a tumor genome.
R and Bioconductor packages
- Some basics of biomaRt (and GenomicRanges)
- Annotating Ranges Represent common sequence data types (e.g., from BAM, gff, bed, and wig files) as genomic ranges for simple and advanced range-based queries.
library(VariantAnnotation) library(AnnotationHub) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(TxDb.Mmusculus.UCSC.mm10.ensGene) library(org.Hs.eg.db) library(org.Mm.eg.db) library(BSgenome.Hsapiens.UCSC.hg19)
- Analysis of RNA-Seq Data with R/Bioconductor by Girke in UC Riverside
- R / Bioconductor for High-Throughput Sequence Analysis 2012 by Martin Morgan1 and Nicolas Delhomme
- Count-based differential expression analysis of RNA sequencing data using R and Bioconductor by Simon Anders
- Sequences, Genomes, and Genes in R / Bioconductor by Martin Morgan 2013.
- Orchestrating high-throughput genomic analysis with Bioconductor by Wolfgang Huber et al 2015.
- RNA-seq Analysis Example This is a script that will do differential gene expression (DGE) analysis for RNA-seq experiments using the bioconductor package edgeR. RPKMs were calculated for bar plots.
Some workflows
RNA-Seq workflow
Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.
STAR Samtools Rsamtools fastq -----> sam ----------> bam ----------> bamfiles -| \ GenomicAlignments DESeq2 --------------------> se --------> dds GenomicFeatures GenomicFeatures / (SummarizedExperiment) (DESeqDataSet) gtf ----------------> txdb ---------------> genes -----|
Sequence analysis
library(ShortRead) or library(Biostrings) (QA) gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information) GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count) edgeR or DESeq2 (gene expression analysis) library(org.Sc.sgd.db) or library(biomaRt)
Accessing Annotation Data
Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
library(org.Hs.eg.db) # Sample OrgDb Workflow library("hgu95av2.db") # Sample ChipDb Workflow library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow library(Homo.sapiens) # Sample OrganismDb Workflow library(AnnotationHub) # Sample AnnotationHub Workflow library("biomaRt") # Using biomaRt library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type | example package name | contents |
---|---|---|
OrgDb | org.Hs.eg.db | gene based information for Homo sapiens |
TxDb | TxDb.Hsapiens.UCSC.hg19.knownGene | transcriptome ranges for Homo sapiens |
OrganismDb | Homo.sapiens | composite information for Homo sapiens |
BSgenome | BSgenome.Hsapiens.UCSC.hg19 | genome sequence for Homo sapiens |
refGenome |
RNA-Seq Data Analysis using R/Bioconductor
- https://github.com/datacarpentry/rnaseq-data-analysis by Stephen Turner.
- Tutorial: Introduction to Bioconductor for high-throughput sequence analysis by UseR 2015
- systemPipeR Building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. An important feature is support for running command-line software, such as NGS aligners, on both single machines or compute clusters.
- BioC2015
limma
- Differential expression analyses for RNA-sequencing and microarray studies
- Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
- Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.
easyRNASeq
Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
ShortRead
Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.
Rsamtools
The Rsamtools package provides an interface to BAM files.
The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts
This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.
IRanges
IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.
The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.
If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.
flank
The example is obtained from ?IRanges::flank.
ir3 <- IRanges(c(2,5,1), c(3,7,3)) # IRanges of length 3 # start end width # [1] 2 3 2 # [2] 5 7 3 # [3] 1 3 3 flank(ir3, 2) # start end width # [1] 0 1 2 # [2] 3 4 2 # [3] -1 0 2 # Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE) # For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1] # == == flank(ir3, 2, start=FALSE) # start end width # [1] 4 5 2 # [2] 8 9 2 # [3] 4 5 2 # For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5] # == == flank(ir3, 2, start=c(FALSE, TRUE, FALSE)) # start end width # [1] 4 5 2 # [2] 3 4 2 # [3] 4 5 2 # Combine the ideas of the previous 2 cases. flank(ir3, c(2, -2, 2)) # start end width # [1] 0 1 2 # [2] 5 6 2 # [3] -1 0 2 # The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F) # For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6] # == == flank(ir3, -2, start=F) # start end width # [1] 2 3 2 # [2] 6 7 2 # [3] 2 3 2 # For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7] # == == flank(ir3, 2, both = TRUE) # start end width # [1] 0 3 4 # [2] 3 6 4 # [3] -1 2 4 # The original statement is equivalent to flank(ir3, 2, start=T, both=T) # (From the manual) If both = TRUE, extends the flanking region width positions into the range. # The resulting range thus straddles the end point, with width positions on either side. # For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3] # == # == == == == flank(ir3, 2, start=FALSE, both=TRUE) # start end width # [1] 2 5 4 # [2] 6 9 4 # [3] 2 5 4 # For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5] # == # == == == ==
Both IRanges and GenomicRanges packages provide the flank function.
Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.
- General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
- Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.
GenomicRanges
GenomicRanges depends on IRanges package. See the dependency diagram below.
GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics | + +-----+ +- GenomeInfoDb | | GenomicAlignments +--- Rsamtools --+-----+ +--- Biostrings
The package defines some classes
- GRanges
- GRangesList
- GAlignments
- SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.
(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.
GenomicAlignments
Counting reads with summarizeOverlaps vignette
library(GenomicAlignments) library(DESeq) library(edgeR) fls <- list.files(system.file("extdata", package="GenomicAlignments"), recursive=TRUE, pattern="*bam$", full=TRUE) features <- GRanges( seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)), ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 300, 500, 500)), "-", group_id=c(rep("A", 4), rep("B", 5), rep("C", 2))) features # GRanges object with 11 ranges and 1 metadata column: # seqnames ranges strand | group_id # <Rle> <IRanges> <Rle> | <character> # [1] chr2L [1000, 1499] - | A # [2] chr2L [3000, 3499] - | A # [3] chr2L [4000, 4499] - | A # [4] chr2L [7000, 7599] - | A # [5] chr2R [2000, 2899] - | B # ... ... ... ... ... ... # [7] chr2R [3600, 3899] - | B # [8] chr2R [4000, 4899] - | B # [9] chr2R [7500, 7799] - | B # [10] chr3L [5000, 5499] - | C # [11] chr3L [5400, 5899] - | C # ------- # seqinfo: 3 sequences from an unspecified genome; no seqlengths olap # class: SummarizedExperiment # dim: 11 2 # exptData(0): # assays(1): counts # rownames: NULL # rowData metadata column names(1): group_id # colnames(2): sm_treated1.bam sm_untreated1.bam # colData names(0): assays(olap)$counts # sm_treated1.bam sm_untreated1.bam # [1,] 0 0 # [2,] 0 0 # [3,] 0 0 # [4,] 0 0 # [5,] 5 1 # [6,] 5 0 # [7,] 2 0 # [8,] 376 104 # [9,] 0 0 # [10,] 0 0 # [11,] 0 0
Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).
samtools view -h -o outputFile.bam inputFile.sam
A modified R code that works is
################################################### ### code chunk number 11: gff (eval = FALSE) ################################################### library(rtracklayer) fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/", "gtf/drosophila_melanogaster/", "Drosophila_melanogaster.BDGP5.25.62.gtf.gz") gffFile <- file.path(tempdir(), basename(fl)) download.file(fl, gffFile) gff0 <- import(gffFile, asRangedData=FALSE) ################################################### ### code chunk number 12: gff_parse (eval = FALSE) ################################################### idx <- mcols(gff0)$source == "protein_coding" & mcols(gff0)$type == "exon" & seqnames(gff0) == "4" gff <- gff0[idx] ## adjust seqnames to match Bam files seqlevels(gff) <- paste("chr", seqlevels(gff), sep="") chr4genes <- split(gff, mcols(gff)$gene_id) ################################################### ### code chunk number 12: gff_parse (eval = FALSE) ################################################### library(GenomicAlignments) # fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam") fls <- list.files(system.file("extdata", package="pasillaBamSubset"), recursive=TRUE, pattern="*bam$", full=TRUE) path <- system.file("extdata", package="pasillaBamSubset") bamlst <- BamFileList(fls) genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object assays(genehits)$counts ################################################### ### code chunk number 15: pasilla_exoncountset (eval = FALSE) ################################################### library(DESeq) expdata = MIAME( name="pasilla knockdown", lab="Genetics and Developmental Biology, University of Connecticut Health Center", contact="Dr. Brenton Graveley", title="modENCODE Drosophila pasilla RNA Binding Protein RNAi knockdown RNA-Seq Studies", pubMedIds="20921232", url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508", abstract="RNA-seq of 3 biological replicates of from the Drosophila melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs encoding pasilla, a mRNA binding protein and 4 biological replicates of the the untreated cell line.") design <- data.frame( condition=c("untreated", "untreated"), replicate=c(1,1), type=rep("single-read", 2), stringsAsFactors=TRUE) library(DESeq) geneCDS <- newCountDataSet( countData=assay(genehits), conditions=design) experimentData(geneCDS) <- expdata sampleNames(geneCDS) = colnames(genehits) ################################################### ### code chunk number 16: pasilla_genes (eval = FALSE) ################################################### chr4tx <- split(gff, mcols(gff)$transcript_id) txhits <- summarizeOverlaps(chr4tx, bamlst) txCDS <- newCountDataSet(assay(txhits), design) experimentData(txCDS) <- expdata
We can also check out ?summarizeOverlaps to find some fake examples.
Rsubread
See this post for about C version of the featureCounts program.
Inference
DESeq or edgeR
- DESeq2 method
- DESeq2 with a large number of samples -> use DESEq2 to normalize the data and then use do a Wilcoxon rank-sum test on the normalized counts, for each gene separately, or, even better, use a permutation test. See this post. Or consider the limma-voom method instead, which will handle 1000 samples in a few seconds without the need for extra memory.
- edgeR normalization factor post. Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the paper by Robinson & Oshlack 2010 for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
- Can I feed TCGA normalized count data to EdgeR?
- counts() function and normalized counts.
prebs
Probe region expression estimation for RNA-seq data for improved microarray comparability
DEXSeq
Inference of differential exon usage in RNA-Seq
rSeqNP
A non-parametric approach for detecting differential expression and splicing from RNA-Seq data
pasilla and pasillaBamSubset Data
pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.
pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).
BitSeq
Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.
ReportingTools
The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.
sequences
More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.
QuasR
Bioinformatics paper
CRAN packages
rbamtools
Provides an interface to functions of the 'SAMtools' C-Library by Heng Li
refGenome
The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.
WhopGenome
Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.
Simulate RNA-Seq
http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#RNA-Seq_simulators
SimSeq
A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.
empiricalFDR.DESeq2
http://biorxiv.org/content/early/2014/12/05/012211
The key function is simulateCounts, which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.
Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.
polyester
http://biorxiv.org/content/early/2014/12/05/012211
Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.
Input: reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.
Output: FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.
RSEM
GEO
See the internal link at R-GEO.
Gene set analysis
Hypergeometric test
Misc
Merge different datasets (different genechips)
Normalization
How to use UCSC Table Browser
- An instruction from BitSeq software
UCSC version & NCBI release corresponding
Gene Annotation
- GeneCards
- Genetics Home Reference from National Library of Medicine
- My Cancer Genome
- Cosmic
How many DNA strands are there in humans?
- http://www.numberof.net/number-of-dna-strands/
- http://www.answers.com/Q/How_many_DNA_strands_are_there_in_humans
How many base pairs in human
- chromosome 22 has the smallest number of bps (~50 million).
- chromosome 1 has the largest number of bps (245 base pairs).
NGS technology
DNA methylation
- GC content = (G+C)/(A+T+G+C) x 100%
- How many CpGs (C follows by G)?
- Analyzing DNA methylation data (part of the book Biomedical Data Science) and the PH525x: Data Analysis for Genomics (edX course). The Github website is on https://github.com/genomicsclass/labs. The source code may not be correct. See also http://www.biostat.jhsph.edu/~iruczins/teaching/kogo/html/ml/week8/methylation.Rmd. The paper Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies the tutorial has mentioned.
devtools::install_github("coloncancermeth","genomicsclass") library(coloncancermeth) # 485512 x 26 data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects dim(meth) dim(pd) length(gr) colnames(pd) table(pd$Status) # 9 normals, 17 cancers normalIndex <- which(pd$Status=="normal") cancerlIndex <- which(pd$Status=="cancer") i=normalIndex[1] plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n") for(i in normalIndex){ lines(density(meth[,i],from=0,to=1),col=1) } ### Add the cancer samples for(i in cancerlIndex){ lines(density(meth[,i],from=0,to=1),col=2) } # finding regions of the genome that are different between cancer and normal samples library(limma) X<-model.matrix(~pd$Status) fit<-lmFit(meth,X) eb <- ebayes(fit) # plot of the region surrounding the top hit library(GenomicRanges) i <- which.min(eb$p.value[,2]) middle <- gr[i,] Index<-gr%over%(middle+10000) cols=ifelse(pd$Status=="normal",1,2) chr=as.factor(seqnames(gr)) pos=start(gr) plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference") matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location") # http://www.ncbi.nlm.nih.gov/pubmed/22422453 # within each chromosome we usually have big gaps creating subgroups of regions to be analyzed chr1Index <- which(chr=="chr1") hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method") library(bumphunter) cl=clusterMaker(chr,pos,maxGap=500) table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them #consider two example regions# ...
Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing
- http://www.rna-seqblog.com/exome-sequencing-vs-rna-seq-to-identify-coding-region-variants/
- http://www.rna-seqblog.com/combined-use-of-exome-and-transcriptome-sequencing/
- A comparison of whole genome and whole transcriptome sequencing
Sequence + Expression
RNASeq + ChipSeq
Labs
Terms
RNA sequencing 101
Web
- Introduction to RNA-Seq including biology overview (DNA, Alternative splcing, mRNA structure, human genome) and sequencing technology.
- RNA Sequencing 101 by Agilent Technologies. Includes the definition of sequencing depth (number of reads per sample) and coverage (number of reads/locus).
- Where do we get reads(A,C,T,G) from sample RNA? See page 12 of this pdf from Colin Dewey in U. Wisc.
- Quantification of RNA-Seq data (see the above pdf)
- Convert read counts into expression: RPKM (see the above pdf)
- RPKM and FPKM (Data analysis of RNA-seq from new generation sequencing by 張庭毓) RNA-Seq資料分析研討會與實作課程 / RNA Seq定序 / 次世代定序(NGS) / 高通量基因定序 分析.
- First vs Second generation sequence.
- The Simple Fool’s Guide to Population Genomics via RNASeq: An Introduction to HighThroughput Sequencing Data Analysis. This covers QC, De novo assembly, BLAST, mapping reads to reference sequences, gene expression analysis and variant (SNP) detection.
- An Introduction to Bioinformatics Resources and their Practical Applications from NIH library Bioinformatics Support Program.
- Teaching material from rnaseqforthenextgeneration.org which includes Designing RNA-Seq experiments, Processing RNA-Seq data, and Downstream analyses with RNA-Seq data.
Books
- RNA-seq Data Analysis: A Practical Approach. The pdf version is available on slideshare.net.
- Statistical Analysis of Next Generation Sequencing Data
strand-specific vs non-strand specific experiment
- http://seqanswers.com/forums/showthread.php?t=28025. According to this message and the article (under the paragraph of Read counting) from PLOS, most of the RNA-seq protocols that are used nowadays are not strand-specific.
- http://biology.stackexchange.com/questions/1958/difference-between-strand-specific-and-not-strand-specific-rna-seq-data
- https://www.biostars.org/p/61625/ how to find if this public rnaseq data are prepared by strand-specific assay?
- https://www.biostars.org/p/62747/ Discussion of using IGV to view strand-specific coverage. See also the similar posts on the right hand side.
- https://www.biostars.org/p/44319/ How To Find Stranded Rna-Seq Experiments Data. The text dUTP 2nd strand marking includes a link to stranded rna-seq data.
- forward (+)/ reverse(-) strand in GAlignments objects (p68 of the pdf manual and page 7 of sam format specification.
Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.
This post mentioned to use infer_experiment.py script to check whether the rna-seq run is stranded or not.
The rna-seq experiment used in this tutorial is not stranded-specific.
FASTQ
- FASTQ=FASTA + Qual. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
FASTA <=> FASTQ conversion
Convert FASTA to FASTQ without quality scores
Biostars. For example, the bioawk by lh3 (Heng Li) worked.
Convert FASTA to FASTQ with quality score file
See the links on the above post.
Convert FASTQ to FASTA using Seqtk
Use the Seqtk program; see this post.
The Seqtk program by lh3 can be used to sample reads from a fastq file including paired-end; see this post.
RPKM (Mortazavi et al. 2008)
Reads per Kilobase of Exon per Million of Mapped reads.
- rpkm function in edgeR package.
- RPKM function in easyRNASeq package.
Idea
- The more we sequence, the more reads we expect from each gene. This is the most relevant correction of this method.
- Longer transcript are expected to generate more reads. The latter is only relevant for comparisons among different genes which we rarely perform!
Calculation
- Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
- Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
- Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
Formula
(tag count * 1,000,000) / (total number of tags * kilobase of transcript) = tag count / (total number of tags * transcript length) * 1.0e6
source("http://www.bioconductor.org/biocLite.R") biocLite("edgeR") library(edgeR) set.seed(1234) y <- matrix(rnbinom(20,size=1,mu=10),5,4) [,1] [,2] [,3] [,4] [1,] 0 0 5 0 [2,] 6 2 7 3 [3,] 5 13 7 2 [4,] 3 3 9 11 [5,] 1 2 1 15 d <- DGEList(counts=y, lib.size=1001:1004) # Note that lib.size is optional # By default, lib.size = colSums(counts) cpm(d) # counts per million Sample1 Sample2 Sample3 Sample4 1 0.000 0.000 4985.045 0.000 2 5994.006 1996.008 6979.063 2988.048 3 4995.005 12974.052 6979.063 1992.032 4 2997.003 2994.012 8973.081 10956.175 5 999.001 1996.008 997.009 14940.239 > cpm(d,log=TRUE) Sample1 Sample2 Sample3 Sample4 1 7.961463 7.961463 12.35309 7.961463 2 12.607393 11.132027 12.81875 11.659911 3 12.355838 13.690089 12.81875 11.129470 4 11.663897 11.662567 13.17022 13.451207 5 10.285119 11.132027 10.28282 13.890078 d$genes$Length <- c(1000,2000,500,1500,3000) rpkm(d) Sample1 Sample2 Sample3 Sample4 1 0.0000 0.000 4985.0449 0.000 2 2997.0030 998.004 3489.5314 1494.024 3 9990.0100 25948.104 13958.1256 3984.064 4 1998.0020 1996.008 5982.0538 7304.117 5 333.0003 665.336 332.3363 4980.080 > cpm function (x, ...) UseMethod("cpm") <environment: namespace:edgeR> > showMethods("cpm") Function "cpm": <not an S4 generic function> > cpm.default function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...) { x <- as.matrix(x) if (is.null(lib.size)) lib.size <- colSums(x) if (log) { prior.count.scaled <- lib.size/mean(lib.size) * prior.count lib.size <- lib.size + 2 * prior.count.scaled } lib.size <- 1e-06 * lib.size if (log) log2(t((t(x) + prior.count.scaled)/lib.size)) else t(t(x)/lib.size) } <environment: namespace:edgeR> > rpkm.default function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25, ...) { y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count) gene.length.kb <- gene.length/1000 if (log) y - log2(gene.length.kb) else y/gene.length.kb } <environment: namespace:edgeR>
Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as
# step 1: 6/(1.0e-6 *1001) = 5994.006 # cpm, compute column-wise # step 2: 5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise # Another way # step 1 (RPK) 6/ (2000/1.0e3) = 3 # step 2 (RPKM) 3/ (1.0e-6 * 1001) = 2997.003
Critics
Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!
FPKM (Trapnell et al. 2010)
Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks). FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
RPKM, FPKM and TPM
- The youtube video is on here
- R code for computing effective counts, TPM, and FPKM
This is for the read normalization
P -- per K -- kilobase (related to gene length) M -- million (related to sequencing depth)
TMM (Robinson and Oshlack, 2010)
Trimmed Means of M values (EdgeR).
Coverage
- bedtools. The bedtools is now hosted on github
- https://github.com/alyssafrazee/polyester
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
- Page 18 of this RNA-Seq 101 from Agilent or Estimating Sequencing Coverage from Illumina.
C = L N / G
where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.
- coverage() function in IRanges package.
- Coverage and read depth
- What Is The Sequencing 'Depth' ? Coverage = (total number of bases generated) / (size of genome sequenced). So a 30x coverage means, on an average each base has been read by 30 sequences.
- Sequencing depth and coverage: key considerations in genomic analyses
SAM/Sequence Alignment Format and BAM format specification
- https://samtools.github.io/hts-specs/SAMv1.pdf and samtools webpage.
- http://genome.sph.umich.edu/wiki/SAM
Germline vs Somatic mutation
Germline: inherit from parents. See the Wikipedia page.
Driver vs passenger mutation
https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer
Other software
Partek
- Partek Flow Software http://youtu.be/-6aeQPOYuHY
dCHIP
MeV
MeV v4.8 (11/18/2011) allows annotation from Bioconductor
IPA from Ingenuity
Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.
Features:
- easily search the scientific literature/integrate diverse biological information.
- build dynamic pathway models
- quickly analyze experimental data/Functional discovery: assign function to genes
- share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.
Start Here
Expression data -> New core analysis -> Functions/Diseases -> Network analysis Canonical pathways | | | Simple or advanced search --------------------+ | | | v | My pathways, Lists <------+ ^ | Creating a custom pathway --------------------+
Resource:
- http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf
- http://libguides.mit.edu/content.php?pid=14149&sid=843471
- http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/
- IPA 5.5 manual http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/ipa_help_manual_5.5_v1.pdf
- Help and supports
- Tutorials which includes
- Search for genes
- Analysis results
- Upload and analyze example data
- Upload and analyze your own expression data
- Visualize connections among genes
- Learn more special features
- Human isoform view
- Transcription factor analysis
- Downstream effects analysis
Notes:
- The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
- The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
- The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
- In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
- The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.
Screenshots:
DAVID Bioinformatics Resource
qpcR
Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).