Genome: Difference between revisions
Line 19: | Line 19: | ||
Download the raw fastq data GSE19602 from [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19602 GEO] and uncompress fastq.bz2 to fastq (~700MB) file. | Download the raw fastq data GSE19602 from [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19602 GEO] and uncompress fastq.bz2 to fastq (~700MB) file. | ||
# upload one fastq data | |||
# FASTQ Grommer. Convert the data to Galaxy needs. FASTQ quality scores type: Sanger. (~10 minutes) | |||
== Bowtie == | == Bowtie == |
Revision as of 21:35, 6 April 2013
Visualization
IGV
RNA seq
BWA/Bowtie samtools fa ---------> sam ------> sam/bam (sorted indexed, short reads), vcf or tophat Rsamtools GenomeFeatures edgeR (normalization) ---------> --------------> table of counts --------->
Youtube videos
Download the raw fastq data GSE19602 from GEO and uncompress fastq.bz2 to fastq (~700MB) file.
- upload one fastq data
- FASTQ Grommer. Convert the data to Galaxy needs. FASTQ quality scores type: Sanger. (~10 minutes)
Bowtie
Extremely fast, general purpose short read aligner
Tophat
Aligns RNA-Seq reads to the genome using Bowtie/Discovers splice sites.
Linux part.
$ type -a tophat # Find out which command the shell executes: tophat is /home/mli/binary/tophat $ ls -l ~/binary
Quick test of Tophat program
$ wget http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz $ tar xzvf test_data.tar.gz $ cd ~/tophat_test_data/test_data $ PATH=$PATH:/home/mli/bowtie-0.12.8 $ export PATH $ ls reads_1.fq test_ref.1.ebwt test_ref.3.bt2 test_ref.rev.1.bt2 test_ref.rev.2.ebwt reads_2.fq test_ref.2.bt2 test_ref.4.bt2 test_ref.rev.1.ebwt test_ref.1.bt2 test_ref.2.ebwt test_ref.fa test_ref.rev.2.bt2 $ tophat -r 20 test_ref reads_1.fq reads_2.fq $ # This will generate a new folder <tophat_out> $ ls tophat_out accepted_hits.bam deletions.bed insertions.bed junctions.bed logs prep_reads.info unmapped.bam
TopHat accepts FASTQ and FASTA files of sequencing reads as input. Alignments are reported in BAM files. BAM is the compressed, binary version of SAM43, a flexible and general purpose read alignment format. SAM and BAM files are produced by most next-generation sequence alignment tools as output, and many downstream analysis tools accept SAM and BAM as input. There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps most popular among these are the SAM tools (http://samtools.sourceforge.net/) and the Picard tools (http://picard.sourceforge.net/).
Cufflinks package
Both Cufflinks and Cuffdiff accept SAM and BAM files as input. It is not uncommon for a single lane of Illumina HiSeq sequencing to produce FASTQ and BAM files with a combined size of 20 GB or larger. Laboratories planning to perform more than a small number of RNA-seq experiments should consider investing in robust storage infrastructure, either by purchasing their own hardware or through cloud storage services.
Cufflinks
Cufflinks uses this map (done from Tophat) against the genome to assemble the reads into transcripts.
Cuffcompare
Compares transcript assemblies to annotation
Cuffmerge
Merges two or more transcript assemblies
Cuffdiff
Finds differentially expressed genes and transcripts/Detect differential splicing and promoter use.
Cuffdiff takes the aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis.
Follow the tutorial, we can quickly test the cuffdiff program.
$ wget http://cufflinks.cbcb.umd.edu/downloads/test_data.sam $ cufflinks ./test_data.sam $ ls -l total 56 -rw-rw-r-- 1 mli mli 221 2013-03-05 15:51 genes.fpkm_tracking -rw-rw-r-- 1 mli mli 231 2013-03-05 15:51 isoforms.fpkm_tracking -rw-rw-r-- 1 mli mli 0 2013-03-05 15:51 skipped.gtf -rw-rw-r-- 1 mli mli 41526 2009-09-26 19:15 test_data.sam -rw-rw-r-- 1 mli mli 887 2013-03-05 15:51 transcripts.gtf
CummeRbund
Plots abundance and differential expression results from Cuffdiff.
Other software
dCHIP
IPA from Ingenuity
Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.
Features:
- easily search the scientific literature/integrate diverse biological information.
- build dynamic pathway models
- quickly analyze experimental data/Functional discovery: assign function to genes
- share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.
Start Here
Expression data -> New core analysis -> Functions/Diseases -> Network analysis Canonical pathways | | | Simple or advanced search --------------------+ | | | v | My pathways, Lists <------+ ^ | Creating a custom pathway --------------------+
Resource:
- http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf
- http://libguides.mit.edu/content.php?pid=14149&sid=843471
- http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/
- IPA 5.5 manual http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/ipa_help_manual_5.5_v1.pdf
- Help and supports
- Tutorials which includes
- Search for genes
- Analysis results
- Upload and analyze example data
- Upload and analyze your own expression data
- Visualize connections among genes
- Learn more special features
- Human isoform view
- Transcription factor analysis
- Downstream effects analysis
Notes:
- The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
- The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
- The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
- In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
- The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.
Screenshots: