Genome: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 56: Line 56:
## Use browser and go to ftp website to download '''all.cDNA''' file to desktop. The desktop should contain 3 files - all.cDNA, rice.bam and rice.bai files for IGV.
## Use browser and go to ftp website to download '''all.cDNA''' file to desktop. The desktop should contain 3 files - all.cDNA, rice.bam and rice.bai files for IGV.
# Goto http://www.broadinstitute.org/software/igv/download to download IGV which is a java-based application. I need to install java machine first by install openjdk-7-jdk. IGV by default will launch 'Human hg18' genome. Launch IGV by
# Goto http://www.broadinstitute.org/software/igv/download to download IGV which is a java-based application. I need to install java machine first by install openjdk-7-jdk. IGV by default will launch 'Human hg18' genome. Launch IGV by
<pre>
'''IGV_2.2.13/java -Xmx750m -jar igv.jar'''
IGV_2.2.13/java -Xmx750m -jar igv.jar
</pre>
## Goto File => Import Genome. Call it 'rice' and select 'all.cDNA' sequence file. Click 'Save' button.
## Goto File => Import Genome. Call it 'rice' and select 'all.cDNA' sequence file. Click 'Save' button.
## Goto File => Upload from File => rice.bam.
## Goto File => Upload from File => rice.bam.

Revision as of 14:44, 7 April 2013

Visualization

IGV

RNA seq

   BWA/Bowtie     samtools        
fa ---------> sam ------> sam/bam (sorted indexed, short reads), vcf
   or tophat

Rsamtools    GenomeFeatures                  edgeR (normalization)
--------->   --------------> table of counts --------->

Youtube videos

Download the raw fastq data GSE19602 from GEO and uncompress fastq.bz2 to fastq (~700MB) file. NOTE: the data downloaded from ncbi is actually sra file format. We can use fastq_dump program in SRA_toolkit to convert sra to fastq. http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software

~/Downloads/sratoolkit.2.3.2-ubuntu64/bin/fastq-dump ~/Downloads/SRR034580.sra 

If we want to run Galaxy locally, we can install it simply by 2 command lines

hg clone https://bitbucket.org/galaxy/galaxy-dist/
cd galaxy-dist
hg update stable

To run Galaxy locally, we do

cd galaxy-dist
sh run.sh --reload

The command line will show Starting server in PID XXXX. serving on http://127.0.0.1:8080. We can use Ctrl + C to stop the Galaxy process.

Note: One problem with this simple instruction is we have not created a user yet.

  1. Upload one fastq data. Click 'refresh' icon on the history panel to ensure the data is uploaded. Don't use 'refresh' button on the browser; it will cause an error for the current process.
  2. FASTQ Groomer. Convert the data to Galaxy needs. NGS: QC and manipulation => Illumina FASTQ. FASTQ quality scores type: Sanger. (~10 minutes). This part uses CPU not memory.
  3. Open a new browser tab and go to ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/. Right click the file all.cDNA and copy link location. In Galaxy click 'Upload File from your computer' paste URL to URL/Text entry.
  4. Scroll down Galaxy and select NGS:Mapping -> Map with BWA. PS. for some reason, BWA is not available. So I use Bowtie2 instead. The output of Bowtie2 is bam file.
    1. For reference genome, choose 'Use one from the history'. Galaxy automatically find the reference file 'ftp://ftp.plantbiology....' from history.
    2. Library mate-paired => Single-end.
    3. FASTQ file => 2: FASTQ Groomer on data 1.
    4. BWA settings to use => Commonly Used.
    5. Execute (~ 15 minutes)
  5. We can view the alignment file (sam format) created from BWA by using UCSV or IGV (input is bam or bai format). We now use NGS: SAM Tools to convert sam file to bam file. Click 'SAM-to-BAM converts SAM format to BAM format' tool.
    1. Choose the source for the reference list => History
    2. Converts SAM file => 4: Map with BWA on data 2 and data 3.
    3. Using reference file => 3:ftp://ftp.plantbiology.....
    4. Execute (~5 minutes)
  6. We want to create bai file which is a shortcut to IGV. It breaks the data into smaller accessible chunks. So when you pull up a certain cDNA, it goes straight to the subset. Go to the history, click the pencil icon (Edit Attributes) on the file SAM-to-BAM on data 3 and data 4.
    1. Look at 'Convert to new format' section. Go ahead and click 'Convert'. (< 1 minute). This will create another file.
    2. Use browser and go to ftp website to download all.cDNA file to desktop. The desktop should contain 3 files - all.cDNA, rice.bam and rice.bai files for IGV.
  7. Goto http://www.broadinstitute.org/software/igv/download to download IGV which is a java-based application. I need to install java machine first by install openjdk-7-jdk. IGV by default will launch 'Human hg18' genome. Launch IGV by

IGV_2.2.13/java -Xmx750m -jar igv.jar

    1. Goto File => Import Genome. Call it 'rice' and select 'all.cDNA' sequence file. Click 'Save' button.
    2. Goto File => Upload from File => rice.bam.
    3. Top right panel is cDNA
    4. Middle right panel has a lot of 'boxes' which is a read. If we zoom in, we can see some read points to left (backward) while some points to right (forward). On the top is a histogram. For example, a base may be covered by a lot of reads then the histogram will show the high frequence.
    5. If we keep zoom in, we can see color at the Bottom right panel. Keeping zoom in, we can see the base G, C, T, A themselves.
    6. Using IGV, we can 1. examine coverage.
    7. We can 2. check 'alternative splicing'. (not for this cDNA)
    8. We can 3. examine SNPs for base change. If we see gray color (dark gray is hight quality read, light gray means low quality read), it means they are perfect match. If we see color, it means there is a change. For example, a read is 'C' but in fact it should be 'A'. If a case has many high quality reads, and half of them are 'G' but the reference genome shows 'A'. This is most likely a SNP. This is heterogeisity.
  1. Tophat - align RNA seq data to genomic DNA
    1. Suppose we have use Galaxy to upload 2 data. One is SRR034580 and we have run FASTQ Groomer on data 1. The second data is SRR034584 and we also have run FASTQ Groomer on data 2. We also have uploaded reference genome sequence.
    2. Goto Galaxy and find NGS: RNA Analysis => Tophat.
    3. reference genome => Use one from the history
    4. RNA-Seq FASTQ file => 2; FASTQ Groomer on data 1.
    5. Execute. This will create 2 files. One is splice junctions and the other is accepted_hits. We queue the job and run another Tophat with the 2nd 'groomer'ed data file. We are going to work on accepted_hits file.
    6. While the queue are running, we can click on 'pencil' icon on 'accepted_hits' job and run the utlity 'Convert to new format' (Bam to Bai). We should do this for both 'accepted_hits' files.
  2. Cufflinks. We will estimate transcript abundance by using FPKM (RPKM).
    1. SAM or BAM file of alignmed RNA-Seq reads => tophat on data 2.. accepted_hits
    2. Use Reference Annotation - No (choose Yes if we want annotation. This requires GTF format. See http://genome.ucsc.edu/FAQ/FAQformat.html#format4. We don't have it for rice.)
    3. Execute. This will create 3 files. Gene expression, transcript expression and assembled transcripts.
    4. We also run Cufflinks for 2nd accepted_hits file. (~ 25 minutes)
  3. Cuffcompare. Compare one to each other.
    1. GTF file produced by Cufflinks => assembled transcript from the 1st data
    2. Use another GTF file produced by Cufflinks => Yes. It automatically find the other one.
    3. Execute. (< 10 minutes). This will create 7 files. Transcript accuracy, tmap file & refmap flie from each assembled transcripts, combined transcripts and transcript tracking.
    4. We are interested in combined transcripts file (to use in Cuffdiff).
  4. Cuffdiff.
    1. Transcripts => combined transcripts.
    2. SAM or BAM file of aligned RNA-Seq reads => 1st accepted_hits
    3. SAM or BAM file or aligned RNA-Seq reads => 2nd accepted_hits
    4. Execute. This will generate 11 files. Isoform expression, gene expression, TSS groups expression, CDS Expression FPKM Tracking, isoform FPKM tracking, gene FPKM tracking, TSS groups FPKM tracking, CDS FPKM tracking, splicing diff, promoters diff, CDS diff. We are interested in 'gene expression' file. We can save it and open it in Excel.
  5. IGV - 2 RNA-Seq datasets aligned to genomic DNA using Tophat
    1. Load the reference genome rice (see above)
    2. Upload from file => rice4.bam. Upload from file => rice5.bam.
    3. Alternative RNA splicing.

Bowtie

Extremely fast, general purpose short read aligner

Tophat

Aligns RNA-Seq reads to the genome using Bowtie/Discovers splice sites.


Linux part.

$ type -a tophat # Find out which command the shell executes:
tophat is /home/mli/binary/tophat
$ ls -l ~/binary

Quick test of Tophat program

$ wget http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz
$ tar xzvf test_data.tar.gz
$ cd ~/tophat_test_data/test_data
$ PATH=$PATH:/home/mli/bowtie-0.12.8
$ export PATH
$ ls
reads_1.fq      test_ref.1.ebwt  test_ref.3.bt2  test_ref.rev.1.bt2   test_ref.rev.2.ebwt
reads_2.fq      test_ref.2.bt2   test_ref.4.bt2  test_ref.rev.1.ebwt
test_ref.1.bt2  test_ref.2.ebwt  test_ref.fa     test_ref.rev.2.bt2
$ tophat -r 20 test_ref reads_1.fq reads_2.fq
$ # This will generate a new folder <tophat_out>
$ ls tophat_out
accepted_hits.bam  deletions.bed  insertions.bed  junctions.bed  logs  prep_reads.info  unmapped.bam

TopHat accepts FASTQ and FASTA files of sequencing reads as input. Alignments are reported in BAM files. BAM is the compressed, binary version of SAM43, a flexible and general purpose read alignment format. SAM and BAM files are produced by most next-generation sequence alignment tools as output, and many downstream analysis tools accept SAM and BAM as input. There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps most popular among these are the SAM tools (http://samtools.sourceforge.net/) and the Picard tools (http://picard.sourceforge.net/).

Cufflinks package

Both Cufflinks and Cuffdiff accept SAM and BAM files as input. It is not uncommon for a single lane of Illumina HiSeq sequencing to produce FASTQ and BAM files with a combined size of 20 GB or larger. Laboratories planning to perform more than a small number of RNA-seq experiments should consider investing in robust storage infrastructure, either by purchasing their own hardware or through cloud storage services.

Cufflinks

Cufflinks uses this map (done from Tophat) against the genome to assemble the reads into transcripts.

Cuffcompare

Compares transcript assemblies to annotation

Cuffmerge

Merges two or more transcript assemblies

Cuffdiff

Finds differentially expressed genes and transcripts/Detect differential splicing and promoter use.

Cuffdiff takes the aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis.

Follow the tutorial, we can quickly test the cuffdiff program.

$ wget http://cufflinks.cbcb.umd.edu/downloads/test_data.sam
$ cufflinks ./test_data.sam
$ ls -l
total 56
-rw-rw-r-- 1 mli mli   221 2013-03-05 15:51 genes.fpkm_tracking
-rw-rw-r-- 1 mli mli   231 2013-03-05 15:51 isoforms.fpkm_tracking
-rw-rw-r-- 1 mli mli     0 2013-03-05 15:51 skipped.gtf
-rw-rw-r-- 1 mli mli 41526 2009-09-26 19:15 test_data.sam
-rw-rw-r-- 1 mli mli   887 2013-03-05 15:51 transcripts.gtf

CummeRbund

Plots abundance and differential expression results from Cuffdiff.

Other software

dCHIP

IPA from Ingenuity

Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.

Features:

  • easily search the scientific literature/integrate diverse biological information.
  • build dynamic pathway models
  • quickly analyze experimental data/Functional discovery: assign function to genes
  • share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.

Start Here

Expression data -> New core analysis -> Functions/Diseases -> Network analysis
                                        Canonical pathways        |
                                              |                   |
Simple or advanced search --------------------+                   |
                                              |                   |
                                              v                   |
                                        My pathways, Lists <------+
                                              ^
                                              |
Creating a custom pathway --------------------+

Resource:

Notes:

  • The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
  • The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
  • The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
  • In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
  • The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.

Screenshots:

IngenuityAnalysisOutput.png

DAVID Bioinformatics Resource