Anders2013
The data is used in the paper "Count-based differential ....." by Anders et al 2013.
Required tools
The version number indicated below is the one I use. It may be updated when you are ready to download that.
- bowtie (bowtie2-2.1.0-linux-x86_64.zip): http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.1.0/ (Mac, Linux, Windows binaries)
- tophat (tophat-2.0.10.Linux_x86_64.tar.gz): http://tophat.cbcb.umd.edu/downloads/ (Linux and Mac binaries)
- samtools (samtools-0.1.19.tar.bz2): http://samtools.sourceforge.net/ (require GCC to compile (build-essential) & zlib1g-dev & libncurses5-dev packages)
- sra toolkit (sratoolkit.2.3.4-2-ubuntu64.tar.gz): http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software (Mac, Ubuntu & CentOS binaries)
- HTseq (HTSeq-0.5.4p5.tar.gz): http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html (require Python 2)
- IGV (IGV_2.3.26.zip): http://www.broadinstitute.org/software/igv/download (require Java and registration to download the program)
- FastQC (fastqc_v0.10.1.zip) http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (require Java)
Installation
If binary executables are available, we don't need to do anything special except adding the path containing the binary to the PATH variable.
If only source code is available, make sure the required tool GCC is installed (Under Ubuntu, we can use sudo apt-get install build-essential). Then we can compile and then install the program by using ./configure, make and make install.
In this exercise, all source code or binary programs were extracted to $HOME/Downloads/ directory where each program has its own directory.
The PATH variable
Open ~/.basrhc file using any text editor (such as nano or vi) and add the path containing the binary of each program to PATH variable.
tail ~/.bashrc export PATH=$PATH:~/Downloads/bowtie2-2.1.0/ export PATH=$PATH:~/Downloads/tophat-2.0.10.Linux_x86_64/ export PATH=$PATH:~/Downloads/samtools-0.1.19/ export PATH=$PATH:~/Downloads/samtools-0.1.19/bcftools/ export PATH=$PATH:~/Downloads/samtools-0.1.19/misc/ export PATH=$PATH:~/Downloads/sratoolkit.2.3.4-2-ubuntu64/bin export PATH=$PATH:~/Downloads/HTSeq-0.5.4p5/build/scripts-2.7/
Data directory
Download <nprot.2013.099-S1.zip> from the paper's web site and extract it to ~/Anders2013 directory.
brb@brbweb4:~/Anders2013$ ls CG8144_RNAi-1.count counts.csv README.txt Untreated-1.count Untreated-6.count CG8144_RNAi-3.count __MACOSX samples.csv Untreated-3.count CG8144_RNAi-4.count nprot.2013.099-S1.zip SraRunInfo.csv Untreated-4.count
The raw data is from GSE18508 'modENCODE Drosophila RNA Binding Protein RNAi RNA-Seq Studies'.
Pipeline
The following summary is extracted from http://www.bioconductor.org/help/course-materials/2013/CSAMA2013/tuesday/afternoon/DESeq_protocol.pdf
It would be interesting to monitor the disk space usage before, during and after the execution.
- Download example data (22 files in SRA format)
Download Supplement file and read <SraRunInfo.csv> in R. We will choose a subset of data based on "LibraryName" column.
sri = read.csv("SraRunInfo.csv", stringsAsFactors=FALSE) keep = grep("CG8144|Untreated-",sri$LibraryName) sri = sri[keep,] fs = basename(sri$download_path) for(i in 1:nrow(sri)) download.file(sri$download_path[i], fs[i]) names(sri) [1] "Run" "ReleaseDate" "spots" "bases" [5] "avgLength" "size_MB" "AssemblyName" "download_path" [9] "Experiment" "LibraryName" "LibraryStrategy" "LibrarySelection" [13] "LibrarySource" "LibraryLayout" "InsertSize" "Platform" [17] "Model" "SRAStudy" "BioProject" "Study_Pubmed_id" [21] "ProjectID" "Sample" "BioSample" "TaxID" [25] "ScientificName" "SampleName" "Submission" "Consent" apply(sri,2, function(x) length(unique(x))) Run ReleaseDate spots bases 22 22 22 22 avgLength size_MB AssemblyName download_path 5 22 1 22 Experiment LibraryName LibraryStrategy LibrarySelection 7 7 1 1 LibrarySource LibraryLayout InsertSize Platform 1 2 2 1 Model SRAStudy BioProject Study_Pubmed_id 1 1 1 1 ProjectID Sample BioSample TaxID 1 7 1 1 ScientificName SampleName Submission Consent 1 1 1 1 sri[1:2,] Run ReleaseDate spots bases avgLength size_MB 1 SRR031714 2010-01-15 10:42:00 5327425 394229450 74 396 2 SRR031715 2010-01-21 14:24:18 5248396 388381304 74 390 AssemblyName 1 NA 2 NA download_path 1 ftp://ftp-private.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR031/SRR031714/SRR031714.sra 2 ftp://ftp-private.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR031/SRR031715/SRR031715.sra Experiment LibraryName LibraryStrategy LibrarySelection LibrarySource 1 SRX014459 Untreated-3 RNA-Seq cDNA TRANSCRIPTOMIC 2 SRX014459 Untreated-3 RNA-Seq cDNA TRANSCRIPTOMIC LibraryLayout InsertSize Platform Model SRAStudy 1 PAIRED 200 ILLUMINA Illumina Genome Analyzer II SRP001537 2 PAIRED 200 ILLUMINA Illumina Genome Analyzer II SRP001537 BioProject Study_Pubmed_id ProjectID Sample BioSample 1 GEO Series accession: GSE18508 20921232 168994 SRS008447 NA 2 GEO Series accession: GSE18508 20921232 168994 SRS008447 NA TaxID ScientificName SampleName Submission Consent 1 7227 Drosophila melanogaster Drosophila melanogaster SRA010243 public 2 7227 Drosophila melanogaster Drosophila melanogaster SRA010243 public
- Convert SRA to FASTQ (sra -> fastq)
This takes a long time and it is not run in parallel.