|
|
Line 107: |
Line 107: |
| [[:File:CentralDogmaMolecular.png]] | | [[:File:CentralDogmaMolecular.png]] |
|
| |
|
| == Introduction to Sequence Data Analysis ==
| | See [[NGS|NGS]]. |
| * [http://www.rna-seqblog.com/review-of-rna-seq-data-analysis-tools/ Review of RNA-Seq Data Analysis Tools]
| |
| * [https://arxiv.org/abs/1804.06050 Modeling and analysis of RNA-seq data: a review from a statistical perspective] Li et al 2018
| |
| * [https://youtu.be/QNd5wkozSJ0 Introduction to RNA Sequencing] Malachi Griffith from the Genome Institute at Washington University
| |
| * [https://www.youtube.com/watch?v=hksQlJLwKqo&feature=youtu.be RNA-Seq workshop] from NYU
| |
| * [http://www.rna-seqblog.com/modern-rna-seq-differential-expression-analyses-transcript-level-or-gene-level/ Modern RNA-seq differential expression analyses: transcript-level or gene-level]
| |
| * http://www.pasteur.fr/~tekaia/BCGA2014/TALKS/Pabinger_Tools4VariantAnalysisOfNGS_EMBO2014.pdf (QC, duplicates, SAM, BAM, picard, variant calling, somatic variant, structural variants, CNV, variant filter, variant annotation, tools for vcf, visualization, pipeline). The source code is available on [https://github.com/tadKeys/BioinformaticsAndGenomesAnalyses2014 github].
| |
| * https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools
| |
| * http://www.rnaseq.wiki/ or https://github.com/griffithlab/rnaseq_tutorial/wiki from The Griffith Lab. It has a [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393 paper] too at PLOS.
| |
| * http://bioinformatics.ca/workshops/2015/high-throughput-biology-sequence-networks-2015#material
| |
| * http://www.cbcb.umd.edu/~hcorrada/CMSC858B/
| |
| * http://watson.nci.nih.gov/~sdavis/tutorials/biowulf-2011/
| |
| * http://cbsu.tc.cornell.edu/ngw2010/Day3_lecture1.pdf
| |
| * http://static.msi.umn.edu/tutorial/lifescience/RNA-Seq-Module-1.pdf
| |
| * http://www.rna-seqblog.com/introduction-to-rna-seq-analysis/ (focus on statistical part)
| |
| * [https://www.youtube.com/watch?v=Ojy82R86Bm4 BroadE: GATK/Mapping and processing RNAseq] from youtube.
| |
| * [http://www.rna-seqblog.com/researchers-at-kansas-state-university-suggest-reconsidering-your-standard-rna-seq-data-management-pipeline/ Low-count transcripts] in RNA-Seq analysis pipeline
| |
| * [http://tv.qiagenbioinformatics.com/video/14353472/rna-seq-analysis-of-human-breast-cancer-data RNA-seq analysis of human breast cancer data] from QIAGEN. Biomedical Genomics Workbench 3.0.1 is used.
| |
| * [https://www.youtube.com/playlist?list=PLjiXAZO27elABzLA0aHKS9chVA2TldoPF RNA-seq Chipster tutorials]
| |
| * [https://discoveringthegenome.org/ Discovering the Genome]
| |
| | |
| NIH only
| |
| * [https://helix.nih.gov/About_Us/training.html Some training material from NIH]:
| |
| * [http://biowulf.nih.gov/biowulf-seminar-3feb2015.pdf Biowulf seminar] by Steven Fellini & Susan Chacko.
| |
| * [https://helix.nih.gov/Documentation/Talks/BashScripting_LinuxCommands.pdf Bash Scripting and Linux commands]
| |
| | |
| === RNA-seq: Basics, Applications and Protocol ===
| |
| https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-protocol-299461
| |
| | |
| === Why Do You Need To Use Cdna For Rna-Seq? ===
| |
| https://www.biostars.org/p/54969/
| |
| | |
| === RNA-Seq vs DNA-Seq ===
| |
| * With DNA, you'd be randomly sequencing the entire genome
| |
| * DNA-Seq cannot be used for measuring gene expression.
| |
| | |
| === Total RNA-Seq ===
| |
| * [https://www.gatc-biotech.com/en/expertise/transcriptomics/total-rna-seq.html Total RNA-Seq vs. mRNA sequencing] from gatc-biotech.com
| |
| * [https://www.illumina.com/techniques/sequencing/rna-sequencing/total-rna-seq.html A comprehensive picture of the transcriptome] from illumina.com
| |
| * https://en.wikipedia.org/wiki/RNA-Seq RNA-Seq is used to analyze the continuously changing cellular transcriptome.
| |
| | |
| === How to know that your RNA-seq is stranded or not? ===
| |
| https://www.biostars.org/p/98756/
| |
| | |
| === Workshops ===
| |
| * [http://bioinformatics.ucdavis.edu/training/documentation/ UC Davis] Bioinformatics Core.
| |
| * [https://bioconductor.org/help/course-materials/2017/CSAMA/ CSAMA 2017]: Statistical Data Analysis for Genome-Scale Biology
| |
| | |
| === Blogs ===
| |
| * [https://blog.genohub.com/2013/08/07/11-top-next-generation-sequencing-blogs/ 11 Top Next Generation Sequencing Blogs]
| |
| * [https://seandavi.github.io/talk/ Sean Davis]
| |
| | |
| == Automation ==
| |
| * [http://www.rna-seqblog.com/implementation-of-an-open-source-software-solution-for-laboratory-information-management-and-automated-rna-seq-data-analysis/ Implementation of an Open Source Software solution for Laboratory Information Management and automated RNA-seq data analysis]
| |
| * [http://www.rna-seqblog.com/trapline-a-standardized-and-automated-pipeline-for-rna-sequencing-data-analysis-evaluation-and-annotation/ TRAPLINE – a standardized and automated pipeline for RNA sequencing data analysis, evaluation and annotation]
| |
| | |
| == Quality control ==
| |
| For base quality score, the quality value, Q(sanger) = - 10log10(prob) where prob = probability that the call at base b is correct. Sanger ('''Phred quality scores''') is between 0 and 93. In practice the maximum quality score is ~40. Quality values below 20 are typically considered low.
| |
| | |
| === Phred quality score and scales ===
| |
| * http://blog.nextgenetics.net/?e=33
| |
| * https://en.wikipedia.org/wiki/FASTQ_format#Encoding
| |
| * http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ contains a link to a python script to guess the encoding format
| |
| | |
| The original Phred scaling of '''33-126''' (representing scores '''0-93''') is also called the '''Sanger''' scale. There are also 2 other scales used by Illumina that shifts the range up:
| |
| | |
| * '''Illumina 1.0''' format uses ASCII 59 - 126 representing scores '''-5 - 62'''.
| |
| * '''Illumina 1.3+''' format uses ASCII 64 - 126 representing scores '''0 - 62'''.
| |
| * '''Illumina 1.8''', the quality scores have basically returned to the use of the Sanger format (Phred+33).
| |
| | |
| === FastQC (java based) ===
| |
| One problem is the reads in trimmed fastq files from one end may not appear in other end for paired data. So consider Trimmomatic.
| |
| | |
| === [http://qualimap.bioinfo.cipf.es/ Qualimap] (java based) ===
| |
| Qualimap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.
| |
| | |
| The first time when we launch Qualimap we may get a message: The following R packages are missing: optparse, NOISeq, Repitools. Features dependent on these packages are disabled. See user manual for details. OK.
| |
| | |
| === [http://www.usadellab.org/cms/index.php?page=trimmomatic Trimmomatic] ===
| |
| https://hpc.nih.gov/apps/trimmomatic.html
| |
| | |
| === [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ Trim Galore!] ===
| |
| https://hpc.nih.gov/apps/trimgalore.html
| |
| | |
| === [http://www.biomedcentral.com/1471-2105/16/224 QoRTs] ===
| |
| A comprehensive toolset for quality control and data processing of RNA-Seq experiments.
| |
| | |
| === [https://bioinf.shenwei.me/seqkit/ SeqKit] ===
| |
| A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
| |
| | |
| === FastqCleaner ===
| |
| [https://www.biorxiv.org/content/early/2018/08/16/393140 An interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files] and [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2961-8 BMC Bioinformatics]
| |
| | |
| == Alignment and Indexing ==
| |
| * https://en.wikipedia.org/wiki/Sequence_alignment
| |
| * https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment
| |
| * [http://www.nature.com/nmeth/journal/vaop/ncurrent/pdf/nmeth.4106.pdf Simulation-based comprehensive benchmarking of RNA-seq aligners]
| |
| * [http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2722.html Systematic evaluation of spliced alignment programs for RNA-seq data]
| |
| * (Video) http://bioinformatics.ca//files/public/RNA-seq_2014_Montreal_Module2.mp4, http://www.rna-seqblog.com/rna-seq-alignment-and-visualization/
| |
| * (Video) https://youtu.be/n77EAk8C1es RNA-SEQ: Mapping to a Reference Genome
| |
| * RNA-Seq alignment methods emphasize gene count while DNA-Seq alignment methods (no junctions) emphasize variant detection. The read length in RNA-Seq alignment does not have be long but it has to be long in DNA-Seq since DNA-Seq cares about a good coverage of the whole exome/genome.
| |
| | |
| === SAM file format ===
| |
| [[Anders2013#sam.2Fbam.2C_.22samtools_view.22_and_Rsamtools|SAM format]]
| |
| | |
| Here are some important fields
| |
| | |
| * Col 1: Read name
| |
| * Col 2: FLAG [0,2^16-1]
| |
| * Col 3: RNAME/Chrom
| |
| * Col 4: Pos [0,2^31-1]
| |
| * Col 5: MAPQ [0,2^8-1]
| |
| * Col 6: CIGAR
| |
| * Col 7: RNEXT (se as "=" if RNEXT is equal to RNAME)
| |
| * Col 8: PNEXT - Mate position [0,2^31-1]
| |
| * Col 9: Insert size [-2^31+1, 2^31-1]
| |
| * Col 10: SEQ
| |
| * Col 11: QUAL
| |
| * Col 14 (not fixed): AS
| |
| | |
| Note pair-end data,
| |
| <pre>
| |
| 1. Fastq files
| |
| A_1.fastq A_2.fastq
| |
| read1 read1
| |
| read2 read2
| |
| ... ...
| |
| | |
| 2. SAM files (sorted by read name)
| |
| read1
| |
| read1
| |
| read2
| |
| read2
| |
| ...
| |
| </pre>
| |
| | |
| For example, consider paired-end reads ''SRR925751.192'' that was extracted from so called inproper paired reads in samtools. If we extract the paired reads by ''grep -w SRR925751.192 bt20/bwa.sam'' we will get
| |
| <pre>
| |
| SRR925751.192 97 chr1 13205 60 101M = 13429 325 .....
| |
| SRR925751.192 145 chr1 13429 60 101M = 13205 -325 .....
| |
| </pre>
| |
| By entering the flag values 97 & 145 into the SAM flag entry on http://broadinstitute.github.io/picard/explain-flags.html, we see
| |
| * FLAG value 97 means read paired, mate reverse strand, first in pair.
| |
| * FLAG value 145 means read paired, read reverse strand, second in pair.
| |
| The two reads have positions. As we can see the '''insert size''' (it's the name shown on samtools help page and IGV program) is defined by start positions instead of end positions)
| |
| <pre>
| |
| 13205 13305 13429 13529
| |
| |-------------> <----------|
| |
| </pre>
| |
| | |
| Consider a properly paired reads ''SRR925751.1''. If we extract the paired reads, we will get
| |
| <pre>
| |
| SRR925751.1 99 chr1 10010 0 90M11S = 10016 95 .....
| |
| SRR925751.1 147 chr1 10016 0 12S89M = 10010 -95 .....
| |
| </pre>
| |
| Again by entering the flag values 99 and 147 into the Decoding SAM flags website, we see
| |
| * FLAG value 99 means read paired, '''read mapped in proper pair''', mate reverse strand, first in pair.
| |
| * FLAG value 147 means read paired, '''read mapped in proper pair''', read reverse strand, second in pair.
| |
| <pre>
| |
| 10010 10099
| |
| |------------>
| |
| 10016 10104
| |
| <-------------|
| |
| </pre>
| |
| The insert size here is 10104-10010+1=95. Running the [https://www.biostars.org/p/16556/ following command],
| |
| <syntaxhighlight lang='bash'>
| |
| cat foo.bam | awk '{ if ($9 >0) {S+=$9; T+=1}} END {print "Mean: "S/T}'
| |
| </syntaxhighlight>
| |
| I get 133.242 for the average insert size from properly paired reads on my data (GSE48215subset, read length=101).
| |
| | |
| If I run the same command on in-properly paired reads, I got 1.9e07 for the average insert size.
| |
| | |
| === Bowtie2 (RNA/DNA) ===
| |
| Extremely fast, general purpose short read aligner.
| |
| | |
| ==== Create index files ====
| |
| bowtie needs to have an '''index''' of the genome in order to perform its alignment functionality. For example, to build a bowtie index against UCSC hg19 (See [http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#indexing-a-reference-genome Getting started with Bowtie 2 -> Indexing a reference genome])
| |
| <syntaxhighlight lang='bash'>
| |
| bowtie2-build /data/ngs/public/sequences/hg19/genome.fa hg19
| |
| </syntaxhighlight>
| |
| | |
| Even the index file can be directly downloaded without going through Bowtie program, Bowtie program is still needed by Tophat program where Tophat's job is to align the RNA-seq data to reference genome.
| |
| | |
| ==== Mapping ====
| |
| To run alignment,
| |
| <syntaxhighlight lang='bash'>
| |
| bowtie2 -p 4 -x /data/ngs/public/sequences/hg19 XXX.fastq -S XXX.bt2.sam
| |
| </syntaxhighlight>
| |
| At the end of alignment, it will show how many (and percent) of reads are aligned 0 times, exactly 1 time, and >1 times.
| |
| | |
| We can transform them into a bam format ('-@' means threads and '-T' means the reference file, it is optional)
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -@ 16 -T XXX.fa XXX.bt2.sam > XXX.bt2.bam
| |
| </syntaxhighlight>
| |
| and view the bam file
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view XXX.bt2.bam
| |
| </syntaxhighlight>
| |
| | |
| Note that it is faster to use pipe to directly output the file in a bam format (reducing files I/O) if the aligner does not provide an option to output a binary bam format.
| |
| | |
| === BWA (DNA) ===
| |
| * BWA MEM Algorithm https://www.biostars.org/p/140502/
| |
| * https://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf By Heng Li
| |
| * http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
| |
| * https://bioinformatics.cancer.gov/sites/default/files/course_material/Data%20Analysis%20for%20Exome%20Sequencing%20Data_0318.ppt
| |
| | |
| Used by whole-exome sequencing. For example, http://bib.oxfordjournals.org/content/15/2/256.full and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3865592/. [https://igor.sbgenomics.com/public/pipelines/534522f6d79f0049c0c9444e/ Whole Exome Analysis].
| |
| ==== Creating index file ====
| |
| Without creating index files, we will get an error ''[E::bwa_idx_load_from_disk] fail to locate the index files'' when we run the 'bwa mem' command (though the command only requires fa file in its arguments).
| |
| | |
| [http://gatkforums.broadinstitute.org/wdl/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk Prepare a reference for use with BWA and GATK]
| |
| <syntaxhighlight lang='bash'>
| |
| bwa index genome.fa # generate 4 new files amb, ann, pac and bwt (not enough for running 'bwa mem')
| |
| bwa index -a bwtsw genome.fa # generate 5 new files amb, ann, pac, bwt and sa (right thing to do)
| |
| </syntaxhighlight>
| |
| where '''-a bwtsw''' specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.
| |
| | |
| It takes 3 hours to create the index files on the combined human+mouse genomes. Though there is no multhreads optionin bwa index, we can use the '''-b''' option to increase the block size in order to speed up the time. See https://github.com/lh3/bwa/issues/104. The default value is 10000000 (1.0e8). See the following output from ''bwa index'':
| |
| <pre>
| |
| [bwa_index] Pack FASTA... 46.43 sec
| |
| [bwa_index] Construct BWT for the packed sequence...
| |
| [BWTIncCreate] textLength=11650920420, availableWord=831801828
| |
| [BWTIncConstructFromPacked] 10 iterations done. 99999988 characters processed.
| |
| [BWTIncConstructFromPacked] 20 iterations done. 199999988 characters processed.
| |
| ...
| |
| [BWTIncConstructFromPacked] 1230 iterations done. 11647338276 characters processed.
| |
| [bwt_gen] Finished constructing BWT in 1232 iterations.
| |
| [bwa_index] 4769.18 seconds elapse.
| |
| [bwa_index] Update BWT... 45.78 sec
| |
| [bwa_index] Pack forward-only FASTA... 32.66 sec
| |
| [bwa_index] Construct SA from BWT and Occ... 2054.29 sec
| |
| [main] Version: 0.7.15-r1140
| |
| [main] CMD: bwa index -a bwtsw genomecb.fa
| |
| [main] Real time: 7027.396 sec; CPU: 6948.342 sec
| |
| </pre>
| |
| | |
| Note that another index file '''fai''' is created by samtools
| |
| <syntaxhighlight lang='bash'>
| |
| samtools faidx genome.fa
| |
| </syntaxhighlight>
| |
| | |
| For example, the BWAIndex folder contains the following files
| |
| <syntaxhighlight lang='bash'>
| |
| $ ls -l ~/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/
| |
| total 12
| |
| lrwxrwxrwx 1 brb brb 22 Mar 10 03:40 genome.fa -> version0.6.0/genome.fa
| |
| lrwxrwxrwx 1 brb brb 26 Mar 10 03:37 genome.fa.amb -> version0.6.0/genome.fa.amb
| |
| lrwxrwxrwx 1 brb brb 26 Mar 10 03:40 genome.fa.ann -> version0.6.0/genome.fa.ann
| |
| lrwxrwxrwx 1 brb brb 26 Mar 10 03:40 genome.fa.bwt -> version0.6.0/genome.fa.bwt
| |
| -rw-r--r-- 1 brb brb 783 Apr 12 14:46 genome.fa.fai
| |
| lrwxrwxrwx 1 brb brb 26 Mar 10 03:40 genome.fa.pac -> version0.6.0/genome.fa.pac
| |
| lrwxrwxrwx 1 brb brb 25 Mar 10 03:37 genome.fa.sa -> version0.6.0/genome.fa.sa
| |
| drwxrwxr-x 2 brb brb 4096 Mar 15 2012 version0.5.x
| |
| drwxrwxr-x 2 brb brb 4096 Mar 15 2012 version0.6.0
| |
| </syntaxhighlight>
| |
| | |
| ==== Mapping ====
| |
| <syntaxhighlight lang='bash'>
| |
| bwa mem # display the help
| |
| bwa mem -t 4 XXX.fa XXX.fastq > XXX.sam
| |
| more XXX.sam
| |
| samtools view -dT XXX.fa XXX.sam > XXX.bam # transform to bam format
| |
| samtools view XXX.bam | more
| |
| </syntaxhighlight>
| |
| and output directly to a binary format
| |
| <syntaxhighlight lang='bash'>
| |
| bwa mem -t 4 XXX.fa XXX.fastq | samtools view -bS - > XXX.bam # transform to bam format directly
| |
| </syntaxhighlight>
| |
| where '-S' means Ignoring for compatibility with previous samtools versions. Previously this option was required if input was in SAM format, but now the correct format is automatically detected by examining the first few characters of input. See http://www.htslib.org/doc/samtools.html.
| |
| | |
| And [https://github.com/ekg/alignment-and-variant-calling-tutorial this is a tutorial] to use bwa and freebayes to do variant calling by Freebayes's author.
| |
| | |
| [https://wikis.utexas.edu/display/bioiteam/Variant+calling+tutorial#Variantcallingtutorial-CallingvariantsinreadsmappedbyBWAorBowtie2 This tutorial] is using a whole genome data [http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR030257 SRR030257] from [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP001369 SRP001369]
| |
| | |
| [http://gatkforums.broadinstitute.org/wdl/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk BWA for whole genome and GATK]
| |
| | |
| Best pipeline for human whole exome sequencing
| |
| * https://www.biostars.org/p/1268/
| |
| | |
| ==== Summary report of a BAM file: samtools flagstat ====
| |
| Unlike Tophat and STAR, BWA mem does not provide a summary report of percentage of mapped reads. To get the percentage of mapped reads, use [https://www.biostars.org/p/14709/ samtoolss flagstat] command
| |
| <syntaxhighlight lang='bash'>
| |
| $ export PATH=$PATH:/opt/SeqTools/bin/samtools-1.3
| |
| | |
| # fastq files from ExomeLungCancer/test.SRR2923335_*.fastq
| |
| # 'bwa mem' and 'samtools fixmate' have been run to generate <accepted_hits.bam>
| |
| | |
| $ wc -l ../test.SRR2923335_1.fastq # 5000 reads in each of PAIRED end fastq files
| |
| 20000 ../test.SRR2923335_1.fastq
| |
| | |
| $ samtools flagstat accepted_hits.bam
| |
| 10006 + 0 in total (QC-passed reads + QC-failed reads)
| |
| 6 + 0 secondary
| |
| 0 + 0 supplementary
| |
| 0 + 0 duplicates
| |
| 9971 + 0 mapped (99.65% : N/A)
| |
| 10000 + 0 paired in sequencing
| |
| 5000 + 0 read1
| |
| 5000 + 0 read2
| |
| 8502 + 0 properly paired (85.02% : N/A)
| |
| 9934 + 0 with itself and mate mapped
| |
| 31 + 0 singletons (0.31% : N/A)
| |
| 1302 + 0 with mate mapped to a different chr
| |
| 1231 + 0 with mate mapped to a different chr (mapQ>=5)
| |
| </syntaxhighlight>
| |
| | |
| See also https://wikis.utexas.edu/display/CoreNGSTools/Alignment#Alignment-Exercise#4:BWA-MEM-HumanmRNA-seq for alignment (including BWA mem and samtools flagstat).
| |
| | |
| * Properly mapped (inward oriented & same chromosome & good insert size): ''samtools view -f 3 -F 2 foo.bam''. But the result is not the same as flagstat??
| |
| * Itself and mate mapped: ''samtools view -f 1 -F 12 foo.bam''. But the result is not the same as flagstat??
| |
| * Singleton: ''samtools view -F 4 -f 8 foo.bam''. The flags indicate that the current read is mapped but its mate isn't. On GSE48215subset data, the return number of reads does not match with flagstat result but ''samtools view -f 4 -F 8 foo.bam'' (current read is unmapped and mate is mapped) gives the same result as the flagstat command.
| |
| | |
| ==== Stringent criterion ====
| |
| * https://sourceforge.net/p/bio-bwa/mailman/message/31968535/
| |
| * http://seqanswers.com/forums/showthread.php?p=186581
| |
| | |
| ==== Applied to RNA-Seq data? ====
| |
| https://www.biostars.org/p/130451/. ''BWA isn't going to handle spliced alignment terribly well, so it's not normally recommended for RNAseq datasets.''
| |
| | |
| === [https://github.com/lh3/minimap2 Minimap2] ===
| |
| * BWA replacement. Fast and almost identical mapping
| |
| * https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty191/4994778
| |
| * https://arxiv.org/abs/1708.01492
| |
| * [https://www.overleaf.com/read/ddwtrgmngxms#/39316160/ Latex] source
| |
| * [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2592-5 IMOS: improved Meta-aligner and Minimap2 On Spark]
| |
| | |
| === [http://ccb.jhu.edu/software/tophat/index.shtml Tophat] (RNA) ===
| |
| Aligns RNA-Seq reads to the genome using '''Bowtie'''/Discovers splice sites. It does so by splitting longer reads into small sections and aligning those to the genome. It then looks for potential splice sites between pairs of sections to construct a final alignment.
| |
| | |
| Linux part.
| |
| <pre>
| |
| $ type -a tophat # Find out which command the shell executes:
| |
| tophat is /home/mli/binary/tophat
| |
| $ ls -l ~/binary
| |
| </pre>
| |
| | |
| Quick test of Tophat program
| |
| <pre>
| |
| $ wget http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz
| |
| $ tar xzvf test_data.tar.gz
| |
| $ cd ~/tophat_test_data/test_data
| |
| $ PATH=$PATH:/home/mli/bowtie-0.12.8
| |
| $ export PATH
| |
| $ ls
| |
| reads_1.fq test_ref.1.ebwt test_ref.3.bt2 test_ref.rev.1.bt2 test_ref.rev.2.ebwt
| |
| reads_2.fq test_ref.2.bt2 test_ref.4.bt2 test_ref.rev.1.ebwt
| |
| test_ref.1.bt2 test_ref.2.ebwt test_ref.fa test_ref.rev.2.bt2
| |
| $ tophat -r 20 test_ref reads_1.fq reads_2.fq
| |
| $ # This will generate a new folder <tophat_out>
| |
| $ ls tophat_out
| |
| accepted_hits.bam deletions.bed insertions.bed junctions.bed logs prep_reads.info unmapped.bam
| |
| </pre>
| |
| | |
| TopHat accepts FASTQ and FASTA files of sequencing reads as input. Alignments are reported in BAM files. BAM is the compressed, binary version of SAM43, a flexible and general purpose read alignment format. SAM and BAM files are produced by most next-generation sequence alignment tools as output, and many downstream analysis tools accept SAM and BAM as input. There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps most popular among these are the SAM tools (http://samtools.sourceforge.net/) and the Picard tools (http://picard.sourceforge.net/).
| |
| | |
| Note that if the data is DNA-Seq, we can merely use Bowtie2 or BWA tools since we don't have to worry about splicing.
| |
| | |
| An example of using Tophat2 (paired end in this case, 5 threads) is
| |
| <syntaxhighlight lang="bash">
| |
| tophat2 --no-coverage-search -p 5 \
| |
| -o "Sample1" \
| |
| -G ~/iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \
| |
| --transcriptome-index=transcriptome_data/known \
| |
| ~/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome \
| |
| myfastq_R1.fastq.gz myfastq_R2.fastq.gz
| |
| </syntaxhighlight>
| |
| | |
| To find out the alignment rate for ALL bam files (eg we have ctrl1, ctrl2, test1, test2 directories and each of them have align_summary.txt file),
| |
| <syntaxhighlight lang="bash">
| |
| grep "overall read mapping rate" */align_summary.txt
| |
| </syntaxhighlight>
| |
| | |
| ==== Novel transcripts ====
| |
| * [http://www.rna-seqblog.com/current-limitations-of-rna-seq-analysis-for-detection-of-novel-transcripts/ The identification and characterization of novel transcripts from RNA-seq data]
| |
| | |
| ==== How does TopHat find junctions? ====
| |
| https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/cufflinks-and-tophat
| |
| | |
| ==== Can Tophat Be Used For Mapping Dna-Seq ====
| |
| https://www.biostars.org/p/63878/
| |
| * In contrast to DNA-sequence alignment, RNA-seq mapping algorithms have two additional challenges. First, because genes in eukaryotic genomes contain introns, and because reads sequenced from mature mRNA transcripts do not include these introns, any RNA-seq alignment program must be able to handle gapped (or spliced) alignment with very large gaps.
| |
| * TopHat is designed to map reads to a reference allowing splicing. In your case, the reads are not spliced because are genomic, so don't waste your time and resources and use Bowtie/BWA directly.
| |
| | |
| === [http://ccb.jhu.edu/software/hisat2/manual.shtml Hisat] (RNA/DNA) ===
| |
| * https://bioinformatics.ca/workshops/2017/informatics-rna-seq-analysis-2017
| |
| * [https://hpc.nih.gov/apps/hisat.html Hisat on Biowulf]
| |
| * [https://www.biostars.org/p/177714/ The comparison between HISAT2 and Tophat2]
| |
| | |
| === [https://code.google.com/p/rna-star/ STAR] (Spliced Transcripts Alignment to a Reference, RNA) ===
| |
| | |
| [http://sci-hub.cc/10.1007/978-1-4939-3572-7_13 Optimizing RNA-Seq Mapping with STAR] by Alexander Dobin and Thomas R. Gingeras
| |
| | |
| Its manual is on [https://github.com/alexdobin/STAR github]. The [http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1114s51/full 2015 paper] includes scripts to run STAR.
| |
| | |
| Note that the readme file says HARDWARE/SOFTWARE REQUIREMENTS:
| |
| * x86-64 compatible processors
| |
| * 64 bit Linux or Mac OS X
| |
| * '''30GB of RAM for human genome'''. In fact, it requires '''34GB''' (tested on UCSC/hg38). See the following '''jobhist''' output from running indexing on Biowulf.
| |
| <pre>
| |
| Submission Command : sbatch --cpus-per-task=2 --mem=64g --time=4:00:00 createStarIndex
| |
| | |
| Jobid Partition State Nodes CPUs Walltime Runtime MemReq MemUsed Nodelist
| |
| 44714895 norm COMPLETED 1 2 04:00:00 02:54:21 64.0GB/node 33.6GB cn3144
| |
| </pre>
| |
| | |
| See the blog on [http://www.gettinggeneticsdone.com/2012/11/star-ultrafast-universal-rna-seq-aligner.html gettinggeneticsdone.com] for a comparison of speed and memory requirement.
| |
| | |
| In short, the notable increase in speed comes at the price of a larger memory requirement.
| |
| | |
| To build STAR on Ubuntu (14.04)
| |
| <syntaxhighlight lang="bash">
| |
| wget https://github.com/alexdobin/STAR/archive/STAR_2.4.2a.tar.gz
| |
| sudo tar -xzf STAR_2.4.2a.tar.gz -C /opt/RNA-Seq/bin
| |
| | |
| cd /opt/RNA-Seq/bin/STAR-STAR_2.4.2a
| |
| cd source
| |
| sudo make STAR
| |
| </syntaxhighlight>
| |
| | |
| ==== Create index folder ====
| |
| <syntaxhighlight lang='bash'>
| |
| STAR --runMode genomeGenerate --runThreadN 11 \
| |
| --genomeDir STARindex \
| |
| --genomeFastaFiles genome.fa \
| |
| --sjdbGTFfile genes.gtf \
| |
| --sjdbOverhang 100
| |
| </syntaxhighlight>
| |
| where 100 = read length (one side) -1 = 101 - 100.
| |
| | |
| ==== STARindex folder ====
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| brb@T3600 ~/SRP050992 $ ls ~/SRP012607/STARindex/
| |
| chrLength.txt chrStart.txt geneInfo.tab SA sjdbList.fromGTF.out.tab
| |
| chrNameLength.txt exonGeTrInfo.tab Genome SAindex sjdbList.out.tab
| |
| chrName.txt exonInfo.tab genomeParameters.txt sjdbInfo.txt transcriptInfo.tab
| |
| brb@T3600 ~/SRP050992 $ cat ~/SRP012607/STARindex/genomeParameters.txt
| |
| ### STAR --runMode genomeGenerate --runThreadN 11 --genomeDir STARindex --genomeFastaFiles /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa --sjdbGTFfile /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf --sjdbOverhang 100
| |
| versionGenome 20201
| |
| genomeFastaFiles /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
| |
| genomeSAindexNbases 14
| |
| genomeChrBinNbits 18
| |
| genomeSAsparseD 1
| |
| sjdbOverhang 100
| |
| sjdbFileChrStartEnd -
| |
| sjdbGTFfile /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf
| |
| sjdbGTFchrPrefix -
| |
| sjdbGTFfeatureExon exon
| |
| sjdbGTFtagExonParentTranscript transcript_id
| |
| sjdbGTFtagExonParentGene gene_id
| |
| sjdbInsertSave Basic
| |
| </pre>
| |
| | |
| ==== Splice junction ====
| |
| * [https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf Manual]. Search 'sjdbOverhang' (length of the donor/acceptor sequence on each side of the junctions). The --sjdbOverhang is used only at the genome generation step, and tells STAR how many bases to concatenate from donor and acceptor sides of the junctions. If you have 100b reads, the ideal value of --sjdbOverhang is 99, which allows the 100b read to map 99b on one side, 1b on the other side. One can think of --sjdbOverhang as the maximum possible overhang for your reads. See [https://www.biostars.org/p/93883/ A post].
| |
| * For the <SJ.out.tab> format see section 4.4 of STAR manual. The first few lines of the file looks like (It is strange that the values in column 7, 8 are so large)
| |
| <pre>
| |
| $ head SJ.out.tab
| |
| chrM 711 764 1 1 1 2 1 44
| |
| chrM 717 1968 1 1 1 1 0 43
| |
| chrM 759 1189 0 0 1 11 3 30
| |
| chrM 795 830 2 2 1 335 344 30
| |
| chrM 822 874 1 1 1 1 1 17
| |
| chrM 831 917 2 2 1 733 81 38
| |
| chrM 831 3135 2 2 1 1 0 30
| |
| chrM 846 1126 0 0 1 4 0 50
| |
| chrM 876 922 1 1 1 26 9 31
| |
| chrM 962 1052 2 2 1 3 0 30
| |
| ^ ^ ^ ^ ^ ^ ^ ^ ^
| |
| | | | | | | | | |
| |
| chrom start end strand intron annotated | # of |
| |
| 0=undef motif # of multi mapped|
| |
| 1=+ uniquely mapped max spliced alignment overhang
| |
| 2=- reads
| |
| </pre>
| |
| * [https://ccb.jhu.edu/software/tophat/manual.shtml Tophat]. Search overhang, donor/acceptor keywords. Check out <junctions.bed>. Tophat suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million).
| |
| * [http://www.ccb.jhu.edu/software/hisat/manual.shtml HiSAT] comes with a python script that can extract splice sites from a GTF file. See [https://www.biostars.org/p/146700/ this post].
| |
| <syntaxhighlight lang='bash'>
| |
| $ cd /tmp
| |
| $ /opt/SeqTools/bin/hisat2-2.0.4/hisat2_extract_splice_sites.py \
| |
| ~/igenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > splicesites.txt
| |
| $ head splicesites.txt
| |
| chr1 12226 12612 +
| |
| chr1 12720 13220 +
| |
| chr1 14828 14969 -
| |
| chr1 15037 15795 -
| |
| chr1 15946 16606 -
| |
| chr1 16764 16857 -
| |
| chr1 17054 17232 -
| |
| chr1 17367 17605 -
| |
| chr1 17741 17914 -
| |
| chr1 18060 18267 -
| |
| </syntaxhighlight>
| |
| * To extract the splice alignment from bam files using '''samtools''', see [https://www.biostars.org/p/12626/ this] and [https://www.biostars.org/p/89581/ this] posts.
| |
| | |
| ==== Screen output ====
| |
| <pre>
| |
| Oct 07 19:25:19 ..... started STAR run
| |
| Oct 07 19:25:19 ... starting to generate Genome files
| |
| Oct 07 19:27:08 ... starting to sort Suffix Array. This may take a long time...
| |
| Oct 07 19:27:38 ... sorting Suffix Array chunks and saving them to disk...
| |
| Oct 07 20:07:07 ... loading chunks from disk, packing SA...
| |
| Oct 07 20:22:52 ... finished generating suffix array
| |
| Oct 07 20:22:52 ... generating Suffix Array index
| |
| Oct 07 20:26:32 ... completed Suffix Array index
| |
| Oct 07 20:26:32 ..... processing annotations GTF
| |
| Oct 07 20:26:38 ..... inserting junctions into the genome indices
| |
| Oct 07 20:30:15 ... writing Genome to disk ...
| |
| Oct 07 20:31:00 ... writing Suffix Array to disk ...
| |
| Oct 07 20:37:24 ... writing SAindex to disk
| |
| Oct 07 20:37:39 ..... finished successfully
| |
| </pre>
| |
| | |
| ==== Log.final.out ====
| |
| An example of the Log.final.out file
| |
| <pre>
| |
| $ cat Log.final.out
| |
| Started job on | Jul 21 13:12:11
| |
| Started mapping on | Jul 21 13:35:06
| |
| Finished on | Jul 21 13:57:42
| |
| Mapping speed, Million of reads per hour | 248.01
| |
| | |
| Number of input reads | 93418927
| |
| Average input read length | 202
| |
| UNIQUE READS:
| |
| Uniquely mapped reads number | 61537064
| |
| Uniquely mapped reads % | 65.87%
| |
| Average mapped length | 192.62
| |
| Number of splices: Total | 15658546
| |
| Number of splices: Annotated (sjdb) | 15590438
| |
| Number of splices: GT/AG | 15400163
| |
| Number of splices: GC/AG | 178602
| |
| Number of splices: AT/AC | 14658
| |
| Number of splices: Non-canonical | 65123
| |
| Mismatch rate per base, % | 0.86%
| |
| Deletion rate per base | 0.02%
| |
| Deletion average length | 1.61
| |
| Insertion rate per base | 0.03%
| |
| Insertion average length | 1.40
| |
| MULTI-MAPPING READS:
| |
| Number of reads mapped to multiple loci | 4962849
| |
| % of reads mapped to multiple loci | 5.31%
| |
| Number of reads mapped to too many loci | 62555
| |
| % of reads mapped to too many loci | 0.07%
| |
| UNMAPPED READS:
| |
| % of reads unmapped: too many mismatches | 0.00%
| |
| % of reads unmapped: too short | 28.70%
| |
| % of reads unmapped: other | 0.05%
| |
| CHIMERIC READS:
| |
| Number of chimeric reads | 0
| |
| % of chimeric reads | 0.00%
| |
| </pre>
| |
| Note that 65.87 (uniquely mapped) + 5.31 (mapped to multiple loci) + 0.07 (mapped too many loci) + 28.70 (unmapped too short) + 0.05 (unmapped other) = 100
| |
| | |
| By default, STAR only outputs reads that map to <=10 loci, others are considered "mapped to too many loci". You can increase this threshold by increasing --outFilterMultimapNmax. See [https://groups.google.com/forum/#!topic/rna-star/sgVFpzSVFyY here].
| |
| | |
| === [http://subread.sourceforge.net/ Subread] (RNA/DNA) ===
| |
| * http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf
| |
| * [http://genomespot.blogspot.com/2015/03/hisat-vs-star-vs-tophat2-vs-olego-vs.html HISAT vs STAR vs TopHat2 vs Olego vs SubJunc] by Mark Ziemann
| |
| * Get a 'Segmentation fault' error on GSE46876 (SRR902884.fastq, single-end) rna-seq data
| |
| | |
| In practice,
| |
| # For DNA-Seq data, the '''subread''' aligner is used.
| |
| # For RNA-Seq data,
| |
| #* the '''subread''' aligner is used when selecting gene counting analysis only.
| |
| #* the '''subjunc''' aligner is used when variant calling analysis is selected.
| |
| | |
| === RNASequel ===
| |
| [http://nar.oxfordjournals.org/content/early/2015/06/16/nar.gkv594.full RNASequel: accurate and repeat tolerant realignment of RNA-seq reads]
| |
| | |
| === [https://github.com/amplab/snap snap] ===
| |
| Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
| |
| | |
| snap is the best choice so far. star/hisat2 are as fast but not as good for genomic reads. accurate mappers need to do alignment anyway
| |
| | |
| Keep in mind SNAP's index is 10-15x bigger than input. 4Gbp->50GB index. SNAP loads index into memory. HISAT2 index is similar size as input
| |
| | |
| === Lessons ===
| |
| * Tophat and STAR needs index files. So if we want to run the alignment using multiple nodes, we want to first run the alignment on a single sample first. Then we can run alignment on others samples using multiple nodes.
| |
| * The index files of Tophat only depends on the reference genome. However the index files of STAR depends on both the reference genome and the read length of the data.
| |
| | |
| === Alignment Algorithms ===
| |
| * [https://biology.stackexchange.com/questions/11263/what-is-the-difference-between-local-and-global-sequence-alignments What is the difference between local and global sequence alignments?]
| |
| * [https://en.wikipedia.org/wiki/Sequence_alignment#Global_and_local_alignments Sequence alignment]
| |
| * [https://www.cs.umd.edu/class/fall2011/cmsc858s/ CMSC 858s: Computational Genomics]
| |
| * [https://en.wikipedia.org/wiki/Substitution_matrix Substitution matrix]
| |
| | |
| === Alignment free ===
| |
| * [https://www.biorxiv.org/content/early/2018/01/11/246967 Limitation of alignment-free tools in total RNA-seq quantification]
| |
| | |
| == SAMtools ==
| |
| '''SAMtools''' bundle include '''samtools''', '''bcftools''', '''bgzip''', '''tabix''', ''' wgsim''' and '''htslib'''. samtools and bcftools are based on htslib.
| |
| | |
| * https://en.wikipedia.org/wiki/SAMtools
| |
| * http://www.htslib.org/doc/samtools.html
| |
| * http://www.htslib.org/doc/samtools-1.2.html (it is strange bam2fq appears in v1.2 but not in v1.3 documentation)
| |
| * [http://biobits.org/samtools_primer.html SAMtools: Primer / Tutorial] by Ethan Cerami. It covers installing samtools, bcftools, generating simulated reads, align reads, convert sam to bam, sorting & indexing, identifying genomic variants, understand vcf format, visualize reads on a command line.
| |
| * http://davetang.org/wiki/tiki-index.php?page=SAMTools contains many recipes like counting number of mapped reads
| |
| | |
| <syntaxhighlight lang='bash'>
| |
| $ /opt/SeqTools/bin/samtools-1.3/samtools
| |
| | |
| Program: samtools (Tools for alignments in the SAM format)
| |
| Version: 1.3 (using htslib 1.3)
| |
| | |
| Usage: samtools <command> [options]
| |
| | |
| Commands:
| |
| -- Indexing
| |
| dict create a sequence dictionary file
| |
| faidx index/extract FASTA
| |
| index index alignment
| |
| | |
| -- Editing
| |
| calmd recalculate MD/NM tags and '=' bases
| |
| fixmate fix mate information
| |
| reheader replace BAM header
| |
| rmdup remove PCR duplicates
| |
| targetcut cut fosmid regions (for fosmid pool only)
| |
| addreplacerg adds or replaces RG tags
| |
| | |
| -- File operations
| |
| collate shuffle and group alignments by name
| |
| cat concatenate BAMs
| |
| merge merge sorted alignments
| |
| mpileup multi-way pileup
| |
| sort sort alignment file
| |
| split splits a file by read group
| |
| quickcheck quickly check if SAM/BAM/CRAM file appears intact
| |
| fastq converts a BAM to a FASTQ
| |
| fasta converts a BAM to a FASTA
| |
| | |
| -- Statistics
| |
| bedcov read depth per BED region
| |
| depth compute the depth
| |
| flagstat simple stats
| |
| idxstats BAM index stats
| |
| phase phase heterozygotes
| |
| stats generate stats (former bamcheck)
| |
| | |
| -- Viewing
| |
| flags explain BAM flags
| |
| tview text alignment viewer
| |
| view SAM<->BAM<->CRAM conversion
| |
| depad convert padded BAM to unpadded BAM
| |
| </syntaxhighlight>
| |
| | |
| To compile the new version (v1.x) of samtools,
| |
| <syntaxhighlight lang='bash'>
| |
| git clone https://github.com/samtools/samtools.git
| |
| git clone https://github.com/samtools/htslib.git
| |
| cd samtools
| |
| make
| |
| </syntaxhighlight>
| |
| | |
| === Multi-thread option ===
| |
| Only '''view''' and '''sort''' have multi-thread ('''-@''') option.
| |
| | |
| === samtools sort ===
| |
| A lot of temporary files will be created.
| |
| | |
| For example, if my output file is called <bwa_homo2_sort.bam>, there are 80 <bwa_homo2_sort.bam.tmp.XXXX.bam> files (each is about 210 MB) generated.
| |
| <pre>
| |
| samtools sort -@ $SLURM_CPUS_PER_TASK -O bam -n bwa_homo2.sam -o bwa_homo2_sort.bam
| |
| </pre>
| |
| | |
| === Decoding SAM flags ===
| |
| http://broadinstitute.github.io/picard/explain-flags.html
| |
| | |
| === samtools flagstat: Know how many alignment a bam file contains ===
| |
| <syntaxhighlight lang='bash'>
| |
| brb@brb-T3500:~/GSE11209$ /opt/RNA-Seq/bin/samtools-0.1.19/samtools flagstat dT_bio_s.bam
| |
| 1393561 + 0 in total (QC-passed reads + QC-failed reads)
| |
| 0 + 0 duplicates
| |
| 1393561 + 0 mapped (100.00%:-nan%)
| |
| 0 + 0 paired in sequencing
| |
| 0 + 0 read1
| |
| 0 + 0 read2
| |
| 0 + 0 properly paired (-nan%:-nan%)
| |
| 0 + 0 with itself and mate mapped
| |
| 0 + 0 singletons (-nan%:-nan%)
| |
| 0 + 0 with mate mapped to a different chr
| |
| 0 + 0 with mate mapped to a different chr (mapQ>=5)
| |
| </syntaxhighlight>
| |
| | |
| We can see how many reads have multiple alignments by comparing the number of reads and the number of alignments output from '''samtools flagstat'''.
| |
| <syntaxhighlight lang='bash'>
| |
| brb@brb-T3500:~/GSE11209$ wc -l SRR002062.fastq
| |
| 13778616 SRR002062.fastq
| |
| </syntaxhighlight>
| |
| | |
| Note: there is no multithread option in samtools flagstat.
| |
| | |
| === samtools flagstat source code ===
| |
| * https://www.biostars.org/p/12475/
| |
| * https://github.com/samtools/samtools/blob/develop/bam_stat.c
| |
| | |
| === samtools view ===
| |
| See also [[Anders2013#sam.2Fbam.2C_.22samtools_view.22_and_Rsamtools|Anders2013-sam/bam]] and http://genome.sph.umich.edu/wiki/SAM
| |
| | |
| Note that we can use both '''-f''' and '''-F''' flags together.
| |
| <pre>
| |
| $ /opt/SeqTools/bin/samtools-1.3/samtools view -h
| |
| | |
| Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
| |
| | |
| Options:
| |
| -b output BAM
| |
| -C output CRAM (requires -T)
| |
| -1 use fast BAM compression (implies -b)
| |
| -u uncompressed BAM output (implies -b)
| |
| -h include header in SAM output
| |
| -H print SAM header only (no alignments)
| |
| -c print only the count of matching records
| |
| -o FILE output file name [stdout]
| |
| -U FILE output reads not selected by filters to FILE [null]
| |
| -t FILE FILE listing reference names and lengths (see long help) [null]
| |
| -L FILE only include reads overlapping this BED FILE [null]
| |
| -r STR only include reads in read group STR [null]
| |
| -R FILE only include reads with read group listed in FILE [null]
| |
| -q INT only include reads with mapping quality >= INT [0]
| |
| -l STR only include reads in library STR [null]
| |
| -m INT only include reads with number of CIGAR operations consuming
| |
| query sequence >= INT [0]
| |
| -f INT only include reads with all bits set in INT set in FLAG [0]
| |
| -F INT only include reads with none of the bits set in INT set in FLAG [0]
| |
| -x STR read tag to strip (repeatable) [null]
| |
| -B collapse the backward CIGAR operation
| |
| -s FLOAT integer part sets seed of random number generator [0];
| |
| rest sets fraction of templates to subsample [no subsampling]
| |
| -@, --threads INT
| |
| number of BAM/CRAM compression threads [0]
| |
| -? print long help, including note about region specification
| |
| -S ignored (input format is auto-detected)
| |
| --input-fmt-option OPT[=VAL]
| |
| Specify a single input file format option in the form
| |
| of OPTION or OPTION=VALUE
| |
| -O, --output-fmt FORMAT[,OPT[=VAL]]...
| |
| Specify output format (SAM, BAM, CRAM)
| |
| --output-fmt-option OPT[=VAL]
| |
| Specify a single output file format option in the form
| |
| of OPTION or OPTION=VALUE
| |
| -T, --reference FILE
| |
| Reference sequence FASTA FILE [null]
| |
| </pre>
| |
| | |
| '''View bam files''':
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view XXX.bam # No header
| |
| samtools view -@ 16 -h XXX.bam # include header in SAM output
| |
| samtools view -@ 16 -H xxx.bam # See the header only. The header may contains the reference genome fasta information.
| |
| </syntaxhighlight>
| |
| Note that according to [http://samtools.github.io/hts-specs/SAMv1.pdf SAM pdf], the header is optional.
| |
| | |
| '''Sam and bam files conversion''':
| |
| <syntaxhighlight lang='bash'>
| |
| /opt/RNA-Seq/bin/samtools-0.1.19/samtools view --help
| |
| | |
| # sam -> bam (need reference genome file)
| |
| samtools view -@ 16 -b XXX.fa XXX.bt2.sam > XXX.bt2.bam
| |
| | |
| # bam -> sam
| |
| samtools view -@ 16 -h -o out.sam in.bam
| |
| </syntaxhighlight>
| |
| | |
| '''Subset''':
| |
| <syntaxhighlight lang='bash'>
| |
| # assuming bam file is sorted & indexed
| |
| samtools view -@ 16 XXX.bam "chr22:24000000-25000000" | more
| |
| </syntaxhighlight>
| |
| | |
| === Remove unpaired reads ===
| |
| According to the samtools manual, the 7th column represents the chromosome name of the mate/next read. If the field is set as '*' when the information is unavailable and set as '=' if RNEXT is identical RNAME.
| |
| | |
| When we use '''samtools view''' command to manually removed certain reads from a '''pair-end''' sam/bam file, the bam file will has a [[#SAM_validation_error|problem]] when it is used in '''picard MarkDuplicates'''. A solution is to run the following line to remove these unpaired reads (so we don't need to the '''VALIDATION_STRINGENCY''' parameter)
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -h accepted_hits_bwa.bam | awk -F "\t" '$7!="*" { print $0 }' > accepted_hits_bwa2.sam
| |
| </syntaxhighlight>
| |
| | |
| === '''less''' bam files ===
| |
| http://martinghunt.github.io/2016/01/25/less-foo.bam.html
| |
| | |
| === Convert SAM to BAM ===
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -b input.sam > output.bam
| |
| # OR
| |
| samtools view -b input.sam -o output.bam
| |
| </syntaxhighlight>
| |
| There is no need to add "-S". The header will be kept in the bam file.
| |
| | |
| === Convert BAM to SAM ===
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -h input.bam -o output.sam
| |
| </syntaxhighlight>
| |
| where '-h' is to ensure the converted SAM file contains the header information. Generally, it is useful to store only sorted BAM files as they use much less disk space and are faster to process.
| |
| | |
| === Convert BAM to FASTQ ===
| |
| * http://www.metagenomics.wiki/tools/samtools/converting-bam-to-fastq
| |
| * http://seqanswers.com/forums/showthread.php?t=7061
| |
| | |
| <syntaxhighlight lang='bash'>
| |
| # Method 1: samtools
| |
| samtools bam2fq SAMPLE.bam > SAMPLE.fastq
| |
| cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_r1.fastq
| |
| cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_r2.fastq
| |
| | |
| # Method 2: picard
| |
| java -Xmx2g -jar Picard/SamToFastq.jar I=SAMPLE.bam F=SAMPLE_r1.fastq F2=SAMPLE_r2.fastq
| |
| | |
| # Method 3: bam2fastx
| |
| # http://manpages.ubuntu.com/manpages/quantal/man1/bam2fastx.1.html
| |
| | |
| # Method 4: Bedtools - bamtofastq
| |
| # http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html
| |
| # Single fastq
| |
| bedtools bamtofastq -i input.bam -fq output.fastq
| |
| # BAM should be sorted by query name (samtools sort -n aln.bam aln.qsort) if creating paired FASTQ
| |
| samtools sort -n input.bam -o input_sorted.bam # sort by read namem (-n)
| |
| bedtools bamtofastq -i input_sorted.bam -fq output_r1.fastq -fq2 output_r2.fastq
| |
| | |
| # Method 5: Bamtools
| |
| # http://github.com/pezmaster31/bamtools
| |
| </syntaxhighlight>
| |
| | |
| Consider the ExomeLungCancer example
| |
| <syntaxhighlight lang='bash'>
| |
| # ExomeLungCancer/test.SRR2923335_*.fastq
| |
| | |
| $ # Method 1
| |
| $ samtools bam2fq bwa.sam > SAMPLE.fastq
| |
| $ cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_r1.fastq
| |
| $ cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_r2.fastq
| |
| $ head -n 4 SAMPLE_r1.fastq
| |
| @SRR2923335.1/1
| |
| NGCAAGTAGAGGGGCGCACACCTACCCGCAGTTGTTCACAGCCATGAGGTCGCCAACTAGCAGGAAGGGGTAGGTCAGCATGCTCACTGCAATCTGAAAC
| |
| +
| |
| #1=DDFDFHHHHHJIJJJJIJJJJJJJJIJJJJJIJIHHHHHFFFFFEECEDDDDDDDDDDDDDDDDDDD>BDDACCDDDDDDDDDDDDDDCCDDEDDDC
| |
| $ head -n 4 output_r2.fastq
| |
| @SRR2923335.1/2
| |
| AACCCTGGTAATGGCTGGAGGCAGNNCTTGGTACAGNGTGTNNGCGTGGTGTGTCNNTGCTNNCTGGGCCGGGGTGGGTCACTGGCACTCAGGCCTCTCT
| |
| +
| |
| CCCFFFFFFHHGHJJJJJEHJIJI##11CCG?DFGI#0)88##0/77CF=@FDDG##--;?##,,5=BABBDDD)9@D59ADCDDBDDDDDDDCDDDDC>
| |
| | |
| $ # Method 4
| |
| $ samtools sort -n accepted_hits.bam -o accepted_sortbyname.bam
| |
| $ /opt/SeqTools/bin/bedtools2/bin/bedtools bamtofastq -i accepted_sortbyname.bam -fq output_r1.fastq -fq2 output_r2.fastq
| |
| | |
| # Compare the first read
| |
| # Original fastq file
| |
| $ head -n 4 ../test.SRR2923335_1.fastq
| |
| @SRR2923335.1 1 length=100
| |
| NGCAAGTAGAGGGGCGCACACCTACCCGCAGTTGTTCACAGCCATGAGGTCGCCAACTAGCAGGAAGGGGTAGGTCAGCATGCTCACTGCAATCTGAAAC
| |
| +SRR2923335.1 1 length=100
| |
| #1=DDFDFHHHHHJIJJJJIJJJJJJJJIJJJJJIJIHHHHHFFFFFEECEDDDDDDDDDDDDDDDDDDD>BDDACCDDDDDDDDDDDDDDCCDDEDDDC
| |
| $ head -n 4 ../test.SRR2923335_2.fastq
| |
| @SRR2923335.1 1 length=100
| |
| AACCCTGGTAATGGCTGGAGGCAGNNCTTGGTACAGNGTGTNNGCGTGGTGTGTCNNTGCTNNCTGGGCCGGGGTGGGTCACTGGCACTCAGGCCTCTCT
| |
| +SRR2923335.1 1 length=100
| |
| CCCFFFFFFHHGHJJJJJEHJIJI##11CCG?DFGI#0)88##0/77CF=@FDDG##--;?##,,5=BABBDDD)9@D59ADCDDBDDDDDDDCDDDDC>
| |
| | |
| $ ##################################################################################################
| |
| $ head -n 4 output_r1.fastq
| |
| @SRR2923335.1/1
| |
| NGCAAGTAGAGGGGCGCACACCTACCCGCAGTTGTTCACAGCCATGAGGTCGCCAACTAGCAGGAAGGGGTAGGTCAGCATGCTCACTGCAATCTGAAAC
| |
| +
| |
| #1=DDFDFHHHHHJIJJJJIJJJJJJJJIJJJJJIJIHHHHHFFFFFEECEDDDDDDDDDDDDDDDDDDD>BDDACCDDDDDDDDDDDDDDCCDDEDDDC
| |
| $ head -n 4 output_r2.fastq
| |
| @SRR2923335.1/2
| |
| AACCCTGGTAATGGCTGGAGGCAGNNCTTGGTACAGNGTGTNNGCGTGGTGTGTCNNTGCTNNCTGGGCCGGGGTGGGTCACTGGCACTCAGGCCTCTCT
| |
| +
| |
| CCCFFFFFFHHGHJJJJJEHJIJI##11CCG?DFGI#0)88##0/77CF=@FDDG##--;?##,,5=BABBDDD)9@D59ADCDDBDDDDDDDCDDDDC>
| |
| </syntaxhighlight>
| |
| | |
| === Using CRAM files ===
| |
| * CRAM files are more dense than BAM files. CRAM files are smaller than BAM by taking advantage of an additional external "reference sequence" file.
| |
| * http://www.htslib.org/workflow/#mapping_to_cram
| |
| * [https://genome.ucsc.edu/goldenPath/help/cram.html CRAM] format
| |
| * The CRAM format was used (to replace the BAM format) in 1000genome
| |
| | |
| === Extract single chromosome ===
| |
| https://www.biostars.org/p/46327/
| |
| <syntaxhighlight lang='bash'>
| |
| samtools sort accepted_hits.bam -o accepted_hits_sorted.bam
| |
| samtools index accepted_hits_sorted.bam
| |
| samtools view -h accepted_hits_sorted.bam chr22 > accepted_hits_sub.sam
| |
| </syntaxhighlight>
| |
| | |
| === Primary, secondary, supplementary alignment ===
| |
| * Multiple mapping and primary from [https://samtools.github.io/hts-specs/SAMv1.pdf#page=2 SAM format specification]
| |
| * https://www.biostars.org/p/138116/
| |
| * supplementary alignment = chimeric alignments = non-linear alignments. It's often the case that the sample we're sequencing has structural variations when compared to the reference sequence. Imagine a 100bp read. Let us suppose that the first 50bp align to chr1 and the last 50bp to chr6.
| |
| * [https://www.biostars.org/p/101533/ How to extract unique mapped results from Bowtie2 bam results?] ''If you don't extract primary or unique reads from the sam/bam, the reads that maps equally well to multiple sequences will cause a serious bias in quantification of repeated elements.''
| |
| | |
| To exact primary alignment reads ([https://www.biostars.org/p/276833/ here]), use
| |
| <pre>
| |
| samtools view -F 260 input.bam
| |
| </pre>
| |
| | |
| === Extract reads based on read IDs ===
| |
| https://www.biostars.org/p/68358/#68362
| |
| <pre>
| |
| samtools view Input.bam chrA:x-y | cut -f1 > idFile.txt
| |
| | |
| LC_ALL=C grep -w -F -f idFile.txt < in.sam > subset.sam
| |
| </pre>
| |
| | |
| === Extract mapped/unmapped reads from BAM ===
| |
| http://www.htslib.org/doc/samtools.html
| |
| <pre>
| |
| 0x1 PAIRED paired-end (or multiple-segment) sequencing technology
| |
| 0x2 PROPER_PAIR each segment properly aligned according to the aligner
| |
| 0x4 UNMAP segment unmapped
| |
| 0x8 MUNMAP next segment in the template unmapped
| |
| 0x10 REVERSE SEQ is reverse complemented
| |
| 0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
| |
| 0x40 READ1 the first segment in the template
| |
| 0x80 READ2 the last segment in the template
| |
| 0x100 SECONDARY secondary alignment
| |
| 0x200 QCFAIL not passing quality controls
| |
| 0x400 DUP PCR or optical duplicate
| |
| 0x800 SUPPLEMENTARY supplementary alignment
| |
| </pre>
| |
| * https://www.biostars.org/p/59281/ & https://www.biostars.org/p/56246/
| |
| <syntaxhighlight lang='bash'>
| |
| # get the unmapped reads from a bam file use :
| |
| samtools view -f 4 file.bam > unmapped.sam
| |
| | |
| # get the output in bam use :
| |
| samtools view -b -f 4 file.bam > unmapped.bam
| |
| | |
| # get only the mapped reads, the parameter 'F', which works like -v of grep
| |
| samtools view -b -F 4 file.bam > mapped.bam
| |
| | |
| # get only properly aligned/properly paired reads (https://www.biostars.org/p/111481/)
| |
| #
| |
| # Q: What Does The "Proper Pair" Bitwise Flag?
| |
| # A1: https://www.biostars.org/p/8318/
| |
| # Properly paired means the read itself as well as its mate are both mapped and
| |
| # they were mapped within a reasonable distance given the expected distance
| |
| # A2: https://www.biostars.org/p/12475/
| |
| # It means means both mates of a read pair map to the same chromosome, oriented towards each other,
| |
| # and with a sensible insert size.
| |
| samtools view -b -f 0x2 accepted_hits.bam > mappedPairs.bam
| |
| </syntaxhighlight>
| |
| * https://www.biostars.org/p/14518/ So for all pair-end reads where one end is mapped and the other end isn't mapped, this should output all the mapped end entries:
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -u -f 8 -F 260 map.bam > oneEndMapped.bam
| |
| </syntaxhighlight>
| |
| and this will output all the unmapped end entries:
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -u -f 4 -F 264 map.bam > oneEndUnmapped.bam
| |
| </syntaxhighlight>
| |
| | |
| === Count number of mapped/unmapped reads from BAM - samtools idxstats ===
| |
| <syntaxhighlight lang='bash'>
| |
| # count the number of mapped reads
| |
| samtools view -c -F0x4 accepted_hits.bam
| |
| 9971
| |
| # count the number of unmapped reads
| |
| samtools view -c -f0x4 accepted_hits.bam
| |
| 35
| |
| </syntaxhighlight>
| |
| The result can be checked with samtools idxstats command. See https://pepebioinformatics.wordpress.com/2014/05/29/samtools-count-mapped-and-unmapped-reads-in-bam-file/
| |
| <syntaxhighlight lang='bash'>
| |
| $ samtools sort accepted_hits.bam -o accepted_hits_sorted.bam
| |
| $ samtools index accepted_hits_sorted.bam accepted_hits_sorted.bai # BAI file is binary
| |
| $ ls -t
| |
| accepted_hits_sorted.bai accepted_hits_sorted.bam accepted_hits.bam
| |
| $ samtools idxstats accepted_hits_sorted.bam
| |
| 1 249250621 949 1
| |
| 2 243199373 807 0
| |
| 3 198022430 764 0
| |
| 4 191154276 371 1
| |
| 5 180915260 527 0
| |
| 6 171115067 411 3
| |
| 7 159138663 888 5
| |
| 8 146364022 434 0
| |
| 9 141213431 409 2
| |
| 10 135534747 408 1
| |
| 11 135006516 490 2
| |
| 12 133851895 326 1
| |
| 13 115169878 149 1
| |
| 14 107349540 399 3
| |
| 15 102531392 249 1
| |
| 16 90354753 401 4
| |
| 17 81195210 466 1
| |
| 18 78077248 99 0
| |
| 19 59128983 503 1
| |
| 20 63025520 275 1
| |
| 21 48129895 87 0
| |
| 22 51304566 228 2
| |
| X 155270560 315 1
| |
| Y 59373566 6 0
| |
| MT 16569 10 0
| |
| * 0 0 4
| |
| $ samtools idxstats accepted_hits_sorted.bam | awk '{s+=$3+$4} END {print s}'
| |
| 10006
| |
| $ samtools idxstats accepted_hits_sorted.bam | awk '{s+=$3} END {print s}'
| |
| 9971
| |
| </syntaxhighlight>
| |
| | |
| === Find unmapped reads ===
| |
| * http://seqanswers.com/forums/showthread.php?t=5787
| |
| | |
| === Find multi-mapped reads ===
| |
| http://seqanswers.com/forums/showthread.php?t=16427
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -F 4 file.bam | awk '{printf $1"\n"}' | sort | uniq -d | wc -l
| |
| # uniq -d will only print duplicate lines
| |
| </syntaxhighlight>
| |
| | |
| === Realignment and recalibration ===
| |
| https://www.biostars.org/p/141784/
| |
| | |
| # Realignment and recalibration won't change these metrics output from samtools flagstat.
| |
| # '''Recalibration is typically not needed anymore (the same goes for realignment if you're using something like GATK's haplotype caller).'''
| |
| # Realignment just locally realigns things and typically the assigned MAPQ values don't change (unmapped mates etc. also won't change).
| |
| # Recalibration typically affects only base qualities.
| |
| | |
| === htslib ===
| |
| Suppose we download a vcf.gz from [ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/ NCBI] ftp site. We want to subset chromosome 1 and index the new file.
| |
| | |
| * '''tabix''' - subset a vcf.gz file or index a vcf.gz file
| |
| <syntaxhighlight lang='bash'>
| |
| tabix -h common_all_20160601.vcf.gz 1: > common_all_20160601_1.vcf
| |
| | |
| $ tabix -h
| |
| | |
| Version: 1.3
| |
| Usage: tabix [OPTIONS] [FILE] [REGION [...]]
| |
| | |
| Indexing Options:
| |
| -0, --zero-based coordinates are zero-based
| |
| -b, --begin INT column number for region start [4]
| |
| -c, --comment CHAR skip comment lines starting with CHAR [null]
| |
| -C, --csi generate CSI index for VCF (default is TBI)
| |
| -e, --end INT column number for region end (if no end, set INT to -b) [5]
| |
| -f, --force overwrite existing index without asking
| |
| -m, --min-shift INT set minimal interval size for CSI indices to 2^INT [14]
| |
| -p, --preset STR gff, bed, sam, vcf
| |
| -s, --sequence INT column number for sequence names (suppressed by -p) [1]
| |
| -S, --skip-lines INT skip first INT lines [0]
| |
| | |
| Querying and other options:
| |
| -h, --print-header print also the header lines
| |
| -H, --only-header print only the header lines
| |
| -l, --list-chroms list chromosome names
| |
| -r, --reheader FILE replace the header with the content of FILE
| |
| -R, --regions FILE restrict to regions listed in the file
| |
| -T, --targets FILE similar to -R but streams rather than index-jumps
| |
| </syntaxhighlight>
| |
| * '''bgzip''' - create a vcf.gz file from a vcf file
| |
| <syntaxhighlight lang='bash'>
| |
| bgzip -c common_all_20160601_1.vcf > common_all_20160601_1.vcf.gz
| |
| | |
| # Indexing. Output is <common_all_20160601_1.vcf.gz.tbi>
| |
| tabix -f -p vcf common_all_20160601_1.vcf.gz
| |
| | |
| $ bgzip -h
| |
| | |
| Version: 1.3
| |
| Usage: bgzip [OPTIONS] [FILE] ...
| |
| Options:
| |
| -b, --offset INT decompress at virtual file pointer (0-based uncompressed offset)
| |
| -c, --stdout write on standard output, keep original files unchanged
| |
| -d, --decompress decompress
| |
| -f, --force overwrite files without asking
| |
| -h, --help give this help
| |
| -i, --index compress and create BGZF index
| |
| -I, --index-name FILE name of BGZF index file [file.gz.gzi]
| |
| -r, --reindex (re)index compressed file
| |
| -s, --size INT decompress INT bytes (uncompressed size)
| |
| -@, --threads INT number of compression threads to use [1]
| |
| </syntaxhighlight>
| |
| | |
| ==== hts-nlim ====
| |
| [https://www.biorxiv.org/content/biorxiv/early/2018/02/08/261735.full.pdf hts-nim: scripting high-performance genomic analyses] and [https://github.com/brentp/hts-nim source] in github.
| |
| | |
| Examples include bam filtering (bam files), read counts in regions (bam & bed files) and quality control variant call files (vcf files). https://github.com/brentp/hts-nim-tools
| |
| | |
| == [http://samstat.sourceforge.net/ SAMStat] ==
| |
| Displaying sequence statistics for next generation sequencing. SAMStat reports nucleotide composition, length distribution, base quality distribution, mapping statistics, mismatch, insertion and deletion error profiles, di-nucleotide and 10-mer over-representation. The output is a single html5 page which can be interpreted by a non-specialist.
| |
| | |
| Usage
| |
| <syntaxhighlight lang='bash'>
| |
| samstat <file.sam> <file.bam> <file.fa> <file.fq> ....
| |
| </syntaxhighlight>
| |
| For each input file SAMStat will create a single html page named after the input file name plus a dot html suffix.
| |
| | |
| == [http://broadinstitute.github.io/picard/ Picard] ==
| |
| A set of tools (in Java) for working with next generation sequencing data in the BAM (http://samtools.sourceforge.net) format.
| |
| | |
| https://github.com/broadinstitute/picard
| |
| | |
| Note that As of version 2.0.1 (Nov. 2015) Picard requires Java 1.8 (jdk8u66). The last version to support Java 1.7 was release 1.141. Use the following to check your Java version
| |
| <syntaxhighlight lang='bash'>
| |
| brb@T3600 ~/github/picard $ java -version
| |
| java version "1.7.0_101"
| |
| OpenJDK Runtime Environment (IcedTea 2.6.6) (7u101-2.6.6-0ubuntu0.14.04.1)
| |
| OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
| |
| </syntaxhighlight>
| |
| | |
| See my [https://taichimd.us/mediawiki/index.php/Linux#JRE_and_JDK Linux -> JRE and JDK] page.
| |
| | |
| Some people reported errors or complain about memory usage. See [http://seqanswers.com/forums/showthread.php?t=6204 this post] from seqanswers.com. And HTSeq python program is another option.
| |
| <syntaxhighlight lang='bash'>
| |
| sudo apt-get install ant
| |
| git clone https://github.com/broadinstitute/picard.git
| |
| cd picard
| |
| git clone https://github.com/samtools/htsjdk.git
| |
| ant -lib lib/ant package-commands # We will see 'BUILD SUCCESSFUL'
| |
| # It will create <dist/picard.jar>
| |
| </syntaxhighlight>
| |
| | |
| === [http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq SamToFastq] ===
| |
| <syntaxhighlight lang='bash'>
| |
| java jvm-args -jar picard.jar PicardCommandName OPTION1=value1 OPTION2=value2...
| |
| # For example
| |
| cd ~/github/freebayes/test/tiny
| |
| java -Xmx2g -jar ~/github/picard/dist/picard.jar SamToFastq \
| |
| INPUT=NA12878.chr22.tiny.bam \
| |
| FASTQ=tmp_1.fq \
| |
| SECOND_END_FASTQ=tmp_2.fq \
| |
| INCLUDE_NON_PF_READS=True \
| |
| VALIDATION_STRINGENCY=SILENT
| |
| wc -l tmp_1.fq
| |
| # 6532 tmp_1.fq
| |
| wc -l tmp_2.fq
| |
| # 6532 tmp_2.fq
| |
| </syntaxhighlight>
| |
| | |
| === SAM validation error; Mate Alignment start should be 0 because reference name = * ===
| |
| [https://software.broadinstitute.org/gatk/guide/article?id=7571 Errors in SAM/BAM files can be diagnosed with ValidateSamFile]
| |
| | |
| When I run picard with MarkDuplicates, AddOrReplaceReadGroups or ReorderSam parameter, I will get an error on the bam file aligned to human+mouse genomes but excluding mouse mappings:
| |
| <pre>
| |
| Ignoring SAM validation error: ERROR: Record 3953, Read name D00748:53:C8KPMANXX:7:1109:11369:5479, Mate Alignment start should be 0 because reference name = *.'
| |
| </pre>
| |
| | |
| https://www.biostars.org/p/9876/ & https://www.biostars.org/p/108148/. Many of the validation errors reported by Picard are "technically" errors, but do not have any impact for the vast majority of downstream processing.
| |
| | |
| For example,
| |
| <syntaxhighlight lang='bash'>
| |
| java -Xmx10g -jar picard.jar MarkDuplicates VALIDATION_STRINGENCY=LENIENT METRICS_FILE=MarkDudup.metrics INPUT=input.bam OUTPUT=output.bam
| |
| # OR to avoid messages
| |
| java -Xmx10g -jar picard.jar MarkDuplicates VALIDATION_STRINGENCY=SILENT METRICS_FILE=MarkDudup.metrics INPUT=input.bam OUTPUT=output.bam
| |
| </syntaxhighlight>
| |
| | |
| Another (different) error message is:
| |
| <pre>
| |
| SAM validation error: ERROR: Record 16642, Read name SRR925751.3835, First of pair flag should not be set for unpaired read.
| |
| </pre>
| |
| | |
| == [https://github.com/arq5x/bedtools2 BEDtools] ==
| |
| The bed format is similar to gtf format but with more compact representation. We can use the apt-get method to install it on Linux environment.
| |
| | |
| <syntaxhighlight lang="bash">
| |
| bedtools intersect
| |
| bedtools bamtobed
| |
| bedtools bedtobam
| |
| bedtools getfasta
| |
| bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>
| |
| </syntaxhighlight>
| |
| | |
| For example,
| |
| <syntaxhighlight lang="bash">
| |
| bedtools intersect -wo -a RefSeq.gtf -b XXX.bed | wc -l # in this case every line is one exon
| |
| bedtools intersect -wo -a RefSeq.gtf -b XXX.bed | cut -f9 | cut -d ' ' -f2 | more
| |
| bedtools intersect -wo -a RefSeq.gtf -b XXX.bed | cut -f9 | cut -d ' ' -f2 | sort -u | wc -l
| |
| | |
| bedtools intersect -wo -a RefSeq.bed -b XXX.bed | more # one gene is one line with multiple intervals
| |
| </syntaxhighlight>
| |
| | |
| One good use of bedtools is to find the intersection of bam and bed files (to double check the insertion/deletion/junction reads in bed files are in accepted_hits.bam). See [http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html this page]
| |
| <syntaxhighlight lang="bash">
| |
| $ bedtools --version
| |
| bedtools v2.17.0
| |
| $ bedtools intersect -abam accepted_hits.bam -b junctions.bed | samtools view - | head -n 3
| |
| </syntaxhighlight>
| |
| | |
| We can use bedtool to reverse bam files to fastq files. See https://www.biostars.org/p/152956/
| |
| | |
| === [http://bedtools.readthedocs.org/en/latest/content/installation.html Installation] ===
| |
| <syntaxhighlight lang="bash">
| |
| git clone https://github.com/arq5x/bedtools2.git
| |
| cd bedtools2
| |
| make
| |
| </syntaxhighlight>
| |
| | |
| === bamtofastq ===
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| cd ~/github/freebayes/test/tiny # Working directory
| |
| | |
| ~/github/samtools/samtools sort -n NA12878.chr22.tiny.bam NA12878.chr22.tiny.qsort
| |
| | |
| ~/github/bedtools2/bin/bamToFastq -i NA12878.chr22.tiny.qsort.bam \
| |
| -fq aln.end1.fq \
| |
| -fq2 aln.end2.fq 2> bamToFastq.warning.out
| |
| | |
| ## 60 warnings; see <bamToFastq.warning.out>
| |
| *****WARNING: Query ST-E00118:53:H02GVALXX:1:1102:1487:23724 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
| |
| | |
| wc -l tmp.out # 60
| |
| wc -l aln.end1.fq # 6548
| |
| wc -l aln.end2.fq # 6548
| |
| # recall tmp_1.fq is 6532
| |
| </pre>
| |
| | |
| == [http://cole-trapnell-lab.github.io/cufflinks/ Cufflinks package] ==
| |
| Transcriptome assembly and differential expression analysis for RNA-Seq.
| |
| | |
| ''Both Cufflinks and Cuffdiff accept SAM and BAM files as input. It is not uncommon for a single lane of Illumina HiSeq sequencing to produce FASTQ and BAM files with a combined size of 20 GB or larger. Laboratories planning to perform more than a small number of RNA-seq experiments should consider investing in robust storage infrastructure, either by purchasing their own hardware or through cloud storage services.''
| |
| | |
| === Tuxedo protocol ===
| |
| * [http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks] by Trapnell, et al 2012.
| |
| * https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite
| |
| | |
| # bowtie2 - fast alignment
| |
| # tophat2 - splice alignment (rna-seq reads, rna are spliced, introns are removed, some reads may span over 2 exons)
| |
| # cufflinks - transcript assembly & quantitation
| |
| # cuffdiff2 - differential expression
| |
| | |
| === [http://cole-trapnell-lab.github.io/cufflinks/getting_started/ Cufflinks - assemble reads into transcript] ===
| |
| Installation
| |
| <pre>
| |
| # about 47MB on ver 2.2.1 for Linux binary version
| |
| wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
| |
| sudo tar xzvf ~/Downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz -C /opt/RNA-Seq/bin/
| |
| export PATH=$PATH:/opt/RNA-Seq/bin/cufflinks-2.2.1.Linux_x86_64/
| |
| | |
| # test
| |
| cufflinks -h
| |
| </pre>
| |
| | |
| Cufflinks uses this map (done from Tophat) against the genome to assemble the reads into '''transcripts'''.
| |
| * http://homer.salk.edu/homer/basicTutorial/rnaseqCufflinks.html
| |
| <pre>
| |
| # Quantifying Known Transcripts using Cufflinks
| |
| cufflinks -o OutputDirectory/ -G refseq.gtf mappedReads.bam
| |
| | |
| # De novo Transcript Discovery using Cufflinks
| |
| cufflinks -o OutputDirectory/ mappedReads.bam
| |
| </pre>
| |
| | |
| The output files are genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf and '''transcripts.gtf'''.
| |
| | |
| It can be used to calculate FPKM.
| |
| * https://www.biostars.org/p/11378/
| |
| * http://www.partek.com/Tutorials/microarray/User_Guides/UnderstandingReads.pdf
| |
| | |
| === Cuffcompare - compares transcript assemblies to annotation ===
| |
| | |
| === Cuffmerge - merges two or more transcript assemblies ===
| |
| First create a text file <assemblies.txt>
| |
| <pre>
| |
| ./dT_bio/transcripts.gtf
| |
| ./dT_ori/transcripts.gtf
| |
| ./dT_tech/transcripts.gtf
| |
| ./RH_bio/transcripts.gtf
| |
| ./RH_ori/transcripts.gtf
| |
| ./RH_tech/transcripts.gtf
| |
| </pre>
| |
| | |
| Then run
| |
| <pre>
| |
| cd GSE11209
| |
| cuffmerge -g genes.gtf -s genome.fa assemblies.txt
| |
| </pre>
| |
| | |
| === Cuffdiff ===
| |
| Finds differentially expressed genes and transcripts/Detect differential splicing and promoter use.
| |
| | |
| Cuffdiff takes the aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis.
| |
| | |
| Follow the [http://cufflinks.cbcb.umd.edu/tutorial.html tutorial], we can quickly test the cuffdiff program.
| |
| <pre>
| |
| $ wget http://cufflinks.cbcb.umd.edu/downloads/test_data.sam
| |
| $ cufflinks ./test_data.sam
| |
| $ ls -l
| |
| total 56
| |
| -rw-rw-r-- 1 mli mli 221 2013-03-05 15:51 genes.fpkm_tracking
| |
| -rw-rw-r-- 1 mli mli 231 2013-03-05 15:51 isoforms.fpkm_tracking
| |
| -rw-rw-r-- 1 mli mli 0 2013-03-05 15:51 skipped.gtf
| |
| -rw-rw-r-- 1 mli mli 41526 2009-09-26 19:15 test_data.sam
| |
| -rw-rw-r-- 1 mli mli 887 2013-03-05 15:51 transcripts.gtf
| |
| </pre>
| |
| | |
| In real data,
| |
| <pre>
| |
| cd GSE11209
| |
| cuffdiff -p 5 -o cuffdiff_out -b genome.fa -L dT,RH \
| |
| -u merged_asm/merged.gtf \
| |
| ./dT_bio/accepted_hits.bam,./dT_ori/accepted_hits.bam,./dT_tech/accepted_hits.bam \
| |
| ./RH_bio/accepted_hits.bam,./RH_ori/accepted_hits.bam,./RH_tech/accepted_hits.bam
| |
| </pre>
| |
| | |
| '''N.B.''': the FPKM value (<genes.fpkm_tracking>) is generated per condition not per sample. If we want FPKM per sample, we want to specify one condition for each sample.
| |
| | |
| The method of constructing test statistics can be found [https://07110005642076687011.googlegroups.com/attach/51b6f3bc81fbea69/Cufflinks_manual_old.pdf?part=4&vt=ANaJVrECk0GsgeKTeGugnJolIXKsZP1M4igDfMSDjo3aR5ALy5r6aLbSBgBi-stEepWMYX2nhGKrph7RRnqm66XJMIcON8Zf-Q4WBoIraXCemBBsiBu4qJs online].
| |
| | |
| === Pipeline ===
| |
| * https://www.biostars.org/p/123993/
| |
| | |
| == [http://compbio.mit.edu/cummeRbund/ CummeRbund] ==
| |
| Plots abundance and differential expression results from Cuffdiff. CummeRbund also handles the details of parsing Cufflinks output file formats to connect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website
| |
| | |
| The tool appears on the paper [http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks] by Trapnell, et al 2012.
| |
| | |
| == Finding differentially expressed genes ==
| |
| === Big pictures ===
| |
| <pre>
| |
| BWA/Bowtie samtools
| |
| fa ---------> sam ------> sam/bam (sorted indexed, short reads), vcf
| |
| or tophat
| |
| | |
| Rsamtools GenomeFeatures edgeR (normalization)
| |
| ---------> --------------> table of counts --------->
| |
| </pre>
| |
| | |
| === Readings ===
| |
| * [http://www.rna-seqblog.com/introduction-to-rna-sequencing-and-analysis/ Introduction to RNA Sequencing and Analysis] Kukurba KR, Montgomery SB. (2015) RNA Sequencing and Analysis. Cold Spring Harb Protoc
| |
| * [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/?tool=pubmed RNA-Seq: a revolutionary tool for transcriptomics]
| |
| * Alignment and visualization from [http://mygoblet.org/training-portal/materials/rna-seq-analysis-2014-module-2-alignment-and-visualization bioinformatics.ca].
| |
| * The [http://rnaseq.uoregon.edu/ RNA-seqlopedia] from the Cresko Lab of the University of Oregon.
| |
| * [http://slideplayer.com/slide/4463815/ RNA-Seq Analysis] by Simon Andrews.
| |
| * [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2261-8 Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods] by Thomas P. Quinn 2018. Analysis scripts (R) are included.
| |
| | |
| === Technical replicate vs biological replicate ===
| |
| We sequence more DNA from the same DNA "library", which is called a "technical replicate". If we perform a new experiment with a new organism/tissue/population of cells, which is called a "biological replicate".
| |
| | |
| === Youtube videos ===
| |
| * Analyze public dataset by using Galaxy and IGV [http://www.youtube.com/watch?v=dTRZjXuQnYU&list=UULQ1j5ge-WYUE1iBEOGpK5A David Coil from UC Davis Genoem Center]
| |
| Download the raw fastq data GSE19602 from [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19602 GEO] and uncompress fastq.bz2 to fastq (~700MB) file. NOTE: the data downloaded from ncbi is actually sra file format. We can use fastq_dump program in SRA_toolkit to convert sra to fastq. http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
| |
| <pre>
| |
| ~/Downloads/sratoolkit.2.3.2-ubuntu64/bin/fastq-dump ~/Downloads/SRR034580.sra
| |
| </pre>
| |
| | |
| If we want to run Galaxy locally, we can install it simply by 2 command lines
| |
| <pre>
| |
| hg clone https://bitbucket.org/galaxy/galaxy-dist/
| |
| cd galaxy-dist
| |
| hg update stable
| |
| </pre>
| |
| | |
| To run Galaxy locally, we do
| |
| <pre>
| |
| cd galaxy-dist
| |
| sh run.sh --reload
| |
| </pre>
| |
| The command line will show ''Starting server in PID XXXX. serving on http://127.0.0.1:8080''. We can use Ctrl + C to stop the Galaxy process.
| |
| | |
| Note: One problem with this simple instruction is we have not created a user yet.
| |
| | |
| # Upload one fastq data. Click 'refresh' icon on the history panel to ensure the data is uploaded. Don't use 'refresh' button on the browser; it will cause an error for the current process.
| |
| # '''FASTQ Groomer'''. Convert the data to Galaxy needs. NGS: QC and manipulation => Illumina FASTQ. FASTQ quality scores type: Sanger. (~10 minutes). This part uses CPU not memory.
| |
| # Open a new browser tab and go to ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/. Right click the file '''all.cDNA''' and copy link location. In Galaxy click 'Upload File from your computer' paste URL to URL/Text entry.
| |
| # Scroll down Galaxy and select NGS:Mapping -> '''Map with BWA'''. PS. for some reason, BWA is not available. So I use '''Bowtie2''' instead. The output of Bowtie2 is bam file.
| |
| ## For reference genome, choose 'Use one from the history'. Galaxy automatically find the reference file 'ftp://ftp.plantbiology....' from history.
| |
| ## Library mate-paired => Single-end.
| |
| ## FASTQ file => 2: FASTQ Groomer on data 1.
| |
| ## BWA settings to use => Commonly Used.
| |
| ## Execute (~ 15 minutes)
| |
| # We can view the alignment file (sam format) created from BWA by using UCSV or IGV (input is bam or bai format). We now use '''NGS: SAM Tools''' to convert sam file to bam file. Click 'SAM-to-BAM converts SAM format to BAM format' tool.
| |
| ## Choose the source for the reference list => History
| |
| ## Converts SAM file => 4: Map with BWA on data 2 and data 3.
| |
| ## Using reference file => 3:ftp://ftp.plantbiology.....
| |
| ## Execute (~5 minutes)
| |
| # We want to create bai file which is a shortcut to IGV. It breaks the data into smaller accessible chunks. So when you pull up a certain cDNA, it goes straight to the subset. Go to the history, click the pencil icon (Edit Attributes) on the file '''SAM-to-BAM on data 3 and data 4'''.
| |
| ## Look at 'Convert to new format' section. Go ahead and click 'Convert'. (< 1 minute). This will create another file.
| |
| ## Use browser and go to ftp website to download '''all.cDNA''' file to desktop. The desktop should contain 3 files - all.cDNA, rice.bam and rice.bai files for IGV.
| |
| # Goto http://www.broadinstitute.org/software/igv/download to download IGV which is a java-based application. I need to install java machine first by install openjdk-7-jdk. IGV by default will launch 'Human hg18' genome. Launch IGV by '''cd IGV_2.2.13;java -Xmx750m -jar igv.jar'''. I found the IGV input requires sam+bai OR bam+bai. So we need to click the pencil icon to create bai file first before we want to upload sam or bam file to IGV.
| |
| ## Goto File => Import Genome. Call it 'rice' and select 'all.cDNA' sequence file. Click 'Save' button.
| |
| ## Goto File => Upload from File => rice.bam.
| |
| ## Top right panel is cDNA
| |
| ## Middle right panel has a lot of 'boxes' which is a read. If we zoom in, we can see some read points to left (backward) while some points to right (forward). On the top is a histogram. For example, a base may be covered by a lot of reads then the histogram will show the high frequence.
| |
| ## If we keep zoom in, we can see color at the Bottom right panel. Keeping zoom in, we can see the base G, C, T, A themselves.
| |
| ## Using IGV, we can 1. examine coverage.
| |
| ## We can 2. check 'alternative splicing'. (not for this cDNA)
| |
| ## We can 3. examine SNPs for base change. If we see gray color (dark gray is hight quality read, light gray means low quality read), it means they are perfect match. If we see color, it means there is a change. For example, a read is 'C' but in fact it should be 'A'. If a case has many high quality reads, and half of them are 'G' but the reference genome shows 'A'. This is most likely a SNP. This is heterogeisity.
| |
| # '''Tophat''' - align RNA seq data to genomic DNA
| |
| ## Suppose we have use Galaxy to upload 2 data. One is SRR034580 and we have run FASTQ Groomer on data 1. The second data is SRR034584 and we also have run FASTQ Groomer on data 2. We also have uploaded reference genome sequence.
| |
| ## Goto Galaxy and find NGS: RNA Analysis => Tophat.
| |
| ## reference genome => Use one from the history
| |
| ## RNA-Seq FASTQ file => 2; FASTQ Groomer on data 1.
| |
| ## Execute. This will create 2 files. One is '''splice junctions''' and the other is '''accepted_hits'''. We queue the job and run another Tophat with the 2nd 'groomer'ed data file. We are going to work on '''accepted_hits''' file.
| |
| ## While the queue are running, we can click on 'pencil' icon on 'accepted_hits' job and run the utlity 'Convert to new format' (Bam to Bai). We should do this for both 'accepted_hits' files.
| |
| ## For some reason, the execution failed: An error occurred with this dataset: TopHat v2.0.7 TopHat v2.0.7 Error indexing reference sequence /bin/sh: 1: bowtie-build: not found.
| |
| # '''Cufflinks'''. We will estimate transcript abundance by using FPKM (RPKM).
| |
| ## SAM or BAM file of alignmed RNA-Seq reads => tophat on data 2.. accepted_hits
| |
| ## Use Reference Annotation - No (choose Yes if we want annotation. This requires GTF format. See http://genome.ucsc.edu/FAQ/FAQformat.html#format4. We don't have it for rice.)
| |
| ## Execute. This will create 3 files. Gene expression, transcript expression and assembled transcripts.
| |
| ## We also run Cufflinks for 2nd accepted_hits file. (~ 25 minutes)
| |
| # '''Cuffcompare'''. Compare one to each other.
| |
| ## GTF file produced by Cufflinks => assembled transcript from the 1st data
| |
| ## Use another GTF file produced by Cufflinks => Yes. It automatically find the other one.
| |
| ## Execute. (< 10 minutes). This will create 7 files. Transcript accuracy, tmap file & refmap flie from each assembled transcripts, combined transcripts and transcript tracking.
| |
| ## We are interested in combined transcripts file (to use in Cuffdiff).
| |
| # '''Cuffdiff'''.
| |
| ## Transcripts => combined transcripts.
| |
| ## SAM or BAM file of aligned RNA-Seq reads => 1st accepted_hits
| |
| ## SAM or BAM file or aligned RNA-Seq reads => 2nd accepted_hits
| |
| ## Execute. This will generate 11 files. Isoform expression, gene expression, TSS groups expression, CDS Expression FPKM Tracking, isoform FPKM tracking, gene FPKM tracking, TSS groups FPKM tracking, CDS FPKM tracking, splicing diff, promoters diff, CDS diff. We are interested in 'gene expression' file. We can save it and open it in Excel.
| |
| # IGV - 2 RNA-Seq datasets aligned to '''genomic''' DNA using Tophat
| |
| ## Load the reference genome rice (see above)
| |
| ## Upload from file => rice4.bam. Upload from file => rice5.bam.
| |
| ## Alternative RNA splicing.
| |
| | |
| === edX course ===
| |
| PH525.5x Case Study: RNA-seq data analysis. The course notes are forming a book. Check out https://github.com/genomicsclass/labs and http://genomicsclass.github.io/book/.
| |
| | |
| == Variant calling ==
| |
| * Variants include (See p21 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery])
| |
| ** Germline SNPs & indels
| |
| ** Somatic SNVs & indels
| |
| ** Somatic CNVs
| |
| | |
| === Overview/papers ===
| |
| * [http://www.nature.com/articles/srep17875 Systematic comparison of variant calling pipelines using gold standard personal exome variants] by Hwang 2015.
| |
| * [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/ A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data] by Heng Li.
| |
| * http://www.slideshare.net/AustralianBioinformatics/introduction-to-nextgeneration
| |
| * http://www.slideshare.net/thomaskeane/overview-of-methods-for-variant-calling-from-nextgeneration-sequence-data-9608507
| |
| * http://www.slideshare.net/jandot/nextgeneration-sequencing-variation-discovery
| |
| * [https://www.biorxiv.org/content/early/2017/09/23/192872 Strelka2: Fast and accurate variant calling for clinical sequencing applications] Sep 2017
| |
| | |
| === Workflow ===
| |
| * Sequence reads -> Quality control -> Mapping -> Mark Duplicates -> Local realignment around INDELS -> Base quality score recalibration -> Variant calling -> Filtering & Annotation -> Querying.
| |
| * http://www.ngscourse.org/Course_Materials/variant_calling/presentation/2015-Cambridge_variant_calling.pdf <span style="color: red">Lots of pictures to explain the concepts!!</span>
| |
| * [https://informatics.sydney.edu.au/training/coursedocs/RNASeqGalaxy_Camperdown_June2018.pdf Introduction to RNA-Seq on Galaxy Analysis for differential expression] Tracy Chew
| |
| * http://seqanswers.com/wiki/How-to/exome_analysis
| |
| ** Alignment step includes
| |
| *** '''mark PCR duplicates''' some reads will be exact copies of each other. These reads are called PCR duplicates due to amplification biases in PCR. They share the same sequence and the same alignment position and could cause trouble during SNP calling as possibly some allele is overrepresented due to amplification biases. See [http://gatkforums.broadinstitute.org/gatk/discussion/6747/how-to-mark-duplicates-with-markduplicates-or-markduplicateswithmatecigar#section3 Conceptual overview of duplicate flagging] and [http://wiki.bits.vib.be/index.php/Call_variants_with_samtools_1.0#Mark_duplicates_using_Picard_tools Call variants with samtools 1.0]
| |
| *** '''local alignment around indels''' Indels within reads often lead to false positive SNPs at the end of sequence reads. To prevent this artifact, local realignment around indels is done using local realignment tools from the Genome Analysis Tool Kit.
| |
| *** '''quality recalibration'''
| |
| ** SNP calling step includes producing raw SNP calls and filter SNPs
| |
| ** Annotation
| |
| * WGS/WES Mapping to Variant Calls http://www.htslib.org/workflow/
| |
| Step 1: Mapping
| |
| <syntaxhighlight lang='bash'>
| |
| bwa index <ref.fa>
| |
| | |
| bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' <ref.fa> <read1.fa> <read1.fa> > lane.sam
| |
| samtools fixmate -O bam <lane.sam> <lane_fixmate.bam>
| |
| samtools sort -O bam -o <lane_sorted.bam> -T </tmp/lane_temp> <lane_fixmate.sam>
| |
| </syntaxhighlight>
| |
| Step 2: Improvement
| |
| <syntaxhighlight lang='bash'>
| |
| java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R <ref.fa> -I <lane.bam> \
| |
| -o <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf>
| |
| java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R <ref.fa> -I <lane.bam> \
| |
| -targetIntervals <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf> -o <lane_realigned.bam>
| |
| | |
| java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R <ref.fa> \
| |
| -knownSites >bundle/b38/dbsnp_142.b38.vcf> -I <lane.bam> -o <lane_recal.table>
| |
| java -Xmx2g -jar GenomeAnalysisTK.jar -T PrintReads -R <ref.fa> -I <lane.bam> \
| |
| --BSQR <lane_recal.table> -o <lane_recal.bam>
| |
| | |
| java -Xmx2g -jar MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT \
| |
| INPUT=<lane_1.bam> INPUT=<lane_2.bam> INPUT=<lane_3.bam> OUTPUT=<library.bam>
| |
| | |
| samtools merge <sample.bam> <library1.bam> <library2.bam> <library3.bam>
| |
| samtools index <sample.bam>
| |
| </syntaxhighlight>
| |
| Step 3: Variant calling
| |
| <syntaxhighlight lang='bash'>
| |
| samtools mpileup -ugf <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | \
| |
| bcftools call -vmO z -o <study.vcf.gz>
| |
| </syntaxhighlight>
| |
| | |
| === Non-R Software ===
| |
| Variant detector/discovery, genotyping
| |
| | |
| * [http://www.biomedcentral.com/1471-2105/16/235# Evaluation of variant detection software for pooled next-generation sequence data]
| |
| * [http://bib.oxfordjournals.org/content/15/2/256.abstract A survey of tools for variant analysis of next-generation genome sequencing data]
| |
| * [http://www.biomedcentral.com/1471-2105/16/235 Evaluation of variant detection software for pooled next-generation sequence data]
| |
| * [http://www.sciencedirect.com/science/article/pii/S0002929713003832 Reliable Identification of Genomic Variants from RNA-Seq Data]
| |
| * [https://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html Bioinformatics at COMAV]
| |
| | |
| ==== Variant Identification ====
| |
| * [http://samtools.sourceforge.net/ Samtools]
| |
| ** https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools
| |
| ** http://petridishtalk.com/2013/02/01/variant-discovery-annotation-filtering-with-samtools-the-gatk/
| |
| ** http://lab.loman.net/2012/10/31/a-simple-guide-to-variant-calling-with-bwa-samtools-varscan2/
| |
| ** [http://ged.msu.edu/angus/tutorials-2013/snp_tutorial.html Samtools and GATK]
| |
| ** [https://www.biostars.org/p/57149/ Difference Between Samtools And Gatk Algorithms]
| |
| * [https://www.broadinstitute.org/gatk/ GATK] by BROAD Institute.
| |
| * [http://varscan.sourceforge.net/ varscan2]
| |
| * [https://github.com/chapmanb/bcbio-nextgen bcbio] - Validated, scalable, community developed variant calling and RNA-seq analysis. [http://bcb.io/2015/09/17/hg38-validation/ Validated variant calling with human genome build 38].
| |
| * [https://github.com/ekg/freebayes freebayes] and [https://github.com/ekg/alignment-and-variant-calling-tutorial a complete tutorial] from alignment to variant calling.
| |
| | |
| GUI software
| |
| * [http://snver.sourceforge.net/snvergui/ SNVerGUI]
| |
| | |
| ==== Variant Annotation ====
| |
| See [[#Variant_Annotation_2|Variant Annotation]]
| |
| | |
| === Biocoductor and R packages ===
| |
| * Bioconductor [http://www.bioconductor.org/packages/release/bioc/html/VariantTools.html VariantTools] package can export to VCF and [http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html VariantAnnotation] package can read VCF format files. See also the [http://bioconductor.org/help/workflows/variants/ Annotating Variants workflow] in Bioconductor. See also [http://bioconductor.org/packages/release/data/annotation/html/SIFT.Hsapiens.dbSNP137.html SIFT.Hsapiens.dbSNP137], [http://bioconductor.org/packages/release/data/experiment/html/COSMIC.67.html COSMIC.67], and [http://bioconductor.org/packages/release/data/annotation/html/PolyPhen.Hsapiens.dbSNP131.html PolyPhen.Hsapiens.dbSNP131] packages.
| |
| * R packages [http://cran.r-project.org/web/packages/seqminer/index.html seqminer]
| |
| | |
| ==== fundamental - VariantAnnotation ====
| |
| readVcf() can be used to read a *.vcf or *.vcf.gz file.
| |
| | |
| ==== dbSNP - SNPlocs.Hsapiens.dbSNP.20101109 ====
| |
| Compare the rs numbers of the data and dbSNP.
| |
| | |
| ==== variant wrt genes - TxDb.Hsapiens.UCSC.hg19.knownGene ====
| |
| Look for the column '''LOCATION''' which has values ''coding'', ''fiveUTR'', ''threeUTR'', ''intron'', ''intergenic'', ''spliceSite'', and ''promoter''.
| |
| | |
| ==== amino acid change for the non-synonymous variants - BSgenome.Hsapiens.UCSC.hg19 ====
| |
| Look for the column '''CONSEQUENCE''' which has values ''synonymous'' or ''nonsynonymous''.
| |
| | |
| ==== SIFT & Polyphen for predict the impact of amino acid substitution on a human protein - PolyPhen.Hsapiens.dbSNP131 ====
| |
| Look for the column '''PREDICTION''' which has values ''possibly damaging'' or ''benign''.
| |
| | |
| ==== rentrez tutorial ====
| |
| https://ropensci.org/tutorials/rentrez_tutorial.html
| |
| | |
| ==== Accessing and Manipulating Biological Databases Exercises ====
| |
| http://www.r-exercises.com/2017/05/04/accessing-and-manipulating-biological-databases-exercises-part-2/? utm_source=rss&utm_medium=rss&utm_campaign=accessing-and-manipulating-biological-databases-exercises-part-2
| |
| | |
| === vcf ===
| |
| ==== [http://www.1000genomes.org/wiki/Analysis/vcf4.0 vcf format] ====
| |
| * One row is one variant or one position/nucleotide in genome.
| |
| * Mapping quality is called MQ on the INFO column of a '''vcf''' file. It is on the 5th column (no header) or the MQ field on the last column of a '''sam''' file.
| |
| ** MQ denotes [https://en.wikipedia.org/wiki/Root_mean_square RMS] mapping quality in '''vcf'''.
| |
| ** MapQ = -10 log10(P) where P = probability that this mapping is NOT the correct one.
| |
| ** https://en.wikipedia.org/wiki/Phred_quality_score. If Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000 (1/10^(MAPQ/10)=1/10^3)
| |
| ** When MAPQ=0, it means that the read maps to two or more places with equal score. See [https://www.biostars.org/p/4015/ here].
| |
| ** When MAPQ=255, it means the mapping quality is not available. See https://biostar.usegalaxy.org/p/4077/
| |
| ** http://davetang.org/muse/2011/09/14/mapping-qualities/
| |
| ** [http://drive5.com/usearch/manual/quality_score.html Fastq] file
| |
| * variant category (snp, insertion, deletion, mixed) http://www.1000genomes.org/node/101
| |
| * [http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it What is a VCF and how should I interpret it?] from broad.
| |
| * See also [[Anders2013#sam.2Fbam.2C_.22samtools_view.22_and_Rsamtools|SAM/BAM format]].
| |
| | |
| Below is an example of each record/row
| |
| {| class="wikitable"
| |
| | 1
| |
| | #CHROM
| |
| | 2
| |
| |-
| |
| | 2
| |
| | POS
| |
| | 14370
| |
| |-
| |
| | 3
| |
| | ID
| |
| | rs6054257 or .
| |
| |-
| |
| | 4
| |
| | REF
| |
| | G
| |
| |-
| |
| | 5
| |
| | ALT
| |
| | A
| |
| |-
| |
| | 6
| |
| | QUAL
| |
| | 29
| |
| |-
| |
| | 7
| |
| | FILTER
| |
| | PASS or .
| |
| |-
| |
| | 8
| |
| | INFO
| |
| | NS=3;DP=14;AF=0.5;DB;H2
| |
| |-
| |
| | opt
| |
| | FORMAT
| |
| | GT:AD:DP:GQ:PL
| |
| |-
| |
| | opt
| |
| | NA00001
| |
| | 0/1:3,2:5:34:34,0,65
| |
| |}
| |
| where the meanings of GT(genotype), AD(Allelic depths for the ref and alt alleles in the order listed), HQ (haplotype quality), GQ (genotype quality)... can be found in the header of the VCF file.
| |
| | |
| ==== Examples ====
| |
| See the [https://github.com/arraytools/brb-seqtools/tree/master/testdata/GSE48215subset samtools and GATK output from GSE48215subset] data included in BRB-SeqTools.
| |
| | |
| ==== Count number of rows ====
| |
| <syntaxhighlight lang='bash'>
| |
| grep -cv "#" vcffile
| |
| </syntaxhighlight>
| |
| | |
| ==== Filtering ====
| |
| * [https://www.biostars.org/p/44378/ Filtering A Sam File For Quality Scores]. For example, to filter out reads with MAQP < 30 in SAM/BAM files
| |
| <syntaxhighlight lang='bash'>
| |
| samtools view -b -q 30 input.bam > filtered.bam
| |
| </syntaxhighlight>
| |
| * To filter vcf based on MQ, use for example,
| |
| <syntaxhighlight lang='bash'>
| |
| bcftools filter -i"QUAL >= 20 && DP >= 5 && MQ >= 30" INPUTVCF > OUTPUTVCF
| |
| </syntaxhighlight>
| |
| | |
| ==== Open with LibreOffice Calc ====
| |
| Use delimiter 'Tab' only and uncheck the others.
| |
| | |
| It will then get the correct header #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO (separated by ;), FORMAT (separated by :), SampleID (separated by :) where INFO field contains site-level annotations and FORMAT&SampleID fields contain sample-level annotations.
| |
| | |
| ==== Count number of records, snps and indels ====
| |
| * [http://vcftools.sourceforge.net/documentation.html#file vcftools].
| |
| * https://www.biostars.org/p/49386/, https://www.biostars.org/p/109086/, https://www.biostars.org/p/50923/, https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools.
| |
| | |
| * Summary statistics
| |
| <syntaxhighlight lang='bash'>
| |
| bcftools stats XXX.vcf
| |
| | |
| # SN, Summary numbers:
| |
| # SN [2]id [3]key [4]value
| |
| SN 0 number of samples: 0
| |
| SN 0 number of records: 71712
| |
| SN 0 number of no-ALTs: 0
| |
| SN 0 number of SNPs: 65950
| |
| SN 0 number of MNPs: 0
| |
| SN 0 number of indels: 5762
| |
| SN 0 number of others: 0
| |
| SN 0 number of multiallelic sites: 0
| |
| SN 0 number of multiallelic SNP sites: 0
| |
| </syntaxhighlight>
| |
| * Remove the header
| |
| <syntaxhighlight lang='bash'>
| |
| grep -v "^#" input.vcf > output.vcf
| |
| </syntaxhighlight>
| |
| * Count number of records
| |
| <syntaxhighlight lang='bash'>
| |
| grep -c -v "^#" XXX.vcf # -c means count, -v is invert search
| |
| </syntaxhighlight>
| |
| | |
| ==== [https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php VariantsToTable] tool from Broad GATK ====
| |
| * https://www.broadinstitute.org/gatk/blog?id=7089
| |
| | |
| ==== [https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html VariantAnnotation] package ====
| |
| * [http://stackoverflow.com/questions/21598212/extract-sample-data-from-vcf-files readVcf(), ScanVcfParam() and readInfo()] functions.
| |
| * [https://www.bioconductor.org/help/workflows/variants/ scanVcfHeader() and scanVcfParam()] functions.
| |
| | |
| To install the R package, first we need to install required software in Linux
| |
| <syntaxhighlight lang='bash'>
| |
| sudo apt-get update
| |
| sudo apt-get install libxml2-dev
| |
| sudo apt-get install libcurl4-openssl-dev
| |
| </syntaxhighlight>
| |
| and then
| |
| <syntaxhighlight lang='bash'>
| |
| source("http://bioconductor.org/biocLite.R")
| |
| biocLite("VariantAnnotation")
| |
| | |
| library(VariantAnnotation)
| |
| vcf <- readVcf("filename.vcf", "hg19")
| |
| # Header
| |
| header(vcf)
| |
| | |
| # Sample names
| |
| samples(header(vcf))
| |
| | |
| # Geno
| |
| geno(header(vcf))
| |
| | |
| # Genomic positions
| |
| head(rowRanges(vcf), 3)
| |
| ref(vcf)[1:5]
| |
| # Variant quality
| |
| qual(vcf)[1:5]
| |
| alt(vcf)[1:5]
| |
| | |
| # INFO
| |
| info(vcf)[1:3, ]
| |
| # Get the Depth (DP)
| |
| hist(info(vcf)$DP, xlab="DP")
| |
| summary(info(vcf)$DP, xlab="DP")
| |
| # Get the Mapping quality (MQ/MapQ)
| |
| hist(info(vcf)$MQ, xlab="MQ")
| |
| | |
| </syntaxhighlight>
| |
| | |
| Example:
| |
| | |
| We first show how to use linux command line to explore the VCF file. Then we show how to use the VariantAnnotation package.
| |
| <syntaxhighlight lang='bash'>
| |
| # The vcf file is created from running samtools on SRR1656687.fastq
| |
| | |
| nskip=$(grep "^#" ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf | wc -l)
| |
| echo $nskip # 53
| |
| awk 'NR==53, NR==53' ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf
| |
| #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup.bam
| |
| | |
| nall=$(cat ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf | wc -l)
| |
| echo $nall # 302174
| |
| | |
| # Generate <variantsOnly.vcf> which contains only the variant body
| |
| awk 'NR==54, NR==302174' ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf > variantsOnly.vcf
| |
| | |
| # Generate <infoField.txt> which contains only the INFO field
| |
| cut -f 8 variantsOnly.vcf > infoField.txt
| |
| | |
| # Generate the final product
| |
| head -n 1 infoField.txt
| |
| DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22
| |
| head -n 1 infoField.txt | cut -f1- -d ";" --output-delimiter " "
| |
| ## Method 1: use cut
| |
| ### Problem: the result is not a table since not all fields appear in all variants.
| |
| cut -f1- -d ";" --output-delimiter=$'\t' infoField.txt > infoFieldTable.txt
| |
| ## Method 2: use sed
| |
| ### Problem: same as Method 1.
| |
| sed 's/;/\t/g' infoField.txt > infoFieldTable2.txt
| |
| ### OR if we want to replace the original file
| |
| sed -i 's/;/\t/g' infoField.txt
| |
| | |
| $ head -n 3 infoField.txt
| |
| DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22
| |
| DP=6;VDB=0.325692;SGB=-0.556411;MQ0F=0.333333;AC=2;AN=2;DP4=0,0,0,4;MQ=11
| |
| DP=5;VDB=0.0481133;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=1,0,2,1;MQ=26
| |
| | |
| $ grep "^##INFO=<ID=" BMBC2_liver3_IMPACT_raw.vcf
| |
| ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
| |
| ##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
| |
| ##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
| |
| ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
| |
| ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
| |
| ##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
| |
| ##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
| |
| ##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
| |
| ##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
| |
| ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
| |
| ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
| |
| ##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
| |
| ##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
| |
| ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
| |
| ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
| |
| ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
| |
| ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
| |
| odroid@odroid:~/Downloads$
| |
| | |
| $ head -n2 variantsOnly.vcf
| |
| 1 10285 . T C 19.3175 . DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22 GT:PL 1/1:46,2,0
| |
| 1 10333 . C T 14.9851 . DP=6;VDB=0.325692;SGB=-0.556411;MQ0F=0.333333;AC=2;AN=2;DP4=0,0,0,4;MQ=11 GT:PL 1/1:42,12,0
| |
| | |
| $ awk 'NR==53, NR==58' ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf
| |
| #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup.bam
| |
| 1 10285 . T C 19.3175 . DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22 GT:PL 1/1:46,2,0
| |
| 1 10333 . C T 14.9851 . DP=6;VDB=0.325692;SGB=-0.556411;MQ0F=0.333333;AC=2;AN=2;DP4=0,0,0,4;MQ=11 GT:PL 1/1:42,12,0
| |
| 1 16257 . G C 24.566 . DP=5;VDB=0.0481133;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=1,0,2,1;MQ=26 GT:PL 0/1:57,0,10
| |
| 1 16534 . C T 13.9287 . DP=5;VDB=0.87268;SGB=-0.556411;MQ0F=0.4;AC=2;AN=2;DP4=0,0,4,0;MQ=11 GT:PL 1/1:41,12,0
| |
| 1 16571 . G A 12.4197 . DP=7;VDB=0.560696;SGB=-0.556411;RPB=1;MQB=1;MQSB=0;BQB=1;MQ0F=0.714286;AC=2;AN=2;DP4=1,0,2,2;MQ=8 GT:PL 1/1:39,8,0
| |
| odroid@odroid:~/Downloads$
| |
| </syntaxhighlight>
| |
| As you can see it is not easy to create a table from the INFO field.
| |
| | |
| Now we use the VariantAnnotation package.
| |
| <syntaxhighlight lang='rsplus'>
| |
| > library("VariantAnnotation")
| |
| > v=readVcf("BMBC2_liver3_IMPACT_raw.vcf", "hg19")
| |
| > v
| |
| class: CollapsedVCF
| |
| dim: 302121 1
| |
| rowRanges(vcf):
| |
| GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
| |
| info(vcf):
| |
| DataFrame with 17 columns: INDEL, IDV, IMF, DP, VDB, RPB, MQB, BQB, MQSB, ...
| |
| info(header(vcf)):
| |
| Number Type Description
| |
| INDEL 0 Flag Indicates that the variant is an INDEL.
| |
| IDV 1 Integer Maximum number of reads supporting an indel
| |
| IMF 1 Float Maximum fraction of reads supporting an indel
| |
| DP 1 Integer Raw read depth
| |
| VDB 1 Float Variant Distance Bias for filtering splice-site arte...
| |
| RPB 1 Float Mann-Whitney U test of Read Position Bias (bigger is...
| |
| MQB 1 Float Mann-Whitney U test of Mapping Quality Bias (bigger ...
| |
| BQB 1 Float Mann-Whitney U test of Base Quality Bias (bigger is ...
| |
| MQSB 1 Float Mann-Whitney U test of Mapping Quality vs Strand Bia...
| |
| SGB 1 Float Segregation based metric.
| |
| MQ0F 1 Float Fraction of MQ0 reads (smaller is better)
| |
| ICB 1 Float Inbreeding Coefficient Binomial test (bigger is better)
| |
| HOB 1 Float Bias in the number of HOMs number (smaller is better)
| |
| AC A Integer Allele count in genotypes for each ALT allele, in th...
| |
| AN 1 Integer Total number of alleles in called genotypes
| |
| DP4 4 Integer Number of high-quality ref-forward , ref-reverse, al...
| |
| MQ 1 Integer Average mapping quality
| |
| geno(vcf):
| |
| SimpleList of length 2: GT, PL
| |
| geno(header(vcf)):
| |
| Number Type Description
| |
| GT 1 String Genotype
| |
| PL G Integer List of Phred-scaled genotype likelihoods
| |
| >
| |
| > header(v)
| |
| class: VCFHeader
| |
| samples(1): dedup.bam
| |
| meta(2): META contig
| |
| fixed(2): FILTER ALT
| |
| info(17): INDEL IDV ... DP4 MQ
| |
| geno(2): GT PL
| |
| >
| |
| > dim(info(v))
| |
| [1] 302121 17
| |
| > info(v)[1:4, ]
| |
| DataFrame with 4 rows and 17 columns
| |
| INDEL IDV IMF DP VDB RPB
| |
| <logical> <integer> <numeric> <integer> <numeric> <numeric>
| |
| 1:10285_T/C FALSE NA NA 4 0.6400000 1
| |
| 1:10333_C/T FALSE NA NA 6 0.3256920 NA
| |
| 1:16257_G/C FALSE NA NA 5 0.0481133 1
| |
| 1:16534_C/T FALSE NA NA 5 0.8726800 NA
| |
| MQB BQB MQSB SGB MQ0F ICB
| |
| <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
| |
| 1:10285_T/C 1 1 NA -0.453602 0.000000 NA
| |
| 1:10333_C/T NA NA NA -0.556411 0.333333 NA
| |
| 1:16257_G/C 1 1 1 -0.511536 0.000000 1
| |
| 1:16534_C/T NA NA NA -0.556411 0.400000 NA
| |
| HOB AC AN DP4 MQ
| |
| <numeric> <IntegerList> <integer> <IntegerList> <integer>
| |
| 1:10285_T/C NA 2 2 0,1,0,... 22
| |
| 1:10333_C/T NA 2 2 0,0,0,... 11
| |
| 1:16257_G/C 0.5 1 2 1,0,2,... 26
| |
| 1:16534_C/T NA 2 2 0,0,4,... 11
| |
| > write.table(info(v)[1:4,], file="infoFieldTable3.txt", sep="\t", quote=F)
| |
| > head(rowRanges(v), 3)
| |
| GRanges object with 3 ranges and 5 metadata columns:
| |
| seqnames ranges strand | paramRangeID REF
| |
| <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
| |
| 1:10285_T/C 1 [10285, 10285] * | <NA> T
| |
| 1:10333_C/T 1 [10333, 10333] * | <NA> C
| |
| 1:16257_G/C 1 [16257, 16257] * | <NA> G
| |
| ALT QUAL FILTER
| |
| <DNAStringSetList> <numeric> <character>
| |
| 1:10285_T/C C 19.3175 .
| |
| 1:10333_C/T T 14.9851 .
| |
| 1:16257_G/C C 24.5660 .
| |
| -------
| |
| seqinfo: 25 sequences from hg19 genome
| |
| > qual(v)[1:5]
| |
| [1] 19.3175 14.9851 24.5660 13.9287 12.4197
| |
| > alt(v)[1:5]
| |
| DNAStringSetList of length 5
| |
| [[1]] C
| |
| [[2]] T
| |
| [[3]] C
| |
| [[4]] T
| |
| [[5]] A
| |
| | |
| > q('no')
| |
| </syntaxhighlight>
| |
| | |
| ==== [https://cran.r-project.org/web/packages/vcfR/index.html vcfR] package ====
| |
| | |
| ==== Filtering strategy ====
| |
| * https://support.bioconductor.org/p/62817/ using '''QUAL''' and '''DP'''.
| |
| | |
| Note that the '''QUAL''' (variant quality score) values can go very large. For example in GSE48215, samtools gives QUAL a range of [3,228] but gatk gives a range of [3, 10042]. [http://gatkforums.broadinstitute.org/wdl/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it This post] says '''QUAL is not often a very useful property for evaluating the quality of a variant call'''.
| |
| <syntaxhighlight lang='rsplus'>
| |
| # vcfs is vcf file ran by samtools using GSE48215subset
| |
| # vcfg is vcf file ran by gatk using GSE48215subset
| |
| > summary(vcfs)
| |
| Length Class Mode
| |
| 1604 CollapsedVCF S4
| |
| > summary(vcfg)
| |
| Length Class Mode
| |
| 868 CollapsedVCF S4
| |
| | |
| > summary(qual(vcfs))
| |
| Min. 1st Qu. Median Mean 3rd Qu. Max.
| |
| 3.013 10.580 23.650 45.510 45.400 228.000
| |
| > summary(qual(vcfg))
| |
| Min. 1st Qu. Median Mean 3rd Qu. Max.
| |
| 3.98 21.77 47.74 293.90 116.10 10040.00
| |
| | |
| > range(info(vcfs)$DP)
| |
| [1] 1 336
| |
| > range(info(vcfg)$DP)
| |
| [1] 1 317
| |
| | |
| > range(info(vcfs)$MQ)
| |
| [1] 4 60
| |
| > range(info(vcfg)$MQ)
| |
| [1] 21.00 65.19
| |
| | |
| # vcfg2 is vcf file ran by gatk using GSE48215
| |
| > vcfg2 <- readVcf("~/GSE48215/outputgcli/bt20_raw.vcf", "hg19")
| |
| > summary(vcfg2)
| |
| Length Class Mode
| |
| 506851 CollapsedVCF S4
| |
| > range(qual(vcfg2))
| |
| [1] 0.00 73662.77
| |
| > summary(qual(vcfg2))
| |
| Min. 1st Qu. Median Mean 3rd Qu. Max.
| |
| 0.00 21.77 21.77 253.70 62.74 73660.00
| |
| | |
| > range(info(vcfg2)$DP)
| |
| [1] 0 3315
| |
| > summary(info(vcfg2)$DP)
| |
| Min. 1st Qu. Median Mean 3rd Qu. Max.
| |
| 0.00 2.00 2.00 14.04 4.00 3315.00
| |
| > range(info(vcfg2)$MQ)
| |
| [1] NaN NaN
| |
| > range(info(vcfg2)$MQ, na.rm=T)
| |
| [1] 20 70
| |
| </syntaxhighlight>
| |
| | |
| * vaf and '''AD''' field (stands for '''allele depth'''/number of reads, 2 values for each '''genotype''' per sample, '''reference:alternative''').
| |
| | |
| These two values will usually, but not always sum to the DP value. '''Reads that are not used for calling are not counted in the DP measure, but are included in AD'''. See [https://www.biostars.org/p/15675/ Question: Understanding Vcf File Format]
| |
| | |
| An example
| |
| <pre>
| |
| GT:AD:DP:GQ:PL 0/1:1,2:3:30:67,0,30
| |
| </pre>
| |
| | |
| R code to compute VAF by using AD information only (AD = 1,2 in this case)
| |
| <syntaxhighlight lang='rsplus'>
| |
| require(VariantAnnotation)
| |
| vcf <- readVcf(file, refg)
| |
| vaf <- sapply(geno(vcf)$AD, function(x) x[2]/sum(x))
| |
| </syntaxhighlight>
| |
| | |
| ==== extract allele frequency ====
| |
| https://en.wikipedia.org/wiki/Minor_allele_frequency
| |
| | |
| Calculate allele frequency from vcf files.
| |
| * https://gatkforums.broadinstitute.org/gatk/discussion/6202/vcf-file-and-allele-frequency. In the code below, vaf is calculated from AD and DP was obtained directly from DP field.
| |
| <syntaxhighlight lang='rsplus'>
| |
| readvcf2 <- function(file, refg="hg19") {
| |
| vcf <- readVcf(file, refg)
| |
| if (!is.null(geno(vcf)$DP) && ncol(geno(vcf)$DP) > 1) stop("More than one sample were found!")
| |
| # http://gatkforums.broadinstitute.org/gatk/discussion/6202/vcf-file-and-allele-frequency
| |
| # READ AD
| |
| if ("AD" %in% names(geno(vcf))) {
| |
| # note it is possible DP does not exist but AD exists (mutect2 output)
| |
| # vaf <- sapply(geno(vcf)$AD, function(x) x[2]) / geno(vcf)$DP
| |
| # vaf <- vaf[, 1]
| |
| vaf <- sapply(geno(vcf)$AD, function(x) x[2]/sum(x))
| |
| } else {
| |
| vaf <- NULL
| |
| }
| |
| # READ DP
| |
| if (is.null(geno(vcf)$DP) ||
| |
| (is.na(geno(vcf)$DP[1]) && is.null(info(vcf)$DP)) ||
| |
| (is.na(geno(vcf)$DP[1]) && is.na(info(vcf)$DP[1]))) {
| |
| cat("Warning: no DP information can be retrieved!\n")
| |
| dp <- NULL
| |
| } else {
| |
| if (!is.null(geno(vcf)$DP) && !is.na(geno(vcf)$DP[1])) {
| |
| dp <- geno(vcf)$DP
| |
| if (is.matrix(dp)) dp <- dp[, 1]
| |
| } else {
| |
| dp <- info(vcf)$DP
| |
| }
| |
| }
| |
| if (! "chr" %in% substr(unique(seqnames(vcf)), 1, 3)) {
| |
| chr <- paste('chr', as.character(seqnames(vcf)), sep='')
| |
| } else {
| |
| chr <- as.character(seqnames(vcf))
| |
| }
| |
| strt <- paste(chr, start(vcf)) # ?BiocGenerics::start
| |
| return(list(dp=dp, vaf=vaf, start=strt, chr=unique(as.character(seqnames(vcf)))))
| |
| }
| |
| </syntaxhighlight>
| |
| * http://samtools.github.io/hts-specs/VCFv4.1.pdf, http://samtools.github.io/hts-specs/VCFv4.2.pdf
| |
| * [https://vcftools.github.io/documentation.html vcftools]
| |
| <pre>
| |
| vcftools --vcf file.vcf --freq --out output # No DP, No AD. No useful information
| |
| vcftools --vcf file.vcf --freq -c > output # same as above
| |
| </pre>
| |
| | |
| ==== How can I extract only insertions from a VCF file? ====
| |
| https://bioinformatics.stackexchange.com/questions/769/how-can-i-extract-only-insertions-from-a-vcf-file
| |
| | |
| ==== subset vcf file ====
| |
| Using [[#htslib:_bgzip_and_tabix|tabix]].
| |
| | |
| ==== Adding/removing 'chr' to/from vcf files ====
| |
| https://www.biostars.org/p/98582/
| |
| | |
| <pre>
| |
| awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf # Not enough
| |
| | |
| awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*)/,m)) print m[1]"chr"m[2]; else print $0 }' no_chr.vcf > with_chr.vcf
| |
| </pre>
| |
| | |
| === SAMtools (samtools, bcftools, htslib) ===
| |
| * http://samtools.sourceforge.net/mpileup.shtml
| |
| * http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html. Li's paper is available in [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/ NCBI PMC].
| |
| * [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/ A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data] by Heng Li.
| |
| * https://www.broadinstitute.org/gatk/media/docs/Samtools.pdf
| |
| | |
| <syntaxhighlight lang='bash'>
| |
| export seqtools_samtools_PATH=/opt/SeqTools/bin/samtools-1.3:/opt/SeqTools/bin/samtools-1.3/misc
| |
| export PATH=$seqtools_samtools_PATH:$PATH
| |
| | |
| samtools sort TNBC1/accepted_hits.bam TNBC1/accepted_hits-sorted
| |
| samtools index TNBC1/accepted_hits-sorted.bam TNBC1/accepted_hits-sorted.bai
| |
| samtools mpileup -uf ~/igenome/human/NCBI/build37.2/genome.fa \
| |
| TNBC1/accepted_hits-sorted.bam | bcftools view -vcg - > TNBC1/var.raw.vcf
| |
| </syntaxhighlight>
| |
| where '-u' in mpileup means uncompressed and '-f' means faidx indexed reference sequence file. If we do not use pipe command, the output from samtools mpileup is a bcf file which can be viewed by using 'bcftools view XXX.bcf | more' command.
| |
| | |
| Note that the ''samtools mpileup'' can be used in different ways. For example,
| |
| <syntaxhighlight lang='bash'>
| |
| samtools mpileup -f XXX.fa XXX.bam > XXX.mpileup
| |
| samtools mpileup -v -u -f XXX.fa XXX.bam > XXX.vcf
| |
| samtools mpileup -g -f XXX.fa XXX.bam > XXX.bcf
| |
| </syntaxhighlight>
| |
| | |
| ==== [http://samtools.github.io/bcftools/bcftools.html bcftools] ====
| |
| bcftools — utilities for variant calling and manipulating VCFs and their binary counterparts BCFs. bcftools was one of components in '''SAMtools''' software (not anymore, see http://www.htslib.org/download/)
| |
| | |
| Installation
| |
| <syntaxhighlight lang="bash">
| |
| wget https://github.com/samtools/bcftools/releases/download/1.2/bcftools-1.2.tar.bz2
| |
| sudo tar jxf bcftools-1.2.tar.bz2 -C /opt/RNA-Seq/bin/
| |
| cd /opt/RNA-Seq/bin/bcftools-1.2/
| |
| sudo make # create bcftools, plot-vcfstats, vcfutils.pl commands
| |
| </syntaxhighlight>
| |
| | |
| Example: Add or remove or update annotations. http://samtools.github.io/bcftools/bcftools.html
| |
| <syntaxhighlight lang="bash">
| |
| # Remove three fields
| |
| bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
| |
| | |
| # Remove all INFO fields and all FORMAT fields except for GT and PL
| |
| bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
| |
| | |
| # Add ID, QUAL and INFO/TAG, not replacing TAG if already present
| |
| bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
| |
| | |
| # Update 'ID' column in VCF file, https://www.biostars.org/p/227652/
| |
| # Note that the column header CHROM,FROM,.. only needs to appear in input.vcf.gz;
| |
| # they may not appear in the annotation file
| |
| # For vcf files, there is a comment sign '#' on the header line containing CHROM,FROM,...
| |
| bcftools annotate -c CHROM,FROM,TO,ID,INFO/MLEAC,INFO/MLEAF -a annotation.vcf.gz -o output.vcf input.vcf.gz
| |
| | |
| # Carry over all INFO and FORMAT annotations except FORMAT/GT
| |
| bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
| |
| | |
| # Annotate from a tab-delimited file with six columns (the fifth is ignored),
| |
| # first indexing with tabix. The coordinates are 1-based.
| |
| tabix -s1 -b2 -e2 annots.tab.gz
| |
| bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
| |
| | |
| # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
| |
| tabix -s1 -b2 -e3 annots.tab.gz
| |
| bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf
| |
| </syntaxhighlight>
| |
| | |
| Example: variant call
| |
| <syntaxhighlight lang="bash">
| |
| samtools sort TNBC1/accepted_hits.bam TNBC1/accepted_hits-sorted
| |
| samtools index TNBC1/accepted_hits-sorted.bam TNBC1/accepted_hits-sorted.bai
| |
| samtools mpileup -uf ~/igenome/human/NCBI/build37.2/genome.fa \
| |
| TNBC1/accepted_hits-sorted.bam | bcftools view -vcg - > TNBC1/var.raw.vcf
| |
| # Or
| |
| samtools mpileup -g -f XXX.fa XXX.bam > sample.bcf
| |
| bcftools call -v -m -O z -o var.raw.vcf.gz sample.bcf
| |
| zcat var.raw.vcf.gz | more
| |
| zcat var.raw.vcf.gz | grep -v "^#" | wc -l
| |
| samtools tview -p 17:1234567 XXX.bam XXX.fa | more # IGV alternative
| |
| </syntaxhighlight>
| |
| where '-v' means exports variants only, '-m' for multiallelic-caller and '-O z' means compressed vcf format.
| |
| | |
| Example: count the number of snps, indels, et al in the vcf file, use
| |
| <syntaxhighlight lang="bash">
| |
| bcftools stats xxx.vcf | more
| |
| </syntaxhighlight>
| |
| | |
| Example: filter based on variant quality, depth, mapping quality
| |
| <syntaxhighlight lang="bash">
| |
| bcftools filter -i"QUAL >= 20 && DP >= 5 && MQ >= 60" inputVCF > output.VCF
| |
| </syntaxhighlight>
| |
| where '-i' include only sites for which EXPRESSION is true.
| |
| | |
| Example: change the header (dedup.bam -> dedup), use
| |
| <syntaxhighlight lang="bash">
| |
| $ grep dedup.bam dT_ori_raw.vcf
| |
| ##samtoolsCommand=samtools mpileup -go temp.bcf -uf /home/brb/GSE11209-master/annotation/genome.fa dedup.bam
| |
| #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup.bam
| |
| $ echo dedup > sampleName
| |
| $ bcftools reheader -s sampleName dT_ori_raw.vcf -o dT_ori_raw2.vcf
| |
| $ grep dedup.bam dT_ori_raw2.vcf
| |
| ##samtoolsCommand=samtools mpileup -go temp.bcf -uf /home/brb/GSE11209-master/annotation/genome.fa dedup.bam
| |
| $ diff dT_ori_raw.vcf dT_ori_raw2.vcf
| |
| 45c45
| |
| < #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup.bam
| |
| ---
| |
| > #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup
| |
| </syntaxhighlight>
| |
| | |
| What bcftools commands are used in BRB-SeqTools?
| |
| * bcftools filter: apply fixed-threshold filters.
| |
| <pre>bcftools filter -i "QUAL >= 1 && DP >= 60 && MQ >= 1" INPUT.vcf > filtered.vcf</pre>
| |
| * bcftools norm: Left-align and normalize indels. http://annovar.openbioinformatics.org/en/latest/articles/VCF/
| |
| <pre>
| |
| bcftools norm -m-both -o splitted.vcf filtered.vcf
| |
| bcftools norm -f genomeRef -o leftnormalized.vcf splitted.vcf
| |
| </pre>
| |
| * bcftools annotate: add/remove or update annotations
| |
| <pre>
| |
| bcftools annotate -c ID -a dbSNPVCF leftnormalized.vcf.gz > dbsnp_anno.vcf
| |
| bcftools annotate -c ID,+GENE -a cosmicVCF dbsnp_anno.vcf.gz > cosmic_dbsnp.vcf
| |
| # +GENE: add annotations without overwriting existing values
| |
| # In this case, it is likely GENE does not appear in dbsnp_anno.vcf.gz
| |
| # It is better +INFO/GENE instead of GENE in '-c' parameter.
| |
| </pre>
| |
| * bcftools query: Extracts fields from VCF or BCF files and outputs them in user-defined format.
| |
| <pre>
| |
| bcftools query -f '%INFO/AC\n' input.vcf > AC.txt
| |
| bcftools query -f '%INFO/MLEAC\n' input.vcf > MLEAC.txt
| |
| </pre>
| |
| | |
| ==== [http://www.htslib.org/doc/tabix.html htslib]: bgzip and tabix ====
| |
| * bgzip – Block compression/decompression utility. The output file .gz is in a binary format.
| |
| * tabix – Generic indexer for TAB-delimited genome position files. The output file tbi is in a binary format.
| |
| | |
| Installation
| |
| <syntaxhighlight lang="bash">
| |
| wget https://github.com/samtools/htslib/releases/download/1.2.1/htslib-1.2.1.tar.bz2
| |
| sudo tar jxf htslib-1.2.1.tar.bz2 -C /opt/RNA-Seq/bin/
| |
| cd /opt/RNA-Seq/bin/htslib-1.2.1/
| |
| sudo make # create tabix, htsfile, bgzip commands
| |
| </syntaxhighlight>
| |
| | |
| Example
| |
| <syntaxhighlight lang="bash">
| |
| export PATH=/opt/SeqTools/bin/samtools-1.3/htslib-1.3:$PATH
| |
| export PATH=/opt/RNA-Seq/bin/bcftools-1.2/:$PATH
| |
| # zip and index
| |
| bgzip -c var.raw.vcf > var.raw.vcf.gz # var.raw.vcf will not be kept
| |
| tabix var.raw.vcf.gz # create var.raw.vcf.ga.tbi, index vcf files (very fast in this step)
| |
| bcftools annotate -c ID -a common_all_20150603.vcf.gz var.raw.vcf.gz > var_annot.vcf # 2 min
| |
| | |
| # subset based on chromosome 1, include 'chr' and position range if necessary
| |
| tabix -h var.raw.vcf.gz 1: > chr1.vcf
| |
| tabix -h var.raw.vcf.gz chr1:10,000,000-20,000,000
| |
| | |
| # subset a vcf file using a bed file
| |
| bgzip -c raw.vcf > raw.vcf.gz # without "-c", the original vcf file will not be kept
| |
| tabix -p vcf raw.vcf.gz # create tbi file
| |
| tabix -R test.bed raw.vcf.gz > testout.vcf
| |
| </syntaxhighlight>
| |
| | |
| === GATK (Java) ===
| |
| * Source code is at [https://github.com/broadgsa/gatk-protected/ github] (up to 3.8) and building instruction https://gatkforums.broadinstitute.org/gatk/discussion/comment/8443.
| |
| * Source code for GATK4 https://github.com/broadinstitute/gatk
| |
| * GATK4 tutorial
| |
| ** https://software.broadinstitute.org/gatk/documentation/presentations
| |
| ** [https://drive.google.com/drive/folders/0BzI1CyccGsZicXl0anMwQnFYVEU Call somatic SNVs and indels using GATK4 Mutect2]
| |
| ** [https://drive.google.com/drive/folders/1dCTLwqZz1oPG_1PWZgAdyTS6lZc5Upd4 Call Somatic CNVs using GATK4.beta]
| |
| * <strike>Need to create an account w/ password and log in in order to see and download the software</strike> (not anymore).
| |
| * Downloaded file is called GenomeAnalysisTK-3.5.tar.bz2 or GenomeAnalysisTK-3.6.tar.bz2. Note that the current version 3.6 (released on 6/1/2016) requires Java 1.8.
| |
| <syntaxhighlight lang='bash'>
| |
| # Download GenomeAnalysisTK-3.4-46.tar.bz2 from gatk website
| |
| sudo mkdir /opt/RNA-Seq/bin/gatk
| |
| sudo tar jxvf ~/Downloads/GenomeAnalysisTK-3.4-46.tar.bz2 -C /opt/RNA-Seq/bin/gatk
| |
| ls /opt/RNA-Seq/bin/gatk
| |
| # GenomeAnalysisTK.jar resources
| |
| </syntaxhighlight>
| |
| * (Blog) https://software.broadinstitute.org/gatk/blog
| |
| * Ubuntu 14 only has Java 1.6 and 1.7. Ubuntu 16 only has Java 1.8 and 1.9.
| |
| * Picard and HTSJDK also rely on Java.
| |
| * Different results on 2 second run. http://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results
| |
| * [https://www.broadinstitute.org/gatk/guide/ GATK guide] and [https://www.broadinstitute.org/gatk/guide/tagged?tag=requirements requirements].
| |
| * [https://www.broadinstitute.org/gatk/guide/article?id=3891 Best Practices workflow for SNP and indel calling on RNAseq data using GATK]. It recommend using [https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php HaplotypeCaller].
| |
| * [https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-2-Realignment.pdf Indel realignment] Input=BAM, Output=BAM. Big improvement for Base Quality Score Recalibration when run on realigned BAM files (artificial SNPs are replaced with real indels).
| |
| * [https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-3-Base_recalibration.pdf Base Quality Score Recalibration] Input=BAM, known sites, Output=BAM.
| |
| * [https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php UnifiedGenotyper] for variant call.
| |
| * https://wikis.utexas.edu/display/bioiteam/Variant+calling+with+GATK
| |
| * [http://www.biomedcentral.com/content/pdf/1471-2407-13-55.pdf Identification of somatic and germline mutations using whole exome sequencing of congenital acute lymphoblastic leukemia]: a paper that has used GATK for germline and somatic analyses
| |
| * http://gatkforums.broadinstitute.org/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail
| |
| * [https://www.qiagenbioinformatics.com/blog/clinical/new-plugin-to-support-your-bwa-gatk-pipelines/?utm_source=newsletter&utm_medium=email&utm_campaign=06.2016&mkt_tok=eyJpIjoiWkdWa09UYzRZakkwWWpBNCIsInQiOiJ3Z2VtODJTTGpTTzltbGQ0ZVFyMXJRZWNNaHhyN3ZacCtKMnlwWjQrNVdJcGttN3lvQTN6TFlBRVF1cHFFY3lzT3FjRzA1Tk5jcllReCtSTlhCaDlcL1ppSHQ5eTl4SjRrRk1uSDlCQWJMTTg9In0%3D New plugin to support your BWA-GATK pipelines for '''Biomedical Genomics Server Solution''']
| |
| | |
| ==== Citing papers ====
| |
| https://software.broadinstitute.org/gatk/documentation/article.php?id=6201
| |
| | |
| # McKenna et al. 2010 "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data"
| |
| # DePristo et al. 2011 "A framework for variation discovery and genotyping using next-generation DNA sequencing data"
| |
| # Van der Auwera et al. 2013 "rom FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline"
| |
| | |
| ==== Container version and Java ====
| |
| For some reason, running GATK 3.8 on Biowulf (CentOS + Sun Java) does not give any variants. But in my Ubuntu (OpenJDK), it does.
| |
| | |
| It is strange the website recommends Sun Java, but the container version uses OpenJDK.
| |
| <pre>
| |
| $ docker run -it --rm broadinstitute/gatk
| |
| (gatk) root@9c630a926bc1:/gatk# java -version
| |
| openjdk version "1.8.0_131"
| |
| OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
| |
| OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
| |
| | |
| $ docker run -it --rm broadinstitute/gatk3:3.8-0
| |
| root@80b7efa0c3ac:/usr# java -version
| |
| openjdk version "1.8.0_102"
| |
| OpenJDK Runtime Environment (build 1.8.0_102-8u102-b14.1-1~bpo8+1-b14)
| |
| OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode)
| |
| </pre>
| |
| | |
| See the example '[[Docker#Run_a_shell_script_on_host|how to run a shell script on host]]' on how to run GATK from the host command line (the container will be deleted after the job is done; similar to what 'singularity' does).
| |
| | |
| ==== Pipeline ====
| |
| * [https://software.broadinstitute.org/gatk/documentation/quickstart.php Quick Start Guide]: installation, best practice, how to run, run pipelines with WDL, ...
| |
| * [https://gencore.bio.nyu.edu/tag/gatk/ Variant Calling Pipeline: FastQ to Annotated SNPs in Hours]
| |
| * https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1
| |
| * https://gatkforums.broadinstitute.org/gatk/discussion/3891/best-practices-for-variant-calling-on-rnaseq
| |
| | |
| ==== IGV ====
| |
| * https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATK_Discovery_Tutorial-Worksheet-AUS2016.pdf
| |
| | |
| ==== dbSNP file ====
| |
| For running GATK best practices, dbsnp file has to be downloaded using the GATK version (with 'chr') for Ensembl but non-GATK (without 'chr') for UCSC. See [[Anders2013#GATK|Anders -> GATK]].
| |
| | |
| * ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/GATK/ common_all_xxxxxx.vcf.gz and tbi files
| |
| * ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/VCF/GATK/ common_all_xxxxxx.vcf.gz and tbi files
| |
| | |
| ==== [https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates MarkDuplicates] by Picard ====
| |
| '''Sequencing error''' propagated in duplicates. See p14 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery].
| |
| | |
| Reads have to be sorted by coordinates (using eg '''picard.jar SortSam''' OR '''samtools sort''') first.
| |
| | |
| * [https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates MarkDuplicates]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/2799/howto-map-and-mark-duplicates (howto) Map and mark duplicates]
| |
| | |
| <pre>
| |
| java -Djava.io.tmpdir="./tmpJava" \
| |
| -Xmx10g -jar $PICARDJARPATH/picard.jar \
| |
| MarkDuplicates \
| |
| VALIDATION_STRINGENCY=SILENT \
| |
| METRICS_FILE=MarkDudup.metrics \
| |
| INPUT=sorted.bam \
| |
| OUTPUT=bad.bam
| |
| </pre>
| |
| ===== Check if sam is sorted =====
| |
| http://plindenbaum.blogspot.com/2011/02/testing-if-bam-file-is-sorted-using.html
| |
| | |
| ==== [https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups Read group assignment] by Picard ====
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
| |
| -jar /opt/SeqTools/bin/picard-tools-2.1.1/picard.jar AddOrReplaceReadGroups \
| |
| INPUT=BMBC2_liver3_IMPACT.bam \
| |
| OUTPUT=rg_added_sorted.bam \
| |
| RGID=1 \
| |
| RGLB=rna \
| |
| RGPL=illumina \
| |
| RGPU=UNKNOWN \
| |
| RGSM=BMBC2_liver3_IMPACT
| |
| </pre>
| |
| | |
| ==== Split reads into exon (RNA-seq only) ====
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| samtools index reorder.bam
| |
| | |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
| |
| -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
| |
| -T SplitNCigarReads \
| |
| -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa \
| |
| -I reorder.bam \
| |
| -o split.bam \
| |
| -U ALLOW_N_CIGAR_READS \
| |
| -fixNDN \
| |
| -allowPotentiallyMisencodedQuals
| |
| | |
| samtools index split.bam
| |
| </pre>
| |
| | |
| ==== [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php Indel realignment] ====
| |
| Local alignment around indels corrects '''mapping errors'''. See p15 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery].
| |
| | |
| Notes
| |
| * (from its documentation) '''indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.'''
| |
| * It require the following two steps before running indel realignment
| |
| *# Generate an intervals (see next subsection) file
| |
| *# sorted bam file (necessary?)
| |
| * The '-known' option is not necessary. In [https://approachedinthelimit.wordpress.com/2016/06/29/updated-gatk-pipeline-to-haplotypecaller-gvcf/ this GATK workflow with HaplotypeCaller], it does not use '-known' or '-knownSites' in the pipeline.
| |
| | |
| ===== [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php Create realignment targets (*.intervals)] using '-known' option =====
| |
| | |
| The following code follows [https://software.broadinstitute.org/gatk/documentation/article.php?id=38 Local Realignment around Indels]. That is, we don't need to use the '-I' parameter as in [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php example from the RealignerTargetCreator] documentation.
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| java -Xmx10g -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
| |
| -T RealignerTargetCreator \
| |
| -R /home/brb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
| |
| -o /home/brb/SeqTestdata/usefulvcf/hg19/gatk/BRB-SeqTools_indels_only.intervals \
| |
| -known /home/brb/SeqTestdata/usefulvcf/hg19/gatk/common_all_20160601.vcf \
| |
| -nt 11
| |
| </pre>
| |
| | |
| Indel realignment requires an intervals file.
| |
| | |
| ===== [https://broadinstitute.github.io/picard/command-line-overview.html#ReorderSam ReorderSam] by Picard =====
| |
| Question: is this step necessary? [https://software.broadinstitute.org/gatk/documentation/article.php?id=38 Local Realignment around Indels] documentation does not have emphasized this step.
| |
| | |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
| |
| -jar /opt/SeqTools/bin/picard-tools-2.1.1/picard.jar \
| |
| ReorderSam \
| |
| INPUT=rg_added_sorted.bam \
| |
| OUTPUT=reorder.bam \
| |
| REFERENCE=/home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa
| |
| </pre>
| |
| | |
| ===== Indel realignment using '-known' option =====
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
| |
| -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
| |
| -T IndelRealigner \
| |
| -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa \
| |
| -I split.bam \
| |
| -targetIntervals /home/brb/SeqTestdata/usefulvcf/hg38/gatk/BRB-SeqTools_indels_only.intervals \
| |
| -known /home/brb/SeqTestdata/usefulvcf/hg38/gatk/common_all_20170710.vcf.gz \
| |
| -o realigned_reads.bam \
| |
| -allowPotentiallyMisencodedQuals
| |
| </pre>
| |
| | |
| ==== [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php Base quality score recalibration] using '-knownSites' option ====
| |
| Base recalibrartion corrects for '''machine errors'''. See p16 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery].
| |
| | |
| The base recalibration process involves two key steps
| |
| # builds a model of covariation based on the data and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality.
| |
| # Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).
| |
| | |
| Afterwards, it (I guess it is in the PrintReads command) adjusts the base quality scores in the data based on the model
| |
| | |
| (from the documentation) '''-knownSites''' parameter: This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.
| |
| | |
| In terms of the software implementation, this workflow actually includes two commands: '''BaseRecalibrator''' and '''PrintReads'''.
| |
| | |
| In my experience running the [https://taichimd.us/mediawiki/index.php/Seqtools#DNA-Seq PrintReads command is very very slow] even I have used the multi-threaded mode option (-nct). See a discussion [https://gatkforums.broadinstitute.org/gatk/discussion/3051/parallelizing-printreads parallelizing PrintReads].
| |
| | |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
| |
| -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
| |
| -T BaseRecalibrator \
| |
| -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa \
| |
| -I realigned_reads.bam \
| |
| -nct 11 \
| |
| -knownSites /home/brb/SeqTestdata/usefulvcf/hg38/gatk/common_all_20170710.vcf.gz \
| |
| -o recal_data.table
| |
| -allowPotentiallyMisencodedQuals
| |
| | |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g
| |
| -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar
| |
| -T PrintReads
| |
| -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa
| |
| -I realigned_reads.bam
| |
| -nct 11
| |
| -BQSR recal_data.table
| |
| -o recal.bam
| |
| </pre>
| |
| | |
| ==== Variant call by [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php HaplotypeCaller] (germline) and [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php MuTec2] (somatic) ====
| |
| | |
| * [http://www.pathologystudent.com/?p=8539 Germline vs. somatic mutations]
| |
| * HaplotypeCaller
| |
| ** ''The (HaplotypeCaller) algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so '''its use is not recommended for somatic (cancer) variant discovery'''. For that purpose, use MuTect2 instead.''
| |
| ** HaplotypeCaller theory/properties https://wiki.nbic.nl/images/1/13/Wim_2013_07_12.pdf#page=6
| |
| ** Local denovo assembly ([https://en.wikipedia.org/wiki/De_novo_transcriptome_assembly ''De novo'' transcriptome assembly]) based variant caller. '''Whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region.''' This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other.
| |
| ** Calls SNP, INDEL, MNP and small SV simultaneously
| |
| ** Removes mapping artifacts
| |
| ** More sensitive and accurate than the Unified Genotyper (UG)
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/4148/hc-overview-how-the-haplotypecaller-works How the HaplotypeCaller works?] ([https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2015-Seq_and_Variant_Discovery])
| |
| *# Define '''active regions''' (substantial evidence of variation relative to the reference)
| |
| *# Determine haplotypes by '''re-assembly''' of the active region
| |
| *# Determine '''likelihoods of the haplotypes''' given the read data
| |
| *# Assign sample '''genotypes'''
| |
| | |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g
| |
| -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar
| |
| -T HaplotypeCaller --genotyping_mode DISCOVERY
| |
| -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa
| |
| -I recal.bam
| |
| -stand_call_conf 30
| |
| -o /home/brb/SRP049647/outputvc/BMBC2_liver3_IMPACT_raw.vcf
| |
| -allowPotentiallyMisencodedQuals
| |
| -nct 11
| |
| </pre>
| |
| | |
| ==== Variant call by VarDict ====
| |
| https://github.com/AstraZeneca-NGS/VarDict
| |
| | |
| ==== Random results and downsampling ====
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results HaplotypeCaller on whole genome or chromosome by chromosome: different results] Downsampling is random, and a few different reads can make a difference.
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/3094/downsampling-with-haplotypecaller Downsampling with HaplotypeCaller]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/1323/downsampling Downsampling details] '''We do not recommend changing the downsampling settings in the tool.'''
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/3989/downsampling-to-coverage-and-the-3-x-haplotypecaller What I want to obtain is the same result as when running just HC on BAM files in plain VCF output]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/9943/questions-about-downsampling You can use PrintReads to downsample first then run HaplotypeCaller on the downsampled BAM]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/8223/haplotypecaller-generates-diff-results-on-different-cpus HaplotypeCaller generates diff results on different CPUs]. you might well get into reproducibility problems if you use single process multi-thread parallelism (i.e. -nt N where N > 1) regardless of what pair HMM implementation you are using.
| |
| | |
| ==== Why expected variants are not called ====
| |
| * [https://software.broadinstitute.org/gatk/documentation/article.php?id=7869 I expect to see a variant at a specific site, but it's not getting called]
| |
| * [https://software.broadinstitute.org/gatk/documentation/article.php?id=5484 Generate a "bamout file" showing how HaplotypeCaller has remapped sequence reads]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/2803/howto-call-variants-with-haplotypecaller Call variants with HaplotypeCaller]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/4851/inconsistency-among-the-depth-in-the-vcf-file-and-in-the-original-bam-file-and-hc-bamout-bam-file Inconsistency among the depth in the vcf file and in the original bam file and HC -bamout bam file]
| |
| * To include all trimmed, downsampled, filtered and uninformative reads when -bamout is specified, add the '''--emitDroppedReads''' argument.
| |
| | |
| ==== Why is HaplotypeCaller dropping half of my reads? ====
| |
| https://gatkforums.broadinstitute.org/gatk/discussion/4844/why-is-haplotypecaller-dropping-half-of-my-reads
| |
| | |
| ==== Missing mapping quality score ====
| |
| https://www.biostars.org/p/79367/#79369
| |
| | |
| On GSE48215 case, I got 34 missing MQ from bwa + gatk but no missing MQ from bwa + samtools???
| |
| | |
| ==== VQSR 质控 ====
| |
| https://zhuanlan.zhihu.com/p/34878471
| |
| | |
| ==== idx file ====
| |
| It is a binary file. It is one of two ouptut from the GATK's Haplotype calling.
| |
| | |
| Unfortunately there is no documentation about its spec. [https://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it It can be generated if VCF files can be validated] by [https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php ValidateVariants]
| |
| | |
| ==== Djava.io.tmpdir ====
| |
| * [https://samnicholls.net/2015/11/11/grokking-gatk/ Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/7623/haplotypecaller-djava-io-tmpdir the number of files (such as bamschedule.* files) in the tmp is so large(about 11000)]
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/4757/is-there-any-speed-benefit-to-redirecting-djava-io-tmpdir-to-local-scratch Is there any speed benefit to redirecting Djava.io.tmpdir to local scratch?]
| |
| * http://people.duke.edu/~ccc14/duke-hts-2017/Statistics/08032017/GATK-pipeline-sample.html
| |
| * How large is the tmp directory required? Less than 1G is sufficient for 12GB bam file.
| |
| | |
| ==== how to deal with errors ====
| |
| * [https://gatkforums.broadinstitute.org/gatk/discussion/1429/error-bam-file-has-a-read-with-mismatching-number-of-bases-and-base-qualities ERROR : BAM file has a read with mismatching number of bases and base qualities] (works) with indelrealigner. [https://gatkforums.broadinstitute.org/gatk/discussion/comment/41013 GATK 3.8 log4j error] (not useful). The solution is to add the option '''--filter_mismatching_base_and_quals''' to IndelRealigner.
| |
| | |
| <pre>
| |
| $ grep -n ERROR swarm_58155698_0.e --color
| |
| 513:ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/usr/local/apps/GATK/3.8-0/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
| |
| 514:ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
| |
| 535:##### ERROR ------------------------------------------------------------------------------------------
| |
| 536:##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
| |
| 537:##### ERROR
| |
| 538:##### ERROR This means that one or more arguments or inputs in your command are incorrect.
| |
| 539:##### ERROR The error message below tells you what is the problem.
| |
| 540:##### ERROR
| |
| 541:##### ERROR If the problem is an invalid argument, please check the online documentation guide
| |
| 542:##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
| |
| 543:##### ERROR
| |
| 544:##### ERROR Visit our website and forum for extensive documentation and answers to
| |
| 545:##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
| |
| 546:##### ERROR
| |
| 547:##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
| |
| 548:##### ERROR
| |
| 549:##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@4a7427f9 is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: BAM file has a read with mismatching number of bases and base qualities. Offender: homo_simulated_Error0_Mu0-j1-chr2-r10408427 [55 bases] [0 quals]. You can use --defaultBaseQualities to assign a default base quality for all reads, but this can be dangerous in you don't know what you are doing.
| |
| 550:##### ERROR -----------------------------------------------------------------------------
| |
| </pre>
| |
| * [https://software.broadinstitute.org/gatk/documentation/article?id=6470 ERROR SAM/BAM/CRAM file appears to be using the wrong encoding for quality scores] in BaseRecalibrator. Adding '--fix_misencoded_quality_scores' does not help.
| |
| <pre>
| |
| ##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
| |
| ##### ERROR
| |
| ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
| |
| ##### ERROR The error message below tells you what is the problem.
| |
| ##### ERROR
| |
| ##### ERROR If the problem is an invalid argument, please check the online documentation guide
| |
| ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
| |
| ##### ERROR
| |
| ##### ERROR Visit our website and forum for extensive documentation and answers to
| |
| ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
| |
| ##### ERROR
| |
| ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
| |
| ##### ERROR
| |
| ##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@6c0bf8f4 appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (68) with BAQ correction factor of 4. Please see https://software.broadinstitute.org/gatk/documentation/article?id=6470 for more details and options related to this error.
| |
| | |
| After adding '--fix_misencoded_quality_scores' does not help.
| |
| | |
| #### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
| |
| </pre>
| |
| | |
| === Novocraft ===
| |
| https://hpc.nih.gov/apps/novocraft.html
| |
| | |
| Running index is very quick (2 minutes for hg19). The following commands will generate a file <hg19index> which is about 7.8GB in size.
| |
| <syntaxhighlight lang='bash'>
| |
| $ sinteractive --mem=32g -c 16
| |
| $ module load novocraft
| |
| $ novoindex hg19index $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
| |
| | |
| # novoindex (3.8) - Universal k-mer index constructor.
| |
| # (C) 2008 - 2011 NovoCraft Technologies Sdn Bhd
| |
| # novoindex hg19index /home/limingc/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
| |
| # Creating 55 indexing threads.
| |
| # Building with 14-mer and step of 2 bp.
| |
| tcmalloc: large alloc 1073750016 bytes == 0x13f8000 @ 0x4008e4 0x56c989 0x40408c 0x40127b 0x4d20bb 0x402845
| |
| tcmalloc: large alloc 8344305664 bytes == 0x41402000 @ 0x4008e4 0x56d6d3 0x40423a 0x40127b 0x4d20bb 0x402845
| |
| # novoindex construction dT = 116.1s
| |
| # Index memory size 7.771Gbyte.
| |
| # Done.
| |
| </syntaxhighlight>
| |
| | |
| === [https://github.com/ekg/freebayes freebayes] ===
| |
| The program gave different errors for some real datasets downloaded from GEO.
| |
| | |
| A small test data included in the software works.
| |
| <syntaxhighlight lang='bash'>
| |
| brb@brb-P45T-A:~/github$ sudo apt-get update
| |
| brb@brb-P45T-A:~/github$ sudo apt-get install cmake
| |
| brb@brb-P45T-A:~/github$ git clone --recursive git://github.com/ekg/freebayes.git
| |
| brb@brb-P45T-A:~/github$ cd freebayes/
| |
| brb@brb-P45T-A:~/github/freebayes$ make
| |
| brb@brb-P45T-A:~/github/freebayes$ make test # Got some errors. So don't worry to 'make test'
| |
| brb@brb-P45T-A:~/github/freebayes/test/tiny$ ../../bin/freebayes -f q.fa NA12878.chr22.tiny.bam > tmp.vcf
| |
| brb@brb-P45T-A:~/github/freebayes/test/tiny$ ls -lt
| |
| total 360
| |
| -rw-rw-r-- 1 brb brb 15812 Oct 9 10:52 tmp.vcf
| |
| -rw-rw-r-- 1 brb brb 287213 Oct 9 09:33 NA12878.chr22.tiny.bam
| |
| -rw-rw-r-- 1 brb brb 96 Oct 9 09:33 NA12878.chr22.tiny.bam.bai
| |
| -rw-rw-r-- 1 brb brb 16307 Oct 9 09:33 NA12878.chr22.tiny.giab.vcf
| |
| -rw-rw-r-- 1 brb brb 12565 Oct 9 09:33 q.fa
| |
| -rw-rw-r-- 1 brb brb 16 Oct 9 09:33 q.fa.fai
| |
| -rw-rw-r-- 1 brb brb 2378 Oct 9 09:33 q_spiked.vcf.gz
| |
| -rw-rw-r-- 1 brb brb 91 Oct 9 09:33 q_spiked.vcf.gz.tbi
| |
| -rw-rw-r-- 1 brb brb 4305 Oct 9 09:33 q.vcf.gz
| |
| -rw-rw-r-- 1 brb brb 102 Oct 9 09:33 q.vcf.gz.tbi
| |
| brb@brb-P45T-A:~/github/freebayes/test/tiny$ ~/github/samtools/samtools view NA12878.chr22.tiny.bam | wc -l
| |
| 3333
| |
| brb@brb-P45T-A:~/github/freebayes/test/tiny$ wc -l tmp.vcf
| |
| 76 tmp.vcf
| |
| brb@brb-P45T-A:~/github/freebayes/test/tiny$ wc -l q.fa
| |
| 207 q.fa
| |
| </syntaxhighlight>
| |
| | |
| To debug the code, we can
| |
| <syntaxhighlight lang='bash'>
| |
| ../../bin/freebayes -f q.fa NA12878.chr22.tiny.bam > tmpFull.out 2>&1
| |
| </syntaxhighlight>
| |
| Then compare tmp.vcf and tmpFull.out files. We see
| |
| * total sites: 12280
| |
| * the first variant call happens at position 186. Look at tmpFull.out file, we see most of positions just show 3 lines in the output but variants like position 186 show a lot of output (line 643 to 736).
| |
| * the output 'processing position XXX' was generated from AlleleParser:toNextPosition() which was called by AlleleParser::getNextAlleles() which was called by freebayes.cpp::main() line 126, the while() loop.
| |
| | |
| ==== freeBayes vs HaplotypeCaller ====
| |
| * https://www.biostars.org/p/174510/
| |
| * [http://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/ Updated comparison of variant detection methods: Ensemble, FreeBayes and minimal BAM preparation pipelines]
| |
| | |
| === [http://wiki.bits.vib.be/index.php/Varscan2 Varscan2] ===
| |
| * http://dkoboldt.github.io/varscan/
| |
| * https://github.com/dkoboldt/varscan
| |
| | |
| === vcftools ===
| |
| https://vcftools.github.io/examples.html
| |
| | |
| This toolset can be used to perform the following operations on VCF files:
| |
| | |
| * Filter out specific variants
| |
| * Compare files
| |
| * Summarize variants
| |
| * Convert to different file types
| |
| * Validate and merge files
| |
| * Create intersections and subsets of variants
| |
| | |
| <syntaxhighlight lang="bash">
| |
| wget http://sourceforge.net/projects/vcftools/files/vcftools_0.1.12b.tar.gz/download \
| |
| -o vcftools_0.1.12b.tar.gz
| |
| tar -xzvf vcftools_0.1.12b.tar.gz
| |
| sudo mv vcftools_0.1.12b /opt/RNA-Seq/bin/
| |
| export PERL5LIB=/opt/RNA-Seq/bin/vcftools_0.1.12b/perl/
| |
| /opt/RNA-Seq/bin/vcftools_0.1.12b/
| |
| make
| |
| export PATH=$PATH:/opt/RNA-Seq/bin/vcftools_0.1.12b/bin
| |
| ls bin
| |
| # fill-aa vcf-annotate vcf-convert vcf-phased-join vcf-subset
| |
| # fill-an-ac vcf-compare vcf-fix-ploidy vcf-query vcftools
| |
| # fill-fs vcf-concat vcf-indel-stats vcf-shuffle-cols vcf-to-tab
| |
| # fill-ref-md5 vcf-consensus vcf-isec vcf-sort vcf-tstv
| |
| # man1 vcf-contrast vcf-merge vcf-stats vcf-validator
| |
| </syntaxhighlight>
| |
| Some example
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
| |
| $ cd ~/SRP032789
| |
| $ vcftools --vcf GSM1261016_IP2-50_var.flt.vcf
| |
| | |
| VCFtools - v0.1.12b
| |
| (C) Adam Auton and Anthony Marcketta 2009
| |
| | |
| Parameters as interpreted:
| |
| --vcf GSM1261016_IP2-50_var.flt.vcf
| |
| | |
| After filtering, kept 1 out of 1 Individuals
| |
| After filtering, kept 193609 out of a possible 193609 Sites
| |
| Run Time = 1.00 seconds
| |
| | |
| $ wc -l GSM1261016_IP2-50_var.flt.vcf
| |
| 193636 GSM1261016_IP2-50_var.flt.vcf
| |
| | |
| $ vcf-indel-stats < GSM1261016_IP2-50_var.flt.vcf > out.txt
| |
| Use of uninitialized value in pattern match (m//) at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 49.
| |
| Use of uninitialized value in concatenation (.) or string at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 49.
| |
| <: No such file or directory at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 18.
| |
| main::error('<: No such file or directory') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 50
| |
| main::init_regions('HASH(0xd77cb8)') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 71
| |
| main::do_stats('HASH(0xd77cb8)') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 9
| |
| </pre>
| |
| | |
| To compare two vcf files, see https://vcftools.github.io/documentation.html
| |
| <pre>
| |
| ./vcftools --vcf input_data.vcf --diff other_data.vcf --out compare
| |
| </pre>
| |
| | |
| === Online course on Variant calling ===
| |
| * edX: HarvardX: PH525.6x Case Study: Variant Discovery and Genotyping. Course notes is at their [https://github.com/hbc/edX/blob/master/edX_Notes.md Github] page.
| |
| | |
| === [http://genome.cshlp.org/content/27/8/1450.full GenomeVIP] ===
| |
| a cloud platform for genomic variant discovery and interpretation
| |
| | |
| === PCA ===
| |
| [http://bwlewis.github.io/1000_genomes_examples/PCA.html PCA of genomic variant data across one chromosome from 2,504 people from the 1000 genomes project]
| |
| | |
| == Variant Annotation ==
| |
| See also the paper [http://bib.oxfordjournals.org/content/15/2/256.abstract A survey of tools for variant analysis of next-generation genome sequencing data].
| |
| | |
| [https://github.com/seandavi/awesome-cancer-variant-databases Awesome-cancer-variant-databases] - A community-maintained repository of cancer clinical knowledge bases and databases focused on cancer variants.
| |
| | |
| === [http://www.ncbi.nlm.nih.gov/SNP/ dbSNP] ===
| |
| [https://support.bioconductor.org/p/33140/ SNPlocs data R package for Human]. Some [http://grokbase.com/t/r/bioconductor/131gt2jyfg/bioc-problem-locating-snp-by-rsid-for-snplocs-hsapiens-dbsnp-20120608-package-bioconductor-x clarification] about SNPlocs.Hsapiens.dbSNP.20120608 package.
| |
| <pre>
| |
| > library(BSgenome)
| |
| > available.SNPs()
| |
| [1] "SNPlocs.Hsapiens.dbSNP141.GRCh38"
| |
| [2] "SNPlocs.Hsapiens.dbSNP142.GRCh37"
| |
| [3] "SNPlocs.Hsapiens.dbSNP.20090506"
| |
| [4] "SNPlocs.Hsapiens.dbSNP.20100427"
| |
| [5] "SNPlocs.Hsapiens.dbSNP.20101109"
| |
| [6] "SNPlocs.Hsapiens.dbSNP.20110815"
| |
| [7] "SNPlocs.Hsapiens.dbSNP.20111119"
| |
| [8] "SNPlocs.Hsapiens.dbSNP.20120608"
| |
| [9] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38"
| |
| </pre>
| |
| | |
| [https://support.bioconductor.org/p/30078/ Query dbSNP]
| |
| | |
| An example:
| |
| <pre>
| |
| wget https://github.com/samtools/bcftools/releases/download/1.2/bcftools-1.2.tar.bz2
| |
| sudo tar jxf bcftools-1.2.tar.bz2 -C /opt/RNA-Seq/bin/
| |
| cd /opt/RNA-Seq/bin/bcftools-1.2/
| |
| sudo make
| |
| | |
| wget https://github.com/samtools/htslib/releases/download/1.2.1/htslib-1.2.1.tar.bz2
| |
| sudo tar jxf htslib-1.2.1.tar.bz2 -C /opt/RNA-Seq/bin/
| |
| cd /opt/RNA-Seq/bin/htslib-1.2.1/
| |
| sudo make # create tabix, htsfile, bgzip commands
| |
| | |
| export bdge_bowtie_PATH=/opt/RNA-Seq/bin/bowtie2-2.2.1
| |
| export bdge_tophat_PATH=/opt/RNA-Seq/bin/tophat-2.0.11.Linux_x86_64
| |
| export bdge_samtools_PATH=/opt/RNA-Seq/bin/samtools-0.1.19
| |
| export PATH=$bdge_bowtie_PATH:$bdge_samtools_PATH:$bdge_tophat_PATH:$PATH
| |
| export PATH=/opt/RNA-Seq/bin/bcftools-1.2/:$PATH
| |
| export PATH=/opt/RNA-Seq/bin/htslib-1.2.1/:$PATH
| |
| | |
| wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/common_all_20150603.vcf.gz.tbi
| |
| wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/common_all_20150603.vcf.gz
| |
| cd TNBC1
| |
| mv ~/Downloads/common_all_20150603.vcf.gz* .
| |
| bgzip -c var.raw.vcf > var.raw.vcf.gz
| |
| tabix var.raw.vcf.gz
| |
| bcftools annotate -c ID -a common_all_20150603.vcf.gz var.raw.vcf.gz > var_annot.vcf
| |
| </pre>
| |
| | |
| Any found in dbSNP?
| |
| <pre>
| |
| grep -c 'rs[0-9]' raw_snps.vcf
| |
| </pre>
| |
| | |
| === ANNOVAR ===
| |
| [http://annovar.openbioinformatics.org/en/latest/ ANNOVAR] and the web based interface to ANNOVAR [http://wannovar.usc.edu/index.php wANNOVAR]. ANNOVAR can annotate genetic variants using
| |
| * Gene-based annotation
| |
| * Region-based annotation
| |
| * Filter-based annotation
| |
| * Other functionalities
| |
| | |
| Note that
| |
| * [https://github.com/JhuangLab/annovarR annovarR] R package. The wrapper functions of annovarR unified the interface of many published annotation tools, such as VEP, ANNOVAR, vcfanno and AnnotationDbi.
| |
| * It is correct to use the original download link from the email to download the latest version.
| |
| * annovar folder needs to be placed under a directory with write permission
| |
| * If we run annovar with a new reference genome, the code will need to download some database. When annovar is downloading the database, the cpu is resting so the whole process looks idling.
| |
| * Annovar will print out a warning with information about changes if there is a new version available.
| |
| * BRB-SeqTools (''grep "\.pl" preprocessgui/*.*'') uses '''annotate_variation.pl''' (5), '''convert2annovar.pl''' (2), '''table_annovar.pl''' (4), '''variants_reduction.pl''' (1).
| |
| * In [https://hpc.nih.gov/apps/ANNOVAR.html nih/biowulf], it creates ''$ANNOVAR_HOME'' and ''$ANNOVAR_DATA'' environment variables.
| |
| * To check the version of my local copy
| |
| <syntaxhighlight lang='bash'>
| |
| ~/annovar/annotate_variation.pl
| |
| </syntaxhighlight>
| |
| * Contents of annovar
| |
| <syntaxhighlight lang='bash'>
| |
| brb@T3600 ~ $ ls -l ~/annovar/
| |
| total 476
| |
| -rwxr-xr-x 1 brb brb 212090 Mar 7 14:59 annotate_variation.pl
| |
| -rwxr-xr-x 1 brb brb 13589 Mar 7 14:59 coding_change.pl
| |
| -rwxr-xr-x 1 brb brb 166582 Mar 7 14:59 convert2annovar.pl
| |
| drwxr-xr-x 2 brb brb 4096 Jun 18 2015 example
| |
| drwxr-xr-x 3 brb brb 4096 Mar 21 09:59 humandb
| |
| -rwxr-xr-x 1 brb brb 19419 Mar 7 14:59 retrieve_seq_from_fasta.pl
| |
| -rwxr-xr-x 1 brb brb 34682 Mar 7 14:59 table_annovar.pl
| |
| -rwxr-xr-x 1 brb brb 21774 Mar 7 14:59 variants_reduction.pl
| |
| brb@T3600 ~ $ ls -lh ~/annovar/humandb | head
| |
| total 36G
| |
| -rw-r--r-- 1 brb brb 927 Mar 21 09:59 annovar_downdb.log
| |
| drwxr-xr-x 2 brb brb 4.0K Jun 18 2015 genometrax-sample-files-gff
| |
| -rw-r--r-- 1 brb brb 20K Mar 7 14:59 GRCh37_MT_ensGeneMrna.fa
| |
| -rw-r--r-- 1 brb brb 3.1K Mar 7 14:59 GRCh37_MT_ensGene.txt
| |
| -rw-r--r-- 1 brb brb 1.4G Dec 15 2014 hg19_AFR.sites.2014_10.txt
| |
| -rw-r--r-- 1 brb brb 87M Dec 15 2014 hg19_AFR.sites.2014_10.txt.idx
| |
| -rw-r--r-- 1 brb brb 2.8G Dec 15 2014 hg19_ALL.sites.2014_10.txt
| |
| -rw-r--r-- 1 brb brb 89M Dec 15 2014 hg19_ALL.sites.2014_10.txt.idx
| |
| -rw-r--r-- 1 brb brb 978M Dec 15 2014 hg19_AMR.sites.2014_10.txt
| |
| </syntaxhighlight>
| |
| | |
| === [http://snpeff.sourceforge.net/ SnpEff & SnpSift] ===
| |
| * http://snpeff.sourceforge.net/SnpEff.html (basic information)
| |
| * http://snpeff.sourceforge.net/SnpEff_manual.html (Manual)
| |
| * http://snpeff.sourceforge.net/protocol.html (6 examples)
| |
| * https://wiki.hpcc.msu.edu/display/Bioinfo/SnpEff+-+The+Basics
| |
| * https://docs.uabgrid.uab.edu/wiki/Galaxy_DNA-Seq_Tutorial
| |
| * http://www.genome.gov/Pages/Research/DIR/DIRNewsFeatures/Next-Gen101/Teer_VariantAnnotation.pdf
| |
| * http://veda.cs.uiuc.edu/course2013/pres/fields/lab/Variant_Calling_v1.pptx (if you click the link, chrome will dl the pptx file)
| |
| * In BRB-SeqTools, '''snpEff.jar''' (2) and '''SnpSift.jar''' (5)
| |
| * NIH/biowulf also has installed [https://hpc.nih.gov/apps/snpEff.html snpEff & SnpSift]
| |
| | |
| ==== SnpEff: Genetic variant '''annotation''' and '''effect prediction''' toolbox ====
| |
| # Input: vcf & reference genome database (eg GRCh38.79).
| |
| # Output: vcf & <snpEff_summary.html> & < snpEff_genes.txt> files.
| |
| <syntaxhighlight lang='bash'>
| |
| wget http://iweb.dl.sourceforge.net/project/snpeff/snpEff_latest_core.zip
| |
| sudo unzip snpEff_latest_core.zip -d /opt/RNA-Seq/bin
| |
| export PATH=/opt/RNA-Seq/bin/snpEff/:$PATH
| |
| | |
| # Next we want to download snpEff database.
| |
| # 1. Need to pay attention the database is snpEff version dependent
| |
| # 2. Instead of using the command line (very slow < 1MB/s),
| |
| # java -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar databases | grep GRCh
| |
| # java -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar download GRCh38.79
| |
| # we just go to the file using the the web browser
| |
| # http://sourceforge.net/projects/snpeff/files/databases/v4_1/ # 525MB
| |
| # 3. the top folder of the zip file is called 'data'. We will need to unzip it to the snpEff directory.
| |
| # That is, snpEff +- data
| |
| # |- examples
| |
| # |- galaxy
| |
| # +- scripts
| |
| mv ~/Downloads/snpEff_v4_1_GRCh38.79.zip .
| |
| unzip snpEff_v4_1GRCh38.79.zip
| |
| sudo java -Xmx4G -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar \
| |
| -i vcf -o vcf GRCh38.79 var_annot.vcf > var_annot_snpEff.vcf
| |
| | |
| brb@T3600 ~ $ ls -l /opt/SeqTools/bin/snpEff/
| |
| total 44504
| |
| drwxrwxr-x 5 brb brb 4096 May 5 11:24 data
| |
| drwxr-xr-x 2 brb brb 4096 Feb 17 16:37 examples
| |
| drwxr-xr-x 3 brb brb 4096 Feb 17 16:37 galaxy
| |
| drwxr-xr-x 3 brb brb 4096 Feb 17 16:37 scripts
| |
| -rw-r--r-- 1 brb brb 6138594 Dec 5 10:49 snpEff.config
| |
| -rw-r--r-- 1 brb brb 20698856 Dec 5 10:49 snpEff.jar
| |
| -rw-r--r-- 1 brb brb 18712032 Dec 5 10:49 SnpSift.jar
| |
| </syntaxhighlight>
| |
| | |
| '''Output file''' (compare ''cosmic_dbsnp_rem.vcf'' and ''snpeff_anno.vcf'' at [https://github.com/arraytools/vc-annotation/tree/master/snpeff/tmp here]):
| |
| * add 5 lines at the header. Search "ID=ANN" at [https://github.com/arraytools/vc-annotation/blob/master/snpeff/tmp/snpeff_anno.vcf snpeff_anno.vcf]
| |
| <pre>
| |
| ##SnpEffVersion="4.2 (build 2015-12-05), by Pablo Cingolani"
| |
| ##SnpEffCmd="SnpEff -no-downstream -no-upstream ....
| |
| ##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID |...
| |
| ##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. ...
| |
| ##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. ...
| |
| </pre>
| |
| * added functional annotations '''ANN''' in the INFO field. See an example at [http://snpeff.sourceforge.net/SnpEff_manual.html Basic example: Annotate using SnpEff]
| |
| <pre>
| |
| ANN=A|missense_variant|MODERATE|ISG15|ISG15|transcript|NM_005101.3|protein_coding|2/2|c.248G>A|p.Ser83Asn|355/666|248/498|83/165||
| |
| </pre>
| |
| | |
| ==== SnpSift: SnpSift helps filtering and manipulating genomic annotated files ====
| |
| Once you annotated your files using SnpEff, you can use SnpSift to help you filter large genomic datasets in order to find the most significant variants
| |
| | |
| '''[http://snpeff.sourceforge.net/SnpSift.html#filter SnpSift filter]'''
| |
| <syntaxhighlight lang='bash'>
| |
| cat "$outputDir/tmp/$tmpfd/snpeff_anno.vcf" | \
| |
| java -jar "$seqtools_snpeff/SnpSift.jar" \
| |
| filter "(ANN[*].BIOTYPE = 'protein_coding') | (ANN[*].EFFECT has 'splice')" \
| |
| > "$outputDir/tmp/$tmpfd/snpeff_proteincoding.vcf"
| |
| </syntaxhighlight>
| |
| | |
| the output file (compare snpeff_anno.vcf and snpeff_proteincoding.vcf at https://github.com/arraytools/vc-annotation/tree/master/snpeff/tmp here]
| |
| * the header will add 3 lines
| |
| <pre>
| |
| ##SnpSiftVersion="SnpSift 4.2 (build 2015-12-05), by Pablo Cingolani"
| |
| ##SnpSiftCmd="SnpSift filter '(ANN[*].BIOTYPE = 'protein_coding') | (ANN[*].EFFECT has 'splice')'"
| |
| ##FILTER=<ID=SnpSift,Description="SnpSift 4.2 ...
| |
| </pre>
| |
| * KEEP variants satisfying the filter criterion (In the ANN field, BIOTYPE=protein_coding and EFFECT=splice). So it will reduce the variants size. No more fields are added.
| |
| | |
| '''[http://snpeff.sourceforge.net/SnpSift.html#dbNSFP SnpSift dbNSFP]'''
| |
| <syntaxhighlight lang='bash'>
| |
| java -jar "$seqtools_snpeff/SnpSift.jar" dbNSFP -f $dbnsfpField \
| |
| -v -db "$dbnsfpFile" "$outputDir/tmp/$tmpfd/nonsyn_splicing2.vcf" \
| |
| > "$outputDir/tmp/$tmpfd/nonsyn_splicing_dbnsfp.vcf"
| |
| </syntaxhighlight>
| |
| Compare ''nonsyn_splicing2.vcf'' and ''[https://github.com/arraytools/vc-annotation/blob/master/snpeff/tmp/nonsyn_splicing_dbnsfp.vcf nonsyn_splicing_dbnsfp.vcf]'' at [https://github.com/arraytools/vc-annotation/tree/master/snpeff/tmp github]
| |
| output file
| |
| * the header adds 29 lines.
| |
| <pre>
| |
| ##SnpSiftCmd="SnpSift dbnsfp -f SIFT_score,SIFT_pred, ...
| |
| ##INFO=<ID=dbNSFP_GERP___RS,Number=A,Type=Float,Description="Field 'GERP++_RS' from dbNSFP">
| |
| ##INFO=<ID=dbNSFP_CADD_phred,Number=A,Type=Float,Description="Field 'CADD_phred' from dbNSFP">
| |
| ...
| |
| </pre>
| |
| * the INFO field will add
| |
| <pre>
| |
| dbNSFP_CADD_phred=2.276;dbNSFP_CADD_raw=-0.033441;dbNSFP_FATHMM_pred=.,.,T; ...
| |
| </pre>
| |
| | |
| '''[http://snpeff.sourceforge.net/SnpSift.html#Extract SnpSift extractFields]'''
| |
| <pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >java -jar "$seqtools_snpeff/SnpSift.jar" extractFields \
| |
| -s "," -e "." "$outputDir/tmp/$tmpfd/nonsyn_splicing_dbnsfp.vcf" CHROM POS ID REF ALT "ANN[0].EFFECT" "ANN[0].IMPACT" "ANN[0].GENE" "ANN[0].GENEID" "ANN[0].FEATURE" "ANN[0].FEATUREID" "ANN[0].BIOTYPE" "ANN[0].HGVS_C" "ANN[0].HGVS_P" dbNSFP_SIFT_score dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_score dbNSFP_Polyphen2_HDIV_pred dbNSFP_Polyphen2_HVAR_score dbNSFP_Polyphen2_HVAR_pred dbNSFP_LRT_score dbNSFP_LRT_pred dbNSFP_MutationTaster_score dbNSFP_MutationTaster_pred dbNSFP_MutationAssessor_score dbNSFP_MutationAssessor_pred dbNSFP_FATHMM_score dbNSFP_FATHMM_pred dbNSFP_PROVEAN_score dbNSFP_PROVEAN_pred dbNSFP_VEST3_score dbNSFP_CADD_raw dbNSFP_CADD_phred dbNSFP_MetaSVM_score dbNSFP_MetaSVM_pred dbNSFP_MetaLR_score dbNSFP_MetaLR_pred dbNSFP_GERP___NR dbNSFP_GERP___RS dbNSFP_phyloP100way_vertebrate dbNSFP_phastCons100way_vertebrate dbNSFP_SiPhy_29way_logOdds \
| |
| > "$outputDir/tmp/$tmpfd/annoTable.txt"
| |
| </pre>
| |
| The output file will
| |
| * remove header from the input VCF file
| |
| * break the INFO field into several columns; Only the columns we specify in the command will be kept.
| |
| | |
| ==== Local ====
| |
| <syntaxhighlight lang='bash'>
| |
| $ ls -l /opt/SeqTools/bin/snpEff/
| |
| total 44504
| |
| drwxrwxr-x 5 brb brb 4096 May 5 11:24 data
| |
| drwxr-xr-x 2 brb brb 4096 Feb 17 16:37 examples
| |
| drwxr-xr-x 3 brb brb 4096 Feb 17 16:37 galaxy
| |
| drwxr-xr-x 3 brb brb 4096 Feb 17 16:37 scripts
| |
| -rw-r--r-- 1 brb brb 6138594 Dec 5 10:49 snpEff.config
| |
| -rw-r--r-- 1 brb brb 20698856 Dec 5 10:49 snpEff.jar
| |
| -rw-r--r-- 1 brb brb 18712032 Dec 5 10:49 SnpSift.jar
| |
| | |
| $ ls -l /opt/SeqTools/bin/snpEff/data
| |
| total 12
| |
| drwxr-xr-x 2 brb brb 4096 May 5 11:24 GRCh38.82
| |
| drwxrwxr-x 2 brb brb 4096 Feb 5 09:28 hg19
| |
| drwxr-xr-x 2 brb brb 4096 Mar 8 09:44 hg38
| |
| </syntaxhighlight>
| |
| | |
| ==== Biowulf ====
| |
| The following code fixed some typos on biowulf website.
| |
| <syntaxhighlight lang='bash'>
| |
| echo $SNPSIFT_JAR
| |
| # Make sure the database (GRCH37.75 in this case) exists
| |
| ls /usr/local/apps/snpEff/4.2/data
| |
| # CanFam3.1.75 Felis_catus_6.2.75 GRCh37.75 GRCh38.82 GRCm38.81 GRCz10.82 hg38
| |
| # CanFam3.1.81 Felis_catus_6.2.81 GRCh37.GTEX GRCh38.p2.RefSeq GRCm38.82 hg19 hg38kg
| |
| # CanFam3.1.82 Felis_catus_6.2.82 GRCh38.81 GRCm38.75 GRCz10.81 hg19kg Zv9.75
| |
| # Use snpEff to annotate against GRCh37.75
| |
| snpEff -v -lof -motif -hgvs -nextProt GRCh37.75 protocols/ex1.vcf > ex1.eff.vcf # 25 minutes
| |
| # create ex1.eff.vcf (475MB), snpEff_genes.txt (2.5MB) and snpEff_summary.html (22MB)
| |
| # Use SnpSift to pull out 'HIGH IMPACT' or 'MODERATE IMPACT' variants
| |
| cat ex1.eff.vcf | \
| |
| java -jar $SNPSIFT_JAR filter "((EFF[*].IMPACT = 'HIGH') | (EFF[*].IMPACT = 'MODERATE'))" \
| |
| > ex1.filtered.vcf
| |
| # ex1.filtered.vcf (8.2MB), 2 minutes
| |
| # Use SnpSift to annotate against the dbNSFP database
| |
| java -jar $SNPSIFT_JAR dbnsfp -v -db /fdb/dbNSFP2/dbNSFP2.9.txt.gz ex1.eff.vcf \
| |
| > file.annotated.vcf
| |
| # file.annotated.vcf (479 MB), 11 minutes
| |
| </syntaxhighlight>
| |
| and the output
| |
| <syntaxhighlight lang='bash'>
| |
| $ ls -lth | head
| |
| total 994M
| |
| -rw-r----- 1 479M May 15 11:47 file.annotated.vcf
| |
| -rw-r----- 1 8.2M May 15 11:31 ex1.filtered.vcf
| |
| -rw-r----- 1 476M May 15 10:31 ex1.eff.vcf
| |
| -rw-r----- 1 2.5M May 15 10:30 snpEff_genes.txt
| |
| -rw-r----- 1 22M May 15 10:30 snpEff_summary.html
| |
| lrwxrwxrwx 1 39 May 15 10:01 protocols -> /usr/local/apps/snpEff/4.2/../protocols
| |
| </syntaxhighlight>
| |
| | |
| ==== Strange error ====
| |
| * Run 1: Error
| |
| <pre>
| |
| $ java -Xmx4G -jar "$seqtools_snpeff/snpEff.jar" -canon -no-downstream -no-upstream -no-intergenic -no-intron -no-utr \
| |
| -noNextProt -noMotif $genomeVer -s "$outputDir/tmp/$tmpfd/annodbsnpRemove.html" "$outputDir/tmp/$tmpfd/cosmic_dbsnp_rem.vcf"
| |
| | |
| java.lang.RuntimeException: ERROR: Cannot read file '/opt/SeqTools/bin/snpEff/./data/hg38/snpEffectPredictor.bin'.
| |
| You can try to download the database by running the following command:
| |
| java -jar snpEff.jar download hg38
| |
| | |
| at ca.mcgill.mcb.pcingola.snpEffect.SnpEffectPredictor.load(SnpEffectPredictor.java:62)
| |
| at ca.mcgill.mcb.pcingola.snpEffect.Config.loadSnpEffectPredictor(Config.java:517)
| |
| at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.loadDb(SnpEff.java:339)
| |
| at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:956)
| |
| at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:939)
| |
| at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:978)
| |
| at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:136)
| |
| | |
| | |
| NEW VERSION!
| |
| There is a new SnpEff version available:
| |
| Version : 4.3P
| |
| Release date : 2017-06-06
| |
| Download URL : http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
| |
| </pre>
| |
| * Run 2: OK
| |
| <pre>
| |
| $ java -Xmx4G -jar "$seqtools_snpeff/snpEff.jar" -canon -no-downstream -no-upstream -no-intergenic -no-intron -no-utr \
| |
| -noNextProt -noMotif $genomeVer -s "$outputDir/tmp/$tmpfd/annodbsnpRemove.html" "$outputDir/tmp/$tmpfd/cosmic_dbsnp_rem.vcf"
| |
| | |
| ##fileformat=VCFv4.2
| |
| ##FILTER=<ID=PASS,Description="All filters passed">
| |
| ##samtoolsVersion=1.3+htslib-1.3
| |
| ...
| |
| </pre>
| |
| | |
| === ANNOVAR and SnpEff examples ===
| |
| https://github.com/arraytools/brb-seqtools/tree/master/testdata/GSE48215subset/output
| |
| | |
| === [http://cancer.sanger.ac.uk/cosmic COSMIC] ===
| |
| * [https://www.youtube.com/channel/UC0W354Ttrh0BZCjt1D3I2HA Youtube videos]
| |
| * [http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=BRAF#histo Histogram] of BRAF (melanoma).
| |
| | |
| === AnnTools ===
| |
| === GNS-SNP ===
| |
| === SeattleSeq ===
| |
| === SVA ===
| |
| === VARIANT ===
| |
| === VEP ===
| |
| [https://informatics.sydney.edu.au/training/coursedocs/DNASeqOnArtemis_Camden_Aug2018_1.pdf#page=46 DNA Sequencing analysis on Artemis Mapping and Variant Calling] Tracy Chew et al
| |
| | |
| === Web tools ===
| |
| * [http://genome.ucsc.edu/cgi-bin/hgVai UCSC Variant Annotation Integrator]
| |
| * [http://snp.gs.washington.edu/SeattleSeqAnnotation137/ SeattleSeq Annotation 137]
| |
| * [http://www.ensembl.org/Homo_sapiens/Tools/VEP Variant Effect Predictor] and [http://useast.ensembl.org/info/docs/tools/vep/script/index.html?redirect=no VEP script]
| |
| | |
| === pcgr ===
| |
| * https://github.com/sigven/pcgr
| |
| * [https://academic.oup.com/bioinformatics/article/34/10/1778/4764004 Personal Cancer Genome Reporter: variant interpretation report for precision oncology] 2017
| |
| | |
| === SPDI ===
| |
| [https://www.biorxiv.org/content/10.1101/537449v1 SPDI: Data Model for Variants and Applications at NCBI]
| |
| | |
| == De novo genome assembly ==
| |
| * [http://bioinformatics.oxfordjournals.org/content/early/2015/06/21/bioinformatics.btv383.abstract Bandage: interactive visualisation of de novo genome assemblies]
| |
| | |
| == Single Cell RNA-Seq ==
| |
| * https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis
| |
| * [http://journal.frontiersin.org/article/10.3389/fgene.2017.00062/full Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods] by Dal Molin et al 2017.
| |
| * [http://bioinformatics.oxfordjournals.org/content/31/13/2225.short?rss=1 Normalization and noise reduction for single cell RNA-seq experiments] by Bo Ding et al 2015.
| |
| * [http://www.rna-seqblog.com/a-step-by-step-workflow-for-low-level-analysis-of-single-cell-rna-seq-data-with-bioconductor/ A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor] by Lun 2016.
| |
| * (Video) [https://youtu.be/IrlNcJwPClQ?list=PLEyKDyF1qdObdFBc3JncwXAnMUHlcd0ap Analysis of single cell RNA-seq data - Lecture 1] by BioinformaticsTraining
| |
| * (Video) [https://youtu.be/8_-oPq1Tg1E Single-cell isolation by a modular single-cell pipette for RNA-sequencing] by labonachipVideos
| |
| * [https://f1000research.com/articles/6-595/v1 Gene length and detection bias in single cell RNA sequencing protocols] and the script & data are available online. Belinda Phipson1 et al 2017.
| |
| * [https://f1000research.com/articles/6-1158/v1 Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference]. '''It has open peer reviews too'''.
| |
| * [https://academic.oup.com/biostatistics/article-abstract/4599254 Missing data and technical variability in single-cell RNA-sequencing experiments]
| |
| * [https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0467-4 A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications] 2017
| |
| * [https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby007/4831233 How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives] by Alessandra Dal Molin 2018
| |
| | |
| === How many cells are in the human body? ===
| |
| [https://www.medicalnewstoday.com/articles/318342.php 30-40 trillion (10<sup>12</sup>) cells]
| |
| | |
| === [https://bioconductor.org/packages/scone scone] package: normalization ===
| |
| [https://www.biorxiv.org/content/biorxiv/early/2017/12/16/235382.full.pdf Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq]
| |
| | |
| === [http://bioconductor.org/packages/release/bioc/html/monocle.html monocle] package ===
| |
| | |
| === [http://bioconductor.org/packages/release/bioc/html/sincell.html sincell] package ===
| |
| | |
| === [https://github.com/diazlab/SCell SCell] ===
| |
| [http://www.rna-seqblog.com/scell-integrated-analysis-of-single-cell-rna-seq-data/ SCell – integrated analysis of single-cell RNA-seq data]
| |
| | |
| === [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2897-6 GEVM] ===
| |
| Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).
| |
| | |
| === NMFEM ===
| |
| [http://www.rna-seqblog.com/nmfem-detecting-heterogeneity-in-single-cell-rna-seq-data-by-non-negative-matrix-factorization/ Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization]
| |
| | |
| === Seurat ===
| |
| * http://satijalab.org/seurat/
| |
| * https://videocast.nih.gov/summary.asp?Live=21733&bhcp=1
| |
| | |
| === Splatter: Simulation Of Single-Cell RNA Sequencing Data ===
| |
| http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1
| |
| | |
| === Scater ===
| |
| [https://academic.oup.com/bioinformatics/article/33/8/1179/2907823 Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R]
| |
| | |
| === [https://github.com/miaozhun/DEsingle DEsingle] ===
| |
| [https://www.rna-seqblog.com/desingle-detecting-three-types-of-differential-expression-in-single-cell-rna-seq-data/ DEsingle – detecting three types of differential expression in single-cell RNA-seq data]
| |
| | |
| == RNA-Seq analysis interface ==
| |
| * [http://bib.oxfordjournals.org/content/early/2015/06/23/bib.bbv036.full Systematically evaluating interfaces for RNA-seq analysis from a life scientist perspective]
| |
| * [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2486-6 iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data]
| |
| * [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2702-z DEvis: an R package for aggregation and visualization of differential expression data]
| |
| | |
| == Co-expression in RNA-Seq ==
| |
| * [http://www.rna-seqblog.com/co-expression-network-analysis-using-rna-seq-data/ Co-expression network analysis using RNA-Seq data]
| |
| * [http://bioconductor.org/packages/devel/bioc/html/coseq.html coseq] - Co-Expression Analysis of Sequencing Data
| |
| * [http://bib.oxfordjournals.org/content/early/2017/01/10/bib.bbw139.full Gene co-expression analysis for functional classification and gene–disease predictions]
| |
| | |
| == Monitor Software Version Change ==
| |
| | |
| == Circos Plot ==
| |
| Circos is a popular tool for summarizing genomic events in a tumor genome.
| |
| | |
| * [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004274 Genome Modeling System: A Knowledge Management Platform for Genomics]
| |
| * [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2564-9 MS-Helios: a Circos wrapper to visualize multi-omic datasets]
| |
| | |
| == Cancer & Mutation ==
| |
| === [https://www.mycancergenome.org/ My Cancer Genome] ===
| |
| | |
| === [https://civic.genome.wustl.edu/#/home CIViC - Clinical Interpretations of Variants in Cancer] ===
| |
| | |
| === NCBI ===
| |
| * [https://www.youtube.com/watch?v=fC9rYghqUTo NCBI Resources and Variant Interpretation Tools for the Clinical Community]: ClinVar, MedGen, GTR, Variation Viewer
| |
|
| |
|
| = R and Bioconductor packages = | | = R and Bioconductor packages = |
Visualization
See also Bioconductor > BiocViews > Visualization. Search 'genom' as the keyword.
nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB
~/binary/IGV_2.3.52/igv.sh
Simulated DNA-Seq
The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.
File:Igv dna simul.png
Whole genome
PRJEB1486
File:Igv prjeb1486 wgs.png
Whole exome
- (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
- (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.
File:Igv gse48215.png File:Igv srp066363.png
RNA-Seq
- (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
- (Right) GSE46876, UCSC/hg19.
File:Igv anders2013 rna.png File:Igv gse46876 rna.png
Tell DNA or RNA
- DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
- RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).
GIVE: Genomic Interactive Visualization Engine
Build your own genome browser
Heat map plotting by genome coordinate.
Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.
R interface to genome browsers and their annotation tracks
- Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.
A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).
See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.
The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets.
https://github.com/cBioPortal/cbioportal
Tutorial: retrieve full TCGA datasets from cBioportal with R
Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.
SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.
Copy Number
Detect copy number variation (CNV) from the whole exome sequencing
Consensus CDS/CCDS
DBS segmentation algorithm
DBS: a fast and informative segmentation algorithm for DNA copy number analysis
modSaRa2
An accurate and powerful method for copy number variation detection
NGS
File:CentralDogmaMolecular.png
See NGS.
R and Bioconductor packages
Resources
library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
Docker
Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.
Some workflows
Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.
STAR Samtools Rsamtools
fastq -----> sam ----------> bam ----------> bamfiles -|
\ GenomicAlignments DESeq2
--------------------> se --------> dds
GenomicFeatures GenomicFeatures / (SummarizedExperiment) (DESeqDataSet)
gtf ----------------> txdb ---------------> genes -----|
library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)
Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
library(org.Hs.eg.db) # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens) # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt") # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type
|
example package name
|
contents
|
OrgDb
|
org.Hs.eg.db
|
gene based information for Homo sapiens
|
TxDb
|
TxDb.Hsapiens.UCSC.hg19.knownGene
|
transcriptome ranges for Homo sapiens
|
OrganismDb
|
Homo.sapiens
|
composite information for Homo sapiens
|
BSgenome
|
BSgenome.Hsapiens.UCSC.hg19
|
genome sequence for Homo sapiens
|
|
refGenome
|
|
RNA-Seq Data Analysis using R/Bioconductor
recount2
Note:
- The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
- Each patient has 4 categories data and the 'case_id' is common to them:
- demographic: gender, race, year_of_birth, year_of_death
- diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
- exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
- main: disease_type, primary_site
- The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.
Visualization
- Differential expression analyses for RNA-sequencing and microarray studies
- Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
- Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.
easyRNASeq
Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
ShortRead
Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.
The Rsamtools package provides an interface to BAM files.
The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts
This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.
IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.
The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.
If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.
flank
The example is obtained from ?IRanges::flank.
ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
# start end width
# [1] 2 3 2
# [2] 5 7 3
# [3] 1 3 3
flank(ir3, 2)
# start end width
# [1] 0 1 2
# [2] 3 4 2
# [3] -1 0 2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
# == ==
flank(ir3, 2, start=FALSE)
# start end width
# [1] 4 5 2
# [2] 8 9 2
# [3] 4 5 2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
# == ==
flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
# start end width
# [1] 4 5 2
# [2] 3 4 2
# [3] 4 5 2
# Combine the ideas of the previous 2 cases.
flank(ir3, c(2, -2, 2))
# start end width
# [1] 0 1 2
# [2] 5 6 2
# [3] -1 0 2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
# == ==
flank(ir3, -2, start=F)
# start end width
# [1] 2 3 2
# [2] 6 7 2
# [3] 2 3 2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
# == ==
flank(ir3, 2, both = TRUE)
# start end width
# [1] 0 3 4
# [2] 3 6 4
# [3] -1 2 4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range.
# The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
# ==
# == == == ==
flank(ir3, 2, start=FALSE, both=TRUE)
# start end width
# [1] 2 5 4
# [2] 6 9 4
# [3] 2 5 4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
# ==
# == == == ==
Both IRanges and GenomicRanges packages provide the flank function.
Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.
- General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
- Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.
GenomicRanges depends on IRanges package. See the dependency diagram below.
GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
| +
+-----+ +- GenomeInfoDb
| |
GenomicAlignments +--- Rsamtools --+-----+
+--- Biostrings
The package defines some classes
- GRanges
- GRangesList
- GAlignments
- SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.
(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.
Counting reads with summarizeOverlaps vignette
library(GenomicAlignments)
library(DESeq)
library(edgeR)
fls <- list.files(system.file("extdata", package="GenomicAlignments"),
recursive=TRUE, pattern="*bam$", full=TRUE)
features <- GRanges(
seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000,
7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900,
300, 500, 500)), "-",
group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features
# GRanges object with 11 ranges and 1 metadata column:
# seqnames ranges strand | group_id
# <Rle> <IRanges> <Rle> | <character>
# [1] chr2L [1000, 1499] - | A
# [2] chr2L [3000, 3499] - | A
# [3] chr2L [4000, 4499] - | A
# [4] chr2L [7000, 7599] - | A
# [5] chr2R [2000, 2899] - | B
# ... ... ... ... ... ...
# [7] chr2R [3600, 3899] - | B
# [8] chr2R [4000, 4899] - | B
# [9] chr2R [7500, 7799] - | B
# [10] chr3L [5000, 5499] - | C
# [11] chr3L [5400, 5899] - | C
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment
# dim: 11 2
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):
assays(olap)$counts
# sm_treated1.bam sm_untreated1.bam
# [1,] 0 0
# [2,] 0 0
# [3,] 0 0
# [4,] 0 0
# [5,] 5 1
# [6,] 5 0
# [7,] 2 0
# [8,] 376 104
# [9,] 0 0
# [10,] 0 0
# [11,] 0 0
Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).
samtools view -h -o outputFile.bam inputFile.sam
A modified R code that works is
###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
"gtf/drosophila_melanogaster/",
"Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)
###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
idx <- mcols(gff0)$source == "protein_coding" &
mcols(gff0)$type == "exon" &
seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)
###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
library(GenomicAlignments)
# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts
###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)
expdata = MIAME(
name="pasilla knockdown",
lab="Genetics and Developmental Biology, University of
Connecticut Health Center",
contact="Dr. Brenton Graveley",
title="modENCODE Drosophila pasilla RNA Binding Protein RNAi
knockdown RNA-Seq Studies",
pubMedIds="20921232",
url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
abstract="RNA-seq of 3 biological replicates of from the Drosophila
melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs
encoding pasilla, a mRNA binding protein and 4 biological replicates
of the the untreated cell line.")
design <- data.frame(
condition=c("untreated", "untreated"),
replicate=c(1,1),
type=rep("single-read", 2), stringsAsFactors=TRUE)
library(DESeq)
geneCDS <- newCountDataSet(
countData=assay(genehits),
conditions=design)
experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)
###################################################
### code chunk number 16: pasilla_genes (eval = FALSE)
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design)
experimentData(txCDS) <- expdata
We can also check out ?summarizeOverlaps to find some fake examples.
See this post for about C version of the featureCounts program.
featureCounts vs HTSeq-count
Inference
DESeq or edgeR
- DESeq2 method
- DESeq2 with a large number of samples -> use DESEq2 to normalize the data and then use do a Wilcoxon rank-sum test on the normalized counts, for each gene separately, or, even better, use a permutation test. See this post. Or consider the limma-voom method instead, which will handle 1000 samples in a few seconds without the need for extra memory.
- edgeR normalization factor post. Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the paper by Robinson & Oshlack 2010 for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
- edgeR From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline
- Can I feed TCGA normalized count data to EdgeR?
- counts() function and normalized counts.
- Why use Negative binomial distribution in RNA-Seq data?] and the Presentation by Simon Anders.
- XBSeq2 – a fast and accurate quantification of differential expression and differential polyadenylation. By using simulated datasets, they demonstrated that, overall, XBSeq2 performs equally well as XBSeq in terms of several statistical metrics and both perform better than DESeq2 and edgeR.
- DEBrowser: Interactive Differential Expression Analysis and Visualization Tool for Count Data Alper Kucukural 2018
An R package for gene and isoform differential expression analysis of RNA-seq data
http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/
Probe region expression estimation for RNA-seq data for improved microarray comparability
Inference of differential exon usage in RNA-Seq
A non-parametric approach for detecting differential expression and splicing from RNA-Seq data
voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data
http://www.biosoft.hacettepe.edu.tr/voomDDA/
Pathway analysis
fgsea: Fast Gene Set Enrichment Analysis
GSEABenchmarkeR: Reproducible GSEA Benchmarking
Towards a gold standard for benchmarking gene set enrichment analysis
GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display
Pipeline
http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/
SEQprocess
SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package
pasilla and pasillaBamSubset Data
pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.
pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).
Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.
ReportingTools
The ReportingTools software package enables users to easily display reports of analysis
results generated from sources such as microarray and sequencing data.
More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.
Bioinformatics paper
CRAN packages
Sample Size Calculation for RNA-Seq Experimental Design
Shiny app
Provides an interface to functions of the 'SAMtools' C-Library by Heng Li
The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.
Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.
Simple TCGA Data Access for Integrated Statistical Analysis in R
TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.
source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")
install.packages("TCGA2STAT")
The getTCGA() function allows to download various kind of data:
- gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
- miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
- mutation data (data.type="Mutation")
- methylation expression (data.type="Methylation")
- copy number changes (data.type="CNA_SNP")
curatedTCGAData
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used
Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.
The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).
> library(Map2NCBI)
> GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/"))
# choose [2], [n], and [1] to filter the build and feature information.
# The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files.
> str(GeneList)
'data.frame': 52157 obs. of 15 variables:
$ tax_id : chr "9606" "9606" "9606" "9606" ...
$ chromosome : chr "1" "1" "1" "1" ...
$ chr_start : num 11874 14362 17369 30366 34611 ...
$ chr_stop : num 14409 29370 17436 30503 36081 ...
$ chr_orient : chr "+" "-" "-" "+" ...
$ contig : chr "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ...
$ ctg_start : num 1874 4362 7369 20366 24611 ...
$ ctg_stop : num 4409 19370 7436 20503 26081 ...
$ ctg_orient : chr "+" "-" "-" "+" ...
$ feature_name : chr "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ...
$ feature_id : chr "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ...
$ feature_type : chr "GENE" "GENE" "GENE" "GENE" ...
$ group_label : chr "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ...
$ transcript : chr "Assembly" "Assembly" "Assembly" "Assembly" ...
$ evidence_code: chr "-" "-" "-" "-" ...
> GeneList$feature_name[grep("^NAP", GeneList$feature_name)]
TCseq: Time course sequencing data analysis
http://bioconductor.org/packages/devel/bioc/html/TCseq.html
GEO
See the internal link at R-GEO.
GREIN: An interactive web platform for re-analyzing GEO RNA-seq data
Journals
Biometrical Journal
Genome Analysis section
PLOS
Software
BRB-SeqTools
https://brb.nci.nih.gov/seqtools/
GeneSpring
RNA-Seq
CCBR Exome Pipeliner
https://ccbr.github.io/Pipeliner/
MOFA: Multi-Omics Factor Analysis
Benchmarking
Essential guidelines for computational method benchmarking
Simulation
Simulate RNA-Seq
Used by TopHat: discovering splice junctions with RNA-Seq
BEERS/Grant G.R. 2011
http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called BEERS and it was used in the STAR software paper.
For the command line options of <reads_simulator.pl> and more details about the config files that are needed/prepared by BEERS, see this gist.
This can generate paired end data but they are in one FASTA file.
$ sudo apt-get install cpanminus
$ sudo cpanm Math::Random
$ wget http://cbil.upenn.edu/BEERS/beers.tar
$
$ tar -xvf beers.tar # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl>
$
$ cd ~/Downloads/
$ mkdir beers_output
$ mkdir beers_simulator_refseq && cd "$_"
$ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz
$ tar xzvf simulator_config_refseq.tar.gz
$ ls -lth
total 1.4G
-rw-r--r-- 1 brb brb 44M Sep 16 2010 simulator_config_featurequantifications_refseq
-rw-r--r-- 1 brb brb 7.7M Sep 15 2010 simulator_config_geneinfo_refseq
-rw-r--r-- 1 brb brb 106M Sep 15 2010 simulator_config_geneseq_refseq
-rw-r--r-- 1 brb brb 1.3G Sep 15 2010 simulator_config_intronseq_refseq
$ cd ~/Downloads/
$ perl reads_simulator.pl 100 testbeers \
-configstem refseq \
-customcfgdir ~/Downloads/beers_simulator_refseq \
-outdir ~/Downloads/beers_output
$ ls -lh beers_output
total 3.9M
-rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt
-rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt
-rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt
-rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt
-rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed
-rw-r--r-- 1 brb brb 31K Mar 16 15:25 simulated_reads_testbeers.cig
-rw-r--r-- 1 brb brb 22K Mar 16 15:25 simulated_reads_testbeers.fa
-rw-r--r-- 1 brb brb 584 Mar 16 15:25 simulated_reads_testbeers.log
$ wc -l simulated_reads2genes_testbeers.txt
102 simulated_reads2genes_testbeers.txt
$ head -4 simulated_reads2genes_testbeers.txt
seq.1 GENE.5600
seq.2 GENE.35506
seq.3 GENE.506
seq.4 GENE.34922
$ tail -4 simulated_reads2genes_testbeers.txt
seq.97 GENE.4197
seq.98 GENE.8763
seq.99 GENE.19573
seq.100 GENE.18830
$ wc -l simulated_reads_indels_testbeers.txt
36131 simulated_reads_indels_testbeers.txt
$ head -2 simulated_reads_indels_testbeers.txt
chr1:6052304-6052531 25 1 G
chr2:73899436-73899622 141 3 ATA
$ tail -2 simulated_reads_indels_testbeers.txt
chr4:68619532-68621804 1298 -2 AA
chr21:32554738-32554962 174 1 T
$ wc -l simulated_reads_substitutions_testbeers.txt
71678 simulated_reads_substitutions_testbeers.txt
$ head -2 simulated_reads_substitutions_testbeers.txt
chr22:50902963-50903167 50903077 G->A
chr1:6052304-6052531 6052330 G->C
$ wc -l simulated_reads_junctions-crossed_testbeers.txt
49 simulated_reads_junctions-crossed_testbeers.txt
$ head -2 simulated_reads_junctions-crossed_testbeers.txt
seq.1a chrX:49084601-49084713
seq.1b chrX:49084909-49086682
$ cat beers_output/simulated_reads_testbeers.log
Simulator run: 'testbeers'
started: Thu Mar 16 15:25:39 EDT 2017
num reads: 100
readlength: 100
substitution frequency: 0.001
indel frequency: 0.0005
base error: 0.005
low quality tail length: 10
percent of tails that are low quality: 0
quality of low qulaity tails: 0.8
percent of alt splice forms: 0.2
number of alt splice forms per gene: 2
stem: refseq
sum of gene counts: 3,886,863,063
sum of intron counts = 1,304,815,198
sum of intron counts = 2,365,472,596
intron frequency: 0.355507598105262
padded intron frequency: 0.52453796437909
finished at Thu Mar 16 15:25:58 EDT 2017
$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ head simulated_reads_testbeers.fa
>seq.1a
CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
>seq.1b
GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
>seq.2a
GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
>seq.2b
ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
>seq.3a
CCCCAGAGGAGCGCCACCTGTCCAAGATGCAGCAGAACGGCTACGAAAATCCAACCTACAAGTTCTTTGAGCAGATGCAGAACTAGACCCCCGCCACAGC
# Take a look at the true coordinates
$ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span
chrX 49084529 49084601 +
chrX 49084713 49084739 +
chrX 49084863 49084909 +
chrX 49086682 49086734 +
$ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates
seq.1a chrX 49084529 73M111N27M 49084529-49084601, 49084713-49084739 + CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
seq.1b chrX 49084863 47M1772N53M 49084863-49084909, 49086682-49086734 - GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
seq.2a chr1 183516256 100M 183516256-183516355 - GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
seq.2b chr1 183515275 100M 183515275-183515374 + ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ wc -l simulated_reads_testbeers.bed
247 simulated_reads_testbeers.bed
$ wc -l simulated_reads_testbeers.cig
200 simulated_reads_testbeers.cig
Flux Sammeth 2010
Bioinformatics
A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.
http://biorxiv.org/content/early/2014/12/05/012211
The key function is simulateCounts, which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.
Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.
http://biorxiv.org/content/early/2014/12/05/012211
Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.
Input: reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.
Output: FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.
Too many dependencies. Got an error in installation.. It seems it has not considered splice junctions.
Simulate DNA-Seq
wgsim
https://github.com/lh3/wgsim
NEAT
DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads
http://genomespot.blogspot.com/2014/11/dna-aligner-accuracy-bwa-bowtie-soap.html
$ head simDNA_100bp_16del.fasta
>Pt-0-100
TGGCGAACGCGGGAATTGACCGCGATGGTGATTCACATCACTCCTAATCCACTTGCTAATCGCCCTACGCTACTATCATTCTTT
>Pt-10-110
GCGGGATTGAACCCGATTGAATTCCAATCACTGCTTAATCCACTTGCTACATCGCCCTACGTACTATCTATTTTTTTGTATTTC
>Pt-20-120
GAACCCGCGATGAATTCAATCCACTGCTACCATTGGCTACATCCGCCCCTACGCTACTCTTCTTTTTTGTATGTCTAAAAAAAA
>Pt-30-130
TGGTGAATCACAATCACTGCCTAACCATTGGCTACATCCGCCCCTACGCTACACTATTTTTTGTATTGCTAAAAAAAAAAATAA
>Pt-40-140
ACAACACTGCCTAATCCACTTGGCTACTCCGCCCCTAGCTACTATCTTTTTTTGTATTTCTAAAAAAAAAAAATCAATTTCAAT
Simulate Whole genome
Simulate whole exome
Variant simulator
sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs
Convert FASTA to FASTQ
It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.
If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.
$ cat test.fasta
>Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
>Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
>Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
>Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
>Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
$ perl ~/Downloads/fasta_to_fastq.pl test.fasta
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Alternatively we can use just one line of code by awk
$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \
test.fasta > test.fq
$ cat test.fq
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).
Simulate genetic data
‘Simulating genetic data with R: an example with deleterious variants (and a pun)’
PDX/Xenograft
#!/bin/bash
module load gossamer
xenome index -M 24 -T 16 -P idx \
-H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \
-G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
RNA-Seq
DNA-Seq
DNA Seq Data
NIH
- Go to SRA/Sequence Read Archiveand type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
- Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
- The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
- The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
- I pick the first one Accession:SRP066837 from this page. The page shows the Study type is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
- (Important trick) Click the number next to Run. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
- Download the raw data from any one of them (eg SRR2968056). For whole genome, the Strategy is WGS. For whole exome, the Strategy is called WXS.
- Search the keywords 'nonsynonymous' and 'human' in PMC
Use SRAToolKit instead of wget to download
Don't use the wget command since it requires the specification of right http address.
Downloading SRA data using command line utilities
SRA2R - a package to import SRA data directly into R.
(Method 1) Use the fastq-dump command. For example, the following command (modified from the document will download the first 5 reads and save it to a file called <SRR390728.fastq> (NOT sra format) in the current directory.
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O .
# OR
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar
This will download the files in FASTQ format.
(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.
(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list
brb@T3600 ~/Downloads $ tar xzvf aspera-connect-3.6.2.117442-linux-64.tar.gz
aspera-connect-3.6.2.117442-linux-64.sh
brb@T3600 ~/Downloads $ ./aspera-connect-3.6.2.117442-linux-64.sh
Installing Aspera Connect
Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only.
Restart firefox manually to load the Aspera Connect plug-in
Install complete.
brb@T3600 ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \
-i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
[email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra .
SRR590795.sra 100% 239MB 535Mb/s 00:06
Completed: 245535K bytes transferred in 7 seconds
(272848K bits/sec), in 1 file.
brb@T3600 ~/Downloads $
Aspera is typically 10 times faster than FTP according to the website. For this case, wget takes 12s while ascp uses 7s.
Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "[email protected]"
SRAdb package
https://bioconductor.org/packages/release/bioc/html/SRAdb.html
First we install some required package for XML and RCurl.
sudo apt-get update
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev
and then
source("https://bioconductor.org/biocLite.R")
biocLite("SRAdb")
SRA
Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers
SRA Explorer
SRP056969
SRP066363 - lung cancer
SRP015769 or SRP062882 - prostate cancer
SRP053134 - breast cancer
Look at the MBases value column. It determines the coverage for each run.
SRP050992 single cell RNA-Seq
Used in Design and computational analysis of single-cell RNA-sequencing experiments
Single cell RNA-Seq
Exploiting single-cell expression to characterize co-expression replicability
SRP040626 or SRP040540 - Colon and rectal cancer
OmicIDX
OmicIDX on BigQuery
Tutorials
See the BWA section.
Whole Exome Seq
Whole Genome Seq
- BioProject and
- Search: filter by 'Homo sapiens wgs'
- Project data type: Genome sequencing
- Click 'Date' to sort by it
- 20 hits as of 1/11/2017 (many of them do not have data)
- PRJNA352450 18 experiments
- PRJNA343545 48 experiments
- 309109 5 experiments
- PRJNA289286 5 experiments
- PRJNA260389 27 experiments
- SRX699196 34,099,675 spots, 6.8G bases, 4.3Gb size
- 248553 3 experiments
- 210123 26 experiments
- 43433 3 experiments (ABI SOLiD System 3.0)
SraRunTable.txt
- http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
- http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click SRP004077
- http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click Runs from the RHS
- http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click RunInfoTable
Note that (For this study, it has 2377 rows)
- Column A (AssemblyName_s) eg GRCh37
- Column I (library_name_s) eg
- column N (header=Run_s) shows all SRR or ERR accession numbers.
- Column P (Sample_Name)
- Column Y (header=Assay_Type_s) shows WGS.
- Column AB (LibraryLayout_s): PAIRED
Public Data
ISB Cancer Genomics Cloud (ISB-CGC)
https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis
The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.
ISB-CGC Web Application
CCLE
Next-generation characterization of the Cancer Cell Line Encyclopedia 2019
It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).ß
NCI's Genomic Data Commons (GDC)/TCGA
The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).
GTEx
Sharing data
Gene set analysis
Hypergeometric test
Next-generation sequencing data
Misc
Advice
Bioinformatics advice I wish I learned 10 years ago from NIH
High Performance
Cloud Computing
Merge different datasets (different genechips)
Normalization
Ensembl
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
File:Tablebrowser.png File:Tablebrowser2.png
Note
- the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
- the output does not have a header
- The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1
If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.
$ wc -l hg38Tables.bed
75893 hg38Tables.bed
$ head -2 hg38Tables.bed
chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754,
chr1 201283451 201332993 NM_000299 0 + 201283702 201328836 0 15 453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920, 0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
$ tail -2 hg38Tables.bed
chr22_KI270734v1_random 131493 137393 NM_005675 0 + 131645 136994 0 5 262,161,101,141,549, 0,342,3949,4665,5351,
chr22_KI270734v1_random 138078 161852 NM_016335 0 - 138479 161586 0 15 589,89,99,176,147,93,82,80,117,65,150,35,209,313,164, 0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610,
$ wc -l hg38CodingExon.bed
577387 hg38CodingExon.bed
$ head -2 hg38CodingExon.bed
chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 -
chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 -
$ tail -2 hg38CodingExon.bed
chr22_KI270734v1_random 156288 156497 NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r 0 -
chr22_KI270734v1_random 161313 161586 NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r 0 -
# Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698)
$ grep NM_001276352 hg38Tables.bed
chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754,
$ grep NM_001276352 hg38CodingExon.bed
chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 -
chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 -
chr1 67103237 67103382 NM_001276352_cds_2_0_chr1_67103238_r 0 -
chr1 67111576 67111644 NM_001276352_cds_3_0_chr1_67111577_r 0 -
chr1 67115351 67115464 NM_001276352_cds_4_0_chr1_67115352_r 0 -
chr1 67125751 67125909 NM_001276352_cds_5_0_chr1_67125752_r 0 -
chr1 67127165 67127240 NM_001276352_cds_6_0_chr1_67127166_r 0 -
This can be compared to refGene(?) directly downloaded via http
$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
--2018-10-09 15:44:43-- http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7457957 (7.1M) [application/x-gzip]
Saving to: ‘hg38.refGene.txt.gz’
hg38.refGene.txt.gz 100%[===============================================================>] 7.11M 901KB/s in 10s
2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957]
$ zcat hg38.refGene.txt.gz | wc -l
75893
15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2
1072 NM_003288 chr20 + 63865227 63891545 63865365 63889945 7 63865227,63869295,63873667,63875815,63882718,63889189,63889849, 63865384,63869441,63873816,63875875,63882820,63889238,63891545, 0 TPD52L2 cmpl cmpl 0,1,0,2,2,2,0,
1815 NR_110164 chr2 + 161244738 161249050 161249050 161249050 2 161244738,161246874, 161244895,161249050, 0 LINC01806 unk unk -1,-1,
$ zcat hg38.refGene.txt.gz | tail -2
1006 NM_130467 chrX + 55220345 55224108 55220599 55224003 5 55220345,55221374,55221766,55222620,55223986, 55220651,55221463,55221875,55222746,55224108, 0 PAGE5 cmpl cmpl 0,1,0,1,1,
637 NM_001364814 chrY - 6865917 6874027 6866072 6872608 7 6865917,6868036,6868731,6868867,6870005,6872554,6873971, 6866078,6868462,6868776,6868909,6870053,6872620,6874027, 0 AMELY cmpl cmpl 0,0,0,0,0,0,-1,
Where to download reference genome
Which human reference genome to use?
http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)
See Table 1 of Chapter 18The Reference Sequence (RefSeq) Database.
Category
|
Description
|
NC
|
Complete genomic molecules
|
NG
|
Incomplete genomic region
|
NM
|
mRNA
|
NR
|
ncRNA
|
NP
|
Protein
|
XM
|
predicted mRNA model
|
XR
|
predicted ncRNA model
|
XP
|
predicted Protein model (eukaryotic sequences)
|
WP
|
predicted Protein model (prokaryotic sequences)
|
UCSC version & NCBI release corresponding
Gene Annotation
How many DNA strands are there in humans?
How many base pairs in human
- 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
- chromosome 22 has the smallest number of bps (~50 million).
- chromosome 1 has the largest number of bps (245 million base pairs).
- Illumina iGenome Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa file is 3.0GB (so is other genome.fa from human).
Gene, Transcript, Coding/Non-coding exon
SNP
Types of SNPs and number of SNPs in each chromosomes
NGS technology
DNA methylation
devtools::install_github("coloncancermeth","genomicsclass")
library(coloncancermeth) # 485512 x 26
data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects
dim(meth)
dim(pd)
length(gr)
colnames(pd)
table(pd$Status) # 9 normals, 17 cancers
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")
i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
for(i in normalIndex){
lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
lines(density(meth[,i],from=0,to=1),col=2)
}
# finding regions of the genome that are different between cancer and normal samples
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- ebayes(fit)
# plot of the region surrounding the top hit
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)
plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
# http://www.ncbi.nlm.nih.gov/pubmed/22422453
# within each chromosome we usually have big gaps creating subgroups of regions to be analyzed
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")
library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
#consider two example regions#
...
Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing
Sequence + Expression
Integrate RNA-Seq and DNA-Seq
Integrate/combine Omics
Gene expression
Expression level is the amount of RNA in cell that was transcribed from that gene. Slides from Alyssa Frazee.
Quantile normalization
- When to use Quantile Normalization? and its R package quantro
- normalize.quantiles() from preprocessCore package. Note for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667.
source('http://bioconductor.org/biocLite.R')
biocLite('preprocessCore')
#load package
library(preprocessCore)
#the function expects a matrix
#create a matrix using the same example
mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
ncol=3)
mat
# [,1] [,2] [,3]
#[1,] 5 4 3
#[2,] 2 1 4
#[3,] 3 4 6
#[4,] 4 2 8
#quantile normalisation
normalize.quantiles(mat)
# [,1] [,2] [,3]
#[1,] 5.666667 5.166667 2.000000
#[2,] 2.000000 2.000000 3.000000
#[3,] 3.000000 5.166667 4.666667
#[4,] 4.666667 3.000000 5.666667
Merging two gene expression studies
- Alternative empirical Bayes models for adjusting for batch effects in genomic studies Zhang et al. BMC Bioinformatics 2018. The R package is BatchQC from Bioconductor.
- Combat() function in sva package from Bioconductor.
- It can remove both known batch effects and other potential latent sources of variation.
- The tutorial includes information on (1) how to estimate the number of latent sources of variation, (2) how to apply the sva package to estimate latent variables such as batch effects, (3) how to directly remove known batch effects using the ComBat function, (4) how to perform differential expression analysis using surrogate variables either directly or with thelimma package, and (4) how to apply “frozen” sva to improve prediction and clustering.
- Tutorial example to remove the batch effect
library(sva)
library(bladderbatch)
data(bladderdata)
pheno = pData(bladderEset)
edata = exprs(bladderEset)
batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat,
par.prior=TRUE, prior.plots=FALSE)
# This returns an expression matrix, with the same dimensions
# as your original dataset.
# By default, it performs parametric empirical Bayesian adjustments.
# If you would like to use nonparametric empirical Bayesian adjustments,
# use the par.prior=FALSE option (this will take longer).
Fusion gene
https://en.wikipedia.org/wiki/Fusion_gene
Structural variation
LUMPY, DELLY, ForestSV, Pindel, breakdancer , SVDetect.
RNASeq + ChipSeq
Labs
Biowulf2 at NIH
Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.
Selected Papers
Pictures
https://www.flickr.com/photos/genomegov
用DNA做身分鑑識
用DNA做身分鑑識
如何自学入门生物信息学
https://zhuanlan.zhihu.com/p/32065916
Staying current
Staying Current in Bioinformatics & Genomics: 2017 Edition
Papers
Common issues in algorithmic bioinformatics papers
What are some common issues I find when reviewing algorithmic bioinformatics conference papers?
Precision Medicine courses
Personalized medicine
Cancer and gene markers
- Colorectal cancer patients without KRAS mutations have far better outcomes with EGFR treatment than those with KRAS mutations.
- Two EGFR inhibitors, cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
- Breast cancer.
The shocking truth about space travel
7 percent of DNA belonging to NASA astronaut Scott Kelly changed in the time he was aboard the International Space Station
bioSyntax: syntax highlighting for computational biology
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2315-y
Deep learning
Deep learning: new computational modelling techniques for genomics
Terms
基因结构
https://zhuanlan.zhihu.com/p/49601643
RNA sequencing 101
Web
- Introduction to RNA-Seq including biology overview (DNA, Alternative splcing, mRNA structure, human genome) and sequencing technology.
- RNA Sequencing 101 by Agilent Technologies. Includes the definition of sequencing depth (number of reads per sample) and coverage (number of reads/locus).
- Where do we get reads(A,C,T,G) from sample RNA? See page 12 of this pdf from Colin Dewey in U. Wisc.
- Quantification of RNA-Seq data (see the above pdf)
- Convert read counts into expression: RPKM (see the above pdf)
- RPKM and FPKM (Data analysis of RNA-seq from new generation sequencing by 張庭毓) RNA-Seq資料分析研討會與實作課程 / RNA Seq定序 / 次世代定序(NGS) / 高通量基因定序 分析.
- First vs Second generation sequence.
- The Simple Fool’s Guide to Population Genomics via RNASeq: An Introduction to HighThroughput Sequencing Data Analysis. This covers QC, De novo assembly, BLAST, mapping reads to reference sequences, gene expression analysis and variant (SNP) detection.
- An Introduction to Bioinformatics Resources and their Practical Applications from NIH library Bioinformatics Support Program.
- Teaching material from rnaseqforthenextgeneration.org which includes Designing RNA-Seq experiments, Processing RNA-Seq data, and Downstream analyses with RNA-Seq data.
Books
strand-specific vs non-strand specific experiment
Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.
This post mentioned to use infer_experiment.py script to check whether the rna-seq run is stranded or not.
The rna-seq experiment used in this tutorial is not stranded-specific.
FASTQ
- FASTQ=FASTA + Qual. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
Phred quality score
q = -10log10(p) where p = error probability for the base.
q
|
error probability
|
base call accuracy
|
10
|
0.1
|
90%
|
13
|
0.05
|
95%
|
20
|
0.01
|
99%
|
30
|
0.001
|
99.9%
|
40
|
0.0001
|
99.99%
|
50
|
0.00001
|
99.999%
|
FASTA
fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.
Download sequence files
Compute the sequence length of a FASTA file
https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
head -2 file.fa | \
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' | \
tail -1
FASTA <=> FASTQ conversion
According to this post,
- FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
- FastQ files are like fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)
Convert FASTA to FASTQ without quality scores
Biostars. For example, the bioawk by lh3 (Heng Li) worked.
Convert FASTA to FASTQ with quality score file
See the links on the above post.
Convert FASTQ to FASTA using Seqtk
Use the Seqtk program; see this post.
The Seqtk program by lh3 can be used to sample reads from a fastq file including paired-end; see this post.
RPKM (Mortazavi et al. 2008)
Reads per Kilobase of Exon per Million of Mapped reads.
Idea
- The more we sequence, the more reads we expect from each gene. This is the most relevant correction of this method.
- Longer transcript are expected to generate more reads. The latter is only relevant for comparisons among different genes which we rarely perform!. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.
Calculation
- Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
- Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
- Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
Formula
RPKM = (10^9 * C)/(N * L), with
C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
[,1] [,2] [,3] [,4]
[1,] 0 0 5 0
[2,] 6 2 7 3
[3,] 5 13 7 2
[4,] 3 3 9 11
[5,] 1 2 1 15
d <- DGEList(counts=y, lib.size=1001:1004)
# Note that lib.size is optional
# By default, lib.size = colSums(counts)
cpm(d) # counts per million
Sample1 Sample2 Sample3 Sample4
1 0.000 0.000 4985.045 0.000
2 5994.006 1996.008 6979.063 2988.048
3 4995.005 12974.052 6979.063 1992.032
4 2997.003 2994.012 8973.081 10956.175
5 999.001 1996.008 997.009 14940.239
> cpm(d,log=TRUE)
Sample1 Sample2 Sample3 Sample4
1 7.961463 7.961463 12.35309 7.961463
2 12.607393 11.132027 12.81875 11.659911
3 12.355838 13.690089 12.81875 11.129470
4 11.663897 11.662567 13.17022 13.451207
5 10.285119 11.132027 10.28282 13.890078
d$genes$Length <- c(1000,2000,500,1500,3000)
rpkm(d)
Sample1 Sample2 Sample3 Sample4
1 0.0000 0.000 4985.0449 0.000
2 2997.0030 998.004 3489.5314 1494.024
3 9990.0100 25948.104 13958.1256 3984.064
4 1998.0020 1996.008 5982.0538 7304.117
5 333.0003 665.336 332.3363 4980.080
> cpm
function (x, ...)
UseMethod("cpm")
<environment: namespace:edgeR>
> showMethods("cpm")
Function "cpm":
<not an S4 generic function>
> cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25,
...)
{
x <- as.matrix(x)
if (is.null(lib.size))
lib.size <- colSums(x)
if (log) {
prior.count.scaled <- lib.size/mean(lib.size) * prior.count
lib.size <- lib.size + 2 * prior.count.scaled
}
lib.size <- 1e-06 * lib.size
if (log)
log2(t((t(x) + prior.count.scaled)/lib.size))
else t(t(x)/lib.size)
}
<environment: namespace:edgeR>
> rpkm.default
function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25,
...)
{
y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count)
gene.length.kb <- gene.length/1000
if (log)
y - log2(gene.length.kb)
else y/gene.length.kb
}
<environment: namespace:edgeR>
Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as
# step 1:
6/(1.0e-6 *1001) = 5994.006 # cpm, compute column-wise
# step 2:
5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise
# Another way
# step 1 (RPK)
6/ (2000/1.0e3) = 3
# step 2 (RPKM)
3/ (1.0e-6 * 1001) = 2997.003
Critics
Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!
(another critic) Union Exon Based Approach
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910
In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.
It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.
FPKM (Trapnell et al. 2010)
Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks).
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
- The youtube video is on here. TPM 比 RPKM/FPKM 好因為 total reads in each experiments are the same.
- Differences
- The main difference between RPKM and FPKM is that the former is a unit based on single-end reads, while the latter is based on paired-end reads and counts the two reads from the same RNA fragment as one instead of two.
- The difference between RPKM/FPKM and TPM is that the former calculates sample-scaling factors before dividing read counts by gene lengths, while the latter divides read counts by gene lengths first and calculates samples calling factors based on the length-normalized read counts.
- If researchers would like to interpret gene expression levels as the proportions of RNA molecules from different genes in a sample,
- 为什么说FPKM和RPKM都错了?
- Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples (TPM method) by Wagner 2012.
- Between samples normalization.
> dds <- makeExampleDESeqDataSet(m=4)
> head(counts(dds))
sample1 sample2 sample3 sample4
gene1 0 1 1 6
gene2 2 0 0 3
gene3 18 9 19 12
gene4 12 25 13 13
gene5 22 26 10 8
gene6 6 5 8 6
> dds <- estimateSizeFactors(dds)
> head(counts(dds))
sample1 sample2 sample3 sample4
gene1 0 1 1 6
gene2 2 0 0 3
gene3 18 9 19 12
gene4 12 25 13 13
gene5 22 26 10 8
gene6 6 5 8 6
> head(counts(dds, normalized=TRUE))
sample1 sample2 sample3 sample4
gene1 0.00000 0.9654796 0.9858756 5.732657
gene2 1.96066 0.0000000 0.0000000 2.866328
gene3 17.64594 8.6893164 18.7316365 11.465314
gene4 11.76396 24.1369899 12.8163829 12.420756
gene5 21.56726 25.1024695 9.8587560 7.643542
gene6 5.88198 4.8273980 7.8870048 5.732657
- TPM has been suggested as a better unit than RPKM/FPKM. But it cannot be used to do comparison between samples].
P -- per
K -- kilobase (related to gene length)
M -- million (related to sequencing depth)
TMM (Robinson and Oshlack, 2010)
Trimmed Means of M values (EdgeR).
Sample size
Coverage
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
C = L N / G
where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.
# Assume the bam file is sorted by chromosome location
# took 40 min on 5.8G bam file. samtools depth has no threads option:(
# it is not right since it only account for regions that were covered with reads
samtools depth *bamfile* | awk '{sum+=$3} END { print "Average = ",sum/NR}' # maybe 42
# The following is the right way! The result matches with Qualimap program.
samtools depth -a *bamfile* | awk '{sum+=$3} END { print "Average = ",sum/NR}' # maybe 8
# OR
LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'` # 3095693981
SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'` # 24473867730
echo $(( $LEN/$SUM ))
SAM/Sequence Alignment Format and BAM format specification
Single-end, pair-end, fragment, insert size
Germline vs Somatic mutation
Germline: inherit from parents. See the Wikipedia page.
Driver vs passenger mutation
https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer
Nonsynonymous mutation
It is related to the genetic code, Wikipedia. There are 20 amino acids though there are 64 codes.
See
isma: analysis of mutations detected by multiple pipelines
isma: an R package for the integrative analysis of mutations detected by multiple pipelines
Missense variants
aminoacid changing variants
Alternative and differential splicing
Best practices and appropriate workflows to analyse alternative and differential splicing
Allele vs Gene
http://www.diffen.com/difference/Allele_vs_Gene
- A gene is a stretch of DNA or RNA that determines a certain trait.
- Genes mutate and can take two or more alternative forms; an allele is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
- An allele is found at a fixed spot on a chromosome?
- Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).
Locus
https://en.wikipedia.org/wiki/Locus_(genetics)
Base quality, Mapping quality, Variant quality
Mapping quality (MAPQ) vs Alignment score (AS)
http://seqanswers.com/forums/showthread.php?t=66634 & SAM format specification
- MAPQ (5th column): MAPping Quality. It equals −10 log10 Pr{mapping position is wrong} (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
- AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)
Note:
- MAPQ scores produced by the aligners typically involves the alignment score and other information.
- You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
- You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
- MAPQ values are really useful but their implementation is a mess by Simon Andrews
Other software
Partek
MeV v4.8 (11/18/2011) allows annotation from Bioconductor
IPA from Ingenuity
Login:
There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.
Features:
- easily search the scientific literature/integrate diverse biological information.
- build dynamic pathway models
- quickly analyze experimental data/Functional discovery: assign function to genes
- share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.
Start Here
Expression data -> New core analysis -> Functions/Diseases -> Network analysis
Canonical pathways |
| |
Simple or advanced search --------------------+ |
| |
v |
My pathways, Lists <------+
^
|
Creating a custom pathway --------------------+
Resource:
Notes:
- The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
- The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
- The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
- In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
- The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.
Screenshots:
File:IngenuityAnalysisOutput.png
It offers an integrated annotation combining gene ontology, pathways and protein annotations.
It can be used to identify the pathways associated with a set of genes; e.g. this paper.
GOTrapper
GOTrapper: a tool to navigate through branches of gene ontology hierarchy
Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).
GSEA
GWAS
Genome-wide association studies in R