Wiki for BRB-SeqTools. Like Galaxy, BRB-SeqTools is not a method.

Windows 10 Bash shell

Gene counting and variant call (both Samtools and GATK) works fine. For variant annotation, see the comment #2 below.

Need to install Xming. Before calling ./SeqTools from the Bash shell, run export DISPLAY=:0 first.
- If we like to start Xming automatically when Windows boots, follow the instruction How to Make a Program Run at Startup on Any Computer.
- Press Windows + R (or 'run' in the search box) and click the Enter key. Type “shell:startup” into the Run dialog, and press Enter. Now drag-and-drop the Xming shortcut from the “All Apps” list in the Start menu directly into this folder. Reboot Windows to make this change to work. It also helps to download the BRB-SeqTools icon (read below).
Automatic setup:
- Need to install unzip utility sudo apt-get install unzip.
- Install fontconfig library sudo apt-get install libfontconfig1-dev.
- New gnome-terminal windows (use apt-get install) cannot be opened from Bash shell. This affect Automatic setup tools in Tools manager and profile manager.
- Java JDK from ppa:webup8team does not work. We need to download/install it <jdk-8u112-linux-x64.tar.gz> from Oracle website. To make the installation silently, we need to add two lines to the installation script. See here for apt-get approach and here for tarball approach.
- An issue in pandoc: timer-create function not implemented. In other words, if we run variant annotation, we will get a bug report message. The main output files (1 vcf and 2 texts files) are generated but the pdf/html files cannot be created.
It is useful to create a Windows icon on the Windows desktop for quick access to BRB-SeqTools program. The BRB-SeqTools icon <BRB-SeqTools.lnk> can be found on Github. A modified automatic setup script <install_rnaseq.sh> can be also found there.

Performance

A subset of GSE48215 (about 1/10 of the original FASTQ files) created to run the benchmark.

mkdir GSE48215_22000000
head -n 22000000 GSE48215/SRR925751_1.fastq  > GSE48215_22000000/SRR925751_1.fastq
head -n 22000000 GSE48215/SRR925751_2.fastq  > GSE48215_22000000/SRR925751_2.fastq

The reference genome file is based on UCSC_hg19_chr1 as part of the DNA-Seq sample data.

	time (min)
Ubuntu 14.04 host	11
Ubuntu 16.04 vm	26
Windows 10 vm	32

Both virtual machines have 6 cores CPU and 16GB memory. For this dataset, about 8GB memory is enough. VirtualBox 5.0.30 was used.

Software List

See Tools Manager -> Automatic setup. A developer version of the shell script is available on Github. Note: GATK and annovar will not be installed automatically due to the license issue.

Program	Major language	Version	Linux OS	Mac OS	Repository	Monitor
bowtie2	C++	2.2.6	src & binary zip	src & binary zip	Github & SourceForge	c-d
tophat	C++	2.1.0	Linux binary tar.gz	Mac binary tar.gz	jhu.edu	c-d
bwa	C	0.7.12	src	src	Github	cron job
star	C++	2.5.1b	one binary tar.gz	one binary tar.gz	Github	cron job
picard	Java	1.141			Github	cron job
samtools	C	1.3	src	src	Github	c-d
GATK*	Java	3.6			broadinstitute.org	c-d
bcftools	C	1.3	src	src	Github	c-d
htslib	C	1.3	src	src	Github	c-d
annovar*	Perl	2016Feb01			openbioinformatics.org	c-d
sratoolkit	Shell	2.7.0	Linux binary tar.gz	Mac binary tar.gz	nih.gov & Github	c-d
fastqc	Java	0.11.5		dmg	bioinformatics.babraham.ac.uk	??
fastx	C	0.0.13	Linux binary tar.bz2	Mac binary tar.bz2	hannonlab.cshl.edu	??
snpeff	Java	4_2			SourceForge & Github	c-d
htseq	Python	0.6.1p1	src	src	python.org	c-d
R	R	3.3.x	apt-get	app	cran.rstudio.com	c-d
pandoc	Haskell	1.16.0.2	deb	pkg	Github	??
latex		Ubuntu repository	apt-get	pkg	Ubuntu/tug.org	??
lftp		Ubuntu repository	apt-get	homebrew	Ubuntu/Rudix	??
avfs		Ubuntu repository	apt-get	src	Ubuntu/SourceForge	??
Java (jdk)		1.8.0 (8u112)	tar.gz	dmg	oracle.com	??
subread	C	1.5.2	src & binary	src & binary	SourceForge	c-d

c-d: changedetection.com
cron job: shell script to check daily
The file name for GATK is GenomeAnalysisTK-3.7.tar.bz2 (~13 MB)
The file name for ANNOVAR is annovar.latest.tar.gz (~113 MB)

Download failure

Several software repositories (eg Github, not sourceforge) are hosted by Amazon S3. So be ware of possible Amazon AWS outage.

Hard Disk Space

Tools Manager: Automatic setup will download 1.2GB data and take about 3GB disk space. Automatic setup will also download tools required for automatic method in Reference Genome Profile Manager (avfs) and cosmic download in variant annotator (lftp).
Reference Genome Profile Manager: All human genomes need to download 20GB data except hg19 that will download 40GB data.
Variant Annotation: ~~Each of~~ snpEff ~~and ANNOVAR~~ will download 14GB database for dbNSFP. Total is 28GB for either GRCh37/hg19 or GRCh38/hg38. annovar/humandb folder will have its own dbnsfp file eg hg38_dbnsfp*.txt.

Virtual machine

For a 100GB dynamic allocated space VM,

space in GB	total	used	avail	vdi	ova
After Ubuntu installation	89	3.7	76	4.5	1.7
After running Automatic setup (Tools Manager)	89	6.9	72	8.4	4.2
After running the RNS-Seq sample data	89
After running the DNS-Seq sample data	89
After running snpEff on the DNS-Seq sample data	89

RAM requirement

Index step in STAR requires more than 30GB memory (30GB is not enough as we tested on macOS) for human genome. If there is a memory problem, SeqTools will abort and generate a BugReport.tar.gz file.

samples.txt file

Input: fastq

Assume we want to run GATK for variant call.

echo -e 'LibraryName\tLibraryLayout\tfastq1\tfastq2\tReadGroup\tSequencerManufacturer
Sample1\tPAIRED\tSample1_read1.fq\tSample1_read2.fq\t1\tillumina' > samples.txt

Input: bam

Assume we want to run GATK for variant call.

echo -e 'LibraryName\tfastq1\tReadGroup\tSequencerManufacturer
Sample1\tSample1.bam\t1\tillumina' > samples.txt

Put common paths in $PATH; source command

For convenience, we can create a file to store the common directories

$ cat ~/bin/setpath 
export PATH=/opt/SeqTools/bin/samtools-1.4:$PATH
export PATH=/opt/SeqTools/bin/gatk-3.8:$PATH
export PATH=/opt/SeqTools/bin/picard-tools-2.1.1/:$PATH
export PATH=/opt/SeqTools/bin/samtools-1.4/htslib-1.4/:$PATH

To use this file, we just need to run source ~/bin/setpath. Dada!! We can use samtools, bgzip, tabix, ... without specifying their full paths.

See

Predefined Datasets Locations

Demo data: testdata (parent directory is determined by the user)
Reference genome from the automatic method: BRB_SeqTools_autosetup_reference_genome_files on v1.0 and RefGenProfiles on v1.2 (parent directory is determined by the user)
Database files from somatic mutation annotator tool: ~/variantAnnoDatabase

$ find .
.
./.DS_Store
./cosmic
./cosmic/.DS_Store
./cosmic/GRCh37
./cosmic/GRCh37/CosmicCodingMuts.vcf.gz
./cosmic/GRCh37/CosmicCodingMuts.vcf.gz.tbi
./cosmic/GRCh38
./cosmic/GRCh38/CosmicCodingMuts.vcf
./cosmic/GRCh38/CosmicCodingMuts.vcf.gz
./cosmic/GRCh38/CosmicCodingMuts.vcf.gz.tbi
./dbNSFP
./dbNSFP/dbNSFP3.2a_hg19.txt.gz     # 13 GB
./dbNSFP/dbNSFP3.2a_hg19.txt.gz.data_types
./dbNSFP/dbNSFP3.2a_hg19.txt.gz.tbi
./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz   # 13 GB
./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.data_types
./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.tbi
./dbsnp
./dbsnp/GRCh37
./dbsnp/GRCh37/common_all_20160601.vcf.gz
./dbsnp/GRCh37/common_all_20160601.vcf.gz.tbi
./dbsnp/GRCh37/log.txt
./dbsnp/GRCh38
./dbsnp/GRCh38/log.txt

Note that the dbNSFP in this directory was used by SNPEFF only. ANNOVAR has its own dbnsfp files.

GATK

For GATK 3.8, the (only) binary file is GenomeAnalysisTK.jar. See https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

For GATK 4.0.0.0 (released 1/9/2018), the binary files are

gatk
gatk-package-[version]-local.jar
gatk-package-[version]-spark.jar

So the script in BRB-SeqTools won't work with GATK 4.0.0.0 release.

Starting with version 4.0, GATK contains a copy of the Picard toolkit.

The GATK in Biowulf is located in /usr/local/apps/GATK.

A practical introduction to GATK 4 on Biowulf (NIH HPC)

ANNOVAR

The result depends on several factors.

ANNOVAR version
ANNOVAR_DATA
cosmicVCF CosmicCodingMuts.vcf.gz
dbSNPVCF
genomeRef
minQual
minReadDepth
minMapQual

Pay attention to the ANNOVAR version that was used. When I use the version 2017-07-16 to run <table_annovar.pl>, I got a message

WARNING: you may have used wrong -buildver, or specified incorrect reference allele, or used outdated mRNA FASTA file!
...
ERROR running system command: <annotate_variation.pl -geneanno -buildver hg38 -dbtype knownGene .....

To check your current ANNOVAR version, type

annotate_variation.pl

See FAQ 41 on How to check if new version of ANNOVAR is available?

Reference sequencing and annotation files from iGenomes

Note that the file size may change as the file is updated.

	size (bytes)	md5sum
Ensembl GRCh37 Homo_sapiens_Ensembl_GRCh37.tar.gz	19971514224	3077c2b593615e418160ae878009b4b5
NCBI GRCh38 Homo_sapiens_NCBI_GRCh38.tar.gz	15848139211	61d263698f0283075f63b1514a16045d
UCSC hg19 Homo_sapiens_UCSC_hg19.tar.gz	45468620403	b64a925c6b7cc0391b3ee188ecec4e48
UCSC hg38 Homo_sapiens_UCSC_hg38.tar.gz	16006984068	97d98dcb25fb041be2f2139322a84cdb

Note that

The download speed is about 3MB/s at home and 7.8~11.4MB/s at office (shown at the end of stdout using wget).
The tarball has a file structure that starts at 'Mus-musculus' or 'Homo_sapiens' folder name (no 'igenomes').

Timing

RNA-Seq

DNA-Seq

Running BWA-MEM & GATK from xenograft data.

Note: compress a 14G fq to 4.2G fq.gz took 30 minutes.

Step	Output file	Time (h:m), Size [Data 1]	Time, Size [Data 2]
Source in fq Source in fq.gz		28G 4.2G	66G 16G
bwa mem	bwa.bam	0:30, 11G	1:00, 19G
pdx specific (sort by pos) pdx specific (remove mouse primary) pdx specific (remove mouse secondary) pdx specific (sort by names)	bwa_sorted.bam bwa_homo.bam bwa_homo2.sam bwa_homo2_sort.bam	0:10, 9.1G 0:10, 9.1G 0:15, 35G 0:18, 11G	0:16, 11G 0:12, 9.8G 0:24, 77G 0:30, 18G
samtools fixmate	accepted_hits.bam	0:40, 11G	1:15, 18G
samtools sort (sort by pos)	sorted.bam	0:10, 9.1G	0:15, 9.8G
Picard MarkDuplicates	bad.bam	1:00, 9.3G	1:35, 11G
Picard AddOrReplaceReadGroups	rg_added_sorted.bam	0:30, 9.3G	0:40, 11G
Picard ReorderSam	split.bam	0:30, 9.3G	0:50, 11G
GATK RealignerTargetCreator	BRB-SeqTools_indels_only.intervals
GATK IndelRealigner (need known SNPs for '-targetIntervals')	realigned_reads.bam	0:35, 9.2G	1:00, 11G
GATK BaseRecalibrator (need known SNPs) GATK PrintReads	recal_data.table recal.bam	0:30, 998K 2:00, 12G	1:00, 215K 5:50, 24G
GATK HaplotypeCaller	vcf	2:30, 3.6M	2:50, 58M
Total		9:46, 28G (src) + 192G (lib)	17:43, 16G (src) + 226G (lib)

Tips

Tutorial videos

Directory name specification in command line

Use a full path instead of "~/" to avoid a problem of identifying vcf files.

Cosmic download failure

The command line to download the <CosmicCodingMuts.vcf.gz> file (~50 MB) on macOS is

$ /usr/local/bin/lftp sftp://EMAIL:[email protected]:22  \
  -e "set sftp:auto-confirm yes; get /cosmic/grch38/cosmic/v78/VCF/CosmicCodingMuts.vcf.gz; bye"

sftp:auto-confirm: no such variable. Use `set -a' to look at all variables.
get: /cosmic/grch38/cosmic/v78/VCF/CosmicCodingMuts.vcf.gz: Fatal error: Host key verification failed

To further investigate the problem.

$ sftp EMAIL:[email protected]:22
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The ECDSA host key for sftp-cancer.sanger.ac.uk has changed,
and the key for the corresponding IP address 193.62.203.28
is unknown. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:OTiODuNLpRbXhKGYgMe8XWsa1jA3JM6x4k/NUeEnJEU.
Please contact your system administrator.
Add correct host key in /Users/USERNAME/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/USERNAME/.ssh/known_hosts:8
ECDSA host key for sftp-cancer.sanger.ac.uk has changed and you have requested strict checking.
Host key verification failed.
Connection closed
$ grep sanger ~/.ssh/known_hosts
sftp-cancer.sanger.ac.uk,193.62.203.24 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBNlj3i1Fccf8QVjXW9iwzNKQ3p4RKl6RL+RdVsFHWlgsg6HvLo7Y0te7oXx2YR8+06vRyW0eePTcJRiFAIeNmcU=
$

Use ssh-keygen -R hostname to delete the offending host key from the <known_hosts>.

For the sftp:auto-confirm: no such variable message see https://github.com/eqyiel/grunt-lftp/issues/1; need to upgrade lftp program to version 4.6.2.

Other software

Bioinformatics Recipes. It is like GEO2R.

The user interface ("Execute" on LHS menu) allows non-programmers to run analysis with required inputs
It can generate shell script ("Code" on LHS menu) to run locally (it assumes all required programs are in $PATH).

This tool teaches some tools; e.g. esearch and efetch for downloading data from NCBI PRJNA601630.