Seqtools: Difference between revisions
Line 307: | Line 307: | ||
./cosmic/GRCh38/CosmicCodingMuts.vcf.gz.tbi | ./cosmic/GRCh38/CosmicCodingMuts.vcf.gz.tbi | ||
./dbNSFP | ./dbNSFP | ||
./dbNSFP/dbNSFP3.2a_hg19.txt.gz | ./dbNSFP/dbNSFP3.2a_hg19.txt.gz # 13 GB | ||
./dbNSFP/dbNSFP3.2a_hg19.txt.gz.data_types | ./dbNSFP/dbNSFP3.2a_hg19.txt.gz.data_types | ||
./dbNSFP/dbNSFP3.2a_hg19.txt.gz.tbi | ./dbNSFP/dbNSFP3.2a_hg19.txt.gz.tbi | ||
./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz | ./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz # 13 GB | ||
./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.data_types | ./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.data_types | ||
./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.tbi | ./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.tbi |
Revision as of 07:47, 10 June 2017
Wiki for BRB-SeqTools
Windows 10 Bash shell
Gene counting and variant call (both Samtools and GATK) works fine. For variant annotation, see the comment #2 below.
- Need to install Xming. Before calling ./SeqTools from the Bash shell, run export DISPLAY=:0 first.
- If we like to start Xming automatically when Windows boots, follow the instruction How to Make a Program Run at Startup on Any Computer.
- Press Windows + R (or 'run' in the search box) and click the Enter key. Type “shell:startup” into the Run dialog, and press Enter. Now drag-and-drop the Xming shortcut from the “All Apps” list in the Start menu directly into this folder. Reboot Windows to make this change to work. It also helps to download the BRB-SeqTools icon (read below).
- Automatic setup:
- Need to install unzip utility sudo apt-get install unzip.
- Install fontconfig library sudo apt-get install libfontconfig1-dev.
- New gnome-terminal windows (use apt-get install) cannot be opened from Bash shell. This affect Automatic setup tools in Tools manager and profile manager.
- Java JDK from ppa:webup8team does not work. We need to download/install it <jdk-8u112-linux-x64.tar.gz> from Oracle website. To make the installation silently, we need to add two lines to the installation script. See here for apt-get approach and here for tarball approach.
- An issue in pandoc: timer-create function not implemented. In other words, if we run variant annotation, we will get a bug report message. The main output files (1 vcf and 2 texts files) are generated but the pdf/html files cannot be created.
- It is useful to create a Windows icon on the Windows desktop for quick access to BRB-SeqTools program. The BRB-SeqTools icon <BRB-SeqTools.lnk> can be found on Github. A modified automatic setup script <install_rnaseq.sh> can be also found there.
Performance
A subset of GSE48215 (about 1/10 of the original FASTQ files) created to run the benchmark.
mkdir GSE48215_22000000 head -n 22000000 GSE48215/SRR925751_1.fastq > GSE48215_22000000/SRR925751_1.fastq head -n 22000000 GSE48215/SRR925751_2.fastq > GSE48215_22000000/SRR925751_2.fastq
The reference genome file is based on UCSC_hg19_chr1 as part of the DNA-Seq sample data.
time (min) | |
---|---|
Ubuntu 14.04 host | 11 |
Ubuntu 16.04 vm | 26 |
Windows 10 vm | 32 |
Both virtual machines have 6 cores CPU and 16GB memory. For this dataset, about 8GB memory is enough. VirtualBox 5.0.30 was used.
Software List
See Tools Manager -> Automatic setup. A developer version of the shell script is available on Github. Note: GATK and annovar will not be installed automatically due to the license issue.
Program | Major language | Version | Linux OS | Mac OS | Repository | Monitor |
---|---|---|---|---|---|---|
bowtie2 | C++ | 2.2.6 | src & binary zip | src & binary zip | Github & SourceForge | c-d |
tophat | C++ | 2.1.0 | Linux binary tar.gz | Mac binary tar.gz | jhu.edu | c-d |
bwa | C | 0.7.12 | src | src | Github | cron job |
star | C++ | 2.5.1b | one binary tar.gz | one binary tar.gz | Github | cron job |
picard | Java | 1.141 | Github | cron job | ||
samtools | C | 1.3 | src | src | Github | c-d |
GATK* | Java | 3.6 | broadinstitute.org | c-d | ||
bcftools | C | 1.3 | src | src | Github | c-d |
htslib | C | 1.3 | src | src | Github | c-d |
annovar* | Perl | 2016Feb01 | openbioinformatics.org | ?? | ||
sratoolkit | Shell | 2.7.0 | Linux binary tar.gz | Mac binary tar.gz | nih.gov & Github | c-d |
fastqc | Java | 0.11.5 | dmg | bioinformatics.babraham.ac.uk | ?? | |
fastx | C | 0.0.13 | Linux binary tar.bz2 | Mac binary tar.bz2 | hannonlab.cshl.edu | ?? |
snpeff | Java | 4_2 | SourceForge & Github | c-d | ||
htseq | Python | 0.6.1p1 | src | src | python.org | c-d |
R | R | 3.3.x | apt-get | app | cran.rstudio.com | c-d |
pandoc | Haskell | 1.16.0.2 | deb | pkg | Github | ?? |
latex | Ubuntu repository | apt-get | pkg | Ubuntu/tug.org | ?? | |
lftp | Ubuntu repository | apt-get | homebrew | Ubuntu/Rudix | ?? | |
avfs | Ubuntu repository | apt-get | src | Ubuntu/SourceForge | ?? | |
Java (jdk) | 1.8.0 (8u112) | tar.gz | dmg | oracle.com | ?? | |
subread | C | 1.5.2 | src & binary | src & binary | SourceForge | c-d |
- c-d: changedetection.com
- cron job: shell script to check daily
Download failure
Several software repositories (eg Github, not sourceforge) are hosted by Amazon S3. So be ware of possible Amazon AWS outage.
Hard Disk Space
- Tools Manager: Automatic setup will download 1.2GB data and take about 3GB disk space. Automatic setup will also download tools required for automatic method in Reference Genome Profile Manager (avfs) and cosmic download in variant annotator (lftp).
- Reference Genome Profile Manager: All human genomes need to download 20GB data except hg19 that will download 40GB data.
- Variant Annotation: Each of snpEff and ANNOVAR will download 14GB database for dbNSFP. Total is 28GB for all GRCh37/hg19 & GRCh38/hg38.
Virtual machine
For a 100GB dynamic allocated space VM,
space in GB | total | used | avail | vdi | ova |
---|---|---|---|---|---|
After Ubuntu installation | 89 | 3.7 | 76 | 4.5 | 1.7 |
After running Automatic setup (Tools Manager) | 89 | 6.9 | 72 | 8.4 | 4.2 |
After running the RNS-Seq sample data | 89 | ||||
After running the DNS-Seq sample data | 89 | ||||
After running snpEff on the DNS-Seq sample data | 89 |
Predefined Locations
- Demo data: testdata (parent directory is determined by the user)
- Reference genome from the automatic method: BRB_SeqTools_autosetup_reference_genome_files on v1.0 and RefGenProfiles on v1.2 (parent directory is determined by the user)
- Database files from somatic mutation annotator tool: ~/variantAnnoDatabase
$ find . . ./.DS_Store ./cosmic ./cosmic/.DS_Store ./cosmic/GRCh37 ./cosmic/GRCh37/CosmicCodingMuts.vcf.gz ./cosmic/GRCh37/CosmicCodingMuts.vcf.gz.tbi ./cosmic/GRCh38 ./cosmic/GRCh38/CosmicCodingMuts.vcf ./cosmic/GRCh38/CosmicCodingMuts.vcf.gz ./cosmic/GRCh38/CosmicCodingMuts.vcf.gz.tbi ./dbNSFP ./dbNSFP/dbNSFP3.2a_hg19.txt.gz # 13 GB ./dbNSFP/dbNSFP3.2a_hg19.txt.gz.data_types ./dbNSFP/dbNSFP3.2a_hg19.txt.gz.tbi ./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz # 13 GB ./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.data_types ./dbNSFP/dbNSFP3.2a_hg38_sorted.txt.gz.tbi ./dbsnp ./dbsnp/GRCh37 ./dbsnp/GRCh37/common_all_20160601.vcf.gz ./dbsnp/GRCh37/common_all_20160601.vcf.gz.tbi ./dbsnp/GRCh37/log.txt ./dbsnp/GRCh38 ./dbsnp/GRCh38/log.txt
Reference sequencing and annotation files from iGenomes
Note that the file size may change as the file is updated.
size (bytes) | md5sum | |
---|---|---|
Ensembl GRCh37
Homo_sapiens_Ensembl_GRCh37.tar.gz |
19971514224 | 3077c2b593615e418160ae878009b4b5 |
NCBI GRCh38
Homo_sapiens_NCBI_GRCh38.tar.gz |
15848139211 | 61d263698f0283075f63b1514a16045d |
UCSC hg19
Homo_sapiens_UCSC_hg19.tar.gz |
45468620403 | b64a925c6b7cc0391b3ee188ecec4e48 |
UCSC hg38
Homo_sapiens_UCSC_hg38.tar.gz |
16006984068 | 97d98dcb25fb041be2f2139322a84cdb |
Note that the download speed is about 3MB/s at home and 7.8~11.4MB/s at office (shown at the end of stdout using wget).
Tips
Tutorial videos
https://www.youtube.com/playlist?list=PL6A4OqNJzh1l1CnCRdO_Q7o_-0K5CjrP4
Cosmic download failure
The command line to download the <CosmicCodingMuts.vcf.gz> file (~50 MB) on macOS is
$ /usr/local/bin/lftp sftp://EMAIL:[email protected]:22 \ -e "set sftp:auto-confirm yes; get /cosmic/grch38/cosmic/v78/VCF/CosmicCodingMuts.vcf.gz; bye" sftp:auto-confirm: no such variable. Use `set -a' to look at all variables. get: /cosmic/grch38/cosmic/v78/VCF/CosmicCodingMuts.vcf.gz: Fatal error: Host key verification failed
To further investigate the problem.
$ sftp EMAIL:[email protected]:22 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The ECDSA host key for sftp-cancer.sanger.ac.uk has changed, and the key for the corresponding IP address 193.62.203.28 is unknown. This could either mean that DNS SPOOFING is happening or the IP address for the host and its host key have changed at the same time. @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:OTiODuNLpRbXhKGYgMe8XWsa1jA3JM6x4k/NUeEnJEU. Please contact your system administrator. Add correct host key in /Users/USERNAME/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /Users/USERNAME/.ssh/known_hosts:8 ECDSA host key for sftp-cancer.sanger.ac.uk has changed and you have requested strict checking. Host key verification failed. Connection closed $ grep sanger ~/.ssh/known_hosts sftp-cancer.sanger.ac.uk,193.62.203.24 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBNlj3i1Fccf8QVjXW9iwzNKQ3p4RKl6RL+RdVsFHWlgsg6HvLo7Y0te7oXx2YR8+06vRyW0eePTcJRiFAIeNmcU= $
Use ssh-keygen -R hostname to delete the offending host key from the <known_hosts>. See here