Genome: Difference between revisions

From 太極
Jump to navigation Jump to search
(632 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Visualization =
= Visualization =
See also [http://www.bioconductor.org/packages/release/BiocViews.html#___Visualization Bioconductor > BiocViews > Visualization]. Search 'genom' as the keyword.
* [https://github.com/cmdcolin/awesome-genome-visualization Awesome genome visualization]
* [http://www.bioconductor.org/packages/release/BiocViews.html#___Visualization Bioconductor > BiocViews > Visualization]. Search 'genom' as the keyword.
 
== Ten simple rules ==
[https://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1010622 Ten simple rules for developing visualization tools in genomics]


== [http://software.broadinstitute.org/software/igv/ IGV] ==
== [http://software.broadinstitute.org/software/igv/ IGV] ==
Line 15: Line 19:
The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.
The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.


[[File:Igv dna simul.png|200px]]
[[:File:Igv dna simul.png]]


=== Whole genome ===
=== Whole genome ===
[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=ERP002259 PRJEB1486]
[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=ERP002259 PRJEB1486]


[[File:Igv prjeb1486 wgs.png|200px]]
[[:File:Igv prjeb1486 wgs.png]]


=== Whole exome ===
=== Whole exome ===
Line 26: Line 30:
* (Right) 1 of 3 whole exome data from [https://www.ncbi.nlm.nih.gov/sra?term=SRP066363 SRP066363], UCSC hg19.   
* (Right) 1 of 3 whole exome data from [https://www.ncbi.nlm.nih.gov/sra?term=SRP066363 SRP066363], UCSC hg19.   


[[File:Igv gse48215.png|200px]] [[File:Igv srp066363.png|200px]]
[[:File:Igv gse48215.png]] [[:File:Igv srp066363.png]]


=== RNA-Seq ===
=== RNA-Seq ===
Line 32: Line 36:
* (Right) [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46876 GSE46876], UCSC/hg19.  
* (Right) [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46876 GSE46876], UCSC/hg19.  


[[File:Igv anders2013 rna.png|200px]] [[File:Igv gse46876 rna.png|200px]]
[[:File:Igv anders2013 rna.png]] [[:File:Igv gse46876 rna.png]]


=== Tell DNA or RNA ===
=== Tell DNA or RNA ===
* DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
* DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
* RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).
* RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).
=== ChromoMap ===
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04556-z ChromoMap]: an R package for interactive visualization of multi-omics data and annotation of chromosomes
== RNA-seq DRaMA ==
https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from [https://blog.rstudio.com/2020/07/13/winners-of-the-2nd-shiny-contest/ 2nd Annual Shiny Contest]


== [http://www.bioconductor.org/packages/release/bioc/html/Gviz.html Gviz] ==
== [http://www.bioconductor.org/packages/release/bioc/html/Gviz.html Gviz] ==
Line 47: Line 57:


== [http://www.bioconductor.org/packages/release/bioc/html/ggbio.html ggbio] ==
== [http://www.bioconductor.org/packages/release/bioc/html/ggbio.html ggbio] ==
[https://support.bioconductor.org/p/9152800/ Wondering how to look at the reads of a gene in samples to check if it was knocked out?]


== [http://www.bioconductor.org/packages/release/bioc/html/NOISeq.html NOISeq] package ==
== [http://www.bioconductor.org/packages/release/bioc/html/NOISeq.html NOISeq] package ==
Line 62: Line 73:
See fig on p22 of [http://www.bioconductor.org/packages/release/bioc/vignettes/Sushi/inst/doc/Sushi.pdf Sushi vignette] where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.
See fig on p22 of [http://www.bioconductor.org/packages/release/bioc/vignettes/Sushi/inst/doc/Sushi.pdf Sushi vignette] where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.


== [http://www.cbioportal.org/ cBioPortal] and TCGA ==
== [http://www.cbioportal.org/ cBioPortal], TCGA, PanCanAtlas ==
The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets.
See [[Tcga|TCGA]].


https://github.com/cBioPortal/cbioportal
== TCPA ==
 
[https://www.tcpaportal.org/tcpa/download.html Download]. Level 4.
[https://www.biostars.org/p/219024/ Tutorial: retrieve full TCGA datasets from cBioportal with R]


== [http://qualimap.bioinfo.cipf.es/ Qualimap] ==
== [http://qualimap.bioinfo.cipf.es/ Qualimap] ==
Line 74: Line 84:
== [http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/ SeqMonk] ==
== [http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/ SeqMonk] ==
SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.
SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.
== dittoSeq ==
[https://rna-seqblog.com/dittoseq-universal-user-friendly-single-cell-and-bulk-rna-sequencing-visualization-toolkit-bioinformatics/ dittoSeq] – universal user-friendly single-cell and bulk RNA sequencing visualization toolkit, bioinformatics
== SeqCVIBE ==
[https://www.rna-seqblog.com/seqcvibe-interactive-analysis-exploration-and-visualization-of-rna-seq-data/ SeqCVIBE – interactive analysis, exploration, and visualization of RNA-Seq data]
== ggcoverage ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05438-2 ggcoverage: an R package to visualize and annotate genome coverage for various NGS data] 2023


= Copy Number =
= Copy Number =
Line 101: Line 120:
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2565-8 DBS: a fast and informative segmentation algorithm for DNA copy number analysis]
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2565-8 DBS: a fast and informative segmentation algorithm for DNA copy number analysis]


= NGS =
== modSaRa2 ==
[[File:CentralDogmaMolecular.png|300px]]
[https://academic.oup.com/bioinformatics/article/35/17/2891/5288773 An accurate and powerful method for copy number variation detection]


== Introduction to Sequence Data Analysis ==
== Visualization ==
* [http://www.rna-seqblog.com/review-of-rna-seq-data-analysis-tools/ Review of RNA-Seq Data Analysis Tools]
[https://academic.oup.com/bioinformatics/article/37/8/1164/5895301 reconCNV: interactive visualization of copy number data from high-throughput sequencing] 2021
* [https://arxiv.org/abs/1804.06050 Modeling and analysis of RNA-seq data: a review from a statistical perspective] Li et al 2018
* [https://youtu.be/QNd5wkozSJ0 Introduction to RNA Sequencing] Malachi Griffith from the Genome Institute at Washington University
* [https://www.youtube.com/watch?v=hksQlJLwKqo&feature=youtu.be RNA-Seq workshop] from NYU
* [http://www.rna-seqblog.com/modern-rna-seq-differential-expression-analyses-transcript-level-or-gene-level/ Modern RNA-seq differential expression analyses: transcript-level or gene-level]
* http://www.pasteur.fr/~tekaia/BCGA2014/TALKS/Pabinger_Tools4VariantAnalysisOfNGS_EMBO2014.pdf (QC, duplicates, SAM, BAM, picard, variant calling, somatic variant, structural variants, CNV, variant filter, variant annotation, tools for vcf, visualization, pipeline). The source code is available on [https://github.com/tadKeys/BioinformaticsAndGenomesAnalyses2014 github].
* https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools
* http://www.rnaseq.wiki/ or https://github.com/griffithlab/rnaseq_tutorial/wiki from The Griffith Lab. It has a [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393 paper] too at PLOS.
* http://bioinformatics.ca/workshops/2015/high-throughput-biology-sequence-networks-2015#material
* http://www.cbcb.umd.edu/~hcorrada/CMSC858B/
* http://watson.nci.nih.gov/~sdavis/tutorials/biowulf-2011/
* http://cbsu.tc.cornell.edu/ngw2010/Day3_lecture1.pdf
* http://static.msi.umn.edu/tutorial/lifescience/RNA-Seq-Module-1.pdf
* http://www.rna-seqblog.com/introduction-to-rna-seq-analysis/ (focus on statistical part)
* [https://www.youtube.com/watch?v=Ojy82R86Bm4 BroadE: GATK/Mapping and processing RNAseq] from youtube.
* [http://www.rna-seqblog.com/researchers-at-kansas-state-university-suggest-reconsidering-your-standard-rna-seq-data-management-pipeline/ Low-count transcripts] in RNA-Seq analysis pipeline
* [http://tv.qiagenbioinformatics.com/video/14353472/rna-seq-analysis-of-human-breast-cancer-data RNA-seq analysis of human breast cancer data] from QIAGEN. Biomedical Genomics Workbench 3.0.1 is used.
* [https://www.youtube.com/playlist?list=PLjiXAZO27elABzLA0aHKS9chVA2TldoPF RNA-seq Chipster tutorials]
* [https://discoveringthegenome.org/ Discovering the Genome]


NIH only
= NGS =
* [https://helix.nih.gov/About_Us/training.html Some training material from NIH]:
[[:File:CentralDogmaMolecular.png]]
* [http://biowulf.nih.gov/biowulf-seminar-3feb2015.pdf Biowulf seminar] by Steven Fellini  & Susan Chacko.
* [https://helix.nih.gov/Documentation/Talks/BashScripting_LinuxCommands.pdf Bash Scripting and Linux commands]


=== RNA-seq: Basics, Applications and Protocol ===
See [[NGS|NGS]].
https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-protocol-299461


=== Why Do You Need To Use Cdna For Rna-Seq? ===
== mNGS ==
https://www.biostars.org/p/54969/
[https://www.top1health.com/Article/93836 找出病原菌的新武器 :總基因體次世代定序是什麼?]


=== RNA-Seq vs DNA-Seq ===
= R and Bioconductor packages =
* With DNA, you'd be randomly sequencing the entire genome
== Resources ==
* DNA-Seq cannot be used for measuring gene expression.
* [https://cran.r-project.org/web/views/Omics.html CRAN Task View: Genomics, Proteomics, Metabolomics, Transcriptomics, and Other Omics]
* [http://rafalab.github.io/pages/harvardx.html HarvardX Biomedical Data Science Open Online Training], [https://hbctraining.github.io/main/ Bioinformatics Training at the Harvard Chan Bioinformatics Core]
* [https://liulab-dfci.github.io/bioinfo-combio/ Introduction to Bioinformatics and Computational Biology] Xiaole Shirley Liu (ebook & videos)
* [https://bioconductor.org/help/course-materials/2017/CSAMA/ CSAMA 2017: Statistical Data Analysis for Genome-Scale Biology]
* [https://nsaunders.wordpress.com/2015/04/28/some-basics-of-biomart/ Some basics of biomaRt] (and GenomicRanges)
* [http://master.bioconductor.org/help/workflows/annotation/AnnotatingRanges/ Annotating Ranges] Represent common sequence data types (e.g., from BAM, gff, bed, and wig files) as genomic ranges for simple and advanced range-based queries.
<pre>
library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
</pre>
* [http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf Analysis of RNA-Seq Data with R/Bioconductor] by Girke in UC Riverside
* [http://ivory.idyll.org/dibsi/index.html 2018 Data Intensive Biology Summer Institute at UC Davis]
* [http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2012/121029_HTS/martin_morgan1_nicolas_delhomme2_embo2012_rbioconductor.pdf R / Bioconductor for High-Throughput Sequence Analysis 2012] by Martin Morgan1 and Nicolas Delhomme
* [http://www-huber.embl.de/pub/pdf/nprot.2013.099.pdf Count-based differential expression analysis of RNA sequencing data using R and Bioconductor] by Simon Anders
* [http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2013/131021_HTS/genesandgenomes.pdf Sequences, Genomes, and Genes in R / Bioconductor] by Martin Morgan 2013.
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4509590/ Orchestrating high-throughput genomic analysis with Bioconductor] by Wolfgang Huber et al 2015.
* [https://rpubs.com/achitsaz/97976 RNA-seq Analysis Example] This is a script that will do differential gene expression (DGE) analysis for RNA-seq experiments using the bioconductor package edgeR. RPKMs were calculated for bar plots.
* [https://github.com/pgpmartin/BioC_For_NGS/blob/master/BioC_for_NGS_PMartin.pdf An introduction to R and Bioconductor for the analysis of high-throughput sequencing data] by Pascal MARTIN Oct 2018
* [https://morphoscape.wordpress.com/2022/07/28/bioinformatics-analysis-of-omics-data-with-the-shell-r/ Bioinformatics Analysis of Omics Data with the Shell & R]


=== Total RNA-Seq ===
=== Docker ===
* [https://www.gatc-biotech.com/en/expertise/transcriptomics/total-rna-seq.html Total RNA-Seq vs. mRNA sequencing] from gatc-biotech.com
[https://github.com/jhuanglab/bioinstaller Bioinstaller]: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on [https://cran.r-project.org/web/packages/BioInstaller/ CRAN].
* [https://www.illumina.com/techniques/sequencing/rna-sequencing/total-rna-seq.html A comprehensive picture of the transcriptome] from illumina.com
* https://en.wikipedia.org/wiki/RNA-Seq RNA-Seq is used to analyze the continuously changing cellular transcriptome.


=== How to know that your RNA-seq is stranded or not? ===
== Some workflows ==
https://www.biostars.org/p/98756/
=== RNA-Seq workflow ===  
Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.
<pre>
      STAR      Samtools        Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments      DESeq2
                                                          --------------------> se --------> dds
      GenomicFeatures        GenomicFeatures            /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|
</pre>
* https://bioconductor.org/packages/3.12/workflows/
* [http://www.bioconductor.org/help/workflows/rnaseqGene/ RNA-Seq workflow]
* [https://www.bioconductor.org/packages/release/workflows/html/RnaSeqGeneEdgeRQL.html RnaSeqGeneEdgeRQL]
* [http://www.sthda.com/english/wiki/rna-seq-differential-expression-work-flow-using-deseq2 RNA-Seq differential expression work flow using DESeq2]
* [https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/ mRNA Analysis Pipeline] from NIC GDC


=== Workshops ===
=== rnaseqGene ===
* [http://bioinformatics.ucdavis.edu/training/documentation/ UC Davis] Bioinformatics Core.
[http://master.bioconductor.org/packages/release/workflows/html/rnaseqGene.html rnaseqGene] - RNA-seq workflow: gene-level exploratory analysis and differential expression
* [https://bioconductor.org/help/course-materials/2017/CSAMA/ CSAMA 2017]: Statistical Data Analysis for Genome-Scale Biology


=== Blogs ===
[http://master.bioconductor.org/packages/release/bioc/html/tximport.html tximport]
* [https://blog.genohub.com/2013/08/07/11-top-next-generation-sequencing-blogs/ 11 Top Next Generation Sequencing Blogs]
* [https://seandavi.github.io/talk/ Sean Davis]


== Automation ==
[https://codeocean.com/capsule/2296412/tree/v1 CodeOcean] - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). [https://codeocean.com/pricing Plan].
* [http://www.rna-seqblog.com/implementation-of-an-open-source-software-solution-for-laboratory-information-management-and-automated-rna-seq-data-analysis/ Implementation of an Open Source Software solution for Laboratory Information Management and automated RNA-seq data analysis]
* [http://www.rna-seqblog.com/trapline-a-standardized-and-automated-pipeline-for-rna-sequencing-data-analysis-evaluation-and-annotation/ TRAPLINE – a standardized and automated pipeline for RNA sequencing data analysis, evaluation and annotation]


== Quality control ==
=== [http://master.bioconductor.org/help/workflows/high-throughput-sequencing/ Sequence analysis] ===
For base quality score, the quality value, Q(sanger) = - 10log10(prob) where prob = probability that the call at base b is correct. Sanger ('''Phred quality scores''') is between 0 and 93. In practice the maximum quality score is ~40. Quality values below 20 are typically considered low.  
<pre>
library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)
</pre>


=== Phred quality score and scales ===
=== [http://master.bioconductor.org/help/workflows/annotation/annotation/ Accessing Annotation Data] ===
* http://blog.nextgenetics.net/?e=33
Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding
<source lang="rsplus">
* http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ contains a link to a python script to guess the encoding format
library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")    # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
</source>


The original Phred scaling of '''33-126''' (representing scores '''0-93''') is also called the '''Sanger''' scale. There are also 2 other scales used by Illumina that shifts the range up:
{| class="wikitable"
 
! Object type
* '''Illumina 1.0''' format uses ASCII 59 - 126 representing scores '''-5 - 62'''.
! example package name
* '''Illumina 1.3+''' format uses ASCII 64 - 126 representing scores '''0 - 62'''.
! contents
*  '''Illumina 1.8''', the quality scores have basically returned to the use of the Sanger format (Phred+33).
|-
| OrgDb
| org.Hs.eg.db
| gene based information for Homo sapiens
|-
| TxDb
| TxDb.Hsapiens.UCSC.hg19.knownGene
| transcriptome ranges for Homo sapiens
|-
| OrganismDb
| Homo.sapiens
| composite information for Homo sapiens
|-
| BSgenome
| BSgenome.Hsapiens.UCSC.hg19
| genome sequence for Homo sapiens
|-
|
| [http://cran.r-project.org/web/packages/refGenome/index.html refGenome]
|
|}


=== FastQC (java based) ===
== RNA-Seq Data Analysis using R/Bioconductor ==
One problem is the reads in trimmed fastq files from one end may not appear in other end for paired data. So consider Trimmomatic.
* https://github.com/datacarpentry/rnaseq-data-analysis by Stephen Turner.
* [https://support.bioconductor.org/p/69677/ Tutorial: Introduction to Bioconductor for high-throughput sequence analysis] by UseR 2015
* [http://bioconductor.org/packages/release/bioc/html/systemPipeR.html systemPipeR] Building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. An important feature is support for running command-line software, such as NGS aligners, on both single machines or compute clusters.
** http://girke.bioinformatics.ucr.edu/GEN242/mydoc_systemPipeVARseq_05.html
* [http://bioconductor.org/help/course-materials/2015/BioC2015/ BioC2015]


=== [http://qualimap.bioinfo.cipf.es/ Qualimap] (java based) ===
=== recount2 ===
Qualimap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.
* https://github.com/leekgroup/recount
* [https://jhubiostatistics.shinyapps.io/recount/ recount2] - A multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets.
* [https://github.com/LieberInstitute/recountWorkshop2020 Human RNA-seq data from recount2 and related packages] from [https://bioc2020.bioconductor.org/workshops BioC 2020 Workshop]
** recount2 works with DESeq2, edgeR and limma-voom


The first time when we launch Qualimap we may get a message: The following R packages are missing: optparse, NOISeq, Repitools. Features dependent on these packages are disabled. See user manual for details. OK.
=== recount3 ===
* [https://lcolladotor-github-io.translate.goog/rnaseq_LCG-UNAM_2022/datos-de-rna-seq-a-trav%C3%A9s-de-recount3.html?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp Intro RNA-seq LCG-UNAM 2022] (Spanish)
* [https://support.bioconductor.org/p/9152538/ Recount3: PCR duplicates].
** [https://www.khanacademy.org/science/ap-biology/gene-expression-and-regulation/biotechnology/a/polymerase-chain-reaction-pcr PCR duplication] can refer to two different things. It can mean the process of making many copies of a specific DNA region using a technique called Polymerase Chain Reaction (PCR). PCR relies on a thermostable DNA polymerase and requires DNA primers designed specifically for the DNA region of interest.
** On the other hand, [https://www.nature.com/articles/nmeth.4268 PCR duplication] can also refer to a problem that occurs when the same DNA fragment is amplified and sequenced multiple times, resulting in identical reads that can bias many types of high-throughput-sequencing experiments. These identical reads are called [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4933-1 PCR duplicates] and can be eliminated using various methods such as removing all but one read of identical sequences or using '''unique molecular identifiers (UMIs)''' to enable accurate counting and tracking of molecules.
** '''UMI''' stands for Unique Molecular Identifier. [https://dnatech.genomecenter.ucdavis.edu/faqs/what-are-umis-and-why-are-they-used-in-high-throughput-sequencing/ It is a complex index added to sequencing libraries before any PCR amplification steps, enabling the accurate bioinformatic identification of PCR duplicates]. UMIs are also known as '''Molecular Barcodes''' or '''Random Barcodes'''. UMIs are valuable tools for both quantitative sequencing applications and also for genomic variant detection, especially the detection of rare mutations. UMI sequence information in conjunction with alignment coordinates enables grouping of sequencing data into read families representing individual sample DNA or RNA fragments.
** [https://umi-tools.readthedocs.io/en/latest/reference/dedup.html '''dedup'''] - Deduplicate reads using UMI and mapping coordinates
** UMIs can be extracted from a fastq file using awk. For example '''awk 'NR % 4 == 1 {split($0,a,":"); print a[6]}' input.fastq > umis.txt''' . Here we assume the read header is '''@SEQ_ID:LANE:TILE:X:Y:UMI, then the UMI sequence is in the 6th field, following the 5th colon.'''


=== [http://www.usadellab.org/cms/index.php?page=trimmomatic Trimmomatic] ===
== dbGap ==
https://hpc.nih.gov/apps/trimmomatic.html
[https://academic.oup.com/bioinformatics/article/36/4/1305/5556117 dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)]


=== [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ Trim Galore!] ===
== eQTL ==
https://hpc.nih.gov/apps/trimgalore.html
[https://www.coursera.org/lecture/statistical-genomics/combining-data-types-eqtl-6-04-NkorV Statistics for Genomic Data Science] (Coursera) and [https://cran.r-project.org/web/packages/MatrixEQTL/index.html MatrixEQTL] from CRAN


=== [http://www.biomedcentral.com/1471-2105/16/224 QoRTs] ===
== [https://bioconductor.org/packages/release/bioc/html/GenomicDataCommons.html GenomicDataCommons] package ==
A comprehensive toolset for quality control and data processing of RNA-Seq experiments.
* Genomic Data Commons
** https://gdc.cancer.gov/
** https://www.cancer.gov/about-nci/organization/ccg/research/computational-genomics/gdc
** [https://portal.gdc.cancer.gov/ Data Portal]. A list of [https://portal.gdc.cancer.gov/projects Projects]
* Use the GenomicDataCommons package to find and download variants from TCGA (NCI Genomic Data Commons Access) dataset and maftools package for analysis and visualization. See  https://seandavi.github.io/talk/2018/02/08/bioconductor-a-potential-hub-in-the-cancer-biomarker-data-ecosystem/
* https://seandavi.github.io/post/2018/03/extracting-clinical-information-using-the-genomicdatacommons-package/
* [https://docs.google.com/presentation/d/1bjnW67aemW90kFcq_S5rGorX96Xjrp9jt3tM4PpNuHI The Cancer Data Ecosystem: Data and cloud resources for cancer data science]
* [https://docs.google.com/presentation/d/1z0fgKWQrshn9rB2HJ4oiNd4V2tZUeomvre4p3usw6hc The GenomicDataCommons and GEOquery Bioconductor Packages]


=== [https://bioinf.shenwei.me/seqkit/ SeqKit] ===
Note:
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
# The TCGA data such as [https://portal.gdc.cancer.gov/projects/TCGA-LUAD TCGA-LUAD] are not part of clinical trials (described [https://wiki.cancerimagingarchive.net/display/Public/TCGA-LUAD here]).
# Each patient has 4 categories data and the 'case_id' is common to them:
#* demographic: gender, race, year_of_birth, year_of_death
#* diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
#* exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
#* main: disease_type, primary_site
# The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.


=== FastqCleaner ===
== Visualization ==
[https://www.biorxiv.org/content/early/2018/08/16/393140 An interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files] and [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2961-8 BMC Bioinformatics]
=== [https://bioconductor.org/packages/release/bioc/html/GenVisR.html GenVisR] ===


== Alignment and Indexing ==
=== [http://www.bioconductor.org/packages/devel/bioc/html/ComplexHeatmap.html ComplexHeatmap] ===
* https://en.wikipedia.org/wiki/Sequence_alignment
* https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment
* [http://www.nature.com/nmeth/journal/vaop/ncurrent/pdf/nmeth.4106.pdf Simulation-based comprehensive benchmarking of RNA-seq aligners]
* [http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2722.html Systematic evaluation of spliced alignment programs for RNA-seq data]
* (Video) http://bioinformatics.ca//files/public/RNA-seq_2014_Montreal_Module2.mp4, http://www.rna-seqblog.com/rna-seq-alignment-and-visualization/
* (Video) https://youtu.be/n77EAk8C1es RNA-SEQ: Mapping to a Reference Genome
* RNA-Seq alignment methods emphasize gene count while DNA-Seq alignment methods (no junctions) emphasize variant detection. The read length in RNA-Seq alignment does not have be long but it has to be long in DNA-Seq since DNA-Seq cares about a good coverage of the whole exome/genome.


=== SAM file format ===
== Read counts ==
[[Anders2013#sam.2Fbam.2C_.22samtools_view.22_and_Rsamtools|SAM format]]
=== Read, fragment ===
* [https://biology.stackexchange.com/a/30150 Meaning of the "reads" keyword in terms of RNA-seq or next generation sequencing]. A read refers to the sequence of a cluster that is obtained after the end of the sequencing process which is ultimately the sequence of a section of a unique fragment.
* [https://www.biostars.org/p/106291/ What is the difference between a Read and a Fragment in RNA-seq?]. Diagram , Pair-end, single-end.
* In the context of RNA-seq, "read" and "fragment" may refer to slightly different things, but they are related concepts.
** A '''read''' is a short sequence of nucleotides that has been generated by a sequencing machine. These reads are typically around 100-150 bases long. RNA-seq experiments generate millions or billions of reads, and these reads are aligned to a reference genome or transcriptome to determine which reads came from which genes or transcripts, this information is used to quantify gene and transcript expression levels.
** A '''fragment''' is a piece of RNA that has been broken up and converted into a read. In RNA-seq, the first step is to convert the RNA into a library of fragments. To do this, the RNA is typically broken up into smaller pieces using a process called fragmentation. Then, '''adapters''' are added to the ends of the fragments to allow them to be sequenced. The fragments are then converted into a library of reads that can be sequenced using a next-generation sequencing platform.
** In summary, a read is a short sequence of nucleotides that has been generated by a sequencing machine, whereas a fragment is a piece of RNA that has been broken up and converted into a read. The process of fragmentation creates a library of fragments that are then converted into reads that can be sequenced.
* Does one fragment contain 1 read or multiple reads?
** One fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used.
** In the process of library preparation, RNA is first fragmented into smaller pieces, then adapters are ligated to the ends of the fragments. The fragments are then amplified using PCR, generating multiple copies of the original fragment. These amplified fragments are then sequenced using a next-generation sequencing platform, generating multiple reads per fragment.
** For example, in '''Illumina''' sequencing, fragments are ligated with adapters, then they are clonally amplified using bridge amplification. This allows for the creation of '''clusters''' of identical copies of the original fragment on a sequencing flow cell. Then, each '''cluster''' is sequenced, generating a large number of reads per fragment.
** In other technologies like '''PacBio''' or '''Nanopore''', the sequencing of a fragment generates only one read, as the technology can read long stretches of DNA, therefore it doesn't need to fragment the RNA prior to sequencing.
** In summary, one fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used. '''The number of reads per fragment can vary from one to several thousands.'''
* How many reads in a fragment on average in illumination sequencing?
** The number of reads per fragment in Illumina sequencing can vary depending on the sequencing platform and library preparation protocol used, as well as the sequencing depth and the complexity of the sample. However, on average, one fragment can generate several hundred to several thousand reads in Illumina sequencing.
** When sequencing is performed on the Illumina platform, the process of library preparation includes fragmenting the RNA into smaller pieces, ligating adapters to the ends of the fragments, and then amplifying the fragments using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating multiple reads per fragment.
** The number of reads per fragment can also be affected by the sequencing depth, which refers to the total number of reads generated by the sequencing machine. A higher sequencing depth will result in more reads per fragment, while a lower sequencing depth will result in fewer reads per fragment.
** In summary, the number of reads per fragment in Illumina sequencing can vary, but on average, one fragment can generate several hundred to several thousand reads. The number of reads per fragment can be influenced by the sequencing platform, library preparation protocol, sequencing depth, and the complexity of the sample.


Here are some important fields
=== [http://bioconductor.org/packages/release/bioc/html/Rsubread.html Rsubread] ===
* See [https://support.bioconductor.org/p/65604/ this post] for about C version of the [http://bioinf.wehi.edu.au/featureCounts/ featureCounts] program.
* [https://www.biostars.org/p/96176/ featureCounts vs HTSeq-count]
* [http://bioinformatics.cvr.ac.uk/blog/featurecounts-or-htseq-count/ featureCounts or htseq-count?]
* [https://www.jianshu.com/p/9cc4e8657d62 featurecounts的使用说明]
* [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944 GSE62944] TCGA data processed by Rahman M.


* Col 1: Read name
=== RSEM ===
* Col 2: FLAG [0,2^16-1]
<ul>
* Col 3: RNAME/Chrom
<li>[http://deweylab.github.io/RSEM/ RSEM], [https://deweylab.github.io/RSEM/rsem-calculate-expression.html  rsem-calculate-expression] </li>
* Col 4: Pos  [0,2^31-1]
<li>[https://hpc.nih.gov/apps/rsem.html RSEM on Biowulf]
* Col 5: MAPQ [0,2^8-1]
<syntaxhighlight lang='sh'>
* Col 6: CIGAR
$ mkdir SeqTestdata/RNASeqFibroblast/output
* Col 7: RNEXT (se as "=" if RNEXT is equal to RNAME)
$ sinteractive --cpus-per-task=2 --mem=10g
* Col 8: PNEXT - Mate position [0,2^31-1]
$ module load rsem bowtie STAR
* Col 9: Insert size [-2^31+1, 2^31-1]
$ rsem-calculate-expression -p 2 --paired-end --star \
* Col 10: SEQ
../test.SRR493366_1.fastq ../test.SRR493366_2.fastq \
* Col 11: QUAL
/fdb/rsem/ref_from_genome/hg19 Sample1 # 12 seconds
* Col 14 (not fixed): AS
$ ls -lthog
total 5.8M
-rw-r----- 1 1.6M Nov 24 13:39 Sample1.genes.results
-rw-r----- 1 2.5M Nov 24 13:39 Sample1.isoforms.results
-rw-r----- 1 1.6M Nov 24 13:39 Sample1.transcript.bam
drwxr-x--- 2 4.0K Nov 24 13:39 Sample1.stat


Note pair-end data,
$ wc -l Sample1.genes.results
<pre>
26335 Sample1.genes.results
1. Fastq files
$ wc -l Sample1.isoforms.results
  A_1.fastq  A_2.fastq
51399 Sample1.isoforms.results
  read1      read1
  read2      read2
  ...         ...


2. SAM files (sorted by read name)
$ head -2 Sample1.genes.results
  read1
gene_id transcript_id(s) length effective_length expected_count TPM FPKM
  read1
A1BG NM_130786 1766.00 1589.99 0.00 0.00 0.00
  read2
$ head -2 Sample1.isoforms.results
  read2
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
  ...
NM_130786 A1BG 1766 1589.99 0.00 0.00 0.00 0.00
</pre>
$ head -1 /fdb/rsem/ref_from_genome/hg19.transcripts.fa
 
>NM_130786
For example, consider paired-end reads ''SRR925751.192'' that was extracted from so called inproper paired reads in samtools. If we extract the paired reads by ''grep -w SRR925751.192 bt20/bwa.sam'' we will get
$ grep NM_130786 /fdb/igenomes/Homo_sapiens/UCSC/hg19/transcriptInfo.tab
NM_130786 2721192635 2721199328 2721188175 2 8 431185
</syntaxhighlight>
</li>
</ul>
* [http://cshprotocols.cshlp.org/content/2019/6/pdb.prot098368.full An RNA-Seq Protocol for Differential Expression Analysis] Owens 2019
* [https://www.cbioportal.org/ Cbioportal] -> [https://www.cbioportal.org/study/summary?id=brca_tcga_pan_can_atlas_2018 Breast Invasive Carcinoma (TCGA, PanCancer Atlas)] -> Explore selected study -> Original data -> [https://gdc.cancer.gov/about-data/publications/pancanatlas PanCanAtlas Publications] -> RNA (Final) - EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4002-1 Evaluation and comparison of computational tools for RNA-seq isoform quantification] 2017
* [https://www.rna-seqblog.com/the-concepts-of-mean-fragment-length-and-effective-length-in-rna-sequencing/ The Concepts of Mean Fragment Length and Effective Length in RNA Sequencing]
<ul>
<li>RSEM gene level result file (see [[#tximport|here for an example]]) contains 5 essential columns (and the element saved by '''tximport'''() function) excluding transcript_id
<ul>
<li>[https://groups.google.com/g/rsem-users/c/IaZmviqghJc Effective length] &rarr; length. This is different across samples. This is much shorter than Length (e.g. 105 vs 1). </li>
<li>[http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html Expected count] &rarr; count. This is the sum of the posterior probability of each read comes from this transcript over all reads. </li>
<li>TPM &rarr; abundance. The sum of all transcripts' TPM is 1 million. </li>
<li>FPKM (not kept?). FPKM_i = 10^3 / l_bar * TPM_i for gene i. So for each sample FPKM is a scaling of TPM. </li>
</ul>
<pre>
<pre>
SRR925751.192  97  chr1  13205  60  101M = 13429  325 .....
R> dfpkm[1:5, 1:3] / txi.rsem$abundance[1:5, 1:3]
SRR925751.192 145  chr1  13429  60  101M = 13205 -325 .....
          144126_210-T_JKQFX5 144126_210-T_JKQFX6 144126_210-T_JKQFX8
5S_rRNA            0.7563603            1.118008          0.8485292
5_8S_rRNA                NaN                NaN                NaN
6M1-18                    NaN                NaN                NaN
7M1-2                    NaN                NaN                NaN
7SK                0.7563751            1.118029          0.8485281
</pre>
</pre>
By entering the flag values 97 & 145 into the SAM flag entry on http://broadinstitute.github.io/picard/explain-flags.html, we see
<li>An example using [https://www.rdocumentation.org/packages/tximport/versions/1.0.3/topics/tximport tximport::tximport()] and [https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class DESeq2::DESeqDataSetFromTximport]. <span style="color: red">Note it directly uses round(expected_count) to get the integer-value counts.</span> See the source of DESeqDataSetFromTximport() [https://github.com/mikelove/DESeq2/blob/6aa81f9906e192eea1fb21cdc492b26ee4a472f1/R/AllClasses.R#L405 here]. The [https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html tximport vignette] has discussed two suggested ways of importing estimates for use with differential gene expression (DGE) methods in the section of "Downstream DGE in Bioconductor". The vignette does not say anything about "expected_count" from RSEM output.
* FLAG value 97 means read paired, mate reverse strand, first in pair.
* FLAG value 145 means read paired, read reverse strand, second in pair.
The two reads have positions. As we can see the '''insert size''' (it's the name shown on samtools help page and IGV program) is defined by start positions instead of end positions)
<pre>
<pre>
  13205      13305          13429    13529
txi.rsem <- tximport(files, type = "rsem", txIn = F, txOut = F)
    |------------->             <----------|
txi.rsem$length[txi.rsem$length == 0] <- 1
</pre>
names(txi.rsem) # a list,
                # length = effective_length (matrix)
                # counts = expected counts column (matrix), non-integer
                # abundance = TPM (matrix)
                # countsFromAbundance = "no"
# [1] "abundance"          "counts"              "length"              
# [4] "countsFromAbundance"


Consider a properly paired reads ''SRR925751.1''. If we extract the paired reads, we will get
sampleTable <- pheno[, c("EXPID", "PatientID")]
<pre>
rownames(sampleTable) <- colnames(txi.rsem$counts)
SRR925751.1  99  chr1  10010  0  90M11S  =  10016  95  .....
SRR925751.1  147  chr1  10016  0  12S89M  =  10010 -95 .....
</pre>
Again by entering the flag values 99 and 147 into the Decoding SAM flags website, we see
* FLAG value 99 means read paired, '''read mapped in proper pair''', mate reverse strand, first in pair.
* FLAG value 147 means read paired, '''read mapped in proper pair''', read reverse strand, second in pair.
<pre>
    10010        10099
      |------------>
    10016          10104
        <-------------|
</pre>
The insert size here is 10104-10010+1=95. Running the [https://www.biostars.org/p/16556/ following command],
<syntaxhighlight lang='bash'>
cat foo.bam | awk '{ if ($9 >0) {S+=$9; T+=1}} END {print "Mean: "S/T}'
</syntaxhighlight>
I get 133.242 for the average insert size from properly paired reads on my data (GSE48215subset, read length=101).


If I run the same command on in-properly paired reads, I got 1.9e07 for the average insert size.
dds <- DESeq2::DESeqDataSetFromTximport(txi.rsem, sampleTable, ~ PatientID)
# using counts and average transcript lengths from tximport
#
# The DESeqDataSet class enforces non-negative integer values in the "counts"
#    matrix stored as the first element in the assay list.
dds@assays@data@listData$counts[1:5, 1:3] # integer values. How to compute?
                              # https://support.bioconductor.org/p/9134840/
dds@assays@data@listData$avgTxLength[1:5, 1:3] # effective_length


=== Bowtie2 (RNA/DNA) ===
plot(txi.rsem$counts[,1], dds@assays@data@listData$counts[,1])
Extremely fast, general purpose short read aligner.
abline(0, 1, col = 'red')      # compare expected counts vs integer-value counts
                              # a straight line


==== Create index files ====
ddsColl1 <- DESeq2::estimateSizeFactors(dds)
bowtie needs to have an '''index''' of the genome in order to perform its alignment functionality. For example, to build a bowtie index against UCSC hg19 (See [http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#indexing-a-reference-genome Getting started with Bowtie 2 -> Indexing a reference genome])
# using 'avgTxLength' from assays(dds), correcting for library size
<syntaxhighlight lang='bash'>
# Question: how does the function correct for library size?
bowtie2-build /data/ngs/public/sequences/hg19/genome.fa hg19
</syntaxhighlight>


Even the index file can be directly downloaded without going through Bowtie program, Bowtie program is still needed by Tophat program where Tophat's job is to align the RNA-seq data to reference genome.  
ddsColl2 <- DESeq2::estimateDispersions(ddsColl1)
# gene-wise dispersion estimates
# mean-dispersion relationship
# final dispersion estimates
# Note: it seems estimateDispersions is not required
#      if we only want to get the normalized count (still need estimateSizeFactors())
# See ArrayTools/R/FilterAndNormalize.R


==== Mapping ====
cnts2 <- DESeq2::counts(ddsColl2, normalized = FALSE)
To run alignment,
all(dds@assays@data@listData$counts == cnts2)
<syntaxhighlight lang='bash'>
# [1] TRUE
bowtie2 -p 4 -x /data/ngs/public/sequences/hg19 XXX.fastq -S XXX.bt2.sam
</syntaxhighlight>
At the end of alignment, it will show how many (and percent) of reads are aligned 0 times, exactly 1 time, and >1 times.


We can transform them into a bam format ('-@' means threads and '-T' means the reference file, it is optional)
all(round(txi.rsem$counts) == cnts2 )
<syntaxhighlight lang='bash'>
# [1] TRUE.    So in this case round(expected values) = integer-value counts
samtools view -@ 16 -T XXX.fa XXX.bt2.sam > XXX.bt2.bam
</pre>
</syntaxhighlight>
</li>
and view the bam file
</ul>
<syntaxhighlight lang='bash'>
* [https://informatics.fas.harvard.edu/rsem-example-on-odyssey.html RSEM example on Odyssey]
samtools view XXX.bt2.bam
* [https://github.com/bli25broad/RSEM_tutorial A Short Tutorial for RSEM]
</syntaxhighlight>
* [https://ycl6.gitbook.io/rna-seq-data-analysis/ Hands-on Training in RNA-Seq Data Analysis*] which includes Quantification using RSEM and Perform DE analysis. Note the expected count column was used in edgeR.
* [https://biowize.wordpress.com/2014/03/04/understanding-rsem-raw-read-counts-vs-expected-counts/ Understanding RSEM: raw read counts vs expected counts]. These '''“expected counts”''' can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323 RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome]. Abundance estimates are given in terms of two measures. The output file (XXX_RNASeq.RSEM.genes.results) contains 7 columns: '''gene_id, transcript_id(s), length,  effective_length, expected_count, TPM, FPKM'''.
** Expected counts: The is an estimate of the number of fragments that are derived from a given isoform or gene.  '''This count is generally a non-integer value''' and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR or DESeq.
** TPM: This is the estimated fraction of transcripts made up by a given isoform or gene. The transcript fraction measure is preferred over the popular RPKM/FPKM measures because '''it is independent of the mean expressed transcript length''' and is thus more comparable across samples and species.
<ul>
<li>The ''length'' or ''effective_length'' are different (though similar) for different samples for the same gene </li>
<li>A scatter plot and correlation shows the expected_count and TPM are different
<pre>
x <- read.delim("144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results")
colnames(x)
# [1] "gene_id"          "transcript_id.s." "length"          "effective_length"
# [5] "expected_count"  "TPM"              "FPKM"
plot(x[, "TPM"], x[, "expected_count"])
cor(x[, "TPM"], x[, "expected_count"])
# [1] 0.4902708
cor(x[, "TPM"], x[, "expected_count"], method = 'spearman')
# [1] 0.9886384
x[1:5, "length"]
[1] 105.01 161.00 473.00  68.00 304.47


Note that it is faster to use pipe to directly output the file in a bam format (reducing files I/O) if the aligner does not provide an option to output a binary bam format.
x2 <- read.delim("144126_210-T_JKQFX6_v2.0.1.4.0_RNASeq.RSEM.genes.results")
x2[1:5, "length"]
# [1] 105.00 161.00 473.00  68.00 305.27
x[1:5, "effective_length"]
# [1]  1.03  16.65 293.88  0.00 129.00
x2[1:5, "effective_length"]
# [1]  1.58  17.82 293.05  0.00 128.86
</pre>
</li>
</ul>
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0940-1 A benchmark for RNA-seq quantification pipelines]. 2016 They compare the STAR, TopHat2, and Bowtie2 '''mapping methods''' and the Cufflinks, eXpress , Flux Capacitor, kallisto, RSEM, Sailfish, and Salmon '''quantification methods'''. '''RSEM slightly outperforming the rest.'''
* [https://github.com/qianhuiSenn/scRNA_cell_deconv_benchmark/blob/master/Scripts/Downsample/downsample_read_pancreas_code.txt Downsample reads] from '' Evaluation of Cell Type Annotation R Packages on Single Cell RNA-seq Data'' 2020.


=== BWA (DNA) ===
==== Expected_count ====
* BWA MEM Algorithm https://www.biostars.org/p/140502/
Number of reads mapping to that transcript
* https://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf By Heng Li
* http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
* https://bioinformatics.cancer.gov/sites/default/files/course_material/Data%20Analysis%20for%20Exome%20Sequencing%20Data_0318.ppt


Used by whole-exome sequencing. For example, http://bib.oxfordjournals.org/content/15/2/256.full and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3865592/. [https://igor.sbgenomics.com/public/pipelines/534522f6d79f0049c0c9444e/ Whole Exome Analysis].
<ul>
==== Creating index file ====
<li>[https://biowize.wordpress.com/2014/03/04/understanding-rsem-raw-read-counts-vs-expected-counts/ Understanding RSEM: raw read counts vs expected counts] In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.
Without creating index files, we will get an error ''[E::bwa_idx_load_from_disk] fail to locate the index files'' when we run the 'bwa mem' command (though the command only requires fa file in its arguments).
 
[http://gatkforums.broadinstitute.org/wdl/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk Prepare a reference for use with BWA and GATK]
<syntaxhighlight lang='bash'>
bwa index genome.fa          # generate 4 new files amb, ann, pac and bwt (not enough for running 'bwa mem')
bwa index -a bwtsw genome.fa # generate 5 new files amb, ann, pac, bwt and sa (right thing to do)
</syntaxhighlight>
where '''-a bwtsw''' specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.
 
It takes 3 hours to create the index files on the combined human+mouse genomes. Though there is no multhreads optionin bwa index, we can use the '''-b''' option to increase the block size in order to speed up the time. See https://github.com/lh3/bwa/issues/104. The default value is 10000000 (1.0e8). See the following output from ''bwa index'':
<pre>
<pre>
[bwa_index] Pack FASTA... 46.43 sec
R> x <- read.delim("41samples/165739~295-R~AM1I30~RNASEQ.genes.results")
[bwa_index] Construct BWT for the packed sequence...
R> summary(x$expected_count)    # Larger than TPM, contradict to the above statement
[BWTIncCreate] textLength=11650920420, availableWord=831801828
  Min. 1st Qu. Median    Mean 3rd Qu.   Max.
[BWTIncConstructFromPacked] 10 iterations done. 99999988 characters processed.
      0      0      10    1346    556  533634
[BWTIncConstructFromPacked] 20 iterations done. 199999988 characters processed.
R> summary(x$TPM)
...
    Min. 1st Qu.   Median    Mean  3rd Qu.     Max.
[BWTIncConstructFromPacked] 1230 iterations done. 11647338276 characters processed.
    0.00    0.00    0.16    35.58    7.50 70091.93
[bwt_gen] Finished constructing BWT in 1232 iterations.
R> x[1:5, c("expected_count", "TPM")]
[bwa_index] 4769.18 seconds elapse.
  expected_count      TPM
[bwa_index] Update BWT... 45.78 sec
1        6190.00 70091.93
[bwa_index] Pack forward-only FASTA... 32.66 sec
2          0.00    0.00
[bwa_index] Construct SA from BWT and Occ... 2054.29 sec
3          0.00    0.00
[main] Version: 0.7.15-r1140
4          0.00    0.00
[main] CMD: bwa index -a bwtsw genomecb.fa
5        795.01  171.67
[main] Real time: 7027.396 sec; CPU: 6948.342 sec
</pre>
</pre>
</li>
<li>[https://www.bioinfo-scrounger.com/archives/482/ Alignment-based的转录本定量-RSEM]/ '''the sum of the posterior probability of each read comes from this transcript over all reads'''
</ul>


Note that another index file '''fai''' is created by samtools
[https://support.bioconductor.org/p/90672/ Expected counts from RSEM in DESeq2?] Yes, RSEM expected counts can be used with DESeq2.
<syntaxhighlight lang='bash'>
:<syntaxhighlight lang='rsplus'>
samtools faidx genome.fa
# adding
txi$length[txi$length <= 0] <- 1
# before
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
</syntaxhighlight>
</syntaxhighlight>


For example, the BWAIndex folder contains the following files
==== Examples ====
<syntaxhighlight lang='bash'>
{{Pre}}
$ ls -l ~/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/
$ wc -l 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results
total 12
  28110 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results
lrwxrwxrwx 1 brb brb  22 Mar 10 03:40 genome.fa -> version0.6.0/genome.fa
lrwxrwxrwx 1 brb brb  26 Mar 10 03:37 genome.fa.amb -> version0.6.0/genome.fa.amb
lrwxrwxrwx 1 brb brb  26 Mar 10 03:40 genome.fa.ann -> version0.6.0/genome.fa.ann
lrwxrwxrwx 1 brb brb  26 Mar 10 03:40 genome.fa.bwt -> version0.6.0/genome.fa.bwt
-rw-r--r-- 1 brb brb  783 Apr 12 14:46 genome.fa.fai
lrwxrwxrwx 1 brb brb  26 Mar 10 03:40 genome.fa.pac -> version0.6.0/genome.fa.pac
lrwxrwxrwx 1 brb brb  25 Mar 10 03:37 genome.fa.sa -> version0.6.0/genome.fa.sa
drwxrwxr-x 2 brb brb 4096 Mar 15  2012 version0.5.x
drwxrwxr-x 2 brb brb 4096 Mar 15  2012 version0.6.0
</syntaxhighlight>


==== Mapping ====
$ head -n 4 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results | cut -f1,3,4,5,6,7
<syntaxhighlight lang='bash'>
bwa mem  # display the help
bwa mem -t 4 XXX.fa XXX.fastq > XXX.sam
more XXX.sam
samtools view -dT XXX.fa XXX.sam > XXX.bam # transform to bam format
samtools view XXX.bam | more
</syntaxhighlight>
and output directly to a binary format
<syntaxhighlight lang='bash'>
bwa mem -t 4 XXX.fa XXX.fastq | samtools view -bS - > XXX.bam  # transform to bam format directly
</syntaxhighlight>
where '-S' means Ignoring for compatibility with previous samtools versions. Previously this option was required if input was in SAM format, but now the correct format is automatically detected by examining the first few characters of input. See http://www.htslib.org/doc/samtools.html.


And [https://github.com/ekg/alignment-and-variant-calling-tutorial this is a tutorial] to use bwa and freebayes to do variant calling by Freebayes's author.
gene_id   length  effective_length  expected_count TPM   FPKM
5S_rRNA   105.01  1.03             1513.66   31450.7 23788.06
5_8S_rRNA 161   16.65             0           0   0
6M1-18   473   293.88     0           0   0
</pre>


[https://wikis.utexas.edu/display/bioiteam/Variant+calling+tutorial#Variantcallingtutorial-CallingvariantsinreadsmappedbyBWAorBowtie2 This tutorial] is using a whole genome data [http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR030257 SRR030257] from [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP001369 SRP001369]
Second example
{{Pre}}
$ wc -l Sample_HS-578T_CB6CRANXX.genes.results
28110
$ head -1 Sample_HS-578T_CB6CRANXX.genes.results
gene_id transcript_id.s. length effective_length expected_count TPM FPKM
$ tail -4Sample_HS-578T_CB6CRANXX.genes.results
septin 9/TNRC6C fusion uc010wto.1 41 0 0 0 0
svRNAa uc022bxg.1 22 0 0 0 0
tRNA Pro uc022bqx.1 65 0 0 0 0
unknown uc002afm.3 1117 922.85 0 0 0


[http://gatkforums.broadinstitute.org/wdl/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk BWA for whole genome and GATK]
$ wc -l Sample_HS-578T_CB6CRANXX.isoforms.results
78376
$ head -1 Sample_HS-578T_CB6CRANXX.isoforms.results
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
$ tail -4 Sample_HS-578T_CB6CRANXX.isoforms.results
uc010wto.1 septin 9/TNRC6C fusion 41 0 0 0 0 0
uc022bxg.1 svRNAa 22 0 0 0 0 0
uc022bqx.1 tRNA Pro 65 0 0 0 0 0
uc002afm.3 unknown 1117 922.85 0 0 0 0
</pre>


Best pipeline for human whole exome sequencing
== [http://www.bioconductor.org/packages/release/bioc/html/limma.html limma] ==
* https://www.biostars.org/p/1268/
* [http://nar.oxfordjournals.org/content/early/2015/01/20/nar.gkv007.long Differential expression analyses for RNA-sequencing and microarray studies]
* [http://bioinf.wehi.edu.au/RNAseqCaseStudy/ Case Study] using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). ''Here we illustrate how to use two Bioconductor packages - '''Rsubread''' and '''limma''' - to perform a complete RNA-seq analysis, including '''Subread''''''Bold text''' read mapping, '''featureCounts''' read summarization, '''voom''' normalization and '''limma''' differential expresssion analysis.''
* Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See [https://support.bioconductor.org/p/47217/ this post].


==== Summary report of a BAM file: samtools flagstat ====
=== RSEM ===
Unlike Tophat and STAR, BWA mem does not provide a summary report of percentage of mapped reads. To get the percentage of mapped reads, use [https://www.biostars.org/p/14709/ samtoolss flagstat] command
* [https://support.bioconductor.org/p/84749/ RSEM expected counts to limma-voom]. Choice 1: '''tximport'''. Choice 2: If you are working with '''RSEM gene-level expected counts''', then you can just pass them to limma as if they were counts. That's what we do.
<syntaxhighlight lang='bash'>
* [https://stat.ethz.ch/pipermail/bioconductor/2013-November/056309.html Differential expression of RNA-seq data using limma and voom()]
$ export PATH=$PATH:/opt/SeqTools/bin/samtools-1.3


# fastq files from ExomeLungCancer/test.SRR2923335_*.fastq
=== Within-subject correlation ===
# 'bwa mem' and 'samtools fixmate' have been run to generate <accepted_hits.bam>
* [https://support.bioconductor.org/p/9152932/ Does this RNAseq experiment require a repeated measures approach?]
** Solution 1: 9.7 Multi-level Experiments of LIMMA user guide. '''duplicateCorrelation()''', '''lmFit()''', '''makeContrasts()''', '''contrasts.fit()''' and '''eBayes()'''
** Solution 2: Section 3.5 "Comparisons both between and within subjects" in edgeR. model.matrix(), glmQLFit(), glmQLFTtest(), topTags().  


$ wc -l ../test.SRR2923335_1.fastq  # 5000 reads in each of PAIRED end fastq files
=== Time Course Experiments ===
20000 ../test.SRR2923335_1.fastq
* See Limma's [https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf#page=48 vignette] on Section 9.6
** Few time points (ANOVA, contrast)
*** Which genes respond at either the 6 hour or 24 hour times in the wild-type?
*** Which genes respond (i.e., change over time) in the mutant?
*** Which genes respond differently over time in the mutant relative to the wild-type?
** Many time points (regression such as cubic spline,  moderated F-test)
*** Detect genes with different time trends for treatment vs control.


$ samtools flagstat accepted_hits.bam
== [https://bioconductor.org/packages/release/bioc/html/easyRNASeq.html easyRNASeq] ==
10006 + 0 in total (QC-passed reads + QC-failed reads)
Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
6 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
9971 + 0 mapped (99.65% : N/A)
10000 + 0 paired in sequencing
5000 + 0 read1
5000 + 0 read2
8502 + 0 properly paired (85.02% : N/A)
9934 + 0 with itself and mate mapped
31 + 0 singletons (0.31% : N/A)
1302 + 0 with mate mapped to a different chr
1231 + 0 with mate mapped to a different chr (mapQ>=5)
</syntaxhighlight>


See also https://wikis.utexas.edu/display/CoreNGSTools/Alignment#Alignment-Exercise#4:BWA-MEM-HumanmRNA-seq for alignment (including BWA mem and samtools flagstat).
== ShortRead ==
Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.


* Properly mapped (inward oriented & same chromosome & good insert size): ''samtools view -f 3 -F 2 foo.bam''. But the result is not the same as flagstat??
== [http://www.bioconductor.org/packages/release/bioc/html/Rsamtools.html Rsamtools] ==
* Itself and mate mapped: ''samtools view -f 1 -F 12 foo.bam''. But the result is not the same as flagstat??
The Rsamtools package provides an interface to BAM files.  
* Singleton: ''samtools view -F 4 -f 8 foo.bam''. The flags indicate that the current read is mapped but its mate isn't. On GSE48215subset data, the return number of reads does not match with flagstat result but ''samtools view -f 4 -F 8 foo.bam'' (current read is unmapped and mate is mapped) gives the same result as the flagstat command.


==== Stringent criterion ====
The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts
* https://sourceforge.net/p/bio-bwa/mailman/message/31968535/
* http://seqanswers.com/forums/showthread.php?p=186581


==== Applied to RNA-Seq data? ====
This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.
https://www.biostars.org/p/130451/. ''BWA isn't going to handle spliced alignment terribly well, so it's not normally recommended for RNAseq datasets.''


=== [https://github.com/lh3/minimap2 Minimap2] ===
== [http://www.bioconductor.org/packages/release/bioc/html/IRanges.html IRanges] ==
* BWA replacement. Fast and almost identical mapping
IRanges is a fundamental package (see how many packages depend on it) to other packages like '''GenomicRanges''', '''GenomicFeatures''' and '''GenomicAlignments'''. The package defines the IRanges class.  
* https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty191/4994778
* https://arxiv.org/abs/1708.01492
* [https://www.overleaf.com/read/ddwtrgmngxms#/39316160/ Latex] source
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2592-5 IMOS: improved Meta-aligner and Minimap2 On Spark]


=== [http://ccb.jhu.edu/software/tophat/index.shtml Tophat] (RNA) ===
The '''plotRanges'''() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.
Aligns RNA-Seq reads to the genome using '''Bowtie'''/Discovers splice sites. It does so by splitting longer reads into small sections and aligning those to the genome. It then looks for potential splice sites between pairs of sections to construct a final alignment.  


Linux part.
If we want to make the same plot using the ggplot2 package, we can follow the example in [http://stackoverflow.com/questions/21506724/how-to-plot-overlapping-ranges-with-ggplot2 this post]. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.
<pre>
$ type -a tophat # Find out which command the shell executes:
tophat is /home/mli/binary/tophat
$ ls -l ~/binary
</pre>


Quick test of Tophat program
=== flank ===
The example is obtained from ?IRanges::flank.
<pre>
<pre>
$ wget http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz
ir3 <- IRanges(c(2,5,1), c(3,7,3))
$ tar xzvf test_data.tar.gz
# IRanges of length 3
$ cd ~/tophat_test_data/test_data
#    start end width
$ PATH=$PATH:/home/mli/bowtie-0.12.8
# [1]    2  3     2
$ export PATH
# [2]    5   7    3
$ ls
# [3]    1   3     3
reads_1.fq      test_ref.1.ebwt  test_ref.3.bt2  test_ref.rev.1.bt2  test_ref.rev.2.ebwt
reads_2.fq      test_ref.2.bt2   test_ref.4.bt2  test_ref.rev.1.ebwt
test_ref.1.bt2  test_ref.2.ebwt  test_ref.fa     test_ref.rev.2.bt2
$ tophat -r 20 test_ref reads_1.fq reads_2.fq
$ # This will generate a new folder <tophat_out>
$ ls tophat_out
accepted_hits.bam  deletions.bed  insertions.bed  junctions.bed  logs  prep_reads.info  unmapped.bam
</pre>


TopHat accepts FASTQ and FASTA files of sequencing reads as input. Alignments are reported in BAM files. BAM is the compressed, binary version of SAM43, a flexible and general purpose read alignment format. SAM and BAM files are produced by most next-generation sequence alignment tools as output, and many downstream analysis tools accept SAM and BAM as input. There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps most popular among these are the SAM tools (http://samtools.sourceforge.net/) and the Picard tools (http://picard.sourceforge.net/).
flank(ir3, 2)
#    start end width
# [1]    0  1    2
# [2]    3  4    2
# [3]    -1  0    2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                    == ==


Note that if the data is DNA-Seq, we can merely use Bowtie2 or BWA tools since we don't have to worry about splicing.
flank(ir3, 2, start=FALSE)
#    start end width
# [1]    4  5    2
# [2]    8  9    2
# [3]    4  5    2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == ==


An example of using Tophat2 (paired end in this case, 5 threads) is
flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
<syntaxhighlight lang="bash">
#    start end width
tophat2  --no-coverage-search -p 5 \
# [1]    4  5     2
      -o "Sample1" \
# [2]    3  4    2
      -G ~/iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \
# [3]    4  5    2
      --transcriptome-index=transcriptome_data/known \
# Combine the ideas of the previous 2 cases.
      ~/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome \
      myfastq_R1.fastq.gz myfastq_R2.fastq.gz
</syntaxhighlight>


To find out the alignment rate for ALL bam files (eg we have ctrl1, ctrl2, test1, test2 directories and each of them have align_summary.txt file),
flank(ir3, c(2, -2, 2))
<syntaxhighlight lang="bash">
#    start end width
grep "overall read mapping rate" */align_summary.txt
# [1]    0  1    2
</syntaxhighlight>
# [2]    5  6    2
# [3]    -1  0    2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                  == ==


==== Novel transcripts ====
flank(ir3, -2, start=F)
* [http://www.rna-seqblog.com/current-limitations-of-rna-seq-analysis-for-detection-of-novel-transcripts/ The identification and characterization of novel transcripts from RNA-seq data]
#    start end width
# [1]    2  3    2
# [2]    6  7    2
# [3]    2  3    2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                      == ==


==== How does TopHat find junctions? ====
flank(ir3, 2, both = TRUE)
https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/cufflinks-and-tophat
#    start end width
 
# [1]    0  3    4
==== Can Tophat Be Used For Mapping Dna-Seq ====
# [2]    3  6    4
https://www.biostars.org/p/63878/
# [3]    -1  2    4
* In contrast to DNA-sequence alignment, RNA-seq mapping algorithms have two additional challenges. First, because genes in eukaryotic genomes contain introns, and because reads sequenced from mature mRNA transcripts do not include these introns, any RNA-seq alignment program must be able to handle gapped (or spliced) alignment with very large gaps.
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
* TopHat is designed to map reads to a reference allowing splicing. In your case, the reads are not spliced because are genomic, so don't waste your time and resources and use Bowtie/BWA directly.
# (From the manual) If both = TRUE, extends the flanking region width positions into the range.  
 
#        The resulting range thus straddles the end point, with width positions on either side.
=== [http://ccb.jhu.edu/software/hisat2/manual.shtml Hisat] (RNA/DNA) ===
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
* https://bioinformatics.ca/workshops/2017/informatics-rna-seq-analysis-2017
#                                            ==
* [https://hpc.nih.gov/apps/hisat.html Hisat on Biowulf]
#                                      == == == ==
* [https://www.biostars.org/p/177714/ The comparison between HISAT2 and Tophat2]
 
=== [https://code.google.com/p/rna-star/ STAR] (Spliced Transcripts Alignment to a Reference, RNA) ===
 
[http://sci-hub.cc/10.1007/978-1-4939-3572-7_13 Optimizing RNA-Seq Mapping with STAR] by  Alexander Dobin and Thomas R. Gingeras


Its manual is on [https://github.com/alexdobin/STAR github]. The [http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1114s51/full 2015 paper] includes scripts to run STAR.
flank(ir3, 2, start=FALSE, both=TRUE)
 
#    start end width
Note that the readme file says HARDWARE/SOFTWARE REQUIREMENTS:
# [1]     2  5    4
* x86-64 compatible processors
# [2]     6  9    4
* 64 bit Linux or Mac OS X
# [3]    2  5    4
* '''30GB of RAM for human genome'''. In fact, it requires '''34GB''' (tested on UCSC/hg38). See the following '''jobhist''' output from running indexing on Biowulf.
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
<pre>
#                                          ==
Submission Command : sbatch --cpus-per-task=2 --mem=64g --time=4:00:00 createStarIndex
#                                      == == == ==


Jobid        Partition      State  Nodes  CPUs      Walltime      Runtime        MemReq  MemUsed  Nodelist
44714895          norm  COMPLETED      1    2      04:00:00      02:54:21    64.0GB/node  33.6GB  cn3144
</pre>
</pre>


See the blog on [http://www.gettinggeneticsdone.com/2012/11/star-ultrafast-universal-rna-seq-aligner.html gettinggeneticsdone.com] for a comparison of speed and memory requirement.
Both IRanges and GenomicRanges packages provide the '''flank''' function.


In short, the notable increase in speed comes at the price of a larger memory requirement.
'''Flanking region''' is also a common term in High-throughput sequencing. The [http://www.broadinstitute.org/igv/book/export/html/6 IGV] user guide also has some option related to flanking.
* General tab: '''Feature flanking regions (base pairs)'''. IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
* Alignments tab: '''Splice junction track options'''. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.


To build STAR on Ubuntu (14.04)
== [https://www.bioconductor.org/packages/release/bioc/html/Biostrings.html Biostrings] ==
<syntaxhighlight lang="bash">
* [http://www.r-exercises.com/2017/05/21/manipulate-biological-data-using-biostrings-package-part-1/ Manipulate Biological Data Using Biostrings Package Exercises (Part 1)]
wget https://github.com/alexdobin/STAR/archive/STAR_2.4.2a.tar.gz
* [http://www.r-exercises.com/2017/05/28/manipulate-biological-data-using-biostrings-package-exercises-part-2/ Manipulate Biological Data Using Biostrings Package Exercises (Part 2)] - it covers global, local & overlap alignments.
sudo tar -xzf STAR_2.4.2a.tar.gz -C /opt/RNA-Seq/bin


cd /opt/RNA-Seq/bin/STAR-STAR_2.4.2a
== [http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html GenomicRanges] ==
cd source
GenomicRanges depends on [http://www.bioconductor.org/packages/release/bioc/html/IRanges.html IRanges] package. See the dependency diagram below.
sudo make STAR
<pre>
</syntaxhighlight>
GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
 
                        |            +
==== Create index folder ====
                  +-----+            +- GenomeInfoDb
<syntaxhighlight lang='bash'>
                  |                      |
STAR --runMode genomeGenerate --runThreadN 11 \
GenomicAlignments +--- Rsamtools --+-----+
    --genomeDir STARindex \
                                    +--- Biostrings
    --genomeFastaFiles genome.fa \
    --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100
</syntaxhighlight>
where 100 = read length (one side) -1 = 101 - 100.  
 
==== STARindex folder ====
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
brb@T3600 ~/SRP050992 $ ls ~/SRP012607/STARindex/
chrLength.txt      chrStart.txt      geneInfo.tab          SA           sjdbList.fromGTF.out.tab
chrNameLength.txt  exonGeTrInfo.tab  Genome                SAindex      sjdbList.out.tab
chrName.txt        exonInfo.tab      genomeParameters.txt sjdbInfo.txt  transcriptInfo.tab
brb@T3600 ~/SRP050992 $ cat ~/SRP012607/STARindex/genomeParameters.txt
### STAR  --runMode genomeGenerate  --runThreadN 11  --genomeDir STARindex  --genomeFastaFiles /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa      --sjdbGTFfile /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf  --sjdbOverhang 100
versionGenome  20201
genomeFastaFiles        /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
genomeSAindexNbases    14
genomeChrBinNbits      18
genomeSAsparseD 1
sjdbOverhang    100
sjdbFileChrStartEnd    -
sjdbGTFfile    /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf
sjdbGTFchrPrefix        -
sjdbGTFfeatureExon      exon
sjdbGTFtagExonParentTranscript  transcript_id
sjdbGTFtagExonParentGene        gene_id
sjdbInsertSave  Basic
</pre>
</pre>


==== Splice junction ====
The package defines some classes
* [https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf Manual]. Search 'sjdbOverhang' (length of the donor/acceptor sequence on each side of the junctions). The --sjdbOverhang is used only at the genome generation step, and tells STAR how many bases to concatenate from donor and acceptor sides of the junctions. If you have 100b reads, the ideal value of --sjdbOverhang is 99, which allows the 100b read to map 99b on one side, 1b on the other side. One can think of --sjdbOverhang as the maximum possible overhang for your reads. See [https://www.biostars.org/p/93883/ A post].
* GRanges
* For the <SJ.out.tab> format see section 4.4 of STAR manual. The first few lines of the file looks like (It is strange that the values in column 7, 8 are so large)
* GRangesList
<pre>
* GAlignments
$ head SJ.out.tab
* SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.
chrM    711    764    1      1      1      2      1      44
chrM    717    1968    1      1      1      1      0      43
chrM    759    1189    0      0      1      11      3      30
chrM    795    830    2      2      1      335    344    30
chrM    822    874    1      1      1      1      1      17
chrM    831    917    2      2      1      733    81      38
chrM    831    3135    2      2      1      1      0      30
chrM    846    1126    0      0      1      4      0      50
chrM    876    922    1      1      1      26      9      31
chrM    962    1052    2      2      1      3      0      30
^      ^      ^      ^      ^      ^      ^      ^        ^
|      |      |      |      |      |      |      |        |
chrom  start  end    strand  intron annotated  |    # of      |
                      0=undef  motif          # of  multi mapped|
                      1=+                  uniquely mapped    max spliced alignment overhang 
                      2=-                      reads
</pre>
* [https://ccb.jhu.edu/software/tophat/manual.shtml Tophat]. Search overhang, donor/acceptor keywords. Check out <junctions.bed>. Tophat suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million).
* [http://www.ccb.jhu.edu/software/hisat/manual.shtml HiSAT] comes with a python script that can extract splice sites from a GTF file. See [https://www.biostars.org/p/146700/ this post].
<syntaxhighlight lang='bash'>
$ cd /tmp
$ /opt/SeqTools/bin/hisat2-2.0.4/hisat2_extract_splice_sites.py \
  ~/igenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > splicesites.txt
$ head splicesites.txt
chr1    12226  12612  +
chr1    12720  13220  +
chr1    14828  14969  -
chr1    15037  15795  -
chr1    15946  16606  -
chr1    16764  16857  -
chr1    17054  17232  -
chr1    17367  17605  -
chr1    17741  17914  -
chr1    18060  18267  -
</syntaxhighlight>
* To extract the splice alignment from bam files using '''samtools''', see [https://www.biostars.org/p/12626/ this] and [https://www.biostars.org/p/89581/ this] posts.


==== Screen output ====
(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the ''GAlignments'' object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.
<pre>
Oct 07 19:25:19 ..... started STAR run
Oct 07 19:25:19 ... starting to generate Genome files
Oct 07 19:27:08 ... starting to sort Suffix Array. This may take a long time...
Oct 07 19:27:38 ... sorting Suffix Array chunks and saving them to disk...
Oct 07 20:07:07 ... loading chunks from disk, packing SA...
Oct 07 20:22:52 ... finished generating suffix array
Oct 07 20:22:52 ... generating Suffix Array index
Oct 07 20:26:32 ... completed Suffix Array index
Oct 07 20:26:32 ..... processing annotations GTF
Oct 07 20:26:38 ..... inserting junctions into the genome indices
Oct 07 20:30:15 ... writing Genome to disk ...
Oct 07 20:31:00 ... writing Suffix Array to disk ...
Oct 07 20:37:24 ... writing SAindex to disk
Oct 07 20:37:39 ..... finished successfully
</pre>


==== Log.final.out ====
== [http://www.bioconductor.org/packages/release/bioc/html/GenomicAlignments.html GenomicAlignments] ==
An example of the Log.final.out file
=== Counting reads with summarizeOverlaps vignette ===
<pre>
<pre>
$ cat Log.final.out
library(GenomicAlignments)
                                Started job on |      Jul 21 13:12:11
library(DESeq)
                            Started mapping on |      Jul 21 13:35:06
library(edgeR)
                                    Finished on |      Jul 21 13:57:42
      Mapping speed, Million of reads per hour |      248.01


                          Number of input reads |      93418927
fls <- list.files(system.file("extdata", package="GenomicAlignments"),
                      Average input read length |      202
    recursive=TRUE, pattern="*bam$", full=TRUE)
                                    UNIQUE READS:
                  Uniquely mapped reads number |      61537064
                        Uniquely mapped reads % |      65.87%
                          Average mapped length |      192.62
                      Number of splices: Total |      15658546
            Number of splices: Annotated (sjdb) |      15590438
                      Number of splices: GT/AG |      15400163
                      Number of splices: GC/AG |      178602
                      Number of splices: AT/AC |      14658
              Number of splices: Non-canonical |      65123
                      Mismatch rate per base, % |      0.86%
                        Deletion rate per base |      0.02%
                        Deletion average length |      1.61
                        Insertion rate per base |      0.03%
                      Insertion average length |      1.40
                            MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |      4962849
            % of reads mapped to multiple loci |      5.31%
        Number of reads mapped to too many loci |      62555
            % of reads mapped to too many loci |      0.07%
                                  UNMAPPED READS:
      % of reads unmapped: too many mismatches |      0.00%
                % of reads unmapped: too short |      28.70%
                    % of reads unmapped: other |      0.05%
                                  CHIMERIC READS:
                      Number of chimeric reads |      0
                            % of chimeric reads |      0.00%
</pre>
Note that 65.87 (uniquely mapped) + 5.31 (mapped to multiple loci) + 0.07 (mapped too many loci) + 28.70 (unmapped too short) + 0.05 (unmapped other) = 100


By default, STAR only outputs reads that map to <=10 loci, others are considered "mapped to too many loci". You can increase this threshold by increasing --outFilterMultimapNmax. See [https://groups.google.com/forum/#!topic/rna-star/sgVFpzSVFyY here].
features <- GRanges(
 
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
=== [http://subread.sourceforge.net/ Subread] (RNA/DNA) ===
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000,
* http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900,
* [http://genomespot.blogspot.com/2015/03/hisat-vs-star-vs-tophat2-vs-olego-vs.html HISAT vs STAR vs TopHat2 vs Olego vs SubJunc] by Mark Ziemann
        300, 500, 500)), "-",
* Get a 'Segmentation fault' error on GSE46876 (SRR902884.fastq, single-end) rna-seq data
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
 
features
In practice,
# For DNA-Seq data, the '''subread''' aligner is used.
# For RNA-Seq data,  
#* the '''subread''' aligner is used when selecting gene counting analysis only.
#* the '''subjunc''' aligner is used when variant calling analysis is selected.
 
=== RNASequel ===
[http://nar.oxfordjournals.org/content/early/2015/06/16/nar.gkv594.full RNASequel: accurate and repeat tolerant realignment of RNA-seq reads]


=== [https://github.com/amplab/snap snap] ===
# GRanges object with 11 ranges and 1 metadata column:
Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
#      seqnames      ranges strand  |    group_id
#          <Rle>    <IRanges>  <Rle>  | <character>
#  [1]    chr2L [1000, 1499]      -  |          A
#  [2]    chr2L [3000, 3499]      -  |          A
#  [3]    chr2L [4000, 4499]      -  |          A
#  [4]    chr2L [7000, 7599]      -  |          A
#  [5]    chr2R [2000, 2899]      -  |          B
#  ...      ...          ...    ... ...        ...
#  [7]   chr2R [3600, 3899]      -  |          B
#  [8]    chr2R [4000, 4899]      -  |          B
#  [9]    chr2R [7500, 7799]      -  |          B
#  [10]    chr3L [5000, 5499]      -  |          C
#  [11]    chr3L [5400, 5899]      -  |          C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment
# dim: 11 2
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):


snap is the best choice so far. star/hisat2 are as fast but not as good for genomic reads. accurate mappers need to do alignment anyway
assays(olap)$counts
 
#      sm_treated1.bam sm_untreated1.bam
Keep in mind SNAP's index is 10-15x bigger than input. 4Gbp->50GB index. SNAP loads index into memory. HISAT2 index is similar size as input
#  [1,]              0                0
 
#  [2,]              0                0
=== Lessons ===
#  [3,]              0                0
* Tophat and STAR needs index files. So if we want to run the alignment using multiple nodes, we want to first run the alignment on a single sample first. Then we can run alignment on others samples using multiple nodes.
#  [4,]              0                0
* The index files of Tophat only depends on the reference genome. However the index files of STAR depends on both the reference genome and the read length of the data.
#  [5,]              5                1
#  [6,]              5                0
#  [7,]              2                0
#  [8,]            376              104
#  [9,]              0                0
# [10,]              0                0
# [11,]              0                0
</pre>


=== Alignment Algorithms ===
Pasilla data. Note that the bam files are not clear where to find them. According to the [https://support.bioconductor.org/p/50162/ message], we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).
* [https://biology.stackexchange.com/questions/11263/what-is-the-difference-between-local-and-global-sequence-alignments What is the difference between local and global sequence alignments?]
* [https://en.wikipedia.org/wiki/Sequence_alignment#Global_and_local_alignments Sequence alignment]
* [https://www.cs.umd.edu/class/fall2011/cmsc858s/ CMSC 858s: Computational Genomics]
* [https://en.wikipedia.org/wiki/Substitution_matrix Substitution matrix]
 
=== Alignment free ===
* [https://www.biorxiv.org/content/early/2018/01/11/246967 Limitation of alignment-free tools in total RNA-seq quantification]
 
== SAMtools ==
'''SAMtools''' bundle include '''samtools''', '''bcftools''', '''bgzip''', '''tabix''', ''' wgsim''' and '''htslib'''. samtools and bcftools are based on htslib.
 
* https://en.wikipedia.org/wiki/SAMtools
* http://www.htslib.org/doc/samtools.html
* http://www.htslib.org/doc/samtools-1.2.html (it is strange bam2fq appears in v1.2 but not in v1.3 documentation)
* [http://biobits.org/samtools_primer.html SAMtools: Primer / Tutorial] by Ethan Cerami. It covers installing samtools, bcftools, generating simulated reads, align reads, convert sam to bam, sorting & indexing, identifying genomic variants, understand vcf format, visualize reads on a command line.
* http://davetang.org/wiki/tiki-index.php?page=SAMTools contains many recipes like counting number of mapped reads
 
<syntaxhighlight lang='bash'>
$ /opt/SeqTools/bin/samtools-1.3/samtools
 
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3 (using htslib 1.3)
 
Usage:  samtools <command> [options]
 
Commands:
  -- Indexing
    dict          create a sequence dictionary file
    faidx          index/extract FASTA
    index          index alignment
 
  -- Editing
    calmd          recalculate MD/NM tags and '=' bases
    fixmate        fix mate information
    reheader      replace BAM header
    rmdup          remove PCR duplicates
    targetcut      cut fosmid regions (for fosmid pool only)
    addreplacerg  adds or replaces RG tags
 
  -- File operations
    collate        shuffle and group alignments by name
    cat            concatenate BAMs
    merge          merge sorted alignments
    mpileup        multi-way pileup
    sort          sort alignment file
    split          splits a file by read group
    quickcheck    quickly check if SAM/BAM/CRAM file appears intact
    fastq          converts a BAM to a FASTQ
    fasta          converts a BAM to a FASTA
 
  -- Statistics
    bedcov        read depth per BED region
    depth          compute the depth
    flagstat      simple stats
    idxstats      BAM index stats
    phase          phase heterozygotes
    stats          generate stats (former bamcheck)
 
  -- Viewing
    flags          explain BAM flags
    tview          text alignment viewer
    view          SAM<->BAM<->CRAM conversion
    depad          convert padded BAM to unpadded BAM
</syntaxhighlight>
 
To compile the new version (v1.x) of samtools,
<syntaxhighlight lang='bash'>
git clone https://github.com/samtools/samtools.git
git clone https://github.com/samtools/htslib.git
cd samtools
make
</syntaxhighlight>
 
=== Multi-thread option ===
Only '''view''' and '''sort''' have multi-thread ('''-@''') option.
 
=== samtools sort ===
A lot of temporary files will be created.
 
For example, if my output file is called <bwa_homo2_sort.bam>, there are 80 <bwa_homo2_sort.bam.tmp.XXXX.bam> files (each is about 210 MB) generated.
<pre>
<pre>
samtools sort -@ $SLURM_CPUS_PER_TASK -O bam -n bwa_homo2.sam -o bwa_homo2_sort.bam
samtools view -h -o outputFile.bam inputFile.sam
</pre>
</pre>


=== Decoding SAM flags ===
A modified R code that works is
http://broadinstitute.github.io/picard/explain-flags.html
<pre>
###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
            "gtf/drosophila_melanogaster/",
            "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)


=== samtools flagstat: Know how many alignment a bam file contains ===
###################################################
<syntaxhighlight lang='bash'>
### code chunk number 12: gff_parse (eval = FALSE)
brb@brb-T3500:~/GSE11209$ /opt/RNA-Seq/bin/samtools-0.1.19/samtools flagstat dT_bio_s.bam
###################################################
1393561 + 0 in total (QC-passed reads + QC-failed reads)
idx <- mcols(gff0)$source == "protein_coding" &
0 + 0 duplicates
          mcols(gff0)$type == "exon" &
1393561 + 0 mapped (100.00%:-nan%)
          seqnames(gff0) == "4"
0 + 0 paired in sequencing
gff <- gff0[idx]
0 + 0 read1
## adjust seqnames to match Bam files
0 + 0 read2
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
0 + 0 properly paired (-nan%:-nan%)
chr4genes <- split(gff, mcols(gff)$gene_id)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
</syntaxhighlight>


We can see how many reads have multiple alignments by comparing the number of reads and the number of alignments output from '''samtools flagstat'''.
###################################################
<syntaxhighlight lang='bash'>
### code chunk number 12: gff_parse (eval = FALSE)
brb@brb-T3500:~/GSE11209$ wc -l SRR002062.fastq
###################################################
13778616 SRR002062.fastq
library(GenomicAlignments)
</syntaxhighlight>
 
Note: there is no multithread option in samtools flagstat.
 
=== samtools flagstat source code ===
* https://www.biostars.org/p/12475/
* https://github.com/samtools/samtools/blob/develop/bam_stat.c


=== samtools view ===
# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
See also [[Anders2013#sam.2Fbam.2C_.22samtools_view.22_and_Rsamtools|Anders2013-sam/bam]] and http://genome.sph.umich.edu/wiki/SAM
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
 
    recursive=TRUE, pattern="*bam$", full=TRUE)
Note that we can use both '''-f''' and '''-F''' flags together.
path <- system.file("extdata", package="pasillaBamSubset")
<pre>
bamlst <- BamFileList(fls)
$ /opt/SeqTools/bin/samtools-1.3/samtools view -h
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts


Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)


Options:
expdata = MIAME(
  -b      output BAM
              name="pasilla knockdown",
  -C      output CRAM (requires -T)
              lab="Genetics and Developmental Biology, University of  
  -1      use fast BAM compression (implies -b)
                  Connecticut Health Center",
  -u      uncompressed BAM output (implies -b)
              contact="Dr. Brenton Graveley",
  -h      include header in SAM output
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi
  -H      print SAM header only (no alignments)
                  knockdown RNA-Seq Studies",
  -c      print only the count of matching records
              pubMedIds="20921232",
  -o FILE  output file name [stdout]
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
  -U FILE  output reads not selected by filters to FILE [null]
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
  -t FILE  FILE listing reference names and lengths (see long help) [null]
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs
  -L FILE  only include reads overlapping this BED FILE [null]
                  encoding pasilla, a mRNA binding protein and 4 biological replicates
  -r STR  only include reads in read group STR [null]
                  of the the untreated cell line.")
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT  only include reads with mapping quality >= INT [0]
  -l STR  only include reads in library STR [null]
  -m INT  only include reads with number of CIGAR operations consuming
          query sequence >= INT [0]
  -f INT  only include reads with all bits set in INT set in FLAG [0]
  -F INT  only include reads with none of the bits set in INT set in FLAG [0]
  -x STR  read tag to strip (repeatable) [null]
  -B      collapse the backward CIGAR operation
  -s FLOAT integer part sets seed of random number generator [0];
          rest sets fraction of templates to subsample [no subsampling]
  -@, --threads INT
          number of BAM/CRAM compression threads [0]
  -?      print long help, including note about region specification
  -S      ignored (input format is auto-detected)
      --input-fmt-option OPT[=VAL]
              Specify a single input file format option in the form
              of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
              Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
              Specify a single output file format option in the form
              of OPTION or OPTION=VALUE
  -T, --reference FILE
              Reference sequence FASTA FILE [null]
</pre>


'''View bam files''':
design <- data.frame(
<syntaxhighlight lang='bash'>
              condition=c("untreated", "untreated"),
samtools view XXX.bam    # No header
              replicate=c(1,1),
samtools view -@ 16 -h XXX.bam # include header in SAM output
              type=rep("single-read", 2), stringsAsFactors=TRUE)
samtools view -@ 16 -H xxx.bam # See the header only. The header may contains the reference genome fasta information.
library(DESeq)
</syntaxhighlight>
geneCDS <- newCountDataSet(
Note that according to [http://samtools.github.io/hts-specs/SAMv1.pdf SAM pdf], the header is optional.
                  countData=assay(genehits),
 
                  conditions=design)
'''Sam and bam files conversion''':
<syntaxhighlight lang='bash'>
/opt/RNA-Seq/bin/samtools-0.1.19/samtools view --help


# sam -> bam (need reference genome file)
experimentData(geneCDS) <- expdata
samtools view -@ 16 -b XXX.fa XXX.bt2.sam > XXX.bt2.bam
sampleNames(geneCDS) = colnames(genehits)


# bam -> sam
###################################################
samtools view -@ 16 -h -o out.sam in.bam
### code chunk number 16: pasilla_genes (eval = FALSE)
</syntaxhighlight>
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design)
experimentData(txCDS) <- expdata
</pre>


'''Subset''':
We can also check out ?summarizeOverlaps to find some fake examples.
<syntaxhighlight lang='bash'>
# assuming bam file is sorted & indexed
samtools view -@ 16 XXX.bam "chr22:24000000-25000000" | more
</syntaxhighlight>


=== Remove unpaired reads ===
== tidybulk ==
According to the samtools manual, the 7th column represents the chromosome name of the mate/next read. If the field is set as '*' when the information is unavailable and set as '=' if RNEXT is identical RNAME.  
* [https://bioconductor.org/packages/release/bioc/html/tidybulk.html tidybulk]: an R tidy framework for modular transcriptomic data analysis, [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7 paper]
* [https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html A Tidy Transcriptomics introduction to RNA-Seq analyses]


When we use '''samtools view''' command to manually removed certain reads from a '''pair-end''' sam/bam file, the bam file will has a [[#SAM_validation_error|problem]] when it is used in '''picard MarkDuplicates'''. A solution is to run the following line to remove these unpaired reads (so we don't need to the '''VALIDATION_STRINGENCY''' parameter)
== Bisque ==
<syntaxhighlight lang='bash'>
[https://github.com/cozygene/bisque?s=09 Bisque]: An R toolkit for accurate and efficient estimation of cell composition ('decomposition') from bulk expression data with single-cell information.
samtools view -h accepted_hits_bwa.bam | awk -F "\t" '$7!="*" { print $0 }' > accepted_hits_bwa2.sam
</syntaxhighlight>


=== '''less''' bam files ===
== [https://cran.r-project.org/web/packages/chromoMap/ chromoMap] ==
http://martinghunt.github.io/2016/01/25/less-foo.bam.html


=== Convert SAM to BAM ===
== Inference ==
<syntaxhighlight lang='bash'>
* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz089/5307752 How well do RNA-Seq differential gene expression tools perform in a complex eukaryote? A case study in A. thaliana] Froussios, Bioinformatics 2019
samtools view -b input.sam  >  output.bam
# OR
samtools view -b input.sam -o output.bam
</syntaxhighlight>
There is no need to add "-S". The header will be kept in the bam file.


=== Convert BAM to SAM ===
=== tximport ===
<syntaxhighlight lang='bash'>
* [http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor Vignette]. Search "offset".
samtools view -h input.bam -o output.sam
* [https://f1000research.com/articles/4-1521 Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences] by Soneson, et al. [https://f1000research.com/gateways/bioconductor F1000Research]. Open peer review.
</syntaxhighlight>
** [https://f1000researchdata.s3.amazonaws.com/datasets/7563/e22fa771-ee2b-486a-8cf4-31b645163d6b_gse64570_quantification.html#definition-of-gene-to-transcript-mapping GSE64570]
where '-h' is to ensure the converted SAM file contains the header information. Generally, it is useful to store only sorted BAM files as they use much less disk space and are faster to process.
** [https://f1000researchdata.s3.amazonaws.com/datasets/7563/080b3afa-c02e-407a-b08b-3a1e84aff2db_gse69244_quantification.html GSE69244]
** [https://f1000researchdata.s3.amazonaws.com/datasets/7563/3ca97d58-f9bb-48e4-9e76-85d92f21f40a_GSE72165_quantification.html GSE72165]
<ul>
<li>http://bioconductor.org/packages/release/bioc/html/tximport.html
<pre>
$ head -5 quant.sf
Name Length EffectiveLength TPM NumReads
ENST00000456328.2 1657 1410.79 0.083908 2.46885
ENST00000450305.2 632 410.165 0 0
ENST00000488147.1 1351 1035.94 10.4174 225.073
ENST00000619216.1 68 24 0 0
ENST00000473358.1 712 453.766 0 0
</pre>
</li>
<li>Another real data from [https://pdmdb.cancer.gov/web/apex/f?p=101:34:::NO:34:: PDMR -> PDX]. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
{{Pre}}
$ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results
gene_id transcript_id.s. length effective_length  expected_count  TPM   FPKM
5S_rRNA uc021ofx.1 105.04 2.23           2039.99   49629.87 35353.97


=== Convert BAM to FASTQ ===
$ R
* http://www.metagenomics.wiki/tools/samtools/converting-bam-to-fastq
x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results")
* http://seqanswers.com/forums/showthread.php?t=7061
dim(x)
# [1] 28109    7
names(x)
# [1] "gene_id"          "transcript_id.s." "length"          "effective_length"
# [5] "expected_count"  "TPM"              "FPKM"           
x[1:3, -2]
#    gene_id length effective_length expected_count      TPM    FPKM
# 1  5S_rRNA 105.04            2.23        2039.99 49629.87 35353.97
# 2 5_8S_rRNA 161.00            21.19          0.00    0.00    0.00
# 3    6M1-18 473.00          302.74          0.00    0.00    0.00


<syntaxhighlight lang='bash'>
y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results")
# Method 1: samtools
dim(y)
samtools bam2fq SAMPLE.bam > SAMPLE.fastq
# [1] 78375    8
cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_r1.fastq
names(y)
cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_r2.fastq
# [1] "transcript_id"    "gene_id"          "length"          "effective_length"
# [5] "expected_count"  "TPM"              "FPKM"            "IsoPct"
y[1:3, -1]
#  gene_id length effective_length expected_count TPM FPKM IsoPct
# 1 5S_rRNA    110            3.06              0  0    0      0
# 2 5S_rRNA    133            9.08              0  0    0      0
# 3 5S_rRNA    92            0.00              0  0    0      0
</pre>
</li>
</ul>


# Method 2: picard
[[:File:RSEM PDX.png]]
java -Xmx2g -jar Picard/SamToFastq.jar I=SAMPLE.bam F=SAMPLE_r1.fastq F2=SAMPLE_r2.fastq


# Method 3: bam2fastx
=== DESeq2 or edgeR ===
# http://manpages.ubuntu.com/manpages/quantal/man1/bam2fastx.1.html
* [http://genomebiology.com/2014/15/12/550#sec4 DESeq2 method]
* [https://biocorecrg.github.io/PHINDaccess_RNAseq_2020/differential_expression.html DESeq2 steps]
** DESeq2 uses the median of ratio method for normalization: briefly, the counts are divided by sample-specific size factors.
** Four steps: 1. '''Geometric mean''' is calculated for each gene across all samples (length=number of genes vector). 2. '''ratios''': The counts for a gene in each sample is then divided by this mean. 3. The '''median of these (gene) ratios''' (aka size factor) in a sample is the size factor for that sample (length=number of samples). 4. the raw counts data is divided by these size factors sample by sample.
** [https://github.com/oxwang/fda_scRNA-seq/blob/master/3_Normalization/Code/HCC1395/10X_LLU.R#L148 Manual implementation]. The results (size factor values) are not the same as counts(dds, normalized = TRUE) returns? Ans: see the help of counts()
** [https://rdrr.io/bioc/DESeq2/man/counts.html counts()]. '''normalization factors''' is used by default  (''normalization factors'' always preempt ''size factors'')
** [https://rdrr.io/bioc/DESeq2/man/sizeFactors.html sizeFactors()] - sample by sample (a vector)
** [https://rdrr.io/bioc/DESeq2/man/normalizationFactors.html normalizationFactors()] - '''Gene-specific''' normalization factors for each sample (a matrix)
* DESeq2 with a large number of samples -> use DESEq2 to normalize the data and then use do a Wilcoxon rank-sum test on the normalized counts, for each gene separately, or, even better, use a permutation test. See [https://support.bioconductor.org/p/60432/ this post]. Or consider the limma-voom method instead, which will handle 1000 samples in a few seconds without the need for extra memory.
* edgeR normalization factor [https://support.bioconductor.org/p/65683/ post]. Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the [http://genomebiology.com/2010/11/3/r25 paper by Robinson & Oshlack 2010] for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
* edgeR [http://f1000research.com/articles/5-1438/v1 From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline]
* [https://support.bioconductor.org/p/65890/ Can I feed TCGA normalized count data to EdgeR?]
* [https://support.bioconductor.org/p/66067/ counts() function and normalized counts].
* [https://support.bioconductor.org/p/74572/ Why use Negative binomial distribution] in RNA-Seq data?] and the [http://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lect/L05-deseq2-anders.pdf Presentation] by Simon Anders.
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1803-9 XBSeq2] – a fast and accurate quantification of differential expression and differential polyadenylation. By using simulated datasets, they demonstrated that, overall, XBSeq2 performs equally well as XBSeq in terms of several statistical metrics and both perform better than DESeq2 and edgeR.
* [https://www.biorxiv.org/content/early/2018/08/27/399931 DEBrowser: Interactive Differential Expression Analysis and Visualization Tool for Count Data] Alper Kucukural 2018
* [https://genomespot.blogspot.com/2020/10/edger-or-deseq2-comparing-performance.html?s=09 EdgeR or DESeq2? Comparing the performance of differential expression tools]
* StatQuest:
** [https://youtu.be/Wdt6jdi-NQo edgeR, part 1, Library Normalization] (TMM),
** [https://youtu.be/UFB993xufUU DESeq2, part 1, Library Normalization]
** [https://youtu.be/Gi0JdrxRq5s edgeR and DESeq2, part 2 - Independent Filtering]
* [https://bioinformatics.chat/deseq2?s=09 Differential gene expression and DESeq2 with Michael Love (#60)]
* [https://morphoscape.wordpress.com/2022/08/09/downstream-bioinformatics-analysis-of-omics-data-with-edger/ Downstream Bioinformatics Analysis of Omics Data with edgeR]


# Method 4: Bedtools - bamtofastq
==== Shrinkage estimators ====
# http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html
<ul>
# Single fastq
<li> The package uses a negative binomial statistical model to fit the count data, and can account for differences in sequencing depth across samples.
bedtools bamtofastq -i input.bam -fq output.fastq
* Shrinkage is a technique used to '''regularize the estimates of the parameters''' of the negative binomial model.
#  BAM should be sorted by query name (samtools sort -n aln.bam aln.qsort) if creating paired FASTQ
* The idea behind shrinkage is to '''pull the estimated values of the parameters towards a prior distribution''', which can help to '''reduce the variability of the estimates''' and '''improve the stability of the results'''.
samtools sort -n input.bam -o input_sorted.bam  # sort by read namem (-n)
* The specific shrinkage method used in DESeq2 is called the "shrinkage prior for dispersion" method. This method involves '''adding a prior distribution on the dispersion parameter''' of the negative binomial model, which is used to control the degree of overdispersion in the data.
bedtools bamtofastq -i input_sorted.bam -fq output_r1.fastq -fq2 output_r2.fastq
* This prior distribution is designed to '''shrink the estimated values of the dispersion parameter''' towards a common value across all the genes in the dataset, which can help to '''reduce the variance of the estimated log-fold changes'''.
<li>More details:
* The negative binomial model is used to model the count data, y, as a function of the '''mean''', <math>\mu</math>, and the '''dispersion''', <math>\alpha</math>. The probability mass function of the negative binomial is given by:
: <math>
\begin{align}
P(y | \mu, \alpha) = (y + \alpha - 1)! / (y! * (\alpha - 1)!) * (\mu / (\mu + \alpha))^y * (\alpha / (\mu + \alpha))^\alpha
\end{align}
</math>
* The likelihood of the data is given by:
: <math>
\begin{align}
L(\mu, \alpha | y) = \prod_i [ P(y_i | \mu_i, \alpha) ]
\end{align}
</math>
* The log-likelihood is :
: <math>
\begin{align}
logL(\mu, \alpha | y) = \sum_i [ \log(P(y_i | \mu_i, \alpha)) ]
\end{align}
</math>
: In this model, <math>\mu</math> is the mean of the negative binomial for each gene and it is modeled as a linear function of the design matrix.
: <math>
\begin{align}
\mu_i = exp(X_i \beta)
\end{align}
</math>
: <math>\alpha</math> is the dispersion parameter and it's the same for all the genes, following the common practice in RNA-seq analysis
* '''The shrinkage prior is added on <math>\alpha</math>''', it assumes that <math>\alpha</math> is following a hyper-prior distribution like Gamma distribution
: <math>
\begin{align}
\alpha \sim \Gamma(a_0, b_0)
\end{align}
</math>
: This prior allows the shrinkage of <math>\alpha</math> estimates from the data towards a common value across all the genes, which can help to reduce the variance of the estimated log-fold changes.
* The goal is to find the values of mu and alpha that maximize the log-likelihood, this is done by using maximum likelihood estimation (MLE) or Bayesian approach where the prior are considered and integrated in the calculation and the result is the posterior probability.
<li>Dispersion parameter.
* In the context of the negative binomial model used in DESeq2, the dispersion parameter, alpha, is a measure of the degree of overdispersion in the data. In other words, it represents the variability of the data around the mean. A value of alpha greater than 1 indicates that the data is more dispersed (more variable) than would be expected if the data were following a Poisson distribution, which is a common distribution used to model count data. The Poisson distribution has a single parameter, the mean, which represents both the location and the scale of the distribution. In contrast, the negative binomial distribution has two parameters, the mean and the dispersion, which allows for more flexibility in fitting the data.
* The shrinkage method in DESeq2 involves shrinking the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the '''variance''' of the '''estimated log-fold changes'''.


# Method 5: Bamtools
<li>What is the formula of the fold change estimator given by DESeq2?
# http://github.com/pezmaster31/bamtools
* The fold change estimator given by the DESeq2 package is calculated as the ratio of the estimated mean expression levels in two conditions, with the log2 of this ratio being the log2 fold change. The mean expression levels are calculated using a negative binomial model, which accounts for both the mean and the overdispersion of the data.
</syntaxhighlight>
* The estimated mean expression level for a gene i in condition j is given by:
: <math>
\begin{align}
\log(\mu_{i,j}) = \beta_j + X_i \gamma
\end{align}
</math>
: where <math>\beta_j</math> is the overall mean for condition <math>j</math>, <math>X_i</math> is the design matrix for gene <math>i</math> and <math>\gamma_i</math> is the gene-specific effect.
* The log2 fold change is calculated as:
: <math>
\begin{align}
\log2(\mu_{i,j} / \mu_{i,k})
\end{align}
</math>
: where j and k are the two conditions being compared.
* So, you can see that the fold change estimator depends on the design matrix <math>X</math> and the parameters of the model, <math>\beta</math> and <math>\gamma</math>. The DESeq2 implementation also includes an estimation of the variance-covariance matrix of the parameters to compute the standard deviation (uncertainty) of these parameters, and therefore the standard deviation on the fold-change estimator. This can help to estimate the significance of the fold change between conditions.
* Hypothesis testing H0: log(FC)=0 using Wald test


Consider the ExomeLungCancer example
<li>What is the variance of the estimated log-fold changes before and after applying DESeq2 method?
<syntaxhighlight lang='bash'>
* In RNA-seq data analysis, the log-fold change is a measure of the relative difference in expression between two conditions. The log-fold change is calculated as the log2 of the ratio of the mean expression in one condition to the mean expression in another condition.
# ExomeLungCancer/test.SRR2923335_*.fastq
* Without using any methods like DESeq2, the variance of the estimated log-fold changes can be high, particularly for genes with low expression levels, which can lead to unreliable results. This high variance is due to the over-dispersion present in RNA-seq data, which results in a large variability in the estimated expression levels even for genes with similar means.
* When using the DESeq2 package, the shrinkage method is applied on dispersion parameter alpha, which helps to reduce the variance of the estimated log-fold changes. By applying a prior on alpha and by shrinking the estimates of alpha towards a common value across all the genes, the method reduces the variability of the estimates. This results in more stable and reliable estimates of the log-fold changes, which can improve the accuracy and robustness of the results of the differential expression analysis.
* Additionally, the DESeq2 package also accounts for differences in sequencing depth across samples, which can also help to reduce the variability of the estimated log-fold changes.


$ # Method 1
<li>how DESeq2 package accounts for differences in sequencing depth across samples
$ samtools bam2fq bwa.sam > SAMPLE.fastq
* The DESeq2 package accounts for differences in sequencing depth across samples by using the raw count data to estimate the normalized expression levels for each gene in each sample. This normalization process is necessary because sequencing depth can vary widely between samples, leading to differences in the overall number of reads and the apparent expression levels of the genes.
$ cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_r1.fastq
* The package uses a method called regularized-logarithm (rlog) transformation to normalize the data, which is a variance stabilization method that is based on the logarithm of the counts, but also adjusts for the total library size and the mean expression level.
$ cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_r2.fastq
* The method starts by computing a weighted mean of the counts across all samples, which is used as a reference. Next, for each sample, the counts are divided by the library size and then multiplied by the weighted mean of the counts. This scaling step corrects for differences in sequencing depth by making the library sizes comparable between samples.
$ head -n 4 SAMPLE_r1.fastq
* Then, the regularized logarithm (rlog) transformation is applied to the scaled counts, which is given by :
@SRR2923335.1/1
: <math>
NGCAAGTAGAGGGGCGCACACCTACCCGCAGTTGTTCACAGCCATGAGGTCGCCAACTAGCAGGAAGGGGTAGGTCAGCATGCTCACTGCAATCTGAAAC
\begin{align}
+
vst = \log(counts/sizeFactor + c)
#1=DDFDFHHHHHJIJJJJIJJJJJJJJIJJJJJIJIHHHHHFFFFFEECEDDDDDDDDDDDDDDDDDDD>BDDACCDDDDDDDDDDDDDDCCDDEDDDC
\end{align}
$ head -n 4 output_r2.fastq
</math>
@SRR2923335.1/2
: where c is a small positive constant added to the counts to stabilize the variance, size_factor is the ratio of library size for each sample over the weighted mean of the library size.
AACCCTGGTAATGGCTGGAGGCAGNNCTTGGTACAGNGTGTNNGCGTGGTGTGTCNNTGCTNNCTGGGCCGGGGTGGGTCACTGGCACTCAGGCCTCTCT
* The rlog transformation can stabilize the variance of the data and make the mean expression levels more comparable between samples. This transformed data can then be used for downstream analysis like calculating the fold changes.
+
* In addition to rlog transformation the DESeq2 package uses a negative binomial distribution to model the count data, this distribution helps to account for over-dispersion in the data, and shrinkage method on the dispersion parameter is applied as well to improve the stability of results. All of these techniques work together to help correct for sequencing depth differences across samples, which can improve the accuracy of the estimated fold changes and provide more robust results in differential gene expression analysis.
CCCFFFFFFHHGHJJJJJEHJIJI##11CCG?DFGI#0)88##0/77CF=@FDDG##--;?##,,5=BABBDDD)9@D59ADCDDBDDDDDDDCDDDDC>


$ # Method 4
<li>[https://www.biostars.org/p/448959/#484944 type='apeglm' shrinkage only for use with 'coef']
$ samtools sort -n accepted_hits.bam -o accepted_sortbyname.bam
</ul>
$ /opt/SeqTools/bin/bedtools2/bin/bedtools bamtofastq -i accepted_sortbyname.bam -fq output_r1.fastq -fq2 output_r2.fastq


# Compare the first read
==== Time course experiment ====
# Original fastq file
* [http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#time-course-experiments RNA-seq workflow: gene-level exploratory analysis and differential expression] using DESeq2
$ head -n 4 ../test.SRR2923335_1.fastq
** Genes with small p values from this test are those which at one or more time points after time 0 showed a strain-specific effect. Tested by '''LRT'''.
@SRR2923335.1 1 length=100
** '''Wald tests''' for the log2 fold changes at individual time points can be investigated using the test argument
NGCAAGTAGAGGGGCGCACACCTACCCGCAGTTGTTCACAGCCATGAGGTCGCCAACTAGCAGGAAGGGGTAGGTCAGCATGCTCACTGCAATCTGAAAC
+SRR2923335.1 1 length=100
#1=DDFDFHHHHHJIJJJJIJJJJJJJJIJJJJJIJIHHHHHFFFFFEECEDDDDDDDDDDDDDDDDDDD>BDDACCDDDDDDDDDDDDDDCCDDEDDDC
$ head -n 4 ../test.SRR2923335_2.fastq
@SRR2923335.1 1 length=100
AACCCTGGTAATGGCTGGAGGCAGNNCTTGGTACAGNGTGTNNGCGTGGTGTGTCNNTGCTNNCTGGGCCGGGGTGGGTCACTGGCACTCAGGCCTCTCT
+SRR2923335.1 1 length=100
CCCFFFFFFHHGHJJJJJEHJIJI##11CCG?DFGI#0)88##0/77CF=@FDDG##--;?##,,5=BABBDDD)9@D59ADCDDBDDDDDDDCDDDDC>


$ ##################################################################################################
* [https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf Time course trend analysis] from the edgeR's vignette. '''glmQLFTest()'''
$ head -n 4 output_r1.fastq
** Finds genes that respond to the treatment at either 1 hour or 2 hours versus the 0 hour baseline. This is analogous to an '''ANOVA''' F-test for a normal linear model.
@SRR2923335.1/1
** Assuming gene expression changes smoothly over time, we can use a '''polynomial''' or a '''cubic spline curve''' with a certain number of degrees of freedom to model gene expression along time.
NGCAAGTAGAGGGGCGCACACCTACCCGCAGTTGTTCACAGCCATGAGGTCGCCAACTAGCAGGAAGGGGTAGGTCAGCATGCTCACTGCAATCTGAAAC
** We are looking for genes that change expression level over time. We test for a trend by conducting F-tests for each gene. The ''topTags'' function lists the top set of genes with most significant time effects.
+
** The total number of genes with significant (5% FDR) changes at different time points can be examined with ''decideTests''.
#1=DDFDFHHHHHJIJJJJIJJJJJJJJIJJJJJIJIHHHHHFFFFFEECEDDDDDDDDDDDDDDDDDDD>BDDACCDDDDDDDDDDDDDDCCDDEDDDC
* [https://support.bioconductor.org/p/9150999/ RNA-seq data collected at different time points]. Identify differentially expressed genes associated with seasonal changes
$ head -n 4 output_r2.fastq
@SRR2923335.1/2
AACCCTGGTAATGGCTGGAGGCAGNNCTTGGTACAGNGTGTNNGCGTGGTGTGTCNNTGCTNNCTGGGCCGGGGTGGGTCACTGGCACTCAGGCCTCTCT
+
CCCFFFFFFHHGHJJJJJEHJIJI##11CCG?DFGI#0)88##0/77CF=@FDDG##--;?##,,5=BABBDDD)9@D59ADCDDBDDDDDDDCDDDDC>
</syntaxhighlight>


=== Using CRAM files ===
==== DESeq2 experimental design and interpretation ====
* CRAM files are more dense than BAM files. CRAM files are smaller than BAM by taking advantage of an additional external "reference sequence" file.
[https://rstudio-pubs-static.s3.amazonaws.com/329027_593046fb6d7a427da6b2c538caf601e1.html DESeq2 experimental design and interpretation]
* http://www.htslib.org/workflow/#mapping_to_cram
* [https://genome.ucsc.edu/goldenPath/help/cram.html CRAM] format
* The CRAM format was used (to replace the BAM format) in 1000genome


=== Extract single chromosome ===
==== Controlling for batch differences ====
https://www.biostars.org/p/46327/
The variable we are interested in ("condition") is placed after the batch variable.
<syntaxhighlight lang='bash'>
<pre>
samtools sort accepted_hits.bam -o accepted_hits_sorted.bam
dds <- DESeqDataSetFromMatrix(countData = cts,
samtools index accepted_hits_sorted.bam
                              colData = coldata,
samtools view -h accepted_hits_sorted.bam chr22 > accepted_hits_sub.sam
                              design= ~ batch + condition)
</syntaxhighlight>
dds <- DESeq(dds)
</pre>
OR
<pre>
dds <- DESeq(dds, test="LRT", reduced=~batch)
res <- results(dds)
</pre>


=== Primary, secondary, supplementary alignment ===
==== DESeq2 diagnostic plot, MA plot ====
* Multiple mapping and primary from [https://samtools.github.io/hts-specs/SAMv1.pdf#page=2 SAM format specification]
* [http://www.sthda.com/english/wiki/rna-seq-differential-expression-work-flow-using-deseq2 RNA-Seq differential expression work flow using DESeq2]. MA plot, dispersion plot, histogram of p-values, rlog transoformation.
* https://www.biostars.org/p/138116/
* [https://bioinformatics-core-shared-training.github.io/Bulk_RNAseq_Course_2021/Markdowns/11_Annotation_and_Visualisation.html RNA-seq Analysis in R Annotation and Visualisation of Differential Expression Results]. MA plot, Volcano plot, venn diagram, heatmap (ComplexHeatmap).
* supplementary alignment = chimeric alignments = non-linear alignments. It's often the case that the sample we're sequencing has structural variations when compared to the reference sequence. Imagine a 100bp read. Let us suppose that the first 50bp align to chr1 and the last 50bp to chr6.
* R packages:
* [https://www.biostars.org/p/101533/ How to extract unique mapped results from Bowtie2 bam results?] ''If you don't extract primary or unique reads from the sam/bam, the reads that maps equally well to multiple sequences will cause a serious bias in quantification of repeated elements.''
** [https://rpkgs.datanovia.com/ggpubr/reference/ggmaplot.html ggpubr::ggmaplot]. It can labels DE genes.
** [https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#ma-plot DESeq2::plotMA]
** limma::plotMA
** edgeR::maPlot
** [https://bioconductor.org/packages/release/bioc/vignettes/Glimma/inst/doc/limma_edger.html Glimma::glimmaMA]
** [https://federicomarini.github.io/ideal/reference/plot_ma.html ideal::plot_ma]


To exact primary alignment reads ([https://www.biostars.org/p/276833/ here]), use
==== vst over rlog transformation ====
* https://twitter.com/mikelove/status/1420671546088173569. See [https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html RNA-seq workflow: gene-level exploratory analysis and differential expression]. The transformed data can be used in computing sample distances, PCA, MDS, clustering, ....
* [https://support.bioconductor.org/p/9150245/ normTransform() vs vst()]. [https://rdrr.io/bioc/DESeq2/man/normTransform.html ?normTransform]=log2(x+1). [https://zhuanlan.zhihu.com/p/443966955 DESeq2除了差异分析还有什么要关注的?]
* [https://en.wikipedia.org/wiki/Homoscedasticity Homoscedasticity] = homogeneity of variance
<pre>
<pre>
samtools view -F 260 input.bam
Expected counts
    | round()
    v                              /-- vst transformation  ---\
Raw counts --> normalized counts  --                            -- Other analyses such as PCA, Hclust (sample distances).
                                    \-- rlog transformation ---/
</pre>
</pre>


=== Extract reads based on read IDs ===
==== Simulate negative binomial distribution data ====
https://www.biostars.org/p/68358/#68362
<ul>
<li>rnegbin() in [https://github.com/cran/ssizeRNA/blob/master/R/sim.counts.R#L114 sim.counts()] from the [https://cran.r-project.org/web/packages/ssizeRNA/index.html ssizeRNA] package
<pre>
<pre>
samtools  view Input.bam chrA:x-y  | cut -f1 > idFile.txt
rnegbin(10000 * 10, lambda, 1 / disp)
# 10000 genes, 20 samples
# lambda: mean counts from control group, a matrix.  
# disp: dispersion parameter, a matrix.
</pre>
</li>
</ul>


LC_ALL=C grep -w -F -f idFile.txt  < in.sam > subset.sam
==== Reducing false positives in differential analyses of large RNA sequencing data sets ====
</pre>
* [https://www.rna-seqblog.com/reducing-false-positives-in-differential-analyses-of-large-rna-sequencing-data-sets/ Reducing false positives in differential analyses of large RNA sequencing data sets]
* [https://towardsdatascience.com/deseq2-and-edger-should-no-longer-be-the-default-choice-for-large-sample-differential-gene-8fdf008deae9 DESeq2 and edgeR should no longer be the default choices for large-sample differential gene expression analysis]


=== Extract mapped/unmapped reads from BAM ===
=== edgeR vs DESeq2 vs limma ===
http://www.htslib.org/doc/samtools.html
<ul>
<pre>
<li>edgeR
0x1 PAIRED         paired-end (or multiple-segment) sequencing technology
<syntaxhighlight lang='rsplus'>
0x2 PROPER_PAIR each segment properly aligned according to the aligner
library(edgeR)
0x4 UNMAP         segment unmapped
0x8 MUNMAP         next segment in the template unmapped
0x10 REVERSE         SEQ is reverse complemented
0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
0x40 READ1         the first segment in the template
0x80 READ2         the last segment in the template
0x100 SECONDARY secondary alignment
0x200 QCFAIL         not passing quality controls
0x400 DUP         PCR or optical duplicate
0x800 SUPPLEMENTARY supplementary alignment
</pre>
* https://www.biostars.org/p/59281/ & https://www.biostars.org/p/56246/
<syntaxhighlight lang='bash'>
# get the unmapped reads from a bam file use :
samtools view -f 4 file.bam > unmapped.sam


# get the output in bam use :
# create DGEList object from count data
samtools view -b -f 4 file.bam > unmapped.bam
counts <- matrix(c(20,30,25,50,45,55,15,20,10,5,10,8,100,120,110,80,90,95), nrow=3, ncol=6, byrow=TRUE)
rownames(counts) <- c("G1", "G2", "G3")
colnames(counts) <- c("A1", "A2", "A3", "B1", "B2", "B3")
counts
#    A1  A2  A3 B1 B2 B3
# G1  20  30  25 50 45 55
# G2  15  20  10  5 10  8
# G3 100 120 110 80 90 95
d <- DGEList(counts)


# get only the mapped reads, the parameter 'F', which works like -v of grep
# perform normalization and differential expression analysis
samtools view -b -F 4 file.bam > mapped.bam
d <- calcNormFactors(d)
design <- model.matrix(~0+factor(c(rep("A",3), rep("B",3))))
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
res <- glmQLFTest(fit, contrast=c(-1, 1))


# get only properly aligned/properly paired reads (https://www.biostars.org/p/111481/)
# summarize the results and identify significant genes
#
summary(res)
# Q: What Does The "Proper Pair" Bitwise Flag?
res2 <- topTags(res); res2
# A1: https://www.biostars.org/p/8318/
# Coefficient: -1*factor(c(rep("A", 3), rep("B", 3)))A 1*factor(c(rep("A", 3), rep("B", 3)))B
#    Properly paired means the read itself as well as its mate are both mapped and
#        logFC  logCPM          F      PValue          FDR
#    they were mapped within a reasonable distance given the expected distance
# G1  1.1683093 17.98614 146.579840 4.380158e-08 1.314048e-07
# A2: https://www.biostars.org/p/12475/
# G2 -0.7865080 16.41917  3.056504 1.059256e-01 1.588884e-01
#     It means means both mates of a read pair map to the same chromosome, oriented towards each other,
# G3 -0.1437956 19.34279 40.852893 2.256279e-01 2.256279e-01
#    and with a sensible insert size.  
de_genes <- rownames(res2)[which(res2$FDR < 0.05 & abs(res2$log2FoldChange) > 1)]
samtools view -b -f 0x2 accepted_hits.bam > mappedPairs.bam
</syntaxhighlight>
* https://www.biostars.org/p/14518/ So for all pair-end reads where one end is mapped and the other end isn't mapped, this should output all the mapped end entries:
<syntaxhighlight lang='bash'>
samtools view -u -f 8 -F 260 map.bam > oneEndMapped.bam
</syntaxhighlight>
and this will output all the unmapped end entries:
<syntaxhighlight lang='bash'>
samtools view -u  -f 4 -F 264 map.bam  > oneEndUnmapped.bam
</syntaxhighlight>
</syntaxhighlight>
</li>
<li>DESeq2. The count data above will result in an error. The error can occur when there is very little variability in the count data, which can happen if the biological samples are very homogeneous or if the sequencing depth is very low. In such cases, it may be difficult to reliably identify differentially expressed genes using DESeq2.
<pre>
library(DESeq2)
col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(3, 3))))


=== Count number of mapped/unmapped reads from BAM - samtools idxstats ===
# create a DESeq2 dataset object
<syntaxhighlight lang='bash'>
dds <- DESeqDataSetFromMatrix(countData = counts, colData = col_data, design = ~ condition)
# count the number of mapped reads
samtools view -c -F0x4 accepted_hits.bam
9971
# count the number of unmapped reads
samtools view -c -f0x4 accepted_hits.bam
35
</syntaxhighlight>
The result can be checked with samtools idxstats command. See https://pepebioinformatics.wordpress.com/2014/05/29/samtools-count-mapped-and-unmapped-reads-in-bam-file/
<syntaxhighlight lang='bash'>
$ samtools sort accepted_hits.bam -o accepted_hits_sorted.bam
$ samtools index accepted_hits_sorted.bam accepted_hits_sorted.bai # BAI file is binary
$ ls -t
accepted_hits_sorted.bai  accepted_hits_sorted.bam  accepted_hits.bam
$ samtools idxstats accepted_hits_sorted.bam
1 249250621 949 1
2 243199373 807 0
3 198022430 764 0
4 191154276 371 1
5 180915260 527 0
6 171115067 411 3
7 159138663 888 5
8 146364022 434 0
9 141213431 409 2
10 135534747 408 1
11 135006516 490 2
12 133851895 326 1
13 115169878 149 1
14 107349540 399 3
15 102531392 249 1
16 90354753 401 4
17 81195210 466 1
18 78077248 99 0
19 59128983 503 1
20 63025520 275 1
21 48129895 87 0
22 51304566 228 2
X 155270560 315 1
Y 59373566 6 0
MT 16569 10 0
* 0 0 4
$ samtools idxstats accepted_hits_sorted.bam | awk '{s+=$3+$4} END {print s}'
10006
$ samtools idxstats accepted_hits_sorted.bam | awk '{s+=$3} END {print s}'
9971
</syntaxhighlight>


=== Find unmapped reads ===
# differential expression analysis
* http://seqanswers.com/forums/showthread.php?t=5787
dds <- DESeq(dds)
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates
# mean-dispersion relationship
# Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) :
#  all gene-wise dispersion estimates are within 2 orders of magnitude
#  from the minimum value, and so the standard curve fitting techniques will not work.
#  One can instead use the gene-wise estimates as final estimates:
#  dds <- estimateDispersionsGeneEst(dds)
#  dispersions(dds) <- mcols(dds)$dispGeneEst
#  ...then continue with testing using nbinomWaldTest or nbinomLRT
</pre>
Try another data.  
<syntaxhighlight lang='rsplus'>
count_data <- matrix(c(100, 500, 200, 1000, 300,
                      200, 400, 150, 500, 300,
                      300, 300, 100, 1500, 300,
                      400, 200, 50, 2000, 300), nrow = 5, byrow = TRUE)


=== Find multi-mapped reads ===
colnames(count_data) <- paste0("sample", 1:4)
http://seqanswers.com/forums/showthread.php?t=16427
rownames(count_data) <- paste0("gene", 1:5)
<syntaxhighlight lang='bash'>
col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(2, 2))))
samtools view -F 4 file.bam | awk '{printf $1"\n"}' | sort | uniq -d | wc -l
# uniq -d will only print duplicate lines
</syntaxhighlight>


=== Realignment and recalibration ===
# create a DESeq2 dataset object
https://www.biostars.org/p/141784/
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition)
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates
# mean-dispersion relationship
# -- note: fitType='parametric', but the dispersion trend was not well captured by the
#    function: y = a/x + b, and a local regression fit was automatically substituted.
#    specify fitType='local' or 'mean' to avoid this message next time.
# final dispersion estimates
# fitting model and testing
# Warning message:
# In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
#  Estimated rdf < 1.0; not estimating variance


# Realignment and recalibration won't change these metrics output from samtools flagstat.
# differential expression analysis
# '''Recalibration is typically not needed anymore (the same goes for realignment if you're using something like GATK's haplotype caller).'''
dds <- DESeq(dds)
# Realignment just locally realigns things and typically the assigned MAPQ values don't change (unmapped mates etc. also won't change).
# Recalibration typically affects only base qualities.


=== htslib ===
# extract results
Suppose we download a vcf.gz from [ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/ NCBI] ftp site. We want to subset chromosome 1 and index the new file.
res <- results(dds)
res
# log2 fold change (MLE): condition untreated vs treated
# Wald test p-value: condition untreated vs treated
# DataFrame with 5 rows and 6 columns
#        baseMean log2FoldChange    lfcSE      stat    pvalue      padj
#      <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
# gene1  465.558      0.707313  1.243608  0.568758 0.5695201  0.711900
# gene2  309.591      -0.119330  0.885551 -0.134752 0.8928081  0.892808
# gene3  413.959      -0.742921  0.513376 -1.447130 0.1478604  0.369651
# gene4  638.860      -1.372724  1.428977 -0.960634 0.3367360  0.561227
# gene5  721.470      2.928705  1.536174  1.906493 0.0565863  0.282931


* '''tabix''' - subset a vcf.gz file or index a vcf.gz file
# extract DE genes with adjusted p-value < 0.05 and |log2 fold change| > 1
<syntaxhighlight lang='bash'>
DESeq2_DE_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
tabix -h common_all_20160601.vcf.gz 1: > common_all_20160601_1.vcf


$ tabix -h
# print the number of DE genes identified by DESeq2
cat("DESeq2 identified", nrow(DESeq2_DE_genes), "DE genes.\n")
</syntaxhighlight>
</li>


Version: 1.3
<li>limma-voom. voom is a function in the limma package that modifies RNA-Seq data for use with limma. [https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/thursday/DE.html Differential Expression with Limma-Voom].
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]
<syntaxhighlight lang='rsplus'>
library(limma)


Indexing Options:
# Create a design matrix with the sample groups
  -0, --zero-based          coordinates are zero-based
design_matrix <- model.matrix(~ condition, data = col_data)
  -b, --begin INT            column number for region start [4]
  -c, --comment CHAR        skip comment lines starting with CHAR [null]
  -C, --csi                  generate CSI index for VCF (default is TBI)
  -e, --end INT              column number for region end (if no end, set INT to -b) [5]
  -f, --force                overwrite existing index without asking
  -m, --min-shift INT        set minimal interval size for CSI indices to 2^INT [14]
  -p, --preset STR          gff, bed, sam, vcf
  -s, --sequence INT        column number for sequence names (suppressed by -p) [1]
  -S, --skip-lines INT      skip first INT lines [0]


Querying and other options:
# filter out low-expressed genes
  -h, --print-header        print also the header lines
if (FALSE) {
  -H, --only-header          print only the header lines
  keep <- rowSums(counts) >= 10
  -l, --list-chroms          list chromosome names
  counts <- counts[keep,]
  -r, --reheader FILE        replace the header with the content of FILE
}
  -R, --regions FILE        restrict to regions listed in the file
  -T, --targets FILE        similar to -R but streams rather than index-jumps
</syntaxhighlight>
* '''bgzip''' - create a vcf.gz file from a vcf file
<syntaxhighlight lang='bash'>
bgzip -c common_all_20160601_1.vcf > common_all_20160601_1.vcf.gz


# Indexing. Output is <common_all_20160601_1.vcf.gz.tbi>
# normalization using voom
tabix -f -p vcf common_all_20160601_1.vcf.gz
v <- voom(count_data, design_matrix)


$ bgzip -h
# linear model fitting
fit <- lmFit(v, design_matrix)


Version: 1.3
# Calculate the empirical Bayes statistics
Usage:  bgzip [OPTIONS] [FILE] ...
fit <- eBayes(fit)
Options:
  -b, --offset INT        decompress at virtual file pointer (0-based uncompressed offset)
  -c, --stdout            write on standard output, keep original files unchanged
  -d, --decompress        decompress
  -f, --force            overwrite files without asking
  -h, --help              give this help
  -i, --index            compress and create BGZF index
  -I, --index-name FILE  name of BGZF index file [file.gz.gzi]
  -r, --reindex          (re)index compressed file
  -s, --size INT          decompress INT bytes (uncompressed size)
  -@, --threads INT      number of compression threads to use [1]
</syntaxhighlight>


==== hts-nlim ====
top.table <- topTable(fit, sort.by = "P", n = Inf)
[https://www.biorxiv.org/content/biorxiv/early/2018/02/08/261735.full.pdf hts-nim: scripting high-performance genomic analyses] and [https://github.com/brentp/hts-nim source] in github.
top.table
#            logFC  AveExpr          t    P.Value adj.P.Val        B
# gene3 -2.3242262 15.67787 -1.8229203 0.08330054 0.3153894 -4.249298
# gene5  2.0865122 16.34012  1.5960616 0.12615577 0.3153894 -4.376405
# gene1  0.5610761 16.69722  0.5079444 0.61704872 0.7551329 -4.834183
# gene2  0.2477959 18.69843  0.3829295 0.70581164 0.7551329 -5.150958
# gene4  0.2869870 17.11809  0.3161922 0.75513288 0.7551329 -4.963932


Examples include bam filtering (bam files), read counts in regions (bam & bed files) and quality control variant call files (vcf files). https://github.com/brentp/hts-nim-tools
# Perform hypothesis testing to identify DE genes
results <- decideTests(fit)


== [http://samstat.sourceforge.net/ SAMStat] ==
summary(results)
Displaying sequence statistics for next generation sequencing. SAMStat reports nucleotide composition, length distribution, base quality distribution, mapping statistics, mismatch, insertion and deletion error profiles, di-nucleotide and 10-mer over-representation. The output is a single html5 page which can be interpreted by a non-specialist.
#        (Intercept) conditionuntreated
# Down            0                  0
# NotSig          0                  5
# Up              5                  0


Usage
# Extract the DE genes
<syntaxhighlight lang='bash'>
de_genes <- rownames(count_data)[which(results$all != 0)]
samstat <file.sam>  <file.bam>  <file.fa>  <file.fq> ....
</syntaxhighlight>
</syntaxhighlight>
For each input file SAMStat will create a single html page named after the input file name plus a dot html suffix.
</li>
</ul>


== [http://broadinstitute.github.io/picard/ Picard] ==
==== DESeq2 vs edgeR ====
A set of tools (in Java) for working with next generation sequencing data in the BAM (http://samtools.sourceforge.net) format.
D vs E?


https://github.com/broadinstitute/picard
* One major difference is in the method used to estimate the '''dispersion''' parameter. DESeq2 uses a '''local regression''' method, whereas edgeR uses a '''Cox-Reid profile-adjusted likelihood''' method. The local regression method estimates the dispersion parameter for each gene independently, whereas the profile-adjusted likelihood method estimates a common dispersion parameter for all genes, with gene-specific scaling factors that depend on the mean expression levels.
* Another difference is in the approach to '''normalization'''. DESeq2 uses a '''variance-stabilizing transformation''' to account for differences in library size and composition, whereas edgeR uses a '''trimmed mean of M-values (TMM)''' normalization method, which adjusts for library size differences by scaling the counts of each sample to a common effective library size.
* DESeq2 also uses a different '''statistical model''' for differential expression analysis. DESeq2 models the count data as a negative binomial distribution, but includes additional terms to account for batch effects and other sources of variation. It uses a '''shrinkage estimator''' to improve the estimation of fold changes and reduce false positives. EdgeR, on the other hand, uses a similar negative binomial model but applies an '''empirical Bayes''' method to estimate gene-specific dispersions and to borrow information across genes to improve the power of detection and reduce false positives.


Note that As of version 2.0.1 (Nov. 2015) Picard requires Java 1.8 (jdk8u66). The last version to support Java 1.7 was release 1.141. Use the following to check your Java version
When should I choose DESeq2 and when should I choose edgeR?
<syntaxhighlight lang='bash'>
* The choice between DESeq2 and edgeR for differential gene expression analysis depends on several factors, including the experimental design, sample size, and the nature of the biological question being investigated. Here are some general guidelines to help you choose between these two algorithms:
brb@T3600 ~/github/picard $ java -version
* Choose DESeq2 when:
java version "1.7.0_101"
** The experimental design includes multiple batches or covariates that may affect the gene expression levels
OpenJDK Runtime Environment (IcedTea 2.6.6) (7u101-2.6.6-0ubuntu0.14.04.1)
** The '''sample size is small''', typically fewer than 12 samples per group
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
** The gene expression levels are highly variable across replicates, and the goal is to identify differentially expressed genes with a low false discovery rate (FDR)
</syntaxhighlight>
** The focus is on the fold change rather than the statistical significance of differential expression
* Choose edgeR when:
** The experimental design includes several factors, such as treatment, time, and biological replicate, and the goal is to identify the main effects and interaction effects of these factors on gene expression
** The sample size is moderate to large, typically more than 12 samples per group
** The gene expression levels are less variable across replicates, and the goal is to achieve high statistical power to detect differentially expressed genes
** The focus is on both the fold change and the statistical significance of differential expression, and the researcher is interested in performing downstream analyses such as gene set enrichment analysis or pathway analysis.


See my [https://taichimd.us/mediawiki/index.php/Linux#JRE_and_JDK Linux -> JRE and JDK] page.
=== DESeq2 in Python ===
[https://github.com/owkin/PyDESeq2 PyDESeq2], [https://twitter.com/mikelove/status/1651542670785781763 nice work]


Some people reported errors or complain about memory usage. See [http://seqanswers.com/forums/showthread.php?t=6204 this post] from seqanswers.com. And HTSeq python program is another option.
=== Generalized Linear Models and Plots with edgeR ===
<syntaxhighlight lang='bash'>
[https://morphoscape.wordpress.com/2020/09/26/generalized-linear-models-and-plots-with-edger-advanced-differential-expression-analysis/ Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis]
sudo apt-get install ant
git clone https://github.com/broadinstitute/picard.git
cd picard
git clone https://github.com/samtools/htsjdk.git
ant -lib lib/ant package-commands # We will see 'BUILD SUCCESSFUL'
                                  # It will create <dist/picard.jar>
</syntaxhighlight>


=== [http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq SamToFastq] ===
=== [http://www.bioconductor.org/packages/devel/bioc/html/EBSeq.html EBSeq] ===
<syntaxhighlight lang='bash'>
An R package for gene and isoform differential expression analysis of RNA-seq data
java jvm-args -jar picard.jar PicardCommandName OPTION1=value1 OPTION2=value2...
# For example
cd ~/github/freebayes/test/tiny
java -Xmx2g -jar ~/github/picard/dist/picard.jar SamToFastq \
  INPUT=NA12878.chr22.tiny.bam \
  FASTQ=tmp_1.fq \
  SECOND_END_FASTQ=tmp_2.fq \
  INCLUDE_NON_PF_READS=True \
  VALIDATION_STRINGENCY=SILENT
wc -l tmp_1.fq
# 6532 tmp_1.fq
wc -l tmp_2.fq
# 6532 tmp_2.fq
</syntaxhighlight>


=== SAM validation error; Mate Alignment start should be 0 because reference name = * ===
http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/
[https://software.broadinstitute.org/gatk/guide/article?id=7571 Errors in SAM/BAM files can be diagnosed with ValidateSamFile]


When I run picard with MarkDuplicates, AddOrReplaceReadGroups or ReorderSam parameter, I will get an error on the bam file aligned to human+mouse genomes but excluding mouse mappings:
=== [http://www.bioconductor.org/packages/release/bioc/html/prebs.html prebs] ===
<pre>
Probe region expression estimation for RNA-seq data for improved microarray comparability
Ignoring SAM validation error: ERROR: Record 3953, Read name D00748:53:C8KPMANXX:7:1109:11369:5479, Mate Alignment start should be 0 because reference name = *.'
</pre>


https://www.biostars.org/p/9876/ & https://www.biostars.org/p/108148/. Many of the validation errors reported by Picard are "technically" errors, but do not have any impact for the vast majority of downstream processing.
=== [http://www.bioconductor.org/packages/release/bioc/html/DEXSeq.html DEXSeq] ===
Inference of differential exon usage in RNA-Seq


For example,
=== [http://www-personal.umich.edu/~jianghui/rseqnp/ rSeqNP] ===
<syntaxhighlight lang='bash'>
A non-parametric approach for detecting differential expression and splicing from RNA-Seq data
java -Xmx10g -jar picard.jar MarkDuplicates VALIDATION_STRINGENCY=LENIENT METRICS_FILE=MarkDudup.metrics INPUT=input.bam OUTPUT=output.bam
 
# OR to avoid messages
=== [https://peerj.com/articles/3890/ voomDDA]: discovery of diagnostic biomarkers and classification of RNA-seq data ===
java -Xmx10g -jar picard.jar MarkDuplicates VALIDATION_STRINGENCY=SILENT METRICS_FILE=MarkDudup.metrics INPUT=input.bam OUTPUT=output.bam
http://www.biosoft.hacettepe.edu.tr/voomDDA/
</syntaxhighlight>
 
== Pathway analysis ==


Another (different) error message is:
=== About the KEGG pathways ===
* [https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp#C2 MSigDB] database or msigdbr package - seems to be old. It only has 186 KEGG pathways for human.
* [https://www.bioconductor.org/packages/release/data/experiment/html/msigdb.html msigdb] from Bioconductor
<ul>
<li>[http://www.bioconductor.org/packages/release/bioc/html/KEGGREST.html KEGGREST] package directly pull the data from kegg.jp. I can get 337 KEGG pathways for human ('hsa')
<pre>
<pre>
SAM validation error: ERROR: Record 16642, Read name SRR925751.3835, First of pair flag should not be set for unpaired read.
BiocManager::install("KEGGREST")
library(KEGGREST)
res <- keggList("pathway", "hsa")
length(res) # 337
</pre>
</pre>
</li>
</ul>


== [https://github.com/arq5x/bedtools2 BEDtools] ==
=== GSOAP ===
The bed format is similar to gtf format but with more compact representation. We can use the apt-get method to install it on Linux environment.
[https://academic.oup.com/bioinformatics/article/36/9/2923/5715574 GSOAP: a tool for visualization of gene set over-representation analysis]


<syntaxhighlight lang="bash">
=== clusterProfiler ===
bedtools intersect
* [https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html clusterProfiler]
bedtools bamtobed
* [http://yulab-smu.top/clusterProfiler-book/chapter6.html#kegg-gene-set-enrichment-analysis Chapter 6 KEGG analysis]
bedtools bedtobam
bedtools getfasta
bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>
</syntaxhighlight>


For example,
=== [http://bioconductor.org/packages/release/bioc/html/fgsea.html fgsea:] Fast Gene Set Enrichment Analysis ===
<syntaxhighlight lang="bash">
* [https://bioinformatics.stackexchange.com/a/166 Are fgsea and Broad Institute GSEA equivalent?]. Note that enrichment score are the same though the way of calculating p-values can be different.
bedtools intersect -wo -a RefSeq.gtf -b XXX.bed | wc -l  # in this case every line is one exon
* [https://stephenturner.github.io/deseq-to-fgsea/ DESeq results to pathways in 60 Seconds with the fgsea package]
bedtools intersect -wo -a RefSeq.gtf -b XXX.bed | cut -f9 | cut -d ' ' -f2 | more
* [https://mgrcbioinfo.github.io/my_GSEA_plot/ GSEA plot for multiple comparisons]
bedtools intersect -wo -a RefSeq.gtf -b XXX.bed | cut -f9 | cut -d ' ' -f2 | sort -u | wc -l
* [https://davetang.org/muse/2018/01/10/using-fast-preranked-gene-set-enrichment-analysis-fgsea-package/ Using the fast preranked gene set enrichment analysis (fgsea) package] 2018


bedtools intersect -wo -a RefSeq.bed -b XXX.bed | more    # one gene is one line with multiple intervals
=== [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR]: Reproducible GSEA Benchmarking ===
</syntaxhighlight>
[https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis]


One good use of bedtools is to find the intersection of bam and bed files (to double check the insertion/deletion/junction reads in bed files are in accepted_hits.bam). See [http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html this page]
=== hypeR ===
<syntaxhighlight lang="bash">
* [https://academic.oup.com/bioinformatics/article/36/4/1307/5566242 hypeR: an R package for geneset enrichment workflows]
$ bedtools --version
* [https://montilab.github.io/hypeR-workshop/ Efficient, Scalable, and Reproducible Enrichment Workflows] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
bedtools v2.17.0
$ bedtools intersect -abam accepted_hits.bam -b junctions.bed | samtools view - | head -n 3
</syntaxhighlight>


We can use bedtool to reverse bam files to fastq files. See https://www.biostars.org/p/152956/
=== [https://bioconductor.org/packages/release/bioc/html/rgsepd.html GSEPD] ===
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2697-5 GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display]


=== [http://bedtools.readthedocs.org/en/latest/content/installation.html Installation] ===
=== SeqGSEA ===
<syntaxhighlight lang="bash">
http://www.bioconductor.org/packages/release/bioc/html/SeqGSEA.html
git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make
</syntaxhighlight>


=== bamtofastq ===
=== BAGSE ===
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
[https://academic.oup.com/bioinformatics/article/36/6/1689/5614816 BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis] 2020
cd ~/github/freebayes/test/tiny # Working directory


~/github/samtools/samtools sort -n NA12878.chr22.tiny.bam NA12878.chr22.tiny.qsort
=== GeneSetCluster ===
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03784-z GeneSetCluster: a tool for summarizing and integrating gene-set analysis results]


~/github/bedtools2/bin/bamToFastq -i NA12878.chr22.tiny.qsort.bam \
== Pipeline ==
                      -fq aln.end1.fq \
                      -fq2 aln.end2.fq  2> bamToFastq.warning.out


## 60 warnings; see <bamToFastq.warning.out>
=== SPEAQeasy ===
*****WARNING: Query ST-E00118:53:H02GVALXX:1:1102:1487:23724 is marked as paired, but its mate does not occur next to it in your BAM file.  Skipping.  
[https://rna-seqblog.com/speaqeasy-a-scalable-pipeline-for-expression-analysis-and-quantification-for-r-bioconductor-powered-rna-seq-analyses/ SPEAQeasy] – a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses


wc -l tmp.out  # 60
=== Nextflow ===
wc -l aln.end1.fq  # 6548
* [https://www.nextflow.io/ Nextflow]: Data-driven computational pipelines, [https://www.nextflow.io/blog.html Blog].
wc -l aln.end2.fq  # 6548
* [https://www.nextflow.io/example4.html RNA-Seq pipeline]
                  # recall tmp_1.fq is 6532
</pre>


== [http://cole-trapnell-lab.github.io/cufflinks/ Cufflinks package] ==
=== GeneTEFlow ===
Transcriptome assembly and differential expression analysis for RNA-Seq.
[https://rna-seqblog.com/geneteflow-a-nextflow-based-pipeline-for-analysing-gene-and-transposable-elements-expression-from-rna-seq-data/ GeneTEFlow] – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data


''Both Cufflinks and Cuffdiff accept SAM and BAM files as input. It is not uncommon for a single lane of Illumina HiSeq sequencing to produce FASTQ and BAM files with a combined size of 20 GB or larger. Laboratories planning to perform more than a small number of RNA-seq experiments should consider investing in robust storage infrastructure, either by purchasing their own hardware or through cloud storage services.''
=== pipeComp ===
[https://github.com/plger/pipeComp/ pipeComp], a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools


=== Tuxedo protocol ===
=== [https://github.com/PF2-pasteur-fr/SARTools SARTools] ===
* [http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks] by Trapnell, et al 2012.
http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/
* https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite


# bowtie2 - fast alignment
=== SEQprocess ===
# tophat2 - splice alignment (rna-seq reads, rna are spliced, introns are removed, some reads may span over 2 exons)
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2676-x SEQprocess]: a modularized and customizable pipeline framework for NGS processing in R package
# cufflinks - transcript assembly & quantitation
# cuffdiff2 - differential expression


=== [http://cole-trapnell-lab.github.io/cufflinks/getting_started/ Cufflinks - assemble reads into transcript] ===
=== GEMmaker ===
Installation
[https://github.com/SystemsGenetics/GEMmaker GEMmaker], [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04629-7 Paper]
<pre>
# about 47MB on ver 2.2.1 for Linux binary version
wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
sudo tar xzvf ~/Downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz  -C /opt/RNA-Seq/bin/
export PATH=$PATH:/opt/RNA-Seq/bin/cufflinks-2.2.1.Linux_x86_64/


# test
== pasilla and pasillaBamSubset Data ==
cufflinks -h
pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.
</pre>


Cufflinks uses this map (done from Tophat) against the genome to assemble the reads into '''transcripts'''.
pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).
* http://homer.salk.edu/homer/basicTutorial/rnaseqCufflinks.html
<pre>
# Quantifying Known Transcripts using Cufflinks
cufflinks -o OutputDirectory/ -G refseq.gtf mappedReads.bam


# De novo Transcript Discovery using Cufflinks
== [http://www.bioconductor.org/packages/release/bioc/html/BitSeq.html BitSeq] ==
cufflinks -o OutputDirectory/ mappedReads.bam
Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of [http://www.hiit.fi/u/ahonkela/ Antti Honkela].
</pre>


The output files are  genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf and '''transcripts.gtf'''.
== ReportingTools ==
The [http://www.bioconductor.org/packages/release/bioc/html/ReportingTools.html ReportingTools] software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.


It can be used to calculate FPKM.
Figures can be included in a cell in output table. See [http://www.bioconductor.org/packages/release/bioc/vignettes/ReportingTools/inst/doc/microarrayAnalysis.pdf#page=3 Using ReportingTools in an Analysis of Microarray Data].
* https://www.biostars.org/p/11378/
* http://www.partek.com/Tutorials/microarray/User_Guides/UnderstandingReads.pdf


=== Cuffcompare - compares transcript assemblies to annotation ===
It is suggested by e.g. EnrichmentBrowser.


=== Cuffmerge - merges two or more transcript assemblies ===
== [http://cran.r-project.org/web/packages/sequences/index.html sequences] ==
First create a text file <assemblies.txt>
More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.
<pre>
./dT_bio/transcripts.gtf
./dT_ori/transcripts.gtf
./dT_tech/transcripts.gtf
./RH_bio/transcripts.gtf
./RH_ori/transcripts.gtf
./RH_tech/transcripts.gtf
</pre>


Then run
== [http://www.bioconductor.org/packages/release/bioc/html/QuasR.html QuasR] ==
<pre>
[http://bioinformatics.oxfordjournals.org/content/early/2014/12/09/bioinformatics.btu781.short?rss=1&ssource=mfr Bioinformatics] paper
cd GSE11209
cuffmerge -g genes.gtf -s genome.fa assemblies.txt
</pre>


=== Cuffdiff ===
== CRAN/Bioconductor packages ==
Finds differentially expressed genes and transcripts/Detect differential splicing and promoter use.
=== [https://cran.r-project.org/web/packages/ssizeRNA/index.html ssizeRNA] ===
* Sample Size Calculation for RNA-Seq Experimental Design
* [https://youtu.be/SapuPiaiNjs?t=1327 B4B: Module 2 - RNAseq power calculation]


Cuffdiff takes the aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis.
=== RNASeqPower ===
[https://bioconductor.org/packages/release/bioc/html/RNASeqPower.html RNASeqPower] Sample size for RNAseq studies


Follow the [http://cufflinks.cbcb.umd.edu/tutorial.html tutorial], we can quickly test the cuffdiff program.
=== [http://master.bioconductor.org/packages/devel/bioc/html/RnaSeqSampleSize.html RnaSeqSampleSize] ===
<pre>
[https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ Shiny] app
$ wget http://cufflinks.cbcb.umd.edu/downloads/test_data.sam
$ cufflinks ./test_data.sam
$ ls -l
total 56
-rw-rw-r-- 1 mli mli  221 2013-03-05 15:51 genes.fpkm_tracking
-rw-rw-r-- 1 mli mli  231 2013-03-05 15:51 isoforms.fpkm_tracking
-rw-rw-r-- 1 mli mli    0 2013-03-05 15:51 skipped.gtf
-rw-rw-r-- 1 mli mli 41526 2009-09-26 19:15 test_data.sam
-rw-rw-r-- 1 mli mli  887 2013-03-05 15:51 transcripts.gtf
</pre>


In real data,
=== [https://cran.r-project.org/web/packages/rbamtools/index.html rbamtools] ===
<pre>
Provides an interface to functions of the 'SAMtools' C-Library by Heng Li
cd GSE11209
cuffdiff -p 5 -o cuffdiff_out -b genome.fa -L dT,RH \
        -u merged_asm/merged.gtf \
        ./dT_bio/accepted_hits.bam,./dT_ori/accepted_hits.bam,./dT_tech/accepted_hits.bam \
        ./RH_bio/accepted_hits.bam,./RH_ori/accepted_hits.bam,./RH_tech/accepted_hits.bam
</pre>


'''N.B.''': the FPKM value (<genes.fpkm_tracking>) is generated per condition not per sample. If we want FPKM per sample, we want to specify one condition for each sample.
=== [http://cran.r-project.org/web/packages/refGenome/index.html refGenome]  ===
The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.


The method of constructing test statistics can be found [https://07110005642076687011.googlegroups.com/attach/51b6f3bc81fbea69/Cufflinks_manual_old.pdf?part=4&vt=ANaJVrECk0GsgeKTeGugnJolIXKsZP1M4igDfMSDjo3aR5ALy5r6aLbSBgBi-stEepWMYX2nhGKrph7RRnqm66XJMIcON8Zf-Q4WBoIraXCemBBsiBu4qJs online].
=== [http://cran.r-project.org/web/packages/WhopGenome/index.html WhopGenome]  ===
Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.


=== Pipeline ===
=== [https://cran.r-project.org/web/packages/TCGA2STAT/index.html TCGA2STAT] ===
* https://www.biostars.org/p/123993/
Simple TCGA Data Access for Integrated Statistical Analysis in R


== [http://compbio.mit.edu/cummeRbund/ CummeRbund] ==
TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.
Plots abundance and differential expression results from Cuffdiff. CummeRbund also handles the details of parsing Cufflinks output file formats to connect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website
<syntaxhighlight lang='rsplus'>
source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")


The tool appears on the paper [http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks] by Trapnell, et al 2012.
install.packages("TCGA2STAT")
</syntaxhighlight>


== Finding differentially expressed genes ==
The getTCGA() function allows to download various kind of data:
=== Big pictures ===
* '''gene expression''' which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
* '''miRNA expression''' which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
* '''mutation''' data (data.type="Mutation")
* '''methylation expression''' (data.type="Methylation")
* '''copy number changes''' (data.type="CNA_SNP")
 
=== TCGAbiolinks ===
* https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html. Many vignettes.
* https://bioconductor.org/packages/3.12/TCGAbiolinks
** data access through GenomicDataCommons
** provides data both from the legacy Firehose pipeline used by the TCGA publications (alignments based on hg18 and hg19 builds2), and the GDC harmonized GRCh38 pipeline
* The [https://rdrr.io/bioc/TCGAbiolinks/man/GDCquery.html help page of GDCquery] does not say it clearly about the option of '''file.type'''.  [https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/59 How to use TCGAbiolinks to download raw RSEM gene counts for specific type of cancer]. file.type= "results" or file.type= "normalized_results".
<ul>
<li>An example from [https://github.com/waldronlab/PublicDataResources Public Data Resources in Bioconductor] workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
<pre>
<pre>
  BWA/Bowtie    samtools       
library(TCGAbiolinks)
fa ---------> sam ------> sam/bam (sorted indexed, short reads), vcf
library(SummarizedExperiment)
  or tophat
query <- GDCquery(project = "TCGA-ACC",
                          data.category = "Gene expression",
                          data.type = "Gene expression quantification",
                          platform = "Illumina HiSeq",
                          file.type  = "normalized_results",
                          experimental.strategy = "RNA-Seq",
                          legacy = TRUE)


Rsamtools    GenomeFeatures                  edgeR (normalization)
gdcdir <- file.path("Waldron_PublicData", "GDCdata")
--------->   --------------> table of counts --------->
GDCdownload(query, method = "api", files.per.chunk = 10,
            directory = gdcdir)  # 79 files
ACCse <- GDCprepare(query, directory = gdcdir)
ACCse
class(ACCse)
dim(assay(ACCse))  # 19947 x 79
assay(ACCse)[1:3, 1:2] # symbol id
length(unique(rownames(assay(ACCse))))  #  19672
rowData(ACCse)[1:2, ]
# DataFrame with 2 rows and 3 columns
#          gene_id entrezgene ensembl_gene_id
#      <character>  <integer>    <character>
# A1BG        A1BG          1 ENSG00000121410
# A2M          A2M          2 ENSG00000175899
</pre>
</li>
<li>HTSeq counts data example. [https://github.com/wwylab/DeMixT/issues/19 DeMixT]. [https://support.bioconductor.org/p/9149780/ Error when running GDC_prepare].
{{Pre}}
query2 <- GDCquery(project = "TCGA-ACC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type="HTSeq - Counts") # or "STAR - Counts"
gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2")
GDCdownload(query2, method = "api", files.per.chunk = 10,
            directory = gdcdir2)  # 79 files
ACCse2 <- GDCprepare(query2, directory = gdcdir2)
ACCse2
dim(assay(ACCse2))  # 56457 x 79
assay(ACCse2)[1:3, 1:2]  # ensembl id
rowData(ACCse2)[1:2, ]
DataFrame with 2 rows and 3 columns
                ensembl_gene_id external_gene_name original_ensembl_gene_id
                    <character>        <character>              <character>
ENSG00000000003 ENSG00000000003            TSPAN6      ENSG00000000003.13
ENSG00000000005 ENSG00000000005              TNMD        ENSG00000000005.5
</pre>
</pre>
 
</li>
=== Readings ===
<li>Clinical data
* [http://www.rna-seqblog.com/introduction-to-rna-sequencing-and-analysis/ Introduction to RNA Sequencing and Analysis] Kukurba KR, Montgomery SB. (2015) RNA Sequencing and Analysis. Cold Spring Harb Protoc
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/?tool=pubmed RNA-Seq: a revolutionary tool for transcriptomics]
* Alignment and visualization from [http://mygoblet.org/training-portal/materials/rna-seq-analysis-2014-module-2-alignment-and-visualization bioinformatics.ca].
* The [http://rnaseq.uoregon.edu/ RNA-seqlopedia] from the Cresko Lab of the University of Oregon.
* [http://slideplayer.com/slide/4463815/ RNA-Seq Analysis] by Simon Andrews.
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2261-8 Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods] by Thomas P. Quinn 2018. Analysis scripts (R) are included.
 
=== Technical replicate vs biological replicate ===
We sequence more DNA from the same DNA "library", which is called a "technical replicate". If we perform a new experiment with a new organism/tissue/population of cells, which is called a "biological replicate".
 
=== Youtube videos ===
* Analyze public dataset by using Galaxy and IGV [http://www.youtube.com/watch?v=dTRZjXuQnYU&list=UULQ1j5ge-WYUE1iBEOGpK5A David Coil from UC Davis Genoem Center]
Download the raw fastq data GSE19602 from [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19602 GEO] and uncompress fastq.bz2 to fastq (~700MB) file. NOTE: the data downloaded from ncbi is actually sra file format. We can use fastq_dump program in SRA_toolkit to convert sra to fastq. http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
<pre>
<pre>
~/Downloads/sratoolkit.2.3.2-ubuntu64/bin/fastq-dump ~/Downloads/SRR034580.sra
acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical")
dim(acc_clin)
# [1] 92 71
</pre>
</pre>
 
</li>
If we want to run Galaxy locally, we can install it simply by 2 command lines
<li>[https://www.rdocumentation.org/packages/TCGAbiolinks/versions/1.2.5/topics/TCGAanalyze_DEA TCGAanalyze_DEA()]. Differentially Expression Analysis (DEA) Using '''edgeR''' Package.
<pre>
<pre>
hg clone https://bitbucket.org/galaxy/galaxy-dist/
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
cd galaxy-dist
dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut =  0.25)
hg update stable
samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
                      dataFilt[,samplesTP],"Normal", "Tumor")
# 2nd example
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM,
                          Cond1type = "LGG", Cond2type = "GBM",
                          fdr.cut = 0.01,  logFC.cut = 1,
                          method = "glmLRT")
</pre>
</pre>
</li>
<li>Enrichment analysis
<pre>
ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM",
                                RegulonList = rownames(dataDEGs))


To run Galaxy locally, we do
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat,
                        nRGTab = rownames(dataDEGs),
                        nBar = 20)
</pre>
</li>
<li>[https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/ mRNA Analysis Pipeline] from GDC documentation.
</li>
</ul>
* See the vignette in the [https://bioconductor.org/packages/release/workflows/html/SingscoreAMLMutations.html SingscoreAMLMutations] package.
* Papers
** [https://academic.oup.com/nar/article/44/8/e71/2465925 TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data]
** [https://f1000research.com/articles/5-1542 TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages]
** [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006701 New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx]
<ul>
<li>[https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html RangedSummarizedExperiment] class
<ul>
<li>assay(() </li>
<li>colData() </li>
<li>rowData() </li>
<li>assayNames() </li>
<li>metadata() </li>
<pre>
<pre>
cd galaxy-dist
> dim(colData(ACCse))
sh run.sh --reload
[1] 79 72
> dim(rowData(ACCse))
[1] 19947    3
> dim(assay(ACCse))
[1] 19947    79
> assayNames(ACCse)
[1] "normalized_count"
> assayNames(ACCse2)
[1] "HTSeq - Counts"
> metadata(ACCse)
$data_release
[1] "Data Release 25.0 - July 22, 2020"
</pre>
</pre>
The command line will show ''Starting server in PID XXXX. serving on http://127.0.0.1:8080''. We can use Ctrl + C to stop the Galaxy process.
</ul>
<li>[https://rpubs.com/tiagochst/TCGAbiolinks_to_DESEq2 TCGAbiolinks to DESEq2]. My verified version (R 4.3.2 & Bioc ‘3.17’) available on [https://gist.github.com/arraytools/6e6142f6fabb31e54e188ea1fb0deeee Github].


Note: One problem with this simple instruction is we have not created a user yet.
</ul>


# Upload one fastq data. Click 'refresh' icon on the history panel to ensure the data is uploaded. Don't use 'refresh' button on the browser; it will cause an error for the current process.
=== curatedTCGAData ===
# '''FASTQ Groomer'''. Convert the data to Galaxy needs. NGS: QC and manipulation => Illumina FASTQ. FASTQ quality scores type: Sanger. (~10 minutes). This part uses CPU not memory.
* [https://www.bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects]
# Open a new browser tab and go to ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/. Right click the file '''all.cDNA''' and copy link location. In Galaxy click 'Upload File from your computer' paste URL to URL/Text entry.
** data access through ExperimentHub
# Scroll down Galaxy and select NGS:Mapping -> '''Map with BWA'''. PS. for some reason, BWA is not available. So I use '''Bowtie2''' instead. The output of Bowtie2 is bam file.
** provides data from the legacy Firehose pipeline
## For reference genome, choose 'Use one from the history'. Galaxy automatically find the reference file 'ftp://ftp.plantbiology....' from history.
* [https://waldronlab.io/MultiAssayWorkshop/ Multi-omic Integration and Analysis of cBioPortal and TCGA data with MultiAssayExperiment] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
## Library mate-paired => Single-end.
<ul>
## FASTQ file => 2: FASTQ Groomer on data 1.
<li>[https://waldronlab.io/PublicDataResources/ Public data resources and Bioconductor] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
## BWA settings to use => Commonly Used.
<pre>
## Execute (~ 15 minutes)
library(curatedTCGAData)
# We can view the alignment file (sam format) created from BWA by using UCSV or IGV (input is bam or bai format). We now use '''NGS: SAM Tools''' to convert sam file to bam file. Click 'SAM-to-BAM converts SAM format to BAM format' tool.
library(MultiAssayExperiment)
## Choose the source for the reference list => History
curatedTCGAData(diseaseCode = "*", assays = "*")
## Converts SAM file => 4: Map with BWA on data 2 and data 3.
curatedTCGAData(diseaseCode = "ACC")
## Using reference file => 3:ftp://ftp.plantbiology.....
## Execute (~5 minutes)
# We want to create bai file which is a shortcut to IGV. It breaks the data into smaller accessible chunks. So when you pull up a certain cDNA, it goes straight to the subset. Go to the history, click the pencil icon (Edit Attributes) on the file '''SAM-to-BAM on data 3 and data 4'''.
## Look at 'Convert to new format' section. Go ahead and click 'Convert'. (< 1 minute).  This  will create another file.
## Use browser and go to ftp website to download '''all.cDNA''' file to desktop. The desktop should contain 3 files - all.cDNA, rice.bam and rice.bai files for IGV.
# Goto http://www.broadinstitute.org/software/igv/download to download IGV which is a java-based application. I need to install java machine first by install openjdk-7-jdk. IGV by default will launch 'Human hg18' genome. Launch IGV by '''cd IGV_2.2.13;java -Xmx750m -jar igv.jar'''. I found the IGV input requires sam+bai OR bam+bai. So we need to click the pencil icon to create bai file first before we want to upload sam or bam file to IGV.
## Goto File => Import Genome. Call it 'rice' and select 'all.cDNA' sequence file. Click 'Save' button.
## Goto File => Upload from File => rice.bam.
## Top right panel is cDNA
## Middle right panel has a lot of 'boxes' which is a read. If we zoom in, we can see some read points to left (backward) while some points to right (forward). On the top is a histogram. For example, a base may be covered by a lot of reads then the histogram will show the high frequence.
## If we keep zoom in, we can see color at the Bottom right panel. Keeping zoom in, we can see the base G, C, T, A themselves.
## Using IGV, we can 1. examine coverage.
## We can 2. check 'alternative splicing'. (not for this cDNA)
## We can 3. examine SNPs for base change. If we see gray color (dark gray is hight quality read, light gray means low quality read), it means they are perfect match. If we see color, it means there is a change. For example, a read is 'C' but in fact it should be 'A'. If a case has many high quality reads, and half of them are 'G' but the reference genome shows 'A'. This is most likely a SNP. This is heterogeisity.
# '''Tophat''' - align RNA seq data to genomic DNA
## Suppose we have use Galaxy to upload 2 data. One is SRR034580 and we have run FASTQ Groomer on data 1. The second data is SRR034584 and we also have run FASTQ Groomer on data 2. We also have uploaded reference genome sequence.
## Goto Galaxy and find NGS: RNA Analysis => Tophat.
## reference genome => Use one from the history
## RNA-Seq FASTQ file => 2; FASTQ Groomer on data 1.
## Execute. This will create 2 files. One is '''splice junctions''' and the other is '''accepted_hits'''. We queue the job and run another Tophat with the 2nd 'groomer'ed data file. We are going to work on '''accepted_hits''' file.
## While the queue are running, we can click on 'pencil' icon on 'accepted_hits' job and run the utlity 'Convert to new format' (Bam to Bai). We should do this for both 'accepted_hits' files.
## For some reason, the execution failed: An error occurred with this dataset: TopHat v2.0.7 TopHat v2.0.7 Error indexing reference sequence /bin/sh: 1: bowtie-build: not found.
# '''Cufflinks'''. We will estimate transcript abundance by using FPKM (RPKM).
## SAM or BAM file of alignmed RNA-Seq reads => tophat on data 2.. accepted_hits
## Use Reference Annotation - No (choose Yes if we want annotation. This requires GTF format. See http://genome.ucsc.edu/FAQ/FAQformat.html#format4. We don't have it for rice.)
## Execute. This will create 3 files. Gene expression, transcript expression and assembled transcripts.
## We also run Cufflinks for 2nd accepted_hits file. (~ 25 minutes)
# '''Cuffcompare'''. Compare one to each other.
## GTF file produced by Cufflinks => assembled transcript from the 1st data
## Use another GTF file produced by Cufflinks => Yes. It automatically find the other one.
## Execute. (< 10 minutes). This will create 7 files. Transcript accuracy, tmap file & refmap flie from each assembled transcripts, combined transcripts and transcript tracking.
## We are interested in combined transcripts file (to use in Cuffdiff).
# '''Cuffdiff'''.
## Transcripts => combined transcripts.
## SAM or BAM file of aligned RNA-Seq reads => 1st accepted_hits
## SAM or BAM file or aligned RNA-Seq reads => 2nd accepted_hits
## Execute. This will generate 11 files. Isoform expression, gene expression, TSS groups expression, CDS Expression FPKM Tracking, isoform FPKM tracking, gene FPKM tracking, TSS groups FPKM tracking, CDS FPKM tracking, splicing diff, promoters diff, CDS diff. We are interested in 'gene expression' file. We can save it and open it in Excel.
# IGV - 2 RNA-Seq datasets aligned to '''genomic''' DNA using Tophat
## Load the reference genome rice (see above)
## Upload from file => rice4.bam. Upload from file => rice5.bam.
## Alternative RNA splicing.


=== edX course ===
ACCmae <- curatedTCGAData("ACC", c("RPPAArray", "RNASeq2GeneNorm"),
PH525.5x Case Study: RNA-seq data analysis. The course notes are forming a book. Check out https://github.com/genomicsclass/labs and http://genomicsclass.github.io/book/.
                          dry.run=FALSE)
ACCmae
dim(colData(ACCmae)) # 79 (samples) x 822 (features)


== Variant calling ==
head(metadata(colData(ACCmae))[["subtypes"]])
* Variants include (See p21 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery])
</pre>
** Germline SNPs & indels
</li>
** Somatic SNVs & indels
</ul>
** Somatic CNVs
* Caveats for working with TCGA data
** Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types.
** Use sampleTables on the MultiAssayExperiment object along with data(sampleTypes, package = "TCGAutils") to see what samples are present in the data.
** There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as mergeReplicates.
** Primary tumors should be selected using '''TCGAutils::TCGAsampleSelect''' and used as input to the subsetting mechanisms.


=== Overview/papers ===
=== [https://bioconductor.org/packages/release/bioc/html/caOmicsV.html caOmicsV] ===
* [http://www.nature.com/articles/srep17875 Systematic comparison of variant calling pipelines using gold standard personal exome variants] by Hwang 2015.
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/ A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data] by Heng Li.
* http://www.slideshare.net/AustralianBioinformatics/introduction-to-nextgeneration
* http://www.slideshare.net/thomaskeane/overview-of-methods-for-variant-calling-from-nextgeneration-sequence-data-9608507
* http://www.slideshare.net/jandot/nextgeneration-sequencing-variation-discovery
* [https://www.biorxiv.org/content/early/2017/09/23/192872 Strelka2: Fast and accurate variant calling for clinical sequencing applications] Sep 2017


=== Workflow ===
Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.
* Sequence reads -> Quality control -> Mapping -> Mark Duplicates -> Local realignment around INDELS -> Base quality score recalibration -> Variant calling -> Filtering & Annotation -> Querying.
* http://www.ngscourse.org/Course_Materials/variant_calling/presentation/2015-Cambridge_variant_calling.pdf <span style="color: red">Lots of pictures to explain the concepts!!</span>
* [https://informatics.sydney.edu.au/training/coursedocs/RNASeqGalaxy_Camperdown_June2018.pdf Introduction to RNA-Seq on Galaxy Analysis for differential expression] Tracy Chew
* http://seqanswers.com/wiki/How-to/exome_analysis
** Alignment step includes
*** '''mark PCR duplicates''' some reads will be exact copies of each other. These reads are called PCR duplicates due to amplification biases in PCR. They share the same sequence and the same alignment position and could cause trouble during SNP calling as possibly some allele is overrepresented due to amplification biases. See [http://gatkforums.broadinstitute.org/gatk/discussion/6747/how-to-mark-duplicates-with-markduplicates-or-markduplicateswithmatecigar#section3 Conceptual overview of duplicate flagging] and [http://wiki.bits.vib.be/index.php/Call_variants_with_samtools_1.0#Mark_duplicates_using_Picard_tools Call variants with samtools 1.0]
*** '''local alignment around indels''' Indels within reads often lead to false positive SNPs at the end of sequence reads. To prevent this artifact, local realignment around indels is done using local realignment tools from the Genome Analysis Tool Kit.
*** '''quality recalibration'''
** SNP calling step includes producing raw SNP calls and filter SNPs
** Annotation
* WGS/WES Mapping to Variant Calls http://www.htslib.org/workflow/
Step 1: Mapping
<syntaxhighlight lang='bash'>
bwa index <ref.fa>


bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' <ref.fa> <read1.fa> <read1.fa> > lane.sam
=== [https://cran.r-project.org/web/packages/Map2NCBI/index.html Map2NCBI] ===
samtools fixmate -O bam <lane.sam> <lane_fixmate.bam>
The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).
samtools sort -O bam -o <lane_sorted.bam> -T </tmp/lane_temp> <lane_fixmate.sam>
</syntaxhighlight>
Step 2: Improvement
<syntaxhighlight lang='bash'>
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R <ref.fa> -I <lane.bam> \
    -o <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf>
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R <ref.fa> -I <lane.bam> \
    -targetIntervals <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf> -o <lane_realigned.bam>


java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R <ref.fa> \
<syntaxhighlight lang='rsplus'>
    -knownSites >bundle/b38/dbsnp_142.b38.vcf> -I <lane.bam> -o <lane_recal.table>
> library(Map2NCBI)
java -Xmx2g -jar GenomeAnalysisTK.jar -T PrintReads -R <ref.fa> -I <lane.bam> \
> GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/"))
     --BSQR <lane_recal.table> -o <lane_recal.bam>
  # choose [2], [n], and [1] to filter the build and feature information.
 
  # The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files.
java -Xmx2g -jar MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT \
> str(GeneList)
    INPUT=<lane_1.bam> INPUT=<lane_2.bam> INPUT=<lane_3.bam> OUTPUT=<library.bam>
'data.frame': 52157 obs. of  15 variables:
 
$ tax_id      : chr  "9606" "9606" "9606" "9606" ...
samtools merge <sample.bam> <library1.bam> <library2.bam> <library3.bam>
$ chromosome  : chr  "1" "1" "1" "1" ...
samtools index <sample.bam>
$ chr_start    : num  11874 14362 17369 30366 34611 ...
</syntaxhighlight>
$ chr_stop     : num  14409 29370 17436 30503 36081 ...
Step 3: Variant calling
$ chr_orient  : chr  "+" "-" "-" "+" ...
<syntaxhighlight lang='bash'>
$ contig      : chr  "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ...
samtools mpileup -ugf <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | \
$ ctg_start    : num  1874 4362 7369 20366 24611 ...
    bcftools call -vmO z -o <study.vcf.gz>
$ ctg_stop    : num  4409 19370 7436 20503 26081 ...
$ ctg_orient  : chr  "+" "-" "-" "+" ...
$ feature_name : chr  "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ...
$ feature_id  : chr  "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ...
$ feature_type : chr  "GENE" "GENE" "GENE" "GENE" ...
$ group_label  : chr  "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ...
$ transcript  : chr  "Assembly" "Assembly" "Assembly" "Assembly" ...
$ evidence_code: chr  "-" "-" "-" "-" ...
> GeneList$feature_name[grep("^NAP", GeneList$feature_name)]
</syntaxhighlight>
</syntaxhighlight>


=== Non-R Software ===
=== TCseq: Time course sequencing data analysis ===
Variant detector/discovery, genotyping
http://bioconductor.org/packages/devel/bioc/html/TCseq.html


* [http://www.biomedcentral.com/1471-2105/16/235# Evaluation of variant detection software for pooled next-generation sequence data]
=== UCSC Xena ===
* [http://bib.oxfordjournals.org/content/15/2/256.abstract A survey of tools for variant analysis of next-generation genome sequencing data]
* [https://xena.ucsc.edu/welcome-to-ucsc-xena/ Welcome to UCSC Xena], [https://github.com/ucscXena Source code]. Note one should not confuse it with [https://bioconductor.org/packages/release/bioc/html/Xeva.html Xeva] for analysis PDX data.
* [http://www.biomedcentral.com/1471-2105/16/235 Evaluation of variant detection software for pooled next-generation sequence data]
* [https://github.com/ropensci/UCSCXenaTools UCSCXenaTools]
* [http://www.sciencedirect.com/science/article/pii/S0002929713003832 Reliable Identification of Genomic Variants from RNA-Seq Data]
* It was used by this tumor purity paper [https://academic.oup.com/bib/article/22/6/bbab163/6265216#312129108 Prediction of tumor purity from gene expression data using machine learning].
* [https://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html Bioinformatics at COMAV]


==== Variant Identification ====
=== RTCGA ===
* [http://samtools.sourceforge.net/ Samtools]
https://www.bioconductor.org/packages/release/bioc/html/RTCGA.html
** https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools
** http://petridishtalk.com/2013/02/01/variant-discovery-annotation-filtering-with-samtools-the-gatk/
** http://lab.loman.net/2012/10/31/a-simple-guide-to-variant-calling-with-bwa-samtools-varscan2/
** [http://ged.msu.edu/angus/tutorials-2013/snp_tutorial.html Samtools and GATK]
** [https://www.biostars.org/p/57149/ Difference Between Samtools And Gatk Algorithms]
* [https://www.broadinstitute.org/gatk/ GATK] by BROAD Institute.
* [http://varscan.sourceforge.net/ varscan2]
* [https://github.com/chapmanb/bcbio-nextgen bcbio] - Validated, scalable, community developed variant calling and RNA-seq analysis. [http://bcb.io/2015/09/17/hg38-validation/ Validated variant calling with human genome build 38].
* [https://github.com/ekg/freebayes freebayes] and [https://github.com/ekg/alignment-and-variant-calling-tutorial a complete tutorial] from alignment to variant calling.


GUI software
=== genefu ===
* [http://snver.sourceforge.net/snvergui/ SNVerGUI]
[https://bioconductor.org/packages/release/bioc/html/genefu.html Computation of Gene Expression-Based Signatures in Breast Cancer]


==== Variant Annotation ====
== GEO ==
See [[#Variant_Annotation_2|Variant Annotation]]
See the internal link at [[R#GEO_.28Gene_Expression_Omnibus.29|R-GEO]].


=== Biocoductor and R packages ===
[https://www.biorxiv.org/content/early/2018/05/19/326223 GREIN: An interactive web platform for re-analyzing GEO RNA-seq data]
* Bioconductor [http://www.bioconductor.org/packages/release/bioc/html/VariantTools.html VariantTools] package can export to VCF and [http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html VariantAnnotation] package can read VCF format files. See also the [http://bioconductor.org/help/workflows/variants/ Annotating Variants workflow] in Bioconductor. See also [http://bioconductor.org/packages/release/data/annotation/html/SIFT.Hsapiens.dbSNP137.html SIFT.Hsapiens.dbSNP137],  [http://bioconductor.org/packages/release/data/experiment/html/COSMIC.67.html COSMIC.67], and [http://bioconductor.org/packages/release/data/annotation/html/PolyPhen.Hsapiens.dbSNP131.html PolyPhen.Hsapiens.dbSNP131] packages.
* R packages [http://cran.r-project.org/web/packages/seqminer/index.html seqminer]


==== fundamental - VariantAnnotation ====
=== GEO2RNAseq ===
readVcf() can be used to read a *.vcf or *.vcf.gz file.
[https://www.biorxiv.org/content/10.1101/771063v1.full GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data]


==== dbSNP - SNPlocs.Hsapiens.dbSNP.20101109 ====
== Network-based ==
Compare the rs numbers of the data and dbSNP.
[https://www.nature.com/articles/s41598-022-19019-5 Network-based integration of multi-omics data for clinical outcome prediction in neuroblastoma] 2022


==== variant wrt genes - TxDb.Hsapiens.UCSC.hg19.knownGene ====
== Proteomics ==
Look for the column '''LOCATION''' which has values ''coding'', ''fiveUTR'', ''threeUTR'', ''intron'', ''intergenic'', ''spliceSite'', and ''promoter''.
* [https://www.bioconductor.org/packages/release/bioc/html/MatrixQCvis.html MatrixQCvis: Shiny-based interactive data-quality exploration for omics data]
* [https://www.bioconductor.org/help/course-materials/2016/BioC2016/ConcurrentWorkshops4/Gatto/Bioc2016.html R/Bioconductor tools for mass spectrometry-based proteomics]


==== amino acid change for the non-synonymous variants - BSgenome.Hsapiens.UCSC.hg19 ====
=== OlinkAnalyze ===
Look for the column '''CONSEQUENCE''' which has values ''synonymous'' or ''nonsynonymous''.
* https://cran.r-project.org/web/packages/OlinkAnalyze/index.html
* [https://github.com/arnav-mehta/covid19-proteomics/blob/main/20201023_COVID%20pos%20vs%20neg.R 20201023_COVID pos vs neg.R]


==== SIFT & Polyphen for predict the impact of amino acid substitution on a human protein - PolyPhen.Hsapiens.dbSNP131 ====
=== OlinkRPackage ===
Look for the column '''PREDICTION''' which has values ''possibly damaging'' or ''benign''.
* [https://github.com/Olink-Proteomics/OlinkRPackage OlinkRPackage], [https://cran.r-project.org/web/packages/OlinkAnalyze/index.html CRAN]
* [https://github.com/arnav-mehta/covid19-proteomics Plasma proteomics reveals tissue-specific cell death and mediators of cell-cell interactions in severe COVID-19 patients] 2021
* [https://www.olink.com/faq/what-is-npx/ What is NPX?] NPX, Normalized Protein eXpression, is Olink’s arbitrary unit which is in Log2 scale.
* [https://www.sciencedirect.com/science/article/pii/S2666379121001154#appsec2 Longitudinal proteomic analysis of severe COVID-19 reveals survival-associated signatures, tissue-specific cell death, and cell-cell interactions] Filbin 2021. Olink data is available in [https://data.mendeley.com/datasets/nf853r8xsj/1 here].


==== rentrez tutorial ====
== Mass spectrometry (MS)-based proteomics ==
https://ropensci.org/tutorials/rentrez_tutorial.html
* This is NOT RPPA (reverse phase protein arrays)
* Phosphoprotein 磷蛋白 -level
* MS-based proteomics has several advantages. It can analyze all the proteins in a system and is unbiased and hypothesis-free 1. MS methods are also ideally suited to discover and quantify post-translational modifications (PTMs) on proteins. MS-based targeted proteomic methods have increased specificity and multiplexing capabilities compared to classical immunological methods.
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6731080/ Integration and Analysis of CPTAC Proteomics Data in the Context of Cancer Genomics in the cBioPortal] 2019
* https://proteomics.cancer.gov/data-portal
* [https://docs.cbioportal.org/user-guide/faq/#how-can-i-query-phosphoprotein-levels-in-the-portal How can I query phosphoprotein levels in the portal?]
* R/Bioconductor
** [https://bioconductor.org/packages/release/bioc/html/MSnbase.html MSnbase] from Bioconductor
** https://github.com/lgatto/RforProteomics, [https://lgatto.github.io/bioc-ms-prot/lab.html Bioconductor tools for mass spectrometry and proteomics]
** [https://support.bioconductor.org/p/134743/ Tutorial:Mass Spectrometry Data Analysis with the Spectra Package]
** [https://www.physalia-courses.org/courses-workshops/course58/ R/BIOCONDUCTOR FOR MASS SPECTROMETRY AND PROTEOMICS]
* An example to download data from cbioportal.org
** Go to https://www.cbioportal.org/datasets and search "CPTAC" through the search box or the find function in the browser. It returns 8 hits.
** For example [https://www.cbioportal.org/study/summary?id=coad_cptac_2019 Colon Cancer (CPTAC-2 Prospective, Cell 2019)] contains several molecule profiles to download including samples Phosphoprotein site level expression.
** Click the download button to download the data. "coad_cptac_2019.tar.gz". For this case, "wc -l" shows 31263 rows for phosphoprotein quantification data and 8067 rows for protein quantification data.
<pre>
$ head -5 data_phosphoprotein_quantification.txt' | cut -f 1-5
ENTITY_STABLE_ID GENE_SYMBOL PHOSPHOSITE 01CO005 01CO006
AAAS_pS495 AAAS pS495 NA -0.365
AAAS_pS525 AAAS pS525 NA NA
AAAS_pS541 AAAS pS541 -0.24 NA
AAED1_pS12 AAED1 pS12 -0.46 -0.424


==== Accessing and Manipulating Biological Databases Exercises ====
# R
http://www.r-exercises.com/2017/05/04/accessing-and-manipulating-biological-databases-exercises-part-2/? utm_source=rss&utm_medium=rss&utm_campaign=accessing-and-manipulating-biological-databases-exercises-part-2
> summary(x[, 4], na.rm = T)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   NA's
-5.776  -1.095  -0.608  -0.690  -0.213  2.383  20378
> summary(as.vector(as.matrix(x[, 4:5])), na.rm = T)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  -5.78  -0.81  -0.36  -0.43    0.00    3.13  36304
</pre>
[[File:Cbioportal cptac.png|350px]]


=== vcf ===
== Metabolomics Analysis ==  
==== [http://www.1000genomes.org/wiki/Analysis/vcf4.0 vcf format] ====
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9032224/ Guide to Metabolomics Analysis: A Bioinformatics Workflow]
* One row is one variant or one position/nucleotide in genome.
* Mapping quality is called MQ on the INFO column of a '''vcf''' file. It is on the 5th column (no header) or the MQ field on the last column of a '''sam''' file.
** MQ denotes [https://en.wikipedia.org/wiki/Root_mean_square RMS] mapping quality in '''vcf'''.
** MapQ = -10 log10(P) where P = probability that this mapping is NOT the correct one.
** https://en.wikipedia.org/wiki/Phred_quality_score.  If Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000 (1/10^(MAPQ/10)=1/10^3)
** When MAPQ=0, it means that the read maps to two or more places with equal score. See [https://www.biostars.org/p/4015/ here].
** When MAPQ=255, it means the mapping quality is not available. See https://biostar.usegalaxy.org/p/4077/
** http://davetang.org/muse/2011/09/14/mapping-qualities/
** [http://drive5.com/usearch/manual/quality_score.html Fastq] file
* variant category (snp, insertion, deletion, mixed) http://www.1000genomes.org/node/101
* [http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it What is a VCF and how should I interpret it?] from broad.
* See also [[Anders2013#sam.2Fbam.2C_.22samtools_view.22_and_Rsamtools|SAM/BAM format]].


Below is an example of each record/row
== Protein-protein interaction/PPI ==
{| class="wikitable"
<ul>
| 1
<li>https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction
| #CHROM
<li>[https://www.r-bloggers.com/2012/06/obtaining-a-protein-protein-interaction-network-for-a-gene-list-in-r/ Obtaining a protein-protein interaction network for a gene list in R] and other related posts.
| 2
<li>[https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter11.html Protein-Protein Interaction Graphs]
|-
<li>[https://downloads.thebiogrid.org/BioGRID/ BioGRID], [https://onlinelibrary.wiley.com/doi/10.1002/pro.3978 The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions]
| 2
<pre>
| POS
$ wc -l BIOGRID-ALL-4.4.214.tab.txt                                                                              12:07:59
| 14370
2454000 BIOGRID-ALL-4.4.214.tab.txt
|-
</pre>
| 3
<pre>
| ID
R> x <- read.delim("BIOGRID-ALL-4.4.214.tab.txt", skip=35, nrows=10)
| rs6054257 or .
R> dim(x)
|-
[1] 10 11
| 4
R> x[1:2, ]
| REF
  INTERACTOR_A INTERACTOR_B OFFICIAL_SYMBOL_A OFFICIAL_SYMBOL_B
| G
1      ETG6416      ETG2318            MAP2K4              FLNC
|-
2    ETG84665        ETG88              MYPN            ACTN2
| 5
                                                      ALIASES_FOR_A
| ALT
1 JNKK|JNKK1|MAPKK4|MEK4|MKK4|PRKMK4|SAPKK-1|SAPKK1|SEK1|SERK1|SKK1
| A
2                                            CMD1DD|CMH22|MYOP|RCM4
|-
                            ALIASES_FOR_B EXPERIMENTAL_SYSTEM          SOURCE
| 6
1 ABP-280|ABP280A|ABPA|ABPL|FLN2|MFM5|MPD4          Two-hybrid  Marti A (1997)
| QUAL
2                                  CMD1AA          Two-hybrid  Bang ML (2001)
| 29
  PUBMED_ID ORGANISM_A_ID ORGANISM_B_ID
|-
1   9006895          9606          9606
| 7
2 11309420          9606          9606
| FILTER
</pre>
| PASS or .
<li>[https://academic.oup.com/bioinformatics/article/38/24/5390/6769888 Defining the extent of gene function using ROC curvature] 2022
|-
<li>https://string-db.org/, [https://bioconductor.org/packages/release/bioc/html/STRINGdb.html STRINGdb] R package. When using the web service, remember to
| 8
* minimum required interaction score: 0.90 instead of the default 0.40
| INFO
* hide disconnected nodes in the network
| NS=3;DP=14;AF=0.5;DB;H2
<li>Cytoscrape
|-
* [https://zhuanlan.zhihu.com/p/111001011 Cytoscape之stringAPP蛋白互作分析详解]
| opt
* [https://zhuanlan.zhihu.com/p/455182867 STRING网站+Cytoscape软件制作精美蛋白互作网络图(PPI)]
| FORMAT
</ul>
| GT:AD:DP:GQ:PL
|-
| opt
| NA00001
| 0/1:3,2:5:34:34,0,65
|}
where the meanings of GT(genotype), AD(Allelic depths for the ref and alt alleles in the order listed), HQ (haplotype quality), GQ (genotype quality)... can be found in the header of the VCF file.


==== Examples ====
== Drug-Drug Interactions ==
See the [https://github.com/arraytools/brb-seqtools/tree/master/testdata/GSE48215subset samtools and GATK output from GSE48215subset] data included in BRB-SeqTools.
[https://appsilon.com/drug-drug-interactions-r-shiny/ Understanding Drug-Drug Interactions Using R Shiny]


==== Count number of rows ====
= Journals =
<syntaxhighlight lang='bash'>
* Impact factor: [https://www.bioxbio.com/subject/medicine Top Journals in medicine]
grep -cv "#" vcffile
* [https://www.scijournal.org/ SCI Journal]
</syntaxhighlight>


==== Filtering ====
== Biometrical Journal ==
* [https://www.biostars.org/p/44378/ Filtering A Sam File For Quality Scores]. For example, to filter out reads with MAQP < 30 in SAM/BAM files
* https://onlinelibrary.wiley.com/journal/15214036
<syntaxhighlight lang='bash'>
* [https://onlinelibrary.wiley.com/page/journal/15214036/homepage/forauthors.html Author's Guideline]
samtools view -b -q 30 input.bam > filtered.bam
* [https://onlinelibrary.wiley.com/results/global-subject-codes/st30?target=topic-title-results&startPage=&PubType=journal Biostatistics (topic) journals] from Wiley
</syntaxhighlight>
* To filter vcf based on MQ, use for example,
<syntaxhighlight lang='bash'>
bcftools filter -i"QUAL >= 20 && DP >= 5 && MQ >= 30" INPUTVCF > OUTPUTVCF
</syntaxhighlight>


==== Open with LibreOffice Calc ====
== [https://academic.oup.com/biostatistics/issue Biostatistics] ==
Use delimiter 'Tab' only and uncheck the others.


It will then get the correct header #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO (separated by ;), FORMAT (separated by :), SampleID (separated by :) where INFO field contains site-level annotations and FORMAT&SampleID fields contain sample-level annotations.
== [https://academic.oup.com/bioinformatics Bioinformatics] ==
[https://academic.oup.com/bioinformatics/search-results?f_TocHeadingTitle=GENOME%20ANALYSIS Genome Analysis] section


==== Count number of records, snps and indels ====
== [https://bmcbioinformatics.biomedcentral.com/ BMC Bioinformatics] ==
* [http://vcftools.sourceforge.net/documentation.html#file vcftools].
* https://www.biostars.org/p/49386/, https://www.biostars.org/p/109086/, https://www.biostars.org/p/50923/, https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools.


* Summary statistics
== [https://www.biorxiv.org/ BioRxiv] ==
<syntaxhighlight lang='bash'>
bcftools stats XXX.vcf


# SN, Summary numbers:
== PLOS ==
# SN    [2]id  [3]key  [4]value
SN      0      number of samples:      0
SN      0      number of records:      71712
SN      0      number of no-ALTs:      0
SN      0      number of SNPs: 65950
SN      0      number of MNPs: 0
SN      0      number of indels:      5762
SN      0      number of others:      0
SN      0      number of multiallelic sites:  0
SN      0      number of multiallelic SNP sites:      0
</syntaxhighlight>
* Remove the header
<syntaxhighlight lang='bash'>
grep -v "^#" input.vcf > output.vcf 
</syntaxhighlight>
* Count number of records
<syntaxhighlight lang='bash'>
grep -c -v "^#" XXX.vcf  # -c means count, -v is invert search
</syntaxhighlight>


==== [https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php VariantsToTable] tool from Broad GATK ====
== MDPI ==
* https://www.broadinstitute.org/gatk/blog?id=7089
https://en.wikipedia.org/wiki/MDPI, [https://twitter.com/MKarhulahti/status/1723296939649683689 All MDPI, Frontiers & Hindawi] journals planned to be erased (level 0) from Finnish academic assessment by end of 2024.


==== [https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html VariantAnnotation] package ====
= Software =
* [http://stackoverflow.com/questions/21598212/extract-sample-data-from-vcf-files readVcf(), ScanVcfParam() and readInfo()] functions.
== BRB-SeqTools ==
* [https://www.bioconductor.org/help/workflows/variants/ scanVcfHeader() and scanVcfParam()] functions.
https://brb.nci.nih.gov/seqtools/


To install the R package, first we need to install required software in Linux
== [http://mev.tm4.org/#/welcome WebMeV] ==
<syntaxhighlight lang='bash'>
* [http://cancerres.aacrjournals.org/content/77/21/e11 WebMeV: A Cloud Platform for Analyzing and Visualizing Cancer Genomic Data]
sudo apt-get update
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev
</syntaxhighlight>
and then
<syntaxhighlight lang='bash'>
source("http://bioconductor.org/biocLite.R")
biocLite("VariantAnnotation")


library(VariantAnnotation)
== GeneSpring ==
vcf <- readVcf("filename.vcf", "hg19")
RNA-Seq
# Header
header(vcf)


# Sample names
== CCBR Exome Pipeliner ==
samples(header(vcf))
https://ccbr.github.io/Pipeliner/


# Geno
== Tibanna ==
geno(header(vcf))
[https://data.4dnucleome.org/ Tibanna] helps you run your genomic pipelines on Amazon cloud (AWS). It is used by the 4DN DCIC (4D Nucleome Data Coordination and Integration Center) to process data. Tibanna supports CWL/WDL (w/ docker), Snakemake (w/ conda) and custom Docker/shell command.


# Genomic positions
== [https://github.com/PMBio/MOFA MOFA]: Multi-Omics Factor Analysis ==
head(rowRanges(vcf), 3)
ref(vcf)[1:5]
# Variant quality
qual(vcf)[1:5]
alt(vcf)[1:5]


# INFO
== WGCNA ==
info(vcf)[1:3, ]
* It uses a network method to create distance matrix for genes. We can further to use Hierarchical clustering to create groups/modules/clusters of genes.
# Get the Depth (DP)
* [https://en.wikipedia.org/wiki/Weighted_correlation_network_analysis Weighted gene '''co-expression''' network analysis]
hist(info(vcf)$DP, xlab="DP")
* [https://youtu.be/OpWEHazyQLA Webinar #7 – Introduction to Weighted Gene Co-expression Network Analysis] (video)
summary(info(vcf)$DP, xlab="DP")
# Get the Mapping quality (MQ/MapQ)
hist(info(vcf)$MQ, xlab="MQ")


</syntaxhighlight>
== Benchmarking ==
[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1738-8 Essential guidelines for computational method benchmarking]


Example:  
= Simulation =
* [https://www.biostars.org/p/128762/ NGS reads simulation]


We first show how to use linux command line to explore the VCF file. Then we show how to use the VariantAnnotation package.
== Simulate RNA-Seq ==
<syntaxhighlight lang='bash'>
* http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#RNA-Seq_simulators
# The vcf file is created from running samtools on SRR1656687.fastq
* https://popmodels.cancercontrol.cancer.gov/gsr/packages/


nskip=$(grep "^#" ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf | wc -l)
=== [http://maq.sourceforge.net Maq] ===
echo $nskip  # 53
Used by [https://academic.oup.com/bioinformatics/article/25/9/1105/203994/TopHat-discovering-splice-junctions-with-RNA-Seq TopHat: discovering splice junctions with RNA-Seq]
awk 'NR==53, NR==53' ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup.bam


nall=$(cat ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf | wc -l)
=== BEERS/Grant G.R. 2011 ===
echo $nall  # 302174
http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called [http://cbil.upenn.edu/BEERS/ BEERS] and it was used in the [https://academic.oup.com/bioinformatics/article/29/1/15/272537/STAR-ultrafast-universal-RNA-seq-aligner STAR] software paper.


# Generate <variantsOnly.vcf> which contains only the variant body
For the command line options of <'''reads_simulator.pl'''> and more details about the config files that are needed/prepared by BEERS, see [https://gist.github.com/arraytools/dd62bcca60cc36a1d1769d1a4a7d226b this gist].
awk 'NR==54, NR==302174' ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf > variantsOnly.vcf


# Generate <infoField.txt> which contains only the INFO field
This can generate paired end data but they are in one FASTA file.
cut -f 8 variantsOnly.vcf > infoField.txt


# Generate the final product
<syntaxhighlight lang='bash'>
head -n 1 infoField.txt
$ sudo apt-get install cpanminus
DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22
$ sudo cpanm Math::Random
head -n 1 infoField.txt | cut -f1- -d ";" --output-delimiter " "
$ wget http://cbil.upenn.edu/BEERS/beers.tar
## Method 1: use cut
$
### Problem: the result is not a table since not all fields appear in all variants.
$ tar -xvf beers.tar      # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl>
cut -f1- -d ";" --output-delimiter=$'\t' infoField.txt > infoFieldTable.txt
$
## Method 2: use sed
$ cd ~/Downloads/
### Problem: same as Method 1.  
$ mkdir beers_output 
sed 's/;/\t/g' infoField.txt > infoFieldTable2.txt
$ mkdir beers_simulator_refseq && cd "$_"
### OR if we want to replace the original file
$ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz
sed -i 's/;/\t/g' infoField.txt
$ tar xzvf simulator_config_refseq.tar.gz
$ ls -lth
total 1.4G
-rw-r--r-- 1 brb brb  44M Sep 16  2010 simulator_config_featurequantifications_refseq
-rw-r--r-- 1 brb brb 7.7M Sep 15  2010 simulator_config_geneinfo_refseq
-rw-r--r-- 1 brb brb 106M Sep 15  2010 simulator_config_geneseq_refseq
-rw-r--r-- 1 brb brb 1.3G Sep 15  2010 simulator_config_intronseq_refseq
$ cd ~/Downloads/
$ perl reads_simulator.pl 100 testbeers \
  -configstem refseq \
  -customcfgdir ~/Downloads/beers_simulator_refseq \
  -outdir ~/Downloads/beers_output


$ head -n 3 infoField.txt
$ ls -lh beers_output
DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22
total 3.9M
DP=6;VDB=0.325692;SGB=-0.556411;MQ0F=0.333333;AC=2;AN=2;DP4=0,0,0,4;MQ=11
-rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt
DP=5;VDB=0.0481133;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=1,0,2,1;MQ=26
-rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt
 
-rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt
$ grep "^##INFO=<ID=" BMBC2_liver3_IMPACT_raw.vcf
-rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
-rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
-rw-r--r-- 1 brb brb  31K Mar 16 15:25 simulated_reads_testbeers.cig
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
-rw-r--r-- 1 brb brb  22K Mar 16 15:25 simulated_reads_testbeers.fa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
-rw-r--r-- 1 brb brb  584 Mar 16 15:25 simulated_reads_testbeers.log
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
odroid@odroid:~/Downloads$
 
$ head -n2 variantsOnly.vcf
1 10285 . T C 19.3175 . DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22 GT:PL 1/1:46,2,0
1 10333 . C T 14.9851 . DP=6;VDB=0.325692;SGB=-0.556411;MQ0F=0.333333;AC=2;AN=2;DP4=0,0,0,4;MQ=11 GT:PL 1/1:42,12,0


$ awk 'NR==53, NR==58' ~/Downloads/BMBC2_liver3_IMPACT_raw.vcf
$ wc -l simulated_reads2genes_testbeers.txt
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT dedup.bam
102 simulated_reads2genes_testbeers.txt
1 10285 . T C 19.3175 . DP=4;VDB=0.64;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,0,2;MQ=22 GT:PL 1/1:46,2,0
$ head -4 simulated_reads2genes_testbeers.txt
1 10333 . C T 14.9851 . DP=6;VDB=0.325692;SGB=-0.556411;MQ0F=0.333333;AC=2;AN=2;DP4=0,0,0,4;MQ=11 GT:PL 1/1:42,12,0
seq.1 GENE.5600
1 16257 . G C 24.566 . DP=5;VDB=0.0481133;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=1,0,2,1;MQ=26 GT:PL 0/1:57,0,10
seq.2 GENE.35506
1 16534 . C T 13.9287 . DP=5;VDB=0.87268;SGB=-0.556411;MQ0F=0.4;AC=2;AN=2;DP4=0,0,4,0;MQ=11 GT:PL 1/1:41,12,0
seq.3 GENE.506
1 16571 . G A 12.4197 . DP=7;VDB=0.560696;SGB=-0.556411;RPB=1;MQB=1;MQSB=0;BQB=1;MQ0F=0.714286;AC=2;AN=2;DP4=1,0,2,2;MQ=8 GT:PL 1/1:39,8,0
seq.4 GENE.34922
odroid@odroid:~/Downloads$
$ tail -4 simulated_reads2genes_testbeers.txt
</syntaxhighlight>
seq.97 GENE.4197
As you can see it is not easy to create a table from the INFO field.  
seq.98 GENE.8763
 
seq.99 GENE.19573
Now we use the VariantAnnotation package.
seq.100 GENE.18830
<syntaxhighlight lang='rsplus'>
$ wc -l simulated_reads_indels_testbeers.txt
> library("VariantAnnotation")
36131 simulated_reads_indels_testbeers.txt
> v=readVcf("BMBC2_liver3_IMPACT_raw.vcf", "hg19")
$ head -2 simulated_reads_indels_testbeers.txt
> v
chr1:6052304-6052531 25 1 G
class: CollapsedVCF
chr2:73899436-73899622 141 3 ATA
dim: 302121 1
$ tail -2 simulated_reads_indels_testbeers.txt
rowRanges(vcf):
chr4:68619532-68621804 1298 -2 AA
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
chr21:32554738-32554962 174 1 T
info(vcf):
$ wc -l simulated_reads_substitutions_testbeers.txt
  DataFrame with 17 columns: INDEL, IDV, IMF, DP, VDB, RPB, MQB, BQB, MQSB, ...
71678 simulated_reads_substitutions_testbeers.txt
info(header(vcf)):
$ head -2 simulated_reads_substitutions_testbeers.txt  
        Number Type    Description                                           
chr22:50902963-50903167 50903077 G->A
  INDEL 0      Flag    Indicates that the variant is an INDEL.              
chr1:6052304-6052531 6052330 G->C
  IDV  1      Integer Maximum number of reads supporting an indel           
$ wc -l simulated_reads_junctions-crossed_testbeers.txt
  IMF  1      Float  Maximum fraction of reads supporting an indel         
49   simulated_reads_junctions-crossed_testbeers.txt
  DP    1      Integer Raw read depth                                       
$ head -2 simulated_reads_junctions-crossed_testbeers.txt
  VDB  1      Float  Variant Distance Bias for filtering splice-site arte...
seq.1a chrX:49084601-49084713
  RPB  1      Float  Mann-Whitney U test of Read Position Bias (bigger is...
seq.1b chrX:49084909-49086682
  MQB  1      Float  Mann-Whitney U test of Mapping Quality Bias (bigger ...
  BQB  1      Float  Mann-Whitney U test of Base Quality Bias (bigger is ...
  MQSB  1      Float  Mann-Whitney U test of Mapping Quality vs Strand Bia...
  SGB  1      Float  Segregation based metric.                            
  MQ0F  1      Float  Fraction of MQ0 reads (smaller is better)             
  ICB  1      Float  Inbreeding Coefficient Binomial test (bigger is better)
  HOB  1      Float  Bias in the number of HOMs number (smaller is better) 
  AC    A      Integer Allele count in genotypes for each ALT allele, in th...
  AN    1      Integer Total number of alleles in called genotypes           
  DP4  4      Integer Number of high-quality ref-forward , ref-reverse, al...
  MQ    1      Integer Average mapping quality                               
geno(vcf):
  SimpleList of length 2: GT, PL
geno(header(vcf)):
      Number Type    Description                             
  GT 1     String  Genotype                               
  PL G     Integer List of Phred-scaled genotype likelihoods
>
> header(v)
class: VCFHeader
samples(1): dedup.bam
meta(2): META contig
fixed(2): FILTER ALT
info(17): INDEL IDV ... DP4 MQ
geno(2): GT PL
>
> dim(info(v))
[1] 302121    17
> info(v)[1:4, ]
DataFrame with 4 rows and 17 columns
                INDEL      IDV      IMF        DP      VDB      RPB
            <logical> <integer> <numeric> <integer> <numeric> <numeric>
1:10285_T/C    FALSE        NA        NA        4 0.6400000        1
1:10333_C/T     FALSE        NA        NA        6 0.3256920        NA
1:16257_G/C    FALSE        NA        NA        5 0.0481133        1
1:16534_C/T    FALSE        NA        NA        5 0.8726800        NA
                  MQB      BQB      MQSB      SGB      MQ0F      ICB
            <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
1:10285_T/C        1        1        NA -0.453602  0.000000        NA
1:10333_C/T        NA        NA        NA -0.556411 0.333333        NA
1:16257_G/C        1        1        1 -0.511536  0.000000        1
1:16534_C/T        NA        NA        NA -0.556411  0.400000        NA
                  HOB            AC        AN          DP4        MQ
            <numeric> <IntegerList> <integer> <IntegerList> <integer>
1:10285_T/C        NA            2        2    0,1,0,...        22
1:10333_C/T        NA            2        2    0,0,0,...        11
1:16257_G/C      0.5            1        2    1,0,2,...        26
1:16534_C/T        NA            2        2    0,0,4,...        11
> write.table(info(v)[1:4,], file="infoFieldTable3.txt", sep="\t", quote=F)
> head(rowRanges(v), 3)
GRanges object with 3 ranges and 5 metadata columns:
              seqnames        ranges strand | paramRangeID            REF
                <Rle>     <IRanges>  <Rle> |    <factor> <DNAStringSet>
  1:10285_T/C        1 [10285, 10285]      * |        <NA>              T
  1:10333_C/T        1 [10333, 10333]      * |        <NA>              C
  1:16257_G/C        1 [16257, 16257]      * |        <NA>              G
                            ALT      QUAL      FILTER
              <DNAStringSetList> <numeric> <character>
  1:10285_T/C                  C   19.3175          .
  1:10333_C/T                  T  14.9851          .
   1:16257_G/C                  C  24.5660          .
  -------
  seqinfo: 25 sequences from hg19 genome
> qual(v)[1:5]
[1] 19.3175 14.9851 24.5660 13.9287 12.4197
> alt(v)[1:5]
DNAStringSetList of length 5
[[1]] C
[[2]] T
[[3]] C
[[4]] T
[[5]] A
 
> q('no')
</syntaxhighlight>
 
==== [https://cran.r-project.org/web/packages/vcfR/index.html vcfR] package ====
 
==== Filtering strategy ====
* https://support.bioconductor.org/p/62817/ using '''QUAL''' and '''DP'''.
 
Note that the '''QUAL''' (variant quality score) values can go very large. For example in GSE48215, samtools gives QUAL a range of [3,228] but gatk gives a range of [3, 10042]. [http://gatkforums.broadinstitute.org/wdl/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it This post] says '''QUAL is not often a very useful property for evaluating the quality of a variant call'''.
<syntaxhighlight lang='rsplus'>
# vcfs is vcf file ran by samtools using GSE48215subset
# vcfg is vcf file ran by gatk using GSE48215subset
> summary(vcfs)
      Length        Class        Mode
        1604 CollapsedVCF          S4
> summary(vcfg)
      Length        Class        Mode
        868 CollapsedVCF          S4
 
> summary(qual(vcfs))
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  3.013  10.580  23.650  45.510  45.400 228.000
> summary(qual(vcfg))
    Min.  1st Qu.  Median    Mean  3rd Qu.    Max.
    3.98    21.77    47.74  293.90  116.10 10040.00
 
> range(info(vcfs)$DP)
[1]  1 336
> range(info(vcfg)$DP)
[1]  1 317
 
> range(info(vcfs)$MQ)
[1]  4 60
> range(info(vcfg)$MQ)
[1] 21.00 65.19
 
# vcfg2 is vcf file ran by gatk using GSE48215
> vcfg2 <- readVcf("~/GSE48215/outputgcli/bt20_raw.vcf", "hg19")
> summary(vcfg2)
      Length        Class        Mode
      506851 CollapsedVCF          S4
> range(qual(vcfg2))
[1]    0.00 73662.77
> summary(qual(vcfg2))
    Min.  1st Qu.  Median    Mean  3rd Qu.    Max.
    0.00    21.77    21.77  253.70    62.74 73660.00


> range(info(vcfg2)$DP)
$ cat beers_output/simulated_reads_testbeers.log
[1]    0 3315
Simulator run: 'testbeers'
> summary(info(vcfg2)$DP)
started: Thu Mar 16 15:25:39 EDT 2017
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
num reads: 100
  0.00    2.00    2.00  14.04    4.00 3315.00
readlength: 100
> range(info(vcfg2)$MQ)
substitution frequency: 0.001
[1] NaN NaN
indel frequency: 0.0005
> range(info(vcfg2)$MQ, na.rm=T)
base error: 0.005
[1] 20 70
low quality tail length: 10
</syntaxhighlight>
percent of tails that are low quality: 0
 
quality of low qulaity tails: 0.8
* vaf and '''AD''' field (stands for '''allele depth'''/number of reads, 2 values for each '''genotype''' per sample, '''reference:alternative''').  
percent of alt splice forms: 0.2
 
number of alt splice forms per gene: 2
These two values will usually, but not always sum to the DP value. '''Reads that are not used for calling are not counted in the DP measure, but are included in AD'''. See [https://www.biostars.org/p/15675/ Question: Understanding Vcf File Format]
stem: refseq
 
sum of gene counts: 3,886,863,063
An example
sum of intron counts = 1,304,815,198
<pre>
sum of intron counts = 2,365,472,596
GT:AD:DP:GQ:PL 0/1:1,2:3:30:67,0,30
intron frequency: 0.355507598105262
</pre>
padded intron frequency: 0.52453796437909
 
finished at Thu Mar 16 15:25:58 EDT 2017
R code to compute VAF by using AD information only (AD = 1,2 in this case)
<syntaxhighlight lang='rsplus'>
require(VariantAnnotation)
vcf <- readVcf(file, refg)
vaf <- sapply(geno(vcf)$AD, function(x) x[2]/sum(x))
</syntaxhighlight>


==== extract allele frequency ====
$ wc -l simulated_reads_testbeers.fa
https://en.wikipedia.org/wiki/Minor_allele_frequency
400 simulated_reads_testbeers.fa
$ head simulated_reads_testbeers.fa
>seq.1a
CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
>seq.1b
GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
>seq.2a
GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
>seq.2b
ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
>seq.3a
CCCCAGAGGAGCGCCACCTGTCCAAGATGCAGCAGAACGGCTACGAAAATCCAACCTACAAGTTCTTTGAGCAGATGCAGAACTAGACCCCCGCCACAGC


Calculate allele frequency from vcf files.
# Take a look at the true coordinates
* https://gatkforums.broadinstitute.org/gatk/discussion/6202/vcf-file-and-allele-frequency. In the code below, vaf is calculated from AD and DP was obtained directly from DP field.
$ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span
<syntaxhighlight lang='rsplus'>
chrX 49084529 49084601 +
readvcf2 <- function(file, refg="hg19") {
chrX 49084713 49084739 +
  vcf <- readVcf(file, refg)
chrX 49084863 49084909 +
  if (!is.null(geno(vcf)$DP) && ncol(geno(vcf)$DP) > 1)  stop("More than one sample were found!")
chrX 49086682 49086734 +
  # http://gatkforums.broadinstitute.org/gatk/discussion/6202/vcf-file-and-allele-frequency
$ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates
  # READ AD
seq.1a chrX 49084529 73M111N27M 49084529-49084601, 49084713-49084739 + CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
  if ("AD" %in% names(geno(vcf))) {
seq.1b chrX 49084863 47M1772N53M 49084863-49084909, 49086682-49086734 - GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
    # note it is possible DP does not exist but AD exists (mutect2 output)
seq.2a chr1 183516256 100M 183516256-183516355 - GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
    # vaf <- sapply(geno(vcf)$AD, function(x) x[2]) / geno(vcf)$DP
seq.2b chr1 183515275 100M 183515275-183515374 + ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
    # vaf <- vaf[, 1]
$ wc -l simulated_reads_testbeers.fa
    vaf <- sapply(geno(vcf)$AD, function(x) x[2]/sum(x))
400 simulated_reads_testbeers.fa
  } else {
$ wc -l simulated_reads_testbeers.bed
    vaf <- NULL
247 simulated_reads_testbeers.bed
  }
$ wc -l simulated_reads_testbeers.cig
  # READ DP
200 simulated_reads_testbeers.cig
  if (is.null(geno(vcf)$DP) ||
      (is.na(geno(vcf)$DP[1]) && is.null(info(vcf)$DP)) ||
      (is.na(geno(vcf)$DP[1]) && is.na(info(vcf)$DP[1]))) {
    cat("Warning: no DP information can be retrieved!\n")
    dp <- NULL
  } else {
    if (!is.null(geno(vcf)$DP) && !is.na(geno(vcf)$DP[1])) {
      dp <- geno(vcf)$DP
      if (is.matrix(dp))  dp <- dp[, 1]
    } else {
      dp <- info(vcf)$DP
    }
  }
  if (! "chr" %in% substr(unique(seqnames(vcf)), 1, 3)) {
    chr <- paste('chr', as.character(seqnames(vcf)), sep='')
  } else {
    chr <- as.character(seqnames(vcf))
  }
  strt <- paste(chr, start(vcf)) # ?BiocGenerics::start
  return(list(dp=dp, vaf=vaf, start=strt, chr=unique(as.character(seqnames(vcf)))))
}
</syntaxhighlight>
</syntaxhighlight>
* http://samtools.github.io/hts-specs/VCFv4.1.pdf, http://samtools.github.io/hts-specs/VCFv4.2.pdf
* [https://vcftools.github.io/documentation.html vcftools]
<pre>
vcftools --vcf file.vcf --freq --out output # No DP, No AD. No useful information
vcftools --vcf file.vcf --freq -c > output  # same as above
</pre>


==== How can I extract only insertions from a VCF file? ====
=== [http://sammeth.net/confluence/display/SIM/Home Flux] Sammeth 2010 ===
https://bioinformatics.stackexchange.com/questions/769/how-can-i-extract-only-insertions-from-a-vcf-file


==== subset vcf file ====
=== [http://www.ebi.ac.uk/goldman-srv/simNGS/ SimNGS] ===
Using [[#htslib:_bgzip_and_tabix|tabix]].


==== Adding/removing 'chr' to/from vcf files ====
=== [http://cran.r-project.org/web/packages/SimSeq/index.html SimSeq]  ===
https://www.biostars.org/p/98582/
[http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124.abstract Bioinformatics]


<pre>
A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf # Not enough


awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*)/,m)) print m[1]"chr"m[2]; else print $0 }' no_chr.vcf > with_chr.vcf
=== [http://cran.r-project.org/web/packages/empiricalFDR.DESeq2/index.html empiricalFDR.DESeq2] ===
</pre>
http://biorxiv.org/content/early/2014/12/05/012211


=== SAMtools (samtools, bcftools, htslib) ===
The key function is '''simulateCounts''', which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.  
* http://samtools.sourceforge.net/mpileup.shtml
* http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html. Li's paper is available in [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/ NCBI PMC].
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/ A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data] by Heng Li.
* https://www.broadinstitute.org/gatk/media/docs/Samtools.pdf


<syntaxhighlight lang='bash'>
Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.
export seqtools_samtools_PATH=/opt/SeqTools/bin/samtools-1.3:/opt/SeqTools/bin/samtools-1.3/misc
export PATH=$seqtools_samtools_PATH:$PATH


samtools sort TNBC1/accepted_hits.bam TNBC1/accepted_hits-sorted
=== [http://www.bioconductor.org/packages/release/bioc/html/polyester.html polyester] ===
samtools index TNBC1/accepted_hits-sorted.bam TNBC1/accepted_hits-sorted.bai
http://biorxiv.org/content/early/2014/12/05/012211
samtools mpileup -uf ~/igenome/human/NCBI/build37.2/genome.fa \
            TNBC1/accepted_hits-sorted.bam | bcftools view -vcg - > TNBC1/var.raw.vcf
</syntaxhighlight>
where '-u' in mpileup means uncompressed and '-f' means faidx indexed reference sequence file. If we do not use pipe command, the output from samtools mpileup is a bcf file which can be viewed by using 'bcftools view XXX.bcf | more' command.


Note that the ''samtools mpileup'' can be used in different ways. For example,
Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.  
<syntaxhighlight lang='bash'>
samtools mpileup -f XXX.fa XXX.bam > XXX.mpileup
samtools mpileup -v -u -f XXX.fa XXX.bam > XXX.vcf
samtools mpileup -g -f XXX.fa XXX.bam > XXX.bcf
</syntaxhighlight>


==== [http://samtools.github.io/bcftools/bcftools.html bcftools] ====
'''Input''': reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.
bcftools — utilities for variant calling and manipulating VCFs and their binary counterparts BCFs. bcftools was one of components in '''SAMtools''' software (not anymore, see http://www.htslib.org/download/)


Installation
'''Output''': FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.
<syntaxhighlight lang="bash">
wget https://github.com/samtools/bcftools/releases/download/1.2/bcftools-1.2.tar.bz2
sudo tar jxf bcftools-1.2.tar.bz2 -C /opt/RNA-Seq/bin/
cd /opt/RNA-Seq/bin/bcftools-1.2/
sudo make # create bcftools, plot-vcfstats, vcfutils.pl commands
</syntaxhighlight>


Example: Add or remove or update annotations. http://samtools.github.io/bcftools/bcftools.html
Too many dependencies. <strike>Got an error in installation.</strike>. It seems it has not considered splice junctions.
<syntaxhighlight lang="bash">
# Remove three fields
bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz


# Remove all INFO fields and all FORMAT fields except for GT and PL
=== seqgendiff ===
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3450-9 Data-based RNA-seq simulations by binomial thinning]


# Add ID, QUAL and INFO/TAG, not replacing TAG if already present
== Simulate DNA-Seq ==
bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
* Software list - https://popmodels.cancercontrol.cancer.gov/gsr/packages/
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/ A comparison of tools for the simulation of genomic next-generation sequencing data] Merly Escalona 2016


# Update 'ID' column in VCF file, https://www.biostars.org/p/227652/
=== wgsim ===
# Note that the column header CHROM,FROM,.. only needs to appear in input.vcf.gz;
https://github.com/lh3/wgsim
#      they may not appear in the annotation file
# For vcf files, there is a comment sign '#' on the header line containing CHROM,FROM,...
bcftools annotate -c CHROM,FROM,TO,ID,INFO/MLEAC,INFO/MLEAF -a annotation.vcf.gz -o output.vcf input.vcf.gz


# Carry over all INFO and FORMAT annotations except FORMAT/GT
* Used by [https://www.biorxiv.org/content/biorxiv/early/2017/12/20/237107.full.pdf#page=3 Cleaning clinical genomic data: Simple identification and removal of recurrently miscalled variants in single genomes] bioRxiv 2017
bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
* [https://gatkforums.broadinstitute.org/gatk/discussion/7859/how-to-simulate-reads-using-a-reference-genome-alt-contig (How to) Simulate reads using a reference genome ALT contig]
* [http://research.cs.wisc.edu/wham/comparison-using-wgsim/ Comparing WHAM with BWA using wgsim
* http://biobits.org/samtools_primer.html


# Annotate from a tab-delimited file with six columns (the fifth is ignored),
=== dwgssim ===
# first indexing with tabix. The coordinates are 1-based.
* [https://github.com/nh13/dwgsim Whole Genome Simulator for Next-Generation Sequencing]
tabix -s1 -b2 -e2 annots.tab.gz
* Example: [https://www.bioinformatics.recipes/recipe/view/recipe-variants-bcftools/#info How to generate variant calls with bcftools]
bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf


# Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
=== NEAT ===
tabix -s1 -b2 -e3 annots.tab.gz
* https://github.com/zstephens/neat-genreads
bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5125660/ Simulating next-generation sequencing datasets from empirical mutation and sequencing models] Zachary Stephens, 2016
</syntaxhighlight>
* If I set 10 as the coverage rate and read length 101, the generated fq file is about 34GB (3.3GB * 10) for each one of the pairs.


Example: variant call
=== DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads ===
<syntaxhighlight lang="bash">
http://genomespot.blogspot.com/2014/11/dna-aligner-accuracy-bwa-bowtie-soap.html
samtools sort TNBC1/accepted_hits.bam TNBC1/accepted_hits-sorted
samtools index TNBC1/accepted_hits-sorted.bam TNBC1/accepted_hits-sorted.bai
samtools mpileup -uf ~/igenome/human/NCBI/build37.2/genome.fa \
                    TNBC1/accepted_hits-sorted.bam | bcftools view -vcg - > TNBC1/var.raw.vcf
# Or
samtools mpileup -g -f XXX.fa XXX.bam > sample.bcf
bcftools call -v -m -O z -o var.raw.vcf.gz sample.bcf
zcat var.raw.vcf.gz | more
zcat var.raw.vcf.gz | grep -v "^#" | wc -l
samtools tview -p 17:1234567 XXX.bam XXX.fa  | more    # IGV alternative
</syntaxhighlight>
where '-v' means exports variants only, '-m' for multiallelic-caller and '-O z' means compressed vcf format.


Example: count the number of snps, indels, et al in the vcf file, use
<syntaxhighlight lang='bash'>
<syntaxhighlight lang="bash">
$ head simDNA_100bp_16del.fasta
bcftools stats xxx.vcf | more
>Pt-0-100
</syntaxhighlight>
TGGCGAACGCGGGAATTGACCGCGATGGTGATTCACATCACTCCTAATCCACTTGCTAATCGCCCTACGCTACTATCATTCTTT
 
>Pt-10-110
Example: filter based on variant quality, depth, mapping quality
GCGGGATTGAACCCGATTGAATTCCAATCACTGCTTAATCCACTTGCTACATCGCCCTACGTACTATCTATTTTTTTGTATTTC
<syntaxhighlight lang="bash">
>Pt-20-120
bcftools filter -i"QUAL >= 20 && DP >= 5 && MQ >= 60" inputVCF > output.VCF
GAACCCGCGATGAATTCAATCCACTGCTACCATTGGCTACATCCGCCCCTACGCTACTCTTCTTTTTTGTATGTCTAAAAAAAA
>Pt-30-130
TGGTGAATCACAATCACTGCCTAACCATTGGCTACATCCGCCCCTACGCTACACTATTTTTTGTATTGCTAAAAAAAAAAATAA
>Pt-40-140
ACAACACTGCCTAATCCACTTGGCTACTCCGCCCCTAGCTACTATCTTTTTTTGTATTTCTAAAAAAAAAAAATCAATTTCAAT
</syntaxhighlight>
</syntaxhighlight>
where '-i' include only sites for which EXPRESSION is true.


Example: change the header (dedup.bam -> dedup), use
=== Simulate Whole genome ===
<syntaxhighlight lang="bash">
* [https://github.com/nh13/dwgsim DWGSIM] mentioned by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3785481/ Variant Callers for Next-Generation Sequencing Data: A Comparison Study]. For its usage, see http://davetang.org/wiki/tiki-index.php?page=DWGSIM
$ grep dedup.bam dT_ori_raw.vcf                                                 
 
##samtoolsCommand=samtools mpileup -go temp.bcf -uf /home/brb/GSE11209-master/annotation/genome.fa dedup.bam
=== Simulate whole exome ===
#CHROM  POS    ID      REF    ALT    QUAL    FILTER  INFO    FORMAT  dedup.bam
* https://www.biostars.org/p/66714/ (no final answer)
$ echo dedup > sampleName
* [https://academic.oup.com/bioinformatics/article/29/8/1076/225073 Wessim: a whole-exome sequencing simulator based on in silico exome capture] Sangwoo Kim 2013 & [http://sak042.github.io/Wessim/ software]
$ bcftools reheader -s sampleName dT_ori_raw.vcf -o dT_ori_raw2.vcf
$ grep dedup.bam dT_ori_raw2.vcf
##samtoolsCommand=samtools mpileup -go temp.bcf -uf /home/brb/GSE11209-master/annotation/genome.fa dedup.bam
$ diff dT_ori_raw.vcf dT_ori_raw2.vcf
45c45
< #CHROM        POS    ID      REF    ALT    QUAL    FILTER  INFO    FORMAT  dedup.bam
---
> #CHROM        POS    ID      REF    ALT    QUAL    FILTER  INFO    FORMAT  dedup
</syntaxhighlight>


What bcftools commands are used in BRB-SeqTools?
=== SCSIM ===
* bcftools filter: apply fixed-threshold filters.
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03550-1 SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data]
<pre>bcftools filter -i "QUAL >= 1 && DP >= 60 && MQ >= 1" INPUT.vcf > filtered.vcf</pre>
* bcftools norm: Left-align and normalize indels. http://annovar.openbioinformatics.org/en/latest/articles/VCF/
<pre>
bcftools norm -m-both -o splitted.vcf filtered.vcf
bcftools norm -f genomeRef -o leftnormalized.vcf splitted.vcf
</pre>
* bcftools annotate: add/remove or update annotations
<pre>
bcftools annotate -c ID -a dbSNPVCF leftnormalized.vcf.gz > dbsnp_anno.vcf
bcftools annotate -c ID,+GENE -a cosmicVCF dbsnp_anno.vcf.gz > cosmic_dbsnp.vcf
# +GENE: add annotations without overwriting existing values
# In this case, it is likely GENE does not appear in dbsnp_anno.vcf.gz
# It is better +INFO/GENE instead of GENE in '-c' parameter.
</pre>
* bcftools query: Extracts fields from VCF or BCF files and outputs them in user-defined format.
<pre>
bcftools query -f '%INFO/AC\n' input.vcf > AC.txt
bcftools query -f '%INFO/MLEAC\n' input.vcf > MLEAC.txt
</pre>


==== [http://www.htslib.org/doc/tabix.html htslib]: bgzip and tabix ====
== Variant simulator ==
* bgzip – Block compression/decompression utility. The output file .gz is in a binary format.
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2611-1 sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs]
* tabix – Generic indexer for TAB-delimited genome position files. The output file tbi is in a binary format.


Installation
== Mutation-Simulator ==
<syntaxhighlight lang="bash">
[https://github.com/mkpython3/Mutation-Simulator Mutation-Simulator]
wget https://github.com/samtools/htslib/releases/download/1.2.1/htslib-1.2.1.tar.bz2
sudo tar jxf htslib-1.2.1.tar.bz2 -C /opt/RNA-Seq/bin/
cd /opt/RNA-Seq/bin/htslib-1.2.1/
sudo make  # create tabix, htsfile, bgzip commands
</syntaxhighlight>


Example
== SigProfilerSimulator ==
<syntaxhighlight lang="bash">
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03772-3 Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator]
export PATH=/opt/SeqTools/bin/samtools-1.3/htslib-1.3:$PATH
export PATH=/opt/RNA-Seq/bin/bcftools-1.2/:$PATH
# zip and index
bgzip -c var.raw.vcf > var.raw.vcf.gz # var.raw.vcf will not be kept
tabix var.raw.vcf.gz  # create var.raw.vcf.ga.tbi, index vcf files (very fast in this step)
bcftools annotate -c ID -a common_all_20150603.vcf.gz var.raw.vcf.gz > var_annot.vcf # 2 min


# subset based on chromosome 1, include 'chr' and position range if necessary
== Convert FASTA to FASTQ ==
tabix -h var.raw.vcf.gz 1: > chr1.vcf
It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.
tabix -h var.raw.vcf.gz chr1:10,000,000-20,000,000


# subset a vcf file using a bed file
If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.
bgzip -c raw.vcf > raw.vcf.gz # without "-c", the original vcf file will not be kept
tabix -p vcf raw.vcf.gz  # create tbi file
tabix -R test.bed raw.vcf.gz > testout.vcf
</syntaxhighlight>


=== GATK (Java) ===
* Source code is at [https://github.com/broadgsa/gatk-protected/ github] (up to 3.8) and building instruction https://gatkforums.broadinstitute.org/gatk/discussion/comment/8443.
* Source code for GATK4 https://github.com/broadinstitute/gatk
* GATK4 tutorial
** https://software.broadinstitute.org/gatk/documentation/presentations
** [https://drive.google.com/drive/folders/0BzI1CyccGsZicXl0anMwQnFYVEU Call​ ​somatic​ ​SNVs​ ​and​ ​indels​ ​using​ ​GATK4​ ​Mutect2]
** [https://drive.google.com/drive/folders/1dCTLwqZz1oPG_1PWZgAdyTS6lZc5Upd4 Call​ ​Somatic​ ​CNVs​ ​using​ ​GATK4.beta]
* <strike>Need to create an account w/ password and log in in order to see and download the software</strike> (not anymore).
* Downloaded file is called GenomeAnalysisTK-3.5.tar.bz2 or GenomeAnalysisTK-3.6.tar.bz2. Note that the current version 3.6 (released on 6/1/2016) requires Java 1.8.
<syntaxhighlight lang='bash'>
<syntaxhighlight lang='bash'>
# Download GenomeAnalysisTK-3.4-46.tar.bz2 from gatk website
$ cat test.fasta
sudo mkdir /opt/RNA-Seq/bin/gatk
>Pt-0-50
sudo tar jxvf ~/Downloads/GenomeAnalysisTK-3.4-46.tar.bz2 -C /opt/RNA-Seq/bin/gatk
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
ls /opt/RNA-Seq/bin/gatk
>Pt-10-60
# GenomeAnalysisTK.jar  resources
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
</syntaxhighlight>
>Pt-20-70
* (Blog) https://software.broadinstitute.org/gatk/blog
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
* Ubuntu 14 only has Java 1.6 and 1.7. Ubuntu 16 only has Java 1.8 and 1.9.
>Pt-30-80
* Picard and HTSJDK also rely on Java.
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
* Different results on 2 second run. http://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results
>Pt-40-90
* [https://www.broadinstitute.org/gatk/guide/ GATK guide] and [https://www.broadinstitute.org/gatk/guide/tagged?tag=requirements requirements].
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
* [https://www.broadinstitute.org/gatk/guide/article?id=3891 Best Practices workflow for SNP and indel calling on RNAseq data using GATK]. It recommend using [https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php HaplotypeCaller].
$ perl ~/Downloads/fasta_to_fastq.pl test.fasta
* [https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-2-Realignment.pdf Indel realignment] Input=BAM, Output=BAM. Big improvement for Base Quality Score Recalibration when run on realigned BAM files (artificial SNPs are replaced with real indels).
@Pt-0-50
* [https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-3-Base_recalibration.pdf Base Quality Score Recalibration] Input=BAM, known sites, Output=BAM.
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
* [https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php UnifiedGenotyper] for variant call.
+
* https://wikis.utexas.edu/display/bioiteam/Variant+calling+with+GATK
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
* [http://www.biomedcentral.com/content/pdf/1471-2407-13-55.pdf Identification of somatic and germline mutations using whole exome sequencing of congenital acute lymphoblastic leukemia]: a paper that has used GATK for germline and somatic analyses
@Pt-10-60
* http://gatkforums.broadinstitute.org/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
* [https://www.qiagenbioinformatics.com/blog/clinical/new-plugin-to-support-your-bwa-gatk-pipelines/?utm_source=newsletter&utm_medium=email&utm_campaign=06.2016&mkt_tok=eyJpIjoiWkdWa09UYzRZakkwWWpBNCIsInQiOiJ3Z2VtODJTTGpTTzltbGQ0ZVFyMXJRZWNNaHhyN3ZacCtKMnlwWjQrNVdJcGttN3lvQTN6TFlBRVF1cHFFY3lzT3FjRzA1Tk5jcllReCtSTlhCaDlcL1ppSHQ5eTl4SjRrRk1uSDlCQWJMTTg9In0%3D New plugin to support your BWA-GATK pipelines for '''Biomedical Genomics Server Solution''']
+
 
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
==== Citing papers ====
@Pt-20-70
https://software.broadinstitute.org/gatk/documentation/article.php?id=6201
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
</syntaxhighlight>


# McKenna et al. 2010 "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data"
Alternatively we can use just one line of code by [https://www.reddit.com/r/bioinformatics/comments/32pu00/fasta_to_fastq_converter/ awk]
# DePristo et al. 2011 "A framework for variation discovery and genotyping using next-generation DNA sequencing data"
<syntaxhighlight>
# Van der Auwera et al. 2013 "rom FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline"
$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \
  test.fasta > test.fq
$ cat test.fq
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
</syntaxhighlight>
Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).


==== Container version and Java ====
== Simulate genetic data ==
For some reason, running GATK 3.8 on Biowulf (CentOS + Sun Java) does not give any variants. But in my Ubuntu (OpenJDK), it does.
[https://onunicornsandgenes.blog/2019/06/16/simulating-genetic-data-with-r-an-example-with-deleterious-variants-and-a-pun/ ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’]


It is strange the website recommends Sun Java, but the container version uses OpenJDK.
= PDX/Xenograft =
<pre>
* https://en.wikipedia.org/wiki/Patient_derived_xenograft
$ docker run -it --rm broadinstitute/gatk
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-1172 Are special read alignment strategies necessary and cost-effective when handling sequencing reads from patient-derived tumor xenografts?] by Tso et al, BMC Genomics, 2014.
(gatk) root@9c630a926bc1:/gatk# java -version
* [http://www.arrayserver.com/wiki/index.php?title=Align_Ion_Torrent_reads Map xenograft reads] by [http://www.omicsoft.com/array-studio/ Array Suite]
openjdk version "1.8.0_131"
* [http://www.pdxfinder.org/ PDXFinder] includes PDMR as one of 6 providers. The website source code https://github.com/PDXFinder/pdxfinder.
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
** [http://www.pdxfinder.org/data/pdx/IRCC/CRC0120LM#variation A Colorectal Carcinoma example] contains 'genomic data'. Each row represents one seq position. No raw FASTQ files available.
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
* NCI [https://pdmr.cancer.gov/database/default.htm Patient-Derived Models Repository (PDMR)].
 
** '''passage number''' (P0 represented first '''implant'''). [https://bitesizebio.com/13685/cell-culture-passage-number-explained/ Understanding Cell Passage Number and How to Calculate it for Cell Cultures]. A passage number refers to the '''number of times''' a culture of cells has been '''sub-cultivated''' or transferred to a new environment. For example, if a culture of cells is started and then '''split''' into two new cultures, each of these new cultures would be considered a "passage 2" culture. The original culture would be considered "passage 1." This is commonly used in in-vitro cell culture studies to track the age, growth, and changes of the cells over time. It is also used to ensure the consistent quality of the cells or organisms being used in experiments.
$ docker run -it --rm broadinstitute/gatk3:3.8-0
** ftp://dctdftp.nci.nih.gov/pub/pdm/
root@80b7efa0c3ac:/usr# java -version
** The ftp link can be obtained by clicking 'PDMR Models' -> 'PDMR database' -> Click here to access the PDMR Database -> 'Genomic Analysis'.
openjdk version "1.8.0_102"
** There will be three tabs: NCI Cancer Genome Panel (4246 records), Whole Exome Sequence (785 records) and RNASeq (807 records).
OpenJDK Runtime Environment (build 1.8.0_102-8u102-b14.1-1~bpo8+1-b14)
** For Whole Exome Sequence, '''VCF''' was provided. For RNASeq, '''RSEM''' files per genes or isoforms are available.
OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode)
** The '''RNASeq Transcriptome Data Analysis Pipeline and Specifications''' and '''Whole Exome Sequencing Data Analysis Pipeline and Specifications''' are available under SOPs. [https://pdmdb.cancer.gov/pls/apex/f?p=101:34:0::NO:34:: RNASeq] TPM data.
</pre>
* [https://www.rna-seqblog.com/reproducible-bioinformatics-project/ Reproducible Bioinformatics Project]
* [https://academic.oup.com/bioinformatics/article/28/12/i172/269972 Xenome] a tool for classifying reads from xenograft samples, Thomas Conway et al 2012.  
** [https://en.wikipedia.org/wiki/K-mer K-mer]
** The program is bundled in '''[https://github.com/data61/gossamer/blob/master/docs/xenome.md Gossamer]''' (Github)
** [https://hpc.nih.gov/apps/gossamer.html Biowulf] It is noted that Gossamer runs in a Singularity container
** Indexing took 13 hours when I set 16 threads and 24GB memory (25.4GB was used). A set of 23 files with prefix 'idx' will be generated.
: <syntaxhighlight lang='bash'>
#!/bin/bash
module load gossamer
xenome index -M 24 -T 16 -P idx \
  -H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \
  -G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
</syntaxhighlight>
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0074432 Next-Generation Sequence Analysis of Cancer Xenograft Models] by Fernando J. Rossello et al 2013.
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4991491/ Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers] Bradford et al 2016
* [https://f1000research.com/articles/5-2741/v1 An open-source application for disambiguating two species in next generation sequencing data from grafted samples] by Ahdesmäki MJ et al 2016. [https://github.com/AstraZeneca-NGS/disambiguate Disambiguation]
* [http://mcr.aacrjournals.org/content/molcanres/15/8/1012.full.pdf Next-Generation Sequencing Analysis and Algorithms for PDX and CDX Models] by Garima Khandelwal et al 2017. [https://github.com/CRUKMI-ComputationalBiology/bamcmp bamcmp] software.
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4414-y Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts] by Maurizio Callari et al 2018. Both RNA-Seq and DNA-Seq are considered. Software [https://github.com/cclab-brca/ICRG ICRG].
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2353-5 XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data] Kluin et al 2018. Software in [https://github.com/PeeperLab/XenofilteR github].
* [https://hpc.nih.gov/apps/bbtools.html BBSplit], [https://pdmr.cancer.gov/content/docs/MCCRD_SOP0011_PDX_Whole_Exome_Seq_Analysis_Pipeline.pdf PDX Whole Exome Seq Analysis Pipeline], [https://pdmr.cancer.gov/content/docs/MCCRD_SOP0012_PDX_RNASeq_Analysis_Pipeline.pdf RNASeq Transciptome Data Analysis Pipeline] from PDMR. [https://www.biostars.org/p/143019/ Tool to separate human and mouse rna seq reads].
* [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006596 Whole genomes define concordance of matched primary, xenograft, and organoid models of pancreas cancer] Gendoo et al 2019
* [https://cancerres.aacrjournals.org/content/79/17/4539 Integrative Pharmacogenomics Analysis of Patient-Derived Xenografts] 2019
* [https://pubmed.ncbi.nlm.nih.gov/31262303/ Genomic data analysis workflows for tumors from patient-derived xenografts (PDXs): challenges and guidelines] 2019
* [https://bioconductor.org/packages/release/bioc/html/Xeva.html Xeva] package. Analysis of patient-derived xenograft (PDX) data. Paper [https://aacrjournals.org/cancerres/article/79/17/4539/638195/Integrative-Pharmacogenomics-Analysis-of-Patient Integrative Pharmacogenomics Analysis of Patient-Derived Xenografts] Mer, 2019. ''Molecular profile and pharmacologic profile''.
** GSE78806 - microarray-based Gene expression data, PDX passages, and tissue information
** [https://www.nature.com/articles/nm.3954 High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response] Gao, 2015. Molecular profiles including mutation, CNA, RNASeq-based gene expression, and pharmacologic profiles.


See the example '[[Docker#Run_a_shell_script_on_host|how to run a shell script on host]]' on how to run GATK from the host command line (the container will be deleted after the job is done; similar to what 'singularity' does).
== RNA-Seq ==
* [https://pdmr.cancer.gov/content/docs/MCCRD_SOP0012_PDX_RNASeq_Analysis_Pipeline.pdf RNASeq Transciptome Data Analysis Pipeline and Specifications] by Mocha
*# ''PDX Mouse reads are removed from the raw FASTQ files using bbsplit (bbtools v37.36).''
*# The fastq files are mapped to human transcriptome based on exon models from hg19 using Bowtie2 (version 2.2.6) [1]. The resulting SAM files are converted to BAM forma using samtools [2] and the coordinations in BAM are converted to the genomic (hg19) coordinations using RSEM (version 1.2.31). Gene and transcript quantifications are also done using RSEM.
*# Removal of Small nucleolar RNAs (snoRNAs)
* Platform [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16791 GPL16791] Illumina HiSeq 2500
** https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1702792
** https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1887215
* [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL25431 GPL25431] Illumina HiSeq 4000 (Homo sapiens; Mus musculus)
** GSE118197
* [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18573 GPL18573] Illumina NextSeq 500
** GSE106336  [https://www.nature.com/articles/s41467-017-01967-6 MYC] regulates ductal-neuroendocrine lineage plasticity in pancreatic ductal adenocarcinoma associated with poor outcome and chemoresistance
* [https://youtu.be/j4qpJ8sVjT0 Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis]. This is posted on [https://www.rna-seqblog.com/mastering-rna-seq-data-analysis-a-critical-approach-to-transcriptomic-data-analysis/ rna-seqblog]. See the link [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4991491/ Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers].
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03633-z PDXGEM]: patient-derived tumor xenograft-based gene expression model for predicting clinical response to anticancer therapy in cancer patients


==== Pipeline ====
== DNA-Seq ==
* [https://software.broadinstitute.org/gatk/documentation/quickstart.php Quick Start Guide]: installation, best practice, how to run, run pipelines with WDL, ...
* [http://www.sciencedirect.com/science/article/pii/S2211124713004634 Endocrine-Therapy-Resistant ESR1 Variants Revealed by Genomic Characterization of Breast-Cancer-Derived Xenografts]  
* [https://gencore.bio.nyu.edu/tag/gatk/ Variant Calling Pipeline: FastQ to Annotated SNPs in Hours]
** [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/molecular.cgi?study_id=phs000611.v1.p1&phv=195714&phd=&pha=&pht=3502&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1 dbGaP]
* https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1
** GaP accession [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=phs000611 phs000611].
* https://gatkforums.broadinstitute.org/gatk/discussion/3891/best-practices-for-variant-calling-on-rnaseq
* https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&f=study&term=xenograft&go=Go
 
* [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=ERP021871 ERP021871]
==== IGV ====
* https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATK_Discovery_Tutorial-Worksheet-AUS2016.pdf


==== dbSNP file ====
== MAF (TCGA, GDC) ==
For running GATK best practices, dbsnp file has to be downloaded using the GATK version (with 'chr') for Ensembl but non-GATK (without 'chr') for UCSC. See [[Anders2013#GATK|Anders -> GATK]].
* See [https://brb.nci.nih.gov/d3oncoprint/D3Oncoprint-UserManual.pdf#page=17 D3Oncoprint] manual. D3oncoprint uses java to create the interface (java -jar D3Oncoprint.jar). The result is shown in a browser. The top part is a heatmap and the bottom part is a table. The heatmap is good only for a small number of variants and there is no way to zoom in if we have a lot of variants (eg 1000). The phenotypes file is optional. The following 3 columns are required and the others are for tooltip and table. The column name from a VCF file input is given below.
** '''Gene column''': GENE
** '''Variant type column''': ExonicFunc.refGene
** '''Aminoacid change column''': AAChange.refGene (this defines the name of '''variant'''. The variant uses the single letter aminoacid codes. This information is only used in HTML table. For example if AAChange.refGene=UGT1A7:NM_019077:exon1:c.T387G:p.N129K, then the variant will be N129K. My observation is the variant name for variant type 'frameshift_deletion' is altered in table; R128fs -> R128Xfs)
* https://www.reddit.com/r/bioinformatics/comments/cmdza5/vcf_and_maf_file_formats/ which links to
** MAF: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/
** VCF: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692?id=11005
** [https://www.reddit.com/r/bioinformatics/comments/cmdza5/vcf_and_maf_file_formats/ VCF and MAF file formats?]
* [https://pdmr.cancer.gov/content/docs/MCCRD_SOP0023_v2.0_PDX_consensus_files.pdf The Standing Operating Procedure (SOP) describes procedures for generating consensus/intersect variants calls for reporting in the NCI Patient-Derived Models database]
* Convert vcf to maf. https://hpc.nih.gov/apps/vcf2maf.html
* [https://www.bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html#10_variant_annotations maftools] : Summarize, Analyze and Visualize MAF Files (oncoplots)
* https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html
** [https://www.biostars.org/p/295833/ Use TCGAbiolinks package] to download maf file.
** http://firebrowse.org/. For example, BRCA -> Mutation Annotation File (from bar plot on RHS). Pick any data. The (extracted) file name will be *.maf.txt.


* ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/GATK/ common_all_xxxxxx.vcf.gz and tbi files
= DNA Seq Data =
* ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/VCF/GATK/ common_all_xxxxxx.vcf.gz and tbi files
== NIH ==
* Go to [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj SRA/Sequence Read Archive]and type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
*# Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
*# The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
*# The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
*# I pick the first one Accession:SRP066837 from this page. The page shows the '''Study type''' is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
*# <span style="color: red">(Important trick)</span> Click the number next to '''Run'''. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
*# Download the raw data from any one of them (eg SRR2968056). For whole genome, the '''Strategy''' is ''WGS''. For whole exome, the '''Strategy''' is called ''WXS''.
* Search the keywords 'nonsynonymous' and 'human' in [http://www.ncbi.nlm.nih.gov/pmc/?term=nonsynonymous+human PMC]


==== [https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates MarkDuplicates] by Picard ====
=== Use [http://www.ncbi.nlm.nih.gov/books/NBK158900/ SRAToolKit] instead of wget to download ===
'''Sequencing error''' propagated in duplicates. See p14 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery].
Don't use the ''wget'' command since it requires the specification of right http address.


Reads have to be sorted by coordinates (using eg '''picard.jar SortSam''' OR '''samtools sort''') first.
[http://www.ncbi.nlm.nih.gov/books/NBK158899/ Downloading SRA data using command line utilities]


* [https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates MarkDuplicates]
[https://github.com/NCBI-Hackathons/SRA2R SRA2R] - a package to import SRA data directly into R.
* [https://gatkforums.broadinstitute.org/gatk/discussion/2799/howto-map-and-mark-duplicates (howto) Map and mark duplicates]


<pre>
(Method 1) Use the '''[http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump fastq-dump]''' command. For example, the following command (modified from the [http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using document] will download the first 5 reads and save it to a file called <SRR390728.fastq> ('''NOT sra format)''' in the current directory.
java -Djava.io.tmpdir="./tmpJava" \
<syntaxhighlight lang='bash'>
  -Xmx10g -jar $PICARDJARPATH/picard.jar  \
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O .
  MarkDuplicates \
# OR
  VALIDATION_STRINGENCY=SILENT \
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar
  METRICS_FILE=MarkDudup.metrics \
</syntaxhighlight>
  INPUT=sorted.bam \
This will download the files in FASTQ format.
  OUTPUT=bad.bam
</pre>
===== Check if sam is sorted =====
http://plindenbaum.blogspot.com/2011/02/testing-if-bam-file-is-sorted-using.html


==== [https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups Read group assignment] by Picard ====
(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
<syntaxhighlight lang='bash'>
java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
  -jar /opt/SeqTools/bin/picard-tools-2.1.1/picard.jar AddOrReplaceReadGroups \
</syntaxhighlight>
  INPUT=BMBC2_liver3_IMPACT.bam \
It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.
  OUTPUT=rg_added_sorted.bam \
  RGID=1 \
  RGLB=rna \
  RGPL=illumina \
  RGPU=UNKNOWN \
  RGSM=BMBC2_liver3_IMPACT
</pre>


==== Split reads into exon (RNA-seq only) ====
(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
<syntaxhighlight lang='bash'>
samtools index reorder.bam
brb@T3600 ~/Downloads $ tar xzvf aspera-connect-3.6.2.117442-linux-64.tar.gz
aspera-connect-3.6.2.117442-linux-64.sh
brb@T3600 ~/Downloads $ ./aspera-connect-3.6.2.117442-linux-64.sh


java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
Installing Aspera Connect
  -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
  -T SplitNCigarReads \
  -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa \
  -I reorder.bam \
  -o split.bam \
  -U ALLOW_N_CIGAR_READS \
  -fixNDN \
  -allowPotentiallyMisencodedQuals


samtools index split.bam
Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only.
</pre>
Restart firefox manually to load the Aspera Connect plug-in


==== [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php Indel realignment] ====
Install complete.
Local alignment around indels corrects '''mapping errors'''. See p15 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery].


Notes
brb@T3600 ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \
* (from its documentation) '''indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.'''
  -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
* It require the following two steps before running indel realignment
  anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra .
*# Generate an intervals (see next subsection) file
SRR590795.sra                                                                          100%  239MB  535Mb/s    00:06
*# sorted bam file (necessary?)
Completed: 245535K bytes transferred in 7 seconds
* The '-known' option is not necessary. In [https://approachedinthelimit.wordpress.com/2016/06/29/updated-gatk-pipeline-to-haplotypecaller-gvcf/ this GATK workflow with HaplotypeCaller], it does not use '-known' or '-knownSites' in the pipeline.
(272848K bits/sec), in 1 file.
brb@T3600 ~/Downloads $
</syntaxhighlight>
''Aspera is typically 10 times faster than FTP'' according to the website. For this case, wget takes 12s while ascp uses 7s.


===== [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php Create realignment targets (*.intervals)] using '-known' option  =====
Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "anonftp@ftp-private.ncbi.nlm.nih.gov"


The following code follows [https://software.broadinstitute.org/gatk/documentation/article.php?id=38 Local Realignment around Indels]. That is, we don't need to use the '-I' parameter as in [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php example from the RealignerTargetCreator] documentation.
=== SRAdb package ===
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
https://bioconductor.org/packages/release/bioc/html/SRAdb.html
java -Xmx10g -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
      -T RealignerTargetCreator \
      -R /home/brb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \
      -o  /home/brb/SeqTestdata/usefulvcf/hg19/gatk/BRB-SeqTools_indels_only.intervals \
      -known  /home/brb/SeqTestdata/usefulvcf/hg19/gatk/common_all_20160601.vcf  \
      -nt 11
</pre>


Indel realignment requires an intervals file.
First we install some required package for XML and RCurl.
 
<syntaxhighlight lang='bash'>
===== [https://broadinstitute.github.io/picard/command-line-overview.html#ReorderSam ReorderSam] by Picard =====
sudo apt-get update
Question: is this step necessary? [https://software.broadinstitute.org/gatk/documentation/article.php?id=38 Local Realignment around Indels] documentation does not have emphasized this step.
sudo apt-get install libxml2-dev
 
sudo apt-get install libcurl4-openssl-dev
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
</syntaxhighlight>
java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
and then
  -jar /opt/SeqTools/bin/picard-tools-2.1.1/picard.jar  \
<syntaxhighlight lang='rsplus'>
  ReorderSam \
source("https://bioconductor.org/biocLite.R")
  INPUT=rg_added_sorted.bam \
biocLite("SRAdb")
  OUTPUT=reorder.bam \
</syntaxhighlight>
  REFERENCE=/home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa
</pre>


===== Indel realignment using '-known' option =====
== SRA ==
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
[https://ncbiinsights.ncbi.nlm.nih.gov/2021/05/27/nih-open-access-cloud-sra/ The wait is over… NIH’s Public Sequence Read Archive is now open access on the cloud]
java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
  -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
  -T IndelRealigner \
  -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa \
  -I split.bam \
  -targetIntervals /home/brb/SeqTestdata/usefulvcf/hg38/gatk/BRB-SeqTools_indels_only.intervals \
  -known /home/brb/SeqTestdata/usefulvcf/hg38/gatk/common_all_20170710.vcf.gz \
  -o realigned_reads.bam \
  -allowPotentiallyMisencodedQuals
</pre>


==== [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php Base quality score recalibration] using '-knownSites' option ====
Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers
Base recalibrartion corrects for '''machine errors'''. See p16 on [https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2016-Seq_and_Variant_Discovery].


The base recalibration process involves two key steps
=== SRA Explorer ===
# builds a model of covariation based on the data and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality.  
* https://ewels.github.io/sra-explorer/
# Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).
* Source code https://github.com/ewels/sra-explorer


Afterwards, it (I guess it is in the PrintReads command) adjusts the base quality scores in the data based on the model
=== SRP056969 ===
* [https://www.nature.com/articles/s41467-017-00867-z Inference of RNA decay rate from transcriptional profiling highlights the regulatory programs of Alzheimer’s disease]
* [http://www.rna-seqblog.com/rna-seq-reveals-mrna-stability-a-marker-in-alzheimers-patients/ RNA-Seq reveals mRNA stability a marker in Alzheimer’s patients]
* REMBRANDTS: REMoving Bias from Rna-seq ANalysis of Differential Transcript Stability


(from the documentation) '''-knownSites''' parameter: This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.
=== SRP066363 - lung cancer ===
* Platform: GPL11154 Illumina HiSeq 2000 (Homo sapiens)
* Overall design: RNAseq and DNA copy number analysis of H1975 cells
* Strategy: 6 RNA-Seq and 3 Whole exome. Paired. [https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=302630 9 samples]
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066363
* http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74866


In terms of the software implementation, this workflow actually includes two commands: '''BaseRecalibrator''' and '''PrintReads'''.
=== SRP015769 or SRP062882 - prostate cancer ===
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP015769 5 are from normal and 5 are from tumor. Whole Exome Seq.
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062882 6 normal and the rest are tumor.  


In my experience running the [https://taichimd.us/mediawiki/index.php/Seqtools#DNA-Seq PrintReads command is very very slow] even I have used the multi-threaded mode option (-nct). See a discussion [https://gatkforums.broadinstitute.org/gatk/discussion/3051/parallelizing-printreads parallelizing PrintReads].
=== SRP053134 - breast cancer ===
* http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1785051


<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
Look at the MBases value column. It determines the coverage for each run.
java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g \
  -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar \
  -T BaseRecalibrator \
  -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa \
  -I realigned_reads.bam \
  -nct 11 \
  -knownSites /home/brb/SeqTestdata/usefulvcf/hg38/gatk/common_all_20170710.vcf.gz \
  -o recal_data.table
  -allowPotentiallyMisencodedQuals


java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g
=== [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64016 SRP050992] single cell RNA-Seq ===
  -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar
Used in [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0927-y Design and computational analysis of single-cell RNA-sequencing experiments]
  -T PrintReads
  -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa
  -I realigned_reads.bam
  -nct 11
  -BQSR recal_data.table
  -o recal.bam
</pre>


==== Variant call by [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php HaplotypeCaller] (germline) and [https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php MuTec2] (somatic) ====
=== Single cell RNA-Seq ===
* [http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0964-6 Exploiting single-cell expression to characterize co-expression replicability]
* [[NGS#Single_Cell_RNA-Seq|NGS -> Single cell RNA-Seq]]


* [http://www.pathologystudent.com/?p=8539 Germline vs. somatic mutations]
=== SRP040626 or SRP040540 - Colon and rectal cancer ===
* HaplotypeCaller
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP040626
** ''The (HaplotypeCaller) algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so '''its use is not recommended for somatic (cancer) variant discovery'''. For that purpose, use MuTect2 instead.''
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP040540
** HaplotypeCaller theory/properties https://wiki.nbic.nl/images/1/13/Wim_2013_07_12.pdf#page=6
** Local denovo assembly ([https://en.wikipedia.org/wiki/De_novo_transcriptome_assembly ''De novo'' transcriptome assembly]) based variant caller. '''Whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region.''' This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other.
** Calls SNP, INDEL, MNP and small SV simultaneously
** Removes mapping artifacts
** More sensitive and accurate than the Unified Genotyper (UG)
* [https://gatkforums.broadinstitute.org/gatk/discussion/4148/hc-overview-how-the-haplotypecaller-works How the HaplotypeCaller works?] ([https://software.broadinstitute.org/gatk/documentation/presentations Broad Presentation -> Pipeline Talks -> MPG_Primer_2015-Seq_and_Variant_Discovery])
*# Define '''active regions''' (substantial evidence of variation relative to the reference)
*# Determine haplotypes by '''re-assembly''' of the active region
*# Determine '''likelihoods of the haplotypes''' given the read data
*# Assign sample '''genotypes'''


<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
=== OmicIDX ===
java -Djava.io.tmpdir="/home/brb/SRP049647/outputvc/tmpJava" -Xmx10g
[https://seandavi.github.io/2019/06/omicidx-on-bigquery/ OmicIDX on BigQuery]
  -jar /opt/SeqTools/bin/gatk/GenomeAnalysisTK.jar
  -T HaplotypeCaller --genotyping_mode DISCOVERY
  -R /home/brb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/../WholeGenomeFasta/genome.fa
  -I recal.bam 
  -stand_call_conf 30
  -o /home/brb/SRP049647/outputvc/BMBC2_liver3_IMPACT_raw.vcf
  -allowPotentiallyMisencodedQuals
  -nct 11
</pre>


==== Variant call by VarDict ====
== Tutorials ==
https://github.com/AstraZeneca-NGS/VarDict
See the [[#BWA|BWA]] section.  


==== Random results and downsampling ====
== Whole Exome Seq ==
* [https://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results HaplotypeCaller on whole genome or chromosome by chromosome: different results] Downsampling is random, and a few different reads can make a difference.
* [http://www.1000genomes.org/category/exome 1000genomes]. 1000genomes and tcga are two places to get vcf files too.
* [https://gatkforums.broadinstitute.org/gatk/discussion/3094/downsampling-with-haplotypecaller Downsampling with HaplotypeCaller]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4179624/ Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing] (Bao 2014)
* [https://gatkforums.broadinstitute.org/gatk/discussion/1323/downsampling Downsampling details] '''We do not recommend changing the downsampling settings in the tool.'''
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083463/ A framework for variation discovery and genotyping using next-generation DNA sequencing data]. See the table 1 there.
* [https://gatkforums.broadinstitute.org/gatk/discussion/3989/downsampling-to-coverage-and-the-3-x-haplotypecaller What I want to obtain is the same result as when running just HC on BAM files in plain VCF output]
* Some data from SRA repository.  
* [https://gatkforums.broadinstitute.org/gatk/discussion/9943/questions-about-downsampling You can use PrintReads to downsample first then run HaplotypeCaller on the downsampled BAM]
** http://sra.dnanexus.com/?q=cancer+exome&result_type=Study
* [https://gatkforums.broadinstitute.org/gatk/discussion/8223/haplotypecaller-generates-diff-results-on-different-cpus HaplotypeCaller generates diff results on different CPUs]. you might well get into reproducibility problems if you use single process multi-thread parallelism (i.e. -nt N where N > 1) regardless of what pair HMM implementation you are using.
** http://www.ncbi.nlm.nih.gov/sra/?term=WXS
** [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP008740 SRP008740] See [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/ A survey of tools for variant analysis of next-generation sequencing data] (Pabinger 2014)


==== Why expected variants are not called ====
== Whole Genome Seq ==
* [https://software.broadinstitute.org/gatk/documentation/article.php?id=7869 I expect to see a variant at a specific site, but it's not getting called]
* [https://www.ncbi.nlm.nih.gov/bioproject/browse/ BioProject] and
* [https://software.broadinstitute.org/gatk/documentation/article.php?id=5484 Generate a "bamout file" showing how HaplotypeCaller has remapped sequence reads]
** Search: filter by 'Homo sapiens wgs'
* [https://gatkforums.broadinstitute.org/gatk/discussion/2803/howto-call-variants-with-haplotypecaller Call variants with HaplotypeCaller]
** Project data type: Genome sequencing
* [https://gatkforums.broadinstitute.org/gatk/discussion/4851/inconsistency-among-the-depth-in-the-vcf-file-and-in-the-original-bam-file-and-hc-bamout-bam-file Inconsistency among the depth in the vcf file and in the original bam file and HC -bamout bam file]
** Click 'Date' to sort by it
* To include all trimmed, downsampled, filtered and uninformative reads when -bamout is specified, add the '''--emitDroppedReads''' argument.
** 20 hits as of 1/11/2017 (many of them do not have data)
 
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA352450 PRJNA352450] 18 experiments
==== Why is HaplotypeCaller dropping half of my reads? ====
** [https://www.ncbi.nlm.nih.gov/sra/SRX2341045 SRX2341045] 20.5M spots, 4.1G bases, 1.8Gb
https://gatkforums.broadinstitute.org/gatk/discussion/4844/why-is-haplotypecaller-dropping-half-of-my-reads
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA343545 PRJNA343545] 48 experiments
 
** [https://www.ncbi.nlm.nih.gov/sra/SRX2187657 SRX2187657] 417.4M spots, 126.1G bases, 55.1Gb
==== Missing mapping quality score ====
* [https://www.ncbi.nlm.nih.gov/bioproject/309109 309109] 5 experiments
https://www.biostars.org/p/79367/#79369
** [https://www.ncbi.nlm.nih.gov/sra/SRX1538498 SRX1538498] 1.7M spots, 346.1M bases, 99.7Mb
 
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA289286 PRJNA289286] 5 experiments
On GSE48215 case, I got 34 missing MQ from bwa + gatk but no missing MQ from bwa + samtools???
** [https://www.ncbi.nlm.nih.gov/sra/SRX1100298 SRX1100298] 504M spots, 101.8G bases, 45.3Gb
 
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA260389 PRJNA260389] 27 experiments
==== VQSR 质控 ====
** [https://www.ncbi.nlm.nih.gov/sra/SRX699196 SRX699196] 34,099,675 spots, 6.8G bases, 4.3Gb size
https://zhuanlan.zhihu.com/p/34878471
* [https://www.ncbi.nlm.nih.gov/bioproject/248553 248553] 3 experiments
 
** [https://www.ncbi.nlm.nih.gov/sra/SRX1026041 SRX1026041] 1.2G spots, 250.9G bases, 114.2Gb
==== idx file ====
* [https://www.ncbi.nlm.nih.gov/bioproject/210123 210123] 26 experiments
It is a binary file. It is one of two ouptut from the GATK's Haplotype calling.
** [https://www.ncbi.nlm.nih.gov/sra/SRX318496 SRX318496] 173.7M spots, 34.7G bases, 23Gb
 
* [https://www.ncbi.nlm.nih.gov/bioproject/43433 43433] 3 experiments (ABI SOLiD System 3.0)
Unfortunately there is no documentation about its spec. [https://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it  It can be generated if VCF files can be validated] by [https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php ValidateVariants]
** [https://www.ncbi.nlm.nih.gov/sra/SRX017230 SRX017230]  851.1M spots, 85.1G bases, 68.8Gb


==== Djava.io.tmpdir ====
== SraRunTable.txt ==
* [https://samnicholls.net/2015/11/11/grokking-gatk/ Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)]
# http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
* [https://gatkforums.broadinstitute.org/gatk/discussion/7623/haplotypecaller-djava-io-tmpdir the number of files (such as bamschedule.* files) in the tmp is so large(about 11000)]
# http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click ''SRP004077''
* [https://gatkforums.broadinstitute.org/gatk/discussion/4757/is-there-any-speed-benefit-to-redirecting-djava-io-tmpdir-to-local-scratch Is there any speed benefit to redirecting Djava.io.tmpdir to local scratch?]
# http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click '''Runs''' from the RHS
* http://people.duke.edu/~ccc14/duke-hts-2017/Statistics/08032017/GATK-pipeline-sample.html
# http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click '''RunInfoTable'''
* How large is the tmp directory required? Less than 1G is sufficient for 12GB bam file.


==== how to deal with errors ====
Note that (For this study, it has 2377 rows)
* [https://gatkforums.broadinstitute.org/gatk/discussion/1429/error-bam-file-has-a-read-with-mismatching-number-of-bases-and-base-qualities ERROR : BAM file has a read with mismatching number of bases and base qualities] (works) with indelrealigner. [https://gatkforums.broadinstitute.org/gatk/discussion/comment/41013 GATK 3.8 log4j error] (not useful). The solution is to add the option '''--filter_mismatching_base_and_quals''' to IndelRealigner.
* Column A (AssemblyName_s) eg GRCh37
* Column I (library_name_s) eg
* column N (header=Run_s) shows all SRR or ERR accession numbers.
* Column P (Sample_Name)
* Column Y (header=Assay_Type_s) shows '''WGS'''.  
* Column AB (LibraryLayout_s): PAIRED


<pre>
= Public Data =
$ grep -n ERROR swarm_58155698_0.e --color
[https://twitter.com/tangming2005/status/1666437518907133954 Ten Resources for easy access public genomic data] 6/7/2023. UCSCXenaTools (TCGA, ICGC, GDC), PharmacoGx, rDGidb, [https://www.omicsdi.org/#/ OmicsDI], AnnotationHub, TCGAbiolinks, GenomicDataCommons, cbioportal.
513:ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/usr/local/apps/GATK/3.8-0/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
514:ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
535:##### ERROR ------------------------------------------------------------------------------------------
536:##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
537:##### ERROR
538:##### ERROR This means that one or more arguments or inputs in your command are incorrect.
539:##### ERROR The error message below tells you what is the problem.
540:##### ERROR
541:##### ERROR If the problem is an invalid argument, please check the online documentation guide
542:##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
543:##### ERROR
544:##### ERROR Visit our website and forum for extensive documentation and answers to
545:##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
546:##### ERROR
547:##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
548:##### ERROR
549:##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@4a7427f9 is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: BAM file has a read with mismatching number of bases and base qualities. Offender: homo_simulated_Error0_Mu0-j1-chr2-r10408427 [55 bases] [0 quals]. You can use --defaultBaseQualities to assign a default base quality for all reads, but this can be dangerous in you don't know what you are doing.
550:##### ERROR -----------------------------------------------------------------------------
</pre>
* [https://software.broadinstitute.org/gatk/documentation/article?id=6470 ERROR SAM/BAM/CRAM file appears to be using the wrong encoding for quality scores] in BaseRecalibrator. Adding '--fix_misencoded_quality_scores' does not help.
<pre>
##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@6c0bf8f4 appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (68) with BAQ correction factor of 4. Please see https://software.broadinstitute.org/gatk/documentation/article?id=6470 for more details and options related to this error.


After adding '--fix_misencoded_quality_scores' does not help.
[https://waldronlab.io/PublicDataResources/ Public data resources and Bioconductor] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020].


#### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
{| class="wikitable"
</pre>
|-
! Package name
! Object class
! Downloads (Distinct IPs, Jul 2020)
|-
| [https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html GEOquery]
| SummarizedExperiment
| 5754
|-
| [https://www.bioconductor.org/packages/release/bioc/html/GenomicDataCommons.html GenomicDataCommons]
| GDCQuery
| 1154
|-
| [https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html TCGAbiolinks]<br />[https://www.bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html curatedTCGAData]
| RangedSummarizedExperiment<br />MultiAssayExperimentObjects
| 2752 <br />275
|-
| [http://www.bioconductor.org/packages/release/bioc/html/recount.html recount]
| RangedSummarizedExperiment
| 418
|-
| [https://www.bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html curatedMetagenomicData]
| ExperimentHub
| 224
|}


=== Novocraft ===
== ISB Cancer Genomics Cloud (ISB-CGC) ==
https://hpc.nih.gov/apps/novocraft.html
https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis


Running index is very quick (2 minutes for hg19). The following commands will generate a file <hg19index> which is about 7.8GB in size.
The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.
<syntaxhighlight lang='bash'>
$ sinteractive --mem=32g -c 16
$ module load novocraft
$ novoindex hg19index $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa


# novoindex (3.8) - Universal k-mer index constructor.
[https://github.com/isb-cgc/ISB-CGC-Webapp ISB-CGC Web Application]
# (C) 2008 - 2011 NovoCraft Technologies Sdn Bhd
# novoindex hg19index /home/limingc/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
# Creating 55 indexing threads.
# Building with 14-mer and step of 2 bp.
tcmalloc: large alloc 1073750016 bytes == 0x13f8000 @  0x4008e4 0x56c989 0x40408c 0x40127b 0x4d20bb 0x402845
tcmalloc: large alloc 8344305664 bytes == 0x41402000 @  0x4008e4 0x56d6d3 0x40423a 0x40127b 0x4d20bb 0x402845
# novoindex construction dT = 116.1s
# Index memory size  7.771Gbyte.
# Done.
</syntaxhighlight>


=== [https://github.com/ekg/freebayes freebayes] ===
== CCLE, DepMap ==
The program gave different errors for some real datasets downloaded from GEO.  
* [https://www.nature.com/articles/s41586-019-1186-3 Next-generation characterization of the Cancer Cell Line Encyclopedia] 2019
* It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).
* [https://bioinformatics.mdanderson.org/Supplements/ResidualDisease/Reports/assembleCCLEClinical.html Assembling Clinical Information for the CCLE Data]
* Data download [https://depmap.org/portal/download/all/ Depmap]
* [https://depmap.org/portal/ Dependency Map (DepMap)] portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualization tools [https://depmap.org/portal/ccle/ CCLE]
** '''sample_info.csv''' also available from the download page
** '''CCLE_RNAseq_reads.csv''': read counts from RSEM. 1406 x (54358 - 1). Use '''readr::read_csv()'''. range(x[, -1]) = 0 13018000. Note: log2(13018000) = 23.634.
** '''CCLE_expression_full.csv''': log2(TPM + 1). 1406 x (53971 - 1). range(x[, -1]) = 0.00000 17.78354
** '''CCLE_expression.csv''': log2(TPM+1). 1406 x 19221 genes. protein coding genes. 33 diseases.
** '''CCLE_expression_proteincoding_genes_expected_count.csv''': 1406 x (19222 - 1). read count (non-integers) data from RSEM for just protein coding genes. range(x[, -1]) = 0 13018000.
** '''CCLE_expression_transcripts_expected_count.csv''': read count data from RSEM. 1406 x (228138-1). Non-integers. range(x[, -1]) = 0 11664000.
<ul>
<li>[https://bioconductor.org/packages/release/data/experiment/html/depmap.html depmap] package: Cancer Dependency Map Data Package. The depmap package currently contains eight (kinds) datasets available through [http://www.bioconductor.org/packages/release/bioc/html/ExperimentHub.html ExperimentHub].
* RNA inference knockout data
* CRISPR-Cas9 knockout data
* WES copy number data
* CCLE Reverse Phase Protein Array data
* CCLE RNAseq gene expression data
* Cancer cell lines
* Mutation calls
* Drug Sensitivity
<pre>
R> eh <- ExperimentHub()
R> class(eh)
[1] "ExperimentHub"
attr(,"package")
[1] "ExperimentHub"


A small test data included in the software works.
R> rnai <- eh[["EH2260"]]
<syntaxhighlight lang='bash'>
R> class(rnai)
brb@brb-P45T-A:~/github$ sudo apt-get update
[1] "tbl_df"     "tbl"        "data.frame"
brb@brb-P45T-A:~/github$ sudo apt-get install cmake
</pre>
brb@brb-P45T-A:~/github$ git clone --recursive git://github.com/ekg/freebayes.git
</li>
brb@brb-P45T-A:~/github$ cd freebayes/
</ul>
brb@brb-P45T-A:~/github/freebayes$ make
brb@brb-P45T-A:~/github/freebayes$ make test  # Got some errors. So don't worry to 'make test'
brb@brb-P45T-A:~/github/freebayes/test/tiny$ ../../bin/freebayes -f q.fa NA12878.chr22.tiny.bam > tmp.vcf
brb@brb-P45T-A:~/github/freebayes/test/tiny$ ls -lt
total 360
-rw-rw-r-- 1 brb brb  15812 Oct  9 10:52 tmp.vcf
-rw-rw-r-- 1 brb brb 287213 Oct  9 09:33 NA12878.chr22.tiny.bam
-rw-rw-r-- 1 brb brb     96 Oct  9 09:33 NA12878.chr22.tiny.bam.bai
-rw-rw-r-- 1 brb brb  16307 Oct  9 09:33 NA12878.chr22.tiny.giab.vcf
-rw-rw-r-- 1 brb brb  12565 Oct  9 09:33 q.fa
-rw-rw-r-- 1 brb brb    16 Oct  9 09:33 q.fa.fai
-rw-rw-r-- 1 brb brb  2378 Oct  9 09:33 q_spiked.vcf.gz
-rw-rw-r-- 1 brb brb    91 Oct  9 09:33 q_spiked.vcf.gz.tbi
-rw-rw-r-- 1 brb brb  4305 Oct  9 09:33 q.vcf.gz
-rw-rw-r-- 1 brb brb    102 Oct  9 09:33 q.vcf.gz.tbi
brb@brb-P45T-A:~/github/freebayes/test/tiny$ ~/github/samtools/samtools view NA12878.chr22.tiny.bam | wc -l
3333
brb@brb-P45T-A:~/github/freebayes/test/tiny$ wc -l tmp.vcf
76 tmp.vcf
brb@brb-P45T-A:~/github/freebayes/test/tiny$ wc -l q.fa
207 q.fa
</syntaxhighlight>


To debug the code, we can
== NCI's Genomic Data Commons (GDC)/TCGA ==
<syntaxhighlight lang='bash'>
The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).
../../bin/freebayes -f q.fa NA12878.chr22.tiny.bam > tmpFull.out 2>&1
</syntaxhighlight>
Then compare tmp.vcf and tmpFull.out files. We see
* total sites: 12280
* the first variant call happens at position 186. Look at tmpFull.out file, we see most of positions just show 3 lines in the output but variants like position 186 show a lot of output (line 643 to 736).
* the output 'processing position XXX' was generated from AlleleParser:toNextPosition() which was called by AlleleParser::getNextAlleles() which was called by freebayes.cpp::main() line 126, the while() loop.


==== freeBayes vs HaplotypeCaller ====
* [https://portal.gdc.cancer.gov/ NCI's GDC] - Genomic Data Commons Data Portal. Researchers can access over 3 PB of bigData from projects like CPTAC, TARGET and of course TCGA.
* https://www.biostars.org/p/174510/
** [https://gdc.cancer.gov/support/gdc-webinars GDC webinars]
* [http://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/ Updated comparison of variant detection methods: Ensemble, FreeBayes and minimal BAM preparation pipelines]
** [https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations Study Abbreviations]
* [https://tcpaportal.org/tcpa/download.html MDAnderson] download, [https://tcpaportal.org/tcpa/faq.html FAQ], [https://bioinformatics.mdanderson.org/public-software/tcpa/ Available Software]
* [https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#tcgaanalyze_dea__tcgaanalyze_leveltab:_differential_expression_analysis_(dea) Working with TCGAbiolinks package]
** edgeR is used to find DE; see the TCGAanalyze_DEA() function and [https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html#TCGAanalyze_DEA__TCGAanalyze_LevelTab:_Differential_expression_analysis_(DEA) TCGAanalyze: Analyze data from TCGA]
* [https://bioconductor.org/packages/release/data/experiment/html/GSE62944.html GEO accession data GSE62944 as a SummarizedExperiment] and [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944 GEO] website.
* [https://www.ncbi.nlm.nih.gov/pubmed/25819073 BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis.]
* https://github.com/srp33/TCGA_RNASeq_Clinical
* [https://www.biorxiv.org/content/early/2018/12/21/046904 MOGSA: integrative single sample gene-set analysis of multiple omics data] 2019. The data was obtained from TCGA and NCI60.
* [https://cancerres.aacrjournals.org/content/79/13/3514 RNA Sequencing of the NCI-60: Integration into CellMiner and CellMiner CDB]
* [https://www.tandfonline.com/doi/abs/10.1080/01621459.2020.1730853?journalCode=uasa20 Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data] 2020


=== [http://wiki.bits.vib.be/index.php/Varscan2 Varscan2] ===
=== GenomicDataCommons package ===
* http://dkoboldt.github.io/varscan/
* [[#GenomicDataCommons_package|See here]]
* https://github.com/dkoboldt/varscan
* [https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/ GenomicDataCommons] Example: UUID to TCGA and TARGET Barcode Translation


=== vcftools ===
=== NCI60 ===
https://vcftools.github.io/examples.html
[https://dtp.cancer.gov/discovery_development/nci-60/characterization.htm Molecular Characterization of the NCI-60]. NCI-ADR-RES and OVCAR-8 being derived from one another, SNB-19 and U251 are derived from the same patient


This toolset can be used to perform the following operations on VCF files:
=== Case studies ===
[https://link.springer.com/content/pdf/10.1186/s40249-020-00662-x.pdf Expression of the SARS-CoV-2 cell receptor gene ACE2 in a wide variety of human tissues]


* Filter out specific variants
== NCI Proteomic Data Commons ==
* Compare files
https://pdc.cancer.gov/pdc/ vs https://gdc.cancer.gov/
* Summarize variants
* Convert to different file types
* Validate and merge files
* Create intersections and subsets of variants


<syntaxhighlight lang="bash">
== GTEx ==
wget http://sourceforge.net/projects/vcftools/files/vcftools_0.1.12b.tar.gz/download \
* [https://www.gtexportal.org/home/ Genotype-Tissue Expression (GTEx) project]
    -o vcftools_0.1.12b.tar.gz
* [https://master.bioconductor.org/packages/release/workflows/html/recountWorkflow.html recount workflow: accessing over 70,000 human RNA-seq samples with Bioconductor]
tar -xzvf vcftools_0.1.12b.tar.gz
* [https://www.biorxiv.org/content/10.1101/602367v1 Basal Contamination of Bulk Sequencing: Lessons from the GTEx dataset]
sudo mv vcftools_0.1.12b /opt/RNA-Seq/bin/
* [https://bioconductor.org/packages/release/bioc/html/variancePartition.html variancePartition] Quantify and interpret divers of variation in multilevel gene expression experiments. [https://www.biologicalpsychiatryjournal.com/article/S0006-3223(20)31674-7/fulltext Transcriptomic Insight Into the Polygenic Mechanisms Underlying Psychiatric Disorders] 2020
export PERL5LIB=/opt/RNA-Seq/bin/vcftools_0.1.12b/perl/
/opt/RNA-Seq/bin/vcftools_0.1.12b/
make
export PATH=$PATH:/opt/RNA-Seq/bin/vcftools_0.1.12b/bin
ls bin
# fill-aa      vcf-annotate  vcf-convert      vcf-phased-join  vcf-subset
# fill-an-ac    vcf-compare    vcf-fix-ploidy  vcf-query        vcftools
# fill-fs      vcf-concat    vcf-indel-stats  vcf-shuffle-cols  vcf-to-tab
# fill-ref-md5  vcf-consensus  vcf-isec        vcf-sort          vcf-tstv
# man1          vcf-contrast  vcf-merge        vcf-stats        vcf-validator
</syntaxhighlight>
Some example
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
$ cd ~/SRP032789
$ vcftools --vcf GSM1261016_IP2-50_var.flt.vcf


VCFtools - v0.1.12b
== NIH LINCS ==
(C) Adam Auton and Anthony Marcketta 2009
* http://www.lincsproject.org/, http://www.lincsproject.org/LINCS/tools
* [https://bioconductor.org/packages/release/bioc/vignettes/slinky/inst/doc/LINCS-analysis.html slinky], [https://academic.oup.com/bioinformatics/article/35/17/3176/5284904 paper]
* [https://bioconductor.org/packages/release/bioc/html/cmapR.html cmapR]
* [https://pubmed.ncbi.nlm.nih.gov/29065900/ GRcalculator: an online tool for calculating and mining dose-response data]
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6381817/ Reproducible Bioconductor workflows using browser-based interactive notebooks and containers]


Parameters as interpreted:
== Sharing data ==
--vcf GSM1261016_IP2-50_var.flt.vcf
* [https://datascience.cancer.gov/data-sharing NCI Data Sharing]
* [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006472 Ten quick tips for sharing open genomic data] Brown et al, PLOS 2018


After filtering, kept 1 out of 1 Individuals
= Gene set analysis =
After filtering, kept 193609 out of a possible 193609 Sites
* [https://www.biostars.org/p/88926/ Over Representation Vs Enrichment Analysis]
Run Time = 1.00 seconds


$ wc -l GSM1261016_IP2-50_var.flt.vcf
== Hypergeometric test ==
193636 GSM1261016_IP2-50_var.flt.vcf
* [http://mygoblet.org/training-portal/courses/pathway-and-network-analysis-omics-data-2014 A course from bioinformatics.ca] and [http://mygoblet.org/sites/default/files/materrials/Pathways_2014_Module2.pdf over-represented pathway].
* [http://blog.nextgenetics.net/?e=94 How informative are enrichment analyses really?]


$ vcf-indel-stats < GSM1261016_IP2-50_var.flt.vcf > out.txt
== Next-generation sequencing data ==
Use of uninitialized value in pattern match (m//) at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 49.
* [http://bioinformatics.oxfordjournals.org/content/32/17/i611.full Gene-set association tests for next-generation sequencing data]
Use of uninitialized value in concatenation (.) or string at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 49.
<: No such file or directory at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 18.
main::error('<: No such file or directory') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 50
main::init_regions('HASH(0xd77cb8)') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 71
main::do_stats('HASH(0xd77cb8)') called at /opt/RNA-Seq/bin/vcftools_0.1.12b/bin/vcf-indel-stats line 9
</pre>


To compare two vcf files, see https://vcftools.github.io/documentation.html
= Forums =
<pre>
* Biostars source code https://github.com/ialbert/biostar-central
./vcftools --vcf input_data.vcf --diff other_data.vcf --out compare
* https://support.bioconductor.org/ (powered by Biostar too)
</pre>


=== Online course on Variant calling ===
= Batch effect =
* edX: HarvardX: PH525.6x Case Study: Variant Discovery and Genotyping. Course notes is at their [https://github.com/hbc/edX/blob/master/edX_Notes.md Github] page.
[[Batch_effect|Batch effect]]


=== [http://genome.cshlp.org/content/27/8/1450.full GenomeVIP] ===
= Misc =
a cloud platform for genomic variant discovery and interpretation
== Advice ==
* [https://widdowquinn.github.io/ten_great_papers/ Ten great papers for biologists starting out in computational biology]
* [https://github.com/nih-byob/presentations/tree/master/2019/01_bioinformatics_tips Bioinformatics advice I wish I learned 10 years ago] from NIH


=== PCA ===
== High Performance ==
[http://bwlewis.github.io/1000_genomes_examples/PCA.html PCA of genomic variant data across one chromosome from 2,504 people from the 1000 genomes project]
* https://www.youtube.com/watch?v=M3RVfv6lUtc NYCMC


== Variant Annotation ==
== Cloud Computing ==
See also the paper [http://bib.oxfordjournals.org/content/15/2/256.abstract A survey of tools for variant analysis of next-generation genome sequencing data].
* [https://github.com/VCCRI/Falco/ '''Falco''': A quick and flexible single-cell RNA-seq processing framework on the cloud]
* [https://youtu.be/cP5rvWoJDOQ Getting started with Bioconductor in the cloud]
* [https://www.rna-seqblog.com/micloud-a-bioinformatics-cloud-for-seamless-execution-of-complex-ngs-data-analysis-pipelines/ miCloud]: a bioinformatics cloud for seamless execution of complex NGS data analysis pipelines


[https://github.com/seandavi/awesome-cancer-variant-databases Awesome-cancer-variant-databases] - A community-maintained repository of cancer clinical knowledge bases and databases focused on cancer variants.
== Merge different datasets (different genechips) ==
* https://support.bioconductor.org/p/65506/


=== [http://www.ncbi.nlm.nih.gov/SNP/ dbSNP] ===
== Genomic data vs transcriptomic data ==
[https://support.bioconductor.org/p/33140/ SNPlocs data R package for Human]. Some [http://grokbase.com/t/r/bioconductor/131gt2jyfg/bioc-problem-locating-snp-by-rsid-for-snplocs-hsapiens-dbsnp-20120608-package-bioconductor-x clarification] about SNPlocs.Hsapiens.dbSNP.20120608 package.
* The main difference between genomic data and transcriptomic data is that genomic data provides information on the complete DNA sequence of an organism, while transcriptomic data provides information on the expression levels of genes.  
<pre>
* Genomic data:
> library(BSgenome)
** Genomic data refers to the complete DNA sequence of an organism, which includes all of its genes, regulatory regions, and non-coding regions. This type of data provides information on the genetic makeup of an organism, including its potential to develop certain diseases, its evolutionary history, and its overall genetic diversity.  
> available.SNPs()
** Examples of genomic data: Whole genome sequencing (WGS), Genome-wide association studies (GWAS), Copy number variation (CNV) analysis, Comparative genomics, Metagenomics.
[1] "SNPlocs.Hsapiens.dbSNP141.GRCh38"   
* Transcriptomic data
[2] "SNPlocs.Hsapiens.dbSNP142.GRCh37"   
** Transcriptomic data, on the other hand, refers to the collection of all RNA transcripts produced by the genes of an organism. RNA transcripts are produced when genes are transcribed into RNA molecules, which are then used as templates to synthesize proteins. Transcriptomic data provides information on the expression levels of genes, which can help researchers understand how genes are regulated and how they contribute to biological processes.
[3] "SNPlocs.Hsapiens.dbSNP.20090506"   
** Examples of transcriptomic data: RNA-Seq, Microarray, scRNA-Seq, qPCR, Ribosome profiling
[4] "SNPlocs.Hsapiens.dbSNP.20100427"   
[5] "SNPlocs.Hsapiens.dbSNP.20101109"   
[6] "SNPlocs.Hsapiens.dbSNP.20110815"   
[7] "SNPlocs.Hsapiens.dbSNP.20111119"   
[8] "SNPlocs.Hsapiens.dbSNP.20120608"   
[9] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38"
</pre>


[https://support.bioconductor.org/p/30078/ Query dbSNP]
== Low read count and filtering ==
* DESeq2 '''pre-filtering''':
** ''While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2.'' [https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering DESeq2 vignette].
** One can also omit this step entirely and just rely on the '''independent filtering''' procedures available in results(), either IHW or genefilter::filtered_p().
:<syntaxhighlight lang='r'>
smallestGroupSize <- 3 # smallest group size
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep, ]
</syntaxhighlight>


An example:
* '''glmnet(, exclude)'''. See its [https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet.pdf#page=27 vignettte] for some examples including computing univariate T-test and excluding 40% genes with low T-statistic. Notice the <span style="color: red"> bias</span> could be introduced if we use both x and y to filter variables unless the same procedure is applied during CV.
<pre>
* [https://combine-australia.github.io/RNAseq-R/slides/RNASeq_filtering_qc.pdf#page=10 RNA-seq: filtering, quality control and visualisation]. As a general rule, a good threshold can be chosen for a '''CPM''' value that corresponds to a count of 10.
wget https://github.com/samtools/bcftools/releases/download/1.2/bcftools-1.2.tar.bz2
* [https://pubmed.ncbi.nlm.nih.gov/26737772/ Effect of low-expression gene filtering on detection of differentially expressed genes in RNA-seq data] 2015
sudo tar jxf bcftools-1.2.tar.bz2 -C /opt/RNA-Seq/bin/
* [https://seqqc.wordpress.com/2020/02/17/removing-low-count-genes-for-rna-seq-downstream-analysis/ Removing low count genes for RNA-seq downstream analysis]
cd /opt/RNA-Seq/bin/bcftools-1.2/
* [https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/02_Preprocessing_Data.nb.html#filtering-the-genes RNA-seq analysis in R Pre-processsing RNA-seq data]. ''It keeps all genes where the total number of reads across all samples is greater than 5.''
sudo make
* [http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/RNAseq_DE_analysis_with_R.html RNAseq data analysis in R - Notebook]. ''A common way to do this is by filtering out genes having less than 1 count-per-million reads ('''cpm''') in half the samples. The “edgeR” library provides the '''cpm''' function which can be used here.''
* [https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html#pre-filtering-rows-with-very-small-counts RNA-seq workflow - gene-level exploratory analysis and differential expression]. ''we will remove those genes which have a total count of less than 5.''
* [https://biocorecrg.github.io/RNAseq_course_2019/differential_expression.html  Differential expression analysis] workshop material. It referred to this paper [https://f1000research.com/articles/5-1438/v2 From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline]. It uses the criterion ''we keep genes that have '''CPM''' values above 0.5 in at least two libraries''.
* [https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html RNA Sequence Analysis in R: edgeR] which also uses '''cpm''' to filter genes.


wget https://github.com/samtools/htslib/releases/download/1.2.1/htslib-1.2.1.tar.bz2
=== Independent Filtering ===
sudo tar jxf htslib-1.2.1.tar.bz2 -C /opt/RNA-Seq/bin/
<ul>
cd /opt/RNA-Seq/bin/htslib-1.2.1/
<li>[https://bioconductor.org/packages/release/bioc/html/genefilter.html genefilter] package. [https://bioconductor.org/packages/release/bioc/vignettes/genefilter/inst/doc/independent_filtering_plots.pdf Independent filtering vignette].
sudo make  # create tabix, htsfile, bgzip commands
* In the 1st plot, the legend represents different threshold values (θ). For instance, when θ = 0.1, we filter out 10% of hypotheses before conducting multiple testing.
* The 1st plot shows the larger the theta, the more number of hypotheses are rejected when we consider a fixed FDR cutoff. But the plot does not consider larger theta values.
* The second plot indicates what is the optimal theta threshold to filter genes.  
* In this example, removing 60% genes rejects the most number of hypotheses/returns the most number of discoveries.
* It can be seen very large values of theta could reduce power.
* The '''filter parameter''' in [https://www.rdocumentation.org/packages/genefilter/versions/1.54.2/topics/filtered_p filtered_p() or filtered_R()] represents '''ranks'''. That is, using, for example, S2(=rowVars()) or S3=rank(S2) as the '''filter''' returns the same result.
:<syntaxhighlight lang='r'>
# filtered_p: Returned BH adjusted p for different thresholds defined in theta.
#            This can be used to generate the 1st plot.
filtered_p(filter, test, theta, data, method = "none")
# filtered_R: Returned the number of rejections using the specified BH cutoff (alpha)
#            for different thresholds.
#             This can be used to generate the 2nd & 3rd plots and find the optimal threshold.
filtered_R(alpha, filter, test, theta, data, method = "none")
</syntaxhighlight>
* The last plot uses sample mean instead of sample variance as the filter statistic. As we can see, only 10% genes are filtered out.
</br>
[[File:Filtered_p.png|220px]] [[File:Filtered R.png|220px]] [[File:Filtered R mean.png|220px]]


export bdge_bowtie_PATH=/opt/RNA-Seq/bin/bowtie2-2.2.1
<li>DESeq2 [https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilt Independent filtering of results]. '''Independent filtering is performed AFTER the differential expression analysis (DEA) in DESeq2.'''
export bdge_tophat_PATH=/opt/RNA-Seq/bin/tophat-2.0.11.Linux_x86_64
* The results function of the DESeq2 package performs independent filtering by default using the '''mean of normalized counts''' as a '''filter statistic'''.
export bdge_samtools_PATH=/opt/RNA-Seq/bin/samtools-0.1.19
* A threshold on the '''filter statistic''' is found which optimizes the number of adjusted p values lower than a significance level alpha.
export PATH=$bdge_bowtie_PATH:$bdge_samtools_PATH:$bdge_tophat_PATH:$PATH
* The adjusted p values for the genes which do not pass the filter threshold are set to '''NA'''.
export PATH=/opt/RNA-Seq/bin/bcftools-1.2/:$PATH
export PATH=/opt/RNA-Seq/bin/htslib-1.2.1/:$PATH


wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/common_all_20150603.vcf.gz.tbi
<li>[https://youtu.be/Gi0JdrxRq5s StatQuest: edgeR and DESeq2, part 2]. Mitigate the multiple testing problem. [https://statquest.org/statquest-filtering-genes-with-low-read-counts/ code], [https://techal.org/mitigating-the-multiple-testing-problem-independent-filtering-with-edger-and-deseq2 written guide].
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/common_all_20150603.vcf.gz
* A simulated data shows '''[https://youtu.be/Gi0JdrxRq5s?si=1klGB8taFVnZ1B9v&t=462 FDR is doing a great job limiting number of <span style="color: orange">False Positives</span> in the significant results, but it's not awesome keeping the <span style="color: green">True Positives</span>]'''.
cd TNBC1
* [https://youtu.be/Gi0JdrxRq5s?si=1pJkbX6_ltrAB0Nx&t=542 Each time we increase the number of bogus tests, we reduce the number of True Positive p-values < .05 that survive the FDR adjustment].
mv ~/Downloads/common_all_20150603.vcf.gz* .
* [https://youtu.be/Gi0JdrxRq5s?si=OAsr2lI92sf0429P&t=580 A plot of True/False Positives vs Number of bogus tests]. That means FDR is still having its limitation. <s>The independent filtering helps address this limitation.</s>
bgzip -c var.raw.vcf > var.raw.vcf.gz
* [https://youtu.be/Gi0JdrxRq5s?si=qDEhvzHcI7_M3eX_&t=637 edgeR recommends removing all genes except those with > 1 CPM in 2 or more samples]. It compensates for differences in read depth between libraries. The CPM cutoff "1" depends on the data. [https://youtu.be/Gi0JdrxRq5s?si=m6YVfwkIBwZGd1dN&t=924 How can we determine what a good CPM cutoff is?].  
tabix var.raw.vcf.gz
* [https://youtu.be/Gi0JdrxRq5s?si=LX2m1mlFVZH9REzY&t=983 Differences of edgeR and DESeq2].  
bcftools annotate -c ID -a common_all_20150603.vcf.gz var.raw.vcf.gz > var_annot.vcf
* [https://youtu.be/Gi0JdrxRq5s?si=vZc0m5wdoABAkrQ6&t=1160 Y-axis is number of significant genes, X-axis is the threshold/quantiles]. Choose the threshold/quantile such that the number of significant genes is maximized.
</pre>
* [https://youtu.be/Gi0JdrxRq5s?si=1cdGVRdEIigXp6IW&t=1267 Advice]. We see the filtering was done after calculating p-values.  


Any found in dbSNP?
<li>[https://bioconductor.org/help/course-materials/2011/CSAMA/Thursday/Morning%20Talks/110629-multtestindepfilt-huber.pdf Sensitivity, Specificity, ROC, Multiple testing, Independent filtering] by Wolfgang Huber.
<pre>
<li>[https://pubmed.ncbi.nlm.nih.gov/20460310/ Independent filtering increases detection power for high-throughput experiments]. Richard Bourgon 2010.
grep -c 'rs[0-9]' raw_snps.vcf
<li>[https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/05b_wald_test_results.html Exploring DESeq2 results: Wald test]. 3. Genes with a low mean normalized counts.
</pre>
<li>[https://www.r-bloggers.com/2012/09/deseq-vs-edger-comparison/ DESeq vs edgeR Comparison].
</ul>


=== ANNOVAR ===
=== edgeR::filterByExpr ===
[http://annovar.openbioinformatics.org/en/latest/ ANNOVAR] and the web based interface to ANNOVAR [http://wannovar.usc.edu/index.php wANNOVAR]. ANNOVAR can annotate genetic variants using
* [https://rdrr.io/bioc/edgeR/man/filterByExpr.html ?filterByExpr]
* Gene-based annotation
* [https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf#page=71 limma] user guide
* Region-based annotation
* [https://support.bioconductor.org/p/9151145/ Is it okay to increase filtering to get more significant Adjusted P Values for DEGs found using Limma?]
* Filter-based annotation
* [https://support.bioconductor.org/p/9157095/ DESeq2 filtering vs edgeR filtering]
* Other functionalities


Note that
== Normalization ==
* [https://github.com/JhuangLab/annovarR annovarR] R package. The wrapper functions of annovarR unified the interface of many published annotation tools, such as VEP, ANNOVAR, vcfanno and AnnotationDbi.
* [http://nar.oxfordjournals.org/content/early/2015/07/21/nar.gkv736.long How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets] 2015
* It is correct to use the original download link from the email to download the latest version.
* [http://www.biomedcentral.com/1471-2105/16/347 Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data] 2015
* annovar folder needs to be placed under a directory with write permission
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2382-0 Expression analysis of RNA sequencing data from human neural and glial cell lines depends on technical replication and normalization methods] 2018
* If we run annovar with a new reference genome, the code will need to download some database. When annovar is downloading the database, the cpu is resting so the whole process looks idling.
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2745-1 A statistical normalization method and differential expression analysis for RNA-seq data between different species] Zhou 2019
* Annovar will print out a warning with information about changes if there is a new version available.
* [https://github.com/crazyhottommy/RNA-seq-analysis RNA-seq analysis] some notes from Ming Tang
* BRB-SeqTools (''grep "\.pl" preprocessgui/*.*'') uses '''annotate_variation.pl''' (5), '''convert2annovar.pl''' (2), '''table_annovar.pl''' (4), '''variants_reduction.pl''' (1).
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3247-x A protocol to evaluate RNA sequencing normalization methods] Abrams et al 2019. It concluded that transcripts per million ('''TPM''') was the best performing normalization method based on its preservation of biological signal as compared to the other methods tested.
* In [https://hpc.nih.gov/apps/ANNOVAR.html nih/biowulf], it creates ''$ANNOVAR_HOME'' and ''$ANNOVAR_DATA'' environment variables.
* [https://bioconductor.github.io/BiocWorkshops/rna-seq-analysis-is-easy-as-1-2-3-with-limma-glimma-and-edger.html#data-pre-processing Data pre-processing] 6.6.3 Normalising gene expression distributions. log2 CPM (counts per million) was considered. edgeR::calcNormFactors(x, method = "TMM") was used to normalize counts.
* To check the version of my local copy
* [https://morphoscape.wordpress.com/2020/09/26/generalized-linear-models-and-plots-with-edger-advanced-differential-expression-analysis/ Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis]
<syntaxhighlight lang='bash'>
* [https://www.nature.com/articles/srep18898 CrossNorm: a novel normalization strategy for microarray data in cancers] Cheng 2016
~/annovar/annotate_variation.pl
* [https://rdrr.io/bioc/EnrichmentBrowser/man/normalize.html EnrichmentBrowser::normalize()]
</syntaxhighlight>
* [http://www.bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html SingleCellExperiment] vignette. It lists 4 special assays: counts, logcounts, '''cpm''' and '''tpm'''.
* Contents of annovar
* [https://compgenomr.github.io/book/gene-expression-analysis-using-high-throughput-sequencing-technologies.html#within-sample-normalization-of-the-read-counts 8.3.4 Within sample normalization of the read counts] and [https://compgenomr.github.io/book/gene-expression-analysis-using-high-throughput-sequencing-technologies.html#computing-different-normalization-schemes-in-r 8.3.5 Computing different normalization schemes in R] from Computational Genomics with R by Altuna Akalin
<syntaxhighlight lang='bash'>
** '''CPM''' is the simplest method. It addresses the '''sequencing depth''' bias by normalizing the read counts per gene by dividing each gene’s read count by a certain value and multiplying it by 10^6. There are 3 ways for the denominator: Total Counts Normalization, Upper Quartile Normalization and Median Normalization.  
brb@T3600 ~ $ ls -l ~/annovar/
** Popular metrics that improve upon CPM are RPKM/FPKM (reads/fragments per kilobase of million reads) and TPM (transcripts per million).
total 476
** '''RPKM''' is obtained by dividing the CPM value by another factor, which is the length of the gene per kilobase. FPKM (substitute ''reads'' with ''fragments'') is the same as RPKM, but is used for paired-end reads. So RPKM differs from CPM by adding one step.
-rwxr-xr-x 1 brb brb 212090 Mar  7 14:59 annotate_variation.pl
** '''TPM''' also controls for both the library size and the gene lengths, however, with the TPM method, the read counts are first normalized by the gene length (per kilobase), and then gene-length normalized values are divided by the sum of the gene-length normalized values and multiplied by 10^6.
-rwxr-xr-x 1 brb brb  13589 Mar  7 14:59 coding_change.pl
** '''Library composition''' or '''RNA composition'''. In DESeq2 the read counts are normalized by computing size factors, which addresses the differences not only in the library sizes, but also the ''library compositions''. See [https://compgenomr.github.io/book/gene-expression-analysis-using-high-throughput-sequencing-technologies.html#differential-expression-analysis 8.3.7 Differential expression analysis]
-rwxr-xr-x 1 brb brb 166582 Mar  7 14:59 convert2annovar.pl
*** [https://youtu.be/Wdt6jdi-NQo?t=54 Different samples contain different active genes]
drwxr-xr-x 2 brb brb  4096 Jun 18  2015 example
*** [https://chipster.csc.fi/manual/deseq2.html This procedure corrects for library size and RNA composition bias, which can arise for example when only a small number of genes are very highly expressed in one experiment condition but not in the other].
drwxr-xr-x 3 brb brb  4096 Mar 21 09:59 humandb
*** [https://hbctraining.github.io/DGE_workshop_salmon/lessons/01_DGE_setup_and_overview.html DESeq2 first normalizes the count data to account for differences in library sizes and RNA composition between samples].
-rwxr-xr-x 1 brb brb  19419 Mar  7 14:59 retrieve_seq_from_fasta.pl
*** A few ''highly'' differentially expressed genes between samples, differences in the number of genes expressed between samples, or presence of contamination can skew some types of normalization methods. [https://hbctraining.github.io/DGE_workshop_salmon/lessons/02_DGE_count_normalization.html Introduction to DGE] gives a nice graphical illustration of RNA composition.
-rwxr-xr-x 1 brb brb  34682 Mar  7 14:59 table_annovar.pl
*** [https://youtu.be/UFB993xufUU?t=215 StatQuest: DESeq2, part 1, Library Normalization] - adjusting for differences in library composition
-rwxr-xr-x 1 brb brb  21774 Mar  7 14:59 variants_reduction.pl
*** '''Sequencing bias detection''' from [https://www.bioconductor.org/packages/release/bioc/vignettes/NOISeq/inst/doc/NOISeq.pdf#page=12 RNA composition] from the NOISeq's vignette.
brb@T3600 ~ $ ls -lh ~/annovar/humandb | head
* [https://academic.oup.com/bib/article/19/5/776/3056951#supplementary-data Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions] Evans 2018. R code is in github. [http://www.bioconductor.org/packages/release/data/experiment/html/seqc.html seqc] package from Bioconductor.
total 36G
* [https://www.biorxiv.org/content/10.1101/2021.03.11.435043v2 Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data] Vandenbon 2021. Methods that correct for differences in gene length (RPKM and FPKM) don't affect correlation value; that is, these methods would be equivalent to '''CPM''' normalization.
-rw-r--r-- 1 brb brb  927 Mar 21 09:59 annovar_downdb.log
* [https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-021-02936-w TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository] Zhao 2021
drwxr-xr-x 2 brb brb 4.0K Jun 18  2015 genometrax-sample-files-gff
* Youtube
-rw-r--r-- 1 brb brb  20K Mar  7 14:59 GRCh37_MT_ensGeneMrna.fa
** [https://youtu.be/UFB993xufUU StatQuest: DESeq2, part 1, Library Normalization], [http://bioconductor.org/books/release/OSCA/data-infrastructure.html scran::computeSumFactors()]
-rw-r--r-- 1 brb brb 3.1K Mar  7 14:59 GRCh37_MT_ensGene.txt
** [https://youtu.be/Wdt6jdi-NQo StatQuest: edgeR, part 1, Library Normalization] (TMM)
-rw-r--r-- 1 brb brb 1.4G Dec 15  2014 hg19_AFR.sites.2014_10.txt
** [https://youtu.be/TTUrtCY2k-w RPKM, FPKM and TPM, Clearly Explained!!!]
-rw-r--r-- 1 brb brb  87M Dec 15  2014 hg19_AFR.sites.2014_10.txt.idx
** [https://www.youtube.com/watch?v=tlf6wYJrwKY&list=PLblh5JKOoLUJo2Q6xK4tZElbIvAACEykp High Throughput Sequencing]
-rw-r--r-- 1 brb brb 2.8G Dec 15  2014 hg19_ALL.sites.2014_10.txt
* [https://www.rna-seqblog.com/robust-normalization-and-transformation-techniques-for-constructing-gene-coexpression-networks-from-rna-seq-data/ Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data] 2022
-rw-r--r-- 1 brb brb  89M Dec 15  2014 hg19_ALL.sites.2014_10.txt.idx
-rw-r--r-- 1 brb brb 978M Dec 15  2014 hg19_AMR.sites.2014_10.txt
</syntaxhighlight>


=== [http://snpeff.sourceforge.net/ SnpEff & SnpSift] ===
=== log2 transformation ===
* http://snpeff.sourceforge.net/SnpEff.html (basic information)
No matter we use TPM, TMM, FPKM, or DESeq2 normalized counts, we still need to take a log2(x+1) transformation before any analyses.
* http://snpeff.sourceforge.net/SnpEff_manual.html (Manual)
* http://snpeff.sourceforge.net/protocol.html (6 examples)
* https://wiki.hpcc.msu.edu/display/Bioinfo/SnpEff+-+The+Basics
* https://docs.uabgrid.uab.edu/wiki/Galaxy_DNA-Seq_Tutorial
* http://www.genome.gov/Pages/Research/DIR/DIRNewsFeatures/Next-Gen101/Teer_VariantAnnotation.pdf
* http://veda.cs.uiuc.edu/course2013/pres/fields/lab/Variant_Calling_v1.pptx (if you click the link, chrome will dl the pptx file)
* In BRB-SeqTools, '''snpEff.jar''' (2) and '''SnpSift.jar''' (5)
* NIH/biowulf also has installed [https://hpc.nih.gov/apps/snpEff.html snpEff & SnpSift]


==== SnpEff: Genetic variant '''annotation''' and '''effect prediction''' toolbox ====
=== Quantile normalization ===
# Input: vcf & reference genome database (eg GRCh38.79).  
* https://en.wikipedia.org/wiki/Quantile_normalization. The new values are the '''averaged''' '''ordered''' values per gene based on the original rank.
# Output: vcf & <snpEff_summary.html> & < snpEff_genes.txt> files.
* [https://www.r-bloggers.com/2024/03/mastering-quantile-normalization-in-r-a-step-by-step-guide/ Mastering Quantile Normalization in R: A Step-by-Step Guide]. The quantiles of the normalized data are consistent across the different datasets.
<syntaxhighlight lang='bash'>
* [https://www.biostars.org/p/286278/ Question: Quantile normalizing prior to or after TPM scaling?]
wget http://iweb.dl.sourceforge.net/project/snpeff/snpEff_latest_core.zip
* [https://www.biorxiv.org/content/biorxiv/early/2014/12/04/012203.full.pdf When to use Quantile Normalization?] and its R package [http://www.bioconductor.org/packages/release/bioc/html/quantro.html quantro]
sudo unzip snpEff_latest_core.zip -d /opt/RNA-Seq/bin
* [https://davetang.org/muse/2014/07/07/quantile-normalisation-in-r/ Quantile normalisation in R]
export PATH=/opt/RNA-Seq/bin/snpEff/:$PATH
<ul>
 
<li>normalize.quantiles() from preprocessCore package. [https://www.statology.org/quantile-normalization-in-r/ How to Perform Quantile Normalization in R]
# Next we want to download snpEff database.
* for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667.
# 1. Need to pay attention the database is snpEff version dependent
* I got into an error when I use the function in RStudio docker container but the solution [https://support.bioconductor.org/p/122925/#9135989 here] ('''BiocManager::install("preprocessCore", configure.args="--disable-threading")''') works.  
# 2. Instead of using the command line (very slow < 1MB/s),
<syntaxhighlight lang='rsplus'>
#  java -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar databases | grep GRCh
source('http://bioconductor.org/biocLite.R')
#  java -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar download GRCh38.79
biocLite('preprocessCore')
# we just go to the file using the the web browser
#load package
#  http://sourceforge.net/projects/snpeff/files/databases/v4_1/ # 525MB
library(preprocessCore)
# 3. the top folder of the zip file is called 'data'. We will need to unzip it to the snpEff directory.
   
#  That is, snpEff +- data
#the function expects a matrix
#                  |- examples
#create a matrix using the same example
#                  |- galaxy
mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
#                  +- scripts
            ncol=3)
mv ~/Downloads/snpEff_v4_1_GRCh38.79.zip .
mat
unzip snpEff_v4_1GRCh38.79.zip
#     [,1] [,2] [,3]
sudo java -Xmx4G -jar /opt/RNA-Seq/bin/snpEff/snpEff.jar \
#[1,]    5    4    3
                      -i vcf -o vcf GRCh38.79 var_annot.vcf > var_annot_snpEff.vcf
#[2,]   2    1    4
 
#[3,]    3    4    6
brb@T3600 ~ $ ls -l /opt/SeqTools/bin/snpEff/
#[4,]    4    2   8
total 44504
drwxrwxr-x 5 brb brb    4096 May  5 11:24 data
#quantile normalisation
drwxr-xr-x 2 brb brb    4096 Feb 17 16:37 examples
normalize.quantiles(mat)
drwxr-xr-x 3 brb brb    4096 Feb 17 16:37 galaxy
#        [,1]    [,2]    [,3]
drwxr-xr-x 3 brb brb    4096 Feb 17 16:37 scripts
#[1,] 5.666667 5.166667 2.000000
-rw-r--r-- 1 brb brb  6138594 Dec  5 10:49 snpEff.config
#[2,] 2.000000 2.000000 3.000000
-rw-r--r-- 1 brb brb 20698856 Dec  5 10:49 snpEff.jar
#[3,] 3.000000 5.166667 4.666667
-rw-r--r-- 1 brb brb 18712032 Dec  5 10:49 SnpSift.jar
#[4,] 4.666667 3.000000 5.666667
</syntaxhighlight>
 
'''Output file''' (compare ''cosmic_dbsnp_rem.vcf'' and ''snpeff_anno.vcf'' at [https://github.com/arraytools/vc-annotation/tree/master/snpeff/tmp here]):
* add 5 lines at the header. Search "ID=ANN" at [https://github.com/arraytools/vc-annotation/blob/master/snpeff/tmp/snpeff_anno.vcf snpeff_anno.vcf]
<pre>
##SnpEffVersion="4.2 (build 2015-12-05), by Pablo Cingolani"
##SnpEffCmd="SnpEff -no-downstream -no-upstream ....
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID |...
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. ...
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. ...
</pre>
* added functional annotations '''ANN''' in the INFO field. See an example at [http://snpeff.sourceforge.net/SnpEff_manual.html Basic example: Annotate using SnpEff]
<pre>
ANN=A|missense_variant|MODERATE|ISG15|ISG15|transcript|NM_005101.3|protein_coding|2/2|c.248G>A|p.Ser83Asn|355/666|248/498|83/165||
</pre>
 
==== SnpSift: SnpSift helps filtering and manipulating genomic annotated files ====
Once you annotated your files using SnpEff, you can use SnpSift to help you filter large genomic datasets in order to find the most significant variants
 
'''[http://snpeff.sourceforge.net/SnpSift.html#filter SnpSift filter]'''
<syntaxhighlight lang='bash'>
cat "$outputDir/tmp/$tmpfd/snpeff_anno.vcf" | \
    java -jar "$seqtools_snpeff/SnpSift.jar" \
      filter "(ANN[*].BIOTYPE = 'protein_coding') | (ANN[*].EFFECT has 'splice')"  \
    > "$outputDir/tmp/$tmpfd/snpeff_proteincoding.vcf"
</syntaxhighlight>
</syntaxhighlight>
</li>
</ul>
=== Distribution, density plot ===
[https://www.researchgate.net/figure/Density-plot-showing-the-distribution-of-RNA-seq-read-counts-FPKM-of-PEG-treated_fig2_318575379 Density plot showing the distribution of RNA-seq read counts (FPKM)] log10(FPKM)


the output file (compare snpeff_anno.vcf and snpeff_proteincoding.vcf at https://github.com/arraytools/vc-annotation/tree/master/snpeff/tmp here]
=== Negative binomial distribution ===
* the header will add 3 lines
[https://support.bioconductor.org/p/74572/ RNA-seq and Negative binomial distribution]
<pre>
##SnpSiftVersion="SnpSift 4.2 (build 2015-12-05), by Pablo Cingolani"
##SnpSiftCmd="SnpSift filter '(ANN[*].BIOTYPE = 'protein_coding') | (ANN[*].EFFECT has 'splice')'"
##FILTER=<ID=SnpSift,Description="SnpSift 4.2 ...
</pre>
* KEEP variants satisfying the filter criterion (In the ANN field, BIOTYPE=protein_coding and EFFECT=splice). So it will reduce the variants size. No more fields are added.


'''[http://snpeff.sourceforge.net/SnpSift.html#dbNSFP SnpSift dbNSFP]'''
== Z-score transformation ==
<syntaxhighlight lang='bash'>
* This practice has been used extensively in papers without a clear foundation.
java -jar "$seqtools_snpeff/SnpSift.jar" dbNSFP -f $dbnsfpField \
* [https://www.jmdjournal.org/article/S1525-1578(10)60455-2/fulltext Analysis of Microarray Data Using Z Score Transformation] Cheadle 2003. Z-normalization was calculated '''per gene'''.
    -v -db "$dbnsfpFile" "$outputDir/tmp/$tmpfd/nonsyn_splicing2.vcf" \
** For example, [https://www.nature.com/articles/s41598-020-66986-8 ssGSEA score-based Ras dependency indexes derived from gene expression data reveal potential Ras addiction mechanisms with possible clinica implications] applies z-scores on each gene AND ssGSEA scores for each pathway/gene signature.
    > "$outputDir/tmp/$tmpfd/nonsyn_splicing_dbnsfp.vcf"
* [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085150 Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures] Zwiener 2014. Often, standardizing gene expressions is implemented as a default in software packages.
</syntaxhighlight>
* [https://bioinformatics.stackexchange.com/a/2181 RNAseq: Z score, Intensity, and Resources]. ''For visualization in heatmaps or for other clustering (e.g., k-means, fuzzy) it is useful to use z-scores.''
Compare ''nonsyn_splicing2.vcf'' and ''[https://github.com/arraytools/vc-annotation/blob/master/snpeff/tmp/nonsyn_splicing_dbnsfp.vcf nonsyn_splicing_dbnsfp.vcf]'' at [https://github.com/arraytools/vc-annotation/tree/master/snpeff/tmp github]
output file
* the header adds 29 lines.
<pre>
##SnpSiftCmd="SnpSift dbnsfp -f SIFT_score,SIFT_pred, ...
##INFO=<ID=dbNSFP_GERP___RS,Number=A,Type=Float,Description="Field 'GERP++_RS' from dbNSFP">
##INFO=<ID=dbNSFP_CADD_phred,Number=A,Type=Float,Description="Field 'CADD_phred' from dbNSFP">
...
</pre>
* the INFO field will add
<pre>
dbNSFP_CADD_phred=2.276;dbNSFP_CADD_raw=-0.033441;dbNSFP_FATHMM_pred=.,.,T; ...
</pre>


'''[http://snpeff.sourceforge.net/SnpSift.html#Extract SnpSift extractFields]'''
== Ensembl to gene symbol ==
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >java -jar "$seqtools_snpeff/SnpSift.jar" extractFields \
* http://useast.ensembl.org/index.html (seems down). [http://training.ensembl.org/exercises Training].
    -s "," -e "." "$outputDir/tmp/$tmpfd/nonsyn_splicing_dbnsfp.vcf" CHROM POS ID REF ALT "ANN[0].EFFECT" "ANN[0].IMPACT" "ANN[0].GENE" "ANN[0].GENEID" "ANN[0].FEATURE" "ANN[0].FEATUREID" "ANN[0].BIOTYPE" "ANN[0].HGVS_C" "ANN[0].HGVS_P" dbNSFP_SIFT_score dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_score dbNSFP_Polyphen2_HDIV_pred dbNSFP_Polyphen2_HVAR_score dbNSFP_Polyphen2_HVAR_pred dbNSFP_LRT_score dbNSFP_LRT_pred dbNSFP_MutationTaster_score dbNSFP_MutationTaster_pred dbNSFP_MutationAssessor_score dbNSFP_MutationAssessor_pred dbNSFP_FATHMM_score dbNSFP_FATHMM_pred dbNSFP_PROVEAN_score dbNSFP_PROVEAN_pred dbNSFP_VEST3_score dbNSFP_CADD_raw dbNSFP_CADD_phred dbNSFP_MetaSVM_score dbNSFP_MetaSVM_pred dbNSFP_MetaLR_score dbNSFP_MetaLR_pred dbNSFP_GERP___NR dbNSFP_GERP___RS dbNSFP_phyloP100way_vertebrate dbNSFP_phastCons100way_vertebrate dbNSFP_SiPhy_29way_logOdds \
* Online tools
    > "$outputDir/tmp/$tmpfd/annoTable.txt"
** [https://www.syngoportal.org/convert SynGO - ID conversion tool]
</pre>
** [https://www.biotools.fr/human/ensembl_symbol_converter ENSEMBL Gene ID to Gene Symbol Converter]
The output file will
* R. [https://medium.com/computational-biology/gene-id-mapping-using-r-14ff50eec9ba Gene ID mapping using R].
* remove header from the input VCF file
** [https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.34.4/topics/AnnotationDb-objects AnnotationDbi::mapIds()] function + [https://www.biostars.org/p/239681/ "org.Hs.eg.db"] package. It works well.
* break the INFO field into several columns; Only the columns we specify in the command will be kept.
*** Vignette of [https://www.bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html EnhancedVolcano] package
*** [https://www.r-bloggers.com/2016/07/converting-gene-names-in-r-with-annotationdbi/ Converting Gene Names in R with AnnotationDbi]
*** my [https://gist.github.com/arraytools/6e6142f6fabb31e54e188ea1fb0deeee TCGAbiolinks-GBM.Rmd].
** getBM() from [https://stackoverflow.com/a/58875340 "biomaRt"] package
** [https://rdrr.io/bioc/tidybulk/man/ensembl_symbol_mapping.html ensembl_to_symbol()] from tidybulk package.
** [https://github.com/stemangiola/bioc_2020_tidytranscriptomics A Tidy Transcriptomics introduction to RNA sequencing analyses] (Bioc 2020).


==== Local ====
== How to use [http://genome.ucsc.edu/cgi-bin/hgTables UCSC Table Browser] ==
<syntaxhighlight lang='bash'>
* An instruction from [http://bitseq.github.io/howto/index BitSeq] software
$ ls -l /opt/SeqTools/bin/snpEff/
* [https://www.biostars.org/p/93011/ How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols]
total 44504
* [https://www.biostars.org/p/156637/ Where to download refseq gene coding regions data?]
drwxrwxr-x 5 brb brb    4096 May  5 11:24 data
** http://genome.ucsc.edu/cgi-bin/hgTables OR
drwxr-xr-x 2 brb brb    4096 Feb 17 16:37 examples
** Download '''refGene.txt.gz''' file from UCSC directly using http links
drwxr-xr-x 3 brb brb    4096 Feb 17 16:37 galaxy
* [https://www.biostars.org/p/94823/ Where To Download Genome Annotation Including Exon, Intron, Utr, Intergenic Information?]
drwxr-xr-x 3 brb brb    4096 Feb 17 16:37 scripts
-rw-r--r-- 1 brb brb  6138594 Dec  5 10:49 snpEff.config
-rw-r--r-- 1 brb brb 20698856 Dec  5 10:49 snpEff.jar
-rw-r--r-- 1 brb brb 18712032 Dec  5 10:49 SnpSift.jar


$ ls -l /opt/SeqTools/bin/snpEff/data
[[:File:Tablebrowser.png]] [[:File:Tablebrowser2.png]]
total 12
drwxr-xr-x 2 brb brb 4096 May  5 11:24 GRCh38.82
drwxrwxr-x 2 brb brb 4096 Feb  5 09:28 hg19
drwxr-xr-x 2 brb brb 4096 Mar  8 09:44 hg38
</syntaxhighlight>


==== Biowulf ====
Note
The following code fixed some typos on biowulf website.
# the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
<syntaxhighlight lang='bash'>
# the output does not have a header
echo $SNPSIFT_JAR
# The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1
# Make sure the database (GRCH37.75 in this case) exists
ls /usr/local/apps/snpEff/4.2/data
# CanFam3.1.75  Felis_catus_6.2.75  GRCh37.75    GRCh38.82 GRCm38.81  GRCz10.82  hg38
# CanFam3.1.81  Felis_catus_6.2.81  GRCh37.GTEX  GRCh38.p2.RefSeq  GRCm38.82  hg19      hg38kg
# CanFam3.1.82  Felis_catus_6.2.82  GRCh38.81    GRCm38.75 GRCz10.81  hg19kg    Zv9.75
# Use snpEff to annotate against GRCh37.75
snpEff -v -lof -motif -hgvs -nextProt GRCh37.75 protocols/ex1.vcf > ex1.eff.vcf # 25 minutes
          # create ex1.eff.vcf (475MB), snpEff_genes.txt (2.5MB) and snpEff_summary.html (22MB)
# Use SnpSift to pull out 'HIGH IMPACT' or 'MODERATE IMPACT' variants
cat ex1.eff.vcf | \
  java -jar $SNPSIFT_JAR filter "((EFF[*].IMPACT = 'HIGH') | (EFF[*].IMPACT = 'MODERATE'))"  \
  > ex1.filtered.vcf
          # ex1.filtered.vcf (8.2MB), 2 minutes
# Use SnpSift to annotate against the dbNSFP database
java -jar $SNPSIFT_JAR dbnsfp -v -db /fdb/dbNSFP2/dbNSFP2.9.txt.gz ex1.eff.vcf \
  > file.annotated.vcf
          # file.annotated.vcf (479 MB), 11 minutes
</syntaxhighlight>
and the output
<syntaxhighlight lang='bash'>
$ ls -lth | head
total 994M
-rw-r----- 1  479M May 15 11:47 file.annotated.vcf
-rw-r----- 1  8.2M May 15 11:31 ex1.filtered.vcf
-rw-r----- 1  476M May 15 10:31 ex1.eff.vcf
-rw-r----- 1  2.5M May 15 10:30 snpEff_genes.txt
-rw-r----- 1  22M May 15 10:30 snpEff_summary.html
lrwxrwxrwx 1    39 May 15 10:01 protocols -> /usr/local/apps/snpEff/4.2/../protocols
</syntaxhighlight>


==== Strange error ====
If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.
* Run 1: Error
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
<pre>
$ wc -l hg38Tables.bed
$ java -Xmx4G -jar "$seqtools_snpeff/snpEff.jar" -canon -no-downstream -no-upstream -no-intergenic -no-intron -no-utr \
75893 hg38Tables.bed
    -noNextProt -noMotif $genomeVer -s "$outputDir/tmp/$tmpfd/annodbsnpRemove.html" "$outputDir/tmp/$tmpfd/cosmic_dbsnp_rem.vcf"
$ head -2 hg38Tables.bed
chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754,
chr1 201283451 201332993 NM_000299 0 + 201283702 201328836 0 15 453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920, 0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
$ tail -2 hg38Tables.bed
chr22_KI270734v1_random 131493 137393 NM_005675 0 + 131645 136994 0 5 262,161,101,141,549, 0,342,3949,4665,5351,
chr22_KI270734v1_random 138078 161852 NM_016335 0 - 138479 161586 0 15 589,89,99,176,147,93,82,80,117,65,150,35,209,313,164, 0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610,


java.lang.RuntimeException: ERROR: Cannot read file '/opt/SeqTools/bin/snpEff/./data/hg38/snpEffectPredictor.bin'.
$ wc -l hg38CodingExon.bed
You can try to download the database by running the following command:
577387 hg38CodingExon.bed
java -jar snpEff.jar download hg38
$ head -2 hg38CodingExon.bed
chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 -
chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 -
$ tail -2 hg38CodingExon.bed
chr22_KI270734v1_random 156288 156497 NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r 0 -
chr22_KI270734v1_random 161313 161586 NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r 0 -


at ca.mcgill.mcb.pcingola.snpEffect.SnpEffectPredictor.load(SnpEffectPredictor.java:62)
# Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698)
at ca.mcgill.mcb.pcingola.snpEffect.Config.loadSnpEffectPredictor(Config.java:517)
$ grep NM_001276352 hg38Tables.bed
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.loadDb(SnpEff.java:339)
chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754,
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:956)
$ grep NM_001276352 hg38CodingExon.bed
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:939)
chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 -
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:978)
chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 -
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:136)
chr1 67103237 67103382 NM_001276352_cds_2_0_chr1_67103238_r 0 -
 
chr1 67111576 67111644 NM_001276352_cds_3_0_chr1_67111577_r 0 -
 
chr1 67115351 67115464 NM_001276352_cds_4_0_chr1_67115352_r 0 -
NEW VERSION!
chr1 67125751 67125909 NM_001276352_cds_5_0_chr1_67125752_r 0 -
There is a new SnpEff version available:
chr1 67127165 67127240 NM_001276352_cds_6_0_chr1_67127166_r 0 -
Version      : 4.3P
Release date : 2017-06-06
Download URL : http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
</pre>
* Run 2: OK
<pre>
$ java -Xmx4G -jar "$seqtools_snpeff/snpEff.jar" -canon -no-downstream -no-upstream -no-intergenic -no-intron -no-utr \
    -noNextProt -noMotif $genomeVer -s "$outputDir/tmp/$tmpfd/annodbsnpRemove.html" "$outputDir/tmp/$tmpfd/cosmic_dbsnp_rem.vcf"
 
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.3+htslib-1.3
...
</pre>
</pre>


=== ANNOVAR and SnpEff examples ===
This can be compared to '''refGene'''(?) directly downloaded via http
https://github.com/arraytools/brb-seqtools/tree/master/testdata/GSE48215subset/output
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
--2018-10-09 15:44:43--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7457957 (7.1M) [application/x-gzip]
Saving to: ‘hg38.refGene.txt.gz’


=== [http://cancer.sanger.ac.uk/cosmic COSMIC] ===
hg38.refGene.txt.gz                100%[===============================================================>]   7.11M  901KB/s    in 10s
* [https://www.youtube.com/channel/UC0W354Ttrh0BZCjt1D3I2HA Youtube videos]
* [http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=BRAF#histo Histogram] of BRAF (melanoma).


=== AnnTools ===
2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957]
=== GNS-SNP ===
=== SeattleSeq ===
=== SVA ===
=== VARIANT ===
=== VEP ===
[https://informatics.sydney.edu.au/training/coursedocs/DNASeqOnArtemis_Camden_Aug2018_1.pdf#page=46 DNA Sequencing analysis on Artemis Mapping and Variant Calling] Tracy Chew et al


=== Web tools ===
$ zcat hg38.refGene.txt.gz | wc -l
* [http://genome.ucsc.edu/cgi-bin/hgVai UCSC Variant Annotation Integrator]
75893
* [http://snp.gs.washington.edu/SeattleSeqAnnotation137/ SeattleSeq Annotation 137]
15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2
* [http://www.ensembl.org/Homo_sapiens/Tools/VEP Variant Effect Predictor] and [http://useast.ensembl.org/info/docs/tools/vep/script/index.html?redirect=no VEP script]
1072 NM_003288 chr20 + 63865227 63891545 63865365 63889945 7 63865227,63869295,63873667,63875815,63882718,63889189,63889849, 63865384,63869441,63873816,63875875,63882820,63889238,63891545, 0 TPD52L2 cmpl cmpl 0,1,0,2,2,2,0,
1815 NR_110164 chr2 + 161244738 161249050 161249050 161249050 2 161244738,161246874, 161244895,161249050, 0 LINC01806 unk unk -1,-1,


=== pcgr ===
$ zcat hg38.refGene.txt.gz | tail -2
* https://github.com/sigven/pcgr
1006 NM_130467 chrX + 55220345 55224108 55220599 55224003 5 55220345,55221374,55221766,55222620,55223986, 55220651,55221463,55221875,55222746,55224108, 0 PAGE5 cmpl cmpl 0,1,0,1,1,
* [https://academic.oup.com/bioinformatics/article/34/10/1778/4764004 Personal Cancer Genome Reporter: variant interpretation report for precision oncology] 2017
637 NM_001364814 chrY - 6865917 6874027 6866072 6872608 7 6865917,6868036,6868731,6868867,6870005,6872554,6873971, 6866078,6868462,6868776,6868909,6870053,6872620,6874027, 0 AMELY cmpl cmpl 0,0,0,0,0,0,-1,
</pre>


=== SPDI ===
== Where to download reference genome ==
[https://www.biorxiv.org/content/10.1101/537449v1 SPDI: Data Model for Variants and Applications at NCBI]
* [http://hgdownload.cse.ucsc.edu/downloads.html UCSC] and [https://genome.ucsc.edu/goldenpath/help/twoBit.html twoBitToFa] to [https://www.biostars.org/p/9700/ UCSC] convert .2bit to fasta.
 
* [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ hg19] from UCSC (chromosome-wise).
== De novo genome assembly ==
* [http://bioinformatics.oxfordjournals.org/content/early/2015/06/21/bioinformatics.btv383.abstract Bandage: interactive visualisation of de novo genome assemblies]


== Single Cell RNA-Seq ==
== Which human reference genome to use? ==
* https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis
http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)
* [http://journal.frontiersin.org/article/10.3389/fgene.2017.00062/full Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods] by Dal Molin et al 2017.
* [http://bioinformatics.oxfordjournals.org/content/31/13/2225.short?rss=1 Normalization and noise reduction for single cell RNA-seq experiments] by Bo Ding et al 2015.
* [http://www.rna-seqblog.com/a-step-by-step-workflow-for-low-level-analysis-of-single-cell-rna-seq-data-with-bioconductor/ A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor] by Lun 2016.
* (Video) [https://youtu.be/IrlNcJwPClQ?list=PLEyKDyF1qdObdFBc3JncwXAnMUHlcd0ap Analysis of single cell RNA-seq data - Lecture 1] by BioinformaticsTraining
* (Video) [https://youtu.be/8_-oPq1Tg1E Single-cell isolation by a modular single-cell pipette for RNA-sequencing] by labonachipVideos
* [https://f1000research.com/articles/6-595/v1 Gene length and detection bias in single cell RNA sequencing protocols] and the script & data are available online. Belinda Phipson1 et al 2017.
* [https://f1000research.com/articles/6-1158/v1 Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference]. '''It has open peer reviews too'''.
* [https://academic.oup.com/biostatistics/article-abstract/4599254 Missing data and technical variability in single-cell RNA-sequencing experiments]
* [https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0467-4 A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications] 2017
* [https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby007/4831233 How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives] by Alessandra Dal Molin 2018


=== How many cells are in the human body? ===
== UHR, HBR ==
[https://www.medicalnewstoday.com/articles/318342.php 30-40 trillion (10<sup>12</sup>) cells]
In RNA sequencing (RNA-seq), Universal Human Reference (UHR) and Human Brain Reference (HBR) are two types of commercially available RNA samples that are often used as control samples and assess the performance and accuracy of RNA-seq assays. See [https://rnabio.org/module-01-inputs/0001/05/01/RNAseq_Data/ this] ([https://github.com/griffithlab/rnaseq_tutorial/wiki/RNAseq-Data github]) and [https://bioinformatics.ccr.cancer.gov/docs/b4b/Module2_RNA_Sequencing/Lesson13/ Lesson 13: Aligning raw sequences to reference genome].


=== [https://bioconductor.org/packages/scone scone] package: normalization ===
== GENCODE transcript database ==
[https://www.biorxiv.org/content/biorxiv/early/2017/12/16/235382.full.pdf Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq]
* [https://www.gencodegenes.org/ GENCODE]
* [https://chitka-kalyan.blogspot.com/2014/02/creating-gencode-transcript-database-in.html Create a GENCODE transcript database in R]


=== [http://bioconductor.org/packages/release/bioc/html/monocle.html monocle] package ===
== [https://en.wikipedia.org/wiki/RefSeq#RefSeq_categories RefSeq categories] ==
See Table 1 of [https://www.ncbi.nlm.nih.gov/books/NBK21091/ Chapter 18The Reference Sequence (RefSeq) Database].


=== [http://bioconductor.org/packages/release/bioc/html/sincell.html sincell] package ===
{| class="wikitable centered" style="text-align:center"
 
|+
=== [https://github.com/diazlab/SCell SCell] ===
|- class="hintergrundfarbe6"
[http://www.rna-seqblog.com/scell-integrated-analysis-of-single-cell-rna-seq-data/ SCell – integrated analysis of single-cell RNA-seq data]
! Category
 
! Description
=== [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2897-6 GEVM] ===
|-  
Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).
| NC
 
| Complete genomic molecules
=== NMFEM ===
|-  
[http://www.rna-seqblog.com/nmfem-detecting-heterogeneity-in-single-cell-rna-seq-data-by-non-negative-matrix-factorization/ Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization]
| NG
 
| Incomplete genomic region
=== Seurat ===
|-  
* http://satijalab.org/seurat/
| NM
* https://videocast.nih.gov/summary.asp?Live=21733&bhcp=1
| [https://en.wikipedia.org/wiki/MRNA mRNA]
 
|-
=== Splatter: Simulation Of Single-Cell RNA Sequencing Data ===
| NR
http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1
| [https://en.wikipedia.org/wiki/Non-coding_RNA ncRNA]
 
|-
=== Scater ===
| NP
[https://academic.oup.com/bioinformatics/article/33/8/1179/2907823 Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R]
| [https://en.wikipedia.org/wiki/Protein Protein]
 
|-
=== [https://github.com/miaozhun/DEsingle DEsingle] ===
| XM
[https://www.rna-seqblog.com/desingle-detecting-three-types-of-differential-expression-in-single-cell-rna-seq-data/ DEsingle – detecting three types of differential expression in single-cell RNA-seq data]
| predicted [[mRNA]] model
 
|-  
== RNA-Seq analysis interface ==
| XR
* [http://bib.oxfordjournals.org/content/early/2015/06/23/bib.bbv036.full Systematically evaluating interfaces for RNA-seq analysis from a life scientist perspective]
| predicted [[ncRNA]] model
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2486-6 iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data]
|-
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2702-z DEvis: an R package for aggregation and visualization of differential expression data]
| XP
 
| predicted [[Protein]] model (eukaryotic sequences)
== Co-expression in RNA-Seq ==
|-  
* [http://www.rna-seqblog.com/co-expression-network-analysis-using-rna-seq-data/ Co-expression network analysis using RNA-Seq data]
| WP
* [http://bioconductor.org/packages/devel/bioc/html/coseq.html coseq] - Co-Expression Analysis of Sequencing Data
| predicted [[Protein]] model (prokaryotic sequences)
* [http://bib.oxfordjournals.org/content/early/2017/01/10/bib.bbw139.full Gene co-expression analysis for functional classification and gene–disease predictions]
|}
 
 
== Monitor Software Version Change ==
== UCSC version & NCBI release corresponding ==
 
* http://genome.ucsc.edu/FAQ/FAQreleases.html
== Circos Plot ==
Circos is a popular tool for summarizing genomic events in a tumor genome.
 
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004274 Genome Modeling System: A Knowledge Management Platform for Genomics]
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2564-9 MS-Helios: a Circos wrapper to visualize multi-omic datasets]
 
== Cancer & Mutation ==
=== [https://www.mycancergenome.org/ My Cancer Genome] ===


=== [https://civic.genome.wustl.edu/#/home CIViC - Clinical Interpretations of Variants in Cancer] ===
== Gene Annotation ==
* [http://www.genecards.org/ GeneCards]
* [http://ghr.nlm.nih.gov/GenesBySymbol Genetics Home Reference] from National Library of Medicine
* [http://www.mycancergenome.org/ My Cancer Genome]
* [http://cancer.sanger.ac.uk/cosmic Cosmic]
* [http://www.gettinggeneticsdone.com/2015/11/annotables-convert-gene-ids.html annotables] R package.
* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz031/5301311?rss=1 ensembldb] R package
* https://www.gencodegenes.org/human/releases.html
* For gene symbols, there are NCBI and HUGO. See an example from GSE6532 (annot.tam object).


=== NCBI ===
{{Pre}}
* [https://www.youtube.com/watch?v=fC9rYghqUTo NCBI Resources and Variant Interpretation Tools for the Clinical Community]: ClinVar, MedGen, GTR, Variation Viewer
library(rtracklayer)
genes <- readGFF("gencode.v27.annotation.gff3.gz")
genes[1:2, 1:5]
# DataFrame with 2 rows and 5 columns
#      seqid  source      type    start      end
#  <factor> <factor>  <factor> <integer> <integer>
# 1    chr1  HAVANA gene          11869    14409
# 2    chr1  HAVANA transcript    11869    14409
genes[1:100, ] %>% filter(type == "gene") %>% dim()
# Error in UseMethod("filter") :
#  no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'RectangularData', 'SimpleList', 'DataFrame_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"


= R and Bioconductor packages =
library(ape)
== Resources ==
genes2 <- read.gff("gencode.v27.annotation.gff3.gz")
* [http://rafalab.github.io/pages/harvardx.html HarvardX Biomedical Data Science Open Online Training], [https://hbctraining.github.io/main/ Bioinformatics Training at the Harvard Chan Bioinformatics Core]
genes2[1:2, 1:5]
* [https://bioconductor.org/help/course-materials/2017/CSAMA/ CSAMA 2017: Statistical Data Analysis for Genome-Scale Biology]
#  seqid source      type start  end
* [https://nsaunders.wordpress.com/2015/04/28/some-basics-of-biomart/ Some basics of biomaRt] (and GenomicRanges)
# 1  chr1 HAVANA      gene 11869 14409
* [http://master.bioconductor.org/help/workflows/annotation/AnnotatingRanges/ Annotating Ranges] Represent common sequence data types (e.g., from BAM, gff, bed, and wig files) as genomic ranges for simple and advanced range-based queries.
# 2  chr1 HAVANA transcript 11869 14409
<pre>
genes2[1:100,] %>% filter(type == "gene") %>% dim()
library(VariantAnnotation)
# [1] 11  9
library(AnnotationHub)
</pre>
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
 
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
=== Genecards ===
library(org.Hs.eg.db)
<ul>
library(org.Mm.eg.db)
<li>https://www.genecards.org/
library(BSgenome.Hsapiens.UCSC.hg19)
<li>Q: What are genes with gene symbols starting with LINC?; eg [https://www.genecards.org/cgi-bin/carddisp.pl?gene=LINC00491 LINC00491] </br>
</pre>
A: Genes with gene symbols starting with "LINC" are long intergenic non-coding RNA (lncRNA) genes. lncRNAs are RNA molecules that are transcribed from the genome but '''do not encode proteins'''. Unlike protein-coding genes, lncRNAs do not have a well-defined coding sequence, but they do play important regulatory roles in cellular processes such as gene expression, chromatin structure, and genome stability. Some lncRNAs are specifically expressed in cancer cells and have been implicated in tumor development and progression, making them of interest for cancer research.
* [http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf Analysis of RNA-Seq Data with R/Bioconductor] by Girke in UC Riverside
</ul>
* [http://ivory.idyll.org/dibsi/index.html 2018 Data Intensive Biology Summer Institute at UC Davis]
* [http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2012/121029_HTS/martin_morgan1_nicolas_delhomme2_embo2012_rbioconductor.pdf R / Bioconductor for High-Throughput Sequence Analysis 2012] by Martin Morgan1 and Nicolas Delhomme
* [http://www-huber.embl.de/pub/pdf/nprot.2013.099.pdf Count-based differential expression analysis of RNA sequencing data using R and Bioconductor] by Simon Anders
* [http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2013/131021_HTS/genesandgenomes.pdf Sequences, Genomes, and Genes in R / Bioconductor] by Martin Morgan 2013.
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4509590/ Orchestrating high-throughput genomic analysis with Bioconductor] by Wolfgang Huber et al 2015.
* [https://rpubs.com/achitsaz/97976 RNA-seq Analysis Example] This is a script that will do differential gene expression (DGE) analysis for RNA-seq experiments using the bioconductor package edgeR. RPKMs were calculated for bar plots.
* [https://github.com/pgpmartin/BioC_For_NGS/blob/master/BioC_for_NGS_PMartin.pdf An introduction to R and Bioconductor for the analysis of high-throughput sequencing data] by Pascal MARTIN Oct 2018


=== Docker ===
== How many [https://en.wikipedia.org/wiki/DNA DNA] strands are there in humans? ==
[https://github.com/jhuanglab/bioinstaller Bioinstaller]: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on [https://cran.r-project.org/web/packages/BioInstaller/ CRAN].
* http://www.numberof.net/number-of-dna-strands/
* http://www.answers.com/Q/How_many_DNA_strands_are_there_in_humans


== Some workflows ==
== How many base pairs in human ==
=== [http://www.bioconductor.org/help/workflows/rnaseqGene/ RNA-Seq workflow] ===
* 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.
* chromosome 22 has the smallest number of bps (~50 million).  
<pre>
* chromosome 1 has the largest number of bps (245 million base pairs).
      STAR      Samtools        Rsamtools
* Illumina iGenome '''Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa''' file is 3.0GB (so is other genome.fa from human).
fastq -----> sam ----------> bam  ----------> bamfiles  -|
 
                                                          \  GenomicAlignments      DESeq2
== Gene, Transcript, Coding/Non-coding exon ==
                                                          --------------------> se --------> dds
* https://hslnews.wordpress.com/2015/07/02/bioinformatics-bite-how-to-find-the-transcription-start-site-of-a-gene/
      GenomicFeatures        GenomicFeatures            /       (SummarizedExperiment) (DESeqDataSet)
* According to https://en.wikipedia.org/wiki/Exon, in the human genome only
  gtf ----------------> txdb ---------------> genes -----|
** 1.1% of the genome is spanned by exons,
</pre>
** 24% is in introns,
=== [http://master.bioconductor.org/help/workflows/high-throughput-sequencing/ Sequence analysis] ===
** 75% of the genome being intergenic DNA.
<pre>
* https://twitter.com/ensembl/status/1276469013707665410?s=20
library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)
</pre>


=== [http://master.bioconductor.org/help/workflows/annotation/annotation/ Accessing Annotation Data] ===
== SNP ==
Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
[https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism Types of SNPs and number of SNPs in each chromosomes]
<source lang="rsplus">
library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")    # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
</source>


{| class="wikitable"
== NGS technology ==
! Object type
* [https://en.wikipedia.org/wiki/Illumina_(company) Illumina - Solexa]
! example package name
* [https://en.wikipedia.org/wiki/ABI_Solid_Sequencing ABI - SOLiD]
! contents
* [https://en.wikipedia.org/wiki/454_Life_Sciences Roche 454]
|-
| OrgDb
| org.Hs.eg.db
| gene based information for Homo sapiens
|-
| TxDb
| TxDb.Hsapiens.UCSC.hg19.knownGene
| transcriptome ranges for Homo sapiens
|-
| OrganismDb
| Homo.sapiens
| composite information for Homo sapiens
|-
| BSgenome
| BSgenome.Hsapiens.UCSC.hg19
| genome sequence for Homo sapiens
|-
|
| [http://cran.r-project.org/web/packages/refGenome/index.html refGenome]
|
|}


== RNA-Seq Data Analysis using R/Bioconductor ==
== DNA methylation, Epigenetics ==
* https://github.com/datacarpentry/rnaseq-data-analysis by Stephen Turner.
* Relation of methylation genes and gene expression
* [https://support.bioconductor.org/p/69677/ Tutorial: Introduction to Bioconductor for high-throughput sequence analysis] by UseR 2015
** Methylation of genes can have different effects on gene expression, depending on where the methylation occurs in the gene and the specific context of the gene and the cellular environment. Generally, '''methylation of ''promoter regions'' of genes is associated with reduced gene expression''', whereas methylation of ''gene body'' regions is less clearly associated with gene expression changes.
* [http://bioconductor.org/packages/release/bioc/html/systemPipeR.html systemPipeR] Building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. An important feature is support for running command-line software, such as NGS aligners, on both single machines or compute clusters.
** When DNA is methylated at the promoter region of a gene, it can prevent the binding of transcription factors and RNA polymerase, which are necessary for transcription initiation. Methylation at the promoter region can also recruit proteins that block transcription or promote histone modifications that lead to chromatin compaction, further limiting access to the gene for transcriptional machinery.
** http://girke.bioinformatics.ucr.edu/GEN242/mydoc_systemPipeVARseq_05.html
** However, methylation in other regions of the gene, such as the gene body, can have more complex effects on gene expression. In some cases, gene body methylation can be associated with increased expression, while in other cases it may have no effect or even lead to decreased expression. It is thought that gene body methylation may be involved in regulating alternative splicing or RNA stability, among other possible mechanisms.
* [http://bioconductor.org/help/course-materials/2015/BioC2015/ BioC2015]
** Therefore, the effect of methylation on gene expression is not always straightforward and depends on various factors, including the specific gene, the location of methylation, and the cellular context.


=== recount2 ===
* [https://wikidiff.com/hypermethylation/hypomethylation Hypermethylation vs Hypomethylation - What's the difference?]
* https://github.com/leekgroup/recount
* [https://jhubiostatistics.shinyapps.io/recount/ recount2] - A multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets.


== [https://bioconductor.org/packages/release/bioc/html/GenomicDataCommons.html GenomicDataCommons] package ==
[https://youtu.be/e4caywLliW0 DNA Methylation - Biochemistry (USMLE Step 1)]
* Genomic Data Commons
** DNA methylation = InActivates DNA transcription
** https://gdc.cancer.gov/
** https://www.cancer.gov/about-nci/organization/ccg/research/computational-genomics/gdc
** [https://portal.gdc.cancer.gov/ Data Portal]. A list of [https://portal.gdc.cancer.gov/projects Projects]
* Use the GenomicDataCommons package to find and download variants from TCGA (NCI Genomic Data Commons Access) dataset and maftools package for analysis and visualization. See  https://seandavi.github.io/talk/2018/02/08/bioconductor-a-potential-hub-in-the-cancer-biomarker-data-ecosystem/
* https://seandavi.github.io/post/2018/03/extracting-clinical-information-using-the-genomicdatacommons-package/
* [https://docs.google.com/presentation/d/1bjnW67aemW90kFcq_S5rGorX96Xjrp9jt3tM4PpNuHI The Cancer Data Ecosystem: Data and cloud resources for cancer data science]


Note:
* The level of mRNA expression is inversely related to the extent of methylation. See [https://www.future-science.com/doi/suppl/10.2144/btn-2018-0179/suppl_file/supplementary_figure_s6.pdf this screenshot example from TCGA]
# The TCGA data such as [https://portal.gdc.cancer.gov/projects/TCGA-LUAD TCGA-LUAD] are not part of clinical trials (described [https://wiki.cancerimagingarchive.net/display/Public/TCGA-LUAD here]).
# Each patient has 4 categories data and the 'case_id' is common to them:
#* demographic: gender, race, year_of_birth, year_of_death
#* diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
#* exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
#* main: disease_type, primary_site
# The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.


== Visualization ==
* [http://nathansheffield.com/wordpress/what-is-hemimethylated-dna/ hemimethylated DNA vs allele-specific methylation]. DNA-hemimethylation is when only one of two (complementary) strands is methylated.
=== [https://bioconductor.org/packages/release/bioc/html/GenVisR.html GenVisR] ===
* [https://bioconductor.org/packages/devel/workflows/vignettes/methylationArrayAnalysis/inst/doc/methylationArrayAnalysis.html methylationArrayAnalysis ]: A cross-package Bioconductor workflow for analysing methylation array data
 
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-587 Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis] Du 2010. Figure 3 shows scatterplots of SD vs mean based on technical replicates of using beta or M-value. It can be seen M-value satisfies homoscedasticity but beta does not.
=== [http://www.bioconductor.org/packages/devel/bioc/html/ComplexHeatmap.html ComplexHeatmap] ===
* [https://health.usnews.com/health-care/for-better/articles/is-your-dna-your-destiny-a-primer-on-epigenetics Is Your DNA Your Destiny? A Primer on Epigenetics]
* [https://en.wikipedia.org/wiki/GC-content GC content] = (G+C)/(A+T+G+C) x 100%
* How many CpGs (C follows by G)?
* Some papers
** [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04416-w Identification of prostate cancer specific methylation biomarkers from a multi-cancer analysis] 2021. Within all bootstrapped datasets, sites with non-zero coefficients in more than 99% LASSO models were kept.
* [http://genomicsclass.github.io/book/pages/methylation.html Analyzing DNA methylation data] (part of the book [http://genomicsclass.github.io/book/ Biomedical Data Science]) and the [https://www.class-central.com/mooc/1615/edx-ph525x-data-analysis-for-genomics PH525x: Data Analysis for Genomics] (edX course). The Github website is on https://github.com/genomicsclass/labs. The source code may not be correct. See also http://www.biostat.jhsph.edu/~iruczins/teaching/kogo/html/ml/week8/methylation.Rmd. The paper [http://ije.oxfordjournals.org/content/41/1/200.long Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies] the tutorial has mentioned.
<source lang="rsplus">
devtools::install_github("coloncancermeth","genomicsclass")
library(coloncancermeth) # 485512 x 26
data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects
dim(meth)
dim(pd)
length(gr)
colnames(pd)
table(pd$Status) # 9 normals, 17 cancers
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")


== [http://www.bioconductor.org/packages/release/bioc/html/limma.html limma] ==
i=normalIndex[1]
* [http://nar.oxfordjournals.org/content/early/2015/01/20/nar.gkv007.long Differential expression analyses for RNA-sequencing and microarray studies]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
* [http://bioinf.wehi.edu.au/RNAseqCaseStudy/ Case Study] using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). ''Here we illustrate how to use two Bioconductor packages - '''Rsubread''' and '''limma''' - to perform a complete RNA-seq analysis, including '''Subread''''''Bold text''' read mapping, '''featureCounts''' read summarization, '''voom''' normalization and '''limma''' differential expresssion analysis.''
for(i in normalIndex){
* Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See [https://support.bioconductor.org/p/47217/ this post].
  lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
  lines(density(meth[,i],from=0,to=1),col=2)
}


== easyRNASeq ==
# finding regions of the genome that are different between cancer and normal samples
Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- ebayes(fit)


== ShortRead ==
# plot of the region surrounding the top hit
Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)


== [http://www.bioconductor.org/packages/release/bioc/html/Rsamtools.html Rsamtools] ==
plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
The Rsamtools package provides an interface to BAM files.
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
# http://www.ncbi.nlm.nih.gov/pubmed/22422453


The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts
# within each chromosome we usually have big gaps creating subgroups of regions to be analyzed
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")


This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.
library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
#consider two example regions#
...
</source>


== [http://www.bioconductor.org/packages/release/bioc/html/IRanges.html IRanges] ==
=== Integrate DNA methylation and gene expression ===
IRanges is a fundamental package (see how many packages depend on it) to other packages like '''GenomicRanges''', '''GenomicFeatures''' and '''GenomicAlignments'''. The package defines the IRanges class.  
* [https://bioconductor.org/packages/release/bioc/html/iNETgrate.html iNETgrate]: Integrates DNA methylation data with gene expression in a single gene network. [https://www.nature.com/articles/s41598-023-48237-8 Paper] 2023.
* [https://bioconductor.org/packages/release/bioc/html/ELMER.html ELMER] Inferring Regulatory Element Landscapes and Transcription Factor Networks Using Cancer Methylomes and [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6546131/ Paper] 2019.
* [https://www.nature.com/articles/s41392-019-0081-6 Integrative analysis of DNA methylation and gene expression identified cervical cancer-specific diagnostic biomarkers] 2019
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4625-x EnrichedHeatmap: an R/Bioconductor package for comprehensive visualization of genomic signal associations] 2018
* [https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/casestudy.html#Case_study_n_3:_Integration_of_methylation_and_expression_for_ACC TCGAbiolinks vignette]. Case 3. Create a scatterplot of log10(FDR gene expression) vs log10(FDR dna methylation). Case 4. ELMER - Identify putative target genes for differentially methylated distal probes, using methylation vs. expression correlation. Identify master regulatory Transcription Factors (TF) whose expression associate with DNA methylation changes at multiple regulatory regions.


The '''plotRanges'''() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.
== Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing ==
* http://www.rna-seqblog.com/exome-sequencing-vs-rna-seq-to-identify-coding-region-variants/
* http://www.rna-seqblog.com/combined-use-of-exome-and-transcriptome-sequencing/
* [http://www.genomebiology.com/2010/11/5/R57 A comparison of whole genome and whole transcriptome sequencing]


If we want to make the same plot using the ggplot2 package, we can follow the example in [http://stackoverflow.com/questions/21506724/how-to-plot-overlapping-ranges-with-ggplot2 this post]. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.
== Sequence + Expression ==
* [http://www.ncbi.nlm.nih.gov/pubmed/26177635 Integrated sequence and expression analysis of ovarian cancer structural variants underscores the importance of gene fusion regulation]


=== flank ===
== Integrate RNA-Seq and DNA-Seq ==
The example is obtained from ?IRanges::flank.
* [https://www.jci.org/articles/view/96153 Integrated RNA and DNA sequencing reveals early drivers of metastatic breast cancer] by Perou. An R code is provided.
 
== Immunohistochemistry/IHC ==
https://en.wikipedia.org/wiki/Immunohistochemistry. Protein expression by IHC.
 
== Deconvolve bulk tumor tissue ==
[https://www.biorxiv.org/content/10.1101/2022.12.04.519045v1 Performance of computational algorithms to deconvolve heterogeneous bulk tumor tissue depends on experimental factors]. [https://twitter.com/ariel_hippen/status/1602379847187107841 Twitter].
 
== Tumor purity ==
* [https://aacrjournals.org/cancerrescommun/article/2/5/353/696472/Tumor-Purity-in-Preclinical-Mouse-Tumor Tumor Purity in Preclinical Mouse Tumor Models] 2022
* [https://ascopubs.org/doi/10.1200/PO.20.00016 Systematic Assessment of Tumor Purity and Its Clinical Implications] Haider 2020.
** With the exception of naive miRNA profiles, ''' ''purity estimates'' were inversely correlated with ''molecular profiles'' '''regardless of the underlying purity estimation profile .
** ''These data suggest that the presence of genomic and transcriptomic correlates of tumor purity are likely to confound biologic and clinical interpretations.''
* Estimators:
** DNA: ABSOLUTE, ASCAT, CLONET, INTEGER, OncoSNP
** RNA: DeMix, ISOpure-R (matlab/R), ESTIMATE ([https://bioinformatics.mdanderson.org/public-software/estimate/ Yoshihara, R])
** miRNA/microRNA: ISOpure-I
* TCGA purity estimate by Aran 2015 [https://www.nature.com/articles/ncomms9971 Systematic pan-cancer analysis of tumour purity] - Supplementary Data 1 (xlsx file with columns: Sample ID,Cancer type,ESTIMATE,ABSOLUTE,LUMP,IHC & CPE).
* [https://rdrr.io/bioc/TCGAbiolinks/man/Tumor.purity.html Tumor.purity: TCGA samples with their Tumor Purity measures] (a data frame with 9364 rows and 7 variables) from [https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html TCGAbiolinks] package
* [https://academic.oup.com/bib/article/22/6/bbab163/6265216#312129108 Prediction of tumor purity from gene expression data using machine learning] 2021.
** We selected the '''CPE''' as the target variable, which is the median purity value after normalizing values from the other four purity estimates (ESTIMATE, ABSOLUTE, LUMP and IHC).
** our data set consisted of 8405 tumor samples.
* How VAF is related to tumor purity?
** '''Variant allelic fraction''' (VAF) is related to tumor purity because it reflects the proportion of cells in a sample that carry a specific genetic variant. In the context of cancer, '''VAF can be used as a surrogate marker for tumor purity''', as the fraction of cells in the sample that carry the variant will depend on the proportion of cancer cells relative to normal cells.
** The VAF of a specific genetic variant in a cancer sample can be calculated as the ratio of the number of reads supporting the variant to the total number of reads covering that locus. '''In a sample that is purely composed of cancer cells, the VAF should approach 1''', as all cells will carry the variant. In a sample that is mixed with normal cells, the VAF will be lower and proportional to the proportion of cancer cells in the sample.
** Therefore, by measuring the VAF of one or more genetic variants, it is possible to estimate the tumor purity, which is the proportion of cancer cells in the sample relative to normal cells. This information is important for a variety of downstream analyses, including '''variant calling''', gene expression analysis, and the estimation of the '''mutational burden''', as it can affect the interpretation of the results and the accuracy of the analysis.
* How gene expression can be used to estimate tumor purity?
** Gene expression analysis can be used to estimate tumor purity by comparing the expression levels of genes known to be specific to either '''normal or cancer cells'''. In a sample that is mixed with normal and cancer cells, the expression levels of these genes will reflect the proportion of normal and cancer cells present in the sample.
** For example, genes that are highly expressed in '''normal cells''', such as '''housekeeping genes''', can be used as a reference to estimate the proportion of normal cells in the sample. Similarly, genes that are highly expressed in '''cancer cells''', such as '''oncogenes''', can be used to estimate the proportion of cancer cells in the sample.
** The relative expression levels of these genes can then be used to estimate the tumor purity, either by comparing the expression levels to a reference sample of known purity, or by using mathematical models to estimate the proportion of normal and cancer cells in the sample.
** It is important to note that this method is not without limitations, as the expression levels of specific genes can be influenced by various factors, such as the presence of cell-to-cell heterogeneity, gene amplification, and epigenetic modifications, among others. Therefore, gene expression analysis should be used in combination with other methods, such as copy number analysis and variant allelic fraction analysis, to obtain a more accurate estimate of tumor purity.
* Some papers
** [https://onlinelibrary.wiley.com/doi/full/10.1002/cam4.3505 Tumor purity as a prognosis and immunotherapy relevant feature in gastric cancer]
** [https://genomemedicine.biomedcentral.com/articles/10.1186/gm524 Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine]. Tumor purity affects the detection of somatic mutations. Explain in plot about what are heterozygous somatic SNV and one hetererozygous germline SNV.
<ul>
<li>ISOpureR (Quon 2013): intensity or count data. Error term <math>e_n</math> multinomial distribution.
<pre>
<pre>
ir3 <- IRanges(c(2,5,1), c(3,7,3))
library(ISOpureR)
# IRanges of length 3
 
#     start end width
# For reproducible results, set the random seed
# [1]    2  3    2
set.seed(123);
# [2]    5  7    3
# [3]    1  3    3


flank(ir3, 2)
# Run ISOpureR Step 1 - Cancer Profile Estimation
#     start end width
system.time(ISOpureS1model <- ISOpure.step1.CPE(
# [1]    0  1    2
  tumor.expression.data, # intensity or count data
# [2]    3  4    2
  normal.expression.data  # intensity or count data
# [3]    -1  0    2
))
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
ISOpureS1model$alphapurities  # tumor purity estimates
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
</pre>
#                                    == ==
</li>
<li>ESTIMATE (Yoshihara 2013): normalized data.
* Since it is based ssGSEA, only ranks are used. It does not matter we used the log transformed or count data.
* ssGSEA is based on two gene signatures: Stromal signature (141 genes) and immune signature (141 genes)
* The formula for calculating ESTIMATE tumor purity was developed in TCGA Affymetrix data (n=1001) including both the '''ESTIMATE score''' and '''ABSOLUTE-based tumor purity'''.
* An evolutionary algorithm was used for the mathematical model.
* Nonlinear least squares method was used to determine the final model estimate.
* Tumor purity = cos(0.6 + 0.000146 * ESTIMATE score)
<pre>
library(estimate)
OvarianCancerExpr <- system.file("extdata", "sample_input.txt", package="estimate")
filterCommonGenes(input.f=OvarianCancerExpr, output.f="OV_10412genes.gct", id="GeneSymbol")
estimateScore("OV_10412genes.gct", "OV_estimate_score.gct", platform="affymetrix")


flank(ir3, 2, start=FALSE)
plotPurity(scores="OV_estimate_score.gct", samples="s516", platform="affymetrix")
#    start end width
# [1]    4  5    2
# [2]    8  9    2
# [3]    4  5    2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == ==


flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
scan("OV_estimate_score.gct", "", skip=6)[-c(1:2)] |> as.numeric() # tumor purity estimates
#    start end width
</pre>
# [1]     4  5    2
</li>
# [2]     3  4    2
<li>[https://wwylab.github.io/DeMixT/tutorial.html DeMixT] (Wang 2018): count data. [https://www.nature.com/articles/s41587-022-01342-x Estimation of tumor cell total mRNA expression in 15 cancer types predicts disease progression], [https://www.nature.com/articles/s41587-022-01342-x Cao 2022] for profile likelihood method & supplementary information for more information about the benchmarking between DeMixT_DE and DeMixT_GS.
# [3]     4  5    2
<pre>
# Combine the ideas of the previous 2 cases.
library(DeMixT)
source("DeMixT_preprocessing.R")


flank(ir3, c(2, -2, 2))
count.mat <- cbind(normal.expression.data, tumor.expression.data)
#    start end width
colnames(count.mat) <- paste0("sample", 1:ncol(count.mat))
# [1]    0  1    2
# [2]    5  6    2
# [3]    -1  0    2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                  == ==


flank(ir3, -2, start=F)
label = factor(c(rep('Normal', ncol(normal.expression.data)),
#     start end width
                rep('Tumor', ncol(tumor.expression.data))))
# [1]    2  3    2
set.seed(1234) # not sure if this is needed
# [2]    6  7    2
preprocessed_data = DeMixT_preprocessing(count.mat, label)
# [3]    2  3    2
PRAD_filter = preprocessed_data$count.matrix
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                      == ==


flank(ir3, 2, both = TRUE)
set.seed(1234)
#    start end width
Normal.id <- paste0("sample", 1:n1)
# [1]    0  3    4
Tumor.id <- paste0("sample", (n1+1):(n1+n2))
# [2]     3  6    4
data.Y = SummarizedExperiment(assays = list(counts = PRAD_filter[, Tumor.id]))
# [3]    -1  2    4
data.N1 <- SummarizedExperiment(assays = list(counts = PRAD_filter[, Normal.id]))
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
res = DeMixT(data.Y = data.Y,
# (From the manual) If both = TRUE, extends the flanking region width positions into the range.  
            data.N1 = data.N1,
#        The resulting range thus straddles the end point, with width positions on either side.
            nthread = 64,
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
            gene.selection.method = "DE") # default is "GS"
#                                            ==
res$pi[2, ] # tumor purity estimates
#                                      == == == ==
</pre>
</li>
<li>[https://cibersortx.stanford.edu/ CIBERSORTx] (Newman 2019). Web only. [https://www.nature.com/articles/s41587-019-0114-2 Determining cell type abundance and expression from bulk tissues with digital cytometry]
<li>PUREE (Revkov 2023). [https://github.com/skandlab/PUREE Python-based]. Only API, no source code.
</li>
</ul>


flank(ir3, 2, start=FALSE, both=TRUE)
== Integrate/combine Omics ==
#    start end width
* [https://github.com/mikelove/awesome-multi-omics?s=09 awesome-multi-omics] by Michael Love
# [1]     2  5    4
* [https://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1011224 Ten quick tips for avoiding pitfalls in multi-omics data integration analyses] 2023
# [2]     6  9    4
* [http://www.bioconductor.org/packages/release/data/experiment/html/BloodCancerMultiOmics2017.html BloodCancerMultiOmics2017]. "Drug-perturbation-based stratification of blood cancer" by Dietrich S, Oles M, Lu J et al. - experimental data and complete analysis.
# [3]     2  5    4
* [https://cran.r-project.org/web/packages/OmicsPLS/index.html OmicsPLS] & [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2371-3 the paper] in BMC Bioinformatics 2018
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
* [https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html MultiAssayExperiment] & [http://cancerres.aacrjournals.org/content/77/21/e39 the paper] in AACR 2017
#                                          ==
** [https://waldronlab.io/MultiAssayWorkshop/ Multi-omic Integration and Analysis of cBioPortal and TCGA data with MultiAssayExperiment] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020]
#                                      == == == ==
* [https://cran.r-project.org/web/packages/mixOmics/index.html mixOmics], http://mixomics.org/, [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752 the paper] on PLOS 2017
* [https://academic.oup.com/biostatistics/advance-article/doi/10.1093/biostatistics/kxy044/5092386 The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models] Biostatistics 2018
* [https://cran.r-project.org/web/packages/tweedie/ Tweedieverse], [https://github.com/himelmallick/Tweedieverse/ Github]
* [https://www.nature.com/articles/ncomms6901#Sec17 Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes] 2015, [https://github.com/gerstung-lab/MDS-expression R code for Supplementary Data 2 and 4] and [https://github.com/mg14/mg14 mg14.R].
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04268-4 A comprehensive database for integrated analysis of omics data in autoimmune diseases] 2021
* [https://academic.oup.com/bioinformatics/article/37/17/2601/6157728 Integrative survival analysis of breast cancer with gene expression and DNA methylation data] Bichindaritz, 2021
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05127-6 Machine learning to analyse omic-data for COVID-19 diagnosis and prognosis] 2023
 
== Gene expression ==
Expression level is the amount of RNA in cell that was transcribed from that gene. [https://speakerdeck.com/alyssafrazee/high-resolution-gene-expression-analysis Slides] from Alyssa Frazee.


</pre>
== Fusion gene ==
* https://en.wikipedia.org/wiki/Fusion_gene
* https://github.com/STAR-Fusion/STAR-Fusion,
* Gene fusion discovery
** [https://youtu.be/TzjxFkDHO4M?t=3713 Fusion prediction accuracy] comparison.
** [https://youtu.be/TzjxFkDHO4M?t=4686 Fusion gene visualized in IGV]
* [https://bioconductor.org/packages/release/bioc/html/chimeraviz.html chimeraviz] Visualization tools for gene fusions
* [https://youtu.be/5V-NMDvR2l8?t=1432 Tools to call fusions] & Common issues with fusion calling, [https://youtu.be/5V-NMDvR2l8?t=3151 visualizing fusion events in IGV]


Both IRanges and GenomicRanges packages provide the '''flank''' function.
== Structural variation ==
* https://en.wikipedia.org/wiki/Structural_variation
* https://www.ncbi.nlm.nih.gov/dbvar/content/overview/
* [https://www.biorxiv.org/content/biorxiv/early/2018/02/01/200170.full.pdf Detection of complex structural variation from paired-end sequencing data] Joseph G. Arthur, 2018


'''Flanking region''' is also a common term in High-throughput sequencing. The [http://www.broadinstitute.org/igv/book/export/html/6 IGV] user guide also has some option related to flanking.
[https://github.com/arq5x/lumpy-sv LUMPY], [https://github.com/dellytools/delly DELLY], [https://sites.google.com/site/sebatlab/software-data ForestSV], [http://gmt.genome.wustl.edu/packages/pindel/ Pindel], [http://breakdancer.sourceforge.net/ breakdancer] , [http://svdetect.sourceforge.net/Site/Home.html SVDetect].
* General tab: '''Feature flanking regions (base pairs)'''. IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
* Alignments tab: '''Splice junction track options'''. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.


== [https://www.bioconductor.org/packages/release/bioc/html/Biostrings.html Biostrings] ==
== Covid-19 ==
* [http://www.r-exercises.com/2017/05/21/manipulate-biological-data-using-biostrings-package-part-1/ Manipulate Biological Data Using Biostrings Package Exercises (Part 1)]
[https://www.rna-seqblog.com/bulk-rna-sequencing-for-analysis-of-post-covid-19-condition/ Bulk RNA sequencing for analysis of post COVID-19 condition] 2024. 13 differentially expressed genes associated with PCC (long Covid) were found. Enriched pathways were related to interferon-signalling and anti-viral immune processes.
* [http://www.r-exercises.com/2017/05/28/manipulate-biological-data-using-biostrings-package-exercises-part-2/ Manipulate Biological Data Using Biostrings Package Exercises (Part 2)] - it covers global, local & overlap alignments.


== [http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html GenomicRanges] ==
== RNASeq + ChipSeq ==
GenomicRanges depends on [http://www.bioconductor.org/packages/release/bioc/html/IRanges.html IRanges] package. See the dependency diagram below.
* [http://www.nature.com/jhg/journal/vaop/ncurrent/full/jhg201584a.html Elucidating the mechanisms of transcription regulation during heart development by next-generation sequencing]
<pre>
GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                        |            +
                  +-----+            +- GenomeInfoDb
                  |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings
</pre>


The package defines some classes
== Labs ==
* GRanges
* [http://salzberg-lab.org/courses/ Steven Salzberg]
* GRangesList
* GAlignments
* SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.


(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the ''GAlignments'' object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.
== Biowulf2 at NIH ==
* Main site: http://hpc.nih.gov
* User guide: https://hpc.nih.gov/docs/user_guides.html
* Unlock account (60 days inactive) https://hpc.nih.gov/dashboard/
* Transitioning from PBS to Slurm: https://hpc.nih.gov/docs/pbs2slurm.html
* Job Submission 'cheat sheet': https://hpc.nih.gov/docs/biowulf2-handout.pdf
* STAR: https://hpc.nih.gov/apps/STAR.html


== [http://www.bioconductor.org/packages/release/bioc/html/GenomicAlignments.html GenomicAlignments] ==
== [https://github.com/DecodeGenetics/BamHash BamHash] ==
=== Counting reads with summarizeOverlaps vignette ===
Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.
<pre>
library(GenomicAlignments)
library(DESeq)
library(edgeR)


fls <- list.files(system.file("extdata", package="GenomicAlignments"),
== Reproducibility ==
    recursive=TRUE, pattern="*bam$", full=TRUE)
* [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007881 Improving reproducibility in computational biology research]
* [http://www.biorxiv.org/content/early/2017/07/19/165191 SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses] by Claire Rioualen et al, 2017.
 
== Selected Papers ==
* [http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0961-5 Testing for association between RNA-Seq and high-dimensional data] and the Bioconductor package globalSeq.
* [http://link.springer.com/article/10.1208%2Fs12248-016-9917-y The FDA’s Experience with Emerging Genomics Technologies—Past, Present, and Future]
* [http://www.nature.com/nbt/journal/v31/n1/abs/nbt.2450.html Differential analysis of gene regulation at transcript resolution with RNA-seq] Trapnell et al, Nature Biotechnology 31, 46–53 (2013)
* [http://cancerres.aacrjournals.org/content/early/2016/12/01/0008-5472.CAN-16-1624.long A Study of TP53 RNA Splicing Illustrates Pitfalls of RNA-seq Methodology]
* [http://www.rna-seqblog.com/top-rna-seq-articles-2016/ Top RNA-Seq Articles – 2016] from RNA-Seq blog
* [http://onlinelibrary.wiley.com/doi/10.1111/biom.12745/full Multivariate association analysis with somatic mutation data] by He 2017 Biometrics.
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3433-x RASflow – An RNA-Seq Analysis Workflow with Snakemake] Zhang 2019
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0157989 A Survey of Bioinformatics Database and Software Usage through Mining the Literature]
* [https://www.nature.com/articles/s41576-021-00431-y Computational analysis of cancer genome sequencing data] Cortés-Ciriano 2021. Focus on point mutations, copy number alterations, structural variations. For RNA-seq data, it focused on gene fusion detection.
 
== Pictures ==
https://www.flickr.com/photos/genomegov
 
== FISH/Fluorescence In Situ Hybridization ==
* [https://www.genome.gov/genetics-glossary/Fluorescence-In-Situ-Hybridization
* [https://youtu.be/b81DcJC1jAs Fluorescent In Situ Hybridization (FISH) Assay]


features <- GRanges(
== 用DNA做身分鑑識 ==
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
[https://www.worldjournal.com/5120826/article-《社會傳真》用dna做身分鑑識/ 用DNA做身分鑑識]
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000,
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900,
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features


# GRanges object with 11 ranges and 1 metadata column:
== 如何自学入门生物信息学 ==
#      seqnames      ranges strand  |    group_id
* https://zhuanlan.zhihu.com/p/32065916
#          <Rle>    <IRanges>  <Rle>  | <character>
* [http://juang.bst.ntu.edu.tw/JRH/biotech.htm 生物技術簡介]
[1]    chr2L [1000, 1499]      -  |          A
#  [2]    chr2L [3000, 3499]      -  |          A
#  [3]    chr2L [4000, 4499]      -  |          A
#  [4]    chr2L [7000, 7599]      -  |          A
#  [5]    chr2R [2000, 2899]      -  |          B
#  ...      ...          ...   ... ...        ...
#  [7]   chr2R [3600, 3899]      -  |          B
#  [8]    chr2R [4000, 4899]      -  |          B
#  [9]    chr2R [7500, 7799]      -  |          B
#  [10]    chr3L [5000, 5499]      -  |          C
#  [11]    chr3L [5400, 5899]      -  |          C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment
# dim: 11 2
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):


assays(olap)$counts
== CRISPR ==
#      sm_treated1.bam sm_untreated1.bam
[https://health.udn.com/health/story/6008/6028809 基因編輯的原理是什麼?一次看懂基因神剪CRISPR]
#  [1,]               0                0
#  [2,]              0                0
#  [3,]              0                0
#  [4,]              0                0
#  [5,]              5                1
#  [6,]              5                0
#  [7,]              2                0
#  [8,]            376              104
#  [9,]              0                0
# [10,]              0                0
# [11,]              0                0
</pre>


Pasilla data. Note that the bam files are not clear where to find them. According to the [https://support.bioconductor.org/p/50162/ message], we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).
== Staying current ==
<pre>
[http://www.gettinggeneticsdone.com/2017/02/staying-current-in-bioinformatics-genomics-2017.html Staying Current in Bioinformatics & Genomics: 2017 Edition]
samtools view -h -o outputFile.bam inputFile.sam
</pre>


A modified R code that works is
== Papers ==
<pre>
* [http://www.nature.com/nature/journal/vaop/ncurrent/full/nature24286.html DNA sequencing at 40: past, present and future]
###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
            "gtf/drosophila_melanogaster/",
            "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)


###################################################
== Common issues in algorithmic bioinformatics papers ==
### code chunk number 12: gff_parse (eval = FALSE)
[http://medvedevgroup.com/2019/08/11/what-are-some-common-issues-i-find-when-reviewing-algorithmic-bioinformatics-conference-papers/ What are some common issues I find when reviewing algorithmic bioinformatics conference papers?]
###################################################
idx <- mcols(gff0)$source == "protein_coding" &
          mcols(gff0)$type == "exon" &
          seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)


###################################################
== Precision Medicine courses ==
### code chunk number 12: gff_parse (eval = FALSE)
* [https://gmi.ucsf.edu/cme-outreach/ UCSF]
###################################################
* [http://openonlinecourses.com/ehr/PrecisionAndPredictiveMedicine.asp Precision & Predictive Medicine]
library(GenomicAlignments)
 
== Personalized medicine ==
* [https://www.nytimes.com/2017/08/30/health/gene-therapy-cancer.html F.D.A. Approves First Gene-Altering Leukemia Treatment]
* [http://time.com/4989537/blood-cancer-gene-therapy/ The FDA Just Approved a New Way of Fighting (lymphoma) Cancer Using Personalized Gene Therapy]
* [https://ghr.nlm.nih.gov/primer/precisionmedicine/precisionvspersonalized What is the difference between precision medicine and personalized medicine? What about pharmacogenomics?] "personalized medicine" is an older term. [https://ghr.nlm.nih.gov/primer Help Me Understand Genetics]
* [https://academic.oup.com/jnci/article/113/12/1601/6212056 Predictive Biomarkers: Progress on the Road to Personalized Cancer Immunotherapy] Emens 2021


# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
== Cancer and gene markers ==
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
* '''Colorectal cancer''' patients without '''KRAS mutations''' have far better outcomes with '''EGFR treatment''' than those with KRAS mutations.  
    recursive=TRUE, pattern="*bam$", full=TRUE)
** Two '''EGFR inhibitors''', cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
path <- system.file("extdata", package="pasillaBamSubset")
* '''Breast cancer'''. 
bamlst <- BamFileList(fls)
** [https://en.wikipedia.org/wiki/Trastuzumab Trastuzumab]
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
** [https://en.wikipedia.org/wiki/Tamoxifen Tamoxifen]
assays(genehits)$counts


###################################################
== The shocking truth about space travel ==
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
[https://www.morningticker.com/2018/03/the-shocking-truth-about-space-travel/ 7 percent of DNA belonging to NASA astronaut Scott Kelly changed in the time he was aboard the International Space Station]
###################################################
library(DESeq)


expdata = MIAME(
== bioSyntax: syntax highlighting for computational biology ==
              name="pasilla knockdown",
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2315-y
              lab="Genetics and Developmental Biology, University of
 
                  Connecticut Health Center",
== Deep learning ==
              contact="Dr. Brenton Graveley",
[https://www.nature.com/articles/s41576-019-0122-6 Deep learning: new computational modelling techniques for genomics]
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi
 
                  knockdown RNA-Seq Studies",
== HRD/homologous recombination deficiency 同源重组修复缺陷 ==
              pubMedIds="20921232",
* https://en.wikipedia.org/wiki/Homologous_recombination
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
** Homologous recombination proficient (HRP) cancer cells can repair DNA damage caused by chemotherapy, making them difficult to treat.
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
** drugs have been developed to target homologous recombination via c-Abl inhibition and to exploit (take advantage of) deficiencies in homologous recombination in cancer cells with BRCA mutations.
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs
** One such drug is Olaparib, a PARP1 inhibitor that targets cancer cells by inhibiting base-excision repair (BER) in HR-deficient cells. However, cancer cells can become resistant to PARP1 inhibitors if they undergo deletions of mutations in BRCA2, restoring their ability to repair DNA by HR.
                  encoding pasilla, a mRNA binding protein and 4 biological replicates
* https://www.genome.gov/genetics-glossary/homologous-recombination (graphical illustration). Homologous recombination is a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA.
                  of the the untreated cell line.")
 
* BRCA1 and BRCA2 are genes that produce proteins responsible for repairing damaged DNA within cells. Mutations in these genes can lead to errors in the DNA repair process, resulting in an accumulation of mutations that can cause cancer. This '''condition''' is known as '''Homologous Recombination Deficiency (HRD)'''. [https://en.wikipedia.org/wiki/PARP_inhibitor PARP inhibitors] are a type of targeted therapy that blocks the enzyme Poly (ADP-ribose) polymerase (PARP), which helps repair DNA damage in (cancer) cells. By inhibiting PARP, these drugs prevent cancer cells from repairing their DNA, leading to cell death.


design <- data.frame(
* The '''inability to repair DNA damage''' is referred to as homologous recombination deficiency (HRD). [https://www.qiagen.com/us/applications/oncology/solid-tumor/dna-damage-response/hrd DNA damage response] from Qiagen.  
              condition=c("untreated", "untreated"),
* openai
              replicate=c(1,1),
**  Poly(ADP-ribose) polymerase '''(PARP) inhibitors''' are a class of '''drugs''' that are designed to '''inhibit the activity of PARP enzymes'''. PARP enzymes are proteins that are involved in DNA repair pathways. They help to repair DNA damage and maintain the stability of the genome by adding a chemical group called poly(ADP-ribose) to other proteins.
              type=rep("single-read", 2), stringsAsFactors=TRUE)
** PARP inhibitors work by blocking the activity of PARP enzymes, which can interfere with the ability of cells to repair damaged DNA. This can be especially useful in '''cancer cells''', which often rely on PARP enzymes to repair DNA damage and maintain genomic stability. By inhibiting PARP enzymes, PARP inhibitors can sensitizize cancer cells to chemotherapy and other treatments, making them more vulnerable to cell death.
library(DESeq)
** is there drug targeting HRP cancer patients? One approach is to target homologous recombination via c-Abl inhibition. For example, [https://pubmed.ncbi.nlm.nih.gov/34635996/ Niraparib] is a PARP inhibitor that has been shown to be effective in treating advanced ovarian cancer in both homologous recombination deficient (HRD) and homologous recombination proficient (HRP) patients.
geneCDS <- newCountDataSet(
** is there any drugs target HRD cancer patients? Yes, there are several drugs that have been developed to target HRD cancer cells. One approach is to use PARP inhibitors, which exploit deficiencies in homologous recombination in cancer cells with BRCA mutations. For example, Olaparib is a PARP1 inhibitor that has been shown to be effective in shrinking or stopping the growth of tumors from breast, ovarian and prostate cancers caused by mutations in the BRCA1 or BRCA2 genes. By inhibiting base-excision repair (BER) in HR-deficient cells, Olaparib applies the concept of synthetic lethality to specifically target cancer cells. PS: '''Examples of PARP inhibitors''' include niraparib (Zejula), olaparib (Lynparza), talazoparib (Talzenna), and rucaparib (Rubraca).
                  countData=assay(genehits),
** PARP1 is a ''member'' of the PARP family of proteins. PARP stands for Poly (ADP-ribose) polymerase. The [https://en.wikipedia.org/wiki/Poly_%28ADP-ribose%29_polymerase PARP] family comprises 17 members.
                  conditions=design)


experimentData(geneCDS) <- expdata
* '''Loss-of-function''' genes involved in this pathway can '''sensitize''' tumors to poly(adenosine diphosphate [ADP]-ribose) polymerase (PARP) inhibitors and '''platinum-based (Pt)''' chemotherapy
sampleNames(geneCDS) = colnames(genehits)
** Certain genes that are involved in a process called Homologous Recombination Repair (HRR) can make cancer cells more '''susceptible''' to certain treatments. Specifically, it says that when these loss-of-function genes are present, the cancer cells become more sensitive to two types of chemotherapy: PARP inhibitors and platinum-based chemotherapy.
** '''Loss-of-function''' genes are genes that have mutations that prevent them from functioning properly or at all. By disrupting the normal functioning of the HRR pathway, these loss-of-function genes can make cancer cells more sensitive to PARP inhibitors and platinum-based chemotherapy.
* [https://github.com/sztup/scarHRD scarHRD] package. Note for the 2 test samples, the return object is a 1x4 matrix.
** HRD-LOH/Loss of Heterozygosity
** LST/Large Scale Transitions
** Number of Telomeric Allelic Imbalances
** HRD-sum (sum of the above 3)
* [https://academic.oup.com/oncolo/article/27/3/167/6515681 Homologous Recombination Deficiency: Concepts, Definitions, and Assays] Stewart 2022
* [https://pzweuj.github.io/2021/06/10/HRD.html 同源重组修复缺陷HRD 分析]
* [https://cloud.tencent.com/developer/article/1701897 这篇只有两个Figure的10分+SCI是靠什么取胜的?]
* [https://blog.csdn.net/fanyucai1/article/details/119741040 HRD检测方法]
* [https://github.com/ucscXena/gitbookdocs/blob/master/overview-of-features/genomic-signatures.md Link] to HRD score, genome-wide DNA damage footprint. The HRD value has a range 0 to 101. Most are 0 and the rest follows an exponential distribution.
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640082/ Dynamically Accumulating Homologous Recombination Deficiency Score Served as an Important Prognosis Factor in High-Grade Serous Ovarian Cancer] Su 2021
** A combined HRD score ≥42 was associated with shorter OS in 33 cancer types.
** However, in ovarian cancer, which ranked the highest HRD score among other cancers, HRD ≥42 cohort was significantly associated with longer OS.
** An HRD score of ≥42 was determined to signify HRD (HR-deficient), and a score of <42 was considered HR-proficient in clinical trials.
** The datasets including 1. signatures–HRD score and genome-wide DNA damage footprint; 2. phenotype-curated clinical data; and 3. gene expression RNAseq-TOIL RSEM TPM were downloaded from https://xenabrowser.net/.
* [https://www.nature.com/articles/s41598-021-03432-3 Identification of molecular subtypes and prognostic signature for hepatocellular carcinoma based on genes associated with homologous recombination deficiency] Lin 2021.
** the combat function of R software package SVA was used for batch effect removal.
* https://xenabrowser.net/datapages/ has one set called "TCGA Pan-Cancer (PANCAN) (41 datasets)" -> HRD score, genome-wide DNA damage footprint (n=10,647) Pan-Cancer Atlas Hub.
* https://github.com/GerkeLab/TCGAhrd  where HDR can be downloaded from https://gdc.cancer.gov/about-data/publications/PanCan-DDR-2018. Choose 'Zip file with TSV files of DDR Data Resources'. Pay attention to "DDRscores.tsv" and "Scores.tsv", "Samples.tsv".
<ul>
<li>[https://github.com/GerkeLab/TCGAhrd Homologous recombination deficiency in TCGA PanCancer Atlas]
<pre>
> load("clinical_and_hrd.RData")
> dim(full_dat)
[1] 9105  792
> full_dat[1:5, c("HRD_Score", "HRD_LOH", "HRD_LST", "HRD_TAI")]
  HRD_Score HRD_LOH HRD_LST HRD_TAI
1        7      2      2      3
2        9      3      2      4
3        0      0      0      0
4        8      4      2      2
5        5      1      1      3
> summary( full_dat[1:5, c("HRD_Score", "HRD_LOH", "HRD_LST", "HRD_TAI")])
  HRD_Score      HRD_LOH    HRD_LST      HRD_TAI
Min.  :0.0  Min.  :0  Min.  :0.0  Min.  :0.0
1st Qu.:5.0  1st Qu.:1  1st Qu.:1.0  1st Qu.:2.0
Median :7.0  Median :2  Median :2.0  Median :3.0
Mean  :5.8  Mean  :2  Mean  :1.4  Mean  :2.4
3rd Qu.:8.0  3rd Qu.:3  3rd Qu.:2.0  3rd Qu.:3.0
Max.  :9.0  Max.  :4  Max.  :2.0  Max.  :4.0


###################################################
> load("DDRscores.RData")
### code chunk number 16: pasilla_genes (eval = FALSE)
> ls()
###################################################
[1] "dat"      "full_dat"
chr4tx <- split(gff, mcols(gff)$transcript_id)
> dim(dat)
txhits <- summarizeOverlaps(chr4tx, bamlst)
[1] 9125  46
txCDS <- newCountDataSet(assay(txhits), design)
> colnames(dat)
experimentData(txCDS) <- expdata
[1] "patient_id"            "acronym"                "mutLoad_silent"
[4] "mutLoad_nonsilent"      "mutSig1"                "mutSig2"
[7] "mutSig3"                "mutSig4"                "mutSig5"
[10] "mutSig6"                "mutSig7"                "mutSig8"
[13] "mutSig9"                "mutSig10"              "mutSig11"
[16] "mutSig12"              "mutSig13"              "mutSig14"
[19] "mutSig15"              "mutSig16"              "mutSig17"
[22] "mutSig18"              "mutSig19"              "mutSig20"
[25] "mutSig21"              "CNA_n_segs"            "CNA_frac_altered"
[28] "CNA_n_focal_amp_del"    "aneuploidy_score"      "aneuploidy_score_prime"
[31] "LOH_n_seg"              "LOH_frac_altered"      "purity"
[34] "ploidy"                "genome_doublings"      "subclonal_frac"
[37] "HRD_TAI"                "HRD_LST"                "HRD_LOH"
[40] "HRD_Score"              "eCARD"                  "PARPi7"
[43] "PARPi7_bin"            "RPS"                    "tp53_score"
[46] "rppa_ddr_score"
</pre>
</pre>
<li>[https://www.frontiersin.org/articles/10.3389/fonc.2021.746571/full#h13 HRD status is highly '''correlated''' with platinum chemotherapy sensitivity in patients with high-grade serous ovarian cancer (HGSOC)] 2022. Findings showed that the HRD score of platinum treatment-sensitive patients was slightly higher than that of Pt-resistant patients. Analysis showed that [https://ecancer.org/en/news/23005-hrd-detection-predicts-sensitivity-to-platinum-based-chemotherapy-for-ovarian-cancer-patients-in-china patients with positive HRD status had a significantly longer progression-free survival (PFS) compared to those with negative HRD status].
* [https://ovarianresearch.biomedcentral.com/articles/10.1186/s13048-019-0511-7 Effect of BRCA mutational status on survival outcome in advanced-stage high-grade serous ovarian cancer] 2019. The median PFS was 22.9 months for the BRCA mutation group compared to 16.9 months for the wild-type BRCA group. 6 months was used as a cutoff.
* [https://www.healthday.com/health-news/cancer/ovarian-cancer-prognosis-may-depend-on-gene-mutations-651331.html Ovarian Cancer Prognosis May Depend on Gene Mutations]. Five-year survival rates were 61 percent for those with the BRCA2 mutation, 46 percent for those with the BRCA1 mutation and 36 percent for those with neither mutation, the investigators found.
* [https://ovarianresearch.biomedcentral.com/articles/10.1186/s13048-023-01129-x Homologous recombination deficiency status predicts response to platinum-based chemotherapy in Chinese patients with high-grade serous ovarian carcinoma] 2023.
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10232447/ HRD effects on first-line adjuvant chemotherapy and PARPi maintenance therapy in Chinese ovarian cancer patients] 2023


We can also check out ?summarizeOverlaps to find some fake examples.
<li>[https://www.nature.com/articles/s41467-020-19406-4 Pan-cancer landscape of homologous recombination deficiency] 2020


== [https://cran.r-project.org/web/packages/chromoMap/ chromoMap] ==
<li>[https://www.medicalnewstoday.com/articles/hrd-positive-ovarian-cancer What to know about HRD testing for ovarian cancer]


== [http://bioconductor.org/packages/release/bioc/html/Rsubread.html Rsubread] ==
<li>[https://www.azprecisionmed.com/tumor-type/ovarian-cancer/hrd-testing.html HRD in Ovarian Cancer]
See [https://support.bioconductor.org/p/65604/ this post] for about C version of the [http://bioinf.wehi.edu.au/featureCounts/ featureCounts] program.


[https://www.biostars.org/p/96176/ featureCounts vs HTSeq-count]
<li>SAP for [https://classic.clinicaltrials.gov/ProvidedDocs/44/NCT01891344/SAP_001.pdf from Clovis Oncology] & Rucaparib (a PARP inhibitor)


== Inference ==
<li>gLOH vs LOH: The gLOH score and the LOH score are two different measures of Homologous Recombination Deficiency (HRD) in tumors. The gLOH score measures genomic loss of heterozygosity (gLOH) while the LOH score measures loss of heterozygosity (LOH) in a specific region or gene.
* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz089/5307752 How well do RNA-Seq differential gene expression tools perform in a complex eukaryote? A case study in A. thaliana] Froussios, Bioinformatics 2019
 
<li>[https://www.medicalnewstoday.com/articles/hrd-positive-ovarian-cancer What to know about HRD testing for ovarian cancer]


=== DESeq or edgeR ===
<li>Labs:
* [http://genomebiology.com/2014/15/12/550#sec4 DESeq2 method]
* ACTHRD from [https://www.actgenomics.com/patients_product.php?id=5 ACT Genomics 行動基因]. ACTHRD™ detects HRD status by LOH score and 24 HRR-related genes to evaluate whether a tumor is suitable for PARP inhibitors.
* DESeq2 with a large number of samples -> use DESEq2 to normalize the data and then use do a Wilcoxon rank-sum test on the normalized counts, for each gene separately, or, even better, use a permutation test. See [https://support.bioconductor.org/p/60432/ this post]. Or consider the limma-voom method instead, which will handle 1000 samples in a few seconds without the need for extra memory.
* AmoyDx. [https://pubmed.ncbi.nlm.nih.gov/36136896/ In-house testing for homologous recombination repair deficiency (HRD) testing in ovarian carcinoma: a feasibility study comparing AmoyDx HRD Focus panel with Myriad myChoiceCDx assay]
* edgeR normalization factor [https://support.bioconductor.org/p/65683/ post]. Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the [http://genomebiology.com/2010/11/3/r25 paper by Robinson & Oshlack 2010] for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
* [https://bionano.com/hrd-testing/ Bionano]
* edgeR [http://f1000research.com/articles/5-1438/v1 From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline]
* [https://www.webull.com/news/25116275 Burning Rock]宣佈在中國獲得MyChoice®腫瘤檢測的許可
* [https://support.bioconductor.org/p/65890/ Can I feed TCGA normalized count data to EdgeR?]
* [https://info.foundationmedicine.com/hubfs/FMI%20Labels/FoundationOne_CDx_Label_Technical_Info.pdf FoundationMedicine (?FMI) technical information]. ''Positive homologous recombination deficiency (HRD) status (F1CDx HRD defined as tBRCA-positive and/or LOH high) in ovarian cancer patients is associated with improved progression-free survival (PFS) from Rubraca (rucaparib) maintenance therapy in accordance with the Rubraca product label. '' [https://www.accessdata.fda.gov/cdrh_docs/pdf17/P170019S006C.pdf FDA doc]. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8926248/ Clinical and analytical validation of FoundationOne®CDx, a comprehensive genomic profiling assay for solid tumors] 2022.
* [https://support.bioconductor.org/p/66067/ counts() function and normalized counts].
* [https://support-docs.illumina.com/SW/DRAGEN_Analysis_Workflows/Content/SW/Informatics/APP/HRD_appT500ctDNAlocal.htm Illumina]
* [https://support.bioconductor.org/p/74572/ Why use Negative binomial distribution] in RNA-Seq data?] and the [http://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lect/L05-deseq2-anders.pdf Presentation] by Simon Anders.
* [https://myriad.com/genetic-tests/mychoicecdx-tumor-test/ Myriad]
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1803-9 XBSeq2] – a fast and accurate quantification of differential expression and differential polyadenylation. By using simulated datasets, they demonstrated that, overall, XBSeq2 performs equally well as XBSeq in terms of several statistical metrics and both perform better than DESeq2 and edgeR.
* [https://pillarbiosci.com/news/xing-genomic-services-receives-nata-accreditation-for-pillar-biosciences-hrd-gene-panel-enabling-cost-effective-parpi-response-prediction/ Pillar Biosciences]
* [https://www.biorxiv.org/content/early/2018/08/27/399931 DEBrowser: Interactive Differential Expression Analysis and Visualization Tool for Count Data] Alper Kucukural 2018
* [https://www.sophiagenetics.com/press-releases/sophia-genetics-launches-new-deep-learning-capabilities-to-support-the-detection-of-homologous-recombination-deficiencies/ Sophia Genetics]
* Tempus [https://www.tempus.com/oncology/algorithmic-tests/ AI-DRIVEN HRD TEST]
* Thermo fisher scientific [https://assets.thermofisher.com/TFS-Assets/CSD/Reference-Materials/hrd-biomarker-guide.pdf Everything you need to know about homologous recombination repair (HRR) and homologous recombination deficiency (HRD) testing]
</ul>


=== [http://www.bioconductor.org/packages/devel/bioc/html/EBSeq.html EBSeq] ===
== Computational Pathology ==
An R package for gene and isoform differential expression analysis of RNA-seq data
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6852275/ Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the Digital Pathology Association] 2019
* [https://link.springer.com/referenceworkentry/10.1007/978-3-030-80962-1_334-1 Challenges in Computational Pathology of Biomarker-Driven Predictive and Prognostic Immunotherapy]


http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/
== Bi-allelic, monoallelic ==
[https://www.nature.com/scitable/content/bi-allelic-and-monoallelic-expression-8816761/ Bi-allelic and monoallelic expression]. In most cases, both alleles (the two chromosomal copies) are transcribed; this is known as bi-allelic expression (left). However, a minority of genes show monoallelic expression (right). In these cases, only one allele of a gene is expressed (right).


=== [http://www.bioconductor.org/packages/release/bioc/html/prebs.html prebs] ===
== SOMAscan assay (proteomic) ==
Probe region expression estimation for RNA-seq data for improved microarray comparability
<ul>
<li>https://somalogic.com/somascan-discovery/, https://somalogic.com/somascan-platform/, https://somalogic.com/user-manuals/
* [https://somalogic.com/wp-content/uploads/2022/08/SL00000572_Rev4_2022-01_SomaScan-Assay-v4.1.pdf SomaScan Assay v4.1] 7k
* [https://github.com/SomaLogic/SomaDataIO SomaDataIO] R package (currently 6.0.0), [https://somalogic.github.io/SomaDataIO/ Documentation including Vignettes].
:<syntaxhighlight lang='r'>
f <- "SS-1234567_v4.1_Serum.hybNorm.medNormInt.plateScale.calibration.anmlQC.qcCheck.anmlSMP.adat"
my_adat <- read_adat(f)
eset <- adat2eSet(my_adat)


=== [http://www.bioconductor.org/packages/release/bioc/html/DEXSeq.html DEXSeq] ===
exprs(eset) |> dim() # 7596 x ns
Inference of differential exon usage in RNA-Seq


=== [http://www-personal.umich.edu/~jianghui/rseqnp/ rSeqNP] ===
pData(phenoData(eset)) # ns x 33
A non-parametric approach for detecting differential expression and splicing from RNA-Seq data
table(pData(phenoData(eset))$SampleType)


=== [https://peerj.com/articles/3890/ voomDDA]: discovery of diagnostic biomarkers and classification of RNA-seq data ===
pData(featureData(eset)) # 7596 x 19
http://www.biosoft.hacettepe.edu.tr/voomDDA/
j2 <- pData(featureData(eset))$Organism == "Human" &
    (pData(featureData(eset))$Type %in% c("Protein", "Non-Human")) 
sum(j2) # 7289
</syntaxhighlight>
* [https://somalogic.github.io/SomaScan.db/ SomaScan.db] R package. v4.0 is 5K.
* [https://investors.somalogic.com/news-releases/news-release-details/somalogic-announces-new-somascanr-11k-platform SomaScan 11K Assay v5.0]


== Pathway analysis ==
<li>[https://pubmed.ncbi.nlm.nih.gov/27146037/ readat: An R package for reading and working with SomaLogic ADAT files]
=== [http://bioconductor.org/packages/release/bioc/html/fgsea.html fgsea:] Fast Gene Set Enrichment Analysis ===
* https://bitbucket.org/graumannlabtools/readat/src/master/. The R package is not in Bioconductor.
* [https://stephenturner.github.io/deseq-to-fgsea/ DESeq results to pathways in 60 Seconds with the fgsea package]
* [https://mgrcbioinfo.github.io/my_GSEA_plot/ GSEA plot for multiple comparisons]


=== [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR]: Reproducible GSEA Benchmarking ===
<li>[https://rdrr.io/github/andreagrioni/ToolViz/ ToolViz] package.
[https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis]
<li>[https://www.ncbi.nlm.nih.gov/gds/?term=somascan NCBI-GEO], [https://www.ncbi.nlm.nih.gov/geo/browse/?view=platforms&search=somascan&display=20 All list]


=== [https://bioconductor.org/packages/release/bioc/html/rgsepd.html GSEPD] ===
<li>[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9561184/ Assessment of variability in the plasma 7k SomaScan proteomics assay] 2022
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2697-5 GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display]
</ul>


== Pipeline ==
== Scandal ==
=== [https://github.com/PF2-pasteur-fr/SARTools SARTools] ===
[https://www.techbang.com/posts/98333-alzheimers-papers-are-suspected-of-being-fraudulent-setting 阿茲海默症關鍵論文被揭發疑似造假,16年來全球醫學專家可能都被呼弄] & [https://news.ltn.com.tw/news/world/breakingnews/4000907 阿茲海默症關鍵論文疑造假 誤導外界16年]
http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/
 
= Terms =
== RNA vs DNA ==
* [https://medcitynews.com/2020/09/why-rna-is-a-better-measure-of-a-patients-current-health-than-dna/ Why RNA is a better measure of a patient’s current health than DNA]
* [http://www.hkcna.hk/content/2020/1216/868269.shtml 英美接種疫苗,陰謀論隨之盛行]. 核糖核酸=RNA. 脱氧核糖核酸=DNA. 輝瑞公司開發的新冠疫苗,是使用mRNA疫苗原理,即抽取病毒內部分核糖核酸編碼蛋白(或者稱信使核糖核酸)制成疫苗。
** [https://vitals.lifehacker.com/how-mrna-vaccines-work-1845895792 How mRNA Vaccines Work]
** [https://www.acsh.org/news/2020/10/21/how-pfizers-rna-vaccine-works-15104 How Pfizer's RNA Vaccine Works]
** Pfizer’s Covid Vaccine: 11 Things You Need to Know
** [https://blog.ephorie.de/covid-19-vaccine-95-effective-it-doesnt-mean-what-you-think-it-means COVID-19 vaccine “95% effective”: It doesn’t mean what you think it means!]
** [https://www.courthousenews.com/vaccine-technology-how-mrna-changed-the-fight-against-covid-19/ Vaccine Technology: How MRNA Changed the Fight Against Covid-19]


=== SEQprocess ===
== 基因结构 ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2676-x SEQprocess]: a modularized and customizable pipeline framework for NGS processing in R package
https://zhuanlan.zhihu.com/p/49601643


== pasilla and pasillaBamSubset Data ==
== Pseudogene ==
pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.
https://www.genome.gov/genetics-glossary/Pseudogene. An example: [https://www.genecards.org/cgi-bin/carddisp.pl?gene=OR7E47P OR7E47P] with alias [https://genome.weizmann.ac.il/horde/card/index/symbol:OR7E47P bpl 41-16 or bpl41-16].


pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).
== PCR ==
[https://unclegene6666.pixnet.net/blog/post/302562043 什麼是PCR? 聚合酶鏈鎖反應?] 基因叔叔


== [http://www.bioconductor.org/packages/release/bioc/html/BitSeq.html BitSeq] ==
== Epidemiology ==
Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of [http://www.hiit.fi/u/ahonkela/ Antti Honkela].
[https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated Epidemiology for the uninitiated]


== ReportingTools ==
== Cell lines ==
The ReportingTools software package enables users to easily display reports of analysis
* cell line (體外). tumor samples (體內)
results generated from sources such as microarray and sequencing data.
* [https://www.nature.com/articles/nature14397 A resource for cell line authentication, annotation and quality control]
* [https://www.biologydiscussion.com/cell/cell-lines/cell-lines-types-nomenclature-selection-and-maintenance-with-statistics/10517 Cell Lines: Types, Nomenclature, Selection and Maintenance (With Statistics)]
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001206/ Comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer] Jiang 2016
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3715866/ Evaluating cell lines as tumour models by comparison of genomic profiles] Domcke 2013
* Google: tumor cell line vs tumor samples


== [http://cran.r-project.org/web/packages/sequences/index.html sequences] ==
== in vivo, in vitro, and in situ ==
More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.
=== In silico 電腦模擬 (in silicon, s=simulation) ===
* https://en.wikipedia.org/wiki/In_silico An in silico experiment is one performed on computer or via computer simulation.
* The main difference between '''in silico''' gene expression analysis and '''experimental''' gene expression analysis is the method used to study the patterns and levels of gene expression.
** In silico gene expression analysis involves the use of computational tools and algorithms to analyze large datasets of gene expression data obtained from techniques such as microarrays, RNA sequencing, or single-cell RNA sequencing. This analysis can include identifying differentially expressed genes between samples, clustering genes with similar expression patterns, and predicting the functional roles of genes based on their expression profiles.
** On the other hand, experimental gene expression analysis involves directly measuring the levels and patterns of gene expression using laboratory techniques. These techniques can include real-time polymerase chain reaction (PCR), northern blotting, western blotting, and immunohistochemistry, among others. These experimental techniques allow researchers to directly measure the levels of specific RNA or protein molecules in biological samples.
** While in silico gene expression analysis is a rapid and cost-effective way to analyze large datasets of gene expression data, it relies on the accuracy and completeness of the data being analyzed. Experimental gene expression analysis provides a more direct and accurate view of gene expression but can be more time-consuming and expensive. In practice, both in silico and experimental gene expression analysis are valuable tools that can be used to complement each other in the study of gene expression and its role in various biological processes and diseases.
* [https://molecular-cancer.biomedcentral.com/articles/10.1186/1476-4598-6-50 In silico gene expression analysis--an overview] Murray 2007
* [https://www.future-science.com/doi/10.2144/btn-2018-0179 A simple in silico approach to generate gene-expression profiles from subsets of cancer genomics data] 2019
** TCGA from cbioportal was used
** Details steps including screenshots are available at [https://www.future-science.com/doi/suppl/10.2144/btn-2018-0179 Supplementary material]


== [http://www.bioconductor.org/packages/release/bioc/html/QuasR.html QuasR] ==
=== In situ 原處 (介於in vivo與in vitro之間) ===
[http://bioinformatics.oxfordjournals.org/content/early/2014/12/09/bioinformatics.btu781.short?rss=1&ssource=mfr Bioinformatics] paper
* https://en.wikipedia.org/wiki/In_situ 意義大致介於in vivo與in vitro之間。
* Something that’s performed in situ means that it’s observed in its natural context, but '''outside of a living organism'''. In vivo is Latin for “within the living.” It refers to work that’s performed in a whole, living organism .
* A good example of this is a technique called '''in situ hybridization (ISH)'''. ISH can be used to look for a specific '''nucleic acid''' (DNA or RNA) within something like a tissue sample. Specialized probes are used to bind to a specific nucleic acid sequence that the researcher is looking to find. These probes are tagged with things like radioactivity or fluorescence. This allows the researcher to see where the nucleic acid is located within the tissue sample. ISH allows the researcher to observe where a nucleic acid is located within its natural context, yet outside of a living organism. Examples are microarray experiments.


== CRAN packages ==
=== in vivo 活体内 ===
=== [https://cran.r-project.org/web/packages/ssizeRNA/index.html ssizeRNA] ===
* IN VIVO describes a medical experiment or a test that is performed on a '''living organism''', e.g. a human being or a laboratory animal.
Sample Size Calculation for RNA-Seq Experimental Design
* [https://www.promocell.com/in-the-lab/human-primary-cells-and-immortal-cell-lines/ Human primary cells and immortal cell lines: differences and advantages]
* [https://pediaa.com/what-is-the-difference-between-primary-cell-culture-and-cell-line/ What is the Difference Between Primary Cell Culture and Cell Line]


=== [http://master.bioconductor.org/packages/devel/bioc/html/RnaSeqSampleSize.html RnaSeqSampleSize] ===
'''Syngenic'''
[https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ Shiny] app
* https://en.wikipedia.org/wiki/Syngenic
* [https://www.crownbio.com/model-systems/in-vivo Syngeneic models]


=== [https://cran.r-project.org/web/packages/rbamtools/index.html rbamtools] ===
''Syngeneic tumor models'' are experimental models used in cancer research that use '''genetically identical animals''' to study the growth and spread of cancer cells. In these models, a malignant tumor is induced in one animal and then transplanted into another animal of the same genetic background. This allows researchers to study the interactions between the host immune system and the cancer cells, as well as the response of the tumor to various treatments.
Provides an interface to functions of the 'SAMtools' C-Library by Heng Li


=== [http://cran.r-project.org/web/packages/refGenome/index.html refGenome]  ===
''Syngeneic tumor models'' are often used in combination with other experimental models, such as '''xenograft models''' (where the cancer cells are transplanted into a genetically different animal) or '''cell line models''' (where the cancer cells are grown in a laboratory). By using a combination of these models, researchers can gain a more complete understanding of the biology of cancer and develop new treatments for cancer patients.
The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.


=== [http://cran.r-project.org/web/packages/WhopGenome/index.html WhopGenome]  ===
The ''syngeneic model'' is an important tool for studying the role of the immune system in cancer, as the genetically identical animals allow researchers to control for genetic differences that might impact the immune response. Additionally, because the host immune system in these models is functional and can mount a response against the transplanted cancer cells, the syngeneic model provides a more realistic representation of the host-tumor interaction than other models that rely on immunodeficient animals.
Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.


=== [https://cran.r-project.org/web/packages/TCGA2STAT/index.html TCGA2STAT] ===
=== in vitro 试管内/体外 ===
Simple TCGA Data Access for Integrated Statistical Analysis in R
* https://en.wikipedia.org/wiki/In_vitro
* https://en.wikipedia.org/wiki/In_silico
* https://en.wikipedia.org/wiki/RNA-Seq


TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.
== RUO: research use only ==
<syntaxhighlight lang='rsplus'>
'''RUO''' stands for "Research Use Only". In the context of clinical trials and laboratory research, it refers to in vitro diagnostic products (IVDs) that are intended to be used in non-clinical studies, including to gather data for submission as required by regulatory authorities³. These products are not intended for use in diagnostic procedures. They are often used by medical laboratories and other institutions for research purposes. However, if these products are used for purposes other than research, it could have legal implications. It's important to note that RUO products are not subject to the same regulatory controls as in-vitro diagnostic medical devices (CE-IVDs) that must comply with the applicable legal requirements.
source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")


install.packages("TCGA2STAT")
* [https://www.fda.gov/regulatory-information/search-fda-guidance-documents/distribution-in-vitro-diagnostic-products-labeled-research-use-only-or-investigational-use-only Distribution of In Vitro Diagnostic Products Labeled for Research Use Only or Investigational Use Only]
</syntaxhighlight>
* [https://blog.microbiologics.com/ivd-in-vitro-diagnostic-versus-ruo-research-use-only/ In Vitro Diagnostic Use (IVD) versus Research Use Only (RUO) in the Clinical Laboratory]


The getTCGA() function allows to download various kind of data:
== RNA sequencing 101 ==
* '''gene expression''' which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
Web
* '''miRNA expression''' which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
* [https://www.edx.org/course/introduction-to-biology-the-secret-of-life-24 Introduction to Biology - The Secret of Life] from edX. It's one of the top 100 best [https://www.classcentral.com/collection/top-free-online-courses free online courses] posted by ClassCentral.
* '''mutation''' data (data.type="Mutation")
* [http://ctehr.tamu.edu/media/864063/jason-seq-pres.pdf Introduction to RNA-Seq] including biology overview (DNA, Alternative splcing, mRNA structure, human genome) and sequencing technology.
* '''methylation expression''' (data.type="Methylation")
* [https://youtu.be/7BLS_YY9HeM Introduction to RNA-Seq for Researchers] (youtube)
* '''copy number changes''' (data.type="CNA_SNP")
* [http://www.chem.agilent.com/Library/eseminars/Public/RNA%20Sequencing%20101.pdf RNA Sequencing 101] by Agilent Technologies. Includes the definition of sequencing depth (number of reads per sample) and coverage (number of reads/locus).
** [https://www.ecseq.com/support/ngs/what-is-a-good-sequencing-depth-for-bulk-rna-seq What is a good sequencing depth for bulk RNA-Seq?] In general 5 M mapped reads is a good bare minimum for a differential gene expression (DGE) analysis in human. In many cases 5 M – 15 M mapped reads are sufficient. Many published human RNA-Seq experiments have been sequenced with a sequencing depth between 20 M - 50 M reads per sample.  
** [https://www.biostars.org/p/139006/ How to count fastq reads] ''echo $(zcat yourfile.fastq.gz | wc -l)/4 | bc''
* Where do we get reads(A,C,T,G) from sample RNA? See page 12 of this [https://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf pdf] from Colin Dewey in U. Wisc.
* Quantification of RNA-Seq data (see the above pdf)
* Convert read counts into expression: RPKM (see the above pdf)
* RPKM and FPKM ([https://docs.google.com/file/d/0B23nQZpa5ce0ak9jNEdlMEVqemc/edit Data analysis of RNA-seq from new generation sequencing] by 張庭毓) RNA-Seq資料分析研討會與實作課程 / RNA Seq定序 / 次世代定序(NGS) / 高通量基因定序 分析.
* [http://yourgene.pixnet.net/blog/post/66237799 First vs Second] generation sequence.
* [http://sfg.stanford.edu/SFG.pdf The Simple Fool’s Guide to Population Genomics via RNA­Seq]: An Introduction to High­Throughput Sequencing Data Analysis. This covers QC, De novo assembly, BLAST, mapping reads to reference sequences, gene expression analysis and variant (SNP) detection.
* [http://nihlibrary.nih.gov/Services/Bioinformatics/Documents/Lipsett100928talk_web.pdf An Introduction to Bioinformatics Resources and their Practical Applications] from [http://nihlibrary.nih.gov/Services/Bioinformatics/Pages/default.aspx NIH library Bioinformatics Support Program].
* [http://www.rnaseqforthenextgeneration.org/resources/index.html Teaching material] from rnaseqforthenextgeneration.org which includes Designing RNA-Seq experiments, Processing RNA-Seq data, and Downstream analyses with RNA-Seq data.


=== curatedTCGAData ===
== Books ==
* [http://www.amazon.com/RNA-seq-Data-Analysis-Mathematical-Computational/dp/1466595000 RNA-seq Data Analysis: A Practical Approach]. The pdf version is available on slideshare.net.
* [http://www.amazon.com/Statistical-Generation-Sequencing-Frontiers-Probability/dp/3319072110/ref=pd_bxgy_b_img_y Statistical Analysis of Next Generation Sequencing Data]
* [https://www.huber.embl.de/msmb/ Modern Statistics for Modern Biology] (free, see [[Statistics#Books_2|Statistics books]]) by Holmes and Huber. Plots are based on ggplot2.
* [https://github.com/harvardinformatics/learning-bioinformatics-at-home Learning Bioinformatics At Home] - some resources gathered by the Harvard Informatics group.
** Install all required packages using [https://www.huber.embl.de/msmb/ the R script] gives me errors. It would be great if there is a Docker image. Another solution is to run install.packages(pkgsToInstall, Ncpus=4) manually.
* [https://compgenomr.github.io/book/?s=09 Computational Genomics with R] by Altuna Akalin


=== [https://bioconductor.org/packages/release/bioc/html/caOmicsV.html caOmicsV] ===
== strand-specific vs non-strand specific experiment ==
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used
* http://seqanswers.com/forums/showthread.php?t=28025. According to this message and the [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126545 article] (under the paragraph of '''Read counting''') from PLOS, ''most of the RNA-seq protocols that are used nowadays are not strand-specific''.
* http://biology.stackexchange.com/questions/1958/difference-between-strand-specific-and-not-strand-specific-rna-seq-data
* https://www.biostars.org/p/61625/ how to find if this public rnaseq data are prepared by strand-specific assay?
* https://www.biostars.org/p/62747/ Discussion of using IGV to view strand-specific coverage. See also the similar posts on the right hand side.
* https://www.biostars.org/p/44319/ How To Find Stranded Rna-Seq Experiments Data. The text ''dUTP 2nd strand marking'' includes a link to stranded rna-seq data.
* forward (+)/ reverse(-) strand in GAlignments objects ([http://www.bioconductor.org/packages/release/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf p68 of the pdf manual] and  [https://samtools.github.io/hts-specs/SAMv1.pdf page 7 of sam format specification].


Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.
Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.


=== [https://cran.r-project.org/web/packages/Map2NCBI/index.html Map2NCBI] ===
[https://www.biostars.org/p/98756/ This post] mentioned to use [http://rseqc.sourceforge.net/ infer_experiment.py script] to check whether the rna-seq run is stranded or not.
The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).


<syntaxhighlight lang='rsplus'>
The rna-seq experiment used in [http://www.bioconductor.org/help/workflows/rnaseqGene/ this tutorial] is not stranded-specific.
> library(Map2NCBI)
> GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/"))
  # choose [2], [n], and [1] to filter the build and feature information.
  # The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files.
> str(GeneList)
'data.frame': 52157 obs. of  15 variables:
$ tax_id      : chr  "9606" "9606" "9606" "9606" ...
$ chromosome  : chr  "1" "1" "1" "1" ...
$ chr_start    : num  11874 14362 17369 30366 34611 ...
$ chr_stop    : num  14409 29370 17436 30503 36081 ...
$ chr_orient  : chr  "+" "-" "-" "+" ...
$ contig      : chr  "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ...
$ ctg_start    : num  1874 4362 7369 20366 24611 ...
$ ctg_stop    : num  4409 19370 7436 20503 26081 ...
$ ctg_orient  : chr  "+" "-" "-" "+" ...
$ feature_name : chr  "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ...
$ feature_id  : chr  "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ...
$ feature_type : chr  "GENE" "GENE" "GENE" "GENE" ...
$ group_label  : chr  "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ...
$ transcript  : chr  "Assembly" "Assembly" "Assembly" "Assembly" ...
$ evidence_code: chr  "-" "-" "-" "-" ...
> GeneList$feature_name[grep("^NAP", GeneList$feature_name)]
</syntaxhighlight>


=== TCseq: Time course sequencing data analysis ===
== FASTQ ==
http://bioconductor.org/packages/devel/bioc/html/TCseq.html
* [http://en.wikipedia.org/wiki/FASTQ_format FASTQ=FASTA + Qual]. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
=== Phred quality score ===
q = -10log10(p) where p = <span style="color: red">error</span> probability for the base.
{| class="wikitable"
! q
! <span style="color: red">error</span> probability
! base call accuracy
|-
| 10
| 0.1
| 90%
|-
| 13
| 0.05
| 95%
|-
| 20
| 0.01
| 99%
|-
| 30
| 0.001
| 99.9%
|-
| 40
| 0.0001
| 99.99%
|-
| 50
| 0.00001
| 99.999%
|}


== GEO ==
== FASTA ==
See the internal link at [[R#GEO_.28Gene_Expression_Omnibus.29|R-GEO]].
fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.


[https://www.biorxiv.org/content/early/2018/05/19/326223 GREIN: An interactive web platform for re-analyzing GEO RNA-seq data]
=== Download sequence files ===
* ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/RNA/ Assembled genome sequence and annotation data for RefSeq genome assemblies
* ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/


= Journals =
=== Compute the sequence length of a FASTA file ===
== Biometrical Journal ==
https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file
* https://onlinelibrary.wiley.com/journal/15214036
<syntaxhighlight lang='bash'>
* [https://onlinelibrary.wiley.com/page/journal/15214036/homepage/forauthors.html Author's Guideline]
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
* [https://onlinelibrary.wiley.com/results/global-subject-codes/st30?target=topic-title-results&startPage=&PubType=journal Biostatistics (topic) journals] from Wiley


== [https://academic.oup.com/biostatistics/issue Biostatistics] ==
head -2 file.fa | \
    awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}'  | \
    tail -1
</syntaxhighlight>


== [https://academic.oup.com/bioinformatics Bioinformatics] ==
== FASTA <=> FASTQ conversion ==
[https://academic.oup.com/bioinformatics/search-results?f_TocHeadingTitle=GENOME%20ANALYSIS Genome Analysis] section
According to [https://www.quora.com/Bioinformatics-What-is-the-difference-between-fasta-fastq-and-sam-files this post],


== [https://bmcbioinformatics.biomedcentral.com/ BMC Bioinformatics] ==
* FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
* FastQ files are like fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)


== [https://www.biorxiv.org/ BioRxiv] ==
=== Convert FASTA to FASTQ without quality scores ===


== PLOS ==
[https://www.biostars.org/p/99886/ Biostars]. For example, the [https://github.com/lh3/bioawk bioawk] by lh3 (Heng Li) worked.


= Software =
=== Convert FASTA to FASTQ with quality score file ===  
== BRB-SeqTools ==
See the links on the above post.
https://brb.nci.nih.gov/seqtools/


== [http://mev.tm4.org/#/welcome WebMeV] ==
=== Convert FASTQ to FASTA using Seqtk ===
* [http://cancerres.aacrjournals.org/content/77/21/e11 WebMeV: A Cloud Platform for Analyzing and Visualizing Cancer Genomic Data]
Use the [https://github.com/lh3/seqtk Seqtk] program; see [https://www.biostars.org/p/85929/ this post].


== GeneSpring ==
The '''Seqtk''' program by lh3 can be used to sample reads from a fastq file including paired-end; see [https://www.biostars.org/p/69348/ this post].
RNA-Seq


== CCBR Exome Pipeliner ==
== RPKM (Mortazavi et al. 2008) and cpm (counts per million) ==
https://ccbr.github.io/Pipeliner/
Reads per Kilobase of Exon per Million of Mapped reads.


== [https://github.com/PMBio/MOFA MOFA]: Multi-Omics Factor Analysis ==
* RPKMs can only be calculated for those genes for which the '''gene length''' and '''GC content''' information is available; see the vignette of [https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf#page=17  GSVA]
* rpkm function in [https://support.bioconductor.org/p/59317/ edgeR] package.
* RPKM function in [https://support.bioconductor.org/p/50413/ easyRNASeq] package.
* TMM > cpm > log2 transformation on the paper [https://bmccancer.biomedcentral.com/track/pdf/10.1186/s12885-018-4546-8#page=3 Gene expression profiling of 1200 pancreatic ductal adenocarcinoma reveals novel subtypes]
* [https://en.wikipedia.org/wiki/RNA-Seq#Gene_expression_quantification Gene expression quantification] from RNA-Seq wikipedia page
** Sequencing depth/coverage: the total number of reads generated in a single experiment is typically normalized by converting counts to fragments, reads, or counts per million mapped reads ('''FPM, RPM, or CPM''').
** Gene length: '''FPKM, TPM'''. Longer genes will have more fragments/reads/counts than shorter genes if transcript expression is the same. This is adjusted by dividing the FPM by the length of a gene, resulting in the metric fragments per kilobase of transcript per million mapped reads (FPKM). When looking at groups of genes across samples, FPKM is converted to transcripts per million (TPM) by dividing each FPKM by the sum of FPKMs within a sample.
* [https://bioinformatics.stackexchange.com/a/2299 Difference between CPM and TPM and which one for downstream analysis?]. '''CPM''' is basically depth-normalized counts whereas '''TPM''' is length normalized (and then normalized by the length-normalized values of the other genes).
* [http://robpatro.com/blog/?p=235 The RNA-seq abundance zoo]. The counts per million ('''CPM''') metric takes the raw (or estimated) counts, and performs the first type of normalization I mention in the previous section.  That is, it normalized the count by the library size, and then multiplies it by a million (to avoid scary small numbers).
* See also the [[ScRNA#Normalization|log(CPM)]] implemented in Seurat::NormalizeData() for scRNA-seq data.


== Benchmarking ==
Idea
[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1738-8 Essential guidelines for computational method benchmarking]
* The more we sequence, the more reads we expect from each gene. '''This is the most relevant correction of this method.'''
* Longer transcript are expected to generate more reads. '''The latter is only relevant for comparisons among different genes which we rarely perform!'''. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.


= Simulation =
Calculation
* [https://www.biostars.org/p/128762/ NGS reads simulation]
# Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
# Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
# Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.


== Simulate RNA-Seq ==
Formula
* http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#RNA-Seq_simulators
<pre>
* https://popmodels.cancercontrol.cancer.gov/gsr/packages/
RPKM = (10^9 * C)/(N * L), with


=== [http://maq.sourceforge.net Maq] ===
C = Number of reads mapped to a gene
Used by [https://academic.oup.com/bioinformatics/article/25/9/1105/203994/TopHat-discovering-splice-junctions-with-RNA-Seq TopHat: discovering splice junctions with RNA-Seq]
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene
</pre>


=== BEERS/Grant G.R. 2011 ===
<syntaxhighlight lang="rsplus">
http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called [http://cbil.upenn.edu/BEERS/ BEERS] and it was used in the [https://academic.oup.com/bioinformatics/article/29/1/15/272537/STAR-ultrafast-universal-RNA-seq-aligner STAR] software paper.
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)


For the command line options of <'''reads_simulator.pl'''> and more details about the config files that are needed/prepared by BEERS, see [https://gist.github.com/arraytools/dd62bcca60cc36a1d1769d1a4a7d226b this gist].
set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
    [,1] [,2] [,3] [,4]
[1,]    0    0    5    0
[2,]    6    2    7    3
[3,]    5  13    7    2
[4,]    3    3    9  11
[5,]   1    2    1  15


This can generate paired end data but they are in one FASTA file.
d <- DGEList(counts=y, lib.size=1001:1004)
 
# Note that lib.size is optional
<syntaxhighlight lang='bash'>
# By default, lib.size = colSums(counts)
$ sudo apt-get install cpanminus
cpm(d) # counts per million
$ sudo cpanm Math::Random
  Sample1  Sample2  Sample3  Sample4
$ wget http://cbil.upenn.edu/BEERS/beers.tar
1    0.000    0.000 4985.045    0.000
$
2 5994.006  1996.008 6979.063  2988.048
$ tar -xvf beers.tar      # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl>
3 4995.005 12974.052 6979.063 1992.032
$
4 2997.003  2994.012 8973.081 10956.175
$ cd ~/Downloads/
5  999.001  1996.008  997.009 14940.239
$ mkdir beers_output  
> cpm(d,log=TRUE)
$ mkdir beers_simulator_refseq && cd "$_"
    Sample1  Sample2  Sample3  Sample4
$ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz
7.961463 7.961463 12.35309 7.961463
$ tar xzvf simulator_config_refseq.tar.gz
2 12.607393 11.132027 12.81875 11.659911
$ ls -lth
3 12.355838 13.690089 12.81875 11.129470
total 1.4G
4 11.663897 11.662567 13.17022 13.451207
-rw-r--r-- 1 brb brb 44M Sep 16 2010 simulator_config_featurequantifications_refseq
5 10.285119 11.132027 10.28282 13.890078
-rw-r--r-- 1 brb brb 7.7M Sep 15 2010 simulator_config_geneinfo_refseq
-rw-r--r-- 1 brb brb 106M Sep 15  2010 simulator_config_geneseq_refseq
-rw-r--r-- 1 brb brb 1.3G Sep 15  2010 simulator_config_intronseq_refseq
$ cd ~/Downloads/
$ perl reads_simulator.pl 100 testbeers \
  -configstem refseq \
  -customcfgdir ~/Downloads/beers_simulator_refseq \
  -outdir ~/Downloads/beers_output


$ ls -lh beers_output
d$genes$Length <- c(1000,2000,500,1500,3000)
total 3.9M
rpkm(d)
-rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt
    Sample1  Sample2    Sample3  Sample4
-rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt
1   0.0000    0.000  4985.0449    0.000
-rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt
2 2997.0030  998.004  3489.5314 1494.024
-rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt
3 9990.0100 25948.104 13958.1256 3984.064
-rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed
4 1998.0020  1996.008  5982.0538 7304.117
-rw-r--r-- 1 brb brb 31K Mar 16 15:25 simulated_reads_testbeers.cig
5 333.0003  665.336  332.3363 4980.080
-rw-r--r-- 1 brb brb  22K Mar 16 15:25 simulated_reads_testbeers.fa
 
-rw-r--r-- 1 brb brb  584 Mar 16 15:25 simulated_reads_testbeers.log
> cpm
function (x, ...)
UseMethod("cpm")
<environment: namespace:edgeR>
> showMethods("cpm")


$ wc -l simulated_reads2genes_testbeers.txt
Function "cpm":
102 simulated_reads2genes_testbeers.txt
<not an S4 generic function>
$ head -4 simulated_reads2genes_testbeers.txt
> cpm.default
seq.1 GENE.5600
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25,
seq.2 GENE.35506
    ...)
seq.3 GENE.506
{
seq.4 GENE.34922
    x <- as.matrix(x)
$ tail -4 simulated_reads2genes_testbeers.txt
    if (is.null(lib.size))
seq.97 GENE.4197
        lib.size <- colSums(x)
seq.98 GENE.8763
    if (log) {
seq.99 GENE.19573
        prior.count.scaled <- lib.size/mean(lib.size) * prior.count
seq.100 GENE.18830
        lib.size <- lib.size + 2 * prior.count.scaled
$ wc -l simulated_reads_indels_testbeers.txt
    }
36131 simulated_reads_indels_testbeers.txt
    lib.size <- 1e-06 * lib.size
$ head -2 simulated_reads_indels_testbeers.txt
    if (log)
chr1:6052304-6052531 25 1 G
        log2(t((t(x) + prior.count.scaled)/lib.size))
chr2:73899436-73899622 141 3 ATA
    else t(t(x)/lib.size)
$ tail -2 simulated_reads_indels_testbeers.txt
}
chr4:68619532-68621804 1298 -2 AA
<environment: namespace:edgeR>
chr21:32554738-32554962 174 1 T
> rpkm.default
$ wc -l simulated_reads_substitutions_testbeers.txt
function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25,
71678  simulated_reads_substitutions_testbeers.txt
    ...)
$ head -2 simulated_reads_substitutions_testbeers.txt
{
chr22:50902963-50903167 50903077 G->A
    y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count)
chr1:6052304-6052531 6052330 G->C
    gene.length.kb <- gene.length/1000
$ wc -l simulated_reads_junctions-crossed_testbeers.txt
    if (log)
49  simulated_reads_junctions-crossed_testbeers.txt
        y - log2(gene.length.kb)
$ head -2 simulated_reads_junctions-crossed_testbeers.txt
    else y/gene.length.kb
seq.1a chrX:49084601-49084713
}
seq.1b chrX:49084909-49086682
<environment: namespace:edgeR>
</syntaxhighlight>
 
Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as
<syntaxhighlight lang="rsplus">
# step 1:
6/(1.0e-6 *1001) = 5994.006    # cpm, compute column-wise
# step 2:
5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise


$ cat beers_output/simulated_reads_testbeers.log
# Another way
Simulator run: 'testbeers'
# step 1 (RPK)
started: Thu Mar 16 15:25:39 EDT 2017
6/ (2000/1.0e3) = 3
num reads: 100
# step 2 (RPKM)
readlength: 100
3/ (1.0e-6 * 1001) = 2997.003
substitution frequency: 0.001
</syntaxhighlight>
indel frequency: 0.0005
 
base error: 0.005
Another example. [https://github.com/oxwang/fda_scRNA-seq/blob/master/3_Normalization/Code/HCC1395/10X_LLU.R#L52 source code] of calc_cpm().
low quality tail length: 10
<pre>
percent of tails that are low quality: 0
library(edgeR)
quality of low qulaity tails: 0.8
set.seed(1234)
percent of alt splice forms: 0.2
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
number of alt splice forms per gene: 2
cpm(y)
stem: refseq
#          [,1]  [,2]      [,3]      [,4]
sum of gene counts: 3,886,863,063
#[1,]      0.00      0 172413.79      0.00
sum of intron counts = 1,304,815,198
#[2,] 400000.00 100000 241379.31  96774.19
sum of intron counts = 2,365,472,596
#[3,] 333333.33 650000 241379.31  64516.13
intron frequency: 0.355507598105262
#[4,] 200000.00 150000 310344.83 354838.71
padded intron frequency: 0.52453796437909
#[5,]  66666.67 100000  34482.76 483870.97
finished at Thu Mar 16 15:25:58 EDT 2017
 
calc_cpm <- function (expr_mat) {
    norm_factor <- colSums(expr_mat)
    return(t(t(expr_mat)/norm_factor) * 10^6)
    # Fix a bug in the original code
    # Not affect silhouette()
}


$ wc -l simulated_reads_testbeers.fa
calc_cpm(y)
400 simulated_reads_testbeers.fa
#          [,1]  [,2]      [,3]      [,4]
$ head simulated_reads_testbeers.fa
#[1,]      0.00      0 172413.79      0.00
>seq.1a
#[2,] 400000.00 100000 241379.31  96774.19
CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
#[3,] 333333.33 650000 241379.31  64516.13
>seq.1b
#[4,] 200000.00 150000 310344.83 354838.71
GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
#[5,]  66666.67 100000  34482.76 483870.97
>seq.2a
</pre>
GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
>seq.2b
ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
>seq.3a
CCCCAGAGGAGCGCCACCTGTCCAAGATGCAGCAGAACGGCTACGAAAATCCAACCTACAAGTTCTTTGAGCAGATGCAGAACTAGACCCCCGCCACAGC


# Take a look at the true coordinates
=== Critics ===
$ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span
* [http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf RPKM/FPKM is not suitable for statistical testing] (p11):
chrX 49084529 49084601 +
chrX 49084713 49084739 +
chrX 49084863 49084909 +
chrX 49086682 49086734 +
$ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates
seq.1a chrX 49084529 73M111N27M 49084529-49084601, 49084713-49084739 + CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
seq.1b chrX 49084863 47M1772N53M 49084863-49084909, 49086682-49086734 - GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
seq.2a chr1 183516256 100M 183516256-183516355 - GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
seq.2b chr1 183515275 100M 183515275-183515374 + ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ wc -l simulated_reads_testbeers.bed
247 simulated_reads_testbeers.bed
$ wc -l simulated_reads_testbeers.cig
200 simulated_reads_testbeers.cig
</syntaxhighlight>


=== [http://sammeth.net/confluence/display/SIM/Home Flux] Sammeth 2010 ===
''Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!''


=== [http://www.ebi.ac.uk/goldman-srv/simNGS/ SimNGS] ===
* [http://blog.nextgenetics.net/?e=51 RPKM measure is inconsistent among samples]


=== [http://cran.r-project.org/web/packages/SimSeq/index.html SimSeq]  ===
=== CPM vs TPM ===
[http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124.abstract Bioinformatics]
Both has the property that the sumof reads is 1 million(10^6). But TPM includes gene length normalization (TPM accounts for variations in gene length (done first) and sequencing depth (done second)). So if want to find DE genes between samples, it is common to use the TPM normalization method.


A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.
=== (another critic) Union Exon Based Approach ===
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910


=== [http://cran.r-project.org/web/packages/empiricalFDR.DESeq2/index.html empiricalFDR.DESeq2] ===
In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.
http://biorxiv.org/content/early/2014/12/05/012211


The key function is '''simulateCounts''', which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.  
It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.


Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.
== FPKM (Trapnell et al. 2010) ==


=== [http://www.bioconductor.org/packages/release/bioc/html/polyester.html polyester] ===
* Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks).
http://biorxiv.org/content/early/2014/12/05/012211
* FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
* [https://support.bioconductor.org/p/9154302/ Differential expression analysis with only FPKM matrix available from total newbie in R]


Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.  
== [http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ RPKM, FPKM, TPM and DESeq] ==
* RPKM can be calculated using the '''edgeR''' package if we have the raw count data (e.g. Rsubread::'''featureCount'''()). [http://combine-australia.github.io/RNAseq-R/07-rnaseq-day2.html Rsubread is the only R package that can run in R].
* The youtube video is on [https://www.youtube.com/watch?t=119&v=TTUrtCY2k-w here]. TPM 比 RPKM/FPKM 好因為 total reads in each experiments are the same.
* [https://reneshbedre.github.io/blog/expression_units.html Relationship (formula) between RPKM and TPM]
* [https://arxiv.org/pdf/1804.06050.pdf#page=5 Differences]
** The main difference between RPKM and FPKM is that the former is a unit based on single-end reads, while the latter is based on paired-end reads and counts the two reads from the same RNA fragment as one instead of two.
** The difference between RPKM/FPKM and TPM is that the former calculates sample-scaling factors before dividing read counts by gene lengths, while the latter divides read counts by gene lengths first and calculates samples calling factors based on the length-normalized read counts.
** If researchers would like to interpret gene expression levels as the proportions of RNA molecules from different genes in a sample,
* [https://zhuanlan.zhihu.com/p/55988984 为什么说FPKM和RPKM都错了?]
* [http://diytranscriptomics.com/Reading/files/wagnerTPM.pdf Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples] (TPM method) by Wagner 2012.
* Between samples normalization.
** [https://groups.google.com/forum/#!topic/sailfish-users/jBf9SGiH1AM How to Normalize Salmon TPM output?]
** [https://www.biostars.org/p/287296/ How do I normalize for my RNA-seq data across different samples in different conditions]. Using DESeq2 ([https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-014-0550-8 paper], [https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/estimateSizeFactors estimateSizeFactors()]). [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218662/pdf/gb-2010-11-10-r106.pdf DESeq] paper by Anders 2010 where the NB model and size factor was first used.
: <syntaxhighlight lang='rsplus'>
> set.seed(1)
> dds <- makeExampleDESeqDataSet(m=4)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      14      1      1      4
gene2      5      17      13      14
gene3      0      12      8      6
gene4    152      62    149    110
gene5      23      36      33      94
gene6      0      1      1      4
> dds <- estimateSizeFactors(dds)
> sizeFactors(dds)
sample1  sample2  sample3  sample4
1.068930 1.014687 1.010392 1.033559
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      14      1      1      4
gene2      5      17      13      14
gene3      0      12      8      6
gene4    152      62    149    110
gene5      23      36      33      94
gene6      0      1      1      4
> head(counts(dds, normalized=TRUE))
        sample1    sample2    sample3    sample4
gene1  13.097206  0.9855256  0.9897147  3.870122
gene2  4.677574 16.7539354  12.8662916  13.545427
gene3  0.000000 11.8263073  7.9177179  5.805183
gene4 142.198237 61.1025878 147.4674957 106.428358
gene5  21.516838 35.4789219  32.6605863  90.947869
gene6  0.000000  0.9855256  0.9897147  3.870122


'''Input''': reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.
# normalized counts is calculated as the following
R> head(scale(counts(dds, normalized=F), F, sizeFactors(dds)))
        sample1    sample2    sample3    sample4
gene1  13.097206  0.9855256  0.9897147  3.870122
gene2  4.677574 16.7539354  12.8662916  13.545427
gene3  0.000000 11.8263073  7.9177179  5.805183
gene4 142.198237 61.1025878 147.4674957 106.428358
gene5  21.516838 35.4789219  32.6605863  90.947869
gene6  0.000000  0.9855256  0.9897147  3.870122


'''Output''': FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.
# The situation of DESeqDataSet object created using 'tximport()' is different. See the next item.
</syntaxhighlight>
* [https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/estimateSizeFactors ?estimateSizeFactors()]. If tximport is used, the information in <nowiki>assays(dds)[["avgTxLength"]] </nowiki> is automatically used to create appropriate normalization factors. In this case, sizeFactors(dds) will return NULL. See also [[R#Debug_an_S4_function|Debug an S4 function]] for the source code.
* TPM (generated by [https://deweylab.github.io/RSEM/ RSEM] or [https://combine-lab.github.io/salmon/ Salmon] or [https://pachterlab.github.io/kallisto/starting Kallisto]) has been suggested as a better unit than RPKM/FPKM. But it cannot be used to do comparison between samples.
** [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163565/ RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome] Li 2011, 8k citations from google scholar.
** [http://www.arrayserver.com/wiki/index.php?title=TPM_and_FPKM TPM and FPKM] from array suite wiki
** [https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ What the FPKM? A review of RNA-Seq expression units]. R code for computing effective counts, TPM, and FPKM.
** [https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/ In RNA-Seq, 2 != 2: Between-sample normalization]. Among the most popular and well-accepted BSN (between-sample normalization) methods are TMM and DESeq normalization.
: <syntaxhighlight lang='bash'>
P -- per
K -- kilobase (related to gene length)
M -- million (related to sequencing depth)
</syntaxhighlight>
* [https://www.biostars.org/p/329625/ How to convert transcript level TPM to gene level TPM ?]
* [https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/salmon_kalliso_STAR_compare.md R scripts to convert HTseq counts to TPM]
* [https://www.biostars.org/p/171766/ Calculating TPM from featureCounts output], [https://gist.github.com/slowkow/c6ab0348747f86e2748b A simpler version]. It seems CPM and TPM 差別再 TPM 考慮gene length.
* [https://hbctraining.github.io/DGE_workshop_salmon/lessons/02_DGE_count_normalization.html Comparison of common normalization methods] from a workshop from Harvard Chan Bioinformatics Core.
* [https://rnajournal.cshlp.org/content/early/2020/04/13/rna.074922.120 Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols] Zhao 2020
** It can be reasonable to assume that the partitioning of total RNA among the different compartments ('''ribosomal RNA, pre-mRNA, mitochondrial RNA, genomic pre-mRNA and poly(A)+ RNA''') of the transcriptome is '''comparable''' across samples in a given RNA-seq project.
* [https://reneshbedre.github.io/blog/expression_units.html#tmm-trimmed-mean-of-m-values Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq]
* [https://pubmed.ncbi.nlm.nih.gov/34158060/ TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository] Zhao 2021


Too many dependencies. <strike>Got an error in installation.</strike>. It seems it has not considered splice junctions.
== TMM (Robinson and Oshlack, 2010) ==
Trimmed Means of M values (edgeR).
<ul>
<li>TMM relies on the assumption that most genes are not differentially expressed. See the [https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25 paper]. DESeq2 does not rely on this assumption. <li>
<li>TMM will not work well for samples where the library size is so small that most of the counts become zero. A library size of 1 million is on the small side, but is probably ok. [https://support.bioconductor.org/p/9149967/ Is there any reason to think that normalization (e.g. TMM) doesn't work well with samples that that have very different raw counts?]
<li>Many normalization RNA-Seq normalization methods perform poorly on samples with '''extreme composition bias'''. For instance, in one sample a large number of reads comes from rRNAs while in another they have been removed more efficiently. Most scaling based methods, including RPKM and CPM, will underestimate the expression of weaker expressed genes in
the presence of extremely abundant mRNAs (less sequencing real estate available for them). The TMM methods tries to correct this bias.
</li>


=== [http://deweylab.github.io/RSEM/ RSEM] ===
<li>[https://support.bioconductor.org/p/9156103/ Does EdgeR trimmed mean of M values (TMM) account for gene length?] '''No'''. ''In general, edgeR does not need to adjust for gene length in DE analyses because gene length cancels out of DE comparisons.''
<li>[https://www.reneshbedre.com/blog/expression_units.html Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq]


== Simulate DNA-Seq ==
<li>Q: Does TMM method require count data?
* Software list - https://popmodels.cancercontrol.cancer.gov/gsr/packages/
* A: Yes, the TMM method requires RNA-seq count data as input.
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/ A comparison of tools for the simulation of genomic next-generation sequencing data] Merly Escalona 2016
* The TMM method uses these count data to calculate the scaling factors that adjust for differences in library size <strike>and gene length</strike>, as well as the effects of highly expressed genes.
* Before applying the TMM method, it is important to ensure that the count data has been properly preprocessed and filtered to remove low-quality reads, adapter sequences, and other artifacts.
</li>
<li>Q: can TMM method be applied to non-integer data?
* A: It is possible to apply TMM to non-integer data, such as normalized expression values or FPKM (fragments per kilobase of transcript per million mapped reads) values, by rounding the values to the nearest integer.
* In practice, the TMM method can be applied to non-integer data by first converting the data to counts, for example, by multiplying the expression values by a scaling factor that represents the average library size, and then rounding the resulting values to the nearest integer. The TMM method can then be applied to the rounded count data as usual.
</li>
<li>[https://youtu.be/Wdt6jdi-NQo StatQuest: edgeR, part 1, Library Normalization]. Good explanation about reference sample selection. </li>
<li>[https://www.tutorialspoint.com/statistics/trimmed_mean.htm Trimmed Mean] </li>
<li>[https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html RNA Sequence Analysis in R: edgeR]
</li>
<li>[https://davetang.org/muse/2011/01/24/normalisation-methods-for-dge-data/ Normalisation methods implemented in edgeR]. TMM, RLE, Upper-quartile. </li>
<li>Using [https://stats.stackexchange.com/a/421903 edgeR] package
<pre>
library(magrittr)
library(edgeR)
set.seed(1)
M <- matrix(rnbinom(10000,mu=5,size=2), ncol=4)


=== wgsim ===
out <- DGEList(M) %>% calcNormFactors() %>% cpm()
https://github.com/lh3/wgsim
</pre> </li>
<li>Using [https://www.bioconductor.org/packages/release/bioc/vignettes/NOISeq/inst/doc/NOISeq.pdf#page=15 NOISeq] package
<pre>
library(NOISeq)
out2 <- tmm(M, long = 1000, lc = 0, k = 0)
out[1:3, 1:3] / out2[1:3, 1:3]
#    Sample1  Sample2  Sample3
# 1 80.81611 81.32609      NaN
# 2 80.81611 81.32609 80.81611
# 3 80.81611      NaN 80.81611
</pre> </li>
<li>
</ul>


* Used by [https://www.biorxiv.org/content/biorxiv/early/2017/12/20/237107.full.pdf#page=3 Cleaning clinical genomic data: Simple identification and removal of recurrently miscalled variants in single genomes] bioRxiv 2017
== Sample size ==
* [https://gatkforums.broadinstitute.org/gatk/discussion/7859/how-to-simulate-reads-using-a-reference-genome-alt-contig (How to) Simulate reads using a reference genome ALT contig]
[[Power#RNA-seq|Power-> RNA-seq]]
* [http://research.cs.wisc.edu/wham/comparison-using-wgsim/ Comparing WHAM with BWA using wgsim
* http://biobits.org/samtools_primer.html
 
=== NEAT ===
* https://github.com/zstephens/neat-genreads
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5125660/ Simulating next-generation sequencing datasets from empirical mutation and sequencing models] Zachary Stephens, 2016
* If I set 10 as the coverage rate and read length 101, the generated fq file is about 34GB (3.3GB * 10) for each one of the pairs.
 
=== DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads ===
http://genomespot.blogspot.com/2014/11/dna-aligner-accuracy-bwa-bowtie-soap.html


== Coverage ==
* [https://genohub.com/recommended-sequencing-coverage-by-application/ Recommended Coverage and Read Depth for NGS Applications] from @Genohub.
* [http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html bedtools]. The bedtools is now hosted on [https://github.com/arq5x/bedtools2 github]
* https://github.com/alyssafrazee/polyester
<pre>
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
</pre>
* Page 18 of this [http://www.chem.agilent.com/Library/eseminars/Public/RNA%20Sequencing%20101.pdf RNA-Seq 101] from Agilent or [http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf Estimating Sequencing Coverage] from Illumina.
<pre>
C = L N / G
</pre>
where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.
* coverage() function in IRanges package.
* [https://github.com/fbreitwieser/bamcov bamcov] - Quickly calculate and visualize sequence coverage in alignment files
* [https://www.biostars.org/p/6571/ Coverage and read depth]
* [https://www.biostars.org/p/638/ What Is The Sequencing 'Depth' ?] Coverage = (total number of bases generated) / (size of genome sequenced). So a 30x coverage means, on an average each base has been read by 30 sequences.
* [http://www.nature.com/nrg/journal/v15/n2/full/nrg3642.html Sequencing depth and coverage: key considerations in genomic analyses]
* [http://qualimap.bioinfo.cipf.es/doc_html/index.html Qualimap]
* https://www.biostars.org/p/5165/
<syntaxhighlight lang='bash'>
<syntaxhighlight lang='bash'>
$ head simDNA_100bp_16del.fasta
# Assume the bam file is sorted by chromosome location
>Pt-0-100
# took 40 min on 5.8G bam file. samtools depth has no threads option:(
TGGCGAACGCGGGAATTGACCGCGATGGTGATTCACATCACTCCTAATCCACTTGCTAATCGCCCTACGCTACTATCATTCTTT
# it is not right since it only account for regions that were covered with reads
>Pt-10-110
samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'    # maybe 42
GCGGGATTGAACCCGATTGAATTCCAATCACTGCTTAATCCACTTGCTACATCGCCCTACGTACTATCTATTTTTTTGTATTTC
>Pt-20-120
GAACCCGCGATGAATTCAATCCACTGCTACCATTGGCTACATCCGCCCCTACGCTACTCTTCTTTTTTGTATGTCTAAAAAAAA
>Pt-30-130
TGGTGAATCACAATCACTGCCTAACCATTGGCTACATCCGCCCCTACGCTACACTATTTTTTGTATTGCTAAAAAAAAAAATAA
>Pt-40-140
ACAACACTGCCTAATCCACTTGGCTACTCCGCCCCTAGCTACTATCTTTTTTTGTATTTCTAAAAAAAAAAAATCAATTTCAAT
</syntaxhighlight>


=== Simulate Whole genome ===
# The following is the right way! The result matches with Qualimap program.
* [https://github.com/nh13/dwgsim DWGSIM] mentioned by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3785481/ Variant Callers for Next-Generation Sequencing Data: A Comparison Study]. For its usage, see http://davetang.org/wiki/tiki-index.php?page=DWGSIM
samtools depth -a *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'  # maybe 8
# OR
LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'`  # 3095693981
SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'`  # 24473867730
echo $(( $LEN/$SUM ))
</syntaxhighlight>
 
== 5 common genomics file formats ==
[https://youtu.be/MrVpn0vpIYU 5 genomics file formats you must know] (video)
* fastq,
* fastq,
* bam,
* vcf,
* [https://m.ensembl.org/info/website/upload/bed.html bed] (genomic intervals regions)
 
== SAM/Sequence Alignment Format and BAM format specification ==
* https://samtools.github.io/hts-specs/SAMv1.pdf and [http://samtools.sourceforge.net/ samtools] webpage.
* http://genome.sph.umich.edu/wiki/SAM


=== Simulate whole exome ===
== Single-end, pair-end, fragment, insert size ==
* https://www.biostars.org/p/66714/ (no final answer)
* [http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html Paired-end read confusion - library, fragment or insert size?]
* [https://academic.oup.com/bioinformatics/article/29/8/1076/225073 Wessim: a whole-exome sequencing simulator based on in silico exome capture] Sangwoo Kim 2013 & [http://sak042.github.io/Wessim/ software]
* https://www.biostars.org/p/95803/


== Variant simulator ==
== Germline vs Somatic mutation ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2611-1 sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs]
* Germline: inherit from parents. See the [https://en.wikipedia.org/wiki/Germline Wikipedia] page.
* Somatic SNVs are mutations that occur in the cells of a tumor. These mutations can be found in '''multiple copies of the same gene''', while germline SNVs are mutations that are found in '''a single copy of the gene''', usually the original copy.
* [https://my.clevelandclinic.org/health/body/23067-somatic--germline-mutations Somatic & Germline Mutations]


== Convert FASTA to FASTQ ==
== Pathogenic mutation ==
It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.
* A [https://www.thinkgenetic.com/reference/genetic-testing/genetic-testing/358 pathogenic mutation] is a change in the genetic sequence that causes a specific genetic disease. To determine if a change found in the gene is something that causes disease, a laboratory looks at many different factors. For example, they look at the type of change found. Some changes, like nonsense mutations or frameshift mutations, almost always result in a major problem with the protein produced, so they are often labeled as pathogenic mutations. Laboratories will also check the scientific literature and databases to see if the particular change has been reported in other individuals with the genetic disease. Lastly, they look to see if the change is in an area of the gene that is conserved across species, meaning that the area where the change is located is the same in lots of animals, thus may be an important area for the function of the protein.
* pathogenic variant. See [https://my.clevelandclinic.org/health/diseases/21751-genetic-disorders Genetic Disorders]
* [https://www.collinsdictionary.com/dictionary/english/pathogenic-mutations Pathogenic] means ''able to cause or produce disease''.
* What are some examples of genetic diseases caused by pathogenic mutations? Cystic fibrosis, Duchenne muscular dystrophy, Familial hypercholesterolemia, Hemochromatosis, Sickle cell disease, Tay-Sachs disease ...


If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.
== Driver vs passenger mutation ==
https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer


<syntaxhighlight lang='bash'>
== Nonsynonymous mutation ==
$ cat test.fasta
It is related to the [http://www.chemguide.co.uk/organicprops/aminoacids/dna4.html genetic code], [https://en.wikipedia.org/wiki/Genetic_code Wikipedia]. There are 20 amino acids though there are 64 codes.
>Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
>Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
>Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
>Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
>Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
$ perl ~/Downloads/fasta_to_fastq.pl test.fasta
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
</syntaxhighlight>


Alternatively we can use just one line of code by [https://www.reddit.com/r/bioinformatics/comments/32pu00/fasta_to_fastq_converter/ awk]
See
<syntaxhighlight>
* http://evolution.about.com/od/Overview/a/Synonymous-Vs-Nonsynonymous-Mutations.htm
$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \
* https://en.wikipedia.org/wiki/Nonsynonymous_substitution
  test.fasta > test.fq
* [http://thegenomefactory.blogspot.com/2013/10/understanding-snps-and-indels-in.html Understanding SNPs and INDELs in microbial genomes]
$ cat test.fq
* An example from https://en.wikipedia.org/wiki/Silent_mutation
@Pt-0-50
** nonsynonymous: ATG to GTG mutation (AUG = Met, GUG = Val)
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
** synonymous: CAT to CAC mutation (CAU = His, CAC = His)
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
</syntaxhighlight>
Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).


== Simulate genetic data ==
== isma: analysis of mutations detected by multiple pipelines ==
[https://onunicornsandgenes.blog/2019/06/16/simulating-genetic-data-with-r-an-example-with-deleterious-variants-and-a-pun/ ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’]
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2701-0 isma]: an R package for the integrative analysis of mutations detected by multiple pipelines
 
== mutSignatures: analysis of cancer mutational signatures ==
* [https://www.nature.com/articles/s41598-020-75062-0#data-availability MutSignatures]: an R package for extraction and analysis of cancer mutational signatures
* https://github.com/dami82/mutSignatures
 
== Rediscover: identify mutually exclusive mutations ==
[https://academic.oup.com/bioinformatics/article-abstract/38/3/844/6401995 Rediscover: an R package to identify mutually exclusive mutations]


= PDX/Xenograft =
== Tumor mutational burden ==
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-1172 Are special read alignment strategies necessary and cost-effective when handling sequencing reads from patient-derived tumor xenografts?] by Tso et al, BMC Genomics, 2014.
* https://en.wikipedia.org/wiki/Tumor_mutational_burden
* [http://www.arrayserver.com/wiki/index.php?title=Align_Ion_Torrent_reads Map xenograft reads] by [http://www.omicsoft.com/array-studio/ Array Suite]
** '''Treatment Response''': An analysis of a large cohort of patients receiving ''ICI (immune checkpoint inhibitors)'' therapy revealed that higher TMB levels (≥ 20 mutations/Mb) corresponded to a 58% response rate to ICIs while lower TMB levels (<20 mutations/Mb) reduced response to 20%.  
* [http://www.pdxfinder.org/ PDXFinder] includes PDMR as one of 6 providers. The website source code https://github.com/PDXFinder/pdxfinder.
** '''Cut-offs (High and low TMB status)''': Different studies have assigned different cut-offs to delineate between high and low TMB status.  
** [http://www.pdxfinder.org/data/pdx/IRCC/CRC0120LM#variation A Colorectal Carcinoma example] contains 'genomic data'. Each row represents one seq position. No raw FASTQ files available.
* Tumor Mutation Burden (TMB) is typically calculated '''per sample''', not per gene. It is defined as the number of non-synonymous somatic mutations (single nucleotide variants and small insertions/deletions) per megabase in coding regions.
* NCI [https://pdmr.cancer.gov/models/database.htm Patient-Derived Models Repository (PDMR)].
** The data range of TMB can vary widely depending on the '''type of cancer''' and the individual patient’s tumor. For example, one study showed that TMB can range from 0.03 to 14.13 mutations per megabase in a prostate cancer cohort, while this range is from 0.04-99.68 mutations per megabase in a bladder cancer cohort.
** ftp://dctdftp.nci.nih.gov/pub/pdm/
** '''Different sources''' may categorize TMB levels differently. For instance, one source suggests that a TMB score lower than 10 is considered low, between 10 and 15 is intermediate, and larger than 15 is high. Another source suggests that low TMB is less than or equal to 5 mutations/Mb, intermediate TMB is greater than 5 and less than 20, high TMB is greater than or equal to 20 and less than 50, and very high TMB is greater than or equal to 50.
** The ftp link can be obtained by clicking 'PDMR Models' -> 'PDMR database' -> Click here to access the PDMR Database -> 'Genomic Analysis'.
* [https://bmcimmunol.biomedcentral.com/articles/10.1186/s12865-018-0285-5 Correlate tumor mutation burden with immune signatures in human cancers] Wang 2019
** There will be three tabs: NCI Cancer Genome Panel (4246 records), Whole Exome Sequence (785 records) and RNASeq (807 records).
** Definition of cutoff: higher-TMB samples (the samples with TMB scores of ''upper quartile'') and lower-TMB samples (the samples with TMB scores of ''lower quartile'')
** For Whole Exome Sequence, '''VCF''' was provided. For RNASeq, '''TPM''' files per genes or isoforms are available.  
** '''Higher TMB was associated with better survival prognosis''' in numerous cancer types while was associated with worse prognosis in a few cancer types.
** The '''RNASeq Transcriptome Data Analysis Pipeline and Specifications''' and '''Whole Exome Sequencing Data Analysis Pipeline and Specifications''' are available under SOPs.
** Our data implicate that '''higher-TMB patients could gain a more favorable prognosis in diverse cancer types if treated with immunotherapy''', otherwise would have a poorer prognosis compared to lower-TMB patients.
* [https://www.rna-seqblog.com/reproducible-bioinformatics-project/ Reproducible Bioinformatics Project]
** '''Higher nonsynonymous mutation burden in tumors''' is inclined to form more neoantigens that make tumors to have higher immunogenicity, and thus result to improved clinical response to '''immunotherapy'''.
* [https://academic.oup.com/bioinformatics/article/28/12/i172/269972 Xenome] a tool for classifying reads from xenograft samples, Thomas Conway et al 2012.
** Address several questions:
** [https://en.wikipedia.org/wiki/K-mer K-mer]
*** Is the '''immune activity''' (expression levels of immune-related genes) of the higher-TMB subtype different from that of the lower-TMB subtype of cancers?  Wilcoxon rank-sum test. Fisher’s exact test & OR.
** The program is bundled in '''[https://github.com/data61/gossamer/blob/master/docs/xenome.md Gossamer]''' (Github)
*** Are there any immune-related genes or gene-sets which are differentially expressed between the lower-TMB subtype and the higher-TMB subtype of cancers and whose expression is associated with clinical outcomes in cancer? Wilcoxon rank-sum test. Fisher’s exact test & OR.
** [https://hpc.nih.gov/apps/gossamer.html Biowulf] It is noted that Gossamer runs in a Singularity container
*** Is the TMB itself associated with clinical outcomes in cancer? Log-rank tests.
** Indexing took 13 hours when I set 16 threads and 24GB memory (25.4GB was used). A set of 23 files with prefix 'idx' will be generated.
* [https://cancerci.biomedcentral.com/articles/10.1186/s12935-020-01472-9 Significance of tumor mutation burden combined with immune infiltrates in the progression and prognosis of ovarian cancer] Bi 2020
: <syntaxhighlight lang='bash'>
* [https://pubmed.ncbi.nlm.nih.gov/33125859/ The Challenges of Tumor Mutational Burden as an Immunotherapy Biomarker] 2020
#!/bin/bash
* [https://aacrjournals.org/cancerdiscovery/article/10/12/1808/2595/Tumor-Mutational-Burden-as-a-Predictive-Biomarker Tumor Mutational Burden as a Predictive Biomarker in Solid Tumors] 2020
module load gossamer
* [https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html maftools] Summarize, Analyze and Visualize MAF Files
xenome index -M 24 -T 16 -P idx \
* [https://www.biomedcentral.com/search?query=TMB&searchType=publisherSearch 2449 result(s) for 'TMB'] from BMC
  -H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \
* [https://www.biostars.org/p/431067/ How to calculate TMB for somatic data TCGA]
  -G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
* [https://www.biostars.org/p/299549/ Question: TMB Tumor Mutation Burden]
</syntaxhighlight>
* [https://pubmed.ncbi.nlm.nih.gov/32239176/ Mining TCGA database for tumor mutation burden and their clinical significance in bladder cancer]
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0074432 Next-Generation Sequence Analysis of Cancer Xenograft Models] by Fernando J. Rossello et al 2013.
* [https://jitc.bmj.com/content/8/1/e000147.long Establishing guidelines to harmonize tumor mutational burden (TMB): in silico assessment of variation in TMB quantification across diagnostic platforms] 2020
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4991491/ Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers] Bradford et al 2016
* [https://github.com/bioinfo-pf-curie/TMB TMB (python] - This tool was designed to calculate a Tumor Mutational Burden (TMB) score from a VCF file.
* [https://f1000research.com/articles/5-2741/v1 An open-source application for disambiguating two species in next generation sequencing data from grafted samples] by Ahdesmäki MJ et al 2016. [https://github.com/AstraZeneca-NGS/disambiguate Disambiguation]
* [https://acc-bioinfo.github.io/TMBleR/index.html TMBleR] R package, docker, shiny.
* [http://mcr.aacrjournals.org/content/molcanres/15/8/1012.full.pdf Next-Generation Sequencing Analysis and Algorithms for PDX and CDX Models] by Garima Khandelwal et al 2017. [https://github.com/CRUKMI-ComputationalBiology/bamcmp bamcmp] software.
* [https://github.com/jasonwong-lab/TMB TMB prediction App] R, shiny
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4414-y Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts] by Maurizio Callari et al 2018. Both RNA-Seq and DNA-Seq are considered. Software [https://github.com/cclab-brca/ICRG ICRG].
* [https://www.r-bloggers.com/2023/01/an-r-function-to-compute-tumor-mutational-burden-tmb/ An R function to compute Tumor Mutational Burden (TMB)]
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2353-5 XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data] Kluin et al 2018. Software in [https://github.com/PeeperLab/XenofilteR github].
* [https://www.sciencedirect.com/science/article/pii/S0923753421044951 Aligning tumor mutational burden (TMB) quantification across diagnostic platforms: phase II of the Friends of Cancer Research TMB Harmonization Project] 2021.  
* [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006596 Whole genomes define concordance of matched primary, xenograft, and organoid models of pancreas cancer] Gendoo et al 2019
** TMB can be determined by '''selected targeted panels''' in most cases, and '''whole-exome sequencing (WES)''' is the gold standard for quantifying TMB. TMB derived from panels was consistently and significantly lower than that derived from a whole exome. See this paper [https://jitc.bmj.com/content/8/1/e000613 Comparison of commonly used solid tumor targeted gene sequencing panels for estimating tumor mutation burden shows analytical and prognostic concordance within the cancer genome atlas cohort] 2020.


== RNA-Seq ==
== Types of mutations ==
* Platform [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16791 GPL16791] Illumina HiSeq 2500
* [https://www.technologynetworks.com/genomics/articles/missense-nonsense-and-frameshift-mutations-a-genetic-guide-329274 Missense, Nonsense and Frameshift Mutations: A Genetic Guide]
** https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1702792
* [https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Principles_of_Biology/02%3A_Chapter_2/14%3A_Mutations/14.05%3A_Types_of_Mutations 4.5: Types of Mutations] by LibreTexts biology.
** https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1887215
** While these mutation types are distinct, they can overlap in the sense that a single event (like an insertion or deletion) can result in a frameshift mutation.
* [https://youtu.be/j4qpJ8sVjT0 Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis]. This is posted on [https://www.rna-seqblog.com/mastering-rna-seq-data-analysis-a-critical-approach-to-transcriptomic-data-analysis/ rna-seqblog]. See the link [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4991491/ Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers].


== DNA-Seq ==
== Cytogenetic alternations ==
* [http://www.sciencedirect.com/science/article/pii/S2211124713004634 Endocrine-Therapy-Resistant ESR1 Variants Revealed by Genomic Characterization of Breast-Cancer-Derived Xenografts]
* [https://pubmed.ncbi.nlm.nih.gov/25574665/ Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes] Gerstung 2015
** [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/molecular.cgi?study_id=phs000611.v1.p1&phv=195714&phd=&pha=&pht=3502&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1 dbGaP]
* [https://academic.oup.com/bioinformatics/article/37/23/4589/6380545 RCytoGPS: an R package for reading and visualizing cytogenetics data]
** GaP accession [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=phs000611 phs000611].
* https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&f=study&term=xenograft&go=Go
* [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=ERP021871 ERP021871]


= DNA Seq Data =
== Alternative and differential splicing ==
== NIH ==
* [http://www.rna-seqblog.com/best-practices-and-appropriate-workflows-to-analyse-alternative-and-differential-splicing/ Best practices and appropriate workflows to analyse alternative and differential splicing]
* Go to [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj SRA/Sequence Read Archive]and type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
* [https://www.rna-seqblog.com/as-cmc-a-pan-cancer-database-of-alternative-splicing-for-molecular-classification-of-cancer/ AS-CMC – a pan-cancer database of alternative splicing for molecular classification of cancer]
*# Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
*# The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
*# The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
*# I pick the first one Accession:SRP066837 from this page. The page shows the '''Study type''' is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
*# <span style="color: red">(Important trick)</span> Click the number next to '''Run'''. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
*# Download the raw data from any one of them (eg SRR2968056). For whole genome, the '''Strategy''' is ''WGS''. For whole exome, the '''Strategy''' is called ''WXS''.
* Search the keywords 'nonsynonymous' and 'human' in [http://www.ncbi.nlm.nih.gov/pmc/?term=nonsynonymous+human PMC]


=== Use [http://www.ncbi.nlm.nih.gov/books/NBK158900/ SRAToolKit] instead of wget to download ===
== Allele vs Gene ==
Don't use the ''wget'' command since it requires the specification of right http address.
http://www.diffen.com/difference/Allele_vs_Gene


[http://www.ncbi.nlm.nih.gov/books/NBK158899/ Downloading SRA data using command line utilities]
* A gene is a stretch of DNA or RNA that determines a certain trait.  
* Genes mutate and can take two or more alternative forms; an '''allele''' is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
* An allele is found at a fixed spot on a chromosome?
* Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).


[https://github.com/NCBI-Hackathons/SRA2R SRA2R] - a package to import SRA data directly into R.
== Locus ==
https://en.wikipedia.org/wiki/Locus_(genetics)


(Method 1) Use the '''[http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump fastq-dump]''' command. For example, the following command (modified from the [http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using document] will download the first 5 reads and save it to a file called <SRR390728.fastq> ('''NOT sra format)''' in the current directory.
== [https://en.wikipedia.org/wiki/Haplotype Haplotypes] ==
<syntaxhighlight lang='bash'>
* http://www.brown.edu/Research/Istrail_Lab/proj_cmsh.php
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O .
* http://www.nature.com/nri/journal/v5/n1/fig_tab/nri1532_F2.html
# OR
* https://www.sciencenews.org/article/seeking-genetic-fate
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar
* http://www.medscape.com/viewarticle/553400_3
</syntaxhighlight>
This will download the files in FASTQ format.


(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):
== Base quality, Mapping quality, Variant quality ==
<syntaxhighlight lang='bash'>
* Fastq base quality: https://en.wikipedia.org/wiki/FASTQ_format
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
* Mapping quality: http://genome.cshlp.org/content/18/11/1851.long
</syntaxhighlight>
* Variant quality: http://www.ncbi.nlm.nih.gov/pubmed/21903627
It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.


(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list
== VarSAP ==
<syntaxhighlight lang='bash'>
[https://bioinformatics.georgetown.edu/internships/enrichment-of-variant-information-for-the-variant-standardization-and-annotation-pipeline/ Enrichment of Variant Information for the Variant Standardization and Annotation Pipeline]
brb@T3600 ~/Downloads $ tar xzvf aspera-connect-3.6.2.117442-linux-64.tar.gz
aspera-connect-3.6.2.117442-linux-64.sh
brb@T3600 ~/Downloads $ ./aspera-connect-3.6.2.117442-linux-64.sh


Installing Aspera Connect
== The Clinical Knowledgebase (CKB) ==
https://ckb.jax.org/gene/show?geneId=7157  (TP53)


Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only.
== Mapping quality (MAPQ) vs Alignment score (AS) ==
Restart firefox manually to load the Aspera Connect plug-in
http://seqanswers.com/forums/showthread.php?t=66634 & [https://samtools.github.io/hts-specs/SAMv1.pdf SAM format specification]


Install complete.
* MAPQ (5th column): MAPping Quality. It equals '''−10 log10 Pr{mapping position is wrong}''' (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
* AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)


brb@T3600 ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \
Note:
  -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
# '''MAPQ scores produced by the aligners typically involves the alignment score and other information.'''
  [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra .
# You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
SRR590795.sra                                                                          100%  239MB  535Mb/s    00:06
# You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
Completed: 245535K bytes transferred in 7 seconds
# [https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ MAPQ values are really useful but their implementation is a mess] by Simon Andrews
(272848K bits/sec), in 1 file.
brb@T3600 ~/Downloads $
</syntaxhighlight>
''Aspera is typically 10 times faster than FTP'' according to the website. For this case, wget takes 12s while ascp uses 7s.


Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "anonftp@ftp-private.ncbi.nlm.nih.gov"
== gene's isoform ==
* https://en.wikipedia.org/wiki/RNA-Seq#Alternative_splicing
* [https://www.longdom.org/open-access/bioinformatics-tools-for-rnaseq-gene-and-isoform-quantification-2469-9853-1000140.pdf Bioinformatics Tools for RNA-seq Gene and Isoform Quantification] 2016. Figure 1 gives an illustration how isoform can affect the computation of RPKM.
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4002-1 Evaluation and comparison of computational tools for RNA-seq isoform quantification] BMC Genomics 2017.
* [https://www.gettinggeneticsdone.com/2012/12/differential-isoform-expression-cuffdiff2.html Differential Isoform Expression With RNA-Seq: Are We Really There Yet?]
* An example of unidentified transcript-level estimates while gene-level estimation is still possible. Figure 1C in [https://f1000research.com/articles/4-1521/v2 Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences]
* [https://academic.oup.com/bioinformatics/article/38/15/3844/6617821?login=false ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2] 2022


=== SRAdb package ===
== FFPE Tissue vs Frozen Tissue ==
https://bioconductor.org/packages/release/bioc/html/SRAdb.html
* [https://www.biochain.com/general/what-is-ffpe-tissue/ What Is FFPE Tissue And What Are Its Uses]
* [https://rna-seqblog.com/a-new-model-to-predict-survival-in-colorectal-cancer-from-rna-seq-data/ A new model to predict survival in colorectal cancer from DNA / RNA-Seq data]


First we install some required package for XML and RCurl.
== Wild type vs mutant ==
<syntaxhighlight lang='bash'>
* [https://pediaa.com/difference-between-wild-type-and-mutant/ Difference Between Wild Type and Mutant]
sudo apt-get update
* https://en.wikipedia.org/wiki/Wild_type
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev
</syntaxhighlight>
and then
<syntaxhighlight lang='rsplus'>
source("https://bioconductor.org/biocLite.R")
biocLite("SRAdb")
</syntaxhighlight>


== SRA ==
== ns ==
Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers
Not significant


=== SRA Explorer ===
== PARP inhibitor ==
* https://ewels.github.io/sra-explorer/
* [https://youtu.be/vwZ07CX7ZKU?t=70 What is a PARP Inhibitor?] Dana-Farber Cancer Institute
* Source code https://github.com/ewels/sra-explorer
* '''PARP''' is an '''enzyme'''/a family of '''proteins''' that help '''repair damaged DNA''' in cells. When DNA is damaged, PARP detects the damage and signals other '''enzymes''' to come and fix it. This helps maintain the stability of the cell’s genetic material and prevent cell death.
* Is PARP good or bad? PARP is neither inherently good nor bad. It is a protein that plays an important role in maintaining the stability of the cell’s genetic material by helping to repair damaged DNA.
** In normal cells, PARP helps prevent cell death and maintain genomic stability.
** In cancer cells, PARP can help the cancer cells survive and continue to grow by repairing their DNA. This is why PARP inhibitors (PARPi) are used in cancer treatment to block the function of PARP and prevent cancer cells from repairing their DNA.


=== SRP056969 ===
* PARP inhibitors are a type of '''targeted cancer therapy''', '''not a traditional chemotherapy'''.
* [https://www.nature.com/articles/s41467-017-00867-z Inference of RNA decay rate from transcriptional profiling highlights the regulatory programs of Alzheimer’s disease]
 
* [http://www.rna-seqblog.com/rna-seq-reveals-mrna-stability-a-marker-in-alzheimers-patients/ RNA-Seq reveals mRNA stability a marker in Alzheimer’s patients]
* '''PARPi''' therapy is a cancer treatment that blocks the PARP enzyme, which helps repair DNA damage in cancer cells
* REMBRANDTS: REMoving Bias from Rna-seq ANalysis of Differential Transcript Stability
** [https://pubmed.ncbi.nlm.nih.gov/33015058/ PARP Inhibitors: Clinical Relevance, Mechanisms of Action and Tumor Resistance]
** [https://www.drugs.com/drug-class/parp-inhibitors.html List of PARP inhibitors]: olaparib, niraparib, rucaparib, and talazoparib.
** [https://en.wikipedia.org/wiki/Olaparib Olaparib] is a medication for the maintenance treatment of BRCA-mutated advanced ovarian cancer in adults. It is a PARP inhibitor, inhibiting poly ADP ribose polymerase (PARP), an enzyme involved in DNA repair. Others include Letrozole, Avastin.
** '''[https://www.cancer.net/navigating-cancer-care/how-cancer-treated/what-maintenance-therapy Maintenance therapy]''' is called so because it is the ongoing treatment of cancer with medication after the cancer has responded to the first recommended treatment. The main goals of maintenance therapy are
*** To prevent the cancer’s return
*** To delay the growth of advanced cancer after the initial treatment
* '''PARP inhibitors''' are a class of '''drugs''' that inhibit the activity of PARP enzymes. By blocking PARP’s ability to help repair DNA damage, these drugs can make it more difficult for cancer cells to survive DNA damage caused by other treatments, such as chemotherapy or radiation therapy. This can make these treatments more effective against certain types of cancer.
* PARP inhibitors are drugs that block the action of the PARP enzymes, which are involved in DNA repair. There are several PARP inhibitors available, including olaparib (Lynparza), niraparib (Zejula), rucaparib (Rubraca), and talazoparib (Talzenna). These drugs are approved for some types of cancer, such as ovarian and prostate cancer, depending on the presence of certain genetic mutations.
** [https://www.medicalnewstoday.com/articles/parp-inhibitor What is a PARP inhibitor? Uses, how they work, and options].
** PARP inhibitors - [https://www.drugs.com/drug-class/parp-inhibitors.html Drugs.com].


=== SRP066363 - lung cancer ===
== Inhibitor genes and activator genes/enhancer genes ==
* Platform: GPL11154 Illumina HiSeq 2000 (Homo sapiens)
* Inhibitor genes: Inhibitor genes are genes that code for proteins that can regulate or inhibit the activity of other genes or proteins in a cell. These inhibitor proteins can interact with other proteins to prevent them from functioning or alter their activity.
* Overall design: RNAseq and DNA copy number analysis of H1975 cells
** '''TP53''' gene, which codes for the p53 protein, a tumor suppressor protein that plays a critical role in regulating cell division and preventing the formation of cancerous tumors. Mutations in the '''TP53''' gene can result in loss of p53 function and an increased risk of cancer.  
* Strategy: 6 RNA-Seq and 3 Whole exome. Paired. [https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=302630 9 samples]
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066363
* http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74866


=== SRP015769 or SRP062882 - prostate cancer ===
* Activator genes: Activator genes play crucial roles in a variety of biological processes, including embryonic development, immune system responses, and the regulation of gene expression. For example, the NF-kB gene codes for a transcription factor that activates genes involved in immune system responses.
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP015769 5 are from normal and 5 are from tumor. Whole Exome Seq.  
** The '''MYC''' gene is an oncogene that codes for a transcription factor that promotes cell growth and proliferation. Dysregulation or overexpression of MYC can contribute to the development of many types of cancer.
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062882 6 normal and the rest are tumor.  
** Overexpression of the '''HER2''' (human epidermal growth factor receptor 2) gene is commonly observed in certain types of cancer, particularly breast cancer. Approximately 20-25% of breast cancer cases overexpress HER2, which is associated with a more aggressive form of the disease.
** Overexpression of the epidermal growth factor receptor ('''EGFR''') gene is a common genetic alteration observed in glioblastoma multiforme (GBM) patients.  


=== SRP053134 - breast cancer ===
* It's important to note that the distinction between inhibitor and activator genes is not always clear-cut, as many genes can have both inhibitory and activating effects depending on the context and the specific proteins they interact with.
* http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1785051


Look at the MBases value column. It determines the coverage for each run.
* '''Normal/cancer cells and PARP Inhibition''':
** '''Normal cells''' can tolerate DNA damage caused by PARP inhibition due to their efficient '''homologous recombination (HR) mechanism'''.
** In contrast, '''cancer cells''' with a '''deficient HR''' struggle to manage the DNA double-strand breaks (DSBs) and are '''especially sensitive''' to the effects of PARP inhibitors (PARPi)
** '''PARP''' has been found to be '''overexpressed''' in various types of cancers, including breast, ovarian, and oral cancers, compared to their corresponding normal healthy tissues.
** This overexpression makes '''inhibition of PARP activity''' an attractive strategy for cancer therapeutics. By disrupting PARP functions, it impairs DNA damage repair (DDR) pathways in cancer cells.
** After cancer patients receive PARP inhibitor '''(PARPi) drugs''', the expression of PARP genes in cancer patients tends to be '''lower''' compared to normal patients.


=== [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64016 SRP050992] single cell RNA-Seq ===
== Undifferentiated cancer ==
Used in [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0927-y Design and computational analysis of single-cell RNA-sequencing experiments]
* [https://www.medicinenet.com/undifferentiated_cancer/definition.htm Medical Definition of Undifferentiated cancer] (不好的 tumor)
* What are undifferentiated cells
** Undifferentiated cells are cells that have not yet developed into a specific type of cell and do not possess the characteristics of a fully differentiated cell. They are also known as '''stem cells''' or '''progenitor cells'''.
** In developmental biology, undifferentiated cells are the cells that have not yet undergone differentiation, the process by which a less specialized cell becomes a more specialized cell, with a specific function and characteristics. These cells have the potential to divide and differentiate into multiple cell types, either through normal development or in response to injury or disease.
** In the context of cancer, '''undifferentiated''' cells refer to cells that have not yet developed into a specific type of cancer cell. These cells are sometimes called '''cancer stem cells''', and they are thought to be the cells that give rise to the various types of cells within a tumor. '''They are believed to be responsible for the maintenance and growth of the tumor''', and for its ability to spread to other parts of the body. They can be found within many types of cancer, and are considered to be an important target for cancer therapy, as ''they are thought to be more resistant to traditional treatments such as chemotherapy and radiation''.
* [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164174 GSE164174]


=== Single cell RNA-Seq ===
== NCI Information Technology for Cancer Research program /ITCR ==
[http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0964-6 Exploiting single-cell expression to characterize co-expression replicability]
https://itcr.cancer.gov/. [https://itcr.cancer.gov/videos Videos]. It sponsors several programs like Bioconductor, GenePattern, UCSC Xena, IGV, PDX Finder, WebMeV, et al.


=== SRP040626 or SRP040540 - Colon and rectal cancer ===
= Other software =
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP040626
== Partek ==
* http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP040540
* [https://www.partek.com/application-page/single-cell-gene-expression/ Single Cell RNA-Seq]
* Partek Flow Software http://youtu.be/-6aeQPOYuHY
* [https://youtu.be/tZ4BhhVCJfU?t=893 RNA Seq Analysis with Partek Flow]
* [https://www.youtube.com/watch?v=cj9M--9zzgI&list=PLLFT1pfBxZZj8xOCjZTFjt3EX7jqvUgDm A playlist of Single Cell Analysis with Partek Flow Bioinformatics Software] (WTH the videos are 720p only)
* [https://youtu.be/iT63UZGXzu0 Understanding RNA-Seq Data Analysis: A Back-to-basics Overview]
* https://partekflow.cit.nih.gov/


=== OmicIDX ===
== [http://www.hsph.harvard.edu/cli/complab/dchip/ dCHIP] ==
[https://seandavi.github.io/2019/06/omicidx-on-bigquery/ OmicIDX on BigQuery]


== Tutorials ==
== [http://www.tm4.org/mev/ MeV] ==
See the [[#BWA|BWA]] section.


== Whole Exome Seq ==
MeV v4.8 (11/18/2011) allows annotation from Bioconductor
* [http://www.1000genomes.org/category/exome 1000genomes]. 1000genomes and tcga are two places to get vcf files too.
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4179624/ Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing] (Bao 2014)
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083463/ A framework for variation discovery and genotyping using next-generation DNA sequencing data]. See the table 1 there.
* Some data from SRA repository.
** http://sra.dnanexus.com/?q=cancer+exome&result_type=Study
** http://www.ncbi.nlm.nih.gov/sra/?term=WXS
** [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP008740 SRP008740] See [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3956068/ A survey of tools for variant analysis of next-generation sequencing data] (Pabinger 2014)


== Whole Genome Seq ==
== IPA from Ingenuity ==
* [https://www.ncbi.nlm.nih.gov/bioproject/browse/ BioProject] and
Login:
** Search: filter by 'Homo sapiens wgs'
There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.
** Project data type: Genome sequencing
** Click 'Date' to sort by it
** 20 hits as of 1/11/2017 (many of them do not have data)
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA352450 PRJNA352450] 18 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX2341045 SRX2341045] 20.5M spots, 4.1G bases, 1.8Gb
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA343545 PRJNA343545] 48 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX2187657 SRX2187657] 417.4M spots, 126.1G bases, 55.1Gb
* [https://www.ncbi.nlm.nih.gov/bioproject/309109 309109] 5 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX1538498 SRX1538498] 1.7M spots, 346.1M bases, 99.7Mb
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA289286 PRJNA289286] 5 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX1100298 SRX1100298] 504M spots, 101.8G bases, 45.3Gb
* [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA260389 PRJNA260389] 27 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX699196 SRX699196] 34,099,675 spots, 6.8G bases, 4.3Gb size
* [https://www.ncbi.nlm.nih.gov/bioproject/248553 248553] 3 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX1026041 SRX1026041] 1.2G spots, 250.9G bases, 114.2Gb
* [https://www.ncbi.nlm.nih.gov/bioproject/210123 210123] 26 experiments
** [https://www.ncbi.nlm.nih.gov/sra/SRX318496 SRX318496] 173.7M spots, 34.7G bases, 23Gb
* [https://www.ncbi.nlm.nih.gov/bioproject/43433 43433] 3 experiments (ABI SOLiD System 3.0)
** [https://www.ncbi.nlm.nih.gov/sra/SRX017230 SRX017230]  851.1M spots, 85.1G bases, 68.8Gb


== SraRunTable.txt ==
Features:
# http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
* easily search the scientific literature/integrate diverse biological information.
# http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click ''SRP004077''
* build dynamic pathway models
# http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click '''Runs''' from the RHS
* quickly analyze experimental data/Functional discovery: assign function to genes
# http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click '''RunInfoTable'''
* share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.


Note that (For this study, it has 2377 rows)
Start Here
* Column A (AssemblyName_s) eg GRCh37
<pre>
* Column I (library_name_s) eg
Expression data -> New core analysis -> Functions/Diseases -> Network analysis
* column N (header=Run_s) shows all SRR or ERR accession numbers.
                                        Canonical pathways        |
* Column P (Sample_Name)
                                              |                  |
* Column Y (header=Assay_Type_s) shows '''WGS'''.
Simple or advanced search --------------------+                  |
* Column AB (LibraryLayout_s): PAIRED
                                              |                  |
                                              v                  |
                                        My pathways, Lists <------+
                                              ^
                                              |
Creating a custom pathway --------------------+
</pre>


= Public Data =
Resource:
== ISB Cancer Genomics Cloud (ISB-CGC) ==
* http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf
https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis
* http://libguides.mit.edu/content.php?pid=14149&sid=843471
* http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/
* IPA 5.5 manual http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/ipa_help_manual_5.5_v1.pdf
* [http://ingenuity.force.com/ipa Help and supports]
* [http://ingenuity.force.com/ipa/articles/Tutorial/Tutorials Tutorials] which includes
** Search for genes
** Analysis results
** Upload and analyze example data
** Upload and analyze your own expression data
** Visualize connections among genes
** Learn more special features
** Human isoform view
** Transcription factor analysis
** Downstream effects analysis


The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.
Notes:
* The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
* The data to be '''uploaded''' (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term '''Single/Multiple Observation'''. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
* The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
* In this [http://ingenuity.force.com/ipa/articles/Tutorial/upload-analyze-example-data-tutorial prostate example data tutorial], the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
* The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.


[https://github.com/isb-cgc/ISB-CGC-Webapp ISB-CGC Web Application]
Screenshots:


== CCLE ==
[[:File:IngenuityAnalysisOutput.png]]
[https://www.nature.com/articles/s41586-019-1186-3 Next-generation characterization of the Cancer Cell Line Encyclopedia] 2019


It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).ß
== [http://david.abcc.ncifcrf.gov/ DAVID Bioinformatics Resource] ==
It offers an integrated annotation combining gene ontology, pathways and protein annotations.


== NCI's Genomic Data Commons (GDC)/TCGA ==
It can be used to identify the pathways associated with a set of genes; e.g. [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-412#Sec7 this paper].
The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).


* [https://portal.gdc.cancer.gov/ NCI's GDC] - Genomic Data Commons Data Portal. Researchers can access over 3 PB of bigData from projects like CPTAC, TARGET and of course TCGA.
== GOTrapper ==
* [https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#tcgaanalyze_dea__tcgaanalyze_leveltab:_differential_expression_analysis_(dea) Working with TCGAbiolinks package]
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2581-8 GOTrapper: a tool to navigate through branches of gene ontology hierarchy]
* [https://bioconductor.org/packages/release/data/experiment/html/GSE62944.html GEO accession data GSE62944 as a SummarizedExperiment] and [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944 GEO] website.
* [https://www.ncbi.nlm.nih.gov/pubmed/25819073 BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis.]
* https://github.com/srp33/TCGA_RNASeq_Clinical
* [https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/ GenomicDataCommons] Example: UUID to TCGA and TARGET Barcode Translation
* [https://www.biorxiv.org/content/early/2018/12/21/046904 MOGSA: integrative single sample gene-set analysis of multiple omics data] 2019. The data was obtained from TCGA and NCI60.


== GTEx ==
== [http://cran.r-project.org/web/packages/qpcR/index.html qpcR] ==
* [https://www.gtexportal.org/home/ Genotype-Tissue Expression (GTEx) project]
Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).
* [https://master.bioconductor.org/packages/release/workflows/html/recountWorkflow.html recount workflow: accessing over 70,000 human RNA-seq samples with Bioconductor]
* [https://www.biorxiv.org/content/10.1101/602367v1 Basal Contamination of Bulk Sequencing: Lessons from the GTEx dataset]


== Sharing data ==
== GSEA ==
* [https://datascience.cancer.gov/data-sharing NCI Data Sharing]
* http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp
* [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006472 Ten quick tips for sharing open genomic data] Brown et al, PLOS 2018
* http://www.hmwu.idv.tw/web/CourseSMDA/MADA/Hank_MicroarrayDataAnalysis-GSEA-20110616.pdf
 
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2716-6#MOESM20 MGSEA]– a multivariate Gene set enrichment analysis
= Gene set analysis =
* [https://www.biostars.org/p/88926/ Over Representation Vs Enrichment Analysis]
 
== Hypergeometric test ==
* [http://mygoblet.org/training-portal/courses/pathway-and-network-analysis-omics-data-2014 A course from bioinformatics.ca] and [http://mygoblet.org/sites/default/files/materrials/Pathways_2014_Module2.pdf over-represented pathway].
* [http://blog.nextgenetics.net/?e=94 How informative are enrichment analyses really?]
 
== Next-generation sequencing data ==
* [http://bioinformatics.oxfordjournals.org/content/32/17/i611.full Gene-set association tests for next-generation sequencing data]
 
= Misc =
== Advice ==
[https://github.com/nih-byob/presentations/tree/master/2019/01_bioinformatics_tips Bioinformatics advice I wish I learned 10 years ago] from NIH
 
== High Performance ==
* https://www.youtube.com/watch?v=M3RVfv6lUtc NYCMC
 
== Cloud Computing ==
* [https://github.com/VCCRI/Falco/ '''Falco''': A quick and flexible single-cell RNA-seq processing framework on the cloud]
* [https://youtu.be/cP5rvWoJDOQ Getting started with Bioconductor in the cloud]
* [https://www.rna-seqblog.com/micloud-a-bioinformatics-cloud-for-seamless-execution-of-complex-ngs-data-analysis-pipelines/ miCloud]: a bioinformatics cloud for seamless execution of complex NGS data analysis pipelines
 
== Merge different datasets (different genechips) ==
* https://support.bioconductor.org/p/65506/
 
== Normalization ==
* [http://nar.oxfordjournals.org/content/early/2015/07/21/nar.gkv736.long How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets] 2015
* [http://www.biomedcentral.com/1471-2105/16/347 Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data] 2015
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2382-0 Expression analysis of RNA sequencing data from human neural and glial cell lines depends on technical replication and normalization methods] 2018
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2745-1 A statistical normalization method and differential expression analysis for RNA-seq data between different species] Zhou 2019
 
== Ensembl ==
* http://useast.ensembl.org/index.html
* [http://training.ensembl.org/exercises Training]
 
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
 
== How to use [http://genome.ucsc.edu/cgi-bin/hgTables UCSC Table Browser] ==
* An instruction from [http://bitseq.github.io/howto/index BitSeq] software
* [https://www.biostars.org/p/93011/ How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols]
* [https://www.biostars.org/p/156637/ Where to download refseq gene coding regions data?]
** http://genome.ucsc.edu/cgi-bin/hgTables OR
** Download '''refGene.txt.gz''' file from UCSC directly using http links
* [https://www.biostars.org/p/94823/ Where To Download Genome Annotation Including Exon, Intron, Utr, Intergenic Information?]
 
[[File:Tablebrowser.png|330px]] [[File:Tablebrowser2.png|300px]]
 
Note
# the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
# the output does not have a header
# The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1
 
If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
$ wc -l hg38Tables.bed
75893 hg38Tables.bed
$ head -2 hg38Tables.bed
chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754,
chr1 201283451 201332993 NM_000299 0 + 201283702 201328836 0 15 453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920, 0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
$ tail -2 hg38Tables.bed
chr22_KI270734v1_random 131493 137393 NM_005675 0 + 131645 136994 0 5 262,161,101,141,549, 0,342,3949,4665,5351,
chr22_KI270734v1_random 138078 161852 NM_016335 0 - 138479 161586 0 15 589,89,99,176,147,93,82,80,117,65,150,35,209,313,164, 0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610,
 
$ wc -l hg38CodingExon.bed
577387 hg38CodingExon.bed
$ head -2 hg38CodingExon.bed
chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 -
chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 -
$ tail -2 hg38CodingExon.bed
chr22_KI270734v1_random 156288 156497 NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r 0 -
chr22_KI270734v1_random 161313 161586 NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r 0 -
 
# Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698)
$ grep NM_001276352 hg38Tables.bed
chr1 67092175 67134971 NM_001276352 0 - 67093579 67127240 0 9 1429,70,145,68,113,158,92,86,42, 0,4076,11062,19401,23176,33576,34990,38966,42754,
$ grep NM_001276352 hg38CodingExon.bed
chr1 67093579 67093604 NM_001276352_cds_0_0_chr1_67093580_r 0 -
chr1 67096251 67096321 NM_001276352_cds_1_0_chr1_67096252_r 0 -
chr1 67103237 67103382 NM_001276352_cds_2_0_chr1_67103238_r 0 -
chr1 67111576 67111644 NM_001276352_cds_3_0_chr1_67111577_r 0 -
chr1 67115351 67115464 NM_001276352_cds_4_0_chr1_67115352_r 0 -
chr1 67125751 67125909 NM_001276352_cds_5_0_chr1_67125752_r 0 -
chr1 67127165 67127240 NM_001276352_cds_6_0_chr1_67127166_r 0 -
</pre>
 
This can be compared to '''refGene'''(?) directly downloaded via http
<pre style="white-space: pre-wrap; /* CSS 3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* IE 5.5+ */ " >
$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
--2018-10-09 15:44:43--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7457957 (7.1M) [application/x-gzip]
Saving to: ‘hg38.refGene.txt.gz’
 
hg38.refGene.txt.gz                100%[===============================================================>]  7.11M  901KB/s    in 10s
 
2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957]
 
$ zcat hg38.refGene.txt.gz | wc -l
75893
15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2
1072 NM_003288 chr20 + 63865227 63891545 63865365 63889945 7 63865227,63869295,63873667,63875815,63882718,63889189,63889849, 63865384,63869441,63873816,63875875,63882820,63889238,63891545, 0 TPD52L2 cmpl cmpl 0,1,0,2,2,2,0,
1815 NR_110164 chr2 + 161244738 161249050 161249050 161249050 2 161244738,161246874, 161244895,161249050, 0 LINC01806 unk unk -1,-1,
 
$ zcat hg38.refGene.txt.gz | tail -2
1006 NM_130467 chrX + 55220345 55224108 55220599 55224003 5 55220345,55221374,55221766,55222620,55223986, 55220651,55221463,55221875,55222746,55224108, 0 PAGE5 cmpl cmpl 0,1,0,1,1,
637 NM_001364814 chrY - 6865917 6874027 6866072 6872608 7 6865917,6868036,6868731,6868867,6870005,6872554,6873971, 6866078,6868462,6868776,6868909,6870053,6872620,6874027, 0 AMELY cmpl cmpl 0,0,0,0,0,0,-1,
</pre>
 
== Where to download reference genome ==
* [http://hgdownload.cse.ucsc.edu/downloads.html UCSC] and [https://genome.ucsc.edu/goldenpath/help/twoBit.html twoBitToFa] to [https://www.biostars.org/p/9700/ UCSC] convert .2bit to fasta.
* [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ hg19] from UCSC (chromosome-wise).
 
== Which human reference genome to use? ==
http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)
 
== [https://en.wikipedia.org/wiki/RefSeq#RefSeq_categories RefSeq categories] ==
See Table 1 of [https://www.ncbi.nlm.nih.gov/books/NBK21091/ Chapter 18The Reference Sequence (RefSeq) Database].
 
{| class="wikitable centered" style="text-align:center"
|+
|- class="hintergrundfarbe6"
! Category
! Description
|-
| NC
| Complete genomic molecules
|-
| NG
| Incomplete genomic region
|-
| NM
| [https://en.wikipedia.org/wiki/MRNA mRNA]
|-
| NR
| [https://en.wikipedia.org/wiki/Non-coding_RNA ncRNA]
|-
| NP
| [https://en.wikipedia.org/wiki/Protein Protein]
|-
| XM
| predicted [[mRNA]] model
|-
| XR
| predicted [[ncRNA]] model
|-
| XP
| predicted [[Protein]] model (eukaryotic sequences)
|-
| WP
| predicted [[Protein]] model (prokaryotic sequences)
|}
 
== UCSC version & NCBI release corresponding ==
* http://genome.ucsc.edu/FAQ/FAQreleases.html
 
== Gene Annotation ==
* [http://www.genecards.org/ GeneCards]
* [http://ghr.nlm.nih.gov/GenesBySymbol Genetics Home Reference] from National Library of Medicine
* [http://www.mycancergenome.org/ My Cancer Genome]
* [http://cancer.sanger.ac.uk/cosmic Cosmic]
* [http://www.gettinggeneticsdone.com/2015/11/annotables-convert-gene-ids.html annotables] R package.
* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz031/5301311?rss=1 ensembldb] R package
 
== How many [https://en.wikipedia.org/wiki/DNA DNA] strands are there in humans? ==
* http://www.numberof.net/number-of-dna-strands/
* http://www.answers.com/Q/How_many_DNA_strands_are_there_in_humans
 
== How many base pairs in human ==
* 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
* chromosome 22 has the smallest number of bps (~50 million).  
* chromosome 1 has the largest number of bps (245 million base pairs).
* Illumina iGenome '''Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa''' file is 3.0GB (so is other genome.fa from human).
 
== Gene, Transcript, Coding/Non-coding exon ==
* https://hslnews.wordpress.com/2015/07/02/bioinformatics-bite-how-to-find-the-transcription-start-site-of-a-gene/
* According to https://en.wikipedia.org/wiki/Exon, in the human genome only
** 1.1% of the genome is spanned by exons,
** 24% is in introns,
** 75% of the genome being intergenic DNA.
 
== SNP ==
[https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism Types of SNPs and number of SNPs in each chromosomes]
 
== NGS technology ==
* [https://en.wikipedia.org/wiki/Illumina_(company) Illumina - Solexa]
* [https://en.wikipedia.org/wiki/ABI_Solid_Sequencing ABI - SOLiD]
* [https://en.wikipedia.org/wiki/454_Life_Sciences Roche 454]
 
== DNA methylation ==
* [https://en.wikipedia.org/wiki/GC-content GC content]  = (G+C)/(A+T+G+C) x 100%
* How many CpGs (C follows by G)?
* [http://genomicsclass.github.io/book/pages/methylation.html Analyzing DNA methylation data] (part of the book [http://genomicsclass.github.io/book/ Biomedical Data Science]) and the [https://www.class-central.com/mooc/1615/edx-ph525x-data-analysis-for-genomics PH525x: Data Analysis for Genomics] (edX course). The Github website is on https://github.com/genomicsclass/labs. The source code may not be correct. See also http://www.biostat.jhsph.edu/~iruczins/teaching/kogo/html/ml/week8/methylation.Rmd. The paper [http://ije.oxfordjournals.org/content/41/1/200.long Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies] the tutorial has mentioned.
<source lang="rsplus">
devtools::install_github("coloncancermeth","genomicsclass")
library(coloncancermeth) # 485512 x 26
data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects
dim(meth)
dim(pd)
length(gr)
colnames(pd)
table(pd$Status) # 9 normals, 17 cancers
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")
 
i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
for(i in normalIndex){
  lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
  lines(density(meth[,i],from=0,to=1),col=2)
}
 
# finding regions of the genome that are different between cancer and normal samples
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- ebayes(fit)
 
# plot of the region surrounding the top hit
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)
 
plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
# http://www.ncbi.nlm.nih.gov/pubmed/22422453
 
# within each chromosome we usually have big gaps creating subgroups of regions to be analyzed
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")
 
library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
#consider two example regions#
...
</source>
 
== Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing ==
* http://www.rna-seqblog.com/exome-sequencing-vs-rna-seq-to-identify-coding-region-variants/
* http://www.rna-seqblog.com/combined-use-of-exome-and-transcriptome-sequencing/
* [http://www.genomebiology.com/2010/11/5/R57 A comparison of whole genome and whole transcriptome sequencing]
 
== Sequence + Expression ==
* [http://www.ncbi.nlm.nih.gov/pubmed/26177635 Integrated sequence and expression analysis of ovarian cancer structural variants underscores the importance of gene fusion regulation]
 
== Integrate RNA-Seq and DNA-Seq ==
* [https://www.jci.org/articles/view/96153 Integrated RNA and DNA sequencing reveals early drivers of metastatic breast cancer] by Perou. An R code is provided.
 
== Integrate/combine Omics ==
* [https://cran.r-project.org/web/packages/OmicsPLS/index.html OmicsPLS] & [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2371-3 the paper] in BMC Bioinformatics 2018
* [https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html MultiAssayExperiment] & [http://cancerres.aacrjournals.org/content/77/21/e39 the paper] in AACR 2017
* [https://cran.r-project.org/web/packages/mixOmics/index.html mixOmics], http://mixomics.org/, [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752 the paper] on PLOS 2017
* [https://academic.oup.com/biostatistics/advance-article/doi/10.1093/biostatistics/kxy044/5092386 The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models] Biostatistics 2018
 
== Gene expression ==
Expression level is the amount of RNA in cell that was transcribed from that gene. [https://speakerdeck.com/alyssafrazee/high-resolution-gene-expression-analysis Slides] from Alyssa Frazee.
 
== Quantile normalization ==
* [https://www.biorxiv.org/content/biorxiv/early/2014/12/04/012203.full.pdf When to use Quantile Normalization?] and its R package [http://www.bioconductor.org/packages/release/bioc/html/quantro.html quantro]
* normalize.quantiles() from preprocessCore package. Note for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667. <syntaxhighlight lang='rsplus'>
source('http://bioconductor.org/biocLite.R')
biocLite('preprocessCore')
#load package
library(preprocessCore)
#the function expects a matrix
#create a matrix using the same example
mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
            ncol=3)
mat
#    [,1] [,2] [,3]
#[1,]    5    4    3
#[2,]    2    1    4
#[3,]    3    4    6
#[4,]    4    2    8
#quantile normalisation
normalize.quantiles(mat)
#        [,1]    [,2]    [,3]
#[1,] 5.666667 5.166667 2.000000
#[2,] 2.000000 2.000000 3.000000
#[3,] 3.000000 5.166667 4.666667
#[4,] 4.666667 3.000000 5.666667
</syntaxhighlight>
 
== Merging two gene expression studies ==
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2263-6 Alternative empirical Bayes models for adjusting for batch effects in genomic studies] Zhang et al. BMC Bioinformatics 2018. The R package is [http://www.bioconductor.org/packages/release/bioc/html/BatchQC.html BatchQC] from Bioconductor.
* [https://www.rdocumentation.org/packages/sva/versions/3.20.0/topics/ComBat Combat()] function in [http://www.bioconductor.org/packages/release/bioc/html/sva.html sva] package from Bioconductor.
** It can remove both known batch effects and other potential latent sources of variation.
** The tutorial includes information on (1) how to estimate the number of latent sources of variation, (2) how to apply the sva package to estimate latent variables such as batch effects, (3) how to directly remove known batch effects using the ComBat function, (4) how to perform differential expression analysis using surrogate variables either directly or with thelimma package, and (4) how to apply “frozen” sva to improve prediction and clustering.
** [https://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf#page=7 Tutorial example] to remove the batch effect
:<syntaxhighlight lang='bash'>
library(sva)
library(bladderbatch)
data(bladderdata)
pheno = pData(bladderEset)
edata = exprs(bladderEset)
batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat,
                      par.prior=TRUE, prior.plots=FALSE)
# This returns an expression matrix, with the same dimensions
# as your original dataset.
# By default, it performs parametric empirical Bayesian adjustments.
# If you would like to use nonparametric empirical Bayesian adjustments,
# use the par.prior=FALSE option (this will take longer).
</syntaxhighlight>
* [https://academic.oup.com/bioinformatics/article/24/9/1154/206630 Merging two gene-expression studies via cross-platform normalization] by Shabalin et al, Bioinformatics 2008. This method (called '''Cross-Platform Normalization/XPN''')was used by Ternès Biometrical Journal 2017.
* [https://academic.oup.com/bib/article/14/4/469/191565 Batch effect removal methods for microarray gene expression data integration: a survey] by Lazar et al, Bioinformatics 2012. The R package is '''[http://bioconductor.org/packages/3.3/bioc/html/inSilicoMerging.html inSilicoMerging]''' which has been removed from Bioconductor 3.4.
* [https://support.bioconductor.org/p/25840/ Question: Combine hgu133a&b and hgu133plus2]. [https://academic.oup.com/biostatistics/article/8/1/118/252073 Adjusting batch effects in microarray expression data using empirical Bayes methods]
* [https://rdrr.io/bioc/limma/man/removeBatchEffect.html removeBatchEffect()] from limma package
* [https://biodatascience.github.io/compbio/dist/batch.html Batch effects and GC content] of NGS by Michael Love
 
== Fusion gene ==
https://en.wikipedia.org/wiki/Fusion_gene
 
== Structural variation ==
* https://en.wikipedia.org/wiki/Structural_variation
* https://www.ncbi.nlm.nih.gov/dbvar/content/overview/
* [https://www.biorxiv.org/content/biorxiv/early/2018/02/01/200170.full.pdf Detection of complex structural variation from paired-end sequencing data] Joseph G. Arthur, 2018
 
[https://github.com/arq5x/lumpy-sv LUMPY], [https://github.com/dellytools/delly DELLY], [https://sites.google.com/site/sebatlab/software-data ForestSV], [http://gmt.genome.wustl.edu/packages/pindel/ Pindel], [http://breakdancer.sourceforge.net/ breakdancer] , [http://svdetect.sourceforge.net/Site/Home.html SVDetect].
 
== RNASeq + ChipSeq ==
* [http://www.nature.com/jhg/journal/vaop/ncurrent/full/jhg201584a.html Elucidating the mechanisms of transcription regulation during heart development by next-generation sequencing]
 
== Labs ==
* [http://salzberg-lab.org/courses/ Steven Salzberg]
 
== Biowulf2 at NIH ==
* Main site: http://hpc.nih.gov
* User guide: https://hpc.nih.gov/docs/user_guides.html
* Unlock account (60 days inactive) https://hpc.nih.gov/dashboard/
* Transitioning from PBS to Slurm: https://hpc.nih.gov/docs/pbs2slurm.html
* Job Submission 'cheat sheet': https://hpc.nih.gov/docs/biowulf2-handout.pdf
* STAR: https://hpc.nih.gov/apps/STAR.html
 
== [https://github.com/DecodeGenetics/BamHash BamHash] ==
Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.
 
== Selected Papers ==
* [http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0961-5 Testing for association between RNA-Seq and high-dimensional data] and the Bioconductor package globalSeq.
* [http://link.springer.com/article/10.1208%2Fs12248-016-9917-y The FDA’s Experience with Emerging Genomics Technologies—Past, Present, and Future]
* [http://www.nature.com/nbt/journal/v31/n1/abs/nbt.2450.html Differential analysis of gene regulation at transcript resolution with RNA-seq] Trapnell et al, Nature Biotechnology 31, 46–53 (2013)
* [http://cancerres.aacrjournals.org/content/early/2016/12/01/0008-5472.CAN-16-1624.long A Study of TP53 RNA Splicing Illustrates Pitfalls of RNA-seq Methodology]
* [http://www.rna-seqblog.com/top-rna-seq-articles-2016/ Top RNA-Seq Articles 2016] from RNA-Seq blog
* [http://onlinelibrary.wiley.com/doi/10.1111/biom.12745/full Multivariate association analysis with somatic mutation data] by He 2017 Biometrics.
* [http://www.biorxiv.org/content/early/2017/07/19/165191 SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses] by Claire Rioualen et al, 2017.
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0157989 A Survey of Bioinformatics Database and Software Usage through Mining the Literature]
 
== Pictures ==
https://www.flickr.com/photos/genomegov
 
== 用DNA做身分鑑識 ==
[https://www.worldjournal.com/5120826/article-《社會傳真》用dna做身分鑑識/ 用DNA做身分鑑識]
 
== 如何自学入门生物信息学 ==
https://zhuanlan.zhihu.com/p/32065916
 
== Staying current ==
[http://www.gettinggeneticsdone.com/2017/02/staying-current-in-bioinformatics-genomics-2017.html Staying Current in Bioinformatics & Genomics: 2017 Edition]
 
== Papers ==
* [http://www.nature.com/nature/journal/vaop/ncurrent/full/nature24286.html DNA sequencing at 40: past, present and future]
 
== Precision Medicine courses ==
* [https://gmi.ucsf.edu/cme-outreach/ UCSF]
* [http://openonlinecourses.com/ehr/PrecisionAndPredictiveMedicine.asp Precision & Predictive Medicine]
 
== Personalized medicine ==
* [https://www.nytimes.com/2017/08/30/health/gene-therapy-cancer.html F.D.A. Approves First Gene-Altering Leukemia Treatment]
* [http://time.com/4989537/blood-cancer-gene-therapy/ The FDA Just Approved a New Way of Fighting (lymphoma) Cancer Using Personalized Gene Therapy]
* [https://ghr.nlm.nih.gov/primer/precisionmedicine/precisionvspersonalized What is the difference between precision medicine and personalized medicine? What about pharmacogenomics?] "personalized medicine" is an older term. [https://ghr.nlm.nih.gov/primer Help Me Understand Genetics]
 
== Cancer and gene markers ==
* '''Colorectal cancer''' patients without '''KRAS mutations''' have far better outcomes with '''EGFR treatment''' than those with KRAS mutations.
** Two '''EGFR inhibitors''', cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
* '''Breast cancer'''. 
** [https://en.wikipedia.org/wiki/Trastuzumab Trastuzumab]
** [https://en.wikipedia.org/wiki/Tamoxifen Tamoxifen]
 
== The shocking truth about space travel ==
[https://www.morningticker.com/2018/03/the-shocking-truth-about-space-travel/ 7 percent of DNA belonging to NASA astronaut Scott Kelly changed in the time he was aboard the International Space Station]
 
== bioSyntax: syntax highlighting for computational biology ==
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2315-y
 
== Deep learning ==
[https://www.nature.com/articles/s41576-019-0122-6 Deep learning: new computational modelling techniques for genomics]
 
= Terms =
== 基因结构 ==
https://zhuanlan.zhihu.com/p/49601643
 
== RNA sequencing 101 ==
Web
* [http://ctehr.tamu.edu/media/864063/jason-seq-pres.pdf Introduction to RNA-Seq] including biology overview (DNA, Alternative splcing, mRNA structure, human genome) and sequencing technology.
* [http://www.chem.agilent.com/Library/eseminars/Public/RNA%20Sequencing%20101.pdf RNA Sequencing 101] by Agilent Technologies. Includes the definition of sequencing depth (number of reads per sample) and coverage (number of reads/locus).
* Where do we get reads(A,C,T,G) from sample RNA? See page 12 of this [https://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf pdf] from Colin Dewey in U. Wisc.
* Quantification of RNA-Seq data (see the above pdf)
* Convert read counts into expression: RPKM (see the above pdf)
* RPKM and FPKM ([https://docs.google.com/file/d/0B23nQZpa5ce0ak9jNEdlMEVqemc/edit Data analysis of RNA-seq from new generation sequencing] by 張庭毓) RNA-Seq資料分析研討會與實作課程 / RNA Seq定序 / 次世代定序(NGS) / 高通量基因定序 分析.
* [http://yourgene.pixnet.net/blog/post/66237799 First vs Second] generation sequence.
* [http://sfg.stanford.edu/SFG.pdf The Simple Fool’s Guide to Population Genomics via RNA­Seq]: An Introduction to High­Throughput Sequencing Data Analysis. This covers QC, De novo assembly, BLAST, mapping reads to reference sequences, gene expression analysis and variant (SNP) detection.
* [http://nihlibrary.nih.gov/Services/Bioinformatics/Documents/Lipsett100928talk_web.pdf An Introduction to Bioinformatics Resources and their Practical Applications] from [http://nihlibrary.nih.gov/Services/Bioinformatics/Pages/default.aspx NIH library Bioinformatics Support Program].
* [http://www.rnaseqforthenextgeneration.org/resources/index.html Teaching material] from rnaseqforthenextgeneration.org which includes Designing RNA-Seq experiments, Processing RNA-Seq data, and Downstream analyses with RNA-Seq data.
 
== Books ==
* [http://www.amazon.com/RNA-seq-Data-Analysis-Mathematical-Computational/dp/1466595000 RNA-seq Data Analysis: A Practical Approach]. The pdf version is available on slideshare.net.
* [http://www.amazon.com/Statistical-Generation-Sequencing-Frontiers-Probability/dp/3319072110/ref=pd_bxgy_b_img_y Statistical Analysis of Next Generation Sequencing Data]
* [https://www.amazon.com/Modern-Statistics-Biology-Susan-Holmes/dp/1108705294 Modern Statistics for Modern Biology] (free, see [[Statistics#Books_2|Statistics books]]).
 
== strand-specific vs non-strand specific experiment ==
* http://seqanswers.com/forums/showthread.php?t=28025. According to this message and the [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126545 article] (under the paragraph of '''Read counting''') from PLOS, ''most of the RNA-seq protocols that are used nowadays are not strand-specific''.
* http://biology.stackexchange.com/questions/1958/difference-between-strand-specific-and-not-strand-specific-rna-seq-data
* https://www.biostars.org/p/61625/ how to find if this public rnaseq data are prepared by strand-specific assay?
* https://www.biostars.org/p/62747/ Discussion of using IGV to view strand-specific coverage. See also the similar posts on the right hand side.
* https://www.biostars.org/p/44319/ How To Find Stranded Rna-Seq Experiments Data. The text ''dUTP 2nd strand marking'' includes a link to stranded rna-seq data.
* forward (+)/ reverse(-) strand in GAlignments objects ([http://www.bioconductor.org/packages/release/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf p68 of the pdf manual] and  [https://samtools.github.io/hts-specs/SAMv1.pdf page 7 of sam format specification].
 
Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.
 
[https://www.biostars.org/p/98756/ This post] mentioned to use [http://rseqc.sourceforge.net/ infer_experiment.py script] to check whether the rna-seq run is stranded or not.
 
The rna-seq experiment used in [http://www.bioconductor.org/help/workflows/rnaseqGene/ this tutorial] is not stranded-specific.
 
== FASTQ ==
* [http://en.wikipedia.org/wiki/FASTQ_format FASTQ=FASTA + Qual]. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
=== Phred quality score ===
q = -10log10(p) where p = <span style="color: red">error</span> probability for the base.
{| class="wikitable"
! q
! <span style="color: red">error</span> probability
! base call accuracy
|-
| 10
| 0.1
| 90%
|-
| 13
| 0.05
| 95%
|-
| 20
| 0.01
| 99%
|-
| 30
| 0.001
| 99.9%
|-
| 40
| 0.0001
| 99.99%
|-
| 50
| 0.00001
| 99.999%
|}
 
== FASTA ==
fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.
 
=== Download sequence files ===
* ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/RNA/ Assembled genome sequence and annotation data for RefSeq genome assemblies
* ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/
 
=== Compute the sequence length of a FASTA file ===
https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file
<syntaxhighlight lang='bash'>
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
 
head -2 file.fa | \
    awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}'  | \
    tail -1
</syntaxhighlight>
 
== FASTA <=> FASTQ conversion ==
According to [https://www.quora.com/Bioinformatics-What-is-the-difference-between-fasta-fastq-and-sam-files this post],
 
* FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
* FastQ files are like  fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)
 
=== Convert FASTA to FASTQ without quality scores ===
 
[https://www.biostars.org/p/99886/ Biostars]. For example, the [https://github.com/lh3/bioawk bioawk] by lh3 (Heng Li) worked.
 
=== Convert FASTA to FASTQ with quality score file ===
See the links on the above post.
 
=== Convert FASTQ to FASTA using Seqtk ===
Use the [https://github.com/lh3/seqtk Seqtk] program; see [https://www.biostars.org/p/85929/ this post].
 
The '''Seqtk''' program by lh3 can be used to sample reads from a fastq file including paired-end; see [https://www.biostars.org/p/69348/ this post].
 
== RPKM (Mortazavi et al. 2008) ==
Reads per Kilobase of Exon per Million of Mapped reads.
 
* rpkm function in [https://support.bioconductor.org/p/59317/ edgeR] package.
* RPKM function in [https://support.bioconductor.org/p/50413/ easyRNASeq] package.
 
Idea
* The more we sequence, the more reads we expect from each gene. '''This is the most relevant correction of this method.'''
* Longer transcript are expected to generate more reads. '''The latter is only relevant for comparisons among different genes which we rarely perform!'''. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.
 
Calculation
# Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
# Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
# Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
 
Formula
<pre>
RPKM = (10^9 * C)/(N * L), with
 
C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene
</pre>
 
<syntaxhighlight lang="rsplus">
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
 
set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
    [,1] [,2] [,3] [,4]
[1,]    0    0    5    0
[2,]    6    2    7    3
[3,]    5  13    7    2
[4,]    3    3    9  11
[5,]    1    2    1  15
 
d <- DGEList(counts=y, lib.size=1001:1004)
# Note that lib.size is optional
# By default, lib.size = colSums(counts)
cpm(d) # counts per million
  Sample1  Sample2  Sample3  Sample4
1    0.000    0.000 4985.045    0.000
2 5994.006  1996.008 6979.063  2988.048
3 4995.005 12974.052 6979.063  1992.032
4 2997.003  2994.012 8973.081 10956.175
5  999.001  1996.008  997.009 14940.239
> cpm(d,log=TRUE)
    Sample1  Sample2  Sample3  Sample4
1  7.961463  7.961463 12.35309  7.961463
2 12.607393 11.132027 12.81875 11.659911
3 12.355838 13.690089 12.81875 11.129470
4 11.663897 11.662567 13.17022 13.451207
5 10.285119 11.132027 10.28282 13.890078
 
d$genes$Length <- c(1000,2000,500,1500,3000)
rpkm(d)
    Sample1  Sample2    Sample3  Sample4
1    0.0000    0.000  4985.0449    0.000
2 2997.0030  998.004  3489.5314 1494.024
3 9990.0100 25948.104 13958.1256 3984.064
4 1998.0020  1996.008  5982.0538 7304.117
5  333.0003  665.336  332.3363 4980.080
 
> cpm
function (x, ...)
UseMethod("cpm")
<environment: namespace:edgeR>
> showMethods("cpm")
 
Function "cpm":
<not an S4 generic function>
> cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25,
    ...)
{
    x <- as.matrix(x)
    if (is.null(lib.size))
        lib.size <- colSums(x)
    if (log) {
        prior.count.scaled <- lib.size/mean(lib.size) * prior.count
        lib.size <- lib.size + 2 * prior.count.scaled
    }
    lib.size <- 1e-06 * lib.size
    if (log)
        log2(t((t(x) + prior.count.scaled)/lib.size))
    else t(t(x)/lib.size)
}
<environment: namespace:edgeR>
> rpkm.default
function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25,
    ...)
{
    y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count)
    gene.length.kb <- gene.length/1000
    if (log)
        y - log2(gene.length.kb)
    else y/gene.length.kb
}
<environment: namespace:edgeR>
</syntaxhighlight>
 
Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as
<syntaxhighlight lang="rsplus">
# step 1:
6/(1.0e-6 *1001) = 5994.006    # cpm, compute column-wise
# step 2:
5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise
 
# Another way
# step 1 (RPK)
6/ (2000/1.0e3) = 3
# step 2 (RPKM)
3/ (1.0e-6 * 1001) = 2997.003
</syntaxhighlight>
 
=== Critics ===
* [http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf RPKM/FPKM is not suitable for statistical testing] (p11):
 
''Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!''
 
* [http://blog.nextgenetics.net/?e=51 RPKM measure is inconsistent among samples]
 
=== (another critic) Union Exon Based Approach ===
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910
 
In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.
 
It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.
 
== FPKM (Trapnell et al. 2010) ==
 
Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks).
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
 
== [http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ RPKM, FPKM, TPM and DESeq] ==
* The youtube video is on [https://www.youtube.com/watch?t=119&v=TTUrtCY2k-w here]. TPM 比 RPKM/FPKM 好因為 total reads in each experiments are the same.
* [https://arxiv.org/pdf/1804.06050.pdf#page=5 Differences]
** The main difference between RPKM and FPKM is that the former is a unit based on single-end reads, while the latter is based on paired-end reads and counts the two reads from the same RNA fragment as one instead of two.
** The difference between RPKM/FPKM and TPM is that the former calculates sample-scaling factors before dividing read counts by gene lengths, while the latter divides read counts by gene lengths first and calculates samples calling factors based on the length-normalized read counts.
** If researchers would like to interpret gene expression levels as the proportions of RNA molecules from different genes in a sample,
* [https://zhuanlan.zhihu.com/p/55988984 为什么说FPKM和RPKM都错了?]
* [http://diytranscriptomics.com/Reading/files/wagnerTPM.pdf Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples] (TPM method) by Wagner 2012.
* Between samples normalization.
** [https://groups.google.com/forum/#!topic/sailfish-users/jBf9SGiH1AM How to Normalize Salmon TPM output?]
** [https://www.biostars.org/p/287296/ How do I normalize for my RNA-seq data across different samples in different conditions]. Using DESeq2 ([https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-014-0550-8 paper], [https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/estimateSizeFactors estimateSizeFactors()]). [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218662/pdf/gb-2010-11-10-r106.pdf DESeq] paper by Anders 2010 where the NB model and size factor was first used.
: <syntaxhighlight lang='rsplus'>
> dds <- makeExampleDESeqDataSet(m=4)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      0      1      1      6
gene2      2      0      0      3
gene3      18      9      19      12
gene4      12      25      13      13
gene5      22      26      10      8
gene6      6      5      8      6
> dds <- estimateSizeFactors(dds)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      0      1      1      6
gene2      2      0      0      3
gene3      18      9      19      12
gene4      12      25      13      13
gene5      22      26      10      8
gene6      6      5      8      6
> head(counts(dds, normalized=TRUE))
      sample1    sample2    sample3  sample4
gene1  0.00000  0.9654796  0.9858756  5.732657
gene2  1.96066  0.0000000  0.0000000  2.866328
gene3 17.64594  8.6893164 18.7316365 11.465314
gene4 11.76396 24.1369899 12.8163829 12.420756
gene5 21.56726 25.1024695  9.8587560  7.643542
gene6  5.88198  4.8273980  7.8870048  5.732657
</syntaxhighlight>
* TPM has been suggested as a better unit than RPKM/FPKM. But it cannot be used to do comparison between samples].
** [https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ What the FPKM? A review of RNA-Seq expression units]. R code for computing effective counts, TPM, and FPKM/
** [https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/ In RNA-Seq, 2 != 2: Between-sample normalization]. Among the most popular and well-accepted BSN (between-sample normalization) methods are TMM and DESeq normalization.
: <syntaxhighlight lang='bash'>
P -- per
K -- kilobase (related to gene length)
M -- million (related to sequencing depth)
</syntaxhighlight>
 
== TMM (Robinson and Oshlack, 2010) ==
Trimmed Means of M values (EdgeR).
 
== Sample size ==
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2445-2 Empirical assessment of the impact of sample number and read depth on RNA-Seq analysis workflow performance]
 
== Coverage ==
* [https://genohub.com/recommended-sequencing-coverage-by-application/ Recommended Coverage and Read Depth for NGS Applications] from @Genohub.
* [http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html bedtools]. The bedtools is now hosted on [https://github.com/arq5x/bedtools2 github]
* https://github.com/alyssafrazee/polyester
<pre>
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
</pre>
* Page 18 of this [http://www.chem.agilent.com/Library/eseminars/Public/RNA%20Sequencing%20101.pdf RNA-Seq 101] from Agilent or [http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf Estimating Sequencing Coverage] from Illumina.
<pre>
C = L N / G
</pre>
where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.
* coverage() function in IRanges package.
* [https://github.com/fbreitwieser/bamcov bamcov] - Quickly calculate and visualize sequence coverage in alignment files
* [https://www.biostars.org/p/6571/ Coverage and read depth]
* [https://www.biostars.org/p/638/ What Is The Sequencing 'Depth' ?] Coverage = (total number of bases generated) / (size of genome sequenced). So a 30x coverage means, on an average each base has been read by 30 sequences.
* [http://www.nature.com/nrg/journal/v15/n2/full/nrg3642.html Sequencing depth and coverage: key considerations in genomic analyses]
* [http://qualimap.bioinfo.cipf.es/doc_html/index.html Qualimap]
* https://www.biostars.org/p/5165/
<syntaxhighlight lang='bash'>
# Assume the bam file is sorted by chromosome location
# took 40 min on 5.8G bam file. samtools depth has no threads option:(
# it is not right since it only account for regions that were covered with reads
samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'    # maybe 42
 
# The following is the right way! The result matches with Qualimap program.
samtools depth -a *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'  # maybe 8
# OR
LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'`  # 3095693981
SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'`  # 24473867730
echo $(( $LEN/$SUM ))
</syntaxhighlight>
 
== SAM/Sequence Alignment Format and BAM format specification ==
* https://samtools.github.io/hts-specs/SAMv1.pdf and [http://samtools.sourceforge.net/ samtools] webpage.
* http://genome.sph.umich.edu/wiki/SAM
 
== Single-end, pair-end, fragment, insert size ==
* [http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html Paired-end read confusion - library, fragment or insert size?]
* https://www.biostars.org/p/95803/
 
== Germline vs Somatic mutation ==
Germline: inherit from parents. See the [https://en.wikipedia.org/wiki/Germline Wikipedia] page.
 
== Driver vs passenger mutation ==
https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer
 
== Nonsynonymous mutation ==
It is related to the [http://www.chemguide.co.uk/organicprops/aminoacids/dna4.html genetic code], [https://en.wikipedia.org/wiki/Genetic_code Wikipedia]. There are 20 amino acids though there are 64 codes.
 
See
* http://evolution.about.com/od/Overview/a/Synonymous-Vs-Nonsynonymous-Mutations.htm
* https://en.wikipedia.org/wiki/Nonsynonymous_substitution
* [http://thegenomefactory.blogspot.com/2013/10/understanding-snps-and-indels-in.html Understanding SNPs and INDELs in microbial genomes]
* An example from https://en.wikipedia.org/wiki/Silent_mutation
** nonsynonymous: ATG to GTG mutation (AUG = Met, GUG = Val)
** synonymous: CAT to CAC mutation (CAU = His, CAC = His)
 
== isma: analysis of mutations detected by multiple pipelines ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2701-0 isma]: an R package for the integrative analysis of mutations detected by multiple pipelines
 
== Missense variants ==
aminoacid changing variants
 
== Alternative and differential splicing ==
[http://www.rna-seqblog.com/best-practices-and-appropriate-workflows-to-analyse-alternative-and-differential-splicing/ Best practices and appropriate workflows to analyse alternative and differential splicing]
 
== Allele vs Gene ==
http://www.diffen.com/difference/Allele_vs_Gene
 
* A gene is a stretch of DNA or RNA that determines a certain trait.
* Genes mutate and can take two or more alternative forms; an '''allele''' is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
* An allele is found at a fixed spot on a chromosome?
* Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).
 
== Locus ==
https://en.wikipedia.org/wiki/Locus_(genetics)
 
== [https://en.wikipedia.org/wiki/Haplotype Haplotypes] ==
* http://www.brown.edu/Research/Istrail_Lab/proj_cmsh.php
* http://www.nature.com/nri/journal/v5/n1/fig_tab/nri1532_F2.html
* https://www.sciencenews.org/article/seeking-genetic-fate
* http://www.medscape.com/viewarticle/553400_3
 
== Base quality, Mapping quality, Variant quality ==
* Fastq base quality: https://en.wikipedia.org/wiki/FASTQ_format
* Mapping quality: http://genome.cshlp.org/content/18/11/1851.long
* Variant quality: http://www.ncbi.nlm.nih.gov/pubmed/21903627
 
== Mapping quality (MAPQ) vs Alignment score (AS) ==
http://seqanswers.com/forums/showthread.php?t=66634 & [https://samtools.github.io/hts-specs/SAMv1.pdf SAM format specification]
 
* MAPQ (5th column): MAPping Quality. It equals '''−10 log10 Pr{mapping position is wrong}''' (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
* AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)
 
Note:
# '''MAPQ scores produced by the aligners typically involves the alignment score and other information.'''
# You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
# You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
# [https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ MAPQ values are really useful but their implementation is a mess] by Simon Andrews
 
= Other software =
== Partek ==
* Partek Flow Software http://youtu.be/-6aeQPOYuHY
 
== [http://www.hsph.harvard.edu/cli/complab/dchip/ dCHIP] ==
 
== [http://www.tm4.org/mev/ MeV] ==
 
MeV v4.8 (11/18/2011) allows annotation from Bioconductor
 
== IPA from Ingenuity ==
Login:
There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.
 
Features:
* easily search the scientific literature/integrate diverse biological information.
* build dynamic pathway models
* quickly analyze experimental data/Functional discovery: assign function to genes
* share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.
 
Start Here
<pre>
Expression data -> New core analysis -> Functions/Diseases -> Network analysis
                                        Canonical pathways        |
                                              |                  |
Simple or advanced search --------------------+                  |
                                              |                  |
                                              v                  |
                                        My pathways, Lists <------+
                                              ^
                                              |
Creating a custom pathway --------------------+
</pre>
 
Resource:
* http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf
* http://libguides.mit.edu/content.php?pid=14149&sid=843471
* http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/
* IPA 5.5 manual http://people.mbi.ohio-state.edu/baguda/PathwayAnalysis/ipa_help_manual_5.5_v1.pdf
* [http://ingenuity.force.com/ipa Help and supports]
* [http://ingenuity.force.com/ipa/articles/Tutorial/Tutorials Tutorials] which includes
** Search for genes
** Analysis results
** Upload and analyze example data
** Upload and analyze your own expression data
** Visualize connections among genes
** Learn more special features
** Human isoform view
** Transcription factor analysis
** Downstream effects analysis


Notes:
== sandbox.bio: Interactive bioinformatics tutorials ==
* The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
https://sandbox.bio/. An interactive playground for learning bioinformatics command-line tools like bedtools, bowtie2, and samtools.
* The data to be '''uploaded''' (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term '''Single/Multiple Observation'''. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
* The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
* In this [http://ingenuity.force.com/ipa/articles/Tutorial/upload-analyze-example-data-tutorial prostate example data tutorial], the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
* The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.
 
Screenshots:
 
[[File:IngenuityAnalysisOutput.png|100px]]
 
== [http://david.abcc.ncifcrf.gov/ DAVID Bioinformatics Resource] ==
It offers an integrated annotation combining gene ontology, pathways and protein annotations.
 
It can be used to identify the pathways associated with a set of genes; e.g. [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-412#Sec7 this paper].
 
== GOTrapper ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2581-8 GOTrapper: a tool to navigate through branches of gene ontology hierarchy]
 
== [http://cran.r-project.org/web/packages/qpcR/index.html qpcR] ==
Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).
 
== GSEA ==
* http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp
* http://www.hmwu.idv.tw/web/CourseSMDA/MADA/Hank_MicroarrayDataAnalysis-GSEA-20110616.pdf
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2716-6#MOESM20 MGSEA]– a multivariate Gene set enrichment analysis


= GWAS =
= GWAS =
[https://poissonisfish.wordpress.com/2017/10/09/genome-wide-association-studies-in-r/ Genome-wide association studies in R]
[https://poissonisfish.wordpress.com/2017/10/09/genome-wide-association-studies-in-r/ Genome-wide association studies in R]

Revision as of 14:35, 13 May 2024

Visualization

Ten simple rules

Ten simple rules for developing visualization tools in genomics

IGV

nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB
~/binary/IGV_2.3.52/igv.sh

Simulated DNA-Seq

The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.

File:Igv dna simul.png

Whole genome

PRJEB1486

File:Igv prjeb1486 wgs.png

Whole exome

  • (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
  • (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.

File:Igv gse48215.png File:Igv srp066363.png

RNA-Seq

  • (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
  • (Right) GSE46876, UCSC/hg19.

File:Igv anders2013 rna.png File:Igv gse46876 rna.png

Tell DNA or RNA

  • DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
  • RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).

ChromoMap

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

RNA-seq DRaMA

https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from 2nd Annual Shiny Contest

Gviz

GIVE: Genomic Interactive Visualization Engine

Build your own genome browser

ChromHeatMap

Heat map plotting by genome coordinate.

ggbio

Wondering how to look at the reads of a gene in samples to check if it was knocked out?

NOISeq package

Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.

rtracklayer

R interface to genome browsers and their annotation tracks

  • Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.

ssviz

A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).

Sushi

See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.

cBioPortal, TCGA, PanCanAtlas

See TCGA.

TCPA

Download. Level 4.

Qualimap

Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

SeqMonk

SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.

dittoSeq

dittoSeq – universal user-friendly single-cell and bulk RNA sequencing visualization toolkit, bioinformatics

SeqCVIBE

SeqCVIBE – interactive analysis, exploration, and visualization of RNA-Seq data

ggcoverage

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data 2023

Copy Number

Copy number work flow using Bioconductor

Detect copy number variation (CNV) from the whole exome sequencing

Whole exome sequencing != whole genome sequencing

Consensus CDS/CCDS

DBS segmentation algorithm

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

modSaRa2

An accurate and powerful method for copy number variation detection

Visualization

reconCNV: interactive visualization of copy number data from high-throughput sequencing 2021

NGS

File:CentralDogmaMolecular.png

See NGS.

mNGS

找出病原菌的新武器 :總基因體次世代定序是什麼?

R and Bioconductor packages

Resources

library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)

Docker

Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.

Some workflows

RNA-Seq workflow

Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.

       STAR       Samtools         Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments       DESeq2 
                                                           --------------------> se --------> dds
      GenomicFeatures         GenomicFeatures             /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|

rnaseqGene

rnaseqGene - RNA-seq workflow: gene-level exploratory analysis and differential expression

tximport

CodeOcean - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). Plan.

Sequence analysis

library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)

Accessing Annotation Data

Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.

library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")     # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type example package name contents
OrgDb org.Hs.eg.db gene based information for Homo sapiens
TxDb TxDb.Hsapiens.UCSC.hg19.knownGene transcriptome ranges for Homo sapiens
OrganismDb Homo.sapiens composite information for Homo sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19 genome sequence for Homo sapiens
refGenome

RNA-Seq Data Analysis using R/Bioconductor

recount2

recount3

  • Intro RNA-seq LCG-UNAM 2022 (Spanish)
  • Recount3: PCR duplicates.
    • PCR duplication can refer to two different things. It can mean the process of making many copies of a specific DNA region using a technique called Polymerase Chain Reaction (PCR). PCR relies on a thermostable DNA polymerase and requires DNA primers designed specifically for the DNA region of interest.
    • On the other hand, PCR duplication can also refer to a problem that occurs when the same DNA fragment is amplified and sequenced multiple times, resulting in identical reads that can bias many types of high-throughput-sequencing experiments. These identical reads are called PCR duplicates and can be eliminated using various methods such as removing all but one read of identical sequences or using unique molecular identifiers (UMIs) to enable accurate counting and tracking of molecules.
    • UMI stands for Unique Molecular Identifier. It is a complex index added to sequencing libraries before any PCR amplification steps, enabling the accurate bioinformatic identification of PCR duplicates. UMIs are also known as Molecular Barcodes or Random Barcodes. UMIs are valuable tools for both quantitative sequencing applications and also for genomic variant detection, especially the detection of rare mutations. UMI sequence information in conjunction with alignment coordinates enables grouping of sequencing data into read families representing individual sample DNA or RNA fragments.
    • dedup - Deduplicate reads using UMI and mapping coordinates
    • UMIs can be extracted from a fastq file using awk. For example awk 'NR % 4 == 1 {split($0,a,":"); print a[6]}' input.fastq > umis.txt . Here we assume the read header is @SEQ_ID:LANE:TILE:X:Y:UMI, then the UMI sequence is in the 6th field, following the 5th colon.

dbGap

dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

eQTL

Statistics for Genomic Data Science (Coursera) and MatrixEQTL from CRAN

GenomicDataCommons package

Note:

  1. The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
  2. Each patient has 4 categories data and the 'case_id' is common to them:
    • demographic: gender, race, year_of_birth, year_of_death
    • diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
    • exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
    • main: disease_type, primary_site
  3. The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.

Visualization

GenVisR

ComplexHeatmap

Read counts

Read, fragment

  • Meaning of the "reads" keyword in terms of RNA-seq or next generation sequencing. A read refers to the sequence of a cluster that is obtained after the end of the sequencing process which is ultimately the sequence of a section of a unique fragment.
  • What is the difference between a Read and a Fragment in RNA-seq?. Diagram , Pair-end, single-end.
  • In the context of RNA-seq, "read" and "fragment" may refer to slightly different things, but they are related concepts.
    • A read is a short sequence of nucleotides that has been generated by a sequencing machine. These reads are typically around 100-150 bases long. RNA-seq experiments generate millions or billions of reads, and these reads are aligned to a reference genome or transcriptome to determine which reads came from which genes or transcripts, this information is used to quantify gene and transcript expression levels.
    • A fragment is a piece of RNA that has been broken up and converted into a read. In RNA-seq, the first step is to convert the RNA into a library of fragments. To do this, the RNA is typically broken up into smaller pieces using a process called fragmentation. Then, adapters are added to the ends of the fragments to allow them to be sequenced. The fragments are then converted into a library of reads that can be sequenced using a next-generation sequencing platform.
    • In summary, a read is a short sequence of nucleotides that has been generated by a sequencing machine, whereas a fragment is a piece of RNA that has been broken up and converted into a read. The process of fragmentation creates a library of fragments that are then converted into reads that can be sequenced.
  • Does one fragment contain 1 read or multiple reads?
    • One fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used.
    • In the process of library preparation, RNA is first fragmented into smaller pieces, then adapters are ligated to the ends of the fragments. The fragments are then amplified using PCR, generating multiple copies of the original fragment. These amplified fragments are then sequenced using a next-generation sequencing platform, generating multiple reads per fragment.
    • For example, in Illumina sequencing, fragments are ligated with adapters, then they are clonally amplified using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating a large number of reads per fragment.
    • In other technologies like PacBio or Nanopore, the sequencing of a fragment generates only one read, as the technology can read long stretches of DNA, therefore it doesn't need to fragment the RNA prior to sequencing.
    • In summary, one fragment in RNA-seq can contain multiple reads, depending on the sequencing technology and library preparation protocol used. The number of reads per fragment can vary from one to several thousands.
  • How many reads in a fragment on average in illumination sequencing?
    • The number of reads per fragment in Illumina sequencing can vary depending on the sequencing platform and library preparation protocol used, as well as the sequencing depth and the complexity of the sample. However, on average, one fragment can generate several hundred to several thousand reads in Illumina sequencing.
    • When sequencing is performed on the Illumina platform, the process of library preparation includes fragmenting the RNA into smaller pieces, ligating adapters to the ends of the fragments, and then amplifying the fragments using bridge amplification. This allows for the creation of clusters of identical copies of the original fragment on a sequencing flow cell. Then, each cluster is sequenced, generating multiple reads per fragment.
    • The number of reads per fragment can also be affected by the sequencing depth, which refers to the total number of reads generated by the sequencing machine. A higher sequencing depth will result in more reads per fragment, while a lower sequencing depth will result in fewer reads per fragment.
    • In summary, the number of reads per fragment in Illumina sequencing can vary, but on average, one fragment can generate several hundred to several thousand reads. The number of reads per fragment can be influenced by the sequencing platform, library preparation protocol, sequencing depth, and the complexity of the sample.

Rsubread

RSEM

  • RSEM, rsem-calculate-expression
  • RSEM on Biowulf
    $ mkdir SeqTestdata/RNASeqFibroblast/output
    $ sinteractive --cpus-per-task=2 --mem=10g
    $ module load rsem bowtie STAR
    $ rsem-calculate-expression -p 2 --paired-end --star \
    				../test.SRR493366_1.fastq ../test.SRR493366_2.fastq \
    				/fdb/rsem/ref_from_genome/hg19 Sample1 # 12 seconds
    				
    $ ls -lthog
    total 5.8M
    -rw-r----- 1 1.6M Nov 24 13:39 Sample1.genes.results
    -rw-r----- 1 2.5M Nov 24 13:39 Sample1.isoforms.results
    -rw-r----- 1 1.6M Nov 24 13:39 Sample1.transcript.bam
    drwxr-x--- 2 4.0K Nov 24 13:39 Sample1.stat
    
    $ wc -l Sample1.genes.results
    26335 Sample1.genes.results
    $ wc -l Sample1.isoforms.results
    51399 Sample1.isoforms.results
    
    $ head -2 Sample1.genes.results
    gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
    A1BG	NM_130786	1766.00	1589.99	0.00	0.00	0.00
    $ head -2 Sample1.isoforms.results
    transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
    NM_130786	A1BG	1766	1589.99	0.00	0.00	0.00	0.00
    $ head -1 /fdb/rsem/ref_from_genome/hg19.transcripts.fa
    >NM_130786
    $ grep NM_130786 /fdb/igenomes/Homo_sapiens/UCSC/hg19/transcriptInfo.tab
    NM_130786	2721192635	2721199328	2721188175	2	8	431185
  • RSEM gene level result file (see here for an example) contains 5 essential columns (and the element saved by tximport() function) excluding transcript_id
    • Effective length → length. This is different across samples. This is much shorter than Length (e.g. 105 vs 1).
    • Expected count → count. This is the sum of the posterior probability of each read comes from this transcript over all reads.
    • TPM → abundance. The sum of all transcripts' TPM is 1 million.
    • FPKM (not kept?). FPKM_i = 10^3 / l_bar * TPM_i for gene i. So for each sample FPKM is a scaling of TPM.
    R> dfpkm[1:5, 1:3] / txi.rsem$abundance[1:5, 1:3]
              144126_210-T_JKQFX5 144126_210-T_JKQFX6 144126_210-T_JKQFX8
    5S_rRNA             0.7563603            1.118008           0.8485292
    5_8S_rRNA                 NaN                 NaN                 NaN
    6M1-18                    NaN                 NaN                 NaN
    7M1-2                     NaN                 NaN                 NaN
    7SK                 0.7563751            1.118029           0.8485281
    
  • An example using tximport::tximport() and DESeq2::DESeqDataSetFromTximport. Note it directly uses round(expected_count) to get the integer-value counts. See the source of DESeqDataSetFromTximport() here. The tximport vignette has discussed two suggested ways of importing estimates for use with differential gene expression (DGE) methods in the section of "Downstream DGE in Bioconductor". The vignette does not say anything about "expected_count" from RSEM output.
    txi.rsem <- tximport(files, type = "rsem", txIn = F, txOut = F)
    txi.rsem$length[txi.rsem$length == 0] <- 1
    names(txi.rsem) # a list, 
                    # length = effective_length (matrix)
                    # counts = expected counts column (matrix), non-integer
                    # abundance = TPM (matrix)
                    # countsFromAbundance = "no"
    # [1] "abundance"           "counts"              "length"             
    # [4] "countsFromAbundance"
    
    sampleTable <- pheno[, c("EXPID", "PatientID")]
    rownames(sampleTable) <- colnames(txi.rsem$counts)
    
    dds <- DESeq2::DESeqDataSetFromTximport(txi.rsem, sampleTable, ~ PatientID)
    # using counts and average transcript lengths from tximport
    # 
    # The DESeqDataSet class enforces non-negative integer values in the "counts" 
    #     matrix stored as the first element in the assay list.
    dds@assays@data@listData$counts[1:5, 1:3] # integer values. How to compute?
                                   # https://support.bioconductor.org/p/9134840/
    dds@assays@data@listData$avgTxLength[1:5, 1:3] # effective_length
    
    plot(txi.rsem$counts[,1], dds@assays@data@listData$counts[,1])
    abline(0, 1, col = 'red')      # compare expected counts vs integer-value counts
                                   # a straight line
    
    ddsColl1 <- DESeq2::estimateSizeFactors(dds)
    # using 'avgTxLength' from assays(dds), correcting for library size
    # Question: how does the function correct for library size?
    
    ddsColl2 <- DESeq2::estimateDispersions(ddsColl1)
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # final dispersion estimates
    # Note: it seems estimateDispersions is not required 
    #       if we only want to get the normalized count (still need estimateSizeFactors())
    # See ArrayTools/R/FilterAndNormalize.R
    
    cnts2 <- DESeq2::counts(ddsColl2, normalized = FALSE)
    all(dds@assays@data@listData$counts == cnts2)
    # [1] TRUE
    
    all(round(txi.rsem$counts) == cnts2 )
    # [1] TRUE.     So in this case round(expected values) = integer-value counts
    
  • RSEM example on Odyssey
  • A Short Tutorial for RSEM
  • Hands-on Training in RNA-Seq Data Analysis* which includes Quantification using RSEM and Perform DE analysis. Note the expected count column was used in edgeR.
  • Understanding RSEM: raw read counts vs expected counts. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
  • RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. Abundance estimates are given in terms of two measures. The output file (XXX_RNASeq.RSEM.genes.results) contains 7 columns: gene_id, transcript_id(s), length, effective_length, expected_count, TPM, FPKM.
    • Expected counts: The is an estimate of the number of fragments that are derived from a given isoform or gene. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR or DESeq.
    • TPM: This is the estimated fraction of transcripts made up by a given isoform or gene. The transcript fraction measure is preferred over the popular RPKM/FPKM measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species.
  • The length or effective_length are different (though similar) for different samples for the same gene
  • A scatter plot and correlation shows the expected_count and TPM are different
    x <- read.delim("144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    colnames(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM" 
    plot(x[, "TPM"], x[, "expected_count"])
    cor(x[, "TPM"], x[, "expected_count"])
    # [1] 0.4902708
    cor(x[, "TPM"], x[, "expected_count"], method = 'spearman')
    # [1] 0.9886384
    x[1:5, "length"]
    [1] 105.01 161.00 473.00  68.00 304.47
    
    x2 <- read.delim("144126_210-T_JKQFX6_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    x2[1:5, "length"]
    # [1] 105.00 161.00 473.00  68.00 305.27
    x[1:5, "effective_length"]
    # [1]   1.03  16.65 293.88   0.00 129.00
    x2[1:5, "effective_length"]
    # [1]   1.58  17.82 293.05   0.00 128.86
    
  • A benchmark for RNA-seq quantification pipelines. 2016 They compare the STAR, TopHat2, and Bowtie2 mapping methods and the Cufflinks, eXpress , Flux Capacitor, kallisto, RSEM, Sailfish, and Salmon quantification methods. RSEM slightly outperforming the rest.
  • Downsample reads from Evaluation of Cell Type Annotation R Packages on Single Cell RNA-seq Data 2020.

Expected_count

Number of reads mapping to that transcript

  • Understanding RSEM: raw read counts vs expected counts In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.
    R> x <- read.delim("41samples/165739~295-R~AM1I30~RNASEQ.genes.results")
    R> summary(x$expected_count)     # Larger than TPM, contradict to the above statement
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
          0       0      10    1346     556  533634
    R> summary(x$TPM)
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
        0.00     0.00     0.16    35.58     7.50 70091.93
    R> x[1:5, c("expected_count", "TPM")]
      expected_count      TPM
    1        6190.00 70091.93
    2           0.00     0.00
    3           0.00     0.00
    4           0.00     0.00
    5         795.01   171.67
    
  • Alignment-based的转录本定量-RSEM/ the sum of the posterior probability of each read comes from this transcript over all reads

Expected counts from RSEM in DESeq2? Yes, RSEM expected counts can be used with DESeq2.

# adding
txi$length[txi$length <= 0] <- 1
# before
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

Examples

$ wc -l 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results
   28110 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results

$ head -n 4 144126_210-T_JKQFX5_v2.0.1.4.0_RNASeq.RSEM.genes.results | cut -f1,3,4,5,6,7

gene_id	  length  effective_length  expected_count TPM	   FPKM
5S_rRNA	  105.01  1.03	            1513.66	   31450.7 23788.06
5_8S_rRNA 161	  16.65	            0	           0	   0
6M1-18	  473	  293.88	    0	           0	   0

Second example

$ wc -l Sample_HS-578T_CB6CRANXX.genes.results
28110
$ head -1 Sample_HS-578T_CB6CRANXX.genes.results
gene_id	transcript_id.s.	length	effective_length	expected_count	TPM	FPKM
$ tail -4Sample_HS-578T_CB6CRANXX.genes.results
septin 9/TNRC6C fusion	uc010wto.1	41	0	0	0	0
svRNAa	uc022bxg.1	22	0	0	0	0
tRNA Pro	uc022bqx.1	65	0	0	0	0
unknown	uc002afm.3	1117	922.85	0	0	0

$ wc -l Sample_HS-578T_CB6CRANXX.isoforms.results
78376
$ head -1 Sample_HS-578T_CB6CRANXX.isoforms.results
transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
$ tail -4 Sample_HS-578T_CB6CRANXX.isoforms.results
uc010wto.1	septin 9/TNRC6C fusion	41	0	0	0	0	0
uc022bxg.1	svRNAa	22	0	0	0	0	0
uc022bqx.1	tRNA Pro	65	0	0	0	0	0
uc002afm.3	unknown	1117	922.85	0	0	0	0

limma

  • Differential expression analyses for RNA-sequencing and microarray studies
  • Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
  • Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.

RSEM

Within-subject correlation

  • Does this RNAseq experiment require a repeated measures approach?
    • Solution 1: 9.7 Multi-level Experiments of LIMMA user guide. duplicateCorrelation(), lmFit(), makeContrasts(), contrasts.fit() and eBayes()
    • Solution 2: Section 3.5 "Comparisons both between and within subjects" in edgeR. model.matrix(), glmQLFit(), glmQLFTtest(), topTags().

Time Course Experiments

  • See Limma's vignette on Section 9.6
    • Few time points (ANOVA, contrast)
      • Which genes respond at either the 6 hour or 24 hour times in the wild-type?
      • Which genes respond (i.e., change over time) in the mutant?
      • Which genes respond differently over time in the mutant relative to the wild-type?
    • Many time points (regression such as cubic spline, moderated F-test)
      • Detect genes with different time trends for treatment vs control.

easyRNASeq

Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.

ShortRead

Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.

Rsamtools

The Rsamtools package provides an interface to BAM files.

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts

This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.

IRanges

IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.

The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.

If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.

flank

The example is obtained from ?IRanges::flank.

ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
#     start end width
# [1]     2   3     2
# [2]     5   7     3
# [3]     1   3     3

flank(ir3, 2)
#     start end width
# [1]     0   1     2
# [2]     3   4     2
# [3]    -1   0     2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                     == ==

flank(ir3, 2, start=FALSE)
#     start end width
# [1]     4   5     2
# [2]     8   9     2
# [3]     4   5     2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == == 

flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
#     start end width
# [1]     4   5     2
# [2]     3   4     2
# [3]     4   5     2
# Combine the ideas of the previous 2 cases.

flank(ir3, c(2, -2, 2))
#     start end width
# [1]     0   1     2
# [2]     5   6     2
# [3]    -1   0     2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                   == ==

flank(ir3, -2, start=F)
#     start end width
# [1]     2   3     2
# [2]     6   7     2
# [3]     2   3     2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                       == ==

flank(ir3, 2, both = TRUE)
#     start end width
# [1]     0   3     4
# [2]     3   6     4
# [3]    -1   2     4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range. 
#        The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
#                                             ==
#                                       == == == ==

flank(ir3, 2, start=FALSE, both=TRUE)
#     start end width
# [1]     2   5     4
# [2]     6   9     4
# [3]     2   5     4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
#                                          ==
#                                       == == == ==

Both IRanges and GenomicRanges packages provide the flank function.

Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.

  • General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
  • Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.

Biostrings

GenomicRanges

GenomicRanges depends on IRanges package. See the dependency diagram below.

GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                         |            +
                   +-----+            +- GenomeInfoDb
                   |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings

The package defines some classes

  • GRanges
  • GRangesList
  • GAlignments
  • SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.

(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.

GenomicAlignments

Counting reads with summarizeOverlaps vignette

library(GenomicAlignments)
library(DESeq)
library(edgeR)

fls <- list.files(system.file("extdata", package="GenomicAlignments"),
    recursive=TRUE, pattern="*bam$", full=TRUE)

features <- GRanges(
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features

# GRanges object with 11 ranges and 1 metadata column:
#       seqnames       ranges strand   |    group_id
#          <Rle>    <IRanges>  <Rle>   | <character>
#   [1]    chr2L [1000, 1499]      -   |           A
#   [2]    chr2L [3000, 3499]      -   |           A
#   [3]    chr2L [4000, 4499]      -   |           A
#   [4]    chr2L [7000, 7599]      -   |           A
#   [5]    chr2R [2000, 2899]      -   |           B
#   ...      ...          ...    ... ...         ...
#   [7]    chr2R [3600, 3899]      -   |           B
#   [8]    chr2R [4000, 4899]      -   |           B
#   [9]    chr2R [7500, 7799]      -   |           B
#  [10]    chr3L [5000, 5499]      -   |           C
#  [11]    chr3L [5400, 5899]      -   |           C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment 
# dim: 11 2 
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):

assays(olap)$counts
#       sm_treated1.bam sm_untreated1.bam
#  [1,]               0                 0
#  [2,]               0                 0
#  [3,]               0                 0
#  [4,]               0                 0
#  [5,]               5                 1
#  [6,]               5                 0
#  [7,]               2                 0
#  [8,]             376               104
#  [9,]               0                 0
# [10,]               0                 0
# [11,]               0                 0

Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).

samtools view -h -o outputFile.bam inputFile.sam

A modified R code that works is

###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
             "gtf/drosophila_melanogaster/",
             "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
idx <- mcols(gff0)$source == "protein_coding" & 
           mcols(gff0)$type == "exon" & 
           seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
library(GenomicAlignments)

# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
     recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts

###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)

expdata = MIAME(
              name="pasilla knockdown",
              lab="Genetics and Developmental Biology, University of 
                  Connecticut Health Center",
              contact="Dr. Brenton Graveley",
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi 
                  knockdown RNA-Seq Studies",
              pubMedIds="20921232",
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs 
                  encoding pasilla, a mRNA binding protein and 4 biological replicates 
                  of the the untreated cell line.")

design <- data.frame(
              condition=c("untreated", "untreated"),
              replicate=c(1,1),
              type=rep("single-read", 2), stringsAsFactors=TRUE)
library(DESeq)
geneCDS <- newCountDataSet(
                  countData=assay(genehits),
                  conditions=design)

experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)

###################################################
### code chunk number 16: pasilla_genes (eval = FALSE)
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design) 
experimentData(txCDS) <- expdata

We can also check out ?summarizeOverlaps to find some fake examples.

tidybulk

Bisque

Bisque: An R toolkit for accurate and efficient estimation of cell composition ('decomposition') from bulk expression data with single-cell information.

chromoMap

Inference

tximport

  • http://bioconductor.org/packages/release/bioc/html/tximport.html
    $ head -5 quant.sf 
    Name	Length	EffectiveLength	TPM	NumReads
    ENST00000456328.2	1657	1410.79	0.083908	2.46885
    ENST00000450305.2	632	410.165	0	0
    ENST00000488147.1	1351	1035.94	10.4174	225.073
    ENST00000619216.1	68	24	0	0
    ENST00000473358.1	712	453.766	0	0
    
  • Another real data from PDMR -> PDX. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
    $ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results 
    gene_id	transcript_id.s. length	effective_length  expected_count  TPM	  FPKM
    5S_rRNA	uc021ofx.1	 105.04	2.23	          2039.99	  49629.87 35353.97
    
    $ R
    x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results")
    dim(x)
    # [1] 28109     7
    names(x)
    # [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"            
    x[1:3, -2]
    #     gene_id length effective_length expected_count      TPM     FPKM
    # 1   5S_rRNA 105.04             2.23        2039.99 49629.87 35353.97
    # 2 5_8S_rRNA 161.00            21.19           0.00     0.00     0.00
    # 3    6M1-18 473.00           302.74           0.00     0.00     0.00
    
    y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results")
    dim(y)
    # [1] 78375     8
    names(y)
    # [1] "transcript_id"    "gene_id"          "length"           "effective_length"
    # [5] "expected_count"   "TPM"              "FPKM"             "IsoPct" 
    y[1:3, -1]
    #   gene_id length effective_length expected_count TPM FPKM IsoPct
    # 1 5S_rRNA    110             3.06              0   0    0      0
    # 2 5S_rRNA    133             9.08              0   0    0      0
    # 3 5S_rRNA     92             0.00              0   0    0      0
    

File:RSEM PDX.png

DESeq2 or edgeR

Shrinkage estimators

  • The package uses a negative binomial statistical model to fit the count data, and can account for differences in sequencing depth across samples.
    • Shrinkage is a technique used to regularize the estimates of the parameters of the negative binomial model.
    • The idea behind shrinkage is to pull the estimated values of the parameters towards a prior distribution, which can help to reduce the variability of the estimates and improve the stability of the results.
    • The specific shrinkage method used in DESeq2 is called the "shrinkage prior for dispersion" method. This method involves adding a prior distribution on the dispersion parameter of the negative binomial model, which is used to control the degree of overdispersion in the data.
    • This prior distribution is designed to shrink the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
  • More details:
    • The negative binomial model is used to model the count data, y, as a function of the mean, [math]\displaystyle{ \mu }[/math], and the dispersion, [math]\displaystyle{ \alpha }[/math]. The probability mass function of the negative binomial is given by:
    [math]\displaystyle{ \begin{align} P(y | \mu, \alpha) = (y + \alpha - 1)! / (y! * (\alpha - 1)!) * (\mu / (\mu + \alpha))^y * (\alpha / (\mu + \alpha))^\alpha \end{align} }[/math]
    • The likelihood of the data is given by:
    [math]\displaystyle{ \begin{align} L(\mu, \alpha | y) = \prod_i [ P(y_i | \mu_i, \alpha) ] \end{align} }[/math]
    • The log-likelihood is :
    [math]\displaystyle{ \begin{align} logL(\mu, \alpha | y) = \sum_i [ \log(P(y_i | \mu_i, \alpha)) ] \end{align} }[/math]
    In this model, [math]\displaystyle{ \mu }[/math] is the mean of the negative binomial for each gene and it is modeled as a linear function of the design matrix.
    [math]\displaystyle{ \begin{align} \mu_i = exp(X_i \beta) \end{align} }[/math]
    [math]\displaystyle{ \alpha }[/math] is the dispersion parameter and it's the same for all the genes, following the common practice in RNA-seq analysis
    • The shrinkage prior is added on [math]\displaystyle{ \alpha }[/math], it assumes that [math]\displaystyle{ \alpha }[/math] is following a hyper-prior distribution like Gamma distribution
    [math]\displaystyle{ \begin{align} \alpha \sim \Gamma(a_0, b_0) \end{align} }[/math]
    This prior allows the shrinkage of [math]\displaystyle{ \alpha }[/math] estimates from the data towards a common value across all the genes, which can help to reduce the variance of the estimated log-fold changes.
    • The goal is to find the values of mu and alpha that maximize the log-likelihood, this is done by using maximum likelihood estimation (MLE) or Bayesian approach where the prior are considered and integrated in the calculation and the result is the posterior probability.
  • Dispersion parameter.
    • In the context of the negative binomial model used in DESeq2, the dispersion parameter, alpha, is a measure of the degree of overdispersion in the data. In other words, it represents the variability of the data around the mean. A value of alpha greater than 1 indicates that the data is more dispersed (more variable) than would be expected if the data were following a Poisson distribution, which is a common distribution used to model count data. The Poisson distribution has a single parameter, the mean, which represents both the location and the scale of the distribution. In contrast, the negative binomial distribution has two parameters, the mean and the dispersion, which allows for more flexibility in fitting the data.
    • The shrinkage method in DESeq2 involves shrinking the estimated values of the dispersion parameter towards a common value across all the genes in the dataset, which can help to reduce the variance of the estimated log-fold changes.
  • What is the formula of the fold change estimator given by DESeq2?
    • The fold change estimator given by the DESeq2 package is calculated as the ratio of the estimated mean expression levels in two conditions, with the log2 of this ratio being the log2 fold change. The mean expression levels are calculated using a negative binomial model, which accounts for both the mean and the overdispersion of the data.
    • The estimated mean expression level for a gene i in condition j is given by:
    [math]\displaystyle{ \begin{align} \log(\mu_{i,j}) = \beta_j + X_i \gamma \end{align} }[/math]
    where [math]\displaystyle{ \beta_j }[/math] is the overall mean for condition [math]\displaystyle{ j }[/math], [math]\displaystyle{ X_i }[/math] is the design matrix for gene [math]\displaystyle{ i }[/math] and [math]\displaystyle{ \gamma_i }[/math] is the gene-specific effect.
    • The log2 fold change is calculated as:
    [math]\displaystyle{ \begin{align} \log2(\mu_{i,j} / \mu_{i,k}) \end{align} }[/math]
    where j and k are the two conditions being compared.
    • So, you can see that the fold change estimator depends on the design matrix [math]\displaystyle{ X }[/math] and the parameters of the model, [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ \gamma }[/math]. The DESeq2 implementation also includes an estimation of the variance-covariance matrix of the parameters to compute the standard deviation (uncertainty) of these parameters, and therefore the standard deviation on the fold-change estimator. This can help to estimate the significance of the fold change between conditions.
    • Hypothesis testing H0: log(FC)=0 using Wald test
  • What is the variance of the estimated log-fold changes before and after applying DESeq2 method?
    • In RNA-seq data analysis, the log-fold change is a measure of the relative difference in expression between two conditions. The log-fold change is calculated as the log2 of the ratio of the mean expression in one condition to the mean expression in another condition.
    • Without using any methods like DESeq2, the variance of the estimated log-fold changes can be high, particularly for genes with low expression levels, which can lead to unreliable results. This high variance is due to the over-dispersion present in RNA-seq data, which results in a large variability in the estimated expression levels even for genes with similar means.
    • When using the DESeq2 package, the shrinkage method is applied on dispersion parameter alpha, which helps to reduce the variance of the estimated log-fold changes. By applying a prior on alpha and by shrinking the estimates of alpha towards a common value across all the genes, the method reduces the variability of the estimates. This results in more stable and reliable estimates of the log-fold changes, which can improve the accuracy and robustness of the results of the differential expression analysis.
    • Additionally, the DESeq2 package also accounts for differences in sequencing depth across samples, which can also help to reduce the variability of the estimated log-fold changes.
  • how DESeq2 package accounts for differences in sequencing depth across samples
    • The DESeq2 package accounts for differences in sequencing depth across samples by using the raw count data to estimate the normalized expression levels for each gene in each sample. This normalization process is necessary because sequencing depth can vary widely between samples, leading to differences in the overall number of reads and the apparent expression levels of the genes.
    • The package uses a method called regularized-logarithm (rlog) transformation to normalize the data, which is a variance stabilization method that is based on the logarithm of the counts, but also adjusts for the total library size and the mean expression level.
    • The method starts by computing a weighted mean of the counts across all samples, which is used as a reference. Next, for each sample, the counts are divided by the library size and then multiplied by the weighted mean of the counts. This scaling step corrects for differences in sequencing depth by making the library sizes comparable between samples.
    • Then, the regularized logarithm (rlog) transformation is applied to the scaled counts, which is given by :
    [math]\displaystyle{ \begin{align} vst = \log(counts/sizeFactor + c) \end{align} }[/math]
    where c is a small positive constant added to the counts to stabilize the variance, size_factor is the ratio of library size for each sample over the weighted mean of the library size.
    • The rlog transformation can stabilize the variance of the data and make the mean expression levels more comparable between samples. This transformed data can then be used for downstream analysis like calculating the fold changes.
    • In addition to rlog transformation the DESeq2 package uses a negative binomial distribution to model the count data, this distribution helps to account for over-dispersion in the data, and shrinkage method on the dispersion parameter is applied as well to improve the stability of results. All of these techniques work together to help correct for sequencing depth differences across samples, which can improve the accuracy of the estimated fold changes and provide more robust results in differential gene expression analysis.
  • type='apeglm' shrinkage only for use with 'coef'

Time course experiment

  • Time course trend analysis from the edgeR's vignette. glmQLFTest()
    • Finds genes that respond to the treatment at either 1 hour or 2 hours versus the 0 hour baseline. This is analogous to an ANOVA F-test for a normal linear model.
    • Assuming gene expression changes smoothly over time, we can use a polynomial or a cubic spline curve with a certain number of degrees of freedom to model gene expression along time.
    • We are looking for genes that change expression level over time. We test for a trend by conducting F-tests for each gene. The topTags function lists the top set of genes with most significant time effects.
    • The total number of genes with significant (5% FDR) changes at different time points can be examined with decideTests.
  • RNA-seq data collected at different time points. Identify differentially expressed genes associated with seasonal changes

DESeq2 experimental design and interpretation

DESeq2 experimental design and interpretation

Controlling for batch differences

The variable we are interested in ("condition") is placed after the batch variable.

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition)
dds <- DESeq(dds)

OR

dds <- DESeq(dds, test="LRT", reduced=~batch)
res <- results(dds)

DESeq2 diagnostic plot, MA plot

vst over rlog transformation

Expected counts
     | round()
     v                              /-- vst transformation  ---\
Raw counts --> normalized counts  --                            -- Other analyses such as PCA, Hclust (sample distances).
                                    \-- rlog transformation ---/

Simulate negative binomial distribution data

  • rnegbin() in sim.counts() from the ssizeRNA package
    rnegbin(10000 * 10, lambda, 1 / disp) 
    # 10000 genes, 20 samples
    # lambda: mean counts from control group, a matrix. 
    # disp: dispersion parameter, a matrix.
    

Reducing false positives in differential analyses of large RNA sequencing data sets

edgeR vs DESeq2 vs limma

  • edgeR
    library(edgeR)
    
    # create DGEList object from count data
    counts <- matrix(c(20,30,25,50,45,55,15,20,10,5,10,8,100,120,110,80,90,95), nrow=3, ncol=6, byrow=TRUE)
    rownames(counts) <- c("G1", "G2", "G3")
    colnames(counts) <- c("A1", "A2", "A3", "B1", "B2", "B3")
    counts
    #     A1  A2  A3 B1 B2 B3
    # G1  20  30  25 50 45 55
    # G2  15  20  10  5 10  8
    # G3 100 120 110 80 90 95
    d <- DGEList(counts)
    
    # perform normalization and differential expression analysis
    d <- calcNormFactors(d)
    design <- model.matrix(~0+factor(c(rep("A",3), rep("B",3))))
    d <- estimateDisp(d, design)
    fit <- glmQLFit(d, design)
    res <- glmQLFTest(fit, contrast=c(-1, 1))
    
    # summarize the results and identify significant genes
    summary(res)
    res2 <- topTags(res); res2
    # Coefficient:  -1*factor(c(rep("A", 3), rep("B", 3)))A 1*factor(c(rep("A", 3), rep("B", 3)))B 
    #         logFC   logCPM          F       PValue          FDR
    # G1  1.1683093 17.98614 146.579840 4.380158e-08 1.314048e-07
    # G2 -0.7865080 16.41917   3.056504 1.059256e-01 1.588884e-01
    # G3 -0.1437956 19.34279  40.852893 2.256279e-01 2.256279e-01
    de_genes <- rownames(res2)[which(res2$FDR < 0.05 & abs(res2$log2FoldChange) > 1)]
  • DESeq2. The count data above will result in an error. The error can occur when there is very little variability in the count data, which can happen if the biological samples are very homogeneous or if the sequencing depth is very low. In such cases, it may be difficult to reliably identify differentially expressed genes using DESeq2.
    library(DESeq2)
    
    col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(3, 3))))
    
    # create a DESeq2 dataset object
    dds <- DESeqDataSetFromMatrix(countData = counts, colData = col_data, design = ~ condition)
    
    # differential expression analysis
    dds <- DESeq(dds)
    # estimating size factors
    # estimating dispersions
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) : 
    #   all gene-wise dispersion estimates are within 2 orders of magnitude
    #   from the minimum value, and so the standard curve fitting techniques will not work.
    #   One can instead use the gene-wise estimates as final estimates:
    #   dds <- estimateDispersionsGeneEst(dds)
    #   dispersions(dds) <- mcols(dds)$dispGeneEst
    #   ...then continue with testing using nbinomWaldTest or nbinomLRT
    

    Try another data.

    count_data <- matrix(c(100, 500, 200, 1000, 300,
                           200, 400, 150, 500, 300,
                           300, 300, 100, 1500, 300,
                           400, 200, 50, 2000, 300), nrow = 5, byrow = TRUE)
    
    colnames(count_data) <- paste0("sample", 1:4)
    rownames(count_data) <- paste0("gene", 1:5)
    col_data <- data.frame(condition = factor(rep(c("treated", "untreated"), c(2, 2))))
    
    # create a DESeq2 dataset object
    dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition)
    # estimating size factors
    # estimating dispersions
    # gene-wise dispersion estimates
    # mean-dispersion relationship
    # -- note: fitType='parametric', but the dispersion trend was not well captured by the
    #    function: y = a/x + b, and a local regression fit was automatically substituted.
    #    specify fitType='local' or 'mean' to avoid this message next time.
    # final dispersion estimates
    # fitting model and testing
    # Warning message:
    # In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
    #   Estimated rdf < 1.0; not estimating variance
    
    # differential expression analysis
    dds <- DESeq(dds)
    
    # extract results
    res <- results(dds)
    res
    # log2 fold change (MLE): condition untreated vs treated 
    # Wald test p-value: condition untreated vs treated 
    # DataFrame with 5 rows and 6 columns
    #        baseMean log2FoldChange     lfcSE      stat    pvalue      padj
    #       <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
    # gene1   465.558       0.707313  1.243608  0.568758 0.5695201  0.711900
    # gene2   309.591      -0.119330  0.885551 -0.134752 0.8928081  0.892808
    # gene3   413.959      -0.742921  0.513376 -1.447130 0.1478604  0.369651
    # gene4   638.860      -1.372724  1.428977 -0.960634 0.3367360  0.561227
    # gene5   721.470       2.928705  1.536174  1.906493 0.0565863  0.282931
    
    # extract DE genes with adjusted p-value < 0.05 and |log2 fold change| > 1
    DESeq2_DE_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
    
    # print the number of DE genes identified by DESeq2
    cat("DESeq2 identified", nrow(DESeq2_DE_genes), "DE genes.\n")
  • limma-voom. voom is a function in the limma package that modifies RNA-Seq data for use with limma. Differential Expression with Limma-Voom.
    library(limma)
    
    # Create a design matrix with the sample groups
    design_matrix <- model.matrix(~ condition, data = col_data)
    
    # filter out low-expressed genes
    if (FALSE) {
      keep <- rowSums(counts) >= 10
      counts <- counts[keep,]
    }
    
    # normalization using voom
    v <- voom(count_data, design_matrix)
    
    # linear model fitting
    fit <- lmFit(v, design_matrix)
    
    # Calculate the empirical Bayes statistics
    fit <- eBayes(fit)
    
    top.table <- topTable(fit, sort.by = "P", n = Inf)
    top.table
    #            logFC  AveExpr          t    P.Value adj.P.Val         B
    # gene3 -2.3242262 15.67787 -1.8229203 0.08330054 0.3153894 -4.249298
    # gene5  2.0865122 16.34012  1.5960616 0.12615577 0.3153894 -4.376405
    # gene1  0.5610761 16.69722  0.5079444 0.61704872 0.7551329 -4.834183
    # gene2  0.2477959 18.69843  0.3829295 0.70581164 0.7551329 -5.150958
    # gene4  0.2869870 17.11809  0.3161922 0.75513288 0.7551329 -4.963932
    
    # Perform hypothesis testing to identify DE genes
    results <- decideTests(fit)
    
    summary(results)
    #        (Intercept) conditionuntreated
    # Down             0                  0
    # NotSig           0                  5
    # Up               5                  0
    
    # Extract the DE genes
    de_genes <- rownames(count_data)[which(results$all != 0)]

DESeq2 vs edgeR

D vs E?

  • One major difference is in the method used to estimate the dispersion parameter. DESeq2 uses a local regression method, whereas edgeR uses a Cox-Reid profile-adjusted likelihood method. The local regression method estimates the dispersion parameter for each gene independently, whereas the profile-adjusted likelihood method estimates a common dispersion parameter for all genes, with gene-specific scaling factors that depend on the mean expression levels.
  • Another difference is in the approach to normalization. DESeq2 uses a variance-stabilizing transformation to account for differences in library size and composition, whereas edgeR uses a trimmed mean of M-values (TMM) normalization method, which adjusts for library size differences by scaling the counts of each sample to a common effective library size.
  • DESeq2 also uses a different statistical model for differential expression analysis. DESeq2 models the count data as a negative binomial distribution, but includes additional terms to account for batch effects and other sources of variation. It uses a shrinkage estimator to improve the estimation of fold changes and reduce false positives. EdgeR, on the other hand, uses a similar negative binomial model but applies an empirical Bayes method to estimate gene-specific dispersions and to borrow information across genes to improve the power of detection and reduce false positives.

When should I choose DESeq2 and when should I choose edgeR?

  • The choice between DESeq2 and edgeR for differential gene expression analysis depends on several factors, including the experimental design, sample size, and the nature of the biological question being investigated. Here are some general guidelines to help you choose between these two algorithms:
  • Choose DESeq2 when:
    • The experimental design includes multiple batches or covariates that may affect the gene expression levels
    • The sample size is small, typically fewer than 12 samples per group
    • The gene expression levels are highly variable across replicates, and the goal is to identify differentially expressed genes with a low false discovery rate (FDR)
    • The focus is on the fold change rather than the statistical significance of differential expression
  • Choose edgeR when:
    • The experimental design includes several factors, such as treatment, time, and biological replicate, and the goal is to identify the main effects and interaction effects of these factors on gene expression
    • The sample size is moderate to large, typically more than 12 samples per group
    • The gene expression levels are less variable across replicates, and the goal is to achieve high statistical power to detect differentially expressed genes
    • The focus is on both the fold change and the statistical significance of differential expression, and the researcher is interested in performing downstream analyses such as gene set enrichment analysis or pathway analysis.

DESeq2 in Python

PyDESeq2, nice work

Generalized Linear Models and Plots with edgeR

Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis

EBSeq

An R package for gene and isoform differential expression analysis of RNA-seq data

http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/

prebs

Probe region expression estimation for RNA-seq data for improved microarray comparability

DEXSeq

Inference of differential exon usage in RNA-Seq

rSeqNP

A non-parametric approach for detecting differential expression and splicing from RNA-Seq data

voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data

http://www.biosoft.hacettepe.edu.tr/voomDDA/

Pathway analysis

About the KEGG pathways

  • MSigDB database or msigdbr package - seems to be old. It only has 186 KEGG pathways for human.
  • msigdb from Bioconductor
  • KEGGREST package directly pull the data from kegg.jp. I can get 337 KEGG pathways for human ('hsa')
    BiocManager::install("KEGGREST")
    library(KEGGREST)
    res <- keggList("pathway", "hsa") 
    length(res) # 337
    

GSOAP

GSOAP: a tool for visualization of gene set over-representation analysis

clusterProfiler

fgsea: Fast Gene Set Enrichment Analysis

GSEABenchmarkeR: Reproducible GSEA Benchmarking

Towards a gold standard for benchmarking gene set enrichment analysis

hypeR

GSEPD

GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display

SeqGSEA

http://www.bioconductor.org/packages/release/bioc/html/SeqGSEA.html

BAGSE

BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis 2020

GeneSetCluster

GeneSetCluster: a tool for summarizing and integrating gene-set analysis results

Pipeline

SPEAQeasy

SPEAQeasy – a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses

Nextflow

GeneTEFlow

GeneTEFlow – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data

pipeComp

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools

SARTools

http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/

SEQprocess

SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package

GEMmaker

GEMmaker, Paper

pasilla and pasillaBamSubset Data

pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).

BitSeq

Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.

ReportingTools

The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.

Figures can be included in a cell in output table. See Using ReportingTools in an Analysis of Microarray Data.

It is suggested by e.g. EnrichmentBrowser.

sequences

More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.

QuasR

Bioinformatics paper

CRAN/Bioconductor packages

ssizeRNA

RNASeqPower

RNASeqPower Sample size for RNAseq studies

RnaSeqSampleSize

Shiny app

rbamtools

Provides an interface to functions of the 'SAMtools' C-Library by Heng Li

refGenome

The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.

WhopGenome

Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.

TCGA2STAT

Simple TCGA Data Access for Integrated Statistical Analysis in R

TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.

source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")

install.packages("TCGA2STAT")

The getTCGA() function allows to download various kind of data:

  • gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
  • miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
  • mutation data (data.type="Mutation")
  • methylation expression (data.type="Methylation")
  • copy number changes (data.type="CNA_SNP")

TCGAbiolinks

  • An example from Public Data Resources in Bioconductor workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-ACC",
                               data.category = "Gene expression",
                               data.type = "Gene expression quantification",
                               platform = "Illumina HiSeq", 
                               file.type  = "normalized_results",
                               experimental.strategy = "RNA-Seq",
                               legacy = TRUE)
    
    gdcdir <- file.path("Waldron_PublicData", "GDCdata")
    GDCdownload(query, method = "api", files.per.chunk = 10,
                directory = gdcdir)  # 79 files
    ACCse <- GDCprepare(query, directory = gdcdir)
    ACCse
    class(ACCse)
    dim(assay(ACCse))  # 19947 x 79
    assay(ACCse)[1:3, 1:2] # symbol id
    length(unique(rownames(assay(ACCse))))   #  19672
    rowData(ACCse)[1:2, ]
    # DataFrame with 2 rows and 3 columns
    #          gene_id entrezgene ensembl_gene_id
    #      <character>  <integer>     <character>
    # A1BG        A1BG          1 ENSG00000121410
    # A2M          A2M          2 ENSG00000175899
    
  • HTSeq counts data example. DeMixT. Error when running GDC_prepare.
    query2 <- GDCquery(project = "TCGA-ACC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type="HTSeq - Counts") # or "STAR - Counts"
    gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2")
    GDCdownload(query2, method = "api", files.per.chunk = 10,
                directory = gdcdir2)  # 79 files
    ACCse2 <- GDCprepare(query2, directory = gdcdir2)
    ACCse2
    dim(assay(ACCse2))  # 56457 x 79
    assay(ACCse2)[1:3, 1:2]  # ensembl id
    rowData(ACCse2)[1:2, ]
    DataFrame with 2 rows and 3 columns
                    ensembl_gene_id external_gene_name original_ensembl_gene_id
                        <character>        <character>              <character>
    ENSG00000000003 ENSG00000000003             TSPAN6       ENSG00000000003.13
    ENSG00000000005 ENSG00000000005               TNMD        ENSG00000000005.5
    
  • Clinical data
    acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical")
    dim(acc_clin)
    # [1] 92 71
    
  • TCGAanalyze_DEA(). Differentially Expression Analysis (DEA) Using edgeR Package.
    dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
    dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut =  0.25)
    samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
    samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
    dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
                          dataFilt[,samplesTP],"Normal", "Tumor")
    # 2nd example
    dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM,
                               Cond1type = "LGG", Cond2type = "GBM",
                               fdr.cut = 0.01,  logFC.cut = 1,
                               method = "glmLRT")
    
  • Enrichment analysis
    ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", 
                                    RegulonList = rownames(dataDEGs))
    
    TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                            GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC,
                            GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat,
                            nRGTab = rownames(dataDEGs),
                            nBar = 20)
    
  • mRNA Analysis Pipeline from GDC documentation.
  • RangedSummarizedExperiment class
    • assay(()
    • colData()
    • rowData()
    • assayNames()
    • metadata()
    • > dim(colData(ACCse))
      [1] 79 72
      > dim(rowData(ACCse))
      [1] 19947     3
      > dim(assay(ACCse))
      [1] 19947    79
      > assayNames(ACCse)
      [1] "normalized_count"
      > assayNames(ACCse2)
      [1] "HTSeq - Counts"
      > metadata(ACCse)
      $data_release
      [1] "Data Release 25.0 - July 22, 2020"
      
  • TCGAbiolinks to DESEq2. My verified version (R 4.3.2 & Bioc ‘3.17’) available on Github.

curatedTCGAData

  • Public data resources and Bioconductor from Bioc2020
    library(curatedTCGAData)
    library(MultiAssayExperiment)
    curatedTCGAData(diseaseCode = "*", assays = "*")
    curatedTCGAData(diseaseCode = "ACC")
    
    ACCmae <- curatedTCGAData("ACC", c("RPPAArray", "RNASeq2GeneNorm"), 
                              dry.run=FALSE)
    ACCmae
    dim(colData(ACCmae)) # 79 (samples) x 822 (features)
    
    head(metadata(colData(ACCmae))[["subtypes"]])
    
  • Caveats for working with TCGA data
    • Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types.
    • Use sampleTables on the MultiAssayExperiment object along with data(sampleTypes, package = "TCGAutils") to see what samples are present in the data.
    • There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as mergeReplicates.
    • Primary tumors should be selected using TCGAutils::TCGAsampleSelect and used as input to the subsetting mechanisms.

caOmicsV

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used

Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.

Map2NCBI

The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).

> library(Map2NCBI)
> GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/"))
  # choose [2], [n], and [1] to filter the build and feature information.
  # The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files.
> str(GeneList)
'data.frame':	52157 obs. of  15 variables:
 $ tax_id       : chr  "9606" "9606" "9606" "9606" ...
 $ chromosome   : chr  "1" "1" "1" "1" ...
 $ chr_start    : num  11874 14362 17369 30366 34611 ...
 $ chr_stop     : num  14409 29370 17436 30503 36081 ...
 $ chr_orient   : chr  "+" "-" "-" "+" ...
 $ contig       : chr  "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ...
 $ ctg_start    : num  1874 4362 7369 20366 24611 ...
 $ ctg_stop     : num  4409 19370 7436 20503 26081 ...
 $ ctg_orient   : chr  "+" "-" "-" "+" ...
 $ feature_name : chr  "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ...
 $ feature_id   : chr  "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ...
 $ feature_type : chr  "GENE" "GENE" "GENE" "GENE" ...
 $ group_label  : chr  "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ...
 $ transcript   : chr  "Assembly" "Assembly" "Assembly" "Assembly" ...
 $ evidence_code: chr  "-" "-" "-" "-" ...
> GeneList$feature_name[grep("^NAP", GeneList$feature_name)]

TCseq: Time course sequencing data analysis

http://bioconductor.org/packages/devel/bioc/html/TCseq.html

UCSC Xena

RTCGA

https://www.bioconductor.org/packages/release/bioc/html/RTCGA.html

genefu

Computation of Gene Expression-Based Signatures in Breast Cancer

GEO

See the internal link at R-GEO.

GREIN: An interactive web platform for re-analyzing GEO RNA-seq data

GEO2RNAseq

GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

Network-based

Network-based integration of multi-omics data for clinical outcome prediction in neuroblastoma 2022

Proteomics

OlinkAnalyze

OlinkRPackage

Mass spectrometry (MS)-based proteomics

$ head -5 data_phosphoprotein_quantification.txt' | cut -f 1-5
ENTITY_STABLE_ID	GENE_SYMBOL	PHOSPHOSITE	01CO005	01CO006
AAAS_pS495	AAAS	pS495	NA	-0.365
AAAS_pS525	AAAS	pS525	NA	NA
AAAS_pS541	AAAS	pS541	-0.24	NA
AAED1_pS12	AAED1	pS12	-0.46	-0.424

# R
> summary(x[, 4], na.rm = T)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -5.776  -1.095  -0.608  -0.690  -0.213   2.383   20378 
> summary(as.vector(as.matrix(x[, 4:5])), na.rm = T)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  -5.78   -0.81   -0.36   -0.43    0.00    3.13   36304 

Cbioportal cptac.png

Metabolomics Analysis

Guide to Metabolomics Analysis: A Bioinformatics Workflow

Protein-protein interaction/PPI

Drug-Drug Interactions

Understanding Drug-Drug Interactions Using R Shiny

Journals

Biometrical Journal

Biostatistics

Bioinformatics

Genome Analysis section

BMC Bioinformatics

BioRxiv

PLOS

MDPI

https://en.wikipedia.org/wiki/MDPI, All MDPI, Frontiers & Hindawi journals planned to be erased (level 0) from Finnish academic assessment by end of 2024.

Software

BRB-SeqTools

https://brb.nci.nih.gov/seqtools/

WebMeV

GeneSpring

RNA-Seq

CCBR Exome Pipeliner

https://ccbr.github.io/Pipeliner/

Tibanna

Tibanna helps you run your genomic pipelines on Amazon cloud (AWS). It is used by the 4DN DCIC (4D Nucleome Data Coordination and Integration Center) to process data. Tibanna supports CWL/WDL (w/ docker), Snakemake (w/ conda) and custom Docker/shell command.

MOFA: Multi-Omics Factor Analysis

WGCNA

Benchmarking

Essential guidelines for computational method benchmarking

Simulation

Simulate RNA-Seq

Maq

Used by TopHat: discovering splice junctions with RNA-Seq

BEERS/Grant G.R. 2011

http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called BEERS and it was used in the STAR software paper.

For the command line options of <reads_simulator.pl> and more details about the config files that are needed/prepared by BEERS, see this gist.

This can generate paired end data but they are in one FASTA file.

$ sudo apt-get install cpanminus
$ sudo cpanm Math::Random
$ wget http://cbil.upenn.edu/BEERS/beers.tar
$
$ tar -xvf beers.tar      # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl>
$
$ cd ~/Downloads/
$ mkdir beers_output  
$ mkdir beers_simulator_refseq && cd "$_"
$ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz
$ tar xzvf simulator_config_refseq.tar.gz
$ ls -lth 
total 1.4G
-rw-r--r-- 1 brb brb  44M Sep 16  2010 simulator_config_featurequantifications_refseq
-rw-r--r-- 1 brb brb 7.7M Sep 15  2010 simulator_config_geneinfo_refseq
-rw-r--r-- 1 brb brb 106M Sep 15  2010 simulator_config_geneseq_refseq
-rw-r--r-- 1 brb brb 1.3G Sep 15  2010 simulator_config_intronseq_refseq
$ cd ~/Downloads/
$ perl reads_simulator.pl 100 testbeers \
   -configstem refseq \
   -customcfgdir ~/Downloads/beers_simulator_refseq \
   -outdir ~/Downloads/beers_output

$ ls -lh beers_output
total 3.9M
-rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt
-rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt
-rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt
-rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt
-rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed
-rw-r--r-- 1 brb brb  31K Mar 16 15:25 simulated_reads_testbeers.cig
-rw-r--r-- 1 brb brb  22K Mar 16 15:25 simulated_reads_testbeers.fa
-rw-r--r-- 1 brb brb  584 Mar 16 15:25 simulated_reads_testbeers.log

$ wc -l simulated_reads2genes_testbeers.txt
102 simulated_reads2genes_testbeers.txt
$ head -4 simulated_reads2genes_testbeers.txt
seq.1	GENE.5600
seq.2	GENE.35506
seq.3	GENE.506
seq.4	GENE.34922
$ tail -4 simulated_reads2genes_testbeers.txt
seq.97	GENE.4197
seq.98	GENE.8763
seq.99	GENE.19573
seq.100	GENE.18830
$ wc -l simulated_reads_indels_testbeers.txt
36131 simulated_reads_indels_testbeers.txt
$ head -2 simulated_reads_indels_testbeers.txt
chr1:6052304-6052531	25	1	G
chr2:73899436-73899622	141	3	ATA
$ tail -2 simulated_reads_indels_testbeers.txt
chr4:68619532-68621804	1298	-2	AA
chr21:32554738-32554962	174	1	T
$ wc -l simulated_reads_substitutions_testbeers.txt 
71678  simulated_reads_substitutions_testbeers.txt
$ head -2 simulated_reads_substitutions_testbeers.txt 
chr22:50902963-50903167	50903077	G->A
chr1:6052304-6052531	6052330	G->C
$ wc -l simulated_reads_junctions-crossed_testbeers.txt 
49   simulated_reads_junctions-crossed_testbeers.txt
$ head -2 simulated_reads_junctions-crossed_testbeers.txt 
seq.1a	chrX:49084601-49084713
seq.1b	chrX:49084909-49086682

$ cat beers_output/simulated_reads_testbeers.log
Simulator run: 'testbeers'
started: Thu Mar 16 15:25:39 EDT 2017
num reads: 100
readlength: 100
substitution frequency: 0.001
indel frequency: 0.0005
base error: 0.005
low quality tail length: 10
percent of tails that are low quality: 0
quality of low qulaity tails: 0.8
percent of alt splice forms: 0.2
number of alt splice forms per gene: 2
stem: refseq
sum of gene counts: 3,886,863,063
sum of intron counts = 1,304,815,198
sum of intron counts = 2,365,472,596
intron frequency: 0.355507598105262
padded intron frequency: 0.52453796437909
finished at Thu Mar 16 15:25:58 EDT 2017

$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ head simulated_reads_testbeers.fa
>seq.1a
CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
>seq.1b
GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
>seq.2a
GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
>seq.2b
ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
>seq.3a
CCCCAGAGGAGCGCCACCTGTCCAAGATGCAGCAGAACGGCTACGAAAATCCAACCTACAAGTTCTTTGAGCAGATGCAGAACTAGACCCCCGCCACAGC

# Take a look at the true coordinates
$ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span
chrX	49084529	49084601	+
chrX	49084713	49084739	+
chrX	49084863	49084909	+
chrX	49086682	49086734	+
$ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates
seq.1a	chrX	49084529	73M111N27M	49084529-49084601, 49084713-49084739	+	CGAAGAAGGACCCAAAGATGACAAGGCTCACAAAGTACACCCAGGGCAGTTCATACCCCATGGCATCTTGCATCCAGTAGAGCACATCGGTCCAGCCTTC
seq.1b	chrX	49084863	47M1772N53M	49084863-49084909, 49086682-49086734	-	GCTCGAGCTGTTCCTTGGACGAATGCACAAGACGTGCTACTTCCTGGGATCCGACATGGAAGCGGAGGAGGACCCATCGCCCTGTGCATCTTCGGGATCA
seq.2a	chr1	183516256	100M	183516256-183516355	-	GCCCCAGCAGAGCCGGGTAAAGATCAGGAGGGTTAGAAAAAATCAGCGCTTCCTCTTCCTCCAAGGCAGCCAGACTCTTTAACAGGTCCGGAGGAAGCAG
seq.2b	chr1	183515275	100M	183515275-183515374	+	ATGAAGCCTTTTCCCATGGAGCCATATAACCATAATCCCTCAGAAGTCAAGGTCCCAGAATTCTACTGGGATTCTTCCTACAGCATGGCTGATAACAGAT
$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ wc -l simulated_reads_testbeers.bed
247 simulated_reads_testbeers.bed
$ wc -l simulated_reads_testbeers.cig
200 simulated_reads_testbeers.cig

Flux Sammeth 2010

SimNGS

SimSeq

Bioinformatics

A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.

empiricalFDR.DESeq2

http://biorxiv.org/content/early/2014/12/05/012211

The key function is simulateCounts, which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.

Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.

polyester

http://biorxiv.org/content/early/2014/12/05/012211

Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.

Input: reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.

Output: FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.

Too many dependencies. Got an error in installation.. It seems it has not considered splice junctions.

seqgendiff

Data-based RNA-seq simulations by binomial thinning

Simulate DNA-Seq

wgsim

https://github.com/lh3/wgsim

dwgssim

NEAT

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

http://genomespot.blogspot.com/2014/11/dna-aligner-accuracy-bwa-bowtie-soap.html

$ head simDNA_100bp_16del.fasta
>Pt-0-100
TGGCGAACGCGGGAATTGACCGCGATGGTGATTCACATCACTCCTAATCCACTTGCTAATCGCCCTACGCTACTATCATTCTTT
>Pt-10-110
GCGGGATTGAACCCGATTGAATTCCAATCACTGCTTAATCCACTTGCTACATCGCCCTACGTACTATCTATTTTTTTGTATTTC
>Pt-20-120
GAACCCGCGATGAATTCAATCCACTGCTACCATTGGCTACATCCGCCCCTACGCTACTCTTCTTTTTTGTATGTCTAAAAAAAA
>Pt-30-130
TGGTGAATCACAATCACTGCCTAACCATTGGCTACATCCGCCCCTACGCTACACTATTTTTTGTATTGCTAAAAAAAAAAATAA
>Pt-40-140
ACAACACTGCCTAATCCACTTGGCTACTCCGCCCCTAGCTACTATCTTTTTTTGTATTTCTAAAAAAAAAAAATCAATTTCAAT

Simulate Whole genome

Simulate whole exome

SCSIM

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

Variant simulator

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

Mutation-Simulator

Mutation-Simulator

SigProfilerSimulator

Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator

Convert FASTA to FASTQ

It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.

If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.

$ cat test.fasta
>Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
>Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
>Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
>Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
>Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
$ perl ~/Downloads/fasta_to_fastq.pl test.fasta
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Alternatively we can use just one line of code by awk

$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \
   test.fasta > test.fq
$ cat test.fq
@Pt-0-50
TGGCGAACGACGGGAATACCCGGAGGTGAATTCAAATCCACT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-10-60
GACGGAATTGAACCCGATGGGATACAATCCACTGCCTTATCC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-20-70
GAACCCGCGATGGTGTCACAATCCACTCTTAACCATTGCTAC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-30-80
GGTGAATTCACAATCCACTGCCTTACCACTTGGCTACCCCCT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@Pt-40-90
AATCCACTGCCTTATCCACTGGCTACATCCCTACGCTACTAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).

Simulate genetic data

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

PDX/Xenograft

#!/bin/bash
module load gossamer
xenome index -M 24 -T 16 -P idx \
  -H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \
  -G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa

RNA-Seq

DNA-Seq

MAF (TCGA, GDC)

DNA Seq Data

NIH

  • Go to SRA/Sequence Read Archiveand type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
    1. Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
    2. The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
    3. The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
    4. I pick the first one Accession:SRP066837 from this page. The page shows the Study type is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
    5. (Important trick) Click the number next to Run. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
    6. Download the raw data from any one of them (eg SRR2968056). For whole genome, the Strategy is WGS. For whole exome, the Strategy is called WXS.
  • Search the keywords 'nonsynonymous' and 'human' in PMC

Use SRAToolKit instead of wget to download

Don't use the wget command since it requires the specification of right http address.

Downloading SRA data using command line utilities

SRA2R - a package to import SRA data directly into R.

(Method 1) Use the fastq-dump command. For example, the following command (modified from the document will download the first 5 reads and save it to a file called <SRR390728.fastq> (NOT sra format) in the current directory.

/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O .
# OR 
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar

This will download the files in FASTQ format.

(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra

It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.

(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list

brb@T3600 ~/Downloads $ tar xzvf aspera-connect-3.6.2.117442-linux-64.tar.gz
aspera-connect-3.6.2.117442-linux-64.sh
brb@T3600 ~/Downloads $ ./aspera-connect-3.6.2.117442-linux-64.sh

Installing Aspera Connect

Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only.
Restart firefox manually to load the Aspera Connect plug-in

Install complete.

brb@T3600 ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \
  -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
  [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra .
SRR590795.sra                                                                           100%  239MB  535Mb/s    00:06
Completed: 245535K bytes transferred in 7 seconds
 (272848K bits/sec), in 1 file.
brb@T3600 ~/Downloads $

Aspera is typically 10 times faster than FTP according to the website. For this case, wget takes 12s while ascp uses 7s.

Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "[email protected]"

SRAdb package

https://bioconductor.org/packages/release/bioc/html/SRAdb.html

First we install some required package for XML and RCurl.

sudo apt-get update
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev

and then

source("https://bioconductor.org/biocLite.R")
biocLite("SRAdb")

SRA

The wait is over… NIH’s Public Sequence Read Archive is now open access on the cloud

Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers

SRA Explorer

SRP056969

SRP066363 - lung cancer

SRP015769 or SRP062882 - prostate cancer

SRP053134 - breast cancer

Look at the MBases value column. It determines the coverage for each run.

SRP050992 single cell RNA-Seq

Used in Design and computational analysis of single-cell RNA-sequencing experiments

Single cell RNA-Seq

SRP040626 or SRP040540 - Colon and rectal cancer

OmicIDX

OmicIDX on BigQuery

Tutorials

See the BWA section.

Whole Exome Seq

Whole Genome Seq

SraRunTable.txt

  1. http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
  2. http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click SRP004077
  3. http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click Runs from the RHS
  4. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click RunInfoTable

Note that (For this study, it has 2377 rows)

  • Column A (AssemblyName_s) eg GRCh37
  • Column I (library_name_s) eg
  • column N (header=Run_s) shows all SRR or ERR accession numbers.
  • Column P (Sample_Name)
  • Column Y (header=Assay_Type_s) shows WGS.
  • Column AB (LibraryLayout_s): PAIRED

Public Data

Ten Resources for easy access public genomic data 6/7/2023. UCSCXenaTools (TCGA, ICGC, GDC), PharmacoGx, rDGidb, OmicsDI, AnnotationHub, TCGAbiolinks, GenomicDataCommons, cbioportal.

Public data resources and Bioconductor from Bioc2020.

Package name Object class Downloads (Distinct IPs, Jul 2020)
GEOquery SummarizedExperiment 5754
GenomicDataCommons GDCQuery 1154
TCGAbiolinks
curatedTCGAData
RangedSummarizedExperiment
MultiAssayExperimentObjects
2752
275
recount RangedSummarizedExperiment 418
curatedMetagenomicData ExperimentHub 224

ISB Cancer Genomics Cloud (ISB-CGC)

https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis

The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.

ISB-CGC Web Application

CCLE, DepMap

  • Next-generation characterization of the Cancer Cell Line Encyclopedia 2019
  • It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).
  • Assembling Clinical Information for the CCLE Data
  • Data download Depmap
  • Dependency Map (DepMap) portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualization tools CCLE
    • sample_info.csv also available from the download page
    • CCLE_RNAseq_reads.csv: read counts from RSEM. 1406 x (54358 - 1). Use readr::read_csv(). range(x[, -1]) = 0 13018000. Note: log2(13018000) = 23.634.
    • CCLE_expression_full.csv: log2(TPM + 1). 1406 x (53971 - 1). range(x[, -1]) = 0.00000 17.78354
    • CCLE_expression.csv: log2(TPM+1). 1406 x 19221 genes. protein coding genes. 33 diseases.
    • CCLE_expression_proteincoding_genes_expected_count.csv: 1406 x (19222 - 1). read count (non-integers) data from RSEM for just protein coding genes. range(x[, -1]) = 0 13018000.
    • CCLE_expression_transcripts_expected_count.csv: read count data from RSEM. 1406 x (228138-1). Non-integers. range(x[, -1]) = 0 11664000.
  • depmap package: Cancer Dependency Map Data Package. The depmap package currently contains eight (kinds) datasets available through ExperimentHub.
    • RNA inference knockout data
    • CRISPR-Cas9 knockout data
    • WES copy number data
    • CCLE Reverse Phase Protein Array data
    • CCLE RNAseq gene expression data
    • Cancer cell lines
    • Mutation calls
    • Drug Sensitivity
    R> eh <- ExperimentHub()
    R> class(eh)
    [1] "ExperimentHub"
    attr(,"package")
    [1] "ExperimentHub"
    
    R> rnai <- eh[["EH2260"]]
    R> class(rnai)
    [1] "tbl_df"     "tbl"        "data.frame"
    

NCI's Genomic Data Commons (GDC)/TCGA

The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).

GenomicDataCommons package

NCI60

Molecular Characterization of the NCI-60. NCI-ADR-RES and OVCAR-8 being derived from one another, SNB-19 and U251 are derived from the same patient

Case studies

Expression of the SARS-CoV-2 cell receptor gene ACE2 in a wide variety of human tissues

NCI Proteomic Data Commons

https://pdc.cancer.gov/pdc/ vs https://gdc.cancer.gov/

GTEx

NIH LINCS

Sharing data

Gene set analysis

Hypergeometric test

Next-generation sequencing data

Forums

Batch effect

Batch effect

Misc

Advice

High Performance

Cloud Computing

Merge different datasets (different genechips)

Genomic data vs transcriptomic data

  • The main difference between genomic data and transcriptomic data is that genomic data provides information on the complete DNA sequence of an organism, while transcriptomic data provides information on the expression levels of genes.
  • Genomic data:
    • Genomic data refers to the complete DNA sequence of an organism, which includes all of its genes, regulatory regions, and non-coding regions. This type of data provides information on the genetic makeup of an organism, including its potential to develop certain diseases, its evolutionary history, and its overall genetic diversity.
    • Examples of genomic data: Whole genome sequencing (WGS), Genome-wide association studies (GWAS), Copy number variation (CNV) analysis, Comparative genomics, Metagenomics.
  • Transcriptomic data
    • Transcriptomic data, on the other hand, refers to the collection of all RNA transcripts produced by the genes of an organism. RNA transcripts are produced when genes are transcribed into RNA molecules, which are then used as templates to synthesize proteins. Transcriptomic data provides information on the expression levels of genes, which can help researchers understand how genes are regulated and how they contribute to biological processes.
    • Examples of transcriptomic data: RNA-Seq, Microarray, scRNA-Seq, qPCR, Ribosome profiling

Low read count and filtering

  • DESeq2 pre-filtering:
    • While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. DESeq2 vignette.
    • One can also omit this step entirely and just rely on the independent filtering procedures available in results(), either IHW or genefilter::filtered_p().
smallestGroupSize <- 3 # smallest group size
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep, ]

Independent Filtering

edgeR::filterByExpr

Normalization

log2 transformation

No matter we use TPM, TMM, FPKM, or DESeq2 normalized counts, we still need to take a log2(x+1) transformation before any analyses.

Quantile normalization

  • normalize.quantiles() from preprocessCore package. How to Perform Quantile Normalization in R
    • for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667.
    • I got into an error when I use the function in RStudio docker container but the solution here (BiocManager::install("preprocessCore", configure.args="--disable-threading")) works.
    source('http://bioconductor.org/biocLite.R')
    biocLite('preprocessCore')
    #load package
    library(preprocessCore)
     
    #the function expects a matrix
    #create a matrix using the same example
    mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
                 ncol=3)
    mat
    #     [,1] [,2] [,3]
    #[1,]    5    4    3
    #[2,]    2    1    4
    #[3,]    3    4    6
    #[4,]    4    2    8
     
    #quantile normalisation
    normalize.quantiles(mat)
    #         [,1]     [,2]     [,3]
    #[1,] 5.666667 5.166667 2.000000
    #[2,] 2.000000 2.000000 3.000000
    #[3,] 3.000000 5.166667 4.666667
    #[4,] 4.666667 3.000000 5.666667

Distribution, density plot

Density plot showing the distribution of RNA-seq read counts (FPKM) log10(FPKM)

Negative binomial distribution

RNA-seq and Negative binomial distribution

Z-score transformation

Ensembl to gene symbol

How to use UCSC Table Browser

File:Tablebrowser.png File:Tablebrowser2.png

Note

  1. the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
  2. the output does not have a header
  3. The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1

If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.

$ wc -l hg38Tables.bed 
75893 hg38Tables.bed
$ head -2 hg38Tables.bed 
chr1	67092175	67134971	NM_001276352	0	-	67093579	67127240	0	9	1429,70,145,68,113,158,92,86,42,	0,4076,11062,19401,23176,33576,34990,38966,42754,
chr1	201283451	201332993	NM_000299	0	+	201283702	201328836	0	15	453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920,	0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
$ tail -2 hg38Tables.bed 
chr22_KI270734v1_random	131493	137393	NM_005675	0	+	131645	136994	0	5	262,161,101,141,549,	0,342,3949,4665,5351,
chr22_KI270734v1_random	138078	161852	NM_016335	0	-	138479	161586	0	15	589,89,99,176,147,93,82,80,117,65,150,35,209,313,164,	0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610,

$ wc -l hg38CodingExon.bed 
577387 hg38CodingExon.bed
$ head -2 hg38CodingExon.bed 
chr1	67093579	67093604	NM_001276352_cds_0_0_chr1_67093580_r	0	-
chr1	67096251	67096321	NM_001276352_cds_1_0_chr1_67096252_r	0	-
$ tail -2 hg38CodingExon.bed 
chr22_KI270734v1_random	156288	156497	NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r	0	-
chr22_KI270734v1_random	161313	161586	NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r	0	-

# Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698)
$ grep NM_001276352 hg38Tables.bed 
chr1	67092175	67134971	NM_001276352	0	-	67093579	67127240	0	9	1429,70,145,68,113,158,92,86,42,	0,4076,11062,19401,23176,33576,34990,38966,42754,
$ grep NM_001276352 hg38CodingExon.bed
chr1	67093579	67093604	NM_001276352_cds_0_0_chr1_67093580_r	0	-
chr1	67096251	67096321	NM_001276352_cds_1_0_chr1_67096252_r	0	-
chr1	67103237	67103382	NM_001276352_cds_2_0_chr1_67103238_r	0	-
chr1	67111576	67111644	NM_001276352_cds_3_0_chr1_67111577_r	0	-
chr1	67115351	67115464	NM_001276352_cds_4_0_chr1_67115352_r	0	-
chr1	67125751	67125909	NM_001276352_cds_5_0_chr1_67125752_r	0	-
chr1	67127165	67127240	NM_001276352_cds_6_0_chr1_67127166_r	0	-

This can be compared to refGene(?) directly downloaded via http

$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
--2018-10-09 15:44:43--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7457957 (7.1M) [application/x-gzip]
Saving to: ‘hg38.refGene.txt.gz’

hg38.refGene.txt.gz                100%[===============================================================>]   7.11M   901KB/s    in 10s

2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957]

$ zcat hg38.refGene.txt.gz | wc -l
75893
15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2
1072	NM_003288	chr20	+	63865227	63891545	63865365	63889945	7	63865227,63869295,63873667,63875815,63882718,63889189,63889849,	63865384,63869441,63873816,63875875,63882820,63889238,63891545,	0	TPD52L2	cmpl	cmpl	0,1,0,2,2,2,0,
1815	NR_110164	chr2	+	161244738	161249050	161249050	161249050	2	161244738,161246874,	161244895,161249050,	0	LINC01806	unk	unk	-1,-1,

$ zcat hg38.refGene.txt.gz | tail -2
1006	NM_130467	chrX	+	55220345	55224108	55220599	55224003	5	55220345,55221374,55221766,55222620,55223986,	55220651,55221463,55221875,55222746,55224108,	0	PAGE5	cmpl	cmpl	0,1,0,1,1,
637	NM_001364814	chrY	-	6865917	6874027	6866072	6872608	7	6865917,6868036,6868731,6868867,6870005,6872554,6873971,	6866078,6868462,6868776,6868909,6870053,6872620,6874027,	0	AMELY	cmpl	cmpl	0,0,0,0,0,0,-1,

Where to download reference genome

Which human reference genome to use?

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)

UHR, HBR

In RNA sequencing (RNA-seq), Universal Human Reference (UHR) and Human Brain Reference (HBR) are two types of commercially available RNA samples that are often used as control samples and assess the performance and accuracy of RNA-seq assays. See this (github) and Lesson 13: Aligning raw sequences to reference genome.

GENCODE transcript database

RefSeq categories

See Table 1 of Chapter 18The Reference Sequence (RefSeq) Database.

Category Description
NC Complete genomic molecules
NG Incomplete genomic region
NM mRNA
NR ncRNA
NP Protein
XM predicted mRNA model
XR predicted ncRNA model
XP predicted Protein model (eukaryotic sequences)
WP predicted Protein model (prokaryotic sequences)

UCSC version & NCBI release corresponding

Gene Annotation

library(rtracklayer)
genes <- readGFF("gencode.v27.annotation.gff3.gz")
genes[1:2, 1:5]
# DataFrame with 2 rows and 5 columns
#      seqid   source       type     start       end
#   <factor> <factor>   <factor> <integer> <integer>
# 1     chr1   HAVANA gene           11869     14409
# 2     chr1   HAVANA transcript     11869     14409
genes[1:100, ] %>% filter(type == "gene") %>% dim()
# Error in UseMethod("filter") :
#   no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'RectangularData', 'SimpleList', 'DataFrame_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"

library(ape)
genes2 <- read.gff("gencode.v27.annotation.gff3.gz")
genes2[1:2, 1:5]
#   seqid source       type start   end
# 1  chr1 HAVANA       gene 11869 14409
# 2  chr1 HAVANA transcript 11869 14409
genes2[1:100,]  %>% filter(type == "gene") %>% dim()
# [1] 11  9

Genecards

  • https://www.genecards.org/
  • Q: What are genes with gene symbols starting with LINC?; eg LINC00491
    A: Genes with gene symbols starting with "LINC" are long intergenic non-coding RNA (lncRNA) genes. lncRNAs are RNA molecules that are transcribed from the genome but do not encode proteins. Unlike protein-coding genes, lncRNAs do not have a well-defined coding sequence, but they do play important regulatory roles in cellular processes such as gene expression, chromatin structure, and genome stability. Some lncRNAs are specifically expressed in cancer cells and have been implicated in tumor development and progression, making them of interest for cancer research.

How many DNA strands are there in humans?

How many base pairs in human

  • 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
  • chromosome 22 has the smallest number of bps (~50 million).
  • chromosome 1 has the largest number of bps (245 million base pairs).
  • Illumina iGenome Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa file is 3.0GB (so is other genome.fa from human).

Gene, Transcript, Coding/Non-coding exon

SNP

Types of SNPs and number of SNPs in each chromosomes

NGS technology

DNA methylation, Epigenetics

  • Relation of methylation genes and gene expression
    • Methylation of genes can have different effects on gene expression, depending on where the methylation occurs in the gene and the specific context of the gene and the cellular environment. Generally, methylation of promoter regions of genes is associated with reduced gene expression, whereas methylation of gene body regions is less clearly associated with gene expression changes.
    • When DNA is methylated at the promoter region of a gene, it can prevent the binding of transcription factors and RNA polymerase, which are necessary for transcription initiation. Methylation at the promoter region can also recruit proteins that block transcription or promote histone modifications that lead to chromatin compaction, further limiting access to the gene for transcriptional machinery.
    • However, methylation in other regions of the gene, such as the gene body, can have more complex effects on gene expression. In some cases, gene body methylation can be associated with increased expression, while in other cases it may have no effect or even lead to decreased expression. It is thought that gene body methylation may be involved in regulating alternative splicing or RNA stability, among other possible mechanisms.
    • Therefore, the effect of methylation on gene expression is not always straightforward and depends on various factors, including the specific gene, the location of methylation, and the cellular context.
devtools::install_github("coloncancermeth","genomicsclass")
library(coloncancermeth) # 485512 x 26
data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects
dim(meth)
dim(pd)
length(gr)
colnames(pd)
table(pd$Status) # 9 normals, 17 cancers
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")

i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
for(i in normalIndex){
  lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
  lines(density(meth[,i],from=0,to=1),col=2)
}

# finding regions of the genome that are different between cancer and normal samples
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- ebayes(fit)

# plot of the region surrounding the top hit
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)

plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
# http://www.ncbi.nlm.nih.gov/pubmed/22422453

# within each chromosome we usually have big gaps creating subgroups of regions to be analyzed
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")

library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
#consider two example regions#
...

Integrate DNA methylation and gene expression

Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing

Sequence + Expression

Integrate RNA-Seq and DNA-Seq

Immunohistochemistry/IHC

https://en.wikipedia.org/wiki/Immunohistochemistry. Protein expression by IHC.

Deconvolve bulk tumor tissue

Performance of computational algorithms to deconvolve heterogeneous bulk tumor tissue depends on experimental factors. Twitter.

Tumor purity

  • Tumor Purity in Preclinical Mouse Tumor Models 2022
  • Systematic Assessment of Tumor Purity and Its Clinical Implications Haider 2020.
    • With the exception of naive miRNA profiles, purity estimates were inversely correlated with molecular profiles regardless of the underlying purity estimation profile .
    • These data suggest that the presence of genomic and transcriptomic correlates of tumor purity are likely to confound biologic and clinical interpretations.
  • Estimators:
    • DNA: ABSOLUTE, ASCAT, CLONET, INTEGER, OncoSNP
    • RNA: DeMix, ISOpure-R (matlab/R), ESTIMATE (Yoshihara, R)
    • miRNA/microRNA: ISOpure-I
  • TCGA purity estimate by Aran 2015 Systematic pan-cancer analysis of tumour purity - Supplementary Data 1 (xlsx file with columns: Sample ID,Cancer type,ESTIMATE,ABSOLUTE,LUMP,IHC & CPE).
  • Tumor.purity: TCGA samples with their Tumor Purity measures (a data frame with 9364 rows and 7 variables) from TCGAbiolinks package
  • Prediction of tumor purity from gene expression data using machine learning 2021.
    • We selected the CPE as the target variable, which is the median purity value after normalizing values from the other four purity estimates (ESTIMATE, ABSOLUTE, LUMP and IHC).
    • our data set consisted of 8405 tumor samples.
  • How VAF is related to tumor purity?
    • Variant allelic fraction (VAF) is related to tumor purity because it reflects the proportion of cells in a sample that carry a specific genetic variant. In the context of cancer, VAF can be used as a surrogate marker for tumor purity, as the fraction of cells in the sample that carry the variant will depend on the proportion of cancer cells relative to normal cells.
    • The VAF of a specific genetic variant in a cancer sample can be calculated as the ratio of the number of reads supporting the variant to the total number of reads covering that locus. In a sample that is purely composed of cancer cells, the VAF should approach 1, as all cells will carry the variant. In a sample that is mixed with normal cells, the VAF will be lower and proportional to the proportion of cancer cells in the sample.
    • Therefore, by measuring the VAF of one or more genetic variants, it is possible to estimate the tumor purity, which is the proportion of cancer cells in the sample relative to normal cells. This information is important for a variety of downstream analyses, including variant calling, gene expression analysis, and the estimation of the mutational burden, as it can affect the interpretation of the results and the accuracy of the analysis.
  • How gene expression can be used to estimate tumor purity?
    • Gene expression analysis can be used to estimate tumor purity by comparing the expression levels of genes known to be specific to either normal or cancer cells. In a sample that is mixed with normal and cancer cells, the expression levels of these genes will reflect the proportion of normal and cancer cells present in the sample.
    • For example, genes that are highly expressed in normal cells, such as housekeeping genes, can be used as a reference to estimate the proportion of normal cells in the sample. Similarly, genes that are highly expressed in cancer cells, such as oncogenes, can be used to estimate the proportion of cancer cells in the sample.
    • The relative expression levels of these genes can then be used to estimate the tumor purity, either by comparing the expression levels to a reference sample of known purity, or by using mathematical models to estimate the proportion of normal and cancer cells in the sample.
    • It is important to note that this method is not without limitations, as the expression levels of specific genes can be influenced by various factors, such as the presence of cell-to-cell heterogeneity, gene amplification, and epigenetic modifications, among others. Therefore, gene expression analysis should be used in combination with other methods, such as copy number analysis and variant allelic fraction analysis, to obtain a more accurate estimate of tumor purity.
  • Some papers
  • ISOpureR (Quon 2013): intensity or count data. Error term [math]\displaystyle{ e_n }[/math] multinomial distribution.
    library(ISOpureR)
    
    # For reproducible results, set the random seed
    set.seed(123);
    
    # Run ISOpureR Step 1 - Cancer Profile Estimation
    system.time(ISOpureS1model <- ISOpure.step1.CPE(
      tumor.expression.data,  # intensity or count data
      normal.expression.data  # intensity or count data 
    ))
    ISOpureS1model$alphapurities  # tumor purity estimates
    
  • ESTIMATE (Yoshihara 2013): normalized data.
    • Since it is based ssGSEA, only ranks are used. It does not matter we used the log transformed or count data.
    • ssGSEA is based on two gene signatures: Stromal signature (141 genes) and immune signature (141 genes)
    • The formula for calculating ESTIMATE tumor purity was developed in TCGA Affymetrix data (n=1001) including both the ESTIMATE score and ABSOLUTE-based tumor purity.
    • An evolutionary algorithm was used for the mathematical model.
    • Nonlinear least squares method was used to determine the final model estimate.
    • Tumor purity = cos(0.6 + 0.000146 * ESTIMATE score)
    library(estimate)
    OvarianCancerExpr <- system.file("extdata", "sample_input.txt", package="estimate")
    filterCommonGenes(input.f=OvarianCancerExpr, output.f="OV_10412genes.gct", id="GeneSymbol")
    estimateScore("OV_10412genes.gct", "OV_estimate_score.gct", platform="affymetrix")
    
    plotPurity(scores="OV_estimate_score.gct", samples="s516", platform="affymetrix")
    
    scan("OV_estimate_score.gct", "", skip=6)[-c(1:2)] |> as.numeric() # tumor purity estimates
    
  • DeMixT (Wang 2018): count data. Estimation of tumor cell total mRNA expression in 15 cancer types predicts disease progression, Cao 2022 for profile likelihood method & supplementary information for more information about the benchmarking between DeMixT_DE and DeMixT_GS.
    library(DeMixT)
    source("DeMixT_preprocessing.R")
    
    count.mat <- cbind(normal.expression.data, tumor.expression.data)
    colnames(count.mat) <- paste0("sample", 1:ncol(count.mat))
    
    label = factor(c(rep('Normal', ncol(normal.expression.data)),
                     rep('Tumor', ncol(tumor.expression.data))))
    set.seed(1234) # not sure if this is needed
    preprocessed_data = DeMixT_preprocessing(count.mat, label)
    PRAD_filter = preprocessed_data$count.matrix
    
    set.seed(1234)
    Normal.id <- paste0("sample", 1:n1)
    Tumor.id <- paste0("sample", (n1+1):(n1+n2))
    data.Y = SummarizedExperiment(assays = list(counts = PRAD_filter[, Tumor.id]))
    data.N1 <- SummarizedExperiment(assays = list(counts = PRAD_filter[, Normal.id]))
    res = DeMixT(data.Y = data.Y,
                 data.N1 = data.N1,
                 nthread = 64,
                 gene.selection.method = "DE") # default is "GS"
    res$pi[2, ] # tumor purity estimates
    
  • CIBERSORTx (Newman 2019). Web only. Determining cell type abundance and expression from bulk tissues with digital cytometry
  • PUREE (Revkov 2023). Python-based. Only API, no source code.

Integrate/combine Omics

Gene expression

Expression level is the amount of RNA in cell that was transcribed from that gene. Slides from Alyssa Frazee.

Fusion gene

Structural variation

LUMPY, DELLY, ForestSV, Pindel, breakdancer , SVDetect.

Covid-19

Bulk RNA sequencing for analysis of post COVID-19 condition 2024. 13 differentially expressed genes associated with PCC (long Covid) were found. Enriched pathways were related to interferon-signalling and anti-viral immune processes.

RNASeq + ChipSeq

Labs

Biowulf2 at NIH

BamHash

Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.

Reproducibility

Selected Papers

Pictures

https://www.flickr.com/photos/genomegov

FISH/Fluorescence In Situ Hybridization

用DNA做身分鑑識

用DNA做身分鑑識

如何自学入门生物信息学

CRISPR

基因編輯的原理是什麼?一次看懂基因神剪CRISPR

Staying current

Staying Current in Bioinformatics & Genomics: 2017 Edition

Papers

Common issues in algorithmic bioinformatics papers

What are some common issues I find when reviewing algorithmic bioinformatics conference papers?

Precision Medicine courses

Personalized medicine

Cancer and gene markers

  • Colorectal cancer patients without KRAS mutations have far better outcomes with EGFR treatment than those with KRAS mutations.
    • Two EGFR inhibitors, cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
  • Breast cancer.

The shocking truth about space travel

7 percent of DNA belonging to NASA astronaut Scott Kelly changed in the time he was aboard the International Space Station

bioSyntax: syntax highlighting for computational biology

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2315-y

Deep learning

Deep learning: new computational modelling techniques for genomics

HRD/homologous recombination deficiency 同源重组修复缺陷

  • https://en.wikipedia.org/wiki/Homologous_recombination
    • Homologous recombination proficient (HRP) cancer cells can repair DNA damage caused by chemotherapy, making them difficult to treat.
    • drugs have been developed to target homologous recombination via c-Abl inhibition and to exploit (take advantage of) deficiencies in homologous recombination in cancer cells with BRCA mutations.
    • One such drug is Olaparib, a PARP1 inhibitor that targets cancer cells by inhibiting base-excision repair (BER) in HR-deficient cells. However, cancer cells can become resistant to PARP1 inhibitors if they undergo deletions of mutations in BRCA2, restoring their ability to repair DNA by HR.
  • https://www.genome.gov/genetics-glossary/homologous-recombination (graphical illustration). Homologous recombination is a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA.
  • BRCA1 and BRCA2 are genes that produce proteins responsible for repairing damaged DNA within cells. Mutations in these genes can lead to errors in the DNA repair process, resulting in an accumulation of mutations that can cause cancer. This condition is known as Homologous Recombination Deficiency (HRD). PARP inhibitors are a type of targeted therapy that blocks the enzyme Poly (ADP-ribose) polymerase (PARP), which helps repair DNA damage in (cancer) cells. By inhibiting PARP, these drugs prevent cancer cells from repairing their DNA, leading to cell death.
  • The inability to repair DNA damage is referred to as homologous recombination deficiency (HRD). DNA damage response from Qiagen.
  • openai
    • Poly(ADP-ribose) polymerase (PARP) inhibitors are a class of drugs that are designed to inhibit the activity of PARP enzymes. PARP enzymes are proteins that are involved in DNA repair pathways. They help to repair DNA damage and maintain the stability of the genome by adding a chemical group called poly(ADP-ribose) to other proteins.
    • PARP inhibitors work by blocking the activity of PARP enzymes, which can interfere with the ability of cells to repair damaged DNA. This can be especially useful in cancer cells, which often rely on PARP enzymes to repair DNA damage and maintain genomic stability. By inhibiting PARP enzymes, PARP inhibitors can sensitizize cancer cells to chemotherapy and other treatments, making them more vulnerable to cell death.
    • is there drug targeting HRP cancer patients? One approach is to target homologous recombination via c-Abl inhibition. For example, Niraparib is a PARP inhibitor that has been shown to be effective in treating advanced ovarian cancer in both homologous recombination deficient (HRD) and homologous recombination proficient (HRP) patients.
    • is there any drugs target HRD cancer patients? Yes, there are several drugs that have been developed to target HRD cancer cells. One approach is to use PARP inhibitors, which exploit deficiencies in homologous recombination in cancer cells with BRCA mutations. For example, Olaparib is a PARP1 inhibitor that has been shown to be effective in shrinking or stopping the growth of tumors from breast, ovarian and prostate cancers caused by mutations in the BRCA1 or BRCA2 genes. By inhibiting base-excision repair (BER) in HR-deficient cells, Olaparib applies the concept of synthetic lethality to specifically target cancer cells. PS: Examples of PARP inhibitors include niraparib (Zejula), olaparib (Lynparza), talazoparib (Talzenna), and rucaparib (Rubraca).
    • PARP1 is a member of the PARP family of proteins. PARP stands for Poly (ADP-ribose) polymerase. The PARP family comprises 17 members.

Computational Pathology

Bi-allelic, monoallelic

Bi-allelic and monoallelic expression. In most cases, both alleles (the two chromosomal copies) are transcribed; this is known as bi-allelic expression (left). However, a minority of genes show monoallelic expression (right). In these cases, only one allele of a gene is expressed (right).

SOMAscan assay (proteomic)

Scandal

阿茲海默症關鍵論文被揭發疑似造假,16年來全球醫學專家可能都被呼弄 & 阿茲海默症關鍵論文疑造假 誤導外界16年

Terms

RNA vs DNA

基因结构

https://zhuanlan.zhihu.com/p/49601643

Pseudogene

https://www.genome.gov/genetics-glossary/Pseudogene. An example: OR7E47P with alias bpl 41-16 or bpl41-16.

PCR

什麼是PCR? 聚合酶鏈鎖反應? 基因叔叔

Epidemiology

Epidemiology for the uninitiated

Cell lines

in vivo, in vitro, and in situ

In silico 電腦模擬 (in silicon, s=simulation)

  • https://en.wikipedia.org/wiki/In_silico An in silico experiment is one performed on computer or via computer simulation.
  • The main difference between in silico gene expression analysis and experimental gene expression analysis is the method used to study the patterns and levels of gene expression.
    • In silico gene expression analysis involves the use of computational tools and algorithms to analyze large datasets of gene expression data obtained from techniques such as microarrays, RNA sequencing, or single-cell RNA sequencing. This analysis can include identifying differentially expressed genes between samples, clustering genes with similar expression patterns, and predicting the functional roles of genes based on their expression profiles.
    • On the other hand, experimental gene expression analysis involves directly measuring the levels and patterns of gene expression using laboratory techniques. These techniques can include real-time polymerase chain reaction (PCR), northern blotting, western blotting, and immunohistochemistry, among others. These experimental techniques allow researchers to directly measure the levels of specific RNA or protein molecules in biological samples.
    • While in silico gene expression analysis is a rapid and cost-effective way to analyze large datasets of gene expression data, it relies on the accuracy and completeness of the data being analyzed. Experimental gene expression analysis provides a more direct and accurate view of gene expression but can be more time-consuming and expensive. In practice, both in silico and experimental gene expression analysis are valuable tools that can be used to complement each other in the study of gene expression and its role in various biological processes and diseases.
  • In silico gene expression analysis--an overview Murray 2007
  • A simple in silico approach to generate gene-expression profiles from subsets of cancer genomics data 2019

In situ 原處 (介於in vivo與in vitro之間)

  • https://en.wikipedia.org/wiki/In_situ 意義大致介於in vivo與in vitro之間。
  • Something that’s performed in situ means that it’s observed in its natural context, but outside of a living organism. In vivo is Latin for “within the living.” It refers to work that’s performed in a whole, living organism .
  • A good example of this is a technique called in situ hybridization (ISH). ISH can be used to look for a specific nucleic acid (DNA or RNA) within something like a tissue sample. Specialized probes are used to bind to a specific nucleic acid sequence that the researcher is looking to find. These probes are tagged with things like radioactivity or fluorescence. This allows the researcher to see where the nucleic acid is located within the tissue sample. ISH allows the researcher to observe where a nucleic acid is located within its natural context, yet outside of a living organism. Examples are microarray experiments.

in vivo 活体内

Syngenic

Syngeneic tumor models are experimental models used in cancer research that use genetically identical animals to study the growth and spread of cancer cells. In these models, a malignant tumor is induced in one animal and then transplanted into another animal of the same genetic background. This allows researchers to study the interactions between the host immune system and the cancer cells, as well as the response of the tumor to various treatments.

Syngeneic tumor models are often used in combination with other experimental models, such as xenograft models (where the cancer cells are transplanted into a genetically different animal) or cell line models (where the cancer cells are grown in a laboratory). By using a combination of these models, researchers can gain a more complete understanding of the biology of cancer and develop new treatments for cancer patients.

The syngeneic model is an important tool for studying the role of the immune system in cancer, as the genetically identical animals allow researchers to control for genetic differences that might impact the immune response. Additionally, because the host immune system in these models is functional and can mount a response against the transplanted cancer cells, the syngeneic model provides a more realistic representation of the host-tumor interaction than other models that rely on immunodeficient animals.

in vitro 试管内/体外

RUO: research use only

RUO stands for "Research Use Only". In the context of clinical trials and laboratory research, it refers to in vitro diagnostic products (IVDs) that are intended to be used in non-clinical studies, including to gather data for submission as required by regulatory authorities³. These products are not intended for use in diagnostic procedures. They are often used by medical laboratories and other institutions for research purposes. However, if these products are used for purposes other than research, it could have legal implications. It's important to note that RUO products are not subject to the same regulatory controls as in-vitro diagnostic medical devices (CE-IVDs) that must comply with the applicable legal requirements.

RNA sequencing 101

Web

Books

strand-specific vs non-strand specific experiment

Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.

This post mentioned to use infer_experiment.py script to check whether the rna-seq run is stranded or not.

The rna-seq experiment used in this tutorial is not stranded-specific.

FASTQ

  • FASTQ=FASTA + Qual. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

Phred quality score

q = -10log10(p) where p = error probability for the base.

q error probability base call accuracy
10 0.1 90%
13 0.05 95%
20 0.01 99%
30 0.001 99.9%
40 0.0001 99.99%
50 0.00001 99.999%

FASTA

fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.

Download sequence files

Compute the sequence length of a FASTA file

https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

head -2 file.fa | \
    awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}'  | \
    tail -1

FASTA <=> FASTQ conversion

According to this post,

  • FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
  • FastQ files are like fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)

Convert FASTA to FASTQ without quality scores

Biostars. For example, the bioawk by lh3 (Heng Li) worked.

Convert FASTA to FASTQ with quality score file

See the links on the above post.

Convert FASTQ to FASTA using Seqtk

Use the Seqtk program; see this post.

The Seqtk program by lh3 can be used to sample reads from a fastq file including paired-end; see this post.

RPKM (Mortazavi et al. 2008) and cpm (counts per million)

Reads per Kilobase of Exon per Million of Mapped reads.

  • RPKMs can only be calculated for those genes for which the gene length and GC content information is available; see the vignette of GSVA
  • rpkm function in edgeR package.
  • RPKM function in easyRNASeq package.
  • TMM > cpm > log2 transformation on the paper Gene expression profiling of 1200 pancreatic ductal adenocarcinoma reveals novel subtypes
  • Gene expression quantification from RNA-Seq wikipedia page
    • Sequencing depth/coverage: the total number of reads generated in a single experiment is typically normalized by converting counts to fragments, reads, or counts per million mapped reads (FPM, RPM, or CPM).
    • Gene length: FPKM, TPM. Longer genes will have more fragments/reads/counts than shorter genes if transcript expression is the same. This is adjusted by dividing the FPM by the length of a gene, resulting in the metric fragments per kilobase of transcript per million mapped reads (FPKM). When looking at groups of genes across samples, FPKM is converted to transcripts per million (TPM) by dividing each FPKM by the sum of FPKMs within a sample.
  • Difference between CPM and TPM and which one for downstream analysis?. CPM is basically depth-normalized counts whereas TPM is length normalized (and then normalized by the length-normalized values of the other genes).
  • The RNA-seq abundance zoo. The counts per million (CPM) metric takes the raw (or estimated) counts, and performs the first type of normalization I mention in the previous section. That is, it normalized the count by the library size, and then multiplies it by a million (to avoid scary small numbers).
  • See also the log(CPM) implemented in Seurat::NormalizeData() for scRNA-seq data.

Idea

  • The more we sequence, the more reads we expect from each gene. This is the most relevant correction of this method.
  • Longer transcript are expected to generate more reads. The latter is only relevant for comparisons among different genes which we rarely perform!. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.

Calculation

  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

Formula

RPKM = (10^9 * C)/(N * L), with 

C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)

set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
     [,1] [,2] [,3] [,4]
[1,]    0    0    5    0
[2,]    6    2    7    3
[3,]    5   13    7    2
[4,]    3    3    9   11
[5,]    1    2    1   15

d <- DGEList(counts=y, lib.size=1001:1004)
# Note that lib.size is optional
# By default, lib.size = colSums(counts)
cpm(d) # counts per million
   Sample1   Sample2  Sample3   Sample4
1    0.000     0.000 4985.045     0.000
2 5994.006  1996.008 6979.063  2988.048
3 4995.005 12974.052 6979.063  1992.032
4 2997.003  2994.012 8973.081 10956.175
5  999.001  1996.008  997.009 14940.239
> cpm(d,log=TRUE)
    Sample1   Sample2  Sample3   Sample4
1  7.961463  7.961463 12.35309  7.961463
2 12.607393 11.132027 12.81875 11.659911
3 12.355838 13.690089 12.81875 11.129470
4 11.663897 11.662567 13.17022 13.451207
5 10.285119 11.132027 10.28282 13.890078

d$genes$Length <- c(1000,2000,500,1500,3000)
rpkm(d)
    Sample1   Sample2    Sample3  Sample4
1    0.0000     0.000  4985.0449    0.000
2 2997.0030   998.004  3489.5314 1494.024
3 9990.0100 25948.104 13958.1256 3984.064
4 1998.0020  1996.008  5982.0538 7304.117
5  333.0003   665.336   332.3363 4980.080

> cpm
function (x, ...)
UseMethod("cpm")
<environment: namespace:edgeR>
> showMethods("cpm")

Function "cpm":
 <not an S4 generic function>
> cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25,
    ...)
{
    x <- as.matrix(x)
    if (is.null(lib.size))
        lib.size <- colSums(x)
    if (log) {
        prior.count.scaled <- lib.size/mean(lib.size) * prior.count
        lib.size <- lib.size + 2 * prior.count.scaled
    }
    lib.size <- 1e-06 * lib.size
    if (log)
        log2(t((t(x) + prior.count.scaled)/lib.size))
    else t(t(x)/lib.size)
}
<environment: namespace:edgeR>
> rpkm.default
function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25,
    ...)
{
    y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count)
    gene.length.kb <- gene.length/1000
    if (log)
        y - log2(gene.length.kb)
    else y/gene.length.kb
}
<environment: namespace:edgeR>

Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as

# step 1:
6/(1.0e-6 *1001) = 5994.006    # cpm, compute column-wise
# step 2:
5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise

# Another way
# step 1 (RPK) 
6/ (2000/1.0e3) = 3
# step 2 (RPKM)
3/ (1.0e-6 * 1001) = 2997.003

Another example. source code of calc_cpm().

library(edgeR)
set.seed(1234)
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
cpm(y)
#          [,1]   [,2]      [,3]      [,4]
#[1,]      0.00      0 172413.79      0.00
#[2,] 400000.00 100000 241379.31  96774.19
#[3,] 333333.33 650000 241379.31  64516.13
#[4,] 200000.00 150000 310344.83 354838.71
#[5,]  66666.67 100000  34482.76 483870.97

calc_cpm <- function (expr_mat) {
    norm_factor <- colSums(expr_mat)
    return(t(t(expr_mat)/norm_factor) * 10^6)
    # Fix a bug in the original code
    # Not affect silhouette()
}

calc_cpm(y)
#          [,1]   [,2]      [,3]      [,4]
#[1,]      0.00      0 172413.79      0.00
#[2,] 400000.00 100000 241379.31  96774.19
#[3,] 333333.33 650000 241379.31  64516.13
#[4,] 200000.00 150000 310344.83 354838.71
#[5,]  66666.67 100000  34482.76 483870.97

Critics

Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!

CPM vs TPM

Both has the property that the sumof reads is 1 million(10^6). But TPM includes gene length normalization (TPM accounts for variations in gene length (done first) and sequencing depth (done second)). So if want to find DE genes between samples, it is common to use the TPM normalization method.

(another critic) Union Exon Based Approach

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910

In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.

It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.

FPKM (Trapnell et al. 2010)

  • Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks).
  • FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
  • Differential expression analysis with only FPKM matrix available from total newbie in R

RPKM, FPKM, TPM and DESeq

> set.seed(1)
> dds <- makeExampleDESeqDataSet(m=4)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      14       1       1       4
gene2       5      17      13      14
gene3       0      12       8       6
gene4     152      62     149     110
gene5      23      36      33      94
gene6       0       1       1       4
> dds <- estimateSizeFactors(dds)
> sizeFactors(dds)
 sample1  sample2  sample3  sample4
1.068930 1.014687 1.010392 1.033559
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      14       1       1       4
gene2       5      17      13      14
gene3       0      12       8       6
gene4     152      62     149     110
gene5      23      36      33      94
gene6       0       1       1       4
> head(counts(dds, normalized=TRUE))
         sample1    sample2     sample3    sample4
gene1  13.097206  0.9855256   0.9897147   3.870122
gene2   4.677574 16.7539354  12.8662916  13.545427
gene3   0.000000 11.8263073   7.9177179   5.805183
gene4 142.198237 61.1025878 147.4674957 106.428358
gene5  21.516838 35.4789219  32.6605863  90.947869
gene6   0.000000  0.9855256   0.9897147   3.870122

# normalized counts is calculated as the following
R> head(scale(counts(dds, normalized=F), F, sizeFactors(dds)))
         sample1    sample2     sample3    sample4
gene1  13.097206  0.9855256   0.9897147   3.870122
gene2   4.677574 16.7539354  12.8662916  13.545427
gene3   0.000000 11.8263073   7.9177179   5.805183
gene4 142.198237 61.1025878 147.4674957 106.428358
gene5  21.516838 35.4789219  32.6605863  90.947869
gene6   0.000000  0.9855256   0.9897147   3.870122

# The situation of DESeqDataSet object created using 'tximport()' is different. See the next item.
P -- per
K -- kilobase (related to gene length)
M -- million (related to sequencing depth)

TMM (Robinson and Oshlack, 2010)

Trimmed Means of M values (edgeR).

  • TMM relies on the assumption that most genes are not differentially expressed. See the paper. DESeq2 does not rely on this assumption.
  • TMM will not work well for samples where the library size is so small that most of the counts become zero. A library size of 1 million is on the small side, but is probably ok. Is there any reason to think that normalization (e.g. TMM) doesn't work well with samples that that have very different raw counts?
  • Many normalization RNA-Seq normalization methods perform poorly on samples with extreme composition bias. For instance, in one sample a large number of reads comes from rRNAs while in another they have been removed more efficiently. Most scaling based methods, including RPKM and CPM, will underestimate the expression of weaker expressed genes in the presence of extremely abundant mRNAs (less sequencing real estate available for them). The TMM methods tries to correct this bias.
  • Does EdgeR trimmed mean of M values (TMM) account for gene length? No. In general, edgeR does not need to adjust for gene length in DE analyses because gene length cancels out of DE comparisons.
  • Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq
  • Q: Does TMM method require count data?
    • A: Yes, the TMM method requires RNA-seq count data as input.
    • The TMM method uses these count data to calculate the scaling factors that adjust for differences in library size and gene length, as well as the effects of highly expressed genes.
    • Before applying the TMM method, it is important to ensure that the count data has been properly preprocessed and filtered to remove low-quality reads, adapter sequences, and other artifacts.
  • Q: can TMM method be applied to non-integer data?
    • A: It is possible to apply TMM to non-integer data, such as normalized expression values or FPKM (fragments per kilobase of transcript per million mapped reads) values, by rounding the values to the nearest integer.
    • In practice, the TMM method can be applied to non-integer data by first converting the data to counts, for example, by multiplying the expression values by a scaling factor that represents the average library size, and then rounding the resulting values to the nearest integer. The TMM method can then be applied to the rounded count data as usual.
  • StatQuest: edgeR, part 1, Library Normalization. Good explanation about reference sample selection.
  • Trimmed Mean
  • RNA Sequence Analysis in R: edgeR
  • Normalisation methods implemented in edgeR. TMM, RLE, Upper-quartile.
  • Using edgeR package
    library(magrittr)
    library(edgeR)
    set.seed(1)
    M <- matrix(rnbinom(10000,mu=5,size=2), ncol=4)
    
    out <- DGEList(M) %>% calcNormFactors() %>% cpm()
    
  • Using NOISeq package
    library(NOISeq)
    out2 <- tmm(M, long = 1000, lc = 0, k = 0)
    out[1:3, 1:3] / out2[1:3, 1:3]
    #    Sample1  Sample2  Sample3
    # 1 80.81611 81.32609      NaN
    # 2 80.81611 81.32609 80.81611
    # 3 80.81611      NaN 80.81611
    

Sample size

Power-> RNA-seq

Coverage

~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
C = L N / G

where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.

# Assume the bam file is sorted by chromosome location
# took 40 min on 5.8G bam file. samtools depth has no threads option:(
# it is not right since it only account for regions that were covered with reads
samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'    # maybe 42

# The following is the right way! The result matches with Qualimap program.
samtools depth -a *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'  # maybe 8
# OR
LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'`   # 3095693981
SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'`   # 24473867730
echo $(( $LEN/$SUM ))

5 common genomics file formats

5 genomics file formats you must know (video)

  • fastq,
  • fastq,
  • bam,
  • vcf,
  • bed (genomic intervals regions)

SAM/Sequence Alignment Format and BAM format specification

Single-end, pair-end, fragment, insert size

Germline vs Somatic mutation

  • Germline: inherit from parents. See the Wikipedia page.
  • Somatic SNVs are mutations that occur in the cells of a tumor. These mutations can be found in multiple copies of the same gene, while germline SNVs are mutations that are found in a single copy of the gene, usually the original copy.
  • Somatic & Germline Mutations

Pathogenic mutation

  • A pathogenic mutation is a change in the genetic sequence that causes a specific genetic disease. To determine if a change found in the gene is something that causes disease, a laboratory looks at many different factors. For example, they look at the type of change found. Some changes, like nonsense mutations or frameshift mutations, almost always result in a major problem with the protein produced, so they are often labeled as pathogenic mutations. Laboratories will also check the scientific literature and databases to see if the particular change has been reported in other individuals with the genetic disease. Lastly, they look to see if the change is in an area of the gene that is conserved across species, meaning that the area where the change is located is the same in lots of animals, thus may be an important area for the function of the protein.
  • pathogenic variant. See Genetic Disorders
  • Pathogenic means able to cause or produce disease.
  • What are some examples of genetic diseases caused by pathogenic mutations? Cystic fibrosis, Duchenne muscular dystrophy, Familial hypercholesterolemia, Hemochromatosis, Sickle cell disease, Tay-Sachs disease ...

Driver vs passenger mutation

https://en.wikipedia.org/wiki/Somatic_evolution_in_cancer

Nonsynonymous mutation

It is related to the genetic code, Wikipedia. There are 20 amino acids though there are 64 codes.

See

isma: analysis of mutations detected by multiple pipelines

isma: an R package for the integrative analysis of mutations detected by multiple pipelines

mutSignatures: analysis of cancer mutational signatures

Rediscover: identify mutually exclusive mutations

Rediscover: an R package to identify mutually exclusive mutations

Tumor mutational burden

Types of mutations

Cytogenetic alternations

Alternative and differential splicing

Allele vs Gene

http://www.diffen.com/difference/Allele_vs_Gene

  • A gene is a stretch of DNA or RNA that determines a certain trait.
  • Genes mutate and can take two or more alternative forms; an allele is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
  • An allele is found at a fixed spot on a chromosome?
  • Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).

Locus

https://en.wikipedia.org/wiki/Locus_(genetics)

Haplotypes

Base quality, Mapping quality, Variant quality

VarSAP

Enrichment of Variant Information for the Variant Standardization and Annotation Pipeline

The Clinical Knowledgebase (CKB)

https://ckb.jax.org/gene/show?geneId=7157 (TP53)

Mapping quality (MAPQ) vs Alignment score (AS)

http://seqanswers.com/forums/showthread.php?t=66634 & SAM format specification

  • MAPQ (5th column): MAPping Quality. It equals −10 log10 Pr{mapping position is wrong} (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
  • AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)

Note:

  1. MAPQ scores produced by the aligners typically involves the alignment score and other information.
  2. You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
  3. You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
  4. MAPQ values are really useful but their implementation is a mess by Simon Andrews

gene's isoform

FFPE Tissue vs Frozen Tissue

Wild type vs mutant

ns

Not significant

PARP inhibitor

  • What is a PARP Inhibitor? Dana-Farber Cancer Institute
  • PARP is an enzyme/a family of proteins that help repair damaged DNA in cells. When DNA is damaged, PARP detects the damage and signals other enzymes to come and fix it. This helps maintain the stability of the cell’s genetic material and prevent cell death.
  • Is PARP good or bad? PARP is neither inherently good nor bad. It is a protein that plays an important role in maintaining the stability of the cell’s genetic material by helping to repair damaged DNA.
    • In normal cells, PARP helps prevent cell death and maintain genomic stability.
    • In cancer cells, PARP can help the cancer cells survive and continue to grow by repairing their DNA. This is why PARP inhibitors (PARPi) are used in cancer treatment to block the function of PARP and prevent cancer cells from repairing their DNA.
  • PARP inhibitors are a type of targeted cancer therapy, not a traditional chemotherapy.
  • PARPi therapy is a cancer treatment that blocks the PARP enzyme, which helps repair DNA damage in cancer cells
    • PARP Inhibitors: Clinical Relevance, Mechanisms of Action and Tumor Resistance
    • List of PARP inhibitors: olaparib, niraparib, rucaparib, and talazoparib.
    • Olaparib is a medication for the maintenance treatment of BRCA-mutated advanced ovarian cancer in adults. It is a PARP inhibitor, inhibiting poly ADP ribose polymerase (PARP), an enzyme involved in DNA repair. Others include Letrozole, Avastin.
    • Maintenance therapy is called so because it is the ongoing treatment of cancer with medication after the cancer has responded to the first recommended treatment. The main goals of maintenance therapy are
      • To prevent the cancer’s return
      • To delay the growth of advanced cancer after the initial treatment
  • PARP inhibitors are a class of drugs that inhibit the activity of PARP enzymes. By blocking PARP’s ability to help repair DNA damage, these drugs can make it more difficult for cancer cells to survive DNA damage caused by other treatments, such as chemotherapy or radiation therapy. This can make these treatments more effective against certain types of cancer.
  • PARP inhibitors are drugs that block the action of the PARP enzymes, which are involved in DNA repair. There are several PARP inhibitors available, including olaparib (Lynparza), niraparib (Zejula), rucaparib (Rubraca), and talazoparib (Talzenna). These drugs are approved for some types of cancer, such as ovarian and prostate cancer, depending on the presence of certain genetic mutations.

Inhibitor genes and activator genes/enhancer genes

  • Inhibitor genes: Inhibitor genes are genes that code for proteins that can regulate or inhibit the activity of other genes or proteins in a cell. These inhibitor proteins can interact with other proteins to prevent them from functioning or alter their activity.
    • TP53 gene, which codes for the p53 protein, a tumor suppressor protein that plays a critical role in regulating cell division and preventing the formation of cancerous tumors. Mutations in the TP53 gene can result in loss of p53 function and an increased risk of cancer.
  • Activator genes: Activator genes play crucial roles in a variety of biological processes, including embryonic development, immune system responses, and the regulation of gene expression. For example, the NF-kB gene codes for a transcription factor that activates genes involved in immune system responses.
    • The MYC gene is an oncogene that codes for a transcription factor that promotes cell growth and proliferation. Dysregulation or overexpression of MYC can contribute to the development of many types of cancer.
    • Overexpression of the HER2 (human epidermal growth factor receptor 2) gene is commonly observed in certain types of cancer, particularly breast cancer. Approximately 20-25% of breast cancer cases overexpress HER2, which is associated with a more aggressive form of the disease.
    • Overexpression of the epidermal growth factor receptor (EGFR) gene is a common genetic alteration observed in glioblastoma multiforme (GBM) patients.
  • It's important to note that the distinction between inhibitor and activator genes is not always clear-cut, as many genes can have both inhibitory and activating effects depending on the context and the specific proteins they interact with.
  • Normal/cancer cells and PARP Inhibition:
    • Normal cells can tolerate DNA damage caused by PARP inhibition due to their efficient homologous recombination (HR) mechanism.
    • In contrast, cancer cells with a deficient HR struggle to manage the DNA double-strand breaks (DSBs) and are especially sensitive to the effects of PARP inhibitors (PARPi)
    • PARP has been found to be overexpressed in various types of cancers, including breast, ovarian, and oral cancers, compared to their corresponding normal healthy tissues.
    • This overexpression makes inhibition of PARP activity an attractive strategy for cancer therapeutics. By disrupting PARP functions, it impairs DNA damage repair (DDR) pathways in cancer cells.
    • After cancer patients receive PARP inhibitor (PARPi) drugs, the expression of PARP genes in cancer patients tends to be lower compared to normal patients.

Undifferentiated cancer

  • Medical Definition of Undifferentiated cancer (不好的 tumor)
  • What are undifferentiated cells
    • Undifferentiated cells are cells that have not yet developed into a specific type of cell and do not possess the characteristics of a fully differentiated cell. They are also known as stem cells or progenitor cells.
    • In developmental biology, undifferentiated cells are the cells that have not yet undergone differentiation, the process by which a less specialized cell becomes a more specialized cell, with a specific function and characteristics. These cells have the potential to divide and differentiate into multiple cell types, either through normal development or in response to injury or disease.
    • In the context of cancer, undifferentiated cells refer to cells that have not yet developed into a specific type of cancer cell. These cells are sometimes called cancer stem cells, and they are thought to be the cells that give rise to the various types of cells within a tumor. They are believed to be responsible for the maintenance and growth of the tumor, and for its ability to spread to other parts of the body. They can be found within many types of cancer, and are considered to be an important target for cancer therapy, as they are thought to be more resistant to traditional treatments such as chemotherapy and radiation.
  • GSE164174

NCI Information Technology for Cancer Research program /ITCR

https://itcr.cancer.gov/. Videos. It sponsors several programs like Bioconductor, GenePattern, UCSC Xena, IGV, PDX Finder, WebMeV, et al.

Other software

Partek

dCHIP

MeV

MeV v4.8 (11/18/2011) allows annotation from Bioconductor

IPA from Ingenuity

Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.

Features:

  • easily search the scientific literature/integrate diverse biological information.
  • build dynamic pathway models
  • quickly analyze experimental data/Functional discovery: assign function to genes
  • share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.

Start Here

Expression data -> New core analysis -> Functions/Diseases -> Network analysis
                                        Canonical pathways        |
                                              |                   |
Simple or advanced search --------------------+                   |
                                              |                   |
                                              v                   |
                                        My pathways, Lists <------+
                                              ^
                                              |
Creating a custom pathway --------------------+

Resource:

Notes:

  • The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
  • The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
  • The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
  • In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
  • The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.

Screenshots:

File:IngenuityAnalysisOutput.png

DAVID Bioinformatics Resource

It offers an integrated annotation combining gene ontology, pathways and protein annotations.

It can be used to identify the pathways associated with a set of genes; e.g. this paper.

GOTrapper

GOTrapper: a tool to navigate through branches of gene ontology hierarchy

qpcR

Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).

GSEA

sandbox.bio: Interactive bioinformatics tutorials

https://sandbox.bio/. An interactive playground for learning bioinformatics command-line tools like bedtools, bowtie2, and samtools.

GWAS

Genome-wide association studies in R