ScRNA: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(320 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Resource =
= Resource =
* https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis
* https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis. Search "public".
* [http://journal.frontiersin.org/article/10.3389/fgene.2017.00062/full Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods] by Dal Molin et al 2017.
* [http://journal.frontiersin.org/article/10.3389/fgene.2017.00062/full Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods] by Dal Molin et al 2017.
* [http://bioinformatics.oxfordjournals.org/content/31/13/2225.short?rss=1 Normalization and noise reduction for single cell RNA-seq experiments] by Bo Ding et al 2015.
* [http://bioinformatics.oxfordjournals.org/content/31/13/2225.short?rss=1 Normalization and noise reduction for single cell RNA-seq experiments] by Bo Ding et al 2015.
Line 6: Line 6:
* [https://f1000research.com/articles/6-595/v1 Gene length and detection bias in single cell RNA sequencing protocols] and the script & data are available online. Belinda Phipson1 et al 2017.
* [https://f1000research.com/articles/6-595/v1 Gene length and detection bias in single cell RNA sequencing protocols] and the script & data are available online. Belinda Phipson1 et al 2017.
* [https://f1000research.com/articles/6-1158/v1 Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference] in ISCB (International Society for Computational Biology Community Journal). '''It has open peer reviews too'''.
* [https://f1000research.com/articles/6-1158/v1 Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference] in ISCB (International Society for Computational Biology Community Journal). '''It has open peer reviews too'''.
* [https://romanhaa.github.io/projects/scrnaseq_workflow/ scRNA-seq analysis workflow] by Roman Hillje
* [https://joss.theoj.org/papers/10.21105/joss.01686 Welcome to the Tidyverse] from the Journal of Open Source Software. '''It has open peer reviews too'''.
* [https://joss.theoj.org/papers/10.21105/joss.01686 Welcome to the Tidyverse] from the Journal of Open Source Software. '''It has open peer reviews too'''.
* [https://academic.oup.com/biostatistics/article-abstract/4599254 Missing data and technical variability in single-cell RNA-sequencing experiments]
* [https://academic.oup.com/biostatistics/article-abstract/4599254 Missing data and technical variability in single-cell RNA-sequencing experiments]
* [https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0467-4 A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications] 2017
* [https://www.rna-seqblog.com/clinical-perspectives-of-single-cell-rna-sequencing/ Clinical perspectives of single-cell RNA sequencing] 2021
* [https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby007/4831233 How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives] by Alessandra Dal Molin 2018
* [https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby007/4831233 How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives] by Alessandra Dal Molin 2018
* [https://www.rna-seqblog.com/crafting-a-blueprint-for-single-cell-rna-sequencing/ Crafting a blueprint for single-cell RNA sequencing] 2021
* [https://www.embopress.org/doi/10.15252/msb.20188746 Current best practices in single‐cell RNA‐seq analysis: a tutorial] 2019
* [https://www.embopress.org/doi/10.15252/msb.20188746 Current best practices in single‐cell RNA‐seq analysis: a tutorial] 2019
* [https://rna-seqblog.com/dsave-detection-of-misclassified-cells-in-single-cell-rna-seq-data/ DSAVE]: Detection of misclassified cells in single-cell RNA-Seq data.
* [https://rna-seqblog.com/dsave-detection-of-misclassified-cells-in-single-cell-rna-seq-data/ DSAVE]: Detection of misclassified cells in single-cell RNA-Seq data.
* [https://rna-seqblog.com/top-benefits-of-using-the-technique-of-single-cell-rna-seq/ Top Benefits of Using the Technique of Single Cell RNA-Seq]
* [https://rna-seqblog.com/top-benefits-of-using-the-technique-of-single-cell-rna-seq/ Top Benefits of Using the Technique of Single Cell RNA-Seq]
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07358-4 Systematic comparison of high-throughput single-cell RNA-seq methods for immune cell profiling] 2021
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07358-4 Systematic comparison of high-throughput single-cell RNA-seq methods for immune cell profiling] 2021
* [https://pubmed.ncbi.nlm.nih.gov/34155396/ The triumphs and limitations of computational methods for scRNA-seq ] Kharchenko 2021
* [https://hbctraining.github.io/scRNA-seq_online/ Single-cell RNA-seq data analysis workshop] from [https://hbctraining.github.io/main/ Bioinformatics Training at the Harvard Chan Bioinformatics Core]
* [https://hbctraining.github.io/scRNA-seq_online/ Single-cell RNA-seq data analysis workshop] from [https://hbctraining.github.io/main/ Bioinformatics Training at the Harvard Chan Bioinformatics Core]
* https://www.plob.org/article/21723.html 单细胞RNA测序 中文網站
* https://www.plob.org/article/21723.html 单细胞RNA测序 中文網站
* [https://github.com/Novartis/scRNAseq_workflow_benchmark Novartis/scRNAseq_workflow_benchmark]
* [https://www.rna-seqblog.com/top-benefits-of-using-the-technique-of-single-cell-rna-seq/ Top Benefits of Using the Technique of Single Cell RNA-Seq].  
* [https://www.rna-seqblog.com/top-benefits-of-using-the-technique-of-single-cell-rna-seq/ Top Benefits of Using the Technique of Single Cell RNA-Seq].  
** Gain Better Understanding cell types
** Gain Better Understanding cell types
Line 29: Line 33:
** [https://liulab-dfci.github.io/bioinfo-combio/ Introduction to Bioinformatics and Computational Biology] Xiaole Shirley Liu (ebook format)
** [https://liulab-dfci.github.io/bioinfo-combio/ Introduction to Bioinformatics and Computational Biology] Xiaole Shirley Liu (ebook format)
* ebooks
* ebooks
** [https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/index.html Analysis of single cell RNA-seq data] Ruqian Lyu, PuXue Qiao, and Davis J. McCarthy
** [https://scrnaseq-course.cog.sanger.ac.uk/website/index.html Analysis of single cell RNA-seq data] (ebook, Kiselev et al 2019, University of Cambridge Bioinformatics training unit, SingleCellExperiment & Seurat) and the paper [https://www.nature.com/articles/s41596-020-00409-w Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data] Andrews 2020. A docker image is available.  
** [https://scrnaseq-course.cog.sanger.ac.uk/website/index.html Analysis of single cell RNA-seq data] (ebook, Kiselev et al 2019, University of Cambridge Bioinformatics training unit, SingleCellExperiment & Seurat) and the paper [https://www.nature.com/articles/s41596-020-00409-w Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data] Andrews 2020. A docker image is available.  
** [https://broadinstitute.github.io/2019_scWorkshop/ ANALYSIS OF SINGLE CELL RNA-SEQ DATA] (ebook) Ashenberg et al 2019 from Broad institute. Seurat package was used.
** [https://broadinstitute.github.io/2019_scWorkshop/ ANALYSIS OF SINGLE CELL RNA-SEQ DATA] (ebook) Ashenberg et al 2019 from Broad institute. Seurat package was used.
** [https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/index.html Analysis of single cell RNA-seq data] (ebook) Lyu et al. 2019. Seurate and SingleCellExperiment.
** [https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/index.html Analysis of single cell RNA-seq data] (ebook) Lyu et al. 2019. Seurate and SingleCellExperiment.
* [https://www.rna-seqblog.com/cellheap-a-workflow-for-optimizing-covid-19-single-cell-rna-seq-data-processing/ CellHeap – a workflow for optimizing COVID-19 single-cell RNA-Seq data processing]
* [https://www.rna-seqblog.com/analysis-of-single-cell-rna-sequencing-data-a-step-by-step-guide/ Analysis of single cell RNA sequencing data: a step-by-step guide]
== Workshops ==
* http://dors.weizmann.ac.il/course/workshop2021/scRNA/
== Library preparation ==
* Smart-Seq2
* Drop-Seq
* Fluidigm-C1
== Guideline ==
[https://www.rna-seqblog.com/data-analysis-guidelines-for-single-cell-rna-seq-in-biomedical-studies-and-clinical-applications/ Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications]
== Pipeline ==
* [https://www.biorxiv.org/content/10.1101/2021.08.16.456499v2 scFlow]: A Scalable and Reproducible Analysis Pipeline for Single-Cell RNA Sequencing Data. https://github.com/combiz/scFlow
* [https://www.rna-seqblog.com/bollito-a-flexible-pipeline-for-comprehensive-single-cell-rna-seq-analyses/ bollito – a flexible pipeline for comprehensive single-cell RNA-Seq analyses]
* [https://www.rna-seqblog.com/scampi-a-versatile-pipeline-for-single-cell-rna-seq-analysis-from-basics-to-clinics/ scAmpi] – A versatile pipeline for single-cell RNA-seq analysis from basics to clinics
== GEO ==
* [https://singlecell.biolab.si/widget-catalog/geo_data_sets/ '''scOrange'''] which provides access to data sets from GEO.
* [https://bioinformatics.stackexchange.com/a/13208 How do I pull singe cell RNA sequencing data from GEO database?]
* [https://bioinfo.uth.edu/scrnaseqdb/ scRNASeqDB] not maintained
== Applications ==
* [https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0467-4 A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications] 2017
* [https://www.rna-seqblog.com/single-cell-rna-sequencing-technologies-and-applications/ Single-cell RNA sequencing technologies and applications] 2022
* [https://www.rna-seqblog.com/single-cell-rna-seq-reveals-path-for-better-treatment-of-autoimmune-diseases/ Single-cell RNA-Seq reveals path for better treatment of autoimmune diseases] 2022
* [https://www.rna-seqblog.com/single-cell-rna-sequencing-identifies-an-immunotherapy-target-to-combat-glioblastomas-2/ Single cell RNA sequencing identifies an immunotherapy target to combat glioblastomas]
* [https://thepathologist.com/diagnostics/waiting-in-the-wings Waiting in the Wings – Why isn’t single-cell sequencing ready for clinical application?] 2022
* [https://www.rna-seqblog.com/the-evolution-of-single-cell-rna-sequencing-technology-and-application-progress-and-perspectives/ The evolution of single-cell RNA sequencing technology and application: progress and perspectives] 2023


= Public data =
= Public data =
Line 39: Line 75:
* https://cells.ucsc.edu/  
* https://cells.ucsc.edu/  
* https://www.humancellatlas.org/
* https://www.humancellatlas.org/
* [https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html scRNAseq] package from Bioconductor
* [https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/quality-control-and-data-visualisation.html Tung dataset], [https://github.com/jdblischak/singleCellSeq data].
* [https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/quality-control-and-data-visualisation.html Tung dataset], [https://github.com/jdblischak/singleCellSeq data].
* SRA/GSE
* SRA/GSE
** [https://www.biostars.org/p/427428/ cellranger count help]
** [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/tutorials/gex-analysis-nature-publication#example SRR7611046 and SRR7611048] from Sequence Read Archive (SRA) pages
** [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/tutorials/gex-analysis-nature-publication#example SRR7611046 and SRR7611048] from Sequence Read Archive (SRA) pages
** [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP073767 SRP073767] PBMC data. GSE29087 Mouse Embryonic Data. GSE64016 Human Embryonic Data. See [https://www.frontiersin.org/articles/10.3389/fgene.2020.00041/full Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey] Lytal 2020
** [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP073767 SRP073767] PBMC data. GSE29087 Mouse Embryonic Data. GSE64016 Human Embryonic Data. See [https://www.frontiersin.org/articles/10.3389/fgene.2020.00041/full Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey] Lytal 2020
Line 47: Line 83:
** [https://bioconductor.org/packages/3.12/workflows/vignettes/BP4RNAseq/inst/doc/vignette.html BP4RNAseq] package. SRR11402955, SRR11402974. Note several dependent tools need to be installed manually.
** [https://bioconductor.org/packages/3.12/workflows/vignettes/BP4RNAseq/inst/doc/vignette.html BP4RNAseq] package. SRR11402955, SRR11402974. Note several dependent tools need to be installed manually.
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75790 GSE75790]. See [https://www.sciencedirect.com/science/article/pii/S1097276517300497 Comparative Analysis of Single-Cell RNA Sequencing Methods] Ziegenhain 2017
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75790 GSE75790]. See [https://www.sciencedirect.com/science/article/pii/S1097276517300497 Comparative Analysis of Single-Cell RNA Sequencing Methods] Ziegenhain 2017
** [https://www.ncbi.nlm.nih.gov//geo/query/acc.cgi?acc=GSE137804 GSE137804]/[https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP222837 SRP222837]. 22 samples (GSM or SRX numbers), 42 runs (SRR numbers), 126 fastq.
** [https://www.ncbi.nlm.nih.gov//geo/query/acc.cgi?acc=GSE137804 GSE137804]/[https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP222837 SRP222837]. 22 samples (GSM or SRX numbers), 42 runs (SRR numbers). The fastq.gz files are also directly available in [https://www.ebi.ac.uk/ena/browser/view/PRJNA573097 European Nucleotide Archive] where we can see 14 cells have only Read 2 data without Read 1 data.
** GSE29087, GSE94820, [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835 GSE67835], GSE65528, GSE70850 from [https://academic.oup.com/bioinformatics/article/36/19/4860/5866544 POWSC] - Simulation, power evaluation and sample size recommendation for single-cell RNA-seq, Su 2020.
** GSE29087, GSE94820, [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835 GSE67835], GSE65528, GSE70850 from [https://academic.oup.com/bioinformatics/article/36/19/4860/5866544 POWSC] - Simulation, power evaluation and sample size recommendation for single-cell RNA-seq, Su 2020.
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81076 GSE81076], [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86473 GSE86473], [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241 GSE85241], [https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061/ E-MTAB-5061] from the mutual nearest neighbors (MNNs) method [https://www.nature.com/articles/nbt.4091#Sec23 Haghverdi] 2018
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81076 GSE81076], [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86473 GSE86473], [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241 GSE85241], [https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061/ E-MTAB-5061] from the mutual nearest neighbors (MNNs) method [https://www.nature.com/articles/nbt.4091#Sec23 Haghverdi] 2018
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131309 GSE131309] used by [https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-022-03248-3 Subtyping of sarcomas based on pathway enrichment scores in bulk and single cell transcriptomes] 2022
* [https://academic.oup.com/database/article/doi/10.1093/database/baaa073/6008692 A curated database reveals trends in single-cell transcriptomics] Svensson 2020. And [https://www.nxn.se/single-cell-studies/gui a spreadsheeet] of data.
* [https://www.ebi.ac.uk/humancellatlas/project-catalogue/ A comprehensive list of cellular resolution datasets for the Human Cell Atlas]
== Curated Cancer Cell Atlas ==
[https://www.rna-seqblog.com/the-curated-cancer-cell-atlas-a-comprehensive-pan-cancer-single-cell-rna-sequencing-dataset/ The Curated Cancer Cell Atlas – a comprehensive pan-cancer single-cell RNA-sequencing dataset], [https://www.weizmann.ac.il/sites/3CA/ website]
== scRNAseq package ==
[https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html scRNAseq] package from Bioconductor.
[https://www.bioconductor.org/packages/release/bioc/html/ExperimentHub.html ExperimentHub] package used by scRNAseq, DuoClustering2018 and other packages.


== SRAToolkit ==
== SRAToolkit ==
* https://hpc.nih.gov/apps/sratoolkit.html. [https://github.com/ncbi/sra-tools source code], [https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/ Download SRA sequences from Entrez search results]
* https://hpc.nih.gov/apps/sratoolkit.html. [https://github.com/ncbi/sra-tools source code], [https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/ Download SRA sequences from Entrez search results]
* [https://www.biostars.org/p/376434/ Downloading paired end fastq from SRA], [https://github.com/ncbi/sra-tools/issues/399 fasterq-dump only returns one file for a paired-end sample] for scRNA-seq data. The solution '''fasterq-dump + '--include-technical -S' '''option according to https://github.com/ncbi/sra-tools/issues/399 (SRR9169172 is scRNA-Seq case) works. It yields the same number of reads when I check using the '''wc -l''' command though the md5sum are different.
* [https://rdrr.io/cran/geomedb/man/fasterqDump.html fasterqDump()] from the [https://cran.r-project.org/web/packages/geomedb/index.html geomedb] package.
* [https://rdrr.io/cran/geomedb/man/fasterqDump.html fasterqDump()] from the [https://cran.r-project.org/web/packages/geomedb/index.html geomedb] package.
* [https://www.biostars.org/p/460774/ SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?]
* [https://www.biostars.org/p/460774/ SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?]
Line 58: Line 106:
* [https://www.ncbi.nlm.nih.gov/sra/?term=SRR12148213 SRR12148213] (from [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP222837 SRP222837]). Note the corresponding sample has 8 runs.
* [https://www.ncbi.nlm.nih.gov/sra/?term=SRR12148213 SRR12148213] (from [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP222837 SRP222837]). Note the corresponding sample has 8 runs.
<pre>
<pre>
sinteractive  --gres=lscratch:20 --cpus-per-task=6
sinteractive  --gres=lscratch:800 --cpus-per-task=6 --mem=40g
  # the required memory can be checked by using the '''jobload''' command
  # the required local disk space can be obtained by 'ssh cnXXX' and run 'du -sh /lscratch/JobID'
  #    where the JobID is obtained through the jobload command
 
module load sratoolkit
module load sratoolkit


fasterq-dump -t /lscratch/$SLURM_JOBID SRR2048331 # not working
fasterq-dump -t /lscratch/$SLURM_JOBID SRR2048331 # 1 min, 3.3G fastq file
fastq-dump --split-3 SRR2048331                  # 2 min 30 sec
pigz -p6 SRR12148211*.fastq  # vs gzip. 6 threads are used here
                            # do not need to split threads,
                            # do not need to run separately
 
fastq-dump --split-3 --gzip SRR2048331  # 658M SRR2048331.fastq.gz (vs 626.9Mb in sra format)
fastq-dump --split-3 --gzip SRR2048331  # 658M SRR2048331.fastq.gz (vs 626.9Mb in sra format)


fastq-dump --split-3 --gzip SRR12148213  # 862M + 2.5G for fastq.gz files (vs 2Gb in sra format)
fastq-dump --split-3 --gzip SRR12148213  # 862M + 2.5G for fastq.gz files (vs 2Gb in sra format)
                                         # 6.5G + 15G for fastq files
                                         # 6.5G + 15G for fastq files
                                        # fastq/fastq.gz = 15/2.5 = 6
                                        # fastq.gz/sr = 3.4/2 = 1.7
# This is equivalent to  
# This is equivalent to  
# prefetch SRR12148213; cd SRR12148213
# prefetch SRR12148213; cd SRR12148213
Line 76: Line 135:
$ zcat SRR12148214_2.fastq.gz | wc -l
$ zcat SRR12148214_2.fastq.gz | wc -l
150635300
150635300
fasterq-dump -t /lscratch/$SLURM_JOB_ID \
  --split-files \
  -O /lscratch/$SLURM_JOBID SRR13526458; \
  pigz -p6 /lscratch/$SLURM_JOBID/SRR13526458*.fastq; \
  cp /lscratch/$SLURM_JOBID/SRR13526458*.fastq.gz ~/PDACPDX
</pre>
<pre>
sbatch --gres=lscratch:800 --mem=40g --cpus-per-task=6 --time=12:00:00 myscript
</pre>
</pre>
And the fasterq-dump's [https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump wiki] and the help page
And the fasterq-dump's [https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump wiki] and the help page
Line 143: Line 211:
"fasterq-dump" version 2.10.9
"fasterq-dump" version 2.10.9
</pre>
</pre>
Note that fastq.gz vs fastq file size is about 1:8 ratio.
<pre>
$ ls -lhogt | sed 's/^[^ ][^ ]*  *[^ ][^ ]* //'
23G May  7 11:35 SRR10156301_2.fastq
18G May  7 11:35 SRR10156301_1.fastq
3.1G May  7 11:02 SRR10156301_2.fastq.gz
1.4G May  7 11:02 SRR10156301_1.fastq.gz
</pre>
= Simulation =
* [https://www.biorxiv.org/content/10.1101/2021.06.01.446157v1 A benchmark study of simulation methods for single-cell RNA sequencing data], [https://www.rna-seqblog.com/a-benchmark-study-of-simulation-methods-for-single-cell-rna-sequencing-data/ Blog]
* [https://www.biorxiv.org/content/10.1101/2021.11.15.468676v1?rss=1 Built on sand: the shaky foundations of simulating single-cell RNA sequencing data] Crowell 2021
== Splatter: Simulation Of Single-Cell RNA Sequencing Data ==
[http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1 biorxiv]. [https://pubmed.ncbi.nlm.nih.gov/28899397/ Genome Biol. 2017], [http://www.bioconductor.org/packages/release/bioc/html/splatter.html splatter] package. We can simulate data using 1) estimated parameters from an existing SingleCellExperiment object, or 2) user's specified parameters.
<pre>
suppressPackageStartupMessages({
  library(splatter)
  library(scater)
})
library(scRNAseq)
sce2 <- MacoskoRetinaData()
sce2
ncells <- 1000
params <- splatEstimate(as.matrix(assays(sce2)$counts[, 1:ncells]))
sim <- splatSimulate(params)
comparison <- compareSCEs(list(MacoskoRetina = sce2[, 1:ncells], Splat = sim))
ggplot(comparison$ColData, aes(x = sum, y = detected, colour = Dataset)) +
    geom_point()
</pre>
== splatPop ==
[https://www.rna-seqblog.com/splatpop-simulating-population-scale-single-cell-rna-sequencing-data/ splatPop – simulating population scale single-cell RNA sequencing data]


= Preprocess =
= Preprocess =
* [https://twitter.com/mcgovernmit/status/1379514277191639042?s=20 What does single-cell RNA sequencing look like?] Ideal droplet containing both a cell and RNA-capture reagents.
* [https://www.rna-seqblog.com/preprocessing-method-was-found-to-be-less-important-than-other-steps-in-the-scrna-seq-analysis-process/ Preprocessing method was found to be less important than other steps in the scRNA-seq analysis process] and the paper Benchmarking UMI-based single-cell RNA-seq preprocessing workflows 2021
== fastq file format, UMI ==
* [https://hbctraining.github.io/scRNA-seq_online/lessons/02_SC_generation_of_count_matrix.html Single-cell RNA-seq data - raw data to count matrix]
** Reads with '''different UMI''' are biological duplicates - each read should be counted.
** Reads with the '''same UMI''' are are technical duplicates - the UMIs should be collapsed to be counted as a single read.
* Read1 contains '''barcode''' (represents cells) and '''UMI''' (represents reads), Read2 contains '''transcript sequence'''.
* One case from [https://bioconductor.org/packages/release/bioc/vignettes/scruff/inst/doc/scruff.html scruff] vignette.
<ul>
<li>Extract UMI from fastq using the [https://umi-tools.readthedocs.io/en/latest/reference/extract.html UMI-tools] and [https://www.biostars.org/p/269091/ Extracting UMI sequences from paired-end reads].
<ul>
<li>[https://www.bioinformatics.babraham.ac.uk/training/10XRNASeq/10X_Introduction.pdf#page=6 *How 10X RNA-Seq Works]. '''Cell level barcode (16bp)''' – same for all RNAs in a cell. UMI (10bp) – unique for one RNA in one cell. </li>
<li>[https://elixir-iib-training.github.io/2019-05-07-pozzuoli-singlecell/pres/Panariello_Theory_refresher_and_cellranger_overview.pdf#page=6 Theory Refresher and Software Overview: Cell Ranger] by ELIXIR-IIB Training Platform. ''More details on Reads 1 & 2.'' </li>
<li>[https://www.jianshu.com/p/c6eac2e1f02a 提取barcode和过滤reads], [https://www.jianshu.com/p/d4d7d0fab004 umi_tools for dealing with UMIs and cell barcodes] (簡書). 10X is different from Drop-seq.
<pre>
umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin hgmm_100_R1.fastq.gz \
                  --stdout hgmm_100_R1_extracted.fastq.gz \
                  --read2-in hgmm_100_R2.fastq.gz \
                  --read2-out=hgmm_100_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --whitelist=whitelist.txt
</pre>
</li>
<li>Barcode extraction for Drop-seq. For Drop-seq, the current protocol includes a 12bp CB, followed by an 8bp UMI.
<pre>
--extract-method=string
--bc-pattern=CCCCCCCCCCCCNNNNNNNN
</pre>
</li>
<li>[https://training.galaxyproject.org/topics/transcriptomics/tutorials/scrna-preprocessing-tenx/tutorial.html#10x-chemistries *Pre-processing of 10X Single-Cell RNA Datasets]. 10x '''Chromiumv2''' and '''Chromiumv3''' '''Chemistries/reagent kits''' have the same 16bp of CB but different bp of UMI.
<li>[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4654672 GSM4654672] also confirms Read 1 & Read 2 contents.</li>
</ul>
</li>
</ul>
* UMI ('''Unique molecular identifier''')
** [https://en.wikipedia.org/wiki/Unique_molecular_identifier Wikipedia]
** [https://umi-tools.readthedocs.io/en/latest/Single_cell_tutorial.html Single cell tutorial] from the documentation of '''UMI-tools''' (Tools for dealing with Unique Molecular Identifiers).
** [https://dnatech.genomecenter.ucdavis.edu/faqs/what-are-umis-and-why-are-they-used-in-high-throughput-sequencing/ What are UMIs and why are they used in high-throughput sequencing?] ''UMI sequence information in conjunction with alignment coordinates enables grouping of sequencing data into read families representing individual sample DNA or RNA fragments.''  UMIs are especially helpful for very low input or high depth sequencing.
** [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1438-9 UMI-count modeling and differential expression analysis for single-cell RNA sequencing]. read-count VS UMI-count. Data & code are available.
** [https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/ Unique Molecular Identifiers – the problem, the solution and the proof]
* CellRanger count
** [https://kb.10xgenomics.com/hc/en-us/articles/115003710383-Which-reads-are-considered-for-UMI-counting-by-Cell-Ranger- Which reads are considered for UMI counting by Cell Ranger?]
** [https://kb.10xgenomics.com/hc/en-us/articles/360002364512-How-can-I-get-read-counts-for-the-observed-UMIs- How can I get read counts for the observed UMIs?]
** [https://kb.10xgenomics.com/hc/en-us/articles/360007068611-How-do-I-get-the-read-counts-for-each-barcode- How do I get the read counts for each barcode?]
** [https://kb.10xgenomics.com/hc/en-us/articles/360004350911-How-can-the-percentage-of-reads-aligned-to-Exonic-Regions-be-greater-than-those-aligned-to-the-Transcriptome- How can the percentage of reads aligned to "Exonic Regions" be greater than those aligned to the "Transcriptome"?]
== Cell ranger ==
== Cell ranger ==
<ul>
<ul>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_in 10X Genomics] </li>
<li>[https://www.10xgenomics.com/product-list/#single-cell Chromium Single Cell Gene Expression]
<ul>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview Gene Expression Algorithms Overview] </li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build  Prebuild reference genomes/packages] </li>
<li>[https://www.10xgenomics.com/instruments/chromium-controller You can process between 1–8 samples in one run, loading up to 10,000 cells per sample] and [https://www.biotech.wisc.edu/services/gec/services/10X-Genomics-single-cell-services other statistics] </li>
<li>[https://youtu.be/k9VFNLLQP8c?t=702 You can sequence up to 25,000 cells in each lanes of the chip and this chip has 8 lanes on there. So potentially you could process 200,000 cells.] </li>
<li>
[https://kb.10xgenomics.com/hc/en-us/articles/360001378811-What-is-the-maximum-number-of-cells-that-can-be-profiled- Note that for the Single Cell Gene Expression assay, it is possible to target up to 30,000 cells per '''library''' if using the 3' CellPlex Kit for Cell Multiplexing. Cell '''multiplets''' can be bioinformatically identified and filtered out when using the 3' CellPlex Kit.] </li>
<li>
[https://kb.10xgenomics.com/hc/en-us/articles/115002022743-What-is-the-recommended-sequencing-depth-for-Single-Cell-3-and-5-Gene-Expression-Libraries- What is the recommended sequencing '''depth''' for Single Cell 3' and 5' Gene Expression libraries?] </li>
<li>[https://medicine.uiowa.edu/humangenetics/genomics-sequencing-division/genome-sequencing/single-cell-expression-analysis-scrna-seq 10X Genomics Chromium Single-Cell System] </li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_in Installing cell ranger] - 10X Genomics </li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/overview/system-requirements System Requirements]. How CPU and Memory Affect Runtime. </li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct Running cellranger count],</li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview Understanding Output] & [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/summary cellranger count Web Summary] </li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/ui The Cell Ranger User Interface]. By default, this UI is exposed at an operating-system assigned port, with a randomly-generated authentication token to restrict access. </li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input *Specifying Input FASTQ Files for 10x Pipelines],</li>
<li>[https://www.jianshu.com/p/d146d78f6f95 SRA上那些奇怪的10x Genomics数据] 如何解決有些 RUN 只有一个 fastq文件 </li>
<li>[https://kb.10xgenomics.com/hc/en-us/articles/115003802691-How-do-I-prepare-Sequence-Read-Archive-SRA-data-from-NCBI-for-Cell-Ranger- How do I prepare Sequence Read Archive (SRA) data from NCBI for Cell Ranger?]
</li>
<li>[https://www.biostars.org/p/186741/ What is the difference between --split-files and --split-3 when using fastq-dump] They are the same when I use '''diff''' or '''cmp''' or '''md5sum''' to compare SRR12148212_1.fastq file (7.6G). The files are downloaded by ''fasterq-dump'' without or with the ''--split-files'' option. Note using md5sum to compare the compressed (gz) files may shows differences though the file sizes are the same. So we should be careful when we use md5sum to compare binary files.</li>
</ul>
</li>
<li>PDX/Multiple Species data. [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references Creating a Reference Package with cellranger mkref]
<li>[https://satijalab.org/seurat/articles/pbmc3k_tutorial.html pbmc3k]. The data is saved in ''filtered_gene_bc_matrices/hg19'' directory with 3 files, barcodes.tsv (45K), genes.tsv (798K) and matrix.mtx (27M).</li>
<li>[https://satijalab.org/seurat/articles/pbmc3k_tutorial.html pbmc3k]. The data is saved in ''filtered_gene_bc_matrices/hg19'' directory with 3 files, barcodes.tsv (45K), genes.tsv (798K) and matrix.mtx (27M).</li>
<li>[https://davetang.org/muse/2018/08/09/getting-started-with-cell-ranger/ Getting started with Cell Ranger] by Dave Tang</li>
<li>[https://ucdavis-bioinformatics-training.github.io/2020-Advanced_Single_Cell_RNA_Seq/data_reduction/scMapping Workshop] from UC Davis
<li>[https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/Single_Cell_RNAseq/Chromium_Cell_Ranger.html#gsc.tab=0 *10x genomics single-cell RNAseq analysis from SRA data using Cell Ranger and Seurat], bioinformaticsworkbook.org (useful) </li>
<li>[https://www.biostars.org/p/425118/ how to get/make/find fastq file from cellranger?]
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices filtered_feature_bc_matrix vs raw_feature_bc_matrix]. "Filtered feature-barcode matrix" contains only '''detected''' cellular barcodes. <raw_feature_bc_matrix> contains mainly empty barcodes. That is, the difference is the number of cells.
<ul>
<li>3 files are created under 'filtered_feature_bc_matrix' directory: barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz</li>
<li>[https://www.biostars.org/p/482228/ Using filtered_feature_bc_matrix for scRNAseq samples] </li>
</ul>
</li>
<li>https://hpc.nih.gov/apps/cellranger.html
<li>https://hpc.nih.gov/apps/cellranger.html
<ul>
<ul>
<li>Reference genome
{{Pre}}
module load cellranger/5.0.0
echo $CELLRANGER_REF/refdata-gex-GRCh38-2020-A
# /fdb/cellranger/refdata-gex-GRCh38-2020-A
ls /fdb/cellranger/refdata-gex-GRCh38-2020-A
# fasta  genes  pickle  reference.json  star
ls /fdb/cellranger/
# refdata-cellranger-1.1.0        refdata-cellranger-GRCh38-1.2.0          refdata-cellranger-vdj-GRCh38-alts-ensembl-2.0.0
# refdata-cellranger-1.2.0        refdata-cellranger-GRCh38-3.0.0          refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0
# refdata-cellranger-2.0.0        refdata-cellranger-GRCh38-and-mm10-3.1.0  refdata-cellranger-vdj-GRCh38-alts-ensembl-4.0.0
# refdata-cellranger-2020-A        refdata-cellranger-hg19-1.2.0            refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0
# refdata-cellranger-2.1.0        refdata-cellranger-hg19-3.0.0            refdata-cellranger-vdj-GRCm38-alts-ensembl-2.2.0
# refdata-cellranger-2.2.0        refdata-cellranger-hg19_and_mm10-1.2.0    refdata-cellranger-vdj-GRCm38-alts-ensembl-3.1.0
# refdata-cellranger-3.0.0        refdata-cellranger-hg19_and_mm10-2.1.0    refdata-cellranger-vdj-GRCm38-alts-ensembl-4.0.0
# refdata-cellranger-3.1.0        refdata-cellranger-hg19-and-mm10-3.0.0    refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0
# refdata-cellranger-4.0.0        refdata-cellranger-mm10-1.2.0            refdata-gex-GRCh38-2020-A
# refdata-cellranger-5.0.0        refdata-cellranger-mm10-2.1.0            refdata-gex-GRCh38_and_mm10-2020-A
# refdata-cellranger-ercc92-1.2.0  refdata-cellranger-mm10-3.0.0            refdata-gex-mm10-2020-A
ls /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes
# genes.gtf
wc -l /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes/genes.gtf
# 2765974 /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes/genes.gtf
# One column contains "gene_name"
head -n 6 /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes/genes.gtf
##description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
##provider: GENCODE
##format: gtf
##date: 2019-09-05
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485"; gene_version "5"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; level 2; hgnc_id "HGNC:52482"; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
</pre>
</li>
<li>'''cellranger mkfastq''' wraps Illumina's bcl2fastq to correctly demultiplex Chromium-prepared sequencing samples and to convert barcode and read data to FASTQ files. </li>
<li>'''cellranger mkfastq''' wraps Illumina's bcl2fastq to correctly demultiplex Chromium-prepared sequencing samples and to convert barcode and read data to FASTQ files. </li>
<li>'''cellranger count''' takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-cell matrices and perform clustering and gene expression analysis.</li>
<li>[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count cellranger count] takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-cell matrices and perform clustering and gene expression analysis.</li>
<li>'''cellranger aggr''' aggregates results from cellranger count. </li>
<li>'''cellranger aggr''' aggregates results from cellranger count. [https://kb.10xgenomics.com/hc/en-us/articles/115004217543-How-does-cellranger-aggr-normalize-for-sequencing-depth-among-multiple-libraries- How does cellranger aggr normalize for sequencing depth among multiple libraries?] </li>
<li>'''cellranger reanalyze''' takes feature-barcode matrices produced by cellranger count or aggr and re-runs the dimensionality reduction, clustering, and gene expression algorithms. </li>
<li>'''cellranger reanalyze''' takes feature-barcode matrices produced by cellranger count or aggr and re-runs the dimensionality reduction, clustering, and gene expression algorithms. </li>
<li>The example code will generate an output directory with
<li>The example code will generate an output directory with
Line 219: Line 442:
│   ├── barcodes.tsv.gz
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
│   └── matrix.mtx.gz   #  36601 x 25 after dim(Read10X(data.dir ="filtered_feature_bc_matrix"))
├── filtered_feature_bc_matrix.h5
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── metrics_summary.csv
Line 228: Line 451:
│   ├── barcodes.tsv.gz
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
│   └── matrix.mtx.gz # 36601 x 737280
├── raw_feature_bc_matrix.h5
├── raw_feature_bc_matrix.h5
└── web_summary.html
└── web_summary.html


31 directories, 41 files
31 directories, 41 files
</pre>
<pre>
# The 2nd column contains gene symbol
$ zcat s1/outs/filtered_feature_bc_matrix/features.tsv.gz | head -5
ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945 AL627309.3 Gene Expression
# R
> x <- Read10X("s1/outs/filtered_feature_bc_matrix")
> dim(x)
[1] 36601    25
> x[1:5, 1:3]
5 x 3 sparse Matrix of class "dgCMatrix"
            AACGTTGGTTAAAGTG-1 AGTAGTCAGAGCTATA-1 ATCTGCCCATACTCTT-1
MIR1302-2HG                  .                  .                  .
FAM138A                      .                  .                  .
OR4F5                        .                  .                  .
AL627309.1                  .                  .                  .
AL627309.3                  .                  .                  .
> length(unique(rownames(x)))
[1] 36601
</pre>
</pre>
</li>
</li>
</ul>
</ul>
</li>
</li>
<li>[http://dors.weizmann.ac.il/course/course2018/scRNA-Seq.pdf#oage=31 Why do we have Barcodes with Counts that are not Called Cells? Ambient RNA], [http://dors.weizmann.ac.il/course/course2018/scRNA-Seq.pdf#page=30 A barcode is called a cell if Total UMI count > M/10 is called a cell], M = Cell with maximum total UMI counts (99th percentile of the top N) </li>
<li>[http://dors.weizmann.ac.il/course/course2018/scRNA-Seq.pdf#page=14 Reads are considered duplicated, if they map to the same gene and have the same UMI]. Instead of counting reads we will count number of unique UMIs per gene.</li>
</ul>
</ul>
=== 10x Genomics fragment schematic ===
[http://dors.weizmann.ac.il/course/workshop2021/scRNA/Gil_single_cell_sequencing_Workshop_Dec2020.pdf#page=34 10x Genomics fragment schematic]
[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/molecule_info 10x Molecule Info]. This is referenced by the [https://rdrr.io/github/MarioniLab/DropletUtils/man/read10xMolInfo.html DropletUtils::read10xMolInfo()] function.
=== plot ===
This is an example (GSM4088780) of part of "web_summary.html" output from running cellranger count 2.1.1.
[[File:Cellrangersum.png|300px]]
We can create a similar plot using the [https://rdrr.io/bioc/DropletUtils/man/barcodeRanks.html DropletUtils::barcodeRanks()] function. Here is the plot from running the example.
[[File:BarcodeRanks.png|300px]]
* 11,100 '''barcoded droplets'''
* totals =colSums(mat). It has 877 unique values. Range is [0, 2344].
* lower=100. Only cells satisfies the cutoff will be used in fitting the curve and find left.edge, right.edge, knee, inflection.
* run.totals  877 values. Range [0, 2344]
* run.rank  877 values. Range [1, 10900.5]
* out = 11,100 x 3. Columns include rank, total, fitted
* left.edge=68, rigth.edge=653. Associate with the x-axis. Relative to x, y.
* knee=657, inflection=321 (inflection <- 10^(y[right.edge])). Used in the y-axis
* In cellranger count, y-axis = UMI counts, x-axis = Barcodes.
* In cellranger count, green points = '''cells''', gray points = '''background'''.
=== emptyDrops() and DropletQC ===
[https://rdrr.io/bioc/DropletUtils/man/emptyDrops.html emptyDrops()]
* Identify likely cell-containing droplets.
* It returns FDR for each barcoded droplets. Users define a cutoff to subset the matrix to the cell-containing droplets.
* [https://pubmed.ncbi.nlm.nih.gov/30902100/ EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data], [https://github.com/MarioniLab/EmptyDrops2017 source code]
<pre>
set.seed(0)
my.counts <- DropletUtils:::simCounts()
# Identify likely cell-containing droplets.
out <- emptyDrops(my.counts)
out
# DataFrame with 11100 rows and 5 columns
#          Total  LogProb    PValue  Limited        FDR
#      <integer> <numeric> <numeric> <logical>  <numeric>
# 1            2        NA        NA        NA        NA
# 2            9        NA        NA        NA        NA
# ...        ...      ...      ...      ...        ...
# 11099      191  -228.763 9.999e-05      TRUE 0.00013799
# 11100      198  -233.043 9.999e-05      TRUE 0.00013799
is.cell <- out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)  # 943
</pre>
[https://www.rna-seqblog.com/dropletqc-improved-identification-of-empty-droplets-and-damaged-cells-in-single-cell-rna-seq-data/ DropletQC – improved identification of empty droplets and damaged cells in single-cell RNA-seq data]
=== estimateAmbience() ===
[https://rdrr.io/bioc/DropletUtils/man/estimateAmbience.html estimateAmbience()].
This function obtains an estimate of the composition of the ambient pool of RNA based on the barcodes with total UMI counts less than or equal to "''lower''". The default approach is to assume that all barcodes with total counts less than or equal to "''lower''" are empty.
It returns a numeric vector of length equal to nrow(m), containing the '''estimated the transcript proportions''' in the ambient solution.
<pre>
set.seed(0)
my.counts <- DropletUtils:::simCounts()
dim(my.counts) # [1]  100 11100
ambience <- estimateAmbience(my.counts)
head(ambience)
#        GENE1        GENE2        GENE3        GENE4        GENE5        GENE6
# 0.0002213677 0.0003984648 0.0006331183 0.0008146427 0.0010935705 0.0012042561
length(ambience)  #  [1] 100
</pre>
=== removeAmbience() ===
[https://rdrr.io/bioc/DropletUtils/man/removeAmbience.html removeAmbience()]
Estimate and remove the ambient profile from a count matrix, given pre-existing groupings of similar cells. This function is largely intended for plot beautification rather than real analysis.
It returns a numeric matrix-like object of the same dimensions as y, containing the counts after removing the ambient contamination.
<pre>
ngenes <- 1000
set.seed(1)
ambient <- runif(ngenes, 0, 0.1)  # length = 1000
cells <- c(runif(100) * 10, integer(900))  # length = 1000
y <- matrix(rpois(ngenes * 100, cells + ambient), nrow=ngenes)
# Pretending that all cells are in one group, in this example.
removed <- removeAmbience(y, ambient, groups=rep(1, ncol(y)))
dim(y)      # [1] 1000  100
dim(removed) # [1] 1000  100
summary(as.vector(y))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#  0.0000  0.0000  0.0000  0.5423  0.0000 20.0000
summary(as.vector(removed))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#  0.000  0.000  0.000  0.515  0.000  20.000
</pre>
== Use tools other than cellranger ==
See [https://scrnaseq-course.cog.sanger.ac.uk/website/processing-raw-scrna-seq-data.html Chapter 3 Processing Raw scRNA-seq Data] and [https://scrnaseq-course.cog.sanger.ac.uk/website/construction-of-expression-matrix.html Chapter 4 Construction of expression matrix].
(non-UMI) [https://www.kallistobus.tools/ Kallisto-Bustools] or Alevin, featureCounts, RSEM. Note Kallisto can do both alignment and gene counting.
(UMI-based) UMI-tools & zUMIs, [https://www.rna-seqblog.com/zumis-a-fast-and-flexible-pipeline-to-process-rna-sequencing-data-with-umis/ zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs]
== Imputation ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05417-7 Evaluating imputation methods for single-cell RNA-seq data]
== PDX ==
* https://www.biostars.org/p/449211/
== WASP – a web-accessible single cell RNA-Seq processing ==
[https://www.rna-seqblog.com/wasp-a-versatile-web-accessible-single-cell-rna-seq-processing-platform/ WASP – a versatile, web-accessible single cell RNA-Seq processing platform]


= Seurat* =
= Seurat* =
* http://satijalab.org/seurat/. It has several vignettes.
* http://satijalab.org/seurat/. It has several vignettes.
** [https://satijalab.org/scgd21/ Workshops]
** [https://github.com/satijalab/seurat/wiki/Source-Code-File-Structure-and-Organization Source Code File Structure and Organization]
* https://videocast.nih.gov/summary.asp?Live=21733&bhcp=1
* https://videocast.nih.gov/summary.asp?Live=21733&bhcp=1
* [https://github.com/dbdimitrov/BingleSeq/ BingleSeq] - A user-friendly R package for Bulk and Single-cell RNA-Seq data analyses.
* [https://github.com/dbdimitrov/BingleSeq/ BingleSeq] - A user-friendly R package for Bulk and Single-cell RNA-Seq data analyses.
Line 310: Line 673:
</li>
</li>
</ul>
</ul>
== seurat-wrappers ==
https://github.com/satijalab/seurat-wrappers


== Seurat object ==
== Seurat object ==
Line 315: Line 681:
* [https://stackoverflow.com/a/59672771 Seurat DimPlot - Highlight specific groups of cells in different colours]. seuratObject$meta.data, integrated@[email protected], slotNames(seuratObject), slot(seuratObject, "meta.data").
* [https://stackoverflow.com/a/59672771 Seurat DimPlot - Highlight specific groups of cells in different colours]. seuratObject$meta.data, integrated@[email protected], slotNames(seuratObject), slot(seuratObject, "meta.data").
* Add a new column of meta data. ?AddMetaData
* Add a new column of meta data. ?AddMetaData
=== subset genes ===
<Method 1>: [https://github.com/satijalab/seurat/issues/321 How to subset on genes and preserve idents?]
<pre>
seurat.object # original seurat object
set.seed(1); genes.use <- sample(1:nrow(seurat.object), 1000)
subset.matrix <- seurat.object[['RNA']]@data[genes.use, ]
seurat.object_sub <- CreateSeuratObject(subset.matrix)
seurat.object_sub <-  SetIdent(seurat.object_sub,
                          value=as.factor([email protected]$manual_cluster_tree_addmodulescore))
</pre>
<Method 2>: subset() function. See [https://satijalab.org/seurat/articles/essential_commands.html Seurat Command List].
<Method 3>: [https://www.jianshu.com/p/50e2b6ff10d8 Seurat Weekly NO.2 || 我该如何取子集?(Downsampling)]
== Seurat methods ==
* [https://www.bioinformatics.babraham.ac.uk/training/10XRNASeq/R%20packages%20for%20SCRNA.pdf#page=15 Seurat Methods]
* [https://satijalab.org/seurat/articles/essential_commands.html Seurat Command List]
* [https://satijalab.org/seurat/reference/index.html Reference]
* Most important
** <nowiki>pbmc_small[['RNA']]@counts, pbmc_small[['RNA']]@data, pbmc_small[['RNA']]@scale.data </nowiki>
** <nowiki>pbmc[["pca"]], pbmc[["umap"]] </nowiki>


== QC ==
== QC ==
Line 383: Line 773:
* [https://satijalab.org/seurat/articles/atacseq_integration_vignette.html Integrating scRNA-seq and scATAC-seq data]
* [https://satijalab.org/seurat/articles/atacseq_integration_vignette.html Integrating scRNA-seq and scATAC-seq data]
* [https://satijalab.org/seurat/reference/findintegrationanchors FindIntegrationAnchors()]
* [https://satijalab.org/seurat/reference/findintegrationanchors FindIntegrationAnchors()]
* [https://www.nxn.se/valent/2020/11/28/s9jjv32ogiplagwx8xrkjk532p7k28 Cell type abundance analysis in scRNA-seq]
== DE analysis ==
* 'samples' was used to mean 'cells' according to the vignette [https://satijalab.org/seurat/articles/pbmc3k_tutorial.html Seurat - Guided Clustering Tutorial]
* '''Idents(seuratObject)'''
* seuratObject <- '''SetIdent'''(seuratObject, value=as.factor())
* seuratObject <- '''RenameIdents'''(seuratObject, new.cluster.ids)
* By default, it ('''FindAllMarkers()''') identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells.
** '''FindMarkers()'''
** [https://github.com/satijalab/seurat/issues/4689 FindMarkers: Limma for large datasets, sparse vs dense matrix as an input to the Wilcoxon test]. 2 * min(limma::rankSumTestWithCorrelation(j, data.use[1,])) [https://github.com/satijalab/seurat/blob/13b615c27eeeac85e5c928aa752197ac224339b9/R/differential_expression.R#L2019 is used] to compute the p-value for the 1st gene when the number of cells is not too large. If the number of cells is too large, wilcox.test(data.use[x, ] ~ group.info[, "group"])$p.value [https://github.com/satijalab/seurat/blob/13b615c27eeeac85e5c928aa752197ac224339b9/R/differential_expression.R#L2019 is computed].
** [https://github.com/satijalab/seurat/issues/2950#issuecomment-632769038 NAs produced by integer overflow #2950]
* '''Groups'''
* The vignettes only give examples about the comparisons between clusters.
* For comparison between samples, see
** [https://github.com/satijalab/seurat/issues/325 How to find the difference gene between samples #325].
** [https://www.biostars.org/p/415850/ Calculating differential expression genes between two sample sequenced using 10x] where the vignette [https://satijalab.org/seurat/articles/integration_introduction.html Introduction to scRNA-seq integration] was recommended.
** [https://bioinformatics.stackexchange.com/a/5460 Using Seurat to compare mutant vs.wt]
** [https://hbctraining.github.io/scRNA-seq_online/lessons/pseudobulk_DESeq2_scrnaseq.html Pseudobulk differential expression analysis] via Bioconductor.
* [https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/dechapter.html 12 Differential Expression (DE) analysis] from the ebook 'Analysis of single cell RNA-seq data' Davis J. McCarthy
* [https://www.nature.com/articles/nmeth.4612 Bias, robustness and scalability in single-cell differential expression analysis] Soneson 2018
* Run in parallel for FindAllMarkers() on a large data (33694 features, 114331 + 41956 cells) will result in different errors I use
** parLapply: Error in unserialize(node$con) : error reading from connection. See [https://community.rstudio.com/t/going-parallel-only-intermittently-works-with-the-same-code/13382/10 Going parallel only intermittently works with the same code]. The same error message can be reproduced here: [http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/#Memory_load How-to go parallel in R – basics + tips]. '''Using makeCluster(no_cores, type="FORK") is an important tool for handling memory ceilings. As they link to the original variable address the fork will not require any time for exporting variables or take up any additional space when using these.'''
** mclapply: scheduled cores 4, 6, 8 did not deliver results, all values of the jobs will be affected
** foreach: Error in unserialize(<nowiki>socklist[[n]]</nowiki>) : error reading from connection
** future: No error but one cluster returns no genes. Running FindAllMarkers() with 2 clusters does give an error. See [https://stackoverflow.com/a/40568183 how to adjust future.globals.maxSize in R?]
=== Cluster-free DE analysis ===
https://twitter.com/satijalab/status/1635641897560559617 LEMUR and miloDE
== Visualization ==
* DoHeatmap() ([https://github.com/satijalab/seurat/blob/4e868fcde49dc0a3df47f94f5fb54a421bfdf7bc/R/visualization.R source code]) and a modified version called [https://www.jianshu.com/p/4f81b65abd8b DoMultiBarHeatmap()] to include multiple color bars for samples/cells.


== Azimuth ==
== Azimuth ==
[https://azimuth.hubmapconsortium.org/ Azimuth] - App for reference-based single-cell analysis
[https://azimuth.hubmapconsortium.org/ Azimuth] - App for reference-based single-cell analysis
== Public data examples ==
* [https://youtu.be/IjJOTJsd4Mg How to analyze 10X Single Cell RNA-seq data with R| Seurat Package Tutorial] (youtube)
== GUI/web applications ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04472-2 Asc-Seurat – Analytical single-cell Seurat-based web application] 2021
== tidyseurat ==
[https://cran.r-project.org/web/packages/tidyseurat/ tidyseurat], paper in [https://www.biorxiv.org/content/10.1101/2021.03.26.437294v1 biorxiv] & in [https://academic.oup.com/bioinformatics/article/37/22/4100/6283576 Bioinformatics]
== Georges Seurat ==
[https://www.google.com/doodles/georges-seurats-162nd-birthday Georges Seurat’s 162nd Birthday] by Google


= Bioconductor packages =
= Bioconductor packages =
Line 415: Line 849:
| tximeta, SingleCellExperiment, fishpond, scran, Seurat
| tximeta, SingleCellExperiment, fishpond, scran, Seurat
|}
|}
== scBubbletree ==
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05927-y scBubbletree: computational approach for visualization of single cell RNA-seq data] 2024


= GUI =
= GUI =
== NIDAP ==
== NIDAP ==
https://nidap.nih.gov/ which uses [https://nidap.nih.gov/workspace/documentation/product/code-workbook/overview.html Code workbook] (Code Workbook is an application that allows users to analyze and transform data using an intuitive graphical interface) for the interaction. The R/python analysis code can be accessed just like GEO2R from GEO. The background material is available on [https://btep.ccr.cancer.gov/nidap-single-cell-rna-seq-tutorials/ here].
https://nidap.nih.gov/ which uses [https://nidap.nih.gov/workspace/documentation/product/code-workbook/overview.html Code workbook] (Code Workbook is an application that allows users to analyze and transform data using an intuitive graphical interface) for the interaction. The R/python analysis code can be accessed just like GEO2R from GEO. The background material is available on [https://btep.ccr.cancer.gov/nidap-single-cell-rna-seq-tutorials/ here].
== SCTK2 ==
[https://www.rna-seqblog.com/interactive-analysis-of-single-cell-data-using-flexible-workflows-with-the-single-cell-toolkit-2/ Interactive analysis of single-cell data using flexible workflows with the Single-Cell Toolkit 2]


== Partek ==
== Partek ==
[https://www.partek.com/past-webinars/ Past Webinars]


== BingleSeq ==
== BingleSeq ==
[https://peerj.com/articles/10469/ BingleSeq]: a user-friendly R package for bulk and single-cell RNA-Seq data analysis
[https://peerj.com/articles/10469/ BingleSeq]: a user-friendly R package for bulk and single-cell RNA-Seq data analysis
== Cellar ==
[https://www.nature.com/articles/s41467-022-29744-0 Interactive single-cell data analysis using Cellar] 2022
== ICARUS ==
[https://www.rna-seqblog.com/icarus-an-interactive-web-server-for-single-cell-rna-seq-analysis/ ICARUS – an interactive web server for single cell RNA-seq analysis]


= How many cells are in the human body? =
= How many cells are in the human body? =
Line 431: Line 878:
<ul>
<ul>
<li>
<li>
[https://academic.oup.com/bioinformatics/article/33/21/3486/3952669 powsimR]: power analysis for bulk and single cell RNA-seq experiments. It is time-consuming and error-prone to install lots of [https://github.com/bvieth/powsimR required packages]. Better to have a Docker image.
[https://academic.oup.com/bioinformatics/article/33/21/3486/3952669 powsimR]: power analysis for bulk and single cell RNA-seq experiments. It is time-consuming and error-prone to install lots of [https://github.com/bvieth/powsimR required packages]. Better to have a Docker image. See my note [https://github.com/arraytools/powsimR/blob/main/README.md here].
</li>
</li>
<li>[https://pubmed.ncbi.nlm.nih.gov/31718533/ SCOPIT: sample size calculations for single-cell sequencing experiments]
<li>[https://pubmed.ncbi.nlm.nih.gov/31718533/ SCOPIT: sample size calculations for single-cell sequencing experiments]
Line 474: Line 921:


= Spike-in =
= Spike-in =
* [https://en.wikipedia.org/wiki/RNA_spike-in RNA spike-in]. An RNA spike-in is an RNA transcript of known sequence and quantity used to '''calibrate''' measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.
* [https://en.wikipedia.org/wiki/RNA_spike-in RNA spike-in].  
** In the context of RNA-seq, “spike-in” data refers to the '''addition of a known quantity of synthetic RNA to a sample''' before sequencing. This synthetic RNA, known as an RNA spike-in, has a known sequence and is used to calibrate measurements and normalize the data. By comparing the observed signal from the spike-in RNA to its known quantity, researchers can correct for technical variability and improve the accuracy of their measurements.
** An RNA spike-in is an RNA transcript of known sequence and quantity used to '''calibrate''' measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.
* Spike-in genes are genes that are artificially added to a biological sample, typically at a known concentration, in order to serve as a reference or control for subsequent gene expression analysis.
** Spike-in genes can be used to help normalize gene expression data and account for variations in sample preparation, handling, and sequencing. By adding known quantities of a spike-in gene to a sample, researchers can determine if the sample preparation and sequencing were performed correctly and if there were any systematic biases or variations in the data.
** Spike-in genes are typically added to a sample at a constant concentration or a range of concentrations, depending on the specific experimental design. The spike-in genes can be used to monitor technical variability across samples and experiments, and to help ensure that any observed differences in gene expression are due to biological differences and not technical variations.
** There are different types of spike-in genes that can be used depending on the experimental design, such as synthetic RNA molecules, bacterial genes, or exogenous control genes. Spike-in genes are commonly used in gene expression analysis techniques such as microarrays, RNA sequencing, and quantitative polymerase chain reaction (qPCR).
* How genes can be added to a biological sample
** Transfection
** Microinjection
** Electroporation
** Viral delivery
<ul>
<li>Numerical example
* Let's say a researcher wants to compare the gene expression levels of Sample A and Sample B. The researcher prepares both samples for RNA sequencing, and adds a known amount of a spike-in RNA molecule to both samples. The spike-in RNA molecule is designed to have a known sequence and a concentration of 50,000 copies per microliter.
* The researcher then performs RNA sequencing on both samples and quantifies the expression levels of the spike-in RNA molecule and the endogenous genes. The following table shows the '''measured counts''' of the spike-in RNA molecule and a selected endogenous gene, GENE X, in both Sample A and Sample B:
:{| class="wikitable"
|-
! Sample
! Spike-in RNA
! GENE X
|-
| A
| 100,000
| 500
|-
| B
| 80,000
| 700
|}
* The formula for normalizing the counts of gene X to the counts of the spike-in RNA molecule is as follows:
<div style="margin-left: 1.5em;">
  <source>
Normalized count = (Raw count of gene X / Raw count of spike-in RNA molecule) x
                  (Average counts of spike-in RNA molecule across samples)
  </source>
</div>
* To account for this technical variation, the researcher can use the counts of the spike-in RNA molecule to normalize the counts of the endogenous genes. For example, the normalized counts of GENE X in Sample A and Sample B would be:
:{| class="wikitable"
|-
! Sample
! Normalized GENE X
|-
| A
| 500/100,000*90,000
|-
| B
| 700/80,000*90,000
|}
</li>
</ul>
* [https://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html#3_Normalizing_cell-specific_biases scran] vignette.
* [https://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html#3_Normalizing_cell-specific_biases scran] vignette.
** We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
** We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
Line 484: Line 981:
[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6 Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model], [https://cran.r-project.org/web/packages/glmpca/index.html glmpca] package. All codes are available in github. [https://btep.ccr.cancer.gov/seminar A talk by Hicks 2021].
[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6 Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model], [https://cran.r-project.org/web/packages/glmpca/index.html glmpca] package. All codes are available in github. [https://btep.ccr.cancer.gov/seminar A talk by Hicks 2021].


= [https://bioconductor.org/packages/scone scone] package: normalization =
= Transformation =
[https://www.biorxiv.org/content/biorxiv/early/2017/12/16/235382.full.pdf Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq].
[https://www.nature.com/articles/s41592-023-01814-1 Comparison of transformations for single-cell RNA-seq data]
 
= AI =
* [https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1.full.pdf scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI]
** [https://twitter.com/simocristea/status/1654581096498229250?s=20 Simona Cristea]
 
= Quality control =
* [https://broadinstitute.github.io/2019_scWorkshop/index.html#session-content Quality control of cells and genes] (doublets, ambient, empty drops)
* [https://github.com/YOU-k/scRNA-seq-workshop/blob/master/vignettes/sc_analysis.Rmd scRNA-seq-workshop]. 由于10x 技术会使得外部RNA,也就是不包含微滴的里的细胞被同时测序。在分析之前,我们需要确保得到的barcode数据都相对应活细胞。这里,我们使用`emptyDrops`来筛选“真”细胞。
* [https://scrnaseq-course.cog.sanger.ac.uk/website/processing-raw-scrna-seq-data.html Identifying cell-containing droplets/microwells] about emptyDrops.
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1950-6 Decontamination of ambient RNA in single-cell RNA-seq with DecontX] Yang 2020
* [https://github.com/crazyhottommy/scRNAseq-analysis-notes scRNAseq-analysis-notes] by Ming Tang
* [http://dors.weizmann.ac.il/course/course2018/scRNA-Seq.pdf#page=31 Why do we have Barcodes with Counts that are not Called Cells? Ambient RNA]
* [https://bioconductor.org/books/release/OSCA/multi-sample-comparisons.html#ambient-problems Avoiding problems with ambient RNA] from OSCA
* [https://blog.dolomite-bio.com/quality-control-scrna-seq-data/ A guide to quality controlling your scRNA-Seq data]
 
== Doublet ==
* [https://github.com/chris-mcginnis-ucsf/DoubletFinder DoubletFinder]
* scDblFinder
* scrublet
* scds
 
= Downsampling =
* 3 options [https://bioinformatics.stackexchange.com/a/13058 How to downsample some of the samples in RNA-seq data?]
* [https://rdrr.io/bioc/scRecover/man/countsSampling.html scRecover::countsSampling()]
<ul>
<li>[https://rdrr.io/github/LTLA/scuttle/man/downsampleMatrix.html downsampleMatrix()] from the scuttle package
* downsampleMatrix: Downsample a '''count''' matrix to a desired proportion
* [https://github.com/LTLA/scuttle/blob/master/R/downsampleMatrix.R Source code] of downsampleMatrix
<pre>
R> library(scuttle)
R> sce <- mockSCE()
R> summary(colSums(counts(sce)))
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
344769  366648  374586  373876  382028  396251
 
R> downsampled2 <- downsampleMatrix(counts(sce),
                                    prop = 10000/colSums(counts(sce)), bycol=TRUE)
R> colSums(downsampled2[, 1:5])
Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
  10000    10000    10000    10000    10000
</pre>
</li>
<li>[https://rdrr.io/bioc/DropletUtils/man/downsampleReads.html downsampleReads()] from the DropletUtils package which imports the '''scuttle''' package.
* downsampleReads: Downsamples the reads to produce a matrix of '''UMI counts''', given a HDF5 file containing molecule information
* downsampleReads downsamples '''reads''' rather than '''observed counts'''.
* (from the help) Subsampling the reads with downsampleReads recapitulates the effect of differences in sequencing depth per cell. This provides an alternative to downsampling with the [https://software.10xgenomics.com/single-cell-vdj/software/pipelines/latest/using/aggr CellRanger aggr] function or subsampling with the [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/rkit 10X Genomics R kit].
* From 10x, [https://kb.10xgenomics.com/hc/en-us/articles/115004217543-How-does-cellranger-aggr-normalize-for-sequencing-depth-among-multiple-libraries- How does cellranger aggr normalize for sequencing depth among multiple libraries?]
* From 10x, [https://kb.10xgenomics.com/hc/en-us/articles/115002474263-How-much-sequencing-saturation-should-I-aim-for- How much sequencing saturation should I aim for?] If '''sequencing saturation''' is at 50%, it means that every 2 new reads will result in 1 new UMI count (unique transcript) detected. In contrast, 90% sequencing saturation means that 10 new reads are necessary to obtain one new UMI count. '''If the sequencing saturation is high, additional sequencing would not recover much new information for the library.'''
* From 10x, [https://kb.10xgenomics.com/hc/en-us/articles/115003646912-How-is-sequencing-saturation-calculated- How is sequencing saturation calculated?]
* [https://github.com/veghp/Sequencing Sequencing depth for scRNA-seq].
* [https://github.com/MarioniLab/DropletUtils/blob/master/R/downsampleReads.R Source code] of downsampleReads
* [https://www.nature.com/articles/s41591-020-0844-1/ A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen human tumors]. ''After downsampling by total reads, we used write10xCounts from DropletUtils and a custom Python script to generate an HDF5 file for input into our analysis pipelines, as described in the sections that follow and in the ‘Code availability’ section.''
* [https://www.jianshu.com/p/9d1d35589d31 R包DropletUtils使用]. 比較 downsampling the UMI count matrix 與 downsampling the raw reads. '''假如测序饱和度很高的话,最后的total count 并不会下降太多. ''' It also discussed the use of barcode rank plot, empty droplet and demultiplexing hashed libraries.
* [https://cloud.tencent.com/developer/article/1806470 对10X单细胞reads进行随机抽样]. 使用 [https://github.com/lh3/seqtk seqtk] 对原始fastq文件进行随机抽样.
<pre>
R> library(DropletUtils)
# Mocking up some 10X HDF5-formatted data.
R> out <- DropletUtils:::simBasicMolInfo(tempfile())
R> a <- read10xMolInfo(out)
R> names(a)
[1] "data"  "genes"
R> dim(a$data)
[1] 9508    5
R> str(a$genes)
chr [1:20] "ENSG1" "ENSG2" "ENSG3" "ENSG4" "ENSG5" "ENSG6" "ENSG7" "ENSG8" ...
R> a$data[1:3, ]
DataFrame with 3 rows and 5 columns
        cell      umi gem_group      gene    reads
  <character> <integer> <integer> <integer> <integer>
1        GACT    781899        1        9        10
2        CAGC    747670        1        9        9
3        GTAC    614190        1        6        10
R> unique(a$data$gene)
[1]  9  6 15  7 12 13  1 20  4  8  3 14 17 19 18 16  2  5 10 11
R> length(unique(a$data$gene))  # 20 genes
[1] 20
R> length(unique(a$data$cell))  # 256 cells
[1] 256
 
# Downsampling by the reads.
R> b<- downsampleReads(out, barcode.length=4, prop=0.5)
R> dim(b)
[1]  20 256
R> b[1:3, 1:5]
3 x 5 sparse Matrix of class "dgCMatrix"
      AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1
ENSG1      .      1      .      1      5
ENSG2      1      1      1      4      .
ENSG3      1      1      .      1      4
R> class(b)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
R> 4^4  # ATCG for each character of the barcode
[1] 256
R> 20*256  # So the combinations of (cell, gene) are not unique
[1] 5120
R> length(unique(a$data$umi)) # So UMIs are unique
[1] 9508
R> length(unique(a$data$gem_group))
[1] 1
</pre>
Change '''prop''' parameter. '''Notice the result of downsampling reads does not change much'''.
<pre>
R> b1 <- downsampleReads(out, barcode.length=4, prop=.9999)
R> colSums(b1)[1:5]
AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1
    41    37    28    36    35
R> b2 <- downsampleReads(out, barcode.length=4, prop=0.2)
R> colSums(b2)[1:5]
AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1
    37    31    23    31    28
R> b5 <- downsampleReads(out, barcode.length=4, prop=0.5)
R> colSums(b5)[1:5]
AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1
    41    37    28    36    35
R> sum(b1)
[1] 9508
R> sum(b2)
[1] 8245
R> sum(b5)
[1] 9448
</pre>
</li>
</ul>
* [http://bioconductor.org/books/release/OSCA/normalization.html#applying-the-size-factors 7.5.2 Downsampling and log-transforming] from OSCA
* [https://www.stephaniehicks.com/2018-bioinfosummer-scrnaseq/cleaning-the-expression-matrix.html Down_Sample_Matrix()]
* [https://pubmed.ncbi.nlm.nih.gov/29846586/ zUMIs]
* [https://www.embopress.org/doi/full/10.15252/msb.20188746 Current best practices in single-cell RNA-seq analysis: a tutorial]. ''While downsampling throws away data, it also increases technical dropout rates which CPM and other global scaling normalization methods do not. Thus, downsampling can deliver a more realistic representation of what cellular expression profiles would look like at similar count depths.''
* Huang 2020 [https://www.sciencedirect.com/science/article/pii/S1672022920301443?via%3Dihub Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data]
** downsample features (genes). [https://github.com/qianhuiSenn/scRNA_cell_deconv_benchmark/blob/master/Scripts/Downsample/downsample_gene_pancreas_code.R#L43 down_sample_gene()]. The idea is ''sample(gene_list, size = size, replace = FALSE, prob = prob_list) '' where gene_list is the list of detected genes and size is 5,000/10,000/15,000 and prob_list is the proportion of gene counts (defined by using rowSums()). ''We randomly downsampled the features from Fluidigm C1 data into 15,000, 10,000 and 5000 input genes, based on the original log count distribution''.
** downsample the reads  into 25%, 50%, 75% of the original read depth (with two repetitions) on BAM files, and then realigned files following the method provided by the original manuscript.
 
= Normalization =
* [https://hbctraining.github.io/scRNA-seq_online/lessons/05_normalization_and_PCA.html Introduction to single-cell RNA-seq]. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
* Bulk normalization methods
** DESeq2: '''median of ratios (MoR)''' approach
** edgeR: '''trimmed mean of M values'''
* Count models for normalization of single-cell RNA-seq data. See [https://satijalab.org/scgd21/ Single Cell Genomics Day 2021]
 
== [https://bioconductor.org/packages/scone scone] package: normalization ==
* [https://www.biorxiv.org/content/biorxiv/early/2017/12/16/235382.full.pdf Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq].
* [https://academic.oup.com/bioinformatics/article/38/1/164/6367764 PsiNorm: a scalable normalization for single-cell RNA-seq data]
 
== Dino ==
[https://academic.oup.com/bioinformatics/article/37/22/4123/6306403 Normalization by distributional resampling of high throughput single-cell RNA-sequencing data] Brown 2021


= Batch correction =
= Batch correction =
Line 492: Line 1,135:
* Over-correction can happen.
* Over-correction can happen.
** [https://web.microsoftstream.com/video/fb1f2db5-3256-4fd0-afd1-8b5e7c321b89 Single-cell RNA-seq Analysis on NIDAP - Tutorial Part 2] 40:00 (nih network is required). Groups differ by a single experimental modification (eg a knockoff gene or cells of similar lineage used to determine differentiation pattern). However batch correction is necessary in cases such as spike-in cells where we expect these cells should be clustered together.
** [https://web.microsoftstream.com/video/fb1f2db5-3256-4fd0-afd1-8b5e7c321b89 Single-cell RNA-seq Analysis on NIDAP - Tutorial Part 2] 40:00 (nih network is required). Groups differ by a single experimental modification (eg a knockoff gene or cells of similar lineage used to determine differentiation pattern). However batch correction is necessary in cases such as spike-in cells where we expect these cells should be clustered together.
** '''Mutual nearest neighbors/MNN'''. [https://www.nature.com/articles/nbt.4091 Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors] Haghverdi 2018...  
* '''Mutual nearest neighbors/MNN'''. [https://www.nature.com/articles/nbt.4091 Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors] Haghverdi 2018...  
*** One assumption is the '''batch effect''' is almost '''orthogonal''' to the '''biological subspace'''. It also assumes the batch effect variation in much smaller than the biological effect variation between different cell types.
** One assumption is the '''batch effect''' is almost '''orthogonal''' to the '''biological subspace'''. It also assumes the batch effect variation in much smaller than the biological effect variation between different cell types.
*** ''It is preferable to perform DE analyses using the uncorrected expression values with blocking on the batch, as discussed in Section 11.4...  We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a t-SNE plot by per-cell expression.''  
** ''It is preferable to perform DE analyses using the uncorrected expression values with blocking on the batch, as discussed in Section 11.4...  We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a t-SNE plot by per-cell expression.''  
** [https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa097/5855265?login=true SMNN] can eliminate the overcorrection between different cell types. [https://github.com/yycunc/SMNN github source code].
** Sadly the [https://github.com/MarioniLab/MNN2017 R code] of the paper does not have the file ('MAP2.csv') and the code does not work anymore (e.g. mnnCorrect() does not have $angles component). Check out the [https://github.com/chriscainx/mnnpy '''mnnpy'''] module which still has an option to return angles.
<ul>
<li>[https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa097/5855265?login=true SMNN] can eliminate the overcorrection between different cell types & allow that batch effects are non-orthogonal to the biological differences. [https://github.com/yycunc/SMNN github source code]. Note that the package is only available on github and it depends on the python '''mnnpy''' module.
{{Pre}}
library(reticulate)
use_python("/home/rstudio/.conda/envs/sc_env/bin/python")
py_config()  # confirm
 
library("SMNN")
data("data_SMNN")
dim(data_SMNN$batch1.mat)  # [1] 3000  400
data_SMNN$batch1.mat[1:5, 1:5]
dim(data_SMNN$batch2.mat)  # [1] 3000  500
# Maker genes
markers <- c("Col1a1", "Pdgfra", "Ptprc", "Pecam1")
# Corresponding cell type labels for each marker gene
cluster.info <- c(1, 1, 2, 3)
matched_clusters <- unifiedClusterLabelling(data_SMNN$batch1.mat,
                                            data_SMNN$batch2.mat,
                                            features.use = markers,
                                            cluster.labels = cluster.info,
                                            min.perc = 0.3)
 
corrected.results <- SMNNcorrect(batches = list(data_SMNN$batch1.mat, data_SMNN$batch2.mat),
                                batch.cluster.labels = matched_clusters,
                                k=20,
                                sigma=1,
                                cos.norm.in=TRUE,
                                cos.norm.out=TRUE,
                                subset.genes=rownames(data_SMNN$batch2.mat))
# NOTE: subset.genes parameter was added to fix a bug when I'm running the function
# [1] "Data preparation ..."
# Error: C stack usage  555609078676 is too close to the limit
 
> traceback()
3: py_call_impl(callable, dots$args, dots$keywords)
2: mnnpy$utils$transform_input_data(datas = batches.t, cos_norm_in = cos.norm.in,
      cos_norm_out = cos.norm.out, var_index = as.character(c(0:ncol(batches.t[[1]]))),
      var_subset = subset.index, n_jobs = n.jobs)
1: SMNNcorrect(batches = list(data_SMNN$batch1.mat, data_SMNN$batch2.mat),
      batch.cluster.labels = matched_clusters, k = 20, sigma = 1,
      cos.norm.in = TRUE, cos.norm.out = TRUE, subset.genes = rownames(data_SMNN$batch2.mat)[1:100])
</pre>
</li>
</ul>
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9 A benchmark of batch-effect correction methods for single-cell RNA sequencing data] Tran 2020
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9 A benchmark of batch-effect correction methods for single-cell RNA sequencing data] Tran 2020
* [https://www.nature.com/articles/s41597-021-00809-x A multi-center cross-platform single-cell RNA sequencing reference dataset] with the source code on [https://github.com/oxwang/SciData_scRNAseq github]
* [https://www.nature.com/articles/s41597-021-00809-x A multi-center cross-platform single-cell RNA sequencing reference dataset] 2021 with the source code on [https://github.com/oxwang/SciData_scRNAseq github].
* [https://www.nature.com/articles/s41587-020-00748-9 A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples] Chen 2020. [https://github.com/oxwang/fda_scRNA-seq github] and [https://codeocean.com/capsule/9934440/tree/v1 codeocean] (most complete). The reference genome and transcriptome were downloaded from the 10x website as ‘refdata-cellranger-GRCh38-1.2.0.tar.gz’, which corresponds to the GRCh38 genome and the Ensembl version 84 transcriptome.
** [https://www.biorxiv.org/content/10.1101/2020.09.20.305474v1 Companion paper], [https://github.com/oxwang/SciData_scRNAseq source code] in github. [https://figshare.com/s/7eaff863913678a4166d gene count data] (gene_counts_all.rdata).
** [https://datascience.cancer.gov/news-events/events/multi-center-study-benchmarking-single-cell-rna-sequencing-technologies-using Talk (video)].
<ul>
<ul>
<li>https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html  
<li>https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html  
Line 541: Line 1,231:
</ul>
</ul>
* [https://btep.ccr.cancer.gov/question/single_cell_rna_seq/when-should-we-batch-correct-scrna-seq-data-when-should-we-avoid-it/ When should we batch correct scRNA-Seq data? When should we avoid it?]
* [https://btep.ccr.cancer.gov/question/single_cell_rna_seq/when-should-we-batch-correct-scrna-seq-data-when-should-we-avoid-it/ When should we batch correct scRNA-Seq data? When should we avoid it?]
* BBKNN
** [https://nbviewer.jupyter.org/github/Teichlab/bbknn/blob/master/examples/demo.ipynb Jupyter example]
** [https://www.jianshu.com/p/ea667c99f968 BBKNN:python单细胞样本整合和批次效应处理算法]
* [https://www.rna-seqblog.com/cider-an-interpretable-meta-clustering-framework-for-single-cell-rna-seq-data-integration-and-evaluation/ CIDER – an interpretable meta-clustering framework for single-cell RNA-seq data integration and evaluation] 2021
= Clustering =
* [https://www.biorxiv.org/content/10.1101/2021.03.24.436859v1 minicore]: Fast scRNA-seq clustering with various distance measures
* [https://www.jianshu.com/p/17b12d79b332 R语言学习笔记之聚类分析]
* [https://pubmed.ncbi.nlm.nih.gov/30271584/ A systematic performance evaluation of clustering methods for single-cell RNA-seq data] Duò 2018
** Quantify performance using the '''adjusted Rand Index (ARI)''', comparing the obtained clustering to the true clusters.
** Evaluates the stability of the clustering results from each method, with respect to random starts. Each method was run five times on each data set (for each k), and we quantify the stability by comparing each pair of such runs using the adjusted Rand Index.
** [https://htmlpreview.github.io/?https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison/blob/master/output/countsimQC/Kumar_countsimQC.html Comparison of characteristic features across count data sets] contains a report comparing the main characteristics of the simulated data sets to those of the underlying Kumor data set.
** The paper contains information about data processing (scater & scran packages) and gene filtering (Seurat and M3Drop packages).
{{Pre}}
# after I upgrade R from 4.0.x to 4.1.x
R> library(DuoClustering2018)
snapshotDate(): 2021-05-18
Warning message:
DEPRECATION: As of ExperimentHub (>1.17.2), default caching location has changed.
  Problematic cache: /Users/XXX/Library/Caches/ExperimentHub
  See https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html#default-caching-location-update
</pre>


= Normalization =
== NMF ==
* [https://hbctraining.github.io/scRNA-seq_online/lessons/05_normalization_and_PCA.html Introduction to single-cell RNA-seq]. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
* https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
* Bulk normalization methods
* [https://www.jianshu.com/p/05cf16b47513 NMF rank度量图该怎么看]
** DESeq2: '''median of ratios (MoR)''' approach
* [https://www.quora.com/What-are-the-pros-and-cons-of-different-matrix-factorization-techniques-Low-rank-SVD-and-NMF-What-are-some-practical-examples-of-each What are the pros and cons of different matrix factorization techniques: Low rank, SVD, and NMF? What are some practical examples of each?] In NMF you force the coefficients to be positive.
** edgeR: '''trimmed mean of M values'''
* [https://www.quora.com/What-is-the-difference-between-non-negative-matrix-factorization-and-singular-value-decomposition What is the difference between non-negative matrix factorization and singular value decomposition?] SVD factors contains both positive and negative entries while NMF factors are strictly positive.
 
== cNMF ==
* https://github.com/dylkot/cNMF/ (python)
* It takes a count matrix (N cells X G genes) as input and produces a (K x G) matrix of '''gene expression programs (GEPs)''' and a (N x K) matrix specifying the '''usage of each program''' for each cell in the data.
* [https://en.wikipedia.org/wiki/Gene_expression_programming Gene expression programming]
* [https://en.wikipedia.org/wiki/Non-negative_matrix_factorization NMF]
 
== CancerSubtypes ==
* [http://bioconductor.org/packages/release/bioc/html/CancerSubtypes.html CancerSubtypes]. ExecuteCNMF() which calls the [https://www.rdocumentation.org/packages/NMF/versions/0.23.0/topics/nmf NMF::nmf()] function. Looking at the [https://github.com/xtsvm/CancerSubtypes/blob/master/R/ClusteringMethod.R#L511 source code], it seems the 'consensus' is not used by nmf() so it does not affect the predicted groups. It only affect the final [https://www.rdocumentation.org/packages/NMF/versions/0.23.0/topics/connectivity 'consensus' slot (matrix)].
* [https://academic.oup.com/bioinformatics/article/33/19/3131/3866880 CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization] Xu, 2017.
<pre>
library(CancerSubtypes)
data(GeneExp)
dim(GeneExp) # 1500 x 100
result=ExecuteCNMF(GeneExp,clusterNum=3,nrun=30)
dim(result$distanceMatrix)  # consensus/Similarity returned from nmf()
# [1] 100 100
round(unique(result$distanceMatrix), 3)
# [1] 1.000 0.000 0.167 0.067 0.933 0.100 0.433 0.900 0.833 0.567
#[11] 0.233 0.733 0.267 0.133 0.967 0.633 0.500 0.667 0.200 0.533
sil=silhouette_SimilarityMatrix(result$group, result$distanceMatrix)
plot(sil, col = c("red", "blue", "green"))
# Note that the average silhouette width is computed without considering groups.
# mean(c(.96, .98, 1)) # .98
# mean(sil[, 3]) #  0.9714631  match with the plot
# mean(tapply(sil[, 3], sil[, 1], mean)) # 0.9794998
 
pca <- prcomp(t(GeneExp))
plot(pca$x[,1], pca$x[, 2], col=result$group, pch=19) # black, red, green
</pre>
 
== seurat::BuildClusterTree ==
* https://satijalab.org/seurat/reference/buildclustertree and PlotClusterTree()
* [https://romanhaa.github.io/projects/scrnaseq_workflow/#cluster-tree ggtree() was used for plotting]. See, scRNA-seq analysis workflow by Roman Hillje - Data Visualization & Bioinformatics


== Cell type deconvolution ==
== Significance Analysis for Clustering ==
* [https://en.wikipedia.org/wiki/List_of_distinct_cell_types_in_the_adult_human_body List of distinct cell types in the adult human body]
[https://www.biorxiv.org/content/10.1101/2022.08.01.502383v1 Significance Analysis for Clustering with Single-Cell RNA-Sequencing Data] 2022
* https://twitter.com/stephaniehicks/status/1358776597264797703?s=20
* [http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp#C8 single-cell signatures] from MSigDB


= Model =
= Model =
* [https://en.wikipedia.org/wiki/Zero-inflated_model Zero-inflated model], [https://en.wikipedia.org/wiki/Compound_Poisson_distribution Compound Poisson distribution], [https://en.wikipedia.org/wiki/Negative_binomial_distribution Negative binomial distribution/Gamma-Poisson distribution]
* [https://en.wikipedia.org/wiki/Zero-inflated_model Zero-inflated model], [https://en.wikipedia.org/wiki/Compound_Poisson_distribution Compound Poisson distribution], [https://en.wikipedia.org/wiki/Negative_binomial_distribution Negative binomial distribution/Gamma-Poisson distribution]
* [https://academic.oup.com/bioinformatics/article/34/19/3340/4984507?login=true Two-phase models] for scRNA-Seq
* [https://academic.oup.com/bioinformatics/article/34/19/3340/4984507?login=true Two-phase models] for scRNA-Seq
* [https://www.biorxiv.org/content/10.1101/2021.07.07.451498v1 Comparison and evaluation of statistical error models for scRNA-seq] Choudhary & Satija 2021
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02601-5 Statistics or biology: the zero-inflation controversy about scRNA-seq data] Jiang 2022
= Feature selection =
[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02544-3 Feature selection revisited in the single-cell era] Yang 2021


= DE analysis =
= DE analysis =
*[https://www.biorxiv.org/content/10.1101/2021.05.10.443350v1.full.pdf Individual Level Differential Expression Analysis for Single Cell RNA-seq data (IDEAS)] 2021. [https://www.rna-seqblog.com/ideas-individual-level-differential-expression-analysis-for-single-cell-rna-seq-data/ Genome biology] 2022
* [https://www.rna-seqblog.com/differential-expression-analysis-of-single-cell-data-must-account-for-biological-replicates/ Differential expression analysis of single-cell data must account for biological replicates]
* [https://www.rna-seqblog.com/statistical-approaches-for-differential-expression-analysis-in-single-cell-rna-sequencing-studies/ Statistical approaches for differential expression analysis in single-cell RNA sequencing studies]


== GSEA ==
== GSEA ==
[https://github.com/edvanburen/twosigma twosigma] package and the paper [https://www.biorxiv.org/content/10.1101/2021.01.24.427979v1 TWO-SIGMA-G: A New Competitive Gene Set Testing Framework for scRNA-seq Data Accounting for Inter-Gene and Cell-Cell Correlation] Burden et ap.
* [https://github.com/edvanburen/twosigma twosigma] package and the paper [https://www.biorxiv.org/content/10.1101/2021.01.24.427979v1 TWO-SIGMA-G: A New Competitive Gene Set Testing Framework for scRNA-seq Data Accounting for Inter-Gene and Cell-Cell Correlation] Burden et ap.
* [https://github.com/dpcook/scrna_seq_workshop_2020/blob/master/notebooks/02_downstream_analysis.Rmd#L237 fgsea] by dpcook
* [https://www.nature.com/articles/s41467-020-15298-6 Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies] Ma 2020. [https://xzhoulab.github.io/iDEA/ iDEA] package.
* [https://www.jianshu.com/p/3c1eb0e78ab2 10X单细胞(10X空间转录组)'''基因集'''打分方法汇总]
* [https://bioconductor.org/packages/release/bioc/html/escape.html escape] package - Easy single cell analysis platform for enrichment.
 
== AddModuleScore ==
* [https://www.jianshu.com/p/827143ce66fa Seurat包的打分函数AddModuleScore], [https://www.jianshu.com/p/cef5663888ff Another]
** 我们感兴趣的基因,抽出来,每一个细胞算一个这些基因表达的平均值, 背景基因的平均值在于找每个基因的所在的bin,在该bin内随机抽取相应的ctrl个基因作为背景,最后所有的目标基因算一个平均值,所有的背景基因算一个平均值,两者相减就是该gene set 的score值。
** Calculate the average expression levels of each program (cluster) on single cell level, subtracted by the aggregated expression of '''control feature''' sets. All analyzed features are binned based on averaged expression, and the '''control features''' are randomly selected from each bin.
* [https://www.jianshu.com/p/65cdeaa4f4ab 10X空间转录组和10X单细胞数据联合分析方法汇总]
* [https://broadinstitute.github.io/KrumlovSingleCellWorkshop2020/data-wrangling-scrnaseq-1.html Plot the correlation between number of genes and S score]
* [https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00776-9#Fig4 Figure 4] in Single-cell transcriptome analysis of tumor and stromal compartments of pancreatic ductal adenocarcinoma primary tumors and metastatic lesions.
* [https://github.com/WalterMuskovic/AddModuleScore Exploring the inner workings of the AddModuleScore function from the Seurat R package.]
* [https://gitmemory.com/issue/satijalab/seurat/522/474661143 AddModuleScore() meaning]
* [https://btep.ccr.cancer.gov/question/single_cell_rna_seq/how-can-i-use-a-large-data-set-like-tabula-muris-for-identifying-cell-types-in-my-data/ Identifying cell types in my data?] The highest scoring cell type (per cell) is assigned as the cell type. However, according to [https://github.com/satijalab/seurat/issues/4365 AddModuleScore to score the cells and assign each cell to the gene set #4365] the scores  are not directly comparable but need a normalization step.
* https://github.com/satijalab/seurat/search?q=addmodulescore&type=issues
** [https://github.com/satijalab/seurat/issues/1342 how to visualize ModuleScore on a featureplot]
** [https://github.com/satijalab/seurat/issues/3366 DoHeatmap for module scores #3366]
* [https://rdrr.io/cran/Seurat/man/AddModuleScore.html ?AddModuleScore]. The end result is a new vector ('CD_Features1' in this case) which contains scores for all cells.
* By default 24 bins are created and 100  control features selected from the same bin per analyzed feature.
* It seems AddModuleScore can be used for cell type annotation/identification; see [https://btep.ccr.cancer.gov/question/single_cell_rna_seq/can-you-give-some-suggestions-cell-identity-annoation-and-recommend-packages-good-at-cell-identity-annotation-what-is-the-most-used-practice-manually-or-auto/ this post] and [https://bioinformatics.stackexchange.com/a/8883 this one]. If we have a few cell types, we run AddModuleScore on each cell type (a list of genes for each cell type). For each cell, the cell type with the highest score will be used to represent the cell type.
* It seems AddModuleScore is similar to [[GSEA#ssGSEA|ssGSEA]]; for example, they can return a score for each sample/cell and each gene set.
<pre>
library(Seurat)
data("pbmc_small")
cd_features <- list(c('CD79B', 'CD79A', 'CD19', 'CD180', 'CD200', 'CD3D',
  'CD2', 'CD3E', 'CD7', 'CD8A', 'CD14', 'CD1C', 'CD68', 'CD9', 'CD247'))
pbmc_small <- AddModuleScore(
  object = pbmc_small,
  features = cd_features,
  ctrl = 5,
  name = 'CD_Features'
)
head(pbmc_small, 3)
 
                  orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
ATGCCAGAACGACT SeuratProject        70          47              0
CATGGCCTGTGCAT SeuratProject        85          52              0
GAACCTGATGAACC SeuratProject        87          50              1
              letter.idents groups RNA_snn_res.1 CD_Features1
ATGCCAGAACGACT            A    g2            0    0.4420928
CATGGCCTGTGCAT            A    g1            0    0.4285719
GAACCTGATGAACC            B    g2            0    0.9217770
</pre>
 
== Tweedieverse ==
* [https://cran.r-project.org/web/packages/tweedie/ Tweedieverse], [https://github.com/himelmallick/Tweedieverse/ Github]
 
== DE analysis with multiple samples ==
[https://support.bioconductor.org/p/9150118/ DE analysis with multiple samples]. [https://rdrr.io/bioc/scuttle/man/aggregateAcrossCells.html scuttle::aggregateAcrossCells()]
 
= Cell type deconvolution =
* [https://en.wikipedia.org/wiki/List_of_distinct_cell_types_in_the_adult_human_body List of distinct cell types in the adult human body]
* https://twitter.com/stephaniehicks/status/1358776597264797703?s=20
* [http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp#C8 single-cell signatures] from MSigDB
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04186-5 CDSeqR/CDSeq]: fast complete deconvolution for gene expression data from bulk tissues
* [http://bioconductor.org/packages/release/bioc/html/granulator.html granulator]: Rapid benchmarking of methods for *in silico* deconvolution of bulk RNA-seq data
* [https://www.tandfonline.com/doi/full/10.1080/01621459.2024.2382435 Statistical Inference of Cell-Type Proportions Estimated from Bulk Expression Data] JASA 2024


= Cell type annotation =
= Cell type annotation =
* [https://www.sciencedirect.com/science/article/pii/S1672022920301443 Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data]
* [https://www.healthline.com/health/number-of-cells-in-body#types-of-cells How many cell types exist in the human body]: 200
* [http://bioconductor.org/books/release/SingleRBook/ singleR] (Bioconductor). This method assigns labels to cells based on the '''reference''' samples (e.g. Immgen reference dataset for mouse) with the highest Spearman rank correlations, using only the marker genes between pairs of labels to focus on the relevant differences between cell types. See [http://bioconductor.org/books/release/OSCA/cell-type-annotation.html Chapter 12: Cell type annotation] of OSCA.
* [https://www.sciencedirect.com/science/article/pii/S1672022920301443 Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data] Huang. [https://github.com/qianhuiSenn/scRNA_cell_deconv_benchmark Source code].
* Seurat vignette [https://satijalab.org/seurat/articles/integration_mapping.html Mapping and annotating '''query''' datasets].  FindTransferAnchors() + TransferData() + AddMetaData()
* [http://bioconductor.org/books/release/SingleRBook/ SingleR] (Bioconductor). This method assigns labels to cells based on the '''reference''' samples (e.g. Immgen reference dataset for mouse) with the '''highest Spearman rank correlations''', using only the marker genes between pairs of labels to focus on the relevant differences between cell types. See [http://bioconductor.org/books/release/OSCA/cell-type-annotation.html Chapter 12: Cell type annotation] of OSCA.
** The vignette uses reference data from the [https://bioconductor.org/packages/3.13/data/experiment/html/celldex.html celldex] (Reference Index for Cell Types) package.
** SingleR identifies marker genes from the reference and uses them to compute '''assignment scores''' (based on the Spearman correlation across markers) for each cell in the test dataset against each label in the reference. The label with the highest score is the assigned to the test cell, possibly with further fine-tuning to resolve closely related labels.
** Use rownames(obj) to make sure the gene names are comparable from the test and reference datasets.
** [https://www.bioconductor.org/packages/release/bioc/manuals/SingleR/man/SingleR.pdf#page=30 SingleR()] has several arguments to control the choice of marker genes used for annotation; see trainSingleR().
** Old [https://github.com/dviraran/SingleR source code] in github
* [https://www.rna-seqblog.com/single-cell-mapper-scmappr-using-scrna-seq-to-infer-the-cell-type-specificities-of-differentially-expressed-genes/ Single-cell mapper (scMappR)] – using scRNA-seq to infer the cell-type specificities of differentially expressed genes
* [https://www.rna-seqblog.com/single-cell-mapper-scmappr-using-scrna-seq-to-infer-the-cell-type-specificities-of-differentially-expressed-genes/ Single-cell mapper (scMappR)] – using scRNA-seq to infer the cell-type specificities of differentially expressed genes
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02281-7 scSorter]: assigning cells to known cell types according to marker genes
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02281-7 scSorter]: assigning cells to known cell types according to marker genes
* [https://www.frontiersin.org/articles/10.3389/fgene.2020.00490/full SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data] Cao 2020
* [https://www.sciencedirect.com/science/article/pii/S2001037021000192 Automated methods for cell type annotation on scRNA-seq data]  Pasquini 2021
* [https://pubmed.ncbi.nlm.nih.gov/34252936/ CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data] Wei 2021
* [https://www.biorxiv.org/content/10.1101/2021.11.07.467660v1 Sincast: a computational framework to predict cell identities in single cell transcriptomes using bulk atlases as references]
* [https://academic.oup.com/bioinformatics/article/37/22/4115/6287613 scDetect: a rank-based ensemble learning algorithm for cell type identification of single-cell RNA sequencing in cancer] Shen 2021
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04574-5 scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data]
* [https://www.rna-seqblog.com/scannotate-an-automated-cell-type-annotation-tool-for-single-cell-rna-sequencing-data/ scAnnotate – an automated cell type annotation tool for single-cell RNA-sequencing data]
== GPT-4 ==
[https://www.biorxiv.org/content/10.1101/2023.04.16.537094v1 Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis]
== Cell-level metadata ==
[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001077 Cell-level metadata are indispensable for documenting single-cell sequencing datasets] Puntambekar 2021. [https://raysinensis.shinyapps.io/clustifyr-web-app/?tab=someta Shiny app]. [https://github.com/rnabioco/someta Source code].
[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE137710 GSE137710] is a good example.


= [http://bioconductor.org/packages/release/bioc/html/monocle.html monocle] package =
== Cell subtypes ==
[https://www.rna-seqblog.com/single-cell-rna-sequencing-reveals-specific-cell-subtypes-that-influence-survival-and-determine-molecular-subtype-classification-of-ovarian-cancers/ Single-cell RNA sequencing reveals specific cell subtypes that influence survival and determine molecular subtype classification of ovarian cancers]
 
== Some cell types ==
* [https://en.wikipedia.org/wiki/B_cell B cell], https://askabiologist.asu.edu/b-cell. B淋巴球(B lymphocyte), 屬於後天免疫系統的體液免疫,作用為分泌抗體.
* [https://en.wikipedia.org/wiki/T_cell T cell], https://askabiologist.asu.edu/t-cell. T細胞 是淋巴細胞的一種,在免疫反應中扮演著重要的角色
* [https://en.wikipedia.org/wiki/Schwann_cell Schwann cell] 神經膜細胞
* [https://en.wikipedia.org/wiki/Myelocyte Myelocyte /Myeloid cell] 骨髓細胞 bone-marrow cell
* [https://en.wikipedia.org/wiki/Neuroendocrine_cell Neuroendocrine cell] 神經分泌細胞
* [https://en.wikipedia.org/wiki/Fibroblast Fibroblast] 纖維母細胞
* [https://en.wikipedia.org/wiki/Endothelium Endothelium] 內皮
 
== Cell-cell interactions ==
* [https://www.nature.com/articles/s41576-020-00292-x#Sec22 Deciphering cell–cell interactions and communication from gene expression] Armingol 2021
* [https://academic.oup.com/bioinformatics/article/37/20/3501/6273571 scConnect: a method for exploratory analysis of cell–cell communication based on single-cell RNA-sequencing data] 2021
<ul>
<li>CellChat
<ul>
<li>Compute the communication probability
<syntaxhighlight lang='rsplus'>
# load data
load("data_humanSkin_CellChat.rda")
data.input = data_humanSkin$data # normalized data matrix
meta = data_humanSkin$meta # a dataframe with rownames containing cell mata data
cell.use = rownames(meta)[meta$condition == "LS"] # extract the cell names from disease data
data.input = data.input[, cell.use]
meta = meta[cell.use, ]
unique(meta$labels)
 
# Create a CellChat object
cellchat <- createCellChat(object = data.input, meta = meta, group.by = "labels")
 
# Add cell information
cellchat <- addMeta(cellchat, meta = meta)
cellchat <- setIdent(cellchat, ident.use = "labels") # set "labels" as default cell identity
levels(cellchat@idents) # show factor levels of the cell labels
#  [1] "APOE+ FIB"    "FBN1+ FIB"    "COL11A1+ FIB" "Inflam. FIB"  "cDC1"       
#  [6] "cDC2"        "LC"          "Inflam. DC"  "TC"          "Inflam. TC" 
# [11] "CD40LG+ TC"  "NKT"
groupSize <- as.numeric(table(cellchat@idents))
 
# Set the ligand-receptor interaction database
CellChatDB <- CellChatDB.human
 
# Preprocessing the expression data
CellChatDB.use <- subsetDB(CellChatDB, search = "Secreted Signaling") # use Secreted Signaling
cellchat@DB <- CellChatDB.use
cellchat <- subsetData(cellchat)
cellchat <- identifyOverExpressedGenes(cellchat)
cellchat <- identifyOverExpressedInteractions(cellchat)
cellchat <- projectData(cellchat, PPI.human)
 
# Compute the communication probability
cellchat <- computeCommunProb(cellchat)
cellchat <- computeCommunProbPathway(cellchat)
str(cellchat@net)
# List of 2
#  $ prob: num [1:12, 1:12, 1:656] 0 0 0 0 0 0 0 0 0 0 ...
#  ..- attr(*, "dimnames")=List of 3
#  .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
#  .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
#  .. ..$ : chr [1:656] "TGFB1_TGFBR1_TGFBR2" "TGFB2_TGFBR1_TGFBR2" "TGFB3_TGFBR1_TGFBR2"  "TGFB1_ACVR1B_TGFBR2" ...
#  $ pval: num [1:12, 1:12, 1:656] 1 1 1 1 1 1 1 1 1 1 ...
#  ..- attr(*, "dimnames")=List of 3
#  .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
#  .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
#  .. ..$ : chr [1:656] "TGFB1_TGFBR1_TGFBR2" "TGFB2_TGFBR1_TGFBR2" "TGFB3_TGFBR1_TGFBR2" "TGFB1_ACVR1B_TGFBR2" ...
str(cellchat@netP)
# List of 2
#  $ pathways: chr [1:13] "MIF" "GALECTIN" "CXCL" "COMPLEMENT" ...
#  $ prob    : num [1:12, 1:12, 1:13] 0 0 0 0 0 0 0 0 0 0 ...
#  ..- attr(*, "dimnames")=List of 3
#  .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
#  .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
#  .. ..$ : chr [1:13] "MIF" "GALECTIN" "CXCL" "COMPLEMENT" ...
sum(cellchat@net$prob)
# [1] 18.65038
</syntaxhighlight>
</li>
<li>The error I saw (R 4.1.1, CellChat 1.1.3)
{{Pre}}
cellchat <- netEmbedding(cellchat, type = "functional")
Manifold learning of the signaling networks for a single dataset
Error in runUMAP(Similarity, min_dist = min_dist, n_neighbors = n_neighbors,  :
  Cannot find UMAP, please install through pip (e.g. pip install umap-learn or reticulate::py_install(packages = 'umap-learn')).
</pre>
</li>
<li>[http://yangl.net/2021/09/09/cellchat-netembedding-error/ cellchat netEmbedding 运行出错]. It provides another solution (no need to install anything using pip) for calculating umap.
<pre>
library(uwot)
cellchat <- netEmbedding(cellchat, umap.method = 'uwot',type = "functional")
</pre>
</li>
<li>[https://github.com/sqjin/CellChat/issues/158 Error in runUMAP(Similarity, min.dist = 0.3, n.neighbors = k)]
</li>
<li>[https://github.com/sqjin/CellChat/issues/167 function netEmbedding is unable to work #167]
</li>
<li>[https://www.jianshu.com/p/da145cff3d41 CellChat:细胞间相互作用分析利器] which contains two youtube links to the talks by the author
</li>
</ul>
 
</li>
</ul>
 
= Integrating datasets =
* [http://www.bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html MultiAssayExperiment] package.
* [https://www.nature.com/articles/s41587-021-00867-x?s=09 Iterative single-cell multi-omic integration using online learning] Gao 2021. Online integrative non-negative matrix factorization (iNMF).
* [https://www.cell.com/cell/fulltext/S0092-8674(21)00583-3?s=09 Integrated analysis of multimodal single-cell data] Hao 2021
* [https://www.biorxiv.org/content/10.1101/2021.06.01.445670v1 Muon]: multimodal omics analysis framework
* [https://www.biorxiv.org/content/10.1101/2021.08.04.453579v1 A comparison of data integration methods for single-cell RNA sequencing of cancer samples] Richards 2021
 
= Benchmark =
[https://www.biorxiv.org/content/10.1101/2022.09.22.508982v1 Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability]
 
= scanpy python package =
* [https://scanpy.readthedocs.io/en/stable/ scanpy]
** [https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html Preprocessing and clustering 3k PBMCs] which save the data as the ''h5ad' (hdf5) format.
** [https://cellregeneration.springeropen.com/articles/10.1186/s13619-020-00041-9 Comparison of Scanpy-based algorithms to remove the batch effect from single-cell RNA-seq data]
* [https://github.com/chriscainx/mnnpy mnnpy]
 
= Trajectory/pseudotime analysis =
[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x PAGA]: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells
 
== simulation ==
[http://www.bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splat_params.html#281_pathfrom_-_Path_origin splatSimulatePaths()] from splatter
 
== [http://bioconductor.org/packages/release/bioc/html/monocle.html monocle] package for pseudotime analysis ==
* [http://cole-trapnell-lab.github.io/monocle-release/docs/#what-is-pseudotime- What is a pseudotime?] '''Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation.'''
* According to a table on [http://cole-trapnell-lab.github.io/monocle-release/docs/#filtering-low-quality-cells-recommended here], pseudotime is a continuous variable.
* This is used by the paper [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab450/6306403#supplementary-data Normalization by distributional resampling of high throughput single-cell RNA-sequencing data] Brown 2021 to identify genes with significantly variable expression over pseudo-time.
 
== slingshot ==
[https://bioconductor.org/packages/release/bioc/html/slingshot.html slingshot] Tools for ordering single-cell sequencing
 
== tradeSeq ==
https://bioconductor.org/packages/release/bioc/vignettes/tradeSeq/inst/doc/tradeSeq.html. tradeSeq is an R package that allows analysis of gene expression along trajectories.
 
= Spatial Transcriptomics Data Analysis =
* [https://www.bioconductor.org/packages/release/bioc/html/Spaniel.html Spaniel]
* [http://www.bioconductor.org/packages/release/bioc/html/SpatialCPie.html SpatialCPie]


= [http://bioconductor.org/packages/release/bioc/html/sincell.html sincell] package =
= [http://bioconductor.org/packages/release/bioc/html/sincell.html sincell] package =
Line 580: Line 1,551:
= NMFEM =
= NMFEM =
[http://www.rna-seqblog.com/nmfem-detecting-heterogeneity-in-single-cell-rna-seq-data-by-non-negative-matrix-factorization/ Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization]
[http://www.rna-seqblog.com/nmfem-detecting-heterogeneity-in-single-cell-rna-seq-data-by-non-negative-matrix-factorization/ Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization]
= Splatter: Simulation Of Single-Cell RNA Sequencing Data =
http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1


= Scater =
= Scater =
Line 595: Line 1,563:
= totalVI =
= totalVI =
[https://www.nature.com/articles/s41592-020-01050-x?s=09 Joint probabilistic modeling of single-cell multi-omic data with totalVI]
[https://www.nature.com/articles/s41592-020-01050-x?s=09 Joint probabilistic modeling of single-cell multi-omic data with totalVI]
= Copy number =
== Copy number by infercnv ==
* https://github.com/broadinstitute/inferCNV/wiki
* [http://www.bioconductor.org/packages/release/bioc/html/infercnv.html infercnv] - Infer Copy Number Variation from Single-Cell RNA-Seq Data
* We need to install [https://sourceforge.net/projects/mcmc-jags/files/JAGS/4.x/Mac%20OS%20X/ jags] software before we install the required package '''rjags'''. It seems OK to install this old version 4.3.0 of software on macOS Catalina 10.15.7. To install jags, right click the file 'JAGS-4.3.0.mpkg' and choose 'open'. Then follow the instruction of the installer to complete the installation.
* https://www.jianshu.com/c/4c982303e19f?order_by=top
** It seems the goal is to use the CNV data to cluster cells... All cells that clustered together with spiked in controls were labeled "nontumor" whereas the remaining clusters were labels as 'tumor'.
** [https://mp.weixin.qq.com/s?__biz=MzI1Njk4ODE0MQ==&mid=2247485693&idx=1&sn=57ffc4d2cf4a6a0b7ddd24d308baa1c2&scene=21#wechat_redirect 使用inferCNV分析单细胞转录组中拷贝数变异]. Or follow the help of plot_cnv(), infercnv::run()... inferCNV会输出一个 "map_metadata_from_infercnv.txt" (I don't see it) 文件用于记录每个细胞的元信息,所有信息都可以从该文件中进行提取。或者使用infercnv::add_to_seurat将信息直接增加到原来的seurat对象中。
** [https://www.jianshu.com/p/2ab27d11be23 CNS图表复现13—使用inferCNV来区分肿瘤细胞的恶性与否]. 因为10X技术出来的单个细胞的reads数量太少,检测到的基因数量太少,但是inferCNV更新后有一个参数是cutoff ,就很清楚的指出来了:cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics 说明它应用于10X单细胞转录组数据是没有问题的。
** [https://mp.weixin.qq.com/s?__biz=MzI1Njk4ODE0MQ==&mid=2247487157&idx=1&sn=e2b59fd14c9a25d4331465cf6fdb1d3c&scene=21 是否是免疫细胞很容易区分那是否是肿瘤细胞呢?]
* Question: where to download the gene position file? The infercnv_genes_example object has 10338 genes already
Simpler plot
<pre>
data(infercnv_data_example) # data.frame 8252 x 20, genes x samples
data(infercnv_annots_example) # data.frame 20 x 1, 10 normal and 10 tumor
data(infercnv_genes_example) # data.frame 10338 x 3
infercnv_genes_example[1:3, ]
#            V2    V3    V4
# WASH7P    chr1  14363  29806
# LINC00115 chr1 761586 762902
# NOC2L    chr1 879584 894689
infercnv_object_example <- infercnv::CreateInfercnvObject(raw_counts_matrix=infercnv_data_example,
                                                          gene_order_file=infercnv_genes_example,
                                                          annotations_file=infercnv_annots_example,
                                                          ref_group_names=c("normal"))
infercnv_object_example <- infercnv::run(infercnv_object_example,
                                        cutoff=1,
                                        out_dir=tempfile(),
                                        cluster_by_groups=TRUE,
                                        denoise=TRUE,
                                        HMM=FALSE,
                                        num_threads=2)
system("ls /var/folders/2q/slryb0rx4tj97t66v7l6pwvr_z6g3s/T//RtmpEuGgVy/file182d24b626381")
# 01_incoming_data.infercnv_obj                    infercnv.21_denoised.observations_dendrogram.txt
# 02_reduced_by_cutoff.infercnv_obj                infercnv.21_denoised.png
# 03_normalized_by_depth.infercnv_obj              infercnv.21_denoised.references.txt
# 04_logtransformed.infercnv_obj                    infercnv.heatmap_thresholds.txt
# 08_remove_ref_avg_from_obs_logFC.infercnv_obj    infercnv.observation_groupings.txt
# 09_apply_max_centered_expr_threshold.infercnv_obj infercnv.observations.txt
# 10_smoothed_by_chr.infercnv_obj                  infercnv.observations_dendrogram.txt
# 11_recentered_cells_by_chr.infercnv_obj          infercnv.png
# 12_remove_ref_avg_from_obs_adjust.infercnv_obj    infercnv.preliminary.heatmap_thresholds.txt
# 14_invert_log_transform.infercnv_obj              infercnv.preliminary.observation_groupings.txt
# 15_no_subclustering.infercnv_obj                  infercnv.preliminary.observations.txt
# 21_denoise.NF_NA.SD_1.5.NL_FALSE.infercnv_obj    infercnv.preliminary.observations_dendrogram.txt
# expr.infercnv.21_denoised.dat                    infercnv.preliminary.png
# expr.infercnv.dat                                infercnv.preliminary.references.txt
# expr.infercnv.preliminary.dat                    infercnv.references.txt
# infercnv.21_denoised.heatmap_thresholds.txt      preliminary.infercnv_obj
# infercnv.21_denoised.observation_groupings.txt    run.final.infercnv_obj
# infercnv.21_denoised.observations.txt
# Optional, what's the purpose of this plot function
#          when the last command already generated the heatmap?
plot_cnv(infercnv_object_example,
        out_dir=tempfile(),
        obs_title="Observations (Cells)",
        ref_title="References (Cells)",
        cluster_by_groups=TRUE,
        x.center=1,
        x.range="auto",
        hclust_method='ward.D',
        color_safe_pal=FALSE,
        output_filename="infercnv",
        output_format="png",
        png_res=300,
        dynamic_resize=0)
# ...
list.files("/var/folders/2q/slryb0rx4tj97t66v7l6pwvr_z6g3s/T//RtmpEuGgVy/file182d25df0c4ed")
# [1] "infercnv.heatmap_thresholds.txt"      "infercnv.observation_groupings.txt" 
# [3] "infercnv.observations_dendrogram.txt" "infercnv.observations.txt"         
# [5] "infercnv.png"                        "infercnv.references.txt"
</pre>
More complicated plot
<pre>
infercnv_obj = CreateInfercnvObject(
  raw_counts_matrix=system.file("extdata", "oligodendroglioma_expression_downsampled.counts.matrix.gz", package = "infercnv"),
  annotations_file=system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"),
  delim="\t", gene_order_file=system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv"),
  ref_group_names=c("Microglia/Macrophage","Oligodendrocytes (non-malignant)"))
infercnv_obj = infercnv::run(infercnv_obj,
                            cutoff=1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                            out_dir=tempfile(),
                            cluster_by_groups=TRUE,
                            denoise=TRUE,
                            HMM=TRUE) # HMM=TRUE will increase lots of computation
# STEP 1: incoming data
# STEP 02: Removing lowly expressed genes
# ...
# STEP 22: Denoising
# There were 50 or more warnings (use warnings() to see the first 50)
x <- read.delim(system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"), header=F)
dim(x)  #  [1] 184  2;  184 cells only
unique(x[,2])  # 6 groups. Two of them represent references cells and the other 4 are the obs cells.
# [1] "Microglia/Macrophage"            "Oligodendrocytes (non-malignant)"
# [3] "malignant_MGH36"                  "malignant_MGH53"
# [5] "malignant_97"                    "malignant_93"
x[1:3, ]
#            V1                  V2
# 1 MGH54_P2_C12 Microglia/Macrophage
# 2 MGH36_P6_F03 Microglia/Macrophage
# 3 MGH53_P4_H08 Microglia/Macrophage
xx <- read.delim(system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv"), header=F)
dim(xx)  #  [1] 10338    4
xx[1:2, ]
#          V1  V2    V3    V4
# 1    WASH7P chr1  14363  29806
# 2 LINC00115 chr1 761586 762902
</pre>
[[File:Infercnv.png|300px]]
= Cell-cell interactions and communication =
[https://www.nature.com/articles/s41576-020-00292-x Deciphering cell–cell interactions and communication from gene expression] Armingol 2021
= Chromatin/scATAC-seq data =
== Signac ==
[https://satijalab.org/signac/ Signac] is an extension of Seurat for the analysis, interpretation, and exploration of single-cell chromatin datasets.


= GPU =
= GPU =
[https://blogs.nvidia.com/blog/2021/03/08/atacworks-genomics-ai-research/ NVIDIA Researchers Use AI to Spot Active Areas in Cell DNA]
[https://blogs.nvidia.com/blog/2021/03/08/atacworks-genomics-ai-research/ NVIDIA Researchers Use AI to Spot Active Areas in Cell DNA]
= Single Cells and Single Nuclei =
* https://en.wikipedia.org/wiki/SnRNA-seq
* [https://www.rna-seqblog.com/systematic-comparison-of-single-cell-and-single-nucleus-rna-sequencing-methods/ Systematic comparison of single-cell and single-nucleus RNA-sequencing methods]
* [https://www.biocompare.com/Editorial-Articles/566664-RNA-Sequencing-from-Single-Cells-and-Single-Nuclei/ RNA Sequencing from Single Cells and Single Nuclei]
* [https://www.rna-seqblog.com/advantages-of-single-nucleus-over-single-cell-rna-sequencing/ Advantages of single-nucleus over single-cell RNA sequencing]
= Webinars =
* [https://www.sitcancer.org/education/computational SITC-NCI Computational Immuno-oncology Webinar Series] 2021
* [https://www.sitcancer.org/education/webinars-online/computational SITC-NCI Computational Immuno-oncology Webinar Series] 2022
= Brain cells =
* [https://en.wikipedia.org/wiki/Brain_cell Brain cell]
* [https://www.rna-seqblog.com/neuroscientists-roll-out-first-comprehensive-atlas-of-brain-cells/ Neuroscientists roll out first comprehensive atlas of brain cells]
= PDAC =
[https://www.rna-seqblog.com/single-cell-analysis-support-the-use-of-pdac-organoids-as-a-clinically-relevant-model-for-in-vitro-studies-of-tumor-heterogeneity/ Single-cell analysis support the use of PDAC organoids as a clinically relevant model for in vitro studies of tumor heterogeneity] 2021

Latest revision as of 08:33, 21 September 2024

Resource

Workshops

Library preparation

  • Smart-Seq2
  • Drop-Seq
  • Fluidigm-C1

Guideline

Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications

Pipeline

GEO

Applications

Public data

Curated Cancer Cell Atlas

The Curated Cancer Cell Atlas – a comprehensive pan-cancer single-cell RNA-sequencing dataset, website

scRNAseq package

scRNAseq package from Bioconductor.

ExperimentHub package used by scRNAseq, DuoClustering2018 and other packages.

SRAToolkit

sinteractive  --gres=lscratch:800  --cpus-per-task=6 --mem=40g
  # the required memory can be checked by using the '''jobload''' command
  # the required local disk space can be obtained by 'ssh cnXXX' and run 'du -sh /lscratch/JobID'
  #    where the JobID is obtained through the jobload command

module load sratoolkit

fasterq-dump -t /lscratch/$SLURM_JOBID SRR2048331 # 1 min, 3.3G fastq file
fastq-dump --split-3 SRR2048331                   # 2 min 30 sec
pigz -p6 SRR12148211*.fastq  # vs gzip. 6 threads are used here 
                             # do not need to split threads,
                             # do not need to run separately

fastq-dump --split-3 --gzip SRR2048331   # 658M SRR2048331.fastq.gz (vs 626.9Mb in sra format)

fastq-dump --split-3 --gzip SRR12148213  # 862M + 2.5G for fastq.gz files (vs 2Gb in sra format)
                                         # 6.5G + 15G for fastq files
                                         # fastq/fastq.gz = 15/2.5 = 6
                                         # fastq.gz/sr = 3.4/2 = 1.7
# This is equivalent to 
# prefetch SRR12148213; cd SRR12148213
# fastq-dump --split-3 --gzip SRR12148213.sra # 45 min
# The 'prefetch' will create a new folder SRR12148213 and download SRR12148213.sra into it.
# Note the file sizes from _1.fastq and _2.fastq are quite different. 
# Another run. Check number of reads.
$ zcat SRR12148214_1.fastq.gz | wc -l
150635300
$ zcat SRR12148214_2.fastq.gz | wc -l
150635300

fasterq-dump -t /lscratch/$SLURM_JOB_ID \
  --split-files \
  -O /lscratch/$SLURM_JOBID SRR13526458; \
  pigz -p6 /lscratch/$SLURM_JOBID/SRR13526458*.fastq; \
  cp /lscratch/$SLURM_JOBID/SRR13526458*.fastq.gz ~/PDACPDX
sbatch --gres=lscratch:800 --mem=40g --cpus-per-task=6 --time=12:00:00 myscript

And the fasterq-dump's wiki and the help page

$ fasterq-dump --help

Usage: fasterq-dump [ options ] [ accessions(s)... ]

Parameters:

  accessions(s)                    list of accessions to process


Options:

  -o|--outfile <path>              full path of outputfile (overrides usage
                                     of current directory and given accession)
  -O|--outdir <path>               path for outputfile (overrides usage of
                                     current directory, but uses given
                                     accession)
  -b|--bufsize <size>              size of file-buffer (dflt=1MB, takes
                                     number or number and unit)
  -c|--curcache <size>             size of cursor-cache (dflt=10MB, takes
                                     number or number and unit)
  -m|--mem <size>                  memory limit for sorting (dflt=100MB,
                                     takes number or number and unit)
  -t|--temp <path>                 path to directory for temp. files
                                     (dflt=current dir.)
  -e|--threads <count>             how many threads to use (dflt=6)
  -p|--progress                    show progress (not possible if stdout used)
  -x|--details                     print details of all options selected
  -s|--split-spot                  split spots into reads
  -S|--split-files                 write reads into different files
  -3|--split-3                     writes single reads into special file
     --concatenate-reads           writes whole spots into one file
  -Z|--stdout                      print output to stdout
  -f|--force                       force overwrite of existing file(s)
  -N|--rowid-as-name               use rowid as name (avoids using the name
                                     column)
     --skip-technical              skip technical reads
     --include-technical           explicitly include technical reads
  -P|--print-read-nr               include read-number in defline
  -M|--min-read-len <count>        filter by sequence-lenght
     --table <name>                which seq-table to use in case of pacbio
     --strict                      terminate on invalid read
  -B|--bases <bases>               filter output by matching against given
                                     bases
  -A|--append                      append to output-file, instead of
                                     overwriting it
     --ngc <path>                  <path> to ngc file
     --perm <path>                 <path> to permission file
     --location <location>         location in cloud
     --cart <path>                 <path> to cart file
     --disable-multithreading      disable multithreading
  -V|--version                     Display the version of the program
  -v|--verbose                     Increase the verbosity of the program
                                     status messages. Use multiple times for
                                     more verbosity.
  -L|--log-level <level>           Logging level as number or enum string.
                                     One of
                                     (fatal|sys|int|err|warn|info|debug) or
                                     (0-6) Current/default is warn
     --option-file file            Read more options and parameters from the
                                     file.
  -h|--help                        print this message

"fasterq-dump" version 2.10.9

Note that fastq.gz vs fastq file size is about 1:8 ratio.

$ ls -lhogt | sed 's/^[^ ][^ ]*  *[^ ][^ ]* //'
 23G May  7 11:35 SRR10156301_2.fastq
 18G May  7 11:35 SRR10156301_1.fastq
3.1G May  7 11:02 SRR10156301_2.fastq.gz
1.4G May  7 11:02 SRR10156301_1.fastq.gz

Simulation

Splatter: Simulation Of Single-Cell RNA Sequencing Data

biorxiv. Genome Biol. 2017, splatter package. We can simulate data using 1) estimated parameters from an existing SingleCellExperiment object, or 2) user's specified parameters.

suppressPackageStartupMessages({
  library(splatter)
  library(scater)
})

library(scRNAseq)
sce2 <- MacoskoRetinaData()
sce2

ncells <- 1000
params <- splatEstimate(as.matrix(assays(sce2)$counts[, 1:ncells]))
sim <- splatSimulate(params)

comparison <- compareSCEs(list(MacoskoRetina = sce2[, 1:ncells], Splat = sim))
ggplot(comparison$ColData, aes(x = sum, y = detected, colour = Dataset)) +
    geom_point()

splatPop

splatPop – simulating population scale single-cell RNA sequencing data

Preprocess

fastq file format, UMI

  • Single-cell RNA-seq data - raw data to count matrix
    • Reads with different UMI are biological duplicates - each read should be counted.
    • Reads with the same UMI are are technical duplicates - the UMIs should be collapsed to be counted as a single read.
  • Read1 contains barcode (represents cells) and UMI (represents reads), Read2 contains transcript sequence.
  • One case from scruff vignette.

Cell ranger

  • Chromium Single Cell Gene Expression
  • PDX/Multiple Species data. Creating a Reference Package with cellranger mkref
  • pbmc3k. The data is saved in filtered_gene_bc_matrices/hg19 directory with 3 files, barcodes.tsv (45K), genes.tsv (798K) and matrix.mtx (27M).
  • Getting started with Cell Ranger by Dave Tang
  • Workshop from UC Davis
  • *10x genomics single-cell RNAseq analysis from SRA data using Cell Ranger and Seurat, bioinformaticsworkbook.org (useful)
  • how to get/make/find fastq file from cellranger?
  • filtered_feature_bc_matrix vs raw_feature_bc_matrix. "Filtered feature-barcode matrix" contains only detected cellular barcodes. <raw_feature_bc_matrix> contains mainly empty barcodes. That is, the difference is the number of cells.
  • https://hpc.nih.gov/apps/cellranger.html
    • Reference genome
      module load cellranger/5.0.0
      echo $CELLRANGER_REF/refdata-gex-GRCh38-2020-A
      # /fdb/cellranger/refdata-gex-GRCh38-2020-A
      ls /fdb/cellranger/refdata-gex-GRCh38-2020-A
      # fasta  genes  pickle  reference.json  star
      
      ls /fdb/cellranger/
      # refdata-cellranger-1.1.0         refdata-cellranger-GRCh38-1.2.0           refdata-cellranger-vdj-GRCh38-alts-ensembl-2.0.0
      # refdata-cellranger-1.2.0         refdata-cellranger-GRCh38-3.0.0           refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0
      # refdata-cellranger-2.0.0         refdata-cellranger-GRCh38-and-mm10-3.1.0  refdata-cellranger-vdj-GRCh38-alts-ensembl-4.0.0
      # refdata-cellranger-2020-A        refdata-cellranger-hg19-1.2.0             refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0
      # refdata-cellranger-2.1.0         refdata-cellranger-hg19-3.0.0             refdata-cellranger-vdj-GRCm38-alts-ensembl-2.2.0
      # refdata-cellranger-2.2.0         refdata-cellranger-hg19_and_mm10-1.2.0    refdata-cellranger-vdj-GRCm38-alts-ensembl-3.1.0
      # refdata-cellranger-3.0.0         refdata-cellranger-hg19_and_mm10-2.1.0    refdata-cellranger-vdj-GRCm38-alts-ensembl-4.0.0
      # refdata-cellranger-3.1.0         refdata-cellranger-hg19-and-mm10-3.0.0    refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0
      # refdata-cellranger-4.0.0         refdata-cellranger-mm10-1.2.0             refdata-gex-GRCh38-2020-A
      # refdata-cellranger-5.0.0         refdata-cellranger-mm10-2.1.0             refdata-gex-GRCh38_and_mm10-2020-A
      # refdata-cellranger-ercc92-1.2.0  refdata-cellranger-mm10-3.0.0             refdata-gex-mm10-2020-A
      
      ls /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes
      # genes.gtf 
      wc -l /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes/genes.gtf
      # 2765974 /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes/genes.gtf
      
      # One column contains "gene_name"
      head -n 6 /fdb/cellranger/refdata-gex-GRCh38-2020-A/genes/genes.gtf
      ##description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
      ##provider: GENCODE
      ##contact: [email protected]
      ##format: gtf
      ##date: 2019-09-05
      chr1	HAVANA	gene	29554	31109	.	+	.	gene_id "ENSG00000243485"; gene_version "5"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; level 2; hgnc_id "HGNC:52482"; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
      
    • cellranger mkfastq wraps Illumina's bcl2fastq to correctly demultiplex Chromium-prepared sequencing samples and to convert barcode and read data to FASTQ files.
    • cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-cell matrices and perform clustering and gene expression analysis.
    • cellranger aggr aggregates results from cellranger count. How does cellranger aggr normalize for sequencing depth among multiple libraries?
    • cellranger reanalyze takes feature-barcode matrices produced by cellranger count or aggr and re-runs the dimensionality reduction, clustering, and gene expression algorithms.
    • The example code will generate an output directory with
      $ tree s1/outs
      
      ├── analysis
      │   ├── clustering
      │   │   ├── graphclust
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_10_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_2_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_3_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_4_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_5_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_6_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_7_clusters
      │   │   │   └── clusters.csv
      │   │   ├── kmeans_8_clusters
      │   │   │   └── clusters.csv
      │   │   └── kmeans_9_clusters
      │   │       └── clusters.csv
      │   ├── diffexp
      │   │   ├── graphclust
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_10_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_2_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_3_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_4_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_5_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_6_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_7_clusters
      │   │   │   └── differential_expression.csv
      │   │   ├── kmeans_8_clusters
      │   │   │   └── differential_expression.csv
      │   │   └── kmeans_9_clusters
      │   │       └── differential_expression.csv
      │   ├── pca
      │   │   └── 10_components
      │   │       ├── components.csv
      │   │       ├── dispersion.csv
      │   │       ├── features_selected.csv
      │   │       ├── projection.csv
      │   │       └── variance.csv
      │   ├── tsne
      │   │   └── 2_components
      │   │       └── projection.csv
      │   └── umap
      │       └── 2_components
      │           └── projection.csv
      ├── cloupe.cloupe
      ├── filtered_feature_bc_matrix
      │   ├── barcodes.tsv.gz
      │   ├── features.tsv.gz
      │   └── matrix.mtx.gz   #  36601 x 25 after dim(Read10X(data.dir ="filtered_feature_bc_matrix"))
      ├── filtered_feature_bc_matrix.h5
      ├── metrics_summary.csv
      ├── molecule_info.h5
      ├── possorted_genome_bam.bam
      ├── possorted_genome_bam.bam.bai
      ├── raw_feature_bc_matrix
      │   ├── barcodes.tsv.gz
      │   ├── features.tsv.gz
      │   └── matrix.mtx.gz  # 36601 x 737280
      ├── raw_feature_bc_matrix.h5
      └── web_summary.html
      
      31 directories, 41 files
      
      # The 2nd column contains gene symbol
      $ zcat s1/outs/filtered_feature_bc_matrix/features.tsv.gz | head -5
      ENSG00000243485	MIR1302-2HG	Gene Expression
      ENSG00000237613	FAM138A	Gene Expression
      ENSG00000186092	OR4F5	Gene Expression
      ENSG00000238009	AL627309.1	Gene Expression
      ENSG00000239945	AL627309.3	Gene Expression
      
      # R
      > x <- Read10X("s1/outs/filtered_feature_bc_matrix")
      > dim(x)
      [1] 36601    25
      > x[1:5, 1:3]
      5 x 3 sparse Matrix of class "dgCMatrix"
                  AACGTTGGTTAAAGTG-1 AGTAGTCAGAGCTATA-1 ATCTGCCCATACTCTT-1
      MIR1302-2HG                  .                  .                  .
      FAM138A                      .                  .                  .
      OR4F5                        .                  .                  .
      AL627309.1                   .                  .                  .
      AL627309.3                   .                  .                  .
      > length(unique(rownames(x)))
      [1] 36601
      
  • Why do we have Barcodes with Counts that are not Called Cells? Ambient RNA, A barcode is called a cell if Total UMI count > M/10 is called a cell, M = Cell with maximum total UMI counts (99th percentile of the top N)
  • Reads are considered duplicated, if they map to the same gene and have the same UMI. Instead of counting reads we will count number of unique UMIs per gene.

10x Genomics fragment schematic

10x Genomics fragment schematic

10x Molecule Info. This is referenced by the DropletUtils::read10xMolInfo() function.

plot

This is an example (GSM4088780) of part of "web_summary.html" output from running cellranger count 2.1.1.

Cellrangersum.png

We can create a similar plot using the DropletUtils::barcodeRanks() function. Here is the plot from running the example.

BarcodeRanks.png

  • 11,100 barcoded droplets
  • totals =colSums(mat). It has 877 unique values. Range is [0, 2344].
  • lower=100. Only cells satisfies the cutoff will be used in fitting the curve and find left.edge, right.edge, knee, inflection.
  • run.totals 877 values. Range [0, 2344]
  • run.rank 877 values. Range [1, 10900.5]
  • out = 11,100 x 3. Columns include rank, total, fitted
  • left.edge=68, rigth.edge=653. Associate with the x-axis. Relative to x, y.
  • knee=657, inflection=321 (inflection <- 10^(y[right.edge])). Used in the y-axis
  • In cellranger count, y-axis = UMI counts, x-axis = Barcodes.
  • In cellranger count, green points = cells, gray points = background.

emptyDrops() and DropletQC

emptyDrops()

set.seed(0)
my.counts <- DropletUtils:::simCounts()

# Identify likely cell-containing droplets.
out <- emptyDrops(my.counts)
out
# DataFrame with 11100 rows and 5 columns
#           Total   LogProb    PValue   Limited        FDR
#       <integer> <numeric> <numeric> <logical>  <numeric>
# 1             2        NA        NA        NA         NA
# 2             9        NA        NA        NA         NA
# ...         ...       ...       ...       ...        ...
# 11099       191  -228.763 9.999e-05      TRUE 0.00013799
# 11100       198  -233.043 9.999e-05      TRUE 0.00013799

is.cell <- out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)  # 943

DropletQC – improved identification of empty droplets and damaged cells in single-cell RNA-seq data

estimateAmbience()

estimateAmbience().

This function obtains an estimate of the composition of the ambient pool of RNA based on the barcodes with total UMI counts less than or equal to "lower". The default approach is to assume that all barcodes with total counts less than or equal to "lower" are empty.

It returns a numeric vector of length equal to nrow(m), containing the estimated the transcript proportions in the ambient solution.

set.seed(0)
my.counts <- DropletUtils:::simCounts()
dim(my.counts) # [1]   100 11100

ambience <- estimateAmbience(my.counts)
head(ambience)
#        GENE1        GENE2        GENE3        GENE4        GENE5        GENE6 
# 0.0002213677 0.0003984648 0.0006331183 0.0008146427 0.0010935705 0.0012042561
length(ambience)  #  [1] 100

removeAmbience()

removeAmbience()

Estimate and remove the ambient profile from a count matrix, given pre-existing groupings of similar cells. This function is largely intended for plot beautification rather than real analysis.

It returns a numeric matrix-like object of the same dimensions as y, containing the counts after removing the ambient contamination.

ngenes <- 1000
set.seed(1)
ambient <- runif(ngenes, 0, 0.1)  # length = 1000
cells <- c(runif(100) * 10, integer(900))  # length = 1000
y <- matrix(rpois(ngenes * 100, cells + ambient), nrow=ngenes)

# Pretending that all cells are in one group, in this example.
removed <- removeAmbience(y, ambient, groups=rep(1, ncol(y)))
dim(y)       # [1] 1000  100
dim(removed) # [1] 1000  100
summary(as.vector(y))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.0000  0.0000  0.0000  0.5423  0.0000 20.0000 
summary(as.vector(removed))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   0.000   0.000   0.000   0.515   0.000  20.000 

Use tools other than cellranger

See Chapter 3 Processing Raw scRNA-seq Data and Chapter 4 Construction of expression matrix.

(non-UMI) Kallisto-Bustools or Alevin, featureCounts, RSEM. Note Kallisto can do both alignment and gene counting.

(UMI-based) UMI-tools & zUMIs, zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs

Imputation

Evaluating imputation methods for single-cell RNA-seq data

PDX

WASP – a web-accessible single cell RNA-Seq processing

WASP – a versatile, web-accessible single cell RNA-Seq processing platform

Seurat*

  • Introduction vignette
    Purpose Code Comment
    Setup the Seurat Object pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
    pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
    pbmc[["RNA"]]@counts
    Note the @ operator
    pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
    dense.size <- object.size(as.matrix(pbmc.data))
    QC pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
    VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
    plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
    plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
    plot1 + plot2
    pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
    select cells with number of features in the range of (200,2500)
    Normalizing the data pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
    pbmc[["RNA"]]@data
    Note the @ operator
    feature selection pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
    top10 <- head(VariableFeatures(pbmc), 10)
    plot1 <- VariableFeaturePlot(pbmc)
    plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
    plot1 + plot2
    Scaling the data all.genes <- rownames(pbmc)
    pbmc <- ScaleData(pbmc, features = all.genes)
    pbmc[["RNA"]]@scale.data
    Note the @ operator
    Perform linear dimensional reduction pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
    print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
    VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
    DimPlot(pbmc, reduction = "pca")
    DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
    DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
    Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    JackStrawPlot(pbmc, dims = 1:15)
    ElbowPlot(pbmc)
    Cluster the cells pbmc <- FindNeighbors(pbmc, dims = 1:10)
    pbmc <- FindClusters(pbmc, resolution = 0.5)
    head(Idents(pbmc), 5)
    Non-linear transformation: UMAP/tSNE pbmc <- RunUMAP(pbmc, dims = 1:10)
    DimPlot(pbmc, reduction = "umap")
    Finding differentially expressed features cluster1.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
    head(cluster1.markers, n = 5)
    cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
    head(cluster5.markers, n = 5)
    pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
    pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
    cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
    VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
    VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
    FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
    top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
    DoHeatmap(pbmc, features = top10$gene) + NoLegend()
    Assigning cell type identity to clusters new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
    names(new.cluster.ids) <- levels(pbmc)
    pbmc <- RenameIdents(pbmc, new.cluster.ids)
    DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

seurat-wrappers

https://github.com/satijalab/seurat-wrappers

Seurat object

subset genes

<Method 1>: How to subset on genes and preserve idents?

seurat.object # original seurat object
set.seed(1); genes.use <- sample(1:nrow(seurat.object), 1000)
subset.matrix <- seurat.object[['RNA']]@data[genes.use, ]
seurat.object_sub <- CreateSeuratObject(subset.matrix) 
seurat.object_sub <-  SetIdent(seurat.object_sub, 
                          value=as.factor([email protected]$manual_cluster_tree_addmodulescore))

<Method 2>: subset() function. See Seurat Command List.

<Method 3>: Seurat Weekly NO.2 || 我该如何取子集?(Downsampling)

Seurat methods

QC

QC part explanation:

> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc2 <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500)
> pbmc
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> pbmc2
An object of class Seurat 
13714 features across 2695 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> ind <- which(!colnames(pbmc[['RNA']]) %in% colnames(pbmc2[['RNA']]))
> ind
[1]  427 1060 1762 1878 2568
> tmp <- apply(pbmc[['RNA']]@data, 2, function(x) sum(x > 0)) # detected genes per cell
> range(tmp)
[1]  212 3400
> hist(tmp) # kind of normal
> which(tmp > 2500)
AGAGGTCTACAGCT-1 CCAGTCTGCGGAGA-1 GCGAAGGAGAGCTT-1 
             427             1060             1762 
GGCACGTGTGAGAA-1 TTACTCGAACGTTG-1 
            1878             2568 
> tmp2 <- colSums(pbmc[['RNA']]@data) # total counts per cell
> hist(tmp2)  # kind of normal

Normalization

  • NormalizeData(). log1p(value/colSums[cell-idx] *scale_factor). The scale_factor is 10,000 by default and this gives a good range on the normalized values (at least on the pbmc data). The cpp source code of LogNorm() in preprocessing.R.
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 10000))
    [1] 0.000000 5.753142
    
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 1000))
    [1] 0.000000 3.478712
    
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 100000))
    [1] 0.000000 8.052868
    

    And below is a comparison of the raw counts and normalized values:

    R> pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                              scale.factor = 10000)
    R> cbind(pbmc[['RNA']]@data[i, 1], pbmc2[['RNA']]@data[i, 1])
                [,1]     [,2]
    MRPL20         1 1.635873
    RPL22          1 1.635873
    TNFRSF25       2 2.226555
    EFHD2          1 1.635873
    NECAP2         1 1.635873
    AKR7A2         1 1.635873
    CAPZB          1 1.635873
    RPL11         41 5.138686
    RP5-886K2.3    1 1.635873
    PITHD1         1 1.635873
    

Integration datasets

DE analysis

Cluster-free DE analysis

https://twitter.com/satijalab/status/1635641897560559617 LEMUR and miloDE

Visualization

Azimuth

Azimuth - App for reference-based single-cell analysis

Public data examples

GUI/web applications

Asc-Seurat – Analytical single-cell Seurat-based web application 2021

tidyseurat

tidyseurat, paper in biorxiv & in Bioinformatics

Georges Seurat

Georges Seurat’s 162nd Birthday by Google

Bioconductor packages

Description Packages
Orchestrating Single-Cell Analysis with Bioconductor (OSCA)
simpleSingleCell is an earlier version
See the list here
single cell RNA-Seq workflow (included in OSCA now) simpleSingleCell
visualization tools for single-cell and Bulk RNA Sequencing dittoSeq
An Introduction to Single-Cell RNA-sequencing Analysis in Bioconductor (中文, based on OSCA ebook) DropletUtils, scran, scater, SingleR, BiocFileCache
An Opinionated Computational Workflow for Single-Cell RNA-seq and ATAC-seq SingleCellExperiment, DropletUtils, scater, iSEE(shiny), scran, limma, edgeR
SplatPop: Simulating Population-Scale Single-Cell RNA-sequencing Data splatter
Importing alevin scRNA-seq counts into R/Bioconductor (check 'Get started') tximeta, SingleCellExperiment, fishpond, scran, Seurat

scBubbletree

scBubbletree: computational approach for visualization of single cell RNA-seq data 2024

GUI

NIDAP

https://nidap.nih.gov/ which uses Code workbook (Code Workbook is an application that allows users to analyze and transform data using an intuitive graphical interface) for the interaction. The R/python analysis code can be accessed just like GEO2R from GEO. The background material is available on here.

SCTK2

Interactive analysis of single-cell data using flexible workflows with the Single-Cell Toolkit 2

Partek

Past Webinars

BingleSeq

BingleSeq: a user-friendly R package for bulk and single-cell RNA-Seq data analysis

Cellar

Interactive single-cell data analysis using Cellar 2022

ICARUS

ICARUS – an interactive web server for single cell RNA-seq analysis

How many cells are in the human body?

30-40 trillion (1012) cells

Power analysis

  • powsimR: power analysis for bulk and single cell RNA-seq experiments. It is time-consuming and error-prone to install lots of required packages. Better to have a Docker image. See my note here.
  • SCOPIT: sample size calculations for single-cell sequencing experiments
  • POWSC - Simulation, power evaluation and sample size recommendation for single-cell RNA-seq, Su 2020.
    library(devtools)
    install_github("haowulab/SC2P", build_vignettes=TRUE)
    install_github("suke18/POWSC", build_vignettes = T, dependencies = T)
    
    library(POWSC)
    data("es_mef_sce")
    sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"]
    est_Paras = Est2Phase(sce)
    
    sim_size = c(100, 400, 1000) # A numeric vector
    pow_rslt = runPOWSC(sim_size = sim_size, 
                        est_Paras = est_Paras,
                        per_DE=0.05, 
                        DE_Method = "MAST", 
                        Cell_Type = "PW") 
    # Note, using our previous developed tool SC2P is faster.
    
    packageVersion("POWSC")  # [1] '0.1.0'
    help(package="POWSC")
    plot.POWSC(pow_rslt, Form="II", Cell_Type = "PW") 
    summary.POWSC(pow_rslt, Form="II", Cell_Type = "PW")
    #      (0,10] (10,20] (20,40] (40,80] (80,160] (160,Inf]
    # 100  0.0171  0.1270  0.2877  0.2740   0.4909    0.6765
    # 400  0.3200  0.5373  0.6486  0.7436   0.7959    0.8429
    # 1000 0.5534  0.7424  0.8095  0.9067   0.9815    0.9077
    

Mitochondrion

Spike-in

  • RNA spike-in.
    • In the context of RNA-seq, “spike-in” data refers to the addition of a known quantity of synthetic RNA to a sample before sequencing. This synthetic RNA, known as an RNA spike-in, has a known sequence and is used to calibrate measurements and normalize the data. By comparing the observed signal from the spike-in RNA to its known quantity, researchers can correct for technical variability and improve the accuracy of their measurements.
    • An RNA spike-in is an RNA transcript of known sequence and quantity used to calibrate measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.
  • Spike-in genes are genes that are artificially added to a biological sample, typically at a known concentration, in order to serve as a reference or control for subsequent gene expression analysis.
    • Spike-in genes can be used to help normalize gene expression data and account for variations in sample preparation, handling, and sequencing. By adding known quantities of a spike-in gene to a sample, researchers can determine if the sample preparation and sequencing were performed correctly and if there were any systematic biases or variations in the data.
    • Spike-in genes are typically added to a sample at a constant concentration or a range of concentrations, depending on the specific experimental design. The spike-in genes can be used to monitor technical variability across samples and experiments, and to help ensure that any observed differences in gene expression are due to biological differences and not technical variations.
    • There are different types of spike-in genes that can be used depending on the experimental design, such as synthetic RNA molecules, bacterial genes, or exogenous control genes. Spike-in genes are commonly used in gene expression analysis techniques such as microarrays, RNA sequencing, and quantitative polymerase chain reaction (qPCR).
  • How genes can be added to a biological sample
    • Transfection
    • Microinjection
    • Electroporation
    • Viral delivery
  • Numerical example
    • Let's say a researcher wants to compare the gene expression levels of Sample A and Sample B. The researcher prepares both samples for RNA sequencing, and adds a known amount of a spike-in RNA molecule to both samples. The spike-in RNA molecule is designed to have a known sequence and a concentration of 50,000 copies per microliter.
    • The researcher then performs RNA sequencing on both samples and quantifies the expression levels of the spike-in RNA molecule and the endogenous genes. The following table shows the measured counts of the spike-in RNA molecule and a selected endogenous gene, GENE X, in both Sample A and Sample B:
    Sample Spike-in RNA GENE X
    A 100,000 500
    B 80,000 700
    • The formula for normalizing the counts of gene X to the counts of the spike-in RNA molecule is as follows:
    Normalized count = (Raw count of gene X / Raw count of spike-in RNA molecule) x 
                       (Average counts of spike-in RNA molecule across samples)
    • To account for this technical variation, the researcher can use the counts of the spike-in RNA molecule to normalize the counts of the endogenous genes. For example, the normalized counts of GENE X in Sample A and Sample B would be:
    Sample Normalized GENE X
    A 500/100,000*90,000
    B 700/80,000*90,000
  • scran vignette.
    • We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
    • An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
    • If we have spike-ins, we can use them to fit the trend (variance-mean relationship) instead.
  • Spike-in gene concentrations are known, normalization model existing technical variation by utilizing the difference between these known values and the values observed after processing. "Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey". Lytal 2020

CPM and glmpca package

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model, glmpca package. All codes are available in github. A talk by Hicks 2021.

Transformation

Comparison of transformations for single-cell RNA-seq data

AI

Quality control

Doublet

Downsampling

  • downsampleMatrix() from the scuttle package
    • downsampleMatrix: Downsample a count matrix to a desired proportion
    • Source code of downsampleMatrix
    R> library(scuttle)
    R> sce <- mockSCE()
    R> summary(colSums(counts(sce)))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     344769  366648  374586  373876  382028  396251 
    
    R> downsampled2 <- downsampleMatrix(counts(sce), 
                                        prop = 10000/colSums(counts(sce)), bycol=TRUE)
    R> colSums(downsampled2[, 1:5])
    Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 
       10000    10000    10000    10000    10000 
    
  • downsampleReads() from the DropletUtils package which imports the scuttle package.
    R> library(DropletUtils)
    # Mocking up some 10X HDF5-formatted data.
    R> out <- DropletUtils:::simBasicMolInfo(tempfile())
    R> a <- read10xMolInfo(out)
    R> names(a)
    [1] "data"  "genes"
    R> dim(a$data)
    [1] 9508    5
    R> str(a$genes)
     chr [1:20] "ENSG1" "ENSG2" "ENSG3" "ENSG4" "ENSG5" "ENSG6" "ENSG7" "ENSG8" ...
    R> a$data[1:3, ]
    DataFrame with 3 rows and 5 columns
             cell       umi gem_group      gene     reads
      <character> <integer> <integer> <integer> <integer>
    1        GACT    781899         1         9        10
    2        CAGC    747670         1         9         9
    3        GTAC    614190         1         6        10
    R> unique(a$data$gene)
     [1]  9  6 15  7 12 13  1 20  4  8  3 14 17 19 18 16  2  5 10 11
    R> length(unique(a$data$gene))   # 20 genes
    [1] 20
    R> length(unique(a$data$cell))   # 256 cells
    [1] 256
    
    # Downsampling by the reads.
    R> b<- downsampleReads(out, barcode.length=4, prop=0.5)
    R> dim(b)
    [1]  20 256
    R> b[1:3, 1:5]
    3 x 5 sparse Matrix of class "dgCMatrix"
          AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1
    ENSG1      .      1      .      1      5
    ENSG2      1      1      1      4      .
    ENSG3      1      1      .      1      4
    R> class(b)
    [1] "dgCMatrix"
    attr(,"package")
    [1] "Matrix"
    R> 4^4   # ATCG for each character of the barcode
    [1] 256
    R> 20*256  # So the combinations of (cell, gene) are not unique
    [1] 5120
    R> length(unique(a$data$umi)) # So UMIs are unique
    [1] 9508
    R> length(unique(a$data$gem_group))
    [1] 1
    

    Change prop parameter. Notice the result of downsampling reads does not change much.

    R> b1 <- downsampleReads(out, barcode.length=4, prop=.9999)
    R> colSums(b1)[1:5]
    AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1 
        41     37     28     36     35 
    R> b2 <- downsampleReads(out, barcode.length=4, prop=0.2)
    R> colSums(b2)[1:5]
    AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1 
        37     31     23     31     28 
    R> b5 <- downsampleReads(out, barcode.length=4, prop=0.5)
    R> colSums(b5)[1:5]
    AAAA-1 AAAC-1 AAAG-1 AAAT-1 AACA-1 
        41     37     28     36     35 
    R> sum(b1)
    [1] 9508
    R> sum(b2)
    [1] 8245
    R> sum(b5)
    [1] 9448
    
  • 7.5.2 Downsampling and log-transforming from OSCA
  • Down_Sample_Matrix()
  • zUMIs
  • Current best practices in single-cell RNA-seq analysis: a tutorial. While downsampling throws away data, it also increases technical dropout rates which CPM and other global scaling normalization methods do not. Thus, downsampling can deliver a more realistic representation of what cellular expression profiles would look like at similar count depths.
  • Huang 2020 Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data
    • downsample features (genes). down_sample_gene(). The idea is sample(gene_list, size = size, replace = FALSE, prob = prob_list) where gene_list is the list of detected genes and size is 5,000/10,000/15,000 and prob_list is the proportion of gene counts (defined by using rowSums()). We randomly downsampled the features from Fluidigm C1 data into 15,000, 10,000 and 5000 input genes, based on the original log count distribution.
    • downsample the reads into 25%, 50%, 75% of the original read depth (with two repetitions) on BAM files, and then realigned files following the method provided by the original manuscript.

Normalization

  • Introduction to single-cell RNA-seq. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
  • Bulk normalization methods
    • DESeq2: median of ratios (MoR) approach
    • edgeR: trimmed mean of M values
  • Count models for normalization of single-cell RNA-seq data. See Single Cell Genomics Day 2021

scone package: normalization

Dino

Normalization by distributional resampling of high throughput single-cell RNA-sequencing data Brown 2021

Batch correction

  • Question: scRNA-seq and the batch effects. The biggest problem I found is, since you have no idea what's the real data suppose to be.
  • https://www.biostars.org/p/316204/
  • Over-correction can happen.
    • Single-cell RNA-seq Analysis on NIDAP - Tutorial Part 2 40:00 (nih network is required). Groups differ by a single experimental modification (eg a knockoff gene or cells of similar lineage used to determine differentiation pattern). However batch correction is necessary in cases such as spike-in cells where we expect these cells should be clustered together.
  • Mutual nearest neighbors/MNN. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors Haghverdi 2018...
    • One assumption is the batch effect is almost orthogonal to the biological subspace. It also assumes the batch effect variation in much smaller than the biological effect variation between different cell types.
    • It is preferable to perform DE analyses using the uncorrected expression values with blocking on the batch, as discussed in Section 11.4... We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a t-SNE plot by per-cell expression.
    • Sadly the R code of the paper does not have the file ('MAP2.csv') and the code does not work anymore (e.g. mnnCorrect() does not have $angles component). Check out the mnnpy module which still has an option to return angles.
  • SMNN can eliminate the overcorrection between different cell types & allow that batch effects are non-orthogonal to the biological differences. github source code. Note that the package is only available on github and it depends on the python mnnpy module.
    library(reticulate)
    use_python("/home/rstudio/.conda/envs/sc_env/bin/python")
    py_config()  # confirm
    
    library("SMNN")
    data("data_SMNN")
    dim(data_SMNN$batch1.mat)  # [1] 3000  400
    data_SMNN$batch1.mat[1:5, 1:5]
    dim(data_SMNN$batch2.mat)  # [1] 3000  500
    # Maker genes
    markers <- c("Col1a1", "Pdgfra", "Ptprc", "Pecam1")
    # Corresponding cell type labels for each marker gene
    cluster.info <- c(1, 1, 2, 3)
    matched_clusters <- unifiedClusterLabelling(data_SMNN$batch1.mat, 
                                                data_SMNN$batch2.mat, 
                                                features.use = markers, 
                                                cluster.labels = cluster.info, 
                                                min.perc = 0.3)
    
    corrected.results <- SMNNcorrect(batches = list(data_SMNN$batch1.mat, data_SMNN$batch2.mat), 
                                     batch.cluster.labels = matched_clusters, 
                                     k=20, 
                                     sigma=1, 
                                     cos.norm.in=TRUE, 
                                     cos.norm.out=TRUE, 
                                     subset.genes=rownames(data_SMNN$batch2.mat))
    # NOTE: subset.genes parameter was added to fix a bug when I'm running the function
    # [1] "Data preparation ..."
    # Error: C stack usage  555609078676 is too close to the limit
    
    > traceback()
    3: py_call_impl(callable, dots$args, dots$keywords)
    2: mnnpy$utils$transform_input_data(datas = batches.t, cos_norm_in = cos.norm.in, 
           cos_norm_out = cos.norm.out, var_index = as.character(c(0:ncol(batches.t1))), 
           var_subset = subset.index, n_jobs = n.jobs)
    1: SMNNcorrect(batches = list(data_SMNN$batch1.mat, data_SMNN$batch2.mat), 
           batch.cluster.labels = matched_clusters, k = 20, sigma = 1, 
           cos.norm.in = TRUE, cos.norm.out = TRUE, subset.genes = rownames(data_SMNN$batch2.mat)[1:100])
    
  • https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html
    • Seurat (CCA): RunMultiCCA() seems not available any more. Check out Integration and Label Transfer in Seurat 3 and Introduction to scRNA-seq integration FindIntegrationAnchors() & IntegrateData().
      # Create 2D UMAP DR plot for data before integration
      # It can be seen there is a need to do batch effect correction
      ifnb2 <- ifnb
      ifnb2 <- NormalizeData(ifnb2)
      ifnb2 <- FindVariableFeatures(ifnb2, selection.method = "vst", nfeatures = 2000)
      ifnb2 <- ScaleData(ifnb2, verbose = FALSE)
      ifnb2 <- RunPCA(ifnb2, npcs = 30, verbose = FALSE)
      ifnb2 <- RunUMAP(ifnb2, reduction = "pca", dims = 1:30)
      ifnb2 <- FindNeighbors(ifnb2, reduction = "pca", dims = 1:30)
      ifnb2 <- FindClusters(ifnb2, resolution = 0.5)
      
      # Compared to the 2D plot before and after integration
      DimPlot(ifnb2, reduction = "umap", group.by = "stim") # not good, ideally two groups
                           # should be mixed together.
                           # It seems the blue color is on top of the salmon color.
      DimPlot(immune.combined, reduction = "umap", group.by = "stim")
                           # Two groups are mixed. 
      
      # Compare the cluster plots before and after integration
      DimPlot(ifnb2, reduction = "umap", split.by = "stim") # not good, ideally the cluster
                           # pattern should be similar in both groups 
      DimPlot(immune.combined, reduction = "umap", split.by = "stim") #good, the cluster
                           # patterns are similar in both groups
      

      FindConservedMarkers() and [email protected], ?FindConservedMarkers. ident.1 & ident.2 are used to define cells from the @active.ident component. For example, we can it to find genes that are conserved markers irrespective of stimulation/group condition in a certain cluster. Together with FeaturePlot() we can see the gene expression in 2D space (one plot per feature). The DotPlot() function with the split.by parameter can be useful for viewing conserved cell type markers across conditions

      table([email protected]$stim, [email protected])
      
      table(pbmc_small$groups, [email protected])
      
    • liger LIGER (NMF): liger:: optimizeALS(). Note that in order to use the R package, it is necessary to install hdf5 or libhdf5-dev library. Vignettes are available on github source but not on CRAN package.
    • Harmony: RunHarmony()

Clustering

# after I upgrade R from 4.0.x to 4.1.x
R> library(DuoClustering2018)
snapshotDate(): 2021-05-18
Warning message:
DEPRECATION: As of ExperimentHub (>1.17.2), default caching location has changed.
  Problematic cache: /Users/XXX/Library/Caches/ExperimentHub
  See https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html#default-caching-location-update

NMF

cNMF

CancerSubtypes

library(CancerSubtypes)
data(GeneExp)
dim(GeneExp) # 1500 x 100
result=ExecuteCNMF(GeneExp,clusterNum=3,nrun=30)
dim(result$distanceMatrix)  # consensus/Similarity returned from nmf()
# [1] 100 100
round(unique(result$distanceMatrix), 3)
# [1] 1.000 0.000 0.167 0.067 0.933 0.100 0.433 0.900 0.833 0.567
#[11] 0.233 0.733 0.267 0.133 0.967 0.633 0.500 0.667 0.200 0.533
sil=silhouette_SimilarityMatrix(result$group, result$distanceMatrix)
plot(sil, col = c("red", "blue", "green"))
# Note that the average silhouette width is computed without considering groups.
# mean(c(.96, .98, 1)) # .98
# mean(sil[, 3]) #  0.9714631  match with the plot
# mean(tapply(sil[, 3], sil[, 1], mean)) # 0.9794998

pca <- prcomp(t(GeneExp))
plot(pca$x[,1], pca$x[, 2], col=result$group, pch=19) # black, red, green

seurat::BuildClusterTree

Significance Analysis for Clustering

Significance Analysis for Clustering with Single-Cell RNA-Sequencing Data 2022

Model

Feature selection

Feature selection revisited in the single-cell era Yang 2021

DE analysis

GSEA

AddModuleScore

library(Seurat)
data("pbmc_small")
cd_features <- list(c('CD79B', 'CD79A', 'CD19', 'CD180', 'CD200', 'CD3D',
  'CD2', 'CD3E', 'CD7', 'CD8A', 'CD14', 'CD1C', 'CD68', 'CD9', 'CD247'))
pbmc_small <- AddModuleScore(
  object = pbmc_small,
  features = cd_features,
  ctrl = 5,
  name = 'CD_Features'
)
head(pbmc_small, 3)

                  orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
ATGCCAGAACGACT SeuratProject         70           47               0
CATGGCCTGTGCAT SeuratProject         85           52               0
GAACCTGATGAACC SeuratProject         87           50               1
               letter.idents groups RNA_snn_res.1 CD_Features1
ATGCCAGAACGACT             A     g2             0    0.4420928
CATGGCCTGTGCAT             A     g1             0    0.4285719
GAACCTGATGAACC             B     g2             0    0.9217770

Tweedieverse

DE analysis with multiple samples

DE analysis with multiple samples. scuttle::aggregateAcrossCells()

Cell type deconvolution

Cell type annotation

GPT-4

Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis

Cell-level metadata

Cell-level metadata are indispensable for documenting single-cell sequencing datasets Puntambekar 2021. Shiny app. Source code.

GSE137710 is a good example.

Cell subtypes

Single-cell RNA sequencing reveals specific cell subtypes that influence survival and determine molecular subtype classification of ovarian cancers

Some cell types

Cell-cell interactions

  • CellChat
    • Compute the communication probability
      # load data
      load("data_humanSkin_CellChat.rda")
      data.input = data_humanSkin$data # normalized data matrix
      meta = data_humanSkin$meta # a dataframe with rownames containing cell mata data
      cell.use = rownames(meta)[meta$condition == "LS"] # extract the cell names from disease data
      data.input = data.input[, cell.use]
      meta = meta[cell.use, ]
      unique(meta$labels) 
      
      # Create a CellChat object
      cellchat <- createCellChat(object = data.input, meta = meta, group.by = "labels")
      
      # Add cell information
      cellchat <- addMeta(cellchat, meta = meta)
      cellchat <- setIdent(cellchat, ident.use = "labels") # set "labels" as default cell identity
      levels(cellchat@idents) # show factor levels of the cell labels
      #  [1] "APOE+ FIB"    "FBN1+ FIB"    "COL11A1+ FIB" "Inflam. FIB"  "cDC1"        
      #  [6] "cDC2"         "LC"           "Inflam. DC"   "TC"           "Inflam. TC"  
      # [11] "CD40LG+ TC"   "NKT" 
      groupSize <- as.numeric(table(cellchat@idents))
      
      # Set the ligand-receptor interaction database
      CellChatDB <- CellChatDB.human 
      
      # Preprocessing the expression data
      CellChatDB.use <- subsetDB(CellChatDB, search = "Secreted Signaling") # use Secreted Signaling
      cellchat@DB <- CellChatDB.use
      cellchat <- subsetData(cellchat)
      cellchat <- identifyOverExpressedGenes(cellchat)
      cellchat <- identifyOverExpressedInteractions(cellchat)
      cellchat <- projectData(cellchat, PPI.human)
      
      # Compute the communication probability
      cellchat <- computeCommunProb(cellchat)
      cellchat <- computeCommunProbPathway(cellchat)
      str(cellchat@net)
      # List of 2
      #  $ prob: num [1:12, 1:12, 1:656] 0 0 0 0 0 0 0 0 0 0 ...
      #   ..- attr(*, "dimnames")=List of 3
      #   .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
      #   .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
      #   .. ..$ : chr [1:656] "TGFB1_TGFBR1_TGFBR2" "TGFB2_TGFBR1_TGFBR2" "TGFB3_TGFBR1_TGFBR2"  "TGFB1_ACVR1B_TGFBR2" ...
      #  $ pval: num [1:12, 1:12, 1:656] 1 1 1 1 1 1 1 1 1 1 ...
      #   ..- attr(*, "dimnames")=List of 3
      #   .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
      #   .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
      #   .. ..$ : chr [1:656] "TGFB1_TGFBR1_TGFBR2" "TGFB2_TGFBR1_TGFBR2" "TGFB3_TGFBR1_TGFBR2" "TGFB1_ACVR1B_TGFBR2" ...
      str(cellchat@netP)
      # List of 2
      #  $ pathways: chr [1:13] "MIF" "GALECTIN" "CXCL" "COMPLEMENT" ...
      #  $ prob    : num [1:12, 1:12, 1:13] 0 0 0 0 0 0 0 0 0 0 ...
      #   ..- attr(*, "dimnames")=List of 3
      #   .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
      #   .. ..$ : chr [1:12] "APOE+ FIB" "FBN1+ FIB" "COL11A1+ FIB" "Inflam. FIB" ...
      #   .. ..$ : chr [1:13] "MIF" "GALECTIN" "CXCL" "COMPLEMENT" ...
      sum(cellchat@net$prob)
      # [1] 18.65038
    • The error I saw (R 4.1.1, CellChat 1.1.3)
      cellchat <- netEmbedding(cellchat, type = "functional")
      Manifold learning of the signaling networks for a single dataset 
      Error in runUMAP(Similarity, min_dist = min_dist, n_neighbors = n_neighbors,  : 
        Cannot find UMAP, please install through pip (e.g. pip install umap-learn or reticulate::py_install(packages = 'umap-learn')).
      
    • cellchat netEmbedding 运行出错. It provides another solution (no need to install anything using pip) for calculating umap.
      library(uwot)
      cellchat <- netEmbedding(cellchat, umap.method = 'uwot',type = "functional")
      
    • Error in runUMAP(Similarity, min.dist = 0.3, n.neighbors = k)
    • function netEmbedding is unable to work #167
    • CellChat:细胞间相互作用分析利器 which contains two youtube links to the talks by the author

Integrating datasets

Benchmark

Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability

scanpy python package

Trajectory/pseudotime analysis

PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells

simulation

splatSimulatePaths() from splatter

monocle package for pseudotime analysis

slingshot

slingshot Tools for ordering single-cell sequencing

tradeSeq

https://bioconductor.org/packages/release/bioc/vignettes/tradeSeq/inst/doc/tradeSeq.html. tradeSeq is an R package that allows analysis of gene expression along trajectories.

Spatial Transcriptomics Data Analysis

sincell package

SCell

SCell – integrated analysis of single-cell RNA-seq data

GEVM

Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).

NMFEM

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Scater

Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R

NDRindex

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

DEsingle

DEsingle – detecting three types of differential expression in single-cell RNA-seq data

totalVI

Joint probabilistic modeling of single-cell multi-omic data with totalVI

Copy number

Copy number by infercnv

  • https://github.com/broadinstitute/inferCNV/wiki
  • infercnv - Infer Copy Number Variation from Single-Cell RNA-Seq Data
  • We need to install jags software before we install the required package rjags. It seems OK to install this old version 4.3.0 of software on macOS Catalina 10.15.7. To install jags, right click the file 'JAGS-4.3.0.mpkg' and choose 'open'. Then follow the instruction of the installer to complete the installation.
  • https://www.jianshu.com/c/4c982303e19f?order_by=top
    • It seems the goal is to use the CNV data to cluster cells... All cells that clustered together with spiked in controls were labeled "nontumor" whereas the remaining clusters were labels as 'tumor'.
    • 使用inferCNV分析单细胞转录组中拷贝数变异. Or follow the help of plot_cnv(), infercnv::run()... inferCNV会输出一个 "map_metadata_from_infercnv.txt" (I don't see it) 文件用于记录每个细胞的元信息,所有信息都可以从该文件中进行提取。或者使用infercnv::add_to_seurat将信息直接增加到原来的seurat对象中。
    • CNS图表复现13—使用inferCNV来区分肿瘤细胞的恶性与否. 因为10X技术出来的单个细胞的reads数量太少,检测到的基因数量太少,但是inferCNV更新后有一个参数是cutoff ,就很清楚的指出来了:cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics 说明它应用于10X单细胞转录组数据是没有问题的。
    • 是否是免疫细胞很容易区分那是否是肿瘤细胞呢?
  • Question: where to download the gene position file? The infercnv_genes_example object has 10338 genes already

Simpler plot

data(infercnv_data_example) # data.frame 8252 x 20, genes x samples
data(infercnv_annots_example) # data.frame 20 x 1, 10 normal and 10 tumor
data(infercnv_genes_example) # data.frame 10338 x 3
infercnv_genes_example[1:3, ]
#             V2     V3     V4
# WASH7P    chr1  14363  29806
# LINC00115 chr1 761586 762902
# NOC2L     chr1 879584 894689
infercnv_object_example <- infercnv::CreateInfercnvObject(raw_counts_matrix=infercnv_data_example, 
                                                          gene_order_file=infercnv_genes_example,
                                                          annotations_file=infercnv_annots_example,
                                                          ref_group_names=c("normal"))

infercnv_object_example <- infercnv::run(infercnv_object_example,
                                         cutoff=1,
                                         out_dir=tempfile(), 
                                         cluster_by_groups=TRUE, 
                                         denoise=TRUE,
                                         HMM=FALSE,
                                         num_threads=2)
system("ls /var/folders/2q/slryb0rx4tj97t66v7l6pwvr_z6g3s/T//RtmpEuGgVy/file182d24b626381")
# 01_incoming_data.infercnv_obj                     infercnv.21_denoised.observations_dendrogram.txt
# 02_reduced_by_cutoff.infercnv_obj                 infercnv.21_denoised.png
# 03_normalized_by_depth.infercnv_obj               infercnv.21_denoised.references.txt
# 04_logtransformed.infercnv_obj                    infercnv.heatmap_thresholds.txt
# 08_remove_ref_avg_from_obs_logFC.infercnv_obj     infercnv.observation_groupings.txt
# 09_apply_max_centered_expr_threshold.infercnv_obj infercnv.observations.txt
# 10_smoothed_by_chr.infercnv_obj                   infercnv.observations_dendrogram.txt
# 11_recentered_cells_by_chr.infercnv_obj           infercnv.png
# 12_remove_ref_avg_from_obs_adjust.infercnv_obj    infercnv.preliminary.heatmap_thresholds.txt
# 14_invert_log_transform.infercnv_obj              infercnv.preliminary.observation_groupings.txt
# 15_no_subclustering.infercnv_obj                  infercnv.preliminary.observations.txt
# 21_denoise.NF_NA.SD_1.5.NL_FALSE.infercnv_obj     infercnv.preliminary.observations_dendrogram.txt
# expr.infercnv.21_denoised.dat                     infercnv.preliminary.png
# expr.infercnv.dat                                 infercnv.preliminary.references.txt
# expr.infercnv.preliminary.dat                     infercnv.references.txt
# infercnv.21_denoised.heatmap_thresholds.txt       preliminary.infercnv_obj
# infercnv.21_denoised.observation_groupings.txt    run.final.infercnv_obj
# infercnv.21_denoised.observations.txt

# Optional, what's the purpose of this plot function
#           when the last command already generated the heatmap?
plot_cnv(infercnv_object_example,
         out_dir=tempfile(),
         obs_title="Observations (Cells)",
         ref_title="References (Cells)",
         cluster_by_groups=TRUE,
         x.center=1,
         x.range="auto",
         hclust_method='ward.D',
         color_safe_pal=FALSE,
         output_filename="infercnv",
         output_format="png",
         png_res=300,
         dynamic_resize=0)
# ...
list.files("/var/folders/2q/slryb0rx4tj97t66v7l6pwvr_z6g3s/T//RtmpEuGgVy/file182d25df0c4ed")
# [1] "infercnv.heatmap_thresholds.txt"      "infercnv.observation_groupings.txt"  
# [3] "infercnv.observations_dendrogram.txt" "infercnv.observations.txt"           
# [5] "infercnv.png"                         "infercnv.references.txt" 

More complicated plot

infercnv_obj = CreateInfercnvObject(
  raw_counts_matrix=system.file("extdata", "oligodendroglioma_expression_downsampled.counts.matrix.gz", package = "infercnv"),
  annotations_file=system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"),
  delim="\t", gene_order_file=system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv"),
  ref_group_names=c("Microglia/Macrophage","Oligodendrocytes (non-malignant)")) 

infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                             out_dir=tempfile(), 
                             cluster_by_groups=TRUE, 
                             denoise=TRUE,
                             HMM=TRUE) # HMM=TRUE will increase lots of computation
# 	STEP 1: incoming data
# 	STEP 02: Removing lowly expressed genes
# ...
# STEP 22: Denoising
# There were 50 or more warnings (use warnings() to see the first 50)

x <- read.delim(system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"), header=F)
dim(x)  #  [1] 184   2;   184 cells only
unique(x[,2])  # 6 groups. Two of them represent references cells and the other 4 are the obs cells.
# [1] "Microglia/Macrophage"             "Oligodendrocytes (non-malignant)"
# [3] "malignant_MGH36"                  "malignant_MGH53"
# [5] "malignant_97"                     "malignant_93"
x[1:3, ]
#             V1                   V2
# 1 MGH54_P2_C12 Microglia/Macrophage
# 2 MGH36_P6_F03 Microglia/Macrophage
# 3 MGH53_P4_H08 Microglia/Macrophage

xx <- read.delim(system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv"), header=F)
dim(xx)  #  [1] 10338     4
xx[1:2, ]
#          V1   V2     V3     V4
# 1    WASH7P chr1  14363  29806
# 2 LINC00115 chr1 761586 762902

Infercnv.png

Cell-cell interactions and communication

Deciphering cell–cell interactions and communication from gene expression Armingol 2021

Chromatin/scATAC-seq data

Signac

Signac is an extension of Seurat for the analysis, interpretation, and exploration of single-cell chromatin datasets.

GPU

NVIDIA Researchers Use AI to Spot Active Areas in Cell DNA

Single Cells and Single Nuclei

Webinars

Brain cells

PDAC

Single-cell analysis support the use of PDAC organoids as a clinically relevant model for in vitro studies of tumor heterogeneity 2021