Resource

https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis
Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods by Dal Molin et al 2017.
Normalization and noise reduction for single cell RNA-seq experiments by Bo Ding et al 2015.
A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor by Lun 2016.
Gene length and detection bias in single cell RNA sequencing protocols and the script & data are available online. Belinda Phipson1 et al 2017.
Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference in ISCB (International Society for Computational Biology Community Journal). It has open peer reviews too.
Welcome to the Tidyverse from the Journal of Open Source Software. It has open peer reviews too.
Missing data and technical variability in single-cell RNA-sequencing experiments
A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications 2017
How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives by Alessandra Dal Molin 2018
Current best practices in single‐cell RNA‐seq analysis: a tutorial 2019
DSAVE: Detection of misclassified cells in single-cell RNA-Seq data.
Top Benefits of Using the Technique of Single Cell RNA-Seq
Systematic comparison of high-throughput single-cell RNA-seq methods for immune cell profiling 2021
Single-cell RNA-seq data analysis workshop from Bioinformatics Training at the Harvard Chan Bioinformatics Core
https://www.plob.org/article/21723.html 单细胞RNA测序中文網站
Top Benefits of Using the Technique of Single Cell RNA-Seq.
- Gain Better Understanding cell types
- Analysis of Rare Cell Types
- Reveals Hidden Differences
- Detection of Cancer Stem Cells
- Enables exploring the Complex Systems
- Describes the Regulatory Networks and Developmental Trajectories
Videos
- Analysis of single cell RNA-seq data - Lecture 1 by BioinformaticsTraining
- Single-cell isolation by a modular single-cell pipette for RNA-sequencing by labonachipVideos
- Single Cell RNA Sequencing Data Analysis (youtube)
- Introduction to Bioinformatics and Computational Biology Xiaole Shirley Liu (ebook format)
ebooks
- Analysis of single cell RNA-seq data (ebook, Kiselev et al 2019, University of Cambridge Bioinformatics training unit, SingleCellExperiment & Seurat) and the paper Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data Andrews 2020. A docker image is available.
- ANALYSIS OF SINGLE CELL RNA-SEQ DATA (ebook) Ashenberg et al 2019 from Broad institute. Seurat package was used.
- Analysis of single cell RNA-seq data (ebook) Lyu et al. 2019. Seurate and SingleCellExperiment.

Public data

https://www.biostars.org/p/208973/
Collection of Public Single-Cell RNA-Seq Datasets
Single cell portal https://singlecell.broadinstitute.org/single_cell
https://cells.ucsc.edu/
https://www.humancellatlas.org/
scRNAseq package from Bioconductor
Tung dataset, data.
SRA/GSE
- SRR7611046 and SRR7611048 from Sequence Read Archive (SRA) pages
- SRP073767 PBMC data. GSE29087 Mouse Embryonic Data. GSE64016 Human Embryonic Data. See Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey Lytal 2020
- GSE92332 from Scripts for "Current best-practices in single-cell RNA-seq: a tutorial"
- BP4RNAseq package. SRR11402955, SRR11402974. Note several dependent tools need to be installed manually.
- GSE75790. See Comparative Analysis of Single-Cell RNA Sequencing Methods Ziegenhain 2017
- GSE137804/SRP222837. 22 samples (GSM or SRX numbers), 42 runs (SRR numbers). The fastq.gz files are also directly available in European Nucleotide Archive where we can see 14 cells have only Read 2 data without Read 1 data.
- GSE29087, GSE94820, GSE67835, GSE65528, GSE70850 from POWSC - Simulation, power evaluation and sample size recommendation for single-cell RNA-seq, Su 2020.
- GSE81076, GSE86473, GSE85241, E-MTAB-5061 from the mutual nearest neighbors (MNNs) method Haghverdi 2018

SRAToolkit

https://hpc.nih.gov/apps/sratoolkit.html. source code, Download SRA sequences from Entrez search results
Downloading paired end fastq from SRA, fasterq-dump only returns one file for a paired-end sample
fasterqDump() from the geomedb package.
SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?
SRR2048331.
SRR12148213 (from SRP222837). Note the corresponding sample has 8 runs.

sinteractive  --gres=lscratch:800  --cpus-per-task=6 --mem=40g
  # the required memory can be checked by using the '''jobload''' command
  # the required local disk space can be obtained by 'ssh cnXXX' and run 'du -sh /lscratch/JobID'
  #    where the JobID is obtained through the jobload command

module load sratoolkit

fasterq-dump -t /lscratch/$SLURM_JOBID SRR2048331 # 1 min, 3.3G fastq file
fastq-dump --split-3 SRR2048331                   # 2 min 30 sec
pigz -p6 SRR12148211*.fastq  # vs gzip. 6 threads are used here 
                             # do not need to split threads,
                             # do not need to run separately

fastq-dump --split-3 --gzip SRR2048331   # 658M SRR2048331.fastq.gz (vs 626.9Mb in sra format)

fastq-dump --split-3 --gzip SRR12148213  # 862M + 2.5G for fastq.gz files (vs 2Gb in sra format)
                                         # 6.5G + 15G for fastq files
                                         # fastq/fastq.gz = 15/2.5 = 6
                                         # fastq.gz/sr = 3.4/2 = 1.7
# This is equivalent to 
# prefetch SRR12148213; cd SRR12148213
# fastq-dump --split-3 --gzip SRR12148213.sra # 45 min
# The 'prefetch' will create a new folder SRR12148213 and download SRR12148213.sra into it.
# Note the file sizes from _1.fastq and _2.fastq are quite different. 
# Another run. Check number of reads.
$ zcat SRR12148214_1.fastq.gz | wc -l
150635300
$ zcat SRR12148214_2.fastq.gz | wc -l
150635300

sbatch --gres=lscratch:800 --mem=40g --cpus-per-task=6 --time=12:00:00 myscript

And the fasterq-dump's wiki and the help page

$ fasterq-dump --help

Usage: fasterq-dump [ options ] [ accessions(s)... ]

Parameters:

  accessions(s)                    list of accessions to process


Options:

  -o|--outfile <path>              full path of outputfile (overrides usage
                                     of current directory and given accession)
  -O|--outdir <path>               path for outputfile (overrides usage of
                                     current directory, but uses given
                                     accession)
  -b|--bufsize <size>              size of file-buffer (dflt=1MB, takes
                                     number or number and unit)
  -c|--curcache <size>             size of cursor-cache (dflt=10MB, takes
                                     number or number and unit)
  -m|--mem <size>                  memory limit for sorting (dflt=100MB,
                                     takes number or number and unit)
  -t|--temp <path>                 path to directory for temp. files
                                     (dflt=current dir.)
  -e|--threads <count>             how many threads to use (dflt=6)
  -p|--progress                    show progress (not possible if stdout used)
  -x|--details                     print details of all options selected
  -s|--split-spot                  split spots into reads
  -S|--split-files                 write reads into different files
  -3|--split-3                     writes single reads into special file
     --concatenate-reads           writes whole spots into one file
  -Z|--stdout                      print output to stdout
  -f|--force                       force overwrite of existing file(s)
  -N|--rowid-as-name               use rowid as name (avoids using the name
                                     column)
     --skip-technical              skip technical reads
     --include-technical           explicitly include technical reads
  -P|--print-read-nr               include read-number in defline
  -M|--min-read-len <count>        filter by sequence-lenght
     --table <name>                which seq-table to use in case of pacbio
     --strict                      terminate on invalid read
  -B|--bases <bases>               filter output by matching against given
                                     bases
  -A|--append                      append to output-file, instead of
                                     overwriting it
     --ngc <path>                  <path> to ngc file
     --perm <path>                 <path> to permission file
     --location <location>         location in cloud
     --cart <path>                 <path> to cart file
     --disable-multithreading      disable multithreading
  -V|--version                     Display the version of the program
  -v|--verbose                     Increase the verbosity of the program
                                     status messages. Use multiple times for
                                     more verbosity.
  -L|--log-level <level>           Logging level as number or enum string.
                                     One of
                                     (fatal|sys|int|err|warn|info|debug) or
                                     (0-6) Current/default is warn
     --option-file file            Read more options and parameters from the
                                     file.
  -h|--help                        print this message

"fasterq-dump" version 2.10.9

Preprocess

Cell ranger

10X Genomics
pbmc3k. The data is saved in filtered_gene_bc_matrices/hg19 directory with 3 files, barcodes.tsv (45K), genes.tsv (798K) and matrix.mtx (27M).

https://hpc.nih.gov/apps/cellranger.html

cellranger mkfastq wraps Illumina's bcl2fastq to correctly demultiplex Chromium-prepared sequencing samples and to convert barcode and read data to FASTQ files.
cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-cell matrices and perform clustering and gene expression analysis.
cellranger aggr aggregates results from cellranger count.
cellranger reanalyze takes feature-barcode matrices produced by cellranger count or aggr and re-runs the dimensionality reduction, clustering, and gene expression algorithms.

The example code will generate an output directory with

$ tree s1/outs

├── analysis
│   ├── clustering
│   │   ├── graphclust
│   │   │   └── clusters.csv
│   │   ├── kmeans_10_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_2_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_3_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_4_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_5_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_6_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_7_clusters
│   │   │   └── clusters.csv
│   │   ├── kmeans_8_clusters
│   │   │   └── clusters.csv
│   │   └── kmeans_9_clusters
│   │       └── clusters.csv
│   ├── diffexp
│   │   ├── graphclust
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_10_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_2_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_3_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_4_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_5_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_6_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_7_clusters
│   │   │   └── differential_expression.csv
│   │   ├── kmeans_8_clusters
│   │   │   └── differential_expression.csv
│   │   └── kmeans_9_clusters
│   │       └── differential_expression.csv
│   ├── pca
│   │   └── 10_components
│   │       ├── components.csv
│   │       ├── dispersion.csv
│   │       ├── features_selected.csv
│   │       ├── projection.csv
│   │       └── variance.csv
│   ├── tsne
│   │   └── 2_components
│   │       └── projection.csv
│   └── umap
│       └── 2_components
│           └── projection.csv
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5
└── web_summary.html

31 directories, 41 files

Seurat*

http://satijalab.org/seurat/. It has several vignettes.
https://videocast.nih.gov/summary.asp?Live=21733&bhcp=1
BingleSeq - A user-friendly R package for Bulk and Single-cell RNA-Seq data analyses.
Question: Is it worth it to explore outside of Seurat?

Introduction vignette

Purpose	Code	Comment
Setup the Seurat Object	pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/") pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc[["RNA"]]@counts	Note the @ operator
	pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
	dense.size <- object.size(as.matrix(pbmc.data))
QC	pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 + plot2 pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)	select cells with number of features in the range of (200,2500)
Normalizing the data	pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) pbmc[["RNA"]]@data	Note the @ operator
feature selection	pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) top10 <- head(VariableFeatures(pbmc), 10) plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) plot1 + plot2
Scaling the data	all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes) pbmc[["RNA"]]@scale.data	Note the @ operator
Perform linear dimensional reduction	pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) print(pbmc[["pca"]], dims = 1:5, nfeatures = 5) VizDimLoadings(pbmc, dims = 1:2, reduction = "pca") DimPlot(pbmc, reduction = "pca") DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE) DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
Determine the ‘dimensionality’ of the dataset	pbmc <- JackStraw(pbmc, num.replicate = 100) pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
Determine the ‘dimensionality’ of the dataset	pbmc <- JackStraw(pbmc, num.replicate = 100) pbmc <- ScoreJackStraw(pbmc, dims = 1:20) JackStrawPlot(pbmc, dims = 1:15) ElbowPlot(pbmc)
Cluster the cells	pbmc <- FindNeighbors(pbmc, dims = 1:10) pbmc <- FindClusters(pbmc, resolution = 0.5) head(Idents(pbmc), 5)
Non-linear transformation: UMAP/tSNE	pbmc <- RunUMAP(pbmc, dims = 1:10) DimPlot(pbmc, reduction = "umap")
Finding differentially expressed features	cluster1.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25) head(cluster1.markers, n = 5) cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) head(cluster5.markers, n = 5) pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC) cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE) VlnPlot(pbmc, features = c("MS4A1", "CD79A")) VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE) FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")) top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) DoHeatmap(pbmc, features = top10$gene) + NoLegend()
Assigning cell type identity to clusters	new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet") names(new.cluster.ids) <- levels(pbmc) pbmc <- RenameIdents(pbmc, new.cluster.ids) DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Seurat object

CreateSeuratObject()
Seurat DimPlot - Highlight specific groups of cells in different colours. seuratObject$meta.data, integrated@[email protected], slotNames(seuratObject), slot(seuratObject, "meta.data").
Add a new column of meta data. ?AddMetaData

QC

QC part explanation:

> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc2 <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500)
> pbmc
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> pbmc2
An object of class Seurat 
13714 features across 2695 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> ind <- which(!colnames(pbmc[['RNA']]) %in% colnames(pbmc2[['RNA']]))
> ind
[1]  427 1060 1762 1878 2568
> tmp <- apply(pbmc[['RNA']]@data, 2, function(x) sum(x > 0)) # detected genes per cell
> range(tmp)
[1]  212 3400
> hist(tmp) # kind of normal
> which(tmp > 2500)
AGAGGTCTACAGCT-1 CCAGTCTGCGGAGA-1 GCGAAGGAGAGCTT-1 
             427             1060             1762 
GGCACGTGTGAGAA-1 TTACTCGAACGTTG-1 
            1878             2568 
> tmp2 <- colSums(pbmc[['RNA']]@data) # total counts per cell
> hist(tmp2)  # kind of normal

Normalization

NormalizeData(). log1p(value/colSums[cell-idx] *scale_factor). The scale_factor is 10,000 by default and this gives a good range on the normalized values (at least on the pbmc data). The cpp source code of LogNorm() in preprocessing.R.

R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 10000))
[1] 0.000000 5.753142

R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 1000))
[1] 0.000000 3.478712

R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 100000))
[1] 0.000000 8.052868

And below is a comparison of the raw counts and normalized values:

R> pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                          scale.factor = 10000)
R> cbind(pbmc[['RNA']]@data[i, 1], pbmc2[['RNA']]@data[i, 1])
            [,1]     [,2]
MRPL20         1 1.635873
RPL22          1 1.635873
TNFRSF25       2 2.226555
EFHD2          1 1.635873
NECAP2         1 1.635873
AKR7A2         1 1.635873
CAPZB          1 1.635873
RPL11         41 5.138686
RP5-886K2.3    1 1.635873
PITHD1         1 1.635873

Integration datasets

Azimuth

Azimuth - App for reference-based single-cell analysis

Bioconductor packages

Description	Packages
Orchestrating Single-Cell Analysis with Bioconductor (OSCA) simpleSingleCell is an earlier version	See the list here
single cell RNA-Seq workflow (included in OSCA now)	simpleSingleCell
visualization tools for single-cell and Bulk RNA Sequencing	dittoSeq
An Introduction to Single-Cell RNA-sequencing Analysis in Bioconductor (中文, based on OSCA ebook）	DropletUtils, scran, scater, SingleR, BiocFileCache
An Opinionated Computational Workflow for Single-Cell RNA-seq and ATAC-seq	SingleCellExperiment, DropletUtils, scater, iSEE(shiny), scran, limma, edgeR
SplatPop: Simulating Population-Scale Single-Cell RNA-sequencing Data	splatter
Importing alevin scRNA-seq counts into R/Bioconductor (check 'Get started')	tximeta, SingleCellExperiment, fishpond, scran, Seurat

GUI

NIDAP

https://nidap.nih.gov/ which uses Code workbook (Code Workbook is an application that allows users to analyze and transform data using an intuitive graphical interface) for the interaction. The R/python analysis code can be accessed just like GEO2R from GEO. The background material is available on here.

Partek

BingleSeq

BingleSeq: a user-friendly R package for bulk and single-cell RNA-Seq data analysis

How many cells are in the human body?

30-40 trillion (10¹²) cells

Power analysis

powsimR: power analysis for bulk and single cell RNA-seq experiments. It is time-consuming and error-prone to install lots of required packages. Better to have a Docker image.
SCOPIT: sample size calculations for single-cell sequencing experiments

POWSC - Simulation, power evaluation and sample size recommendation for single-cell RNA-seq, Su 2020.

SC2P and the paper Two-phase differential expression analysis for single cell RNA-seq Wu 2018.

library(devtools)
install_github("haowulab/SC2P", build_vignettes=TRUE)
install_github("suke18/POWSC", build_vignettes = T, dependencies = T)

library(POWSC)
data("es_mef_sce")
sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"]
est_Paras = Est2Phase(sce)

sim_size = c(100, 400, 1000) # A numeric vector
pow_rslt = runPOWSC(sim_size = sim_size, 
                    est_Paras = est_Paras,
                    per_DE=0.05, 
                    DE_Method = "MAST", 
                    Cell_Type = "PW") 
# Note, using our previous developed tool SC2P is faster.

packageVersion("POWSC")  # [1] '0.1.0'
help(package="POWSC")
plot.POWSC(pow_rslt, Form="II", Cell_Type = "PW") 
summary.POWSC(pow_rslt, Form="II", Cell_Type = "PW")
#      (0,10] (10,20] (20,40] (40,80] (80,160] (160,Inf]
# 100  0.0171  0.1270  0.2877  0.2740   0.4909    0.6765
# 400  0.3200  0.5373  0.6486  0.7436   0.7959    0.8429
# 1000 0.5534  0.7424  0.8095  0.9067   0.9815    0.9077

Mitochondrion

https://en.wikipedia.org/wiki/Mitochondrion
WHAT IS A CELL SIMILAR TOO? In real life cells, the mitochondria is much like a power generator, giving power and energy to a cell, so that it may carry out its other functions.
Question: Mitochondrial Gene percentage threshold in single cell RNA-Seq. For example, 30% of total mRNA in the heart is mitochondrial due to high energy needs of cardiomyocytes, compared with 5% or less in tissues with low energy demands. For instance, 30% mitochondrial mRNA is representative of a healthy heart muscle cell, but would represent a stressed lymphocyte.
Seurat::PercentageFeatureSet()

Spike-in

RNA spike-in. An RNA spike-in is an RNA transcript of known sequence and quantity used to calibrate measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.
scran vignette.
- We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
- An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
- If we have spike-ins, we can use them to fit the trend (variance-mean relationship) instead.
Spike-in gene concentrations are known, normalization model existing technical variation by utilizing the difference between these known values and the values observed after processing. "Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey". Lytal 2020

CPM and glmpca package

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model, glmpca package. All codes are available in github. A talk by Hicks 2021.

scone package: normalization

Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq.

Batch correction

Question: scRNA-seq and the batch effects. The biggest problem I found is, since you have no idea what's the real data suppose to be.
https://www.biostars.org/p/316204/
Over-correction can happen.
- Single-cell RNA-seq Analysis on NIDAP - Tutorial Part 2 40:00 (nih network is required). Groups differ by a single experimental modification (eg a knockoff gene or cells of similar lineage used to determine differentiation pattern). However batch correction is necessary in cases such as spike-in cells where we expect these cells should be clustered together.
Mutual nearest neighbors/MNN. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors Haghverdi 2018...
- One assumption is the batch effect is almost orthogonal to the biological subspace. It also assumes the batch effect variation in much smaller than the biological effect variation between different cell types.
- It is preferable to perform DE analyses using the uncorrected expression values with blocking on the batch, as discussed in Section 11.4... We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a t-SNE plot by per-cell expression.
- Sadly the R code of the paper does not have the file ('MAP2.csv') and the code does not work anymore (e.g. mnnCorrect() does not have $angles component). Check out the mnnpy module which still has an option to return angles.

SMNN can eliminate the overcorrection between different cell types & allow that batch effects are non-orthogonal to the biological differences. github source code. Note that the package is only available on github and it depends on the python mnnpy module.

library(reticulate)
use_python("/home/rstudio/.conda/envs/sc_env/bin/python")
py_config()  # confirm

library("SMNN")
data("data_SMNN")
dim(data_SMNN$batch1.mat)  # [1] 3000  400
data_SMNN$batch1.mat[1:5, 1:5]
dim(data_SMNN$batch2.mat)  # [1] 3000  500
# Maker genes
markers <- c("Col1a1", "Pdgfra", "Ptprc", "Pecam1")
# Corresponding cell type labels for each marker gene
cluster.info <- c(1, 1, 2, 3)
matched_clusters <- unifiedClusterLabelling(data_SMNN$batch1.mat, 
                                            data_SMNN$batch2.mat, 
                                            features.use = markers, 
                                            cluster.labels = cluster.info, 
                                            min.perc = 0.3)

corrected.results <- SMNNcorrect(batches = list(data_SMNN$batch1.mat, data_SMNN$batch2.mat), 
                                 batch.cluster.labels = matched_clusters, 
                                 k=20, 
                                 sigma=1, 
                                 cos.norm.in=TRUE, 
                                 cos.norm.out=TRUE, 
                                 subset.genes=rownames(data_SMNN$batch2.mat))
# NOTE: subset.genes parameter was added to fix a bug when I'm running the function
# [1] "Data preparation ..."
# Error: C stack usage  555609078676 is too close to the limit

> traceback()
3: py_call_impl(callable, dots$args, dots$keywords)
2: mnnpy$utils$transform_input_data(datas = batches.t, cos_norm_in = cos.norm.in, 
       cos_norm_out = cos.norm.out, var_index = as.character(c(0:ncol(batches.t1))), 
       var_subset = subset.index, n_jobs = n.jobs)
1: SMNNcorrect(batches = list(data_SMNN$batch1.mat, data_SMNN$batch2.mat), 
       batch.cluster.labels = matched_clusters, k = 20, sigma = 1, 
       cos.norm.in = TRUE, cos.norm.out = TRUE, subset.genes = rownames(data_SMNN$batch2.mat)[1:100])

A benchmark of batch-effect correction methods for single-cell RNA sequencing data Tran 2020
A multi-center cross-platform single-cell RNA sequencing reference dataset with the source code on github

https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html

Seurat (CCA): RunMultiCCA() seems not available any more. Check out Integration and Label Transfer in Seurat 3 and Introduction to scRNA-seq integration FindIntegrationAnchors() & IntegrateData().

# Create 2D UMAP DR plot for data before integration
# It can be seen there is a need to do batch effect correction
ifnb2 <- ifnb
ifnb2 <- NormalizeData(ifnb2)
ifnb2 <- FindVariableFeatures(ifnb2, selection.method = "vst", nfeatures = 2000)
ifnb2 <- ScaleData(ifnb2, verbose = FALSE)
ifnb2 <- RunPCA(ifnb2, npcs = 30, verbose = FALSE)
ifnb2 <- RunUMAP(ifnb2, reduction = "pca", dims = 1:30)
ifnb2 <- FindNeighbors(ifnb2, reduction = "pca", dims = 1:30)
ifnb2 <- FindClusters(ifnb2, resolution = 0.5)

# Compared to the 2D plot before and after integration
DimPlot(ifnb2, reduction = "umap", group.by = "stim") # not good, ideally two groups
                     # should be mixed together.
                     # It seems the blue color is on top of the salmon color.
DimPlot(immune.combined, reduction = "umap", group.by = "stim")
                     # Two groups are mixed. 

# Compare the cluster plots before and after integration
DimPlot(ifnb2, reduction = "umap", split.by = "stim") # not good, ideally the cluster
                     # pattern should be similar in both groups 
DimPlot(immune.combined, reduction = "umap", split.by = "stim") #good, the cluster
                     # patterns are similar in both groups

FindConservedMarkers() and [email protected], ?FindConservedMarkers. ident.1 & ident.2 are used to define cells from the @active.ident component. For example, we can it to find genes that are conserved markers irrespective of stimulation/group condition in a certain cluster. Together with FeaturePlot() we can see the gene expression in 2D space (one plot per feature). The DotPlot() function with the split.by parameter can be useful for viewing conserved cell type markers across conditions

table([email protected]$stim, [email protected])

table(pbmc_small$groups, [email protected])

liger LIGER (NMF): liger:: optimizeALS(). Note that in order to use the R package, it is necessary to install hdf5 or libhdf5-dev library. Vignettes are available on github source but not on CRAN package.
Harmony: RunHarmony()

When should we batch correct scRNA-Seq data? When should we avoid it?

Normalization

Introduction to single-cell RNA-seq. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
Bulk normalization methods
- DESeq2: median of ratios (MoR) approach
- edgeR: trimmed mean of M values

Cell type deconvolution

Model

DE analysis

GSEA

twosigma package and the paper TWO-SIGMA-G: A New Competitive Gene Set Testing Framework for scRNA-seq Data Accounting for Inter-Gene and Cell-Cell Correlation Burden et ap.

Cell type annotation

Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data
singleR (Bioconductor). This method assigns labels to cells based on the reference samples (e.g. Immgen reference dataset for mouse) with the highest Spearman rank correlations, using only the marker genes between pairs of labels to focus on the relevant differences between cell types. See Chapter 12: Cell type annotation of OSCA.
Single-cell mapper (scMappR) – using scRNA-seq to infer the cell-type specificities of differentially expressed genes
scSorter: assigning cells to known cell types according to marker genes

scanpy python package

scanpy
- Preprocessing and clustering 3k PBMCs which save the data as the h5ad' (hdf5) format.
- Comparison of Scanpy-based algorithms to remove the batch effect from single-cell RNA-seq data
mnnpy

monocle package

sincell package

SCell

SCell – integrated analysis of single-cell RNA-seq data

GEVM

Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).