ScRNA: Difference between revisions
Line 240: | Line 240: | ||
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9 A benchmark of batch-effect correction methods for single-cell RNA sequencing data] Tran 2020 | * [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9 A benchmark of batch-effect correction methods for single-cell RNA sequencing data] Tran 2020 | ||
* [https://www.nature.com/articles/s41597-021-00809-x A multi-center cross-platform single-cell RNA sequencing reference dataset] with the source code on [https://github.com/oxwang/SciData_scRNAseq github] | * [https://www.nature.com/articles/s41597-021-00809-x A multi-center cross-platform single-cell RNA sequencing reference dataset] with the source code on [https://github.com/oxwang/SciData_scRNAseq github] | ||
<ul> | |||
<li>https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html | |||
<ul> | |||
<li>'''Seurat (CCA)''': RunMultiCCA() [https://github.com/satijalab/seurat/issues/1145 seems not available any more]. Check out [https://satijalab.org/seurat/archive/v3.0/integration.html Integration and Label Transfer] in Seurat 3 and [https://satijalab.org/seurat/articles/integration_introduction.html Introduction to scRNA-seq integration] FindIntegrationAnchors() & IntegrateData(). | |||
<pre> | |||
# Create 2D UMAP DR plot for data before integration | |||
# It can be seen there is a need to do batch effect correction | |||
ifnb2 <- ifnb | |||
ifnb2 <- NormalizeData(ifnb2) | |||
ifnb2 <- FindVariableFeatures(ifnb2, selection.method = "vst", nfeatures = 2000) | |||
ifnb2 <- ScaleData(ifnb2, verbose = FALSE) | |||
ifnb2 <- RunPCA(ifnb2, npcs = 30, verbose = FALSE) | |||
ifnb2 <- RunUMAP(ifnb2, reduction = "pca", dims = 1:30) | |||
ifnb2 <- FindNeighbors(ifnb2, reduction = "pca", dims = 1:30) | |||
ifnb2 <- FindClusters(ifnb2, resolution = 0.5) | |||
# Compared to the 2D plot before and after integration | |||
DimPlot(ifnb2, reduction = "umap", group.by = "stim") # not good, ideally two groups | |||
# should be mixed together. | |||
# It seems the blue color is on top of the salmon color. | |||
DimPlot(immune.combined, reduction = "umap", group.by = "stim") | |||
# Two groups are mixed. | |||
# Compare the cluster plots before and after integration | |||
DimPlot(ifnb2, reduction = "umap", split.by = "stim") # not good, ideally the cluster | |||
# pattern should be similar in both groups | |||
DimPlot(immune.combined, reduction = "umap", split.by = "stim") #good, the cluster | |||
# patterns are similar in both groups | |||
</pre> | |||
</li> | |||
<li>[https://cran.r-project.org/web/packages/rliger/index.html liger LIGER (NMF)]: liger:: optimizeALS(). Note that in order to use the R package, it is necessary to install '''hdf5''' or '''libhdf5-dev''' library. Vignettes are available on github source but not on CRAN package. | |||
</li> | |||
<li>[https://github.com/immunogenomics/harmony Harmony]: RunHarmony() | |||
</li> | |||
</ul> | |||
</ul> | |||
* [https://btep.ccr.cancer.gov/question/single_cell_rna_seq/when-should-we-batch-correct-scrna-seq-data-when-should-we-avoid-it/ When should we batch correct scRNA-Seq data? When should we avoid it?] | * [https://btep.ccr.cancer.gov/question/single_cell_rna_seq/when-should-we-batch-correct-scrna-seq-data-when-should-we-avoid-it/ When should we batch correct scRNA-Seq data? When should we avoid it?] | ||
Revision as of 15:28, 1 March 2021
Resource
- https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis
- Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods by Dal Molin et al 2017.
- Normalization and noise reduction for single cell RNA-seq experiments by Bo Ding et al 2015.
- A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor by Lun 2016.
- Gene length and detection bias in single cell RNA sequencing protocols and the script & data are available online. Belinda Phipson1 et al 2017.
- Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference in ISCB (International Society for Computational Biology Community Journal). It has open peer reviews too.
- Welcome to the Tidyverse from the Journal of Open Source Software. It has open peer reviews too.
- Missing data and technical variability in single-cell RNA-sequencing experiments
- A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications 2017
- How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives by Alessandra Dal Molin 2018
- Current best practices in single‐cell RNA‐seq analysis: a tutorial 2019
- DSAVE: Detection of misclassified cells in single-cell RNA-Seq data.
- Top Benefits of Using the Technique of Single Cell RNA-Seq
- Systematic comparison of high-throughput single-cell RNA-seq methods for immune cell profiling 2021
- Videos
- Analysis of single cell RNA-seq data - Lecture 1 by BioinformaticsTraining
- Single-cell isolation by a modular single-cell pipette for RNA-sequencing by labonachipVideos
- Single Cell RNA Sequencing Data Analysis (youtube)
- Introduction to Bioinformatics and Computational Biology Xiaole Shirley Liu (ebook format)
- ebooks
- Analysis of single cell RNA-seq data (ebook, Kiselev et al 2019, University of Cambridge Bioinformatics training unit, SingleCellExperiment & Seurat) and the paper Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data Andrews 2020. A docker image is available.
- ANALYSIS OF SINGLE CELL RNA-SEQ DATA (ebook) Ashenberg et al 2019 from Broad institute. Seurat package was used.
- Analysis of single cell RNA-seq data (ebook) Lyu et al. 2019. Seurate and SingleCellExperiment.
Public data
- https://www.biostars.org/p/208973/
- Collection of Public Single-Cell RNA-Seq Datasets
- Single cell portal https://singlecell.broadinstitute.org/single_cell
- https://cells.ucsc.edu/
- https://www.humancellatlas.org/
- scRNAseq package from Bioconductor
- Tung dataset, data.
- SRA/GSE
- SRR7611046 and SRR7611048 from Sequence Read Archive (SRA) pages
- SRP073767 PBMC data. GSE29087 Mouse Embryonic Data. GSE64016 Human Embryonic Data. See Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey Lytal 2020
- GSE92332 from Scripts for "Current best-practices in single-cell RNA-seq: a tutorial"
- BP4RNAseq package. SRR11402955, SRR11402974.
Seurat*
- http://satijalab.org/seurat/. It has several vignettes.
- https://videocast.nih.gov/summary.asp?Live=21733&bhcp=1
- BingleSeq - A user-friendly R package for Bulk and Single-cell RNA-Seq data analyses.
- Question: Is it worth it to explore outside of Seurat?
- Introduction vignette
Purpose Code Comment Setup the Seurat Object pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["RNA"]]@countsNote the @ operator pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30] dense.size <- object.size(as.matrix(pbmc.data)) QC pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)select cells with number of features in the range of (200,2500) Normalizing the data pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc[["RNA"]]@dataNote the @ operator feature selection pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2Scaling the data all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc[["RNA"]]@scale.dataNote the @ operator Perform linear dimensional reduction pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
ElbowPlot(pbmc)Cluster the cells pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
head(Idents(pbmc), 5)Non-linear transformation: UMAP/tSNE pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")Finding differentially expressed features cluster1.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster1.markers, n = 5)
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()Assigning cell type identity to clusters new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
Seurat object
- CreateSeuratObject()
- Seurat DimPlot - Highlight specific groups of cells in different colours. seuratObject$meta.data, integrated@[email protected], slotNames(seuratObject), slot(seuratObject, "meta.data").
- Add a new column of meta data. ?AddMetaData
QC
QC part explanation:
> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) > pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") > pbmc2 <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500) > pbmc An object of class Seurat 13714 features across 2700 samples within 1 assay Active assay: RNA (13714 features, 0 variable features) > pbmc2 An object of class Seurat 13714 features across 2695 samples within 1 assay Active assay: RNA (13714 features, 0 variable features) > ind <- which(!colnames(pbmc[['RNA']]) %in% colnames(pbmc2[['RNA']])) > ind [1] 427 1060 1762 1878 2568 > tmp <- apply(pbmc[['RNA']]@data, 2, function(x) sum(x > 0)) # detected genes per cell > range(tmp) [1] 212 3400 > hist(tmp) # kind of normal > which(tmp > 2500) AGAGGTCTACAGCT-1 CCAGTCTGCGGAGA-1 GCGAAGGAGAGCTT-1 427 1060 1762 GGCACGTGTGAGAA-1 TTACTCGAACGTTG-1 1878 2568 > tmp2 <- colSums(pbmc[['RNA']]@data) # total counts per cell > hist(tmp2) # kind of normal
Normalization
- NormalizeData(). log1p(value/colSums[cell-idx] *scale_factor). The scale_factor is 10,000 by default and this gives a good range on the normalized values (at least on the pbmc data). The cpp source code of LogNorm() in preprocessing.R.
R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 10000)) [1] 0.000000 5.753142 R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 1000)) [1] 0.000000 3.478712 R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 100000)) [1] 0.000000 8.052868
And below is a comparison of the raw counts and normalized values:
R> pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) R> cbind(pbmc[['RNA']]@data[i, 1], pbmc2[['RNA']]@data[i, 1]) [,1] [,2] MRPL20 1 1.635873 RPL22 1 1.635873 TNFRSF25 2 2.226555 EFHD2 1 1.635873 NECAP2 1 1.635873 AKR7A2 1 1.635873 CAPZB 1 1.635873 RPL11 41 5.138686 RP5-886K2.3 1 1.635873 PITHD1 1 1.635873
Integration datasets
- Using Seurat with multimodal data
- Integrating scRNA-seq and scATAC-seq data
- FindIntegrationAnchors()
Azimuth
Azimuth - App for reference-based single-cell analysis
Bioconductor packages
Description | Packages |
---|---|
Orchestrating Single-Cell Analysis with Bioconductor (OSCA) simpleSingleCell is an earlier version |
See the list here |
single cell RNA-Seq workflow (included in OSCA now) | simpleSingleCell |
visualization tools for single-cell and Bulk RNA Sequencing | dittoSeq |
An Introduction to Single-Cell RNA-sequencing Analysis in Bioconductor (中文, based on OSCA ebook) | DropletUtils, scran, scater, SingleR, BiocFileCache |
An Opinionated Computational Workflow for Single-Cell RNA-seq and ATAC-seq | SingleCellExperiment, DropletUtils, scater, iSEE(shiny), scran, limma, edgeR |
SplatPop: Simulating Population-Scale Single-Cell RNA-sequencing Data | splatter |
Importing alevin scRNA-seq counts into R/Bioconductor (check 'Get started') | tximeta, SingleCellExperiment, fishpond, scran, Seurat |
How many cells are in the human body?
Mitochondrion
- https://en.wikipedia.org/wiki/Mitochondrion
- WHAT IS A CELL SIMILAR TOO? In real life cells, the mitochondria is much like a power generator, giving power and energy to a cell, so that it may carry out its other functions.
- Question: Mitochondrial Gene percentage threshold in single cell RNA-Seq. For example, 30% of total mRNA in the heart is mitochondrial due to high energy needs of cardiomyocytes, compared with 5% or less in tissues with low energy demands. For instance, 30% mitochondrial mRNA is representative of a healthy heart muscle cell, but would represent a stressed lymphocyte.
- Seurat::PercentageFeatureSet()
Spike-in
- scran vignette.
- We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
- An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
- If we have spike-ins, we can use them to fit the trend (variance-mean relationship) instead.
- Spike-in gene concentrations are known, normalization model existing technical variation by utilizing the difference between these known values and the values observed after processing. "Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey". Lytal 2020
CPM and glmpca package
Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model, glmpca package. All codes are available in github. A talk by Hicks 2021.
scone package: normalization
Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq.
Batch correction
- A benchmark of batch-effect correction methods for single-cell RNA sequencing data Tran 2020
- A multi-center cross-platform single-cell RNA sequencing reference dataset with the source code on github
- https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html
- Seurat (CCA): RunMultiCCA() seems not available any more. Check out Integration and Label Transfer in Seurat 3 and Introduction to scRNA-seq integration FindIntegrationAnchors() & IntegrateData().
# Create 2D UMAP DR plot for data before integration # It can be seen there is a need to do batch effect correction ifnb2 <- ifnb ifnb2 <- NormalizeData(ifnb2) ifnb2 <- FindVariableFeatures(ifnb2, selection.method = "vst", nfeatures = 2000) ifnb2 <- ScaleData(ifnb2, verbose = FALSE) ifnb2 <- RunPCA(ifnb2, npcs = 30, verbose = FALSE) ifnb2 <- RunUMAP(ifnb2, reduction = "pca", dims = 1:30) ifnb2 <- FindNeighbors(ifnb2, reduction = "pca", dims = 1:30) ifnb2 <- FindClusters(ifnb2, resolution = 0.5) # Compared to the 2D plot before and after integration DimPlot(ifnb2, reduction = "umap", group.by = "stim") # not good, ideally two groups # should be mixed together. # It seems the blue color is on top of the salmon color. DimPlot(immune.combined, reduction = "umap", group.by = "stim") # Two groups are mixed. # Compare the cluster plots before and after integration DimPlot(ifnb2, reduction = "umap", split.by = "stim") # not good, ideally the cluster # pattern should be similar in both groups DimPlot(immune.combined, reduction = "umap", split.by = "stim") #good, the cluster # patterns are similar in both groups
- liger LIGER (NMF): liger:: optimizeALS(). Note that in order to use the R package, it is necessary to install hdf5 or libhdf5-dev library. Vignettes are available on github source but not on CRAN package.
- Harmony: RunHarmony()
- Seurat (CCA): RunMultiCCA() seems not available any more. Check out Integration and Label Transfer in Seurat 3 and Introduction to scRNA-seq integration FindIntegrationAnchors() & IntegrateData().
Normalization
- Introduction to single-cell RNA-seq. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
- Bulk normalization methods
- DESeq2: median of ratios (MoR) approach
- edgeR: trimmed mean of M values
Cell type deconvolution
- List of distinct cell types in the adult human body
- https://twitter.com/stephaniehicks/status/1358776597264797703?s=20
- single-cell signatures from MSigDB
Cell type annotation
- Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data
- Single-cell mapper (scMappR) – using scRNA-seq to infer the cell-type specificities of differentially expressed genes
- scSorter: assigning cells to known cell types according to marker genes
monocle package
sincell package
SCell
SCell – integrated analysis of single-cell RNA-seq data
GEVM
Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).
NMFEM
Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization
Splatter: Simulation Of Single-Cell RNA Sequencing Data
http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1
Scater
Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R
NDRindex
NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data
DEsingle
DEsingle – detecting three types of differential expression in single-cell RNA-seq data
totalVI
Joint probabilistic modeling of single-cell multi-omic data with totalVI