ScRNA

From 太極
Revision as of 14:57, 1 March 2021 by Brb (talk | contribs) (Created page with "= Resource = * https://github.com/seandavi/awesome-single-cell#tutorials-and-workflows List of software packages for single-cell data analysis collected by Sean Davis * [http:...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Resource

Public data

Seurat*

  • Introduction vignette
    Purpose Code Comment
    Setup the Seurat Object pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
    pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
    pbmc[["RNA"]]@counts
    Note the @ operator
    pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
    dense.size <- object.size(as.matrix(pbmc.data))
    QC pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
    VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
    plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
    plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
    plot1 + plot2
    pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
    select cells with number of features in the range of (200,2500)
    Normalizing the data pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
    pbmc[["RNA"]]@data
    Note the @ operator
    feature selection pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
    top10 <- head(VariableFeatures(pbmc), 10)
    plot1 <- VariableFeaturePlot(pbmc)
    plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
    plot1 + plot2
    Scaling the data all.genes <- rownames(pbmc)
    pbmc <- ScaleData(pbmc, features = all.genes)
    pbmc[["RNA"]]@scale.data
    Note the @ operator
    Perform linear dimensional reduction pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
    print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
    VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
    DimPlot(pbmc, reduction = "pca")
    DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
    DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
    Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    JackStrawPlot(pbmc, dims = 1:15)
    ElbowPlot(pbmc)
    Cluster the cells pbmc <- FindNeighbors(pbmc, dims = 1:10)
    pbmc <- FindClusters(pbmc, resolution = 0.5)
    head(Idents(pbmc), 5)
    Non-linear transformation: UMAP/tSNE pbmc <- RunUMAP(pbmc, dims = 1:10)
    DimPlot(pbmc, reduction = "umap")
    Finding differentially expressed features cluster1.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
    head(cluster1.markers, n = 5)
    cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
    head(cluster5.markers, n = 5)
    pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
    pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
    cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
    VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
    VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
    FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
    top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
    DoHeatmap(pbmc, features = top10$gene) + NoLegend()
    Assigning cell type identity to clusters new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
    names(new.cluster.ids) <- levels(pbmc)
    pbmc <- RenameIdents(pbmc, new.cluster.ids)
    DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Seurat object

QC

QC part explanation:

> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc2 <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500)
> pbmc
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> pbmc2
An object of class Seurat 
13714 features across 2695 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> ind <- which(!colnames(pbmc[['RNA']]) %in% colnames(pbmc2[['RNA']]))
> ind
[1]  427 1060 1762 1878 2568
> tmp <- apply(pbmc[['RNA']]@data, 2, function(x) sum(x > 0)) # detected genes per cell
> range(tmp)
[1]  212 3400
> hist(tmp) # kind of normal
> which(tmp > 2500)
AGAGGTCTACAGCT-1 CCAGTCTGCGGAGA-1 GCGAAGGAGAGCTT-1 
             427             1060             1762 
GGCACGTGTGAGAA-1 TTACTCGAACGTTG-1 
            1878             2568 
> tmp2 <- colSums(pbmc[['RNA']]@data) # total counts per cell
> hist(tmp2)  # kind of normal

Normalization

  • NormalizeData(). log1p(value/colSums[cell-idx] *scale_factor). The scale_factor is 10,000 by default and this gives a good range on the normalized values (at least on the pbmc data). The cpp source code of LogNorm() in preprocessing.R.
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 10000))
    [1] 0.000000 5.753142
    
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 1000))
    [1] 0.000000 3.478712
    
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 100000))
    [1] 0.000000 8.052868
    

    And below is a comparison of the raw counts and normalized values:

    R> pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                              scale.factor = 10000)
    R> cbind(pbmc[['RNA']]@data[i, 1], pbmc2[['RNA']]@data[i, 1])
                [,1]     [,2]
    MRPL20         1 1.635873
    RPL22          1 1.635873
    TNFRSF25       2 2.226555
    EFHD2          1 1.635873
    NECAP2         1 1.635873
    AKR7A2         1 1.635873
    CAPZB          1 1.635873
    RPL11         41 5.138686
    RP5-886K2.3    1 1.635873
    PITHD1         1 1.635873
    

Integration datasets

Azimuth

Azimuth - App for reference-based single-cell analysis

Bioconductor packages

Description Packages
Orchestrating Single-Cell Analysis with Bioconductor (OSCA)
simpleSingleCell is an earlier version
See the list here
single cell RNA-Seq workflow (included in OSCA now) simpleSingleCell
visualization tools for single-cell and Bulk RNA Sequencing dittoSeq
An Introduction to Single-Cell RNA-sequencing Analysis in Bioconductor (中文, based on OSCA ebook) DropletUtils, scran, scater, SingleR, BiocFileCache
An Opinionated Computational Workflow for Single-Cell RNA-seq and ATAC-seq SingleCellExperiment, DropletUtils, scater, iSEE(shiny), scran, limma, edgeR
SplatPop: Simulating Population-Scale Single-Cell RNA-sequencing Data splatter
Importing alevin scRNA-seq counts into R/Bioconductor (check 'Get started') tximeta, SingleCellExperiment, fishpond, scran, Seurat

How many cells are in the human body?

30-40 trillion (1012) cells

Mitochondrion

Spike-in

  • scran vignette.
    • We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
    • An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
    • If we have spike-ins, we can use them to fit the trend (variance-mean relationship) instead.
  • Spike-in gene concentrations are known, normalization model existing technical variation by utilizing the difference between these known values and the values observed after processing. "Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey". Lytal 2020

CPM and glmpca package

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model, glmpca package. All codes are available in github. A talk by Hicks 2021.

scone package: normalization

Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq.

Batch correction

Normalization

  • Introduction to single-cell RNA-seq. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
  • Bulk normalization methods
    • DESeq2: median of ratios (MoR) approach
    • edgeR: trimmed mean of M values

Cell type deconvolution

Cell type annotation

monocle package

sincell package

SCell

SCell – integrated analysis of single-cell RNA-seq data

GEVM

Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).

NMFEM

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Splatter: Simulation Of Single-Cell RNA Sequencing Data

http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1

Scater

Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R

NDRindex

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

DEsingle

DEsingle – detecting three types of differential expression in single-cell RNA-seq data

totalVI

Joint probabilistic modeling of single-cell multi-omic data with totalVI