ScRNA: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 44: Line 44:
** [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP073767 SRP073767] PBMC data. GSE29087 Mouse Embryonic Data. GSE64016 Human Embryonic Data. See [https://www.frontiersin.org/articles/10.3389/fgene.2020.00041/full Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey] Lytal 2020
** [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP073767 SRP073767] PBMC data. GSE29087 Mouse Embryonic Data. GSE64016 Human Embryonic Data. See [https://www.frontiersin.org/articles/10.3389/fgene.2020.00041/full Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey] Lytal 2020
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92332 GSE92332] from [https://github.com/theislab/single-cell-tutorial Scripts for "Current best-practices in single-cell RNA-seq: a tutorial"]
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92332 GSE92332] from [https://github.com/theislab/single-cell-tutorial Scripts for "Current best-practices in single-cell RNA-seq: a tutorial"]
** [https://bioconductor.org/packages/3.12/workflows/vignettes/BP4RNAseq/inst/doc/vignette.html BP4RNAseq] package. SRR11402955, SRR11402974.
** [https://bioconductor.org/packages/3.12/workflows/vignettes/BP4RNAseq/inst/doc/vignette.html BP4RNAseq] package. SRR11402955, SRR11402974. Note several dependent tools need to be installed manually.
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75790 GSE75790]. See [https://www.sciencedirect.com/science/article/pii/S1097276517300497 Comparative Analysis of Single-Cell RNA Sequencing Methods] Ziegenhain 2017
** [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75790 GSE75790]. See [https://www.sciencedirect.com/science/article/pii/S1097276517300497 Comparative Analysis of Single-Cell RNA Sequencing Methods] Ziegenhain 2017



Revision as of 15:46, 8 March 2021

Resource

Public data

Seurat*

  • Introduction vignette
    Purpose Code Comment
    Setup the Seurat Object pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
    pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
    pbmc[["RNA"]]@counts
    Note the @ operator
    pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
    dense.size <- object.size(as.matrix(pbmc.data))
    QC pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
    VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
    plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
    plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
    plot1 + plot2
    pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
    select cells with number of features in the range of (200,2500)
    Normalizing the data pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
    pbmc[["RNA"]]@data
    Note the @ operator
    feature selection pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
    top10 <- head(VariableFeatures(pbmc), 10)
    plot1 <- VariableFeaturePlot(pbmc)
    plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
    plot1 + plot2
    Scaling the data all.genes <- rownames(pbmc)
    pbmc <- ScaleData(pbmc, features = all.genes)
    pbmc[["RNA"]]@scale.data
    Note the @ operator
    Perform linear dimensional reduction pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
    print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
    VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
    DimPlot(pbmc, reduction = "pca")
    DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
    DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
    Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    Determine the ‘dimensionality’ of the dataset pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    JackStrawPlot(pbmc, dims = 1:15)
    ElbowPlot(pbmc)
    Cluster the cells pbmc <- FindNeighbors(pbmc, dims = 1:10)
    pbmc <- FindClusters(pbmc, resolution = 0.5)
    head(Idents(pbmc), 5)
    Non-linear transformation: UMAP/tSNE pbmc <- RunUMAP(pbmc, dims = 1:10)
    DimPlot(pbmc, reduction = "umap")
    Finding differentially expressed features cluster1.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
    head(cluster1.markers, n = 5)
    cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
    head(cluster5.markers, n = 5)
    pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
    pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
    cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
    VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
    VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
    FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
    top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
    DoHeatmap(pbmc, features = top10$gene) + NoLegend()
    Assigning cell type identity to clusters new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
    names(new.cluster.ids) <- levels(pbmc)
    pbmc <- RenameIdents(pbmc, new.cluster.ids)
    DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Seurat object

QC

QC part explanation:

> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc2 <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500)
> pbmc
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> pbmc2
An object of class Seurat 
13714 features across 2695 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> ind <- which(!colnames(pbmc[['RNA']]) %in% colnames(pbmc2[['RNA']]))
> ind
[1]  427 1060 1762 1878 2568
> tmp <- apply(pbmc[['RNA']]@data, 2, function(x) sum(x > 0)) # detected genes per cell
> range(tmp)
[1]  212 3400
> hist(tmp) # kind of normal
> which(tmp > 2500)
AGAGGTCTACAGCT-1 CCAGTCTGCGGAGA-1 GCGAAGGAGAGCTT-1 
             427             1060             1762 
GGCACGTGTGAGAA-1 TTACTCGAACGTTG-1 
            1878             2568 
> tmp2 <- colSums(pbmc[['RNA']]@data) # total counts per cell
> hist(tmp2)  # kind of normal

Normalization

  • NormalizeData(). log1p(value/colSums[cell-idx] *scale_factor). The scale_factor is 10,000 by default and this gives a good range on the normalized values (at least on the pbmc data). The cpp source code of LogNorm() in preprocessing.R.
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 10000))
    [1] 0.000000 5.753142
    
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 1000))
    [1] 0.000000 3.478712
    
    R> range(log1p(pbmc[['RNA']]@data[,1]/sum(pbmc[['RNA']]@data[,1]) * 100000))
    [1] 0.000000 8.052868
    

    And below is a comparison of the raw counts and normalized values:

    R> pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                              scale.factor = 10000)
    R> cbind(pbmc[['RNA']]@data[i, 1], pbmc2[['RNA']]@data[i, 1])
                [,1]     [,2]
    MRPL20         1 1.635873
    RPL22          1 1.635873
    TNFRSF25       2 2.226555
    EFHD2          1 1.635873
    NECAP2         1 1.635873
    AKR7A2         1 1.635873
    CAPZB          1 1.635873
    RPL11         41 5.138686
    RP5-886K2.3    1 1.635873
    PITHD1         1 1.635873
    

Integration datasets

Azimuth

Azimuth - App for reference-based single-cell analysis

Bioconductor packages

Description Packages
Orchestrating Single-Cell Analysis with Bioconductor (OSCA)
simpleSingleCell is an earlier version
See the list here
single cell RNA-Seq workflow (included in OSCA now) simpleSingleCell
visualization tools for single-cell and Bulk RNA Sequencing dittoSeq
An Introduction to Single-Cell RNA-sequencing Analysis in Bioconductor (中文, based on OSCA ebook) DropletUtils, scran, scater, SingleR, BiocFileCache
An Opinionated Computational Workflow for Single-Cell RNA-seq and ATAC-seq SingleCellExperiment, DropletUtils, scater, iSEE(shiny), scran, limma, edgeR
SplatPop: Simulating Population-Scale Single-Cell RNA-sequencing Data splatter
Importing alevin scRNA-seq counts into R/Bioconductor (check 'Get started') tximeta, SingleCellExperiment, fishpond, scran, Seurat

GUI

NIDAP

https://nidap.nih.gov/ which uses Code workbook (Code Workbook is an application that allows users to analyze and transform data using an intuitive graphical interface) for the interaction. The R/python analysis code can be accessed just like GEO2R from GEO. The background material is available on here.

Partek

BingleSeq

BingleSeq: a user-friendly R package for bulk and single-cell RNA-Seq data analysis

How many cells are in the human body?

30-40 trillion (1012) cells

Power analysis

  • powsimR: power analysis for bulk and single cell RNA-seq experiments. It is time-consuming and error-prone to install lots of required packages. Better to have a Docker image.
  • SCOPIT: sample size calculations for single-cell sequencing experiments
  • POWSC - Simulation, power evaluation and sample size recommendation for single-cell RNA-seq, Su 2020.
    library(devtools)
    install_github("haowulab/SC2P", build_vignettes=TRUE)
    install_github("suke18/POWSC", build_vignettes = T, dependencies = T)
    
    library(POWSC)
    data("es_mef_sce")
    sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"]
    est_Paras = Est2Phase(sce)
    
    sim_size = c(100, 400, 1000) # A numeric vector
    pow_rslt = runPOWSC(sim_size = sim_size, 
                        est_Paras = est_Paras,
                        per_DE=0.05, 
                        DE_Method = "MAST", 
                        Cell_Type = "PW") 
    # Note, using our previous developed tool SC2P is faster.
    
    packageVersion("POWSC")  # [1] '0.1.0'
    help(package="POWSC")
    plot.POWSC(pow_rslt, Form="II", Cell_Type = "PW") 
    summary.POWSC(pow_rslt, Form="II", Cell_Type = "PW")
    #      (0,10] (10,20] (20,40] (40,80] (80,160] (160,Inf]
    # 100  0.0171  0.1270  0.2877  0.2740   0.4909    0.6765
    # 400  0.3200  0.5373  0.6486  0.7436   0.7959    0.8429
    # 1000 0.5534  0.7424  0.8095  0.9067   0.9815    0.9077
    

Mitochondrion

Spike-in

  • RNA spike-in. An RNA spike-in is an RNA transcript of known sequence and quantity used to calibrate measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.
  • scran vignette.
    • We perform some cursory quality control to remove cells with low total counts or high spike-in percentages.
    • An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
    • If we have spike-ins, we can use them to fit the trend (variance-mean relationship) instead.
  • Spike-in gene concentrations are known, normalization model existing technical variation by utilizing the difference between these known values and the values observed after processing. "Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey". Lytal 2020

CPM and glmpca package

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model, glmpca package. All codes are available in github. A talk by Hicks 2021.

scone package: normalization

Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq.

Batch correction

  • https://broadinstitute.github.io/2019_scWorkshop/correcting-batch-effects.html
    • Seurat (CCA): RunMultiCCA() seems not available any more. Check out Integration and Label Transfer in Seurat 3 and Introduction to scRNA-seq integration FindIntegrationAnchors() & IntegrateData().
      # Create 2D UMAP DR plot for data before integration
      # It can be seen there is a need to do batch effect correction
      ifnb2 <- ifnb
      ifnb2 <- NormalizeData(ifnb2)
      ifnb2 <- FindVariableFeatures(ifnb2, selection.method = "vst", nfeatures = 2000)
      ifnb2 <- ScaleData(ifnb2, verbose = FALSE)
      ifnb2 <- RunPCA(ifnb2, npcs = 30, verbose = FALSE)
      ifnb2 <- RunUMAP(ifnb2, reduction = "pca", dims = 1:30)
      ifnb2 <- FindNeighbors(ifnb2, reduction = "pca", dims = 1:30)
      ifnb2 <- FindClusters(ifnb2, resolution = 0.5)
      
      # Compared to the 2D plot before and after integration
      DimPlot(ifnb2, reduction = "umap", group.by = "stim") # not good, ideally two groups
                           # should be mixed together.
                           # It seems the blue color is on top of the salmon color.
      DimPlot(immune.combined, reduction = "umap", group.by = "stim")
                           # Two groups are mixed. 
      
      # Compare the cluster plots before and after integration
      DimPlot(ifnb2, reduction = "umap", split.by = "stim") # not good, ideally the cluster
                           # pattern should be similar in both groups 
      DimPlot(immune.combined, reduction = "umap", split.by = "stim") #good, the cluster
                           # patterns are similar in both groups
      

      FindConservedMarkers() and [email protected], ?FindConservedMarkers. ident.1 & ident.2 are used to define cells from the @active.ident component. For example, we can it to find genes that are conserved markers irrespective of stimulation/group condition in a certain cluster. Together with FeaturePlot() we can see the gene expression in 2D space (one plot per feature). The DotPlot() function with the split.by parameter can be useful for viewing conserved cell type markers across conditions

      table([email protected]$stim, [email protected])
      
      table(pbmc_small$groups, [email protected])
      
    • liger LIGER (NMF): liger:: optimizeALS(). Note that in order to use the R package, it is necessary to install hdf5 or libhdf5-dev library. Vignettes are available on github source but not on CRAN package.
    • Harmony: RunHarmony()

Normalization

  • Introduction to single-cell RNA-seq. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.
  • Bulk normalization methods
    • DESeq2: median of ratios (MoR) approach
    • edgeR: trimmed mean of M values

Cell type deconvolution

Model

Cell type annotation

monocle package

sincell package

SCell

SCell – integrated analysis of single-cell RNA-seq data

GEVM

Detection of high variability in gene expression from single-cell RNA-seq profiling. Two mouse scRNA-seq data sets were obtained from Gene Expression Omnibus (GSE65525 and GSE60361).

NMFEM

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Splatter: Simulation Of Single-Cell RNA Sequencing Data

http://www.biorxiv.org/content/early/2017/07/24/133173?rss=1

Scater

Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R

NDRindex

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

DEsingle

DEsingle – detecting three types of differential expression in single-cell RNA-seq data

totalVI

Joint probabilistic modeling of single-cell multi-omic data with totalVI