PCA: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 3: Line 3:
* https://en.wikipedia.org/wiki/Principal_component_analysis
* https://en.wikipedia.org/wiki/Principal_component_analysis
* [https://www.youtube.com/watch?v=FgakZw6K1QQ StatQuest: Principal Component Analysis (PCA), Step-by-Step] (video)
* [https://www.youtube.com/watch?v=FgakZw6K1QQ StatQuest: Principal Component Analysis (PCA), Step-by-Step] (video)
** [https://youtu.be/FgakZw6K1QQ?t=743 Singular vector/eigenvector & Loading scores/coefficients for PC1(12:23)],
** [https://youtu.be/FgakZw6K1QQ?t=761 Eigenvalue (12:41)],
** [https://youtu.be/FgakZw6K1QQ?t=904 Variation and eigenvalues (15:04)]
* [http://little-book-of-r-for-multivariate-analysis.readthedocs.io/en/latest/src/multivariateanalysis.html Using R for Multivariate Analysis]
* [http://little-book-of-r-for-multivariate-analysis.readthedocs.io/en/latest/src/multivariateanalysis.html Using R for Multivariate Analysis]
* [https://towardsdatascience.com/how-exactly-does-pca-work-5c342c3077fe How exactly does PCA work?] What PCA (with SVD) does is, it finds the best fit line for these data points which minimizes the distance between the data points and their projections on the best fit line.
* [https://towardsdatascience.com/how-exactly-does-pca-work-5c342c3077fe How exactly does PCA work?] What PCA (with SVD) does is, it finds the best fit line for these data points which minimizes the distance between the data points and their projections on the best fit line.

Revision as of 09:58, 18 March 2021

Principal component analysis

What is PCA

R source code

> stats:::prcomp.default
function (x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, 
    ...) 
{
    x <- as.matrix(x)
    x <- scale(x, center = center, scale = scale.)
    cen <- attr(x, "scaled:center")
    sc <- attr(x, "scaled:scale")
    if (any(sc == 0)) 
        stop("cannot rescale a constant/zero column to unit variance")
    s <- svd(x, nu = 0)
    s$d <- s$d/sqrt(max(1, nrow(x) - 1))
    if (!is.null(tol)) {
        rank <- sum(s$d > (s$d[1L] * tol))
        if (rank < ncol(x)) {
            s$v <- s$v[, 1L:rank, drop = FALSE]
            s$d <- s$d[1L:rank]
        }
    }
    dimnames(s$v) <- list(colnames(x), paste0("PC", seq_len(ncol(s$v))))
    r <- list(sdev = s$d, rotation = s$v, center = if (is.null(cen)) FALSE else cen, 
        scale = if (is.null(sc)) FALSE else sc)
    if (retx) 
        r$x <- x %*% s$v
    class(r) <- "prcomp"
    r
}
<bytecode: 0x000000003296c7d8>
<environment: namespace:stats>

R example

R built-in plot

http://genomicsclass.github.io/book/pages/pca_svd.html

pc <- prcomp(x)
group <- as.numeric(tab$Tissue)
plot(pc$x[, 1], pc$x[, 2], col = group, main = "PCA", xlab = "PC1", ylab = "PC2")

The meaning of colors can be found by palette().

  1. black
  2. red
  3. green3
  4. blue
  5. cyan
  6. magenta
  7. yellow
  8. gray

Theory with an example

Principal Component Analysis in R: prcomp vs princomp

PCA and SVD

Using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix to begin with, since the formation of [math]\displaystyle{ X X^T }[/math] can cause loss of precision.

http://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca

AIC/BIC in estimating the number of components

Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis

Related to Factor Analysis

In short,

  1. In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
  2. In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the covariances or correlations between the variables.
  3. Use Principal Components Analysis to reduce the data into a smaller number of components. Use Factor Analysis to understand what constructs underlie the data.

Calculated by Hand

http://strata.uga.edu/software/pdf/pcaTutorial.pdf

Do not scale your matrix

https://privefl.github.io/blog/(Linear-Algebra)-Do-not-scale-your-matrix/

Visualization

Interactive Principal Component Analysis

Interactive Principal Component Analysis in R

What does it do if we choose center=FALSE in prcomp()?

In USArrests data, use center=FALSE gives a better scatter plot of the first 2 PCA components.

x1 = prcomp(USArrests) 
x2 = prcomp(USArrests, center=F)
plot(x1$x[,1], x1$x[,2])  # looks random
windows(); plot(x2$x[,1], x2$x[,2]) # looks good in some sense

"scale. = TRUE"

By default, it centers the variable to have mean equals to zero. With parameter scale. = T, we normalize the variables to have standard deviation equals to 1.

Removing Zero Variance Columns

Efficiently Removing Zero Variance Columns (An Introduction to Benchmarking)

# Assume df is a data frame or a list (matrix is not enough!)
removeZeroVar3 <- function(df){
  df[, !sapply(df, function(x) min(x) == max(x))]
}
# Assume df is a matrix
removeZeroVar3 <- function(df){
  df[, !apply(df, 2, function(x) min(x) == max(x))]
}

# benchmark
dim(t(tpmlog))  # [1]    58 28109
system.time({a <- t(tpmlog); a <- a[, apply(a, 2, sd) !=0]}) # 0.54
system.time({a <- t(tpmlog); a <- removeZeroVar3(a)})        # 0.18

prcomp vs princomp

prcomp vs princomp from sthda. prcomp() is preferred compared to princomp().

Relation to Multidimensional scaling/MDS

With no missing data, classical MDS (Euclidean distance metric) is the same as PCA.

Comparisons are here.

Differences are asked/answered on stackexchange.com. The post also answered the question when these two are the same.

isoMDS (Non-metric)

cmdscale (Metric)

Matrix factorization methods

http://joelcadwell.blogspot.com/2015/08/matrix-factorization-comes-in-many.html Review of principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA).

Number of components

Obtaining the number of components from cross validation of principal components regression

Outlier samples

Detecting outlier samples in PCA

Reconstructing images

Reconstructing images using PCA

Misc

Data reduction prior to inference: Are there consequences of comparing groups using a t‐test based on principal component scores? Bedrick 2019

glmpca: Generalized PCA for dimension reduction of sparse counts

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model