PCA: Difference between revisions
Tags: mobile edit mobile web edit |
|||
Line 1: | Line 1: | ||
Principal component analysis | Principal component analysis | ||
= What is PCA = | |||
[http://little-book-of-r-for-multivariate-analysis.readthedocs.io/en/latest/src/multivariateanalysis.html Using R for Multivariate Analysis] | |||
= R source code = | = R source code = | ||
<pre> | <pre> |
Revision as of 21:06, 8 December 2020
Principal component analysis
What is PCA
Using R for Multivariate Analysis
R source code
> stats:::prcomp.default function (x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...) { x <- as.matrix(x) x <- scale(x, center = center, scale = scale.) cen <- attr(x, "scaled:center") sc <- attr(x, "scaled:scale") if (any(sc == 0)) stop("cannot rescale a constant/zero column to unit variance") s <- svd(x, nu = 0) s$d <- s$d/sqrt(max(1, nrow(x) - 1)) if (!is.null(tol)) { rank <- sum(s$d > (s$d[1L] * tol)) if (rank < ncol(x)) { s$v <- s$v[, 1L:rank, drop = FALSE] s$d <- s$d[1L:rank] } } dimnames(s$v) <- list(colnames(x), paste0("PC", seq_len(ncol(s$v)))) r <- list(sdev = s$d, rotation = s$v, center = if (is.null(cen)) FALSE else cen, scale = if (is.null(sc)) FALSE else sc) if (retx) r$x <- x %*% s$v class(r) <- "prcomp" r } <bytecode: 0x000000003296c7d8> <environment: namespace:stats>
R example
R built-in plot
http://genomicsclass.github.io/book/pages/pca_svd.html
pc <- prcomp(x) group <- as.numeric(tab$Tissue) plot(pc$x[, 1], pc$x[, 2], col = group, main = "PCA", xlab = "PC1", ylab = "PC2")
The meaning of colors can be found by palette().
- black
- red
- green3
- blue
- cyan
- magenta
- yellow
- gray
Theory with an example
Principal Component Analysis in R: prcomp vs princomp
PCA and SVD
Using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix to begin with, since the formation of [math]\displaystyle{ X X^T }[/math] can cause loss of precision.
http://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca
AIC/BIC in estimating the number of components
Related to Factor Analysis
- http://www.aaronschlegel.com/factor-analysis-introduction-principal-component-method-r/.
- http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/differences-between-pca-and-factor-analysis/
In short,
- In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
- In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the covariances or correlations between the variables.
- Use Principal Components Analysis to reduce the data into a smaller number of components. Use Factor Analysis to understand what constructs underlie the data.
Calculated by Hand
http://strata.uga.edu/software/pdf/pcaTutorial.pdf
Do not scale your matrix
https://privefl.github.io/blog/(Linear-Algebra)-Do-not-scale-your-matrix/
Visualization
- PCA and Visualization
- Scree plots from the FactoMineR package (based on ggplot2)
What does it do if we choose center=FALSE in prcomp()?
In USArrests data, use center=FALSE gives a better scatter plot of the first 2 PCA components.
x1 = prcomp(USArrests) x2 = prcomp(USArrests, center=F) plot(x1$x[,1], x1$x[,2]) # looks random windows(); plot(x2$x[,1], x2$x[,2]) # looks good in some sense
"scale. = TRUE"
- Practical Guide to Principal Component Analysis (PCA) in R & Python
- https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp
By default, it centers the variable to have mean equals to zero. With parameter scale. = T, we normalize the variables to have standard deviation equals to 1.
Removing Zero Variance Columns
Efficiently Removing Zero Variance Columns (An Introduction to Benchmarking)
# Assume df is a data frame or a list (matrix is not enough!) removeZeroVar3 <- function(df){ df[, !sapply(df, function(x) min(x) == max(x))] } # Assume df is a matrix removeZeroVar3 <- function(df){ df[, !apply(df, 2, function(x) min(x) == max(x))] } # benchmark dim(t(tpmlog)) # [1] 58 28109 system.time({a <- t(tpmlog); a <- a[, apply(a, 2, sd) !=0]}) # 0.54 system.time({a <- t(tpmlog); a <- removeZeroVar3(a)}) # 0.18
prcomp vs princomp
prcomp vs princomp from sthda. prcomp() is preferred compared to princomp().
Relation to Multidimensional scaling/MDS
With no missing data, classical MDS (Euclidean distance metric) is the same as PCA.
Comparisons are here.
Differences are asked/answered on stackexchange.com. The post also answered the question when these two are the same.
isoMDS (Non-metric)
cmdscale (Metric)
Matrix factorization methods
http://joelcadwell.blogspot.com/2015/08/matrix-factorization-comes-in-many.html Review of principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA).
Number of components
Obtaining the number of components from cross validation of principal components regression
Outlier samples
Detecting outlier samples in PCA
Reconstructing images
Reconstructing images using PCA
Misc
Generalized PCA for dimension reduction of sparse counts
Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model