# Clustering

## R

• Overview of clustering methods in R
• Partitional Clustering in R: The Essential.
• K-means,
• K-medoids clustering or PAM (Partitioning Around Medoids),
• CLARA (Clustering Large Applications), which is an extension to PAM adapted for large data sets. According to Wikibooks: since CLARA adopts a sampling approach, the quality of its clustering results depends greatly on the size of the sample. When the sample size is small, CLARA’s efficiency in clustering large data sets comes at the cost of clustering quality.

## k-means clustering

### Figures of merit (FOM) plot

maanova::fom(), Validating clustering for gene expression data by Yeung 2001.

## Hierarchical clustering

For the kth cluster, define the Error Sum of Squares as $\displaystyle{ ESS_m = }$ sum of squared deviations (squared Euclidean distance) from the cluster centroid. $\displaystyle{ ESS_m = \sum_{l=1}^{n_m}\sum_{k=1}^p (x_{ml,k} - \bar{x}_{m,k})^2 }$ in which $\displaystyle{ \bar{x}_{m,k} = (1/n_m) \sum_{l=1}^{n_m} x_{ml,k} }$ the mean of the mth cluster for the k-th variable, $\displaystyle{ x_{ml,k} }$ being the score on the kth variable $\displaystyle{ (k=1,\dots,p) }$ for the l-th object $\displaystyle{ (l=1,\dots,n_m) }$ in the mth cluster $\displaystyle{ (m=1,\dots,g) }$.

If there are C clusters, define the Total Error Sum of Squares as Sum of Squares as $\displaystyle{ ESS = \sum_m ESS_m, m=1,\dots,C }$

Consider the union of every possible pair of clusters.

Combine the 2 clusters whose combination combination results in the smallest increase in ESS.

1. The default linkage is "complete" in R.
2. Ward's method tends to join clusters with a small number of observations, and it is strongly biased toward producing clusters with the same shape and with roughly the same number of observations.
3. It is also very sensitive to outliers. See Milligan (1980).

Take pomeroy data (7129 x 90) for an example:

library(gplots)

lr = as.matrix(lr)
method = "average" # method <- "complete"; method <- "ward.D2"; method <- "single"
hclust1 <- function(x) hclust(x, method= method)
heatmap.2(lr, col=bluered(75), hclustfun = hclust1, distfun = dist,
density.info="density", scale = "none",
key=FALSE, symkey=FALSE, trace="none",
main = method)


It seems average method will create a waterfall like dendrogram. Ward method will produce a tight clusters. Complete linkage produces a more 中庸 result.

### K-Means vs hierarchical clustering

K-Means vs hierarchical clustering. Hierarchical clustering is usually preferable, as it is both more flexible and has fewer hidden assumptions about the distribution of the underlying data.

### Correlation distance

# Pairwise correlation between samples (columns)
cols.cor <- cor(mydata, use = "pairwise.complete.obs", method = "pearson")
# Pairwise correlation between rows (genes)
rows.cor <- cor(t(mydata), use = "pairwise.complete.obs", method = "pearson")

## Row- and column-wise clustering using correlation
hclust.col <- hclust(as.dist(1-cols.cor))
hclust.row <- hclust(as.dist(1-rows.cor))

# Plot the heatmap
library("gplots")
heatmap.2(mydata, scale = "row", col = bluered(100),
trace = "none", density.info = "none",
Colv = as.dendrogram(hclust.col),
Rowv = as.dendrogram(hclust.row)
)


### Get the ordering

set.seed(123)
dat <- matrix(rnorm(20), ncol=2)

# perform hierarchical clustering
hc <- hclust(dist(dat))

# plot dendrogram
plot(hc)

# get ordering of leaves
ord <- order.dendrogram(as.dendrogram(hc))
# OR an easy way
attr(n, "nodePar") <- c(a$nodePar, lab.col=labCol) } n } clusDendro <- dendrapply(as.dendrogram(hc), colLab) # I change cex because there are lots of samples op <- par(mar=c(5,3,1,.5)+.1) plot(clusDendro,...) par(op) } genedata <- matrix(rnorm(100*20), nc=20) colnames(genedata) <- paste0("S", 1:20) pheno <- rep(c(1,2), each =10) plotHclustColors(genedata, pheno, cex=.8)  ### Dendrogram with covariates ## Density based clustering ## Biclustering ## Optimal number of clusters ### Silhouette score/width ### Scree/elbow plot • Cf scree plot for PCA analysis • K-Means Clustering in R: Step-by-Step Example. ?factoextra::fviz_nbclust (good integration with different clustering methods and evaluation statistic) • datacamp # Use map_dbl to run many models with varying value of k (centers) tot_withinss <- map_dbl(1:10, function(k){ model <- kmeans(x = lineup, centers = k) model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)


### kBET: k-nearest neighbour batch effect test

• Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
• https://github.com/theislab/kBET
• quantify mixability; how well cells of the same type from different batches were grouped together

### Alignment score

• Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
• quantify mixability; how well cells of the same type from different batches were grouped together

### dynamicTreeCut package

dynamicTreeCut: Methods for Detection of Clusters in Hierarchical Clustering Dendrograms. cutreeDynamicTree(). Found in here.

## Power

Statistical power for cluster analysis 2022. It includes several take-home message.

# Evaluate the effect of centering & scaling

## 1-correlation distance

Effect of centering and scaling on clustering of genes and samples in terms of distance. 'Yes' means the distance was changed compared to the baseline where no centering or scaling was applied.

clustering genes clustering samples
centering on each genes No Yes*
scaling on each genes No Yes*
• Note: Cor(X, Y) = Cor(X + constant scalar, Y). If the constant is not a scalar, the equation won't hold. Or think about plotting data in a 2 dimension space. If the X data has a constant shift for all observations/genes, then the linear correlation won't be changed.

## Euclidean distance

clustering genes clustering samples
centering on each genes Yes No1
scaling on each genes Yes Yes2

Note

1. $\displaystyle{ \sum(X_i - Y_i)^2 = \sum(X_i-c_i - (Y_i-c_i))^2 }$
2. $\displaystyle{ \sum(X_i - Y_i)^2 \neq \sum(X_i/c_i - Y_i/c_i)^2 }$

## Supporting R code

1. 1-Corr distance

source("http://www.bioconductor.org/biocLite.R")
biocLite("limma"); biocLite("ALL")

library(limma); library(ALL)
data(ALL)

eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")] f <- factor(as.character(eset$mol.biol))
design <- model.matrix(~f)
fit <- eBayes(lmFit(eset,design))
selected  <- p.adjust(fit$p.value[, 2]) < 0.05 esetSel <- eset [selected, ] # 165 x 47 heatmap(exprs(esetSel)) esetSel2 <- esetSel[sample(1:nrow(esetSel), 20), sample(1:ncol(esetSel), 10)] # 20 x 10 dist.no <- 1-cor(t(as.matrix(esetSel2))) dist.mean <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2))))) # gene variance has not changed! dist.median <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median)))) range(dist.no - dist.mean) # [1] -1.110223e-16 0.000000e+00 range(dist.no - dist.median) # [1] -1.110223e-16 0.000000e+00 range(dist.mean - dist.median) # [1] 0 0  So centering (no matter which measure: mean, median, ...) genes won't affect 1-corr distance of genes. dist.mean <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/"))) dist.median <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/"))) range(dist.no - dist.mean) # [1] -8.881784e-16 6.661338e-16 range(dist.no - dist.median) # [1] -6.661338e-16 6.661338e-16 range(dist.mean - dist.median) # [1] -1.110223e-15 1.554312e-15  So scaling after centering (no matter what measures: mean, median,...) won't affect 1-corr distance of genes. How about centering / scaling genes on array clustering? dist.no <- 1-cor(as.matrix(esetSel2)) dist.mean <- 1-cor(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)))) # array variance has changed! dist.median <- 1-cor(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median))) range(dist.no - dist.mean) # [1] -1.547086 0.000000 range(dist.no - dist.median) # [1] -1.483427 0.000000 range(dist.mean - dist.median) # [1] -0.5283601 0.6164602 dist.mean <- 1-cor(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/")) dist.median <- 1-cor(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/")) range(dist.no - dist.mean) # [1] -1.477407 0.000000 range(dist.no - dist.median) # [1] -1.349419 0.000000 range(dist.mean - dist.median) # [1] -0.5419534 0.6269875  2. Euclidean distance dist.no <- dist(as.matrix(esetSel2)) dist.mean <- dist(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)))) dist.median <- dist(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median))) range(dist.no - dist.mean) # [1] 7.198864e-05 2.193487e+01 range(dist.no - dist.median) # [1] -0.3715231 21.9320846 range(dist.mean - dist.median) # [1] -0.923717629 -0.000088385  Centering does affect the Euclidean distance. dist.mean <- dist(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/")) dist.median <- dist(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/")) range(dist.no - dist.mean) # [1] 0.7005071 24.0698991 range(dist.no - dist.median) # [1] 0.636749 24.068920 range(dist.mean - dist.median) # [1] -0.22122869 0.02906131  And scaling affects Euclidean distance too. How about centering / scaling genes on array clustering? dist.no <- dist(t(as.matrix(esetSel2))) dist.mean <- dist(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2))))) dist.median <- dist(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median)))) range(dist.no - dist.mean) # 0 0 range(dist.no - dist.median) # 0 0 range(dist.mean - dist.median) # 0 0 dist.mean <- dist(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/"))) dist.median <- dist(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/"))) range(dist.no - dist.mean) # [1] 1.698960 9.383789 range(dist.no - dist.median) # [1] 1.683028 9.311603 range(dist.mean - dist.median) # [1] -0.09139173 0.02546394  ## Euclidean distance and Pearson correlation relationship $\displaystyle{ r(X, Y) = 1 - \frac{d^2(X, Y)}{2n} }$ where $\displaystyle{ r(X, Y) }$ is the Pearson correlation of variables X and Y and $\displaystyle{ d^2(X, Y) }$ is the squared Euclidean distance of X and Y. # Simple image plot using image() function image(t(x)) is similar to stats::heatmap(x, Rowv = NA, Colv = NA, scale = "none") except heatmap() can show column/row names while image() won't. The default colors are the same too though not pretty. ### Create Matrix plot using colors to fill grid # Create matrix. Using random values for this example. rand <- rnorm(286, 0.8, 0.3) mat <- matrix(sort(rand), nr=26) dim(mat) # Check dimensions # Create x and y labels yLabels <- seq(1, 26, 1) xLabels <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"); # Set min and max values of rand min <- min(rand, na.rm=T) max <- max(rand, na.rm=T) # Red and green range from 0 to 1 while Blue ranges from 1 to 0 ColorRamp <- rgb(seq(0.95,0.99,length=50), # Red seq(0.95,0.05,length=50), # Green seq(0.95,0.05,length=50)) # Blue ColorLevels <- seq(min, max, length=length(ColorRamp)) # Set layout. We are going to include a colorbar next to plot. layout(matrix(data=c(1,2), nrow=1, ncol=2), widths=c(4,1), heights=c(1,1)) #plotting margins. These seem to work well for me. par(mar = c(5,5,2.5,1), font = 2) # Plot it up! image(1:ncol(mat), 1:nrow(mat), t(mat), col=ColorRamp, xlab="Variable", ylab="Time", axes=FALSE, zlim=c(min,max), main= NA) # Now annotate the plot box() axis(side = 1, at=seq(1,length(xLabels),1), labels=xLabels, cex.axis=1.0) axis(side = 2, at=seq(1,length(yLabels),1), labels=yLabels, las= 1, cex.axis=1) # Add colorbar to second plot region par(mar = c(3,2.5,2.5,2)) image(1, ColorLevels, matrix(data=ColorLevels, ncol=length(ColorLevels),nrow=1), col=ColorRamp,xlab="",ylab="",xaxt="n", las = 1)  If we define ColorRamp variable using the following way, we will get the 2nd plot. require(RColorBrewer) # get brewer.pal() ColorRamp <- colorRampPalette( rev(brewer.pal(9, "RdBu")) )(25)  Note that • colorRampPalette() is an R's built-in function. It interpolate a set of given colors to create new color palettes. The return is a function that takes an integer argument (the required number of colors) and returns a character vector of colors (see rgb()) interpolating the given sequence (similar to heat.colors() or terrain.colors()). # An example of showing 50 shades of grey in R greys <- grep("^grey", colours(), value = TRUE) length(greys) # [1] 102 shadesOfGrey <- colorRampPalette(c("grey0", "grey100")) shadesOfGrey(2) # [1] "#000000" "#FFFFFF" # And 50 shades of grey? fiftyGreys <- shadesOfGrey(50) mat <- matrix(rep(1:50, each = 50)) image(mat, axes = FALSE, col = fiftyGreys) box()  • (For dual channel data) brewer.pal(9, "RdBu") creates a diverging palette based on "RdBu" with 9 colors. See help(brewer.pal, package="RColorBrewer") for a list of palette name. The meaning of the palette name can be found on colorbrew2.org website. In genomics, we will add rev() such as rev(brewer.pal(9, "RdBu")). • (For single channel data) brewer.pal(9, "Blues") is good. See an example. # stats::heatmap() • ?heatmap. It includes parameters for settings • margins (margins ) • font size (cexRow, cexCol), • row/column orders (Rowv, Colv) • scale = c("row", "column", "none"). • Source code of heatmap() • Hierarchical Clustering in R: The Essentials. Note stats::heatmap() can add color side bars too. • If we run the heatmap() function line-by-line, we see the side bars were drawn by using par(mar) & image(, axes = FALSE). • Default par()$mar is (5,4,4,1)+.5
• layout(lmat, widths = lwid, heights = lhei, respect = TRUE)
> lmat
[,1] [,2] [,3]
[1,]    0    0    5
[2,]    0    0    2
[3,]    4    1    3
# 1 = RowSideColors
# 2 = ColSideColors
# 3 = heatmap
# 4 = Row dendrogram
# 5 = Column dendrogram
> lwid # lhei is the same
[1] 1.0 0.2 4.0

• When it is drawing RowSideColors, par()$mar is changed to (5, 0, 0, .5) • When it is drawing ColSideColors, par()$mar is changed to (.5, 0, 0, 5)
• When it is drawing the heatmap, par()$mar is changed to (5, 0, 0, 5) • image() was called 3 times if RowSideColors and ColSideColors are TRUE. • Bottom & right texts on x-axis & y-axis are drawn by axis() • When it is drawing the row dendrogram, par()$mar is changed to (5, 0, 0, 0)
• When it is drawing the column dendrogram, par()$mar is changed to (0, 0, 0, 5) ## Rowv, Colv: reorder of rows and columns • ?heatmap, ?heatmap.2. • Rowv/Colv. Either a dendrogram or a vector of values used to reorder the row dendrogram or NA to suppress any row dendrogram (and reordering) or by default, NULL. • If either is a vector (of ‘weights’) then the appropriate dendrogram is reordered according to the supplied values subject to the constraints imposed by the dendrogram, by reorder(dd, Rowv), in the row case. If either is missing, as by default, then the ordering of the corresponding dendrogram is by the mean value of the rows/columns, i.e., in the case of rows, Rowv <- rowMeans(x, na.rm = na.rm). • ?hclust The algorithm used in hclust is to order the subtree so that the tighter cluster is on the left (the last, i.e., most recent, merge of the left subtree is at a lower value than the last merge of the right subtree). Single observations are the tightest clusters possible, and merges involving two observations place them in order by their observation sequence number. (Not clear about the ordering of two single observations?) • ?reorder.dendrogram. At each node, the branches are ordered in increasing weights where the weight of a branch is defined as f(wj) where f is agglo.FUN and wj is the weight of the j-th sub branch. reorder(x, wts, agglo.FUN = sum, …)  • Order Rows & Columns of Heatmap in R (2 Examples), How does R heatmap order rows by default? set.seed(3255434) # Set seed for reproducibility my_mat <- matrix(rnorm(25, 0, 10), nrow = 5) # Create example matrix colnames(my_mat) <- paste0("col", 1:5) # Specify column names rownames(my_mat) <- paste0("row", 1:5) # Specify row names my_mat apply(my_mat, 1, mean) |> round(2) # row1 row2 row3 row4 row5 # 1.24 0.37 5.77 -3.70 -2.74 apply(my_mat, 2, mean) |> round(2) # col1 col2 col3 col4 col5 # -2.64 2.98 -1.21 5.64 -3.83 heatmap(my_mat) # col order is col1 col3 col2 col5 col4 # +-----------+ # | | # +------+ | # | | +----+ # | +---+ | | # | | | | | # 1 3 2 5 4 # -2.6 -1.2 2.9 -3.8 5.6 # heatmap() has applied reorder() by default internally # To obtain the same ordering of hclust(): hclust_rows <- as.dendrogram(hclust(dist(my_mat))) # Calculate hclust dendrograms hclust_cols <- as.dendrogram(hclust(dist(t(my_mat)))) heatmap(my_mat, # Draw heatmap with hclust dendrograms Rowv = hclust_rows, Colv = hclust_cols)$colInd
# 4 5 1 2 3

plot(hclust(dist(t(my_mat))))
# col order is col4 col5 col1 col2 col3
#     +---------+
#     |         |
#     |      +-----+
#  +----+    |     |
#  |    |    |   +---+
#  |    |    |   |   |
#  4    5    1   2   3
#  5.6 -3.8 -2.6 2.9 -1.2
# order by the tightness
#
# To obtain the same dendrogram of heatmap():
Colv <- colMeans(my_mat, na.rm = T)
plot(reorder(hclust_cols, Colv))

• Heatmap in R: Static and Interactive Visualization

## scale parameter

The scale parameter in heatmap() or heatmap.2() only affects the coloring. It does not affect the clustering. In stats::heatmap(, scale="row") by default, but in gplots::heatmap.2(, scale = "none") by default.

When we check the heatmap.2() source code, we see it runs hclust() before calling sweep() if scale = "row". The scaled x was then used to display the carpet by using the image() function.

It looks like many people misunderstand the meaning; see this post Row scaling from ComplexHeatmap. The scale parameter in tidyHeatmap also did the scaling before clustering. However, we can still do that by following this post Can we scale data and trim data for better presentation by specifying our own clustering results in cluster_rows and cluster_columns parameters.

library(gplots)
nr <- 5; nc <- 20
set.seed(1)
x <- matrix(rnorm(nr*nc), nr=nr)
x[1,] <- x[1,]-min(x[1,]) # in order to see the effect of 'scale'

# the following 2 lines prove the scale parameter does not affect clustering
o1 <- heatmap.2(x, scale = "row", main = 'row', trace ='none', col=bluered(75)) # colors are balanced per row, but not column
o2 <- heatmap.2(x, scale = "none", main = 'none', trace ='none', col=bluered(75)) # colors are imbalanced
identical(o1$colInd, o2$colInd) # [1] TRUE
identical(o1$rowInd, o2$rowInd) # [1] TRUE

# the following line proves we'll get a different result if we input a z-score matrix
o3 <- heatmap.2(t(o1$carpet), scale = "none", main = 'o1$carpet', trace ='none', col=bluered(75)) # totally different


## Is it important to scale data before clustering

Is it important to scale data before clustering?. So if we are using the correlation as the distance, we don't need to use z-score transformation.

# gplots package and heatmap.2()

The following example is extracted from DESeq2 package.

## ----loadDESeq2, echo=FALSE----------------------------------------------
# in order to print version number below
library("DESeq2")

library("parathyroidSE")
library("GenomicFeatures")
data(exonsByGene)

## ----locateFiles, echo=FALSE---------------------------------------------
bamDir <- system.file("extdata",package="parathyroidSE",mustWork=TRUE)
fls <- list.files(bamDir, pattern="bam$",full=TRUE) ## ----bamfilepaired------------------------------------------------------- library( "Rsamtools" ) bamLst <- BamFileList( fls, yieldSize=100000 ) ## ----sumOver------------------------------------------------------------- library( "GenomicAlignments" ) se <- summarizeOverlaps( exonsByGene, bamLst, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) ## ----libraries----------------------------------------------------------- library( "DESeq2" ) library( "parathyroidSE" ) ## ----loadEcs------------------------------------------------------------- data( "parathyroidGenesSE" ) se <- parathyroidGenesSE colnames(se) <- se$run

## ----fromSE--------------------------------------------------------------
ddsFull <- DESeqDataSet( se, design = ~ patient + treatment )

## ----collapse------------------------------------------------------------
ddsCollapsed <- collapseReplicates( ddsFull,
groupby = ddsFull$sample, run = ddsFull$run )

## ----subsetCols----------------------------------------------------------
col = unique(as.numeric(dat$GO)), lty= 1, lwd = 5, cex=.7) # In practice par(xpd = FALSE) # default heatmap.2(, ColSideColors=cc) # add sample dendrogram par(xpd = NA) legend(0, .5, ...) # legend is on the LHS # the coordinate is device dependent  Another example from video which makes use of an archived package heatmap.plus. legend(0.8,1, legend=paste(treatment_times,"weeks"), fill=treatment_color_options, cex=0.5) legend(0.8,0.9, legend=c("Control","Treatment"), fill=c('#808080','#FFC0CB'), cex=0.5)  ### heatmap.plus() How to Make an R Heatmap with Annotations and Legend. ColSideColors can be a matrix (n x 2). So it is possible to draw two side colors on the heatmap. Unfortunately the package was removed from CRAN in 2021-04. The package was used by TCGAbiolinks but now this package uses ComplexHeatmap instead. devtools::install_version("heatmap.plus", "1.3")  ## Output from heatmap.2 examples # ggplot2 package ## ggplot2::geom_tile() # Suppose dat=[x, y1, y2, y3] is a wide matrix # and we want to make a long matrix like dat=[x, y, val] library(tidyr) dat <- dat %>% pivot_longer(!x, names_to = 'y', values_to='val') ggplot(dat, aes(x, y)) + geom_tile(aes(fill = val), colour = "white") + scale_fill_gradient2(low = "blue", mid = "white", high = "red") + labs(y="Cell Line", fill= "Log GI50") # white is the border color # grey = NA by default # labs(fill) is to change the title # labs(y) is to change the y-axis label  # NMF package aheatmap() function. # ComplexHeatmap Pros ## Simple examples with code ## More examples ## Clustering • Whether to cluster rows or not Heatmap(mat, cluster_rows = F)  • Whether to show the dendrogram or not Heatmap(mat, show_column_dend = F)  • Change the default distance method Heatmap(mat, clustering_distance_rows = function(m) dist(m)) Heatmap(mat, clustering_distance_rows = function(x, y) 1-cor(x, y))  • Change the default agglomeration/linkage method Heatmap(mat, clustering_method_rows = "complete")  • Change the clustering method in rows or columns Heatmap(mat, cluster_rows = diana(mat), cluster_columns = agnes(t(mat))) # 小心 # ** if cluster_columns is set as a function, you don't need to transpose the matrix ** Heatmap(mat, cluster_rows = diana, cluster_columns = agnes) # the above is the same as the following # Note, when cluster_rows is set as a function, the argument m is the input mat itself, # while for cluster_columns, m is the transpose of mat. Heatmap(mat, cluster_rows = function(m) as.dendrogram(diana(m)), cluster_columns = function(m) as.dendrogram(agnes(m))) fh = function(x) fastcluster::hclust(dist(x)) Heatmap(mat, cluster_rows = fh, cluster_columns = fh)  • Run clustering in each of subgroup # you might already have a subgroup classification for the matrix rows or columns, # and you only want to perform clustering for the features in the same subgroup. group = kmeans(t(mat), centers = 3)$cluster
Heatmap(mat, cluster_columns = cluster_within_group(mat, group))


### Render dendrograms

We can add colors to branches of the dendrogram after we cut the tree. See 2.3.3 Render dendrograms

### Reorder/rotate branches in dendrograms

• In the Heatmap() function, dendrograms are reordered to make features with larger difference more separated from each others (see reorder.dendrogram()).
• See an interesting example which makes use of the dendsort package. Not really useful.
• 2.3.4 Reorder dendrograms
• Inconsistent clustering with ComplexHeatmap?. Good explanation!
• row_dend_reorder/column_dend_reorder with default value TRUE.
• If we set "row_dend_reorder/column_dend_reorder" to be FALSE, then the orders obtained from hclust() & Heatmap() will be the same. More specifically, the order will be the same for columns and the order will be reversed for rows.
• By default, Heatmap() will create a different order than hclust(). If we like to get the same order as hclust(), we can do:
Heatmap(my_mat, column_dend_reorder = F, row_dend_reorder = F)
# OR
hclust_rows <- as.dendrogram(hclust(dist(my_mat)))
hclust_cols <- as.dendrogram(hclust(dist(t(my_mat)))
Heatmap(my_mat, cluster_columns = hclust_cols,
column_dend_reorder = F,
cluster_rows = hclust_rows,
row_dend_reorder = F, name = 'my_mat')

• By default, Heatmap() can create the same order as heatmap()/heatmap.2() function for columns but the row orders are reversed (but when I try another data, the statement does not hold).
Heatmap(my_mat)
# OR
Colv <- colMeans(my_mat, na.rm = T)
hclust_cols2 <- reorder(hclust_cols, Colv)
Rowv <- rowMeans(my_mat, na.rm = T)
hclust_rows2 <- reorder(hclust_rows, Rowv)
Heatmap(my_mat, cluster_columns = hclust_cols2,
column_dend_reorder = F,
cluster_rows = hclust_rows2,
row_dend_reorder = F, name = 'my_mat2')
# PS. columns order is the same as heatmap(),
#    but row order is the "reverse" of the order of heatmap()

• The order of rows and columns in a heatmap produced by the heatmap function can be different from the order produced by the hclust function because the heatmap function uses additional steps to reorder the dendrogram based on row/column means (Order of rows in heatmap?). This is done through the reorderfun parameter, which takes a function that reorders the dendrogram as much as possible based on row/column means. If you want to use the same order produced by the hclust function in your heatmap, you can extract the dendrogram from the hclust object and pass it to the Rowv or Colv arguments of the heatmap function. You can also set the reorderfun parameter to a function that does not reorder the dendrogram.
• Use dendextend package (see the next section). The 1st plot shows the original heatmap. The 2nd plot shows how to use the result of hclust() in the Heatmap() function. The 3rd plot shows how to rotate branches using the dendextend package.

### dendextend package

• See clustering section.
• Examples. See the plots given in the last section for how to use rotate() function to rotate branches. For rows, if we want to use numerical numbers instead of labels in order parameter, we need to count from top to bottom. For columns, we can count from left to right.
# create a dendrogram
hc <- hclust(dist(USArrests), "ave")
dend <- as.dendrogram(hc)

# manipulate the dendrogram using the dendextend package
dend2 <- color_branches(dend, k = 3)

# create a heatmap using the ComplexHeatmap package
Heatmap(USArrests, name = "USArrests", cluster_rows = dend2)


## Get the rows/columns order

Use row_order()/column_order(). See 4.12 Get orders and dendrograms

set.seed(123)
dat <- matrix(rnorm(20), ncol=2)
hc <- hclust(dist(dat))
plot(hc)
# get ordering of leaves
ord <- order.dendrogram(as.dendrogram(hc))
ord
# [1]  8  3  6  5 10  1  9  7  2  4

rownames(dat) <- 1:10
Heatmap(dat)
row_order(draw(Heatmap(dat)) )
# [1]  6  3  7  4  2  1  9  5 10  8
# Same order if I read the labels from top to down
# Differ from hclust() b/c reordering


## Set the rows/columns order manually

Heatmap(mat, name = "mat",
row_order = order(as.numeric(gsub("row", "", rownames(mat)))),
column_order = order(as.numeric(gsub("column", "", colnames(mat)))),
column_title = "reorder matrix")


## Rotate labels

Heatmap(mat, name = "mat", column_names_rot = 45)


## Heatmap split

• See 2.7 Heatmap split. One advantage of using this approach instead of the "+" operator is we have only 1 color annotation instead of 2 color annotations separately for each category/group.
• Split by k-means clustering
• Split by categorical variables. Below is an example where we want to sort genes within each level of some row class variable (eg. epi and mes). Then we will sort samples within each level of some column class variable (eg tumortype: carcinoma vs sarcoma).
• Split by dendrogram
• Furthermore we can also specify

## Multiple heatmaps in a plot

library(cowplot)

h1 <- Heatmap()
h2 <- Heatmap()
h3 <- Heatmap()
plot_grid(grid.grabExpr(draw(h1)),
grid.grabExpr(draw(h2)),
grid.grabExpr(draw(h3)), ncol=2)


## Colors and legend

• How to make continuous legend symmetric? #82, 2020 To exactly control the break values on the legend, you can set heatmap_legend_param argument in Heatmap() function.
• Use circlize::colorRamp2() to change the color limit including the color specification. PS: NO need to use library()/require() to load the circlize package.
• ComplexHeatmap break values appear different in the plots #361, 2019. pretty(range(x), n=3)
Heatmap( xm, col = colorRamp2(c(min(xm), 0, max(xm)), c("#0404B4", "white", "#B18904")),
show_row_names = F, km = 2, column_names_gp = gpar(fontsize = 7), name="Tumors",
heatmap_legend_param = list(at = c(min(xm), 0, max(xm))))

pretty(seq(-3, 3, length = 3),n=4)
# [1] -4 -2  0  2  4
pretty(seq(-3, 3, length = 3),n=5) # default n=5
# [1] -3 -2 -1  0  1  2  3

• One legend for a list of heatmaps #391, 2019
col_fun = colorRamp2(...)
Heatmap(mat1, col = col_fun, ...) +
Heatmap(mat2, col = col_fun, show_heatmap_legend = FALSE, ...) +
Heatmap(mat3, col = col_fun, show_heatmap_legend = FALSE, ...) +

• Breaks in Color Scales are Wrong #659, 2020. col = colorRamp2(seq(-3, 3, length = 3), c("blue", "#EEEEEE", "red")) does not mean -3, 0, 3 should be the breaks on the legend (although you can manually control it). The color mapping function only defines the colors, while the default break values on the legends are calculated from the input matrix with 3 to 5 break values. In your code, you see 4 and -4 are the border of the legend, actually, all values between 3~4 are mapped to red and all the values between -3~-4 are mapped to blue. In other words, if I use colorRamp2(c(-3, 1, 3), c('blue', 'white', 'red')), it will uniformly distribute data in (-3,1) to c('blue', 'white') and (1,3) to c('white', 'red').
• Hex code #EEEEEE represents bright gray
• Setting a default color schema #834, 2021
• Changing the default background color #698, 2021

## Customize the heatmap body

We can add numbers to each/certain cells. See 2.9 Customize the heatmap body

## Save images to files

See

png(file="newfile.png", width=8, height=6, units="in", res=300)
ht <- Heatmap(...)
draw(ht)
dev.off()


### svg and pdf

For some reason, when I save the image to a file in svg or pdf format I will see borders of each cell. When I try use_raster = TRUE option, it seems to fix the problem on the body heatmap but the column annotation part still has borders.

See Section 2.12

## Text alignment

See 3.14 Text annotation and search the keyword just which is useful in rowAnnotation(). Some examples are: just=c("left", "bottom"), just="right", just="center".

## Heatmap annotation

3 Heatmap Annotations. Using operators + and %v%' is easier so we can simplify the call to Heatmap().

Heatmap(...) + rowAnnotation() + ...   # add to right
Heatmap(...) %v% HeatmapAnnotation(...) %v% ... # add to bottom

ha = HeatmapAnnotation(...)
Heatmap(..., top_annotation = ha)

ha = rowAnnotation(...)
Heatmap(..., right_annotation = ha)


Example 1: top/bottom annotation and HeatmapAnnotation()

library(ComplexHeatmap); library(circlize)
set.seed(123)
mat = matrix(rnorm(80, 2), 8, 10)
rownames(mat) = paste0("R", 1:8)
colnames(mat) = paste0("C", 1:10)
col_anno = HeatmapAnnotation(
df = data.frame(anno1 = 1:10,
anno2 = sample(letters[1:3], 10, replace = TRUE)),
col = list(anno2 = c("a" = "red", "b" = "blue", "c" = "green")))
Heatmap(mat,
col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")),
top_annotation = col_anno,
name = "mat",              # legend for the color of the main heatmap
column_title = "Heatmap")  # top of the whole plot, default is ''


Example 2: left/right annotation and rowAnnotation()

row_anno_df <- data.frame(anno1 = 1:8, anno2 = sample(letters[1:3], 8, replace = TRUE))
row_anno_col <- list(anno2 = c("a" = "red", "b" = "blue", "c" = "green"))
row_anno <- rowAnnotation(
df = row_anno_df,
col = row_anno_col)
Heatmap(mat,
col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")),                          ,
right_annotation = row_anno,
name = "mat",
row_title = "Heatmap")
Heatmap(mat,
col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")),                          ,
name = "mat",
row_title = "Heatmap") + row_anno  # row labels disappear?


Example 3: use colorRamp2() to control colors on continuous variables in annotations

# Same definition of row_anno_df
row_anno_col <- list(anno1 = colorRamp2(c(min(row_anno_df$anno1), max(row_anno_df$anno1)),
c("blue", "red")),
anno2 = c("a" = "red", "b" = "blue", "c" = "green"))
row_anno = rowAnnotation(df = row_anno_df,
col = row_anno_col)
Heatmap(mat, col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")),
right_annotation = row_anno,
name = "mat",
row_title = "Heatmap")


### Hide annotation legend for some variables

HeatmapAnnotation("BRCA1/2"=BRCA,
show_legend = c("BRCA1/2" = FALSE),
col=list("BRCA1/2"=BRCA.colors))


### Adjust the height of column annotation

If you find the height of the column annotation too large, you can adjust it using the annotation_height parameter in the HeatmapAnnotation function or the re_size function in the ComplexHeatmap R package. Search height and simple_anno_size_adjust in Heatmap Annotations.

# Remember to set the 'simple_anno_size_adjust' parameter
# Default seems to be 1 cm.
column_ha = HeatmapAnnotation(tgi = tgi[ord],
bin = bin[ord],
col = list(tgi = col_fun,
bin = c("resistant" = "violet",
"sensitive" = "green")),
height = unit(.5, "cm"), simple_anno_size_adjust = TRUE)

# Assuming ha is your HeatmapAnnotation object
ha = re_size(ha, annotation_height = unit(2, "cm"))


## Correlation matrix

The following code will guarantee the heatmap has a diagonal in the direction of top-left and bottom-right. The row dendrogram will be flipped automatically. There is no need to use cluster_rows = rev(mydend).

mcor <- cor(t(lograt))
colnames(mcor) <- rownames(mcor) <- 1:ncol(mcor)
mydend <- as.dendrogram(hclust(as.dist(1-mcor)))
Heatmap( mcor, cluster_columns = mydend, cluster_rows = mydend,
row_dend_reorder = FALSE, column_dend_reorder = FALSE,
row_names_gp = gpar(fontsize = 6),
column_names_gp = gpar(fontsize = 6),
column_title = "", name = "value")


## tidyHeatmap

tidyHeatmap. This is a tidy implementation for heatmap. At the moment it is based on the (great) package 'ComplexHeatmap'.

Note: that ComplexHeatmap is on Bioconductor but tidyHeatmap is on CRAN.

By default, .scale = "row". See ?heatmap.

add_tile() to add a column or row (depending on the data) annotation.

cluster_rows=FALSE if we don't want to cluster rows.

BiocManager::install('tidyHeatmap')
library(tidyHeatmap)

library(tidyr)
mtcars_tidy <-
mtcars |>
as_tibble(rownames="Car name") |>
# Scale
mutate_at(vars(-Car name, -hp, -vs), scale) |>
# tidyfy
pivot_longer(cols = -c(Car name, hp, vs),
names_to = "Property",
values_to = "Value")

# create another variable which will be added next to 'hp'
mtcars_tidy <- mtcars_tidy%>%
mutate(type = Car name)
mtcars_tidy$type <- substr(mtcars_tidy$type, 1, 1)
mtcars_tidy

# NA case. Consider the cell on the top-right corner
mtcars_tidy %>% filter(Car name == 'Volvo 142E' & Property == 'am')
mtcars_tidy <- mtcars_tidy %>%                               # Replacing values
mutate(Value = replace(Value,
Car name == 'Volvo 142E' & Property == 'am',
NA))
mtcars_tidy %>% filter(Car name == 'Volvo 142E' & Property == 'am')
# Re-draw data with missing value
mtcars_tidy |>
heatmap(Car name, Property, Value,
palette_value = circlize::colorRamp2(
seq(-2, 2, length.out = 11),
rev(RColorBrewer::brewer.pal(11, "RdBu")))) |>

# two tiles on rows
mtcars_heatmap <-
mtcars_tidy |>
heatmap(Car name, Property, Value,
palette_value = circlize::colorRamp2(
seq(-2, 2, length.out = 11),
rev(RColorBrewer::brewer.pal(11, "RdBu")))) |>
mtcars_heatmap
# Other useful parameters
# heatmap(, cluster_rows = FALSE)
# heatmap(, .scale = F)

# Note add_tile(var) can decide whether the 'var' should go to
# columns or rows - interesting!
# one tile goes to columns and one tile goes to rows.
tidyHeatmap::pasilla |>
# group_by(location, type) |>
heatmap(
.column = sample,
.row = symbol,
.value = count normalised adjusted
) |>


# Correlation heatmap

## corrplot

This package is used for visualization of correlation matrix. See its vignette and Visualize correlation matrix using correlogram.

# pheatmap

What is special?

1. the color scheme is grey-blue to orange-red. pheatmap() default options, colorRampPalette in R
2. able to include class labels in samples and/or genes
3. the color key is thin and placed on the RHS (prettier for publications though it could miss information)
4. borders for each cell are included (not necessary)

(archived)

# heatmap3

CRAN. An improved heatmap package. Completely compatible with the original R function 'heatmap', and provides more powerful and convenient features.

# funkyheatmap

funkyheatmap - Generating Funky Heatmaps for Data Frames

# rasterImage

library(jpeg)

# need to create a plot first and rasterImage() will overlay it with another image
# bottom-left = (100, 300), top-right = (250, 450)
plot(c(100, 250), c(300, 450), type = "n", xlab = "", ylab = "")

args(rasterImage)
# function (image, xleft, ybottom, xright, ytop, angle = 0, interpolate = TRUE,  ...)
rasterImage(img, 100, 300, 250, 450)

text(100, 350, "ABCD", cex=2, col="yellow",adj=0) # left-bottom align


# Interactive heatmaps

## d3heatmap

This package has been removed from CRAN and archived on 2020-05-24.

A package generates interactive heatmaps using d3.js and htmlwidgets.

The package let you

• Highlight rows/columns by clicking axis labels
• Click and drag over colormap to zoom in (click on colormap to zoom out)
• Optional clustering and dendrograms, courtesy of base::heatmap

The following screenshots shows 3 features.

• Shows the row/column/value under the mouse cursor
• Zoom in a region (click on the zoom-in image will bring back the original heatmap)
• Highlight a row or a column (click the label of another row will highlight another row. Click the same label again will bring back the original image)

## heatmaply

This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice.

Installing this package requires to compile some dependent package.

The return object is heatmaply is 'plotly' and 'htmlwidget'. It does not return the ordering of rows/columns. It can not control whether to do clustering (d3heatmap package is better at this).

## shinyHeatmaply

shinyHeatmaply: Deploy 'heatmaply' using 'shiny'

# Colors

## Fold change

data <- pms / pms[, "12.50"] # # fold change vs "12.50" sample
data <- ifelse(data>1, data, -1/data)
heatmap.2(data, ...)


## Generate sequential colors & grDevices::colorRampPalette()

?colorRampPalette. See an example Building heatmap with R.

mat <- matrix(1:100) # 100 x 1
image(t(mat), axes = FALSE, col = colorRampPalette( c("blue", "white", "red") )(100))


require(RColorBrewer)  # brewer.pal
ColorRamp <- colorRampPalette( brewer.pal(9, "Blues") )(25)
# The above will generate 25 colors by interpolating the colors
#   defined by brewer.pal(9, "Blues")
heatmap.2(..., col = ColorRamp)


## RColorBrewer::brewer.pal() function

• RColorBrewer can be used in both base plots and ggplot2.
• barplot(c(2,5,7), col = brewer.pal(n = 3, name = "RdBu"))
• ggplot() + geom_XXX + scale_fill_brewer(palette = "Dark2"). ggplot() + geom_point + + scale_color_brewer(palette = "Dark2")
• ColorBrewer contains 3 types of color palettes (The above page contains a list of all names):
• sequential,
• diverging,
• qualitative.
• display.brewer.pal() can display the colors on a graphical device.
• brewer.pal() will return colors (in hex) for a certain palette,
library(RColorBrewer)

display.brewer.all()                          # visualize colors
display.brewer.all(colorblindFriendly = TRUE) # draw a plot

display.brewer.pal(11, "BrBG") # Alternative colors used in correlation matrix

display.brewer.pal(9, "Set1") # Up to 9 classes are allowed

brewer.pal(n, name)
brewer.pal(n = 8, name = "Dark2")  # Qualitative palettes
## [1] "#1B9E77" "#D95F02" "#7570B3" "#E7298A" "#66A61E" "#E6AB02" "#A6761D"
## [8] "#666666"
brewer.pal(n = 8, name = "Greens") # Sequential palettes

brewer.pal(n = 8, name = "RdYlBu") # Diverging palettes


### RdYlBu

It means Red-Yellow-Blue. See a full list of names and colors for RColorBrewer palettes Colors in R.

See some examples (R base heatmap, d3heatmap) Heatmap in R: Static and Interactive Visualization using RdYlBu color.

library(RColorBrewer)
display.brewer.pal(5, "RdYlBu")
brewer.pal(n=5, name = "RdYlBu")
# [1] "#D7191C" "#FDAE61" "#FFFFBF" "#ABD9E9" "#2C7BB6"

# If we set n=3, the colors won't be right
display.brewer.pal(3, "RdYlBu")
brewer.pal(n=3, name = "RdYlBu")
# [1] "#FC8D59" "#FFFFBF" "#91BFDB"


pheatmap seems to use a palette very close to RdYlBu. To specify the palette explicitly, see R语言: 从pheatmap无缝迁移至ComplexHeatmap.

Note ComplexHeatmap requires the color to be a function instead of color palette.

library(ComplexHeatmap)
df <- scale(mtcars)
range(df)
# [1] -1.874010  3.211677  # NOT symmetric

col_fun <- circlize::colorRamp2(quantile(df, c(0, .25, .5, .75, 1)),
rev(RColorBrewer::brewer.pal(n=5, name = "RdYlBu")))

# Treat the data as symmetric
col_fun <- circlize::colorRamp2(c(-2, qnorm(c(.25, .5, .75)), 2),
rev(RColorBrewer::brewer.pal(n=5, name = "RdYlBu")))
Heatmap(df,
col = col_fun,
name = "mtcars", #title of legend
column_title = "Variables", row_title = "Samples",
row_names_gp = gpar(fontsize = 7) # Text size for row names
)


## viridis: Colorblind-Friendly Color

• https://cran.r-project.org/web/packages/viridis/index.html.
• The vignette contains a comparison of color palettes from base-R, ggplot2, colorbrewer
• The package contains 8 color scales
• For base plots, use the viridis() function to generate a palette
• or ggplot, use scale_color_viridis() and scale_fill_viridis()
viridis(3)
# [1] "#440154FF" "#21908CFF" "#FDE725FF"
#  (purple, green, yellow)


• ggplot2 heatmap

## scales package

• https://scales.r-lib.org/. Scales colour palettes are used to power the scales in ggplot2, but you can use them in any plotting system include base R plots.
show_col(hue_pal()(4))      # visualize colors
show_col(viridis_pal()(16)) # visualize colors

viridis_pal()(4)
#> [1] "#440154FF" "#31688EFF" "#35B779FF" "#FDE725FF"

# use in combination with baseR palette() to set new defaults
palette(brewer_pal(palette = "Set2")(4))

• Emulate ggplot2 default color palette

## Missing data

In the dynamic heatmap tool of BRB-ArrayTools, the missing data is represented by the gray color.

plot(c(100, 200), c(300, 450), type= "n", xlab = "", ylab = "")
rect(110, 300, 175, 350, col = "navy")
rect(110, 360, 175, 400, col = "powderblue")
rect(110, 410, 175, 450, col = "#4B9CD3")