# Clustering

## Hierarchical clustering

For the kth cluster, define the Error Sum of Squares as $\displaystyle{ ESS_m = }$ sum of squared deviations (squared Euclidean distance) from the cluster centroid. $\displaystyle{ ESS_m = \sum_{l=1}^{n_m}\sum_{k=1}^p (x_{ml,k} - \bar{x}_{m,k})^2 }$ in which $\displaystyle{ \bar{x}_{m,k} = (1/n_m) \sum_{l=1}^{n_m} x_{ml,k} }$ the mean of the mth cluster for the k-th variable, $\displaystyle{ x_{ml,k} }$ being the score on the kth variable $\displaystyle{ (k=1,\dots,p) }$ for the l-th object $\displaystyle{ (l=1,\dots,n_m) }$ in the mth cluster $\displaystyle{ (m=1,\dots,g) }$.

If there are C clusters, define the Total Error Sum of Squares as Sum of Squares as $\displaystyle{ ESS = \sum_m ESS_m, m=1,\dots,C }$

Consider the union of every possible pair of clusters.

Combine the 2 clusters whose combination combination results in the smallest increase in ESS.

1. The default linkage is "complete" in R.
2. Ward's method tends to join clusters with a small number of observations, and it is strongly biased toward producing clusters with the same shape and with roughly the same number of observations.
3. It is also very sensitive to outliers. See Milligan (1980).

Take pomeroy data (7129 x 90) for an example:

library(gplots)

lr = as.matrix(lr)
method = "average" # method <- "complete"; method <- "ward.D"; method <- "ward.D2"
hclust1 <- function(x) hclust(x, method= method)
heatmap.2(lr, col=bluered(75), hclustfun = hclust1, distfun = dist,
density.info="density", scale = "none",
key=FALSE, symkey=FALSE, trace="none",
main = method)


It seems average method will create a waterfall like dendrogram. Ward method will produce a tight clusters. Complete linkage produces a more 中庸 result.

## Optimal number of clusters

### Silhouette score/width ### Scree/elbow plot

• Cf scree plot for PCA analysis
• K-Means Clustering in R: Step-by-Step Example. ?factoextra::fviz_nbclust (good integration with different clustering methods and evaluation statistic)
• datacamp
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
model <- kmeans(x = lineup, centers = k)
model$tot.withinss }) # Generate a data frame containing both k and tot_withinss elbow_df <- data.frame( k = 1:10, tot_withinss = tot_withinss ) # Plot the elbow plot ggplot(elbow_df, aes(x = k, y = tot_withinss)) + geom_line() + scale_x_continuous(breaks = 1:10)  ### kBET: k-nearest neighbour batch effect test • Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019). • https://github.com/theislab/kBET • quantify mixability; how well cells of the same type from different batches were grouped together ### Alignment score • Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018). • quantify mixability; how well cells of the same type from different batches were grouped together ### dynamicTreeCut package dynamicTreeCut: Methods for Detection of Clusters in Hierarchical Clustering Dendrograms. cutreeDynamicTree(). Found in here. ## Compare 2 clustering methods, ARI ## Benchmark clustering algorithms ## Power Statistical power for cluster analysis 2022. It includes several take-home message. ## Louvain algorithm: graph-based method ## Mahalanobis distance ## Large dendrograms # You probably don't understand heatmaps # Evaluate the effect of centering & scaling ## Different distance measures ## 1-correlation distance Effect of centering and scaling on clustering of genes and samples in terms of distance. 'Yes' means the distance was changed compared to the baseline where no centering or scaling was applied. clustering genes clustering samples centering on each genes No Yes* scaling on each genes No Yes* • Note: Cor(X, Y) = Cor(X + constant scalar, Y). If the constant is not a scalar, the equation won't hold. Or think about plotting data in a 2 dimension space. If the X data has a constant shift for all observations/genes, then the linear correlation won't be changed. ## Euclidean distance clustering genes clustering samples centering on each genes Yes No1 scaling on each genes Yes Yes2 Note 1. $\displaystyle{ \sum(X_i - Y_i)^2 = \sum(X_i-c_i - (Y_i-c_i))^2 }$ 2. $\displaystyle{ \sum(X_i - Y_i)^2 \neq \sum(X_i/c_i - Y_i/c_i)^2 }$ ## Supporting R code 1. 1-Corr distance source("http://www.bioconductor.org/biocLite.R") biocLite("limma"); biocLite("ALL") library(limma); library(ALL) data(ALL) eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]

f <- factor(as.character(eset$mol.biol)) design <- model.matrix(~f) fit <- eBayes(lmFit(eset,design)) selected <- p.adjust(fit$p.value[, 2]) < 0.05
esetSel <- eset [selected, ]   # 165 x 47
heatmap(exprs(esetSel))

esetSel2 <- esetSel[sample(1:nrow(esetSel), 20), sample(1:ncol(esetSel), 10)] # 20 x 10

dist.no <- 1-cor(t(as.matrix(esetSel2)))
dist.mean <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2))))) # gene variance has not changed!
dist.median <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median))))

range(dist.no - dist.mean) #  -1.110223e-16  0.000000e+00
range(dist.no - dist.median) #  -1.110223e-16  0.000000e+00
range(dist.mean - dist.median) #  0 0


So centering (no matter which measure: mean, median, ...) genes won't affect 1-corr distance of genes.

dist.mean <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/")))
dist.median <- 1-cor(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/")))
range(dist.no - dist.mean) #  -8.881784e-16  6.661338e-16
range(dist.no - dist.median) #  -6.661338e-16  6.661338e-16
range(dist.mean - dist.median) #  -1.110223e-15  1.554312e-15


So scaling after centering (no matter what measures: mean, median,...) won't affect 1-corr distance of genes.

How about centering / scaling genes on array clustering?

dist.no <- 1-cor(as.matrix(esetSel2))
dist.mean <- 1-cor(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)))) # array variance has changed!
dist.median <- 1-cor(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median)))

range(dist.no - dist.mean) #  -1.547086  0.000000
range(dist.no - dist.median) #  -1.483427  0.000000
range(dist.mean - dist.median) #  -0.5283601  0.6164602

dist.mean <- 1-cor(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/"))
dist.median <- 1-cor(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/"))
range(dist.no - dist.mean) #  -1.477407  0.000000
range(dist.no - dist.median) #  -1.349419  0.000000
range(dist.mean - dist.median) #  -0.5419534  0.6269875


2. Euclidean distance

dist.no <- dist(as.matrix(esetSel2))
dist.mean <- dist(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2))))
dist.median <- dist(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median)))

range(dist.no - dist.mean) #  7.198864e-05 2.193487e+01
range(dist.no - dist.median) #  -0.3715231 21.9320846
range(dist.mean - dist.median) #  -0.923717629 -0.000088385


Centering does affect the Euclidean distance.

dist.mean <- dist(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/"))
dist.median <- dist(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/"))

range(dist.no - dist.mean) #   0.7005071 24.0698991
range(dist.no - dist.median) #   0.636749 24.068920
range(dist.mean - dist.median) #  -0.22122869  0.02906131


And scaling affects Euclidean distance too.

How about centering / scaling genes on array clustering?

dist.no <- dist(t(as.matrix(esetSel2)))
dist.mean <- dist(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)))))
dist.median <- dist(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median))))

range(dist.no - dist.mean) # 0 0
range(dist.no - dist.median)  # 0 0
range(dist.mean - dist.median) # 0 0

dist.mean <- dist(t(sweep(as.matrix(esetSel2), 1L, rowMeans(as.matrix(esetSel2)), "/")))
dist.median <- dist(t(sweep(as.matrix(esetSel2), 1L, apply(esetSel2, 1, median), "/")))

range(dist.no - dist.mean) #  1.698960 9.383789
range(dist.no - dist.median)  #  1.683028 9.311603
range(dist.mean - dist.median) #  -0.09139173  0.02546394


## Euclidean distance and Pearson correlation relationship

$\displaystyle{ r(X, Y) = 1 - \frac{d^2(X, Y)}{2n} }$

where $\displaystyle{ r(X, Y) }$ is the Pearson correlation of variables X and Y and $\displaystyle{ d^2(X, Y) }$ is the squared Euclidean distance of X and Y.

# Simple image plot using image() function

### Create Matrix plot using colors to fill grid
# Create matrix.   Using random values for this example.
rand <- rnorm(286, 0.8, 0.3)
mat <- matrix(sort(rand), nr=26)
dim(mat)  # Check dimensions

# Create x and y labels
yLabels <- seq(1, 26, 1)
xLabels <- c("a", "b", "c", "d", "e", "f", "g", "h", "i",
"j", "k");

# Set min and max values of rand
min <- min(rand, na.rm=T)
max <- max(rand, na.rm=T)

# Red and green range from 0 to 1 while Blue ranges from 1 to 0
ColorRamp <- rgb(seq(0.95,0.99,length=50),  # Red
seq(0.95,0.05,length=50),  # Green
seq(0.95,0.05,length=50))  # Blue
ColorLevels <- seq(min, max, length=length(ColorRamp))

# Set layout.  We are going to include a colorbar next to plot.
layout(matrix(data=c(1,2), nrow=1, ncol=2), widths=c(4,1),
heights=c(1,1))
#plotting margins.  These seem to work well for me.
par(mar = c(5,5,2.5,1), font = 2)

# Plot it up!
image(1:ncol(mat), 1:nrow(mat), t(mat),
col=ColorRamp, xlab="Variable", ylab="Time",
axes=FALSE, zlim=c(min,max),
main= NA)

# Now annotate the plot
box()
axis(side = 1, at=seq(1,length(xLabels),1), labels=xLabels,
cex.axis=1.0)
axis(side = 2, at=seq(1,length(yLabels),1), labels=yLabels, las= 1,
cex.axis=1)

# Add colorbar to second plot region
par(mar = c(3,2.5,2.5,2))
image(1, ColorLevels,
matrix(data=ColorLevels, ncol=length(ColorLevels),nrow=1),
col=ColorRamp,xlab="",ylab="",xaxt="n", las = 1)


If we define ColorRamp variable using the following way, we will get the 2nd plot.

require(RColorBrewer) # get brewer.pal()
ColorRamp <- colorRampPalette( rev(brewer.pal(9, "RdBu")) )(25)


Note that

• colorRampPalette() is an R's built-in function. It interpolate a set of given colors to create new color palettes. The return is a function that takes an integer argument (the required number of colors) and returns a character vector of colors (see rgb()) interpolating the given sequence (similar to heat.colors() or terrain.colors()).
# An example of showing 50 shades of grey in R
greys <- grep("^grey", colours(), value = TRUE)
length(greys)
#  102
shadesOfGrey <- colorRampPalette(c("grey0", "grey100"))
#  "#000000" "#FFFFFF"
# And 50 shades of grey?
mat <- matrix(rep(1:50, each = 50))
image(mat, axes = FALSE, col = fiftyGreys)
box()

• (For dual channel data) brewer.pal(9, "RdBu") creates a diverging palette based on "RdBu" with 9 colors. See help(brewer.pal, package="RColorBrewer") for a list of palette name. The meaning of the palette name can be found on colorbrew2.org website. In genomics, we will add rev() such as rev(brewer.pal(9, "RdBu")).
• (For single channel data) brewer.pal(9, "Blues") is good. See an example.

# gplots package and heatmap.2()

The following example is extracted from DESeq2 package.

## ----loadDESeq2, echo=FALSE----------------------------------------------
# in order to print version number below
library("DESeq2")

library("parathyroidSE")
library("GenomicFeatures")
data(exonsByGene)

## ----locateFiles, echo=FALSE---------------------------------------------
bamDir <- system.file("extdata",package="parathyroidSE",mustWork=TRUE)
fls <- list.files(bamDir, pattern="bam$",full=TRUE) ## ----bamfilepaired------------------------------------------------------- library( "Rsamtools" ) bamLst <- BamFileList( fls, yieldSize=100000 ) ## ----sumOver------------------------------------------------------------- library( "GenomicAlignments" ) se <- summarizeOverlaps( exonsByGene, bamLst, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) ## ----libraries----------------------------------------------------------- library( "DESeq2" ) library( "parathyroidSE" ) ## ----loadEcs------------------------------------------------------------- data( "parathyroidGenesSE" ) se <- parathyroidGenesSE colnames(se) <- se$run

## ----fromSE--------------------------------------------------------------
ddsFull <- DESeqDataSet( se, design = ~ patient + treatment )

## ----collapse------------------------------------------------------------
ddsCollapsed <- collapseReplicates( ddsFull,
groupby = ddsFull$sample, run = ddsFull$run )

## ----subsetCols----------------------------------------------------------
col = unique(as.numeric(dat$GO)), lty= 1, lwd = 5, cex=.7)  Another example from video which makes use of an archived package heatmap.plus. legend(0.8,1, legend=paste(treatment_times,"weeks"), fill=treatment_color_options, cex=0.5) legend(0.8,0.9, legend=c("Control","Treatment"), fill=c('#808080','#FFC0CB'), cex=0.5)  ### heatmap.plus() How to Make an R Heatmap with Annotations and Legend. ColSideColors can be a matrix (n x 2). So it is possible to draw two side colors on the heatmap. Unfortunately the package was removed from CRAN in 2021-04. The package was used by TCGAbiolinks but now this package uses ComplexHeatmap instead. devtools::install_version("heatmap.plus", "1.3")  ## Output from heatmap.2 examples # ggplot2 package ## ggplot2::geom_tile() # Suppose dat=[x, y1, y2, y3] is a wide matrix # and we want to make a long matrix like dat=[x, y, val] library(tidyr) dat <- dat %>% pivot_longer(!x, names_to = 'y', values_to='val') ggplot(dat, aes(x, y)) + geom_tile(aes(fill = val), colour = "white") + scale_fill_gradient2(low = "blue", mid = "white", high = "red") + labs(y="Cell Line", fill= "Log GI50") # white is the border color # grey = NA by default # labs(fill) is to change the title # labs(y) is to change the y-axis label  # NMF package aheatmap() function. # ComplexHeatmap Pros ## Simple examples with code ## Reorder dendrograms ## Save images to files png(file="newfile.png") ht <- Heatmap(...) draw(ht) dev.off()  ## Colors and legend • How to make continuous legend symmetric? #82, 2020 To exactly control the break values on the legend, you can set heatmap_legend_param argument in Heatmap() function. • Use circlize::colorRamp2() to change the color limit including the color specification. PS: NO need to use library()/require() to load the circlize package. • ComplexHeatmap break values appear different in the plots #361, 2019. pretty(range(x), n=3) Heatmap( xm, col = colorRamp2(c(min(xm), 0, max(xm)), c("#0404B4", "white", "#B18904")), show_row_names = F, km = 2, column_names_gp = gpar(fontsize = 7), name="Tumors", heatmap_legend_param = list(at = c(min(xm), 0, max(xm)))) pretty(seq(-3, 3, length = 3),n=4) #  -4 -2 0 2 4 pretty(seq(-3, 3, length = 3),n=5) # default n=5 #  -3 -2 -1 0 1 2 3  • One legend for a list of heatmaps #391, 2019 col_fun = colorRamp2(...) Heatmap(mat1, col = col_fun, ...) + Heatmap(mat2, col = col_fun, show_heatmap_legend = FALSE, ...) + Heatmap(mat3, col = col_fun, show_heatmap_legend = FALSE, ...) +  • Breaks in Color Scales are Wrong #659, 2020. col = colorRamp2(seq(-3, 3, length = 3), c("blue", "#EEEEEE", "red")) does not mean -3, 0, 3 should be the breaks on the legend (although you can manually control it). The color mapping function only defines the colors, while the default break values on the legends are calculated from the input matrix with 3 to 5 break values. In your code, you see 4 and -4 are the border of the legend, actually, all values between 3~4 are mapped to red and all the values between -3~-4 are mapped to blue. In other words, if I use colorRamp2(c(-3, 1, 3), c('blue', 'white', 'red')), it will uniformly distribute data in (-3,1) to c('blue', 'white') and (1,3) to c('white', 'red'). • Hex code #EEEEEE represents bright gray • Setting a default color schema #834, 2021 • Changing the default background color #698, 2021 ## Correlation matrix The following code will guarantee the heatmap has a diagonal in the direction of top-left and bottom-right. The row dendrogram will be flipped automatically. There is no need to use cluster_rows = rev(mydend). mcor <- cor(t(lograt)) colnames(mcor) <- rownames(mcor) <- 1:ncol(mcor) mydend <- as.dendrogram(hclust(as.dist(1-mcor))) Heatmap( mcor, cluster_columns = mydend, cluster_rows = mydend, row_dend_reorder = FALSE, column_dend_reorder = FALSE, row_names_gp = gpar(fontsize = 6), column_names_gp = gpar(fontsize = 6), column_title = "", name = "value")  ## OncoPrint ## InteractiveComplexHeatmap ## tidyHeatmap tidyHeatmap. This is a tidy implementation for heatmap. At the moment it is based on the (great) package 'ComplexHeatmap'. Note: that ComplexHeatmap is on Bioconductor but tidyHeatmap is on CRAN. By default, .scale = "row". See ?heatmap. add_tile() to add a column or row (depending on the data) annotation. cluster_rows=FALSE if we don't want to cluster rows. BiocManager::install('tidyHeatmap') library(tidyHeatmap) library(tidyr) mtcars_tidy <- mtcars |> as_tibble(rownames="Car name") |> # Scale mutate_at(vars(-Car name, -hp, -vs), scale) |> # tidyfy pivot_longer(cols = -c(Car name, hp, vs), names_to = "Property", values_to = "Value") # create another variable which will be added next to 'hp' mtcars_tidy <- mtcars_tidy%>% mutate(type = Car name) mtcars_tidy$type <- substr(mtcars_tidytype, 1, 1) mtcars_tidy # NA case. Consider the cell on the top-right corner mtcars_tidy %>% filter(Car name == 'Volvo 142E' & Property == 'am') mtcars_tidy <- mtcars_tidy %>% # Replacing values mutate(Value = replace(Value, Car name == 'Volvo 142E' & Property == 'am', NA)) mtcars_tidy %>% filter(Car name == 'Volvo 142E' & Property == 'am') # Re-draw data with missing value mtcars_tidy |> heatmap(Car name, Property, Value, palette_value = circlize::colorRamp2( seq(-2, 2, length.out = 11), rev(RColorBrewer::brewer.pal(11, "RdBu")))) |> add_tile(hp) |> add_tile(type) # two tiles on rows mtcars_heatmap <- mtcars_tidy |> heatmap(Car name, Property, Value, palette_value = circlize::colorRamp2( seq(-2, 2, length.out = 11), rev(RColorBrewer::brewer.pal(11, "RdBu")))) |> add_tile(hp) |> add_tile(type) mtcars_heatmap # Other useful parameters # heatmap(, cluster_rows = FALSE) # heatmap(, .scale = F) # Note add_tile(var) can decide whether the 'var' should go to # columns or rows - interesting! # one tile goes to columns and one tile goes to rows. tidyHeatmap::pasilla |> # group_by(location, type) |> heatmap( .column = sample, .row = symbol, .value = count normalised adjusted ) |> add_tile(condition) |> add_tile(activation)  # Correlation heatmap ## corrplot This package is used for visualization of correlation matrix. See its vignette and Visualize correlation matrix using correlogram. ## ggcorrplot ## ztable # SubtypeDrug: Prioritization of Candidate Cancer Subtype Specific Drugs # pheatmap What is special? 1. the color scheme is grey-blue to orange-red. pheatmap() default options, colorRampPalette in R 2. able to include class labels in samples and/or genes 3. the color key is thin and placed on the RHS (prettier for publications though it could miss information) 4. borders for each cell are included (not necessary) # fheatmap (archived) # heatmap3 CRAN. An improved heatmap package. Completely compatible with the original R function 'heatmap', and provides more powerful and convenient features. # Seurat for scRNA-seq ## DoMultiBarHeatmap ## Dot heatmap # dittoSeq for scRNA-seq # Office/Excel # Heatmap in the terminal # rasterImage library(jpeg) img<-readJPEG("~/Downloads/IMG_20220620_153515023.jpg") # need to create a plot first and rasterImage() will overlay it with another image # bottom-left = (100, 300), top-right = (250, 450) plot(c(100, 250), c(300, 450), type = "n", xlab = "", ylab = "") args(rasterImage) # function (image, xleft, ybottom, xright, ytop, angle = 0, interpolate = TRUE, ...) rasterImage(img, 100, 300, 250, 450) text(100, 350, "ABCD", cex=2, col="yellow",adj=0) # left-bottom align  # Interactive heatmaps ## d3heatmap This package has been removed from CRAN and archived on 2020-05-24. A package generates interactive heatmaps using d3.js and htmlwidgets. The package let you • Highlight rows/columns by clicking axis labels • Click and drag over colormap to zoom in (click on colormap to zoom out) • Optional clustering and dendrograms, courtesy of base::heatmap The following screenshots shows 3 features. • Shows the row/column/value under the mouse cursor • Zoom in a region (click on the zoom-in image will bring back the original heatmap) • Highlight a row or a column (click the label of another row will highlight another row. Click the same label again will bring back the original image) ## heatmaply This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice. Installing this package requires to compile some dependent package. The return object is heatmaply is 'plotly' and 'htmlwidget'. It does not return the ordering of rows/columns. It can not control whether to do clustering (d3heatmap package is better at this). ## shinyHeatmaply shinyHeatmaply: Deploy 'heatmaply' using 'shiny' ## canvasXpress ## InteractiveComplexHeatmap # Calendar heatmap # Dendrogram ## Flip dendrogram ## Color labels library(RColorBrewer) # matrix contains genomics-style data where columns are samples # (if otherwise remove the transposition below) # labels is a factor variable going along the columns of matrix plotHclustColors <- function(matrix,labels,...) { colnames(matrix) <- labels d <- dist(t(matrix)) hc <- hclust(d, method = "ward.D2") labelColors <- brewer.pal(nlevels(labels),"Set1") colLab <- function(n) { if (is.leaf(n)) { a <- attributes(n) labCol <- labelColors[which(levels(labels) == alabel)]
attr(n, "nodePar") <- c(a\$nodePar, lab.col=labCol)
}
n
}
clusDendro <- dendrapply(as.dendrogram(hc), colLab)
# I change cex because there are lots of samples
op <- par(mar=c(5,3,1,.5)+.1, cex=.3)
plot(clusDendro,...)
par(op)
}

plotHclustColors(genedata, pheno)


## dendextend* package

• dendextend::plot(, horiz=TRUE) allows to rotate a dendrogram with tips on RHS.
• plot_horiz.dendrogram() allows to rotate a dendrogram with tips on LHS.
• The package has a function tanglegram() to compare two trees of hierarchical clusterings. See this post and its vignette.
• Add colored bars

Simplified from dendextend's vignette or Label and color leaf dendrogram.

library(dendextend)

set.seed(1234)
iris <- datasets::iris[sample(150, 30), ] # subset for better view
iris2 <- iris[, -5] # data
species_labels <- iris[, 5] # group for coloring

hc_iris <- hclust(dist(iris2), method = "complete")
iris_species <- levels(species_labels)

dend <- as.dendrogram(hc_iris)
colorCodes <- c("red", "green", "blue")
labels_colors(dend) <- colorCodes[as.numeric(species_labels)][order.dendrogram(dend)]

labels(dend) <- paste0(as.character(species_labels)[order.dendrogram(dend)],
"(", labels(dend), ")")
# We hang the dendrogram a bit:
dend <- hang.dendrogram(dend, hang_height=0.1)
dend <- set(dend, "labels_cex", 1.0)

png("~/Downloads/iris_dextend.png", width = 1200, height = 600)
par(mfrow=c(1,2), mar = c(3,3,1,7))
plot(dend, main = "", horiz =  TRUE)
legend("topleft", legend = iris_species, fill = colorCodes)

par(mar=c(3,1,1,5))
plot(as.dendrogram(hc_iris),horiz=TRUE)
dev.off()


# Colors

## Generate sequential colors & grDevices::colorRampPalette()

?colorRampPalette. See an example Building heatmap with R.

mat <- matrix(1:100) # 100 x 1
image(t(mat), axes = FALSE, col = colorRampPalette( c("blue", "white", "red") )(100))

require(RColorBrewer)  # brewer.pal
ColorRamp <- colorRampPalette( brewer.pal(9, "Blues") )(25)
# The above will generate 25 colors by interpolating the colors
#   defined by brewer.pal(9, "Blues")
heatmap.2(..., col = ColorRamp)


## RColorBrewer::brewer.pal() function

• RColorBrewer can be used in both base plots and ggplot2.
• barplot(c(2,5,7), col = brewer.pal(n = 3, name = "RdBu"))
• ggplot() + geom_XXX + scale_fill_brewer(palette = "Dark2"). ggplot() + geom_point + + scale_color_brewer(palette = "Dark2")
• ColorBrewer contains 3 types of color palettes (The above page contains a list of all names):
• sequential,
• diverging,
• qualitative.
• display.brewer.pal() can display the colors on a graphical device.
• brewer.pal() will return colors (in hex) for a certain palette,
library(RColorBrewer)

display.brewer.all()                          # visualize colors
display.brewer.all(colorblindFriendly = TRUE) # draw a plot

display.brewer.pal(11, "BrBG") # Alternative colors used in correlation matrix

display.brewer.pal(9, "Set1") # Up to 9 classes are allowed

brewer.pal(n, name)
brewer.pal(n = 8, name = "Dark2")  # Qualitative palettes
##  "#1B9E77" "#D95F02" "#7570B3" "#E7298A" "#66A61E" "#E6AB02" "#A6761D"
##  "#666666"
brewer.pal(n = 8, name = "Greens") # Sequential palettes

brewer.pal(n = 8, name = "RdYlBu") # Diverging palettes


### RdYlBu

See some examples (R base heatmap, d3heatmap) Heatmap in R: Static and Interactive Visualization using RdYlBu color.

library(RColorBrewer)
display.brewer.pal(5, "RdYlBu")
brewer.pal(n=5, name = "RdYlBu")
#  "#D7191C" "#FDAE61" "#FFFFBF" "#ABD9E9" "#2C7BB6"

# If we set n=3, the colors won't be right
display.brewer.pal(3, "RdYlBu")
brewer.pal(n=3, name = "RdYlBu")
#  "#FC8D59" "#FFFFBF" "#91BFDB"


pheatmap seems to use a palette very close to RdYlBu. To specify the palette explicitly, see R语言: 从pheatmap无缝迁移至ComplexHeatmap.

Note ComplexHeatmap requires the color to be a function instead of color palette.

library(ComplexHeatmap)
df <- scale(mtcars)
range(df)
#  -1.874010  3.211677  # NOT symmetric

col_fun <- circlize::colorRamp2(quantile(df, c(0, .25, .5, .75, 1)),
rev(RColorBrewer::brewer.pal(n=5, name = "RdYlBu")))

# Treat the data as symmetric
col_fun <- circlize::colorRamp2(c(-2, qnorm(c(.25, .5, .75)), 2),
rev(RColorBrewer::brewer.pal(n=5, name = "RdYlBu")))
Heatmap(df,
col = col_fun,
name = "mtcars", #title of legend
column_title = "Variables", row_title = "Samples",
row_names_gp = gpar(fontsize = 7) # Text size for row names
)


## scales package

• https://scales.r-lib.org/. Scales colour palettes are used to power the scales in ggplot2, but you can use them in any plotting system include base R plots.
show_col(hue_pal()(4))      # visualize colors
show_col(viridis_pal()(16)) # visualize colors

viridis_pal()(4)
#>  "#440154FF" "#31688EFF" "#35B779FF" "#FDE725FF"

# use in combination with baseR palette() to set new defaults
palette(brewer_pal(palette = "Set2")(4))

• Emulate ggplot2 default color palette

## Missing data

In the dynamic heatmap tool of BRB-ArrayTools, the missing data is represented by the gray color.

# Papers

## Double dipping 