# Confusion matrix, Sensitivity/Specificity/Accuracy

Wikipedia. Note in wikipedia, it uses positive/negative to represent two classes.

 Predict 1 0 True 1 TP FN Sens=TP/(TP+FN)=Recall=TPR FNR=FN/(TP+FN) 0 FP TN Spec=TN/(FP+TN), 1-Spec=FPR PPV=TP/(TP+FP)=Precision FDR=FP/(TP+FP) =1-PPV NPV=TN/(FN+TN) N = TP + FP + FN + TN
• Sensitivity 敏感度 = TP / (TP + FN) = Recall
• Specificity 特異度 = TN / (TN + FP)
• Accuracy = (TP + TN) / N
• False discovery rate FDR = FP / (TP + FP)
• False negative rate FNR = FN / (TP + FN)
• False positive rate FPR = FP / (FP + TN) = 1 - Spec
• True positive rate = TP / (TP + FN) = Sensitivity
• Positive predictive value (PPV) = TP / # positive calls = TP / (TP + FP) = 1 - FDR = Precision
• Negative predictive value (NPV) = TN / # negative calls = TN / (FN + TN)
• Prevalence 盛行率 = (TP + FN) / N.
• Note that PPV & NPV can also be computed from sensitivity, specificity, and prevalence:
$\displaystyle{ \text{PPV} = \frac{\text{sensitivity} \times \text{prevalence}}{\text{sensitivity} \times \text{prevalence}+(1-\text{specificity}) \times (1-\text{prevalence})} }$
$\displaystyle{ \text{NPV} = \frac{\text{specificity} \times (1-\text{prevalence})}{(1-\text{sensitivity}) \times \text{prevalence}+\text{specificity} \times (1-\text{prevalence})} }$

## PPA, NPA

• Cf PPV/positive predictive value & NPV/negative predictive value
• Positive percent agreement (PPA) and negative percent agreement (NPA) - Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests - Guidance for Industry and FDA Staff
• The percentage of subjects (or samples) that test positive by one test (TA) that are found positive by a second test (TB); the two tests are indicated with the ordered subscripts, as $\displaystyle{ PPA_{TA:TB} }$. The test listed first is treated as the “reference” test (denominator a+b in the formula given below).
• Diagnostic accuracy (sensitivity/specificity) versus agreement (PPA/NPA) statistics
• Positive agreement is the proportion of comparative/reference method positive results in which the test method result is positive.
• Negative agreement is the proportion of comparative/reference method negative results in which the test method result is negative.
• The focus on PPA or NPA depends on the context and the specific objectives of the test evaluation.
• PPA is particularly important when the consequence of a false negative is high. For example, in a disease screening context, a false negative (a sick person being incorrectly identified as healthy) could lead to delayed treatment and worse health outcomes. Therefore, a high PPA (few false negatives) is desirable.
• NPA is crucial when the cost of a false positive is significant. For instance, in a scenario where a positive test result could lead to an invasive follow-up procedure, a false positive (a healthy person being incorrectly identified as sick) could lead to unnecessary risk and anxiety for the patient. In this case, a high NPA (few false positives) is important.
• Paper examples.
• Other evaluation of diagnostic tests: Sensitivity/Specificity, Positive Predictive Value (PPV)/Negative Predictive Value (NPV), Overall Test Accuray, Diagnostic Odds Ratio (DOR), Youden Index (YI), Positive Likelihood Ratio (LR+). See
• Assume rows are new test and columns are reference.  reference + reference - test + TP (a) FP (c) test - FN (b) TN (d) x.1 x.2
PPA = TP/x.1 = TP/(TP+FN), NPA = TN/x.2 = TN/(FP+TN)
• APA (Average positive agreement) and ANA (Average negative agreement). See EVALUATION OF AUTOMATIC CLASS III DESIGNATION FOR Cell-Free DNA BCT and Agreement measures for binary and semi-quantitative data. Note, APA and ANA are not standard terms used in diagnostic test evaluation, and their definitions may vary depending on the context.
• Suppose $\displaystyle{ PPA = \frac{a}{a+b} }$, then $\displaystyle{ APA = \frac{a}{((a+b)+(a+c))/2} = \frac{2x_{11}}{x_{1.}+x_{.1}} = \frac{\mbox{n. of concordant positives}}{\mbox{n. of concordant positives} + \frac{\mbox{n. of disconcordant calls}}{2}} }$.
• Suppose $\displaystyle{ NPA = \frac{d}{c+d} }$, then $\displaystyle{ ANA = \frac{d}{((d+b)+(d+c))/2} = \frac{2x_{22}}{x_{2.}+x_{.2}} = \frac{\mbox{n. of concordant negatives}}{\mbox{n. of concordant negatives} + \frac{\mbox{n. of disconcordant calls}}{2}} }$.
• The overall proportion of agreement is the sum of the diagonal entries divided by the total.
• The above two APA and ANA statistics are in the same form as the Harrell's C-index. See Harrell's C-index formula at What is Harrell’s C-index?

## Random classifier

• Sensitivity and specificity should be close to 0.5 for a random classifier assuming a balanced class distribution in a binary classification problem.
• However, PPV or NPV is not close to 0.5 for a random classifier. PPV for a random classifier will depend on the class distribution in the dataset and the randomness of the classifier's guesses. It would vary randomly depending on the classifier's random guesses and the class distribution in the dataset.

# ROC curve

• Binary case:
• Y = true positive rate = sensitivity,
• X = false positive rate = 1-specificity = 假陽性率
• Area under the curve AUC from the wikipedia: the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').
$\displaystyle{ A = \int_{\infty}^{-\infty} \mbox{TPR}(T) \mbox{FPR}'(T) \, dT = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} I(T'\gt T)f_1(T') f_0(T) \, dT' \, dT = P(X_1 \gt X_0) }$
where $\displaystyle{ X_1 }$ is the score for a positive instance and $\displaystyle{ X_0 }$ is the score for a negative instance, and $\displaystyle{ f_0 }$ and $\displaystyle{ f_1 }$ are probability densities as defined in previous section.
• Interpretation of the AUC. A small toy example (n=12=4+8) was used to calculate the exact probability $\displaystyle{ P(X_1 \gt X_0) }$ (4*8=32 all combinations).
• It is a discrimination measure which tells us how well we can classify patients in two groups: those with and those without the outcome of interest.
• Since the measure is based on ranks, it is not sensitive to systematic errors in the calibration of the quantitative tests.
• The AUC can be defined as The probability that a randomly selected case will have a higher test result than a randomly selected control.
• Plot of sensitivity/specificity (y-axis) vs cutoff points of the biomarker
• The Mann-Whitney U test statistic (or Wilcoxon or Kruskall-Wallis test statistic) is equivalent to the AUC (Mason, 2002). Verify in R.
library(pROC)
set.seed(123)
group1 <- rnorm(50)
group2 <- rnorm(50, mean=1)

roc_obj <- roc(rep(c(0,1), each=50), c(group1, group2))
# Setting levels: control = 0, case = 1
# Setting direction: controls < cases
auc(roc_obj)
# Area under the curve: 0.8036

wilcox.test(group2, group1)$statistic/(50*50) # notice the group1,group2 order # W # 0.8036  • The p-value of the Mann-Whitney U test can thus safely be used to test whether the AUC differs significantly from 0.5 (AUC of an uninformative test). • Calculate AUC by hand. AUC is equal to the probability that a true positive is scored greater than a true negative. • See the uROC() function in <functions.R> from the supplementary of the paper (need access right) Bivariate Marker Measurements and ROC Analysis Wang 2012. Let $\displaystyle{ n_1 }$ be the number of obs from X1 and $\displaystyle{ n_0 }$ be the number of obs from X0. X1 and X0 are the predict values for data from group 1 and 0. $\displaystyle{ TP_i=Prob(X_1\gt X_{0i})=\sum_j (X_{1j} \gt X_{0i})/n_1, ~ FP_i=Prob(X_0\gt X_{0i}) = \sum_j (X_{0j} \gt X_{0i}) / n_0 }$. We can draw a scatter plot or smooth.spline() of TP(y-axis) vs FP(x-axis) for the ROC curve. uROC <- function(marker, status) ### ROC function for univariate marker ### { x <- marker bad <- is.na(status) | is.na(x) status <- status[!bad] x <- x[!bad] if (sum(bad) > 0) cat(paste("\n", sum(bad), "records with missing values dropped. \n")) no_case <- sum(status==1) no_control <- sum(status==0) TP <- rep(0, no_control) FP <- rep(0, no_control) for (i in 1: no_control){ TP[i] <- sum(x[status==1]>x[status==0][i])/no_case FP[i] <- sum(x[status==0]>x[status==0][i])/no_control } list(TP = TP, FP = FP) }  ## partial AUC ## summary ROC ## Weighted ROC ## Adjusted AUC ## Difficult to compute for some models ## Optimal threshold, Youden's index/Youden's J statistic ## ROC Curve AUC for Hypothesis Testing ## Challenges, issues Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review 2022. class imbalance, data pre-processing, and hyperparameter tuning. twitter. # Survival data 'Survival Model Predictive Accuracy and ROC Curves' by Heagerty & Zheng 2005 • Recall Sensitivity= $\displaystyle{ P(\hat{p_i} \gt c | Y_i=1) }$, Specificity= $\displaystyle{ P(\hat{p}_i \le c | Y_i=0 }$), $\displaystyle{ Y_i }$ is binary outcomes, $\displaystyle{ \hat{p}_i }$ is a prediction, $\displaystyle{ c }$ is a criterion for classifying the prediction as positive ($\displaystyle{ \hat{p}_i \gt c }$) or negative ($\displaystyle{ \hat{p}_i \le c }$). • For survival data, we need to use a fixed time/horizon (t) to classify the data as either a case or a control. Following Heagerty and Zheng's definition in Survival Model Predictive Accuracy and ROC Curves (Incident/dynamic) 2005, Sensitivity(c, t)= $\displaystyle{ P(M_i \gt c | T_i = t) }$, Specificity= $\displaystyle{ P(M_i \le c | T_i \gt t }$) where M is a marker value or $\displaystyle{ Z^T \beta }$. Here sensitivity measures the expected fraction of subjects with a marker greater than c among the subpopulation of individuals who die at time t, while specificity measures the fraction of subjects with a marker less than or equal to c among those who survive beyond time t. • The AUC measures the probability that the marker value for a randomly selected case exceeds the marker value for a randomly selected control • ROC curves are useful for comparing the discriminatory capacity of different potential biomarkers. # Precision recall curve ## F1-score • F-score/F1-score. • F1 = 2 * Precision * Recall / (Precision + Recall) = 2 / (1/Precision + 1/Recall) Harmonic mean of precision and recall. • This means that comparison of the F-score across different problems with differing class ratios is problematic. • Recall = Sensitivity = true positive rate. Precision = PPV. They are from the two directions of the "1" class in the confusion matrix. • The F1 score • How to interpret F1 score (simply explained). It is a popular metric to use for classification models as it provides accurate results for both balanced and imbalanced datasets, and takes into account both the precision and recall ability of the model. F1 score tells you the model’s balanced ability to both capture positive cases (recall) and be accurate with the cases it does capture (precision). • What is Considered a “Good” F1 Score? It gives an example of F1-score where all the predictions are on one class (baseline model) for comparison purpose. • An example. confusionMatrix() assumes the rows are predicted and columns are true; that's why I put a transpose on my table object. The 'positive' class seems to be always the 1st row of the 2x2 table. If the levels are not specified, it will assign 'A' to the first row and 'B' to the 2nd row. For this case, we can see accuracy (0.615) is between sensitivity (0.429) and specificity (0.684) and F1-score (0.375) is between precision (0.333) and recall (0.429) . Also F1-score is smaller than sensitivity and specificity.  predicted sensitive resistant ------------------------------- true sensitive 3 4 resistant 6 13 xtab <- matrix(c(3,6,4,13),nr=2) confusionMatrix(t(xtab))$byClass
#         Sensitivity          Specificity       Pos Pred Value       Neg Pred Value
#           0.4285714            0.6842105            0.3333333            0.7647059
#           Precision               Recall                   F1           Prevalence
#           0.3333333            0.4285714            0.3750000            0.2692308
#      Detection Rate Detection Prevalence    Balanced Accuracy
#           0.1153846            0.3461538            0.5563910

confusionMatrix(t(xtab))$overall # Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull # 0.6153846 0.1034483 0.4057075 0.7977398 0.7307692 # AccuracyPValue McnemarPValue # 0.9348215 0.7518296 confusionMatrix(t(xtab))$positive
#  "A"

• Using F1 score metric in KNN through caret package
library(caret)
library(mlbench)
data(Sonar)

f1Summary <- function(data, lev = NULL, model = NULL) {
# Calculate precision and recall
precision <- posPredValue(data$pred, data$obs, positive = lev)
recall <- sensitivity(data$pred, data$obs, positive = lev)

# Calculate F1 score
f1 <- (2 * precision * recall) / (precision + recall)

# Return F1 score
c(F1 = f1)
}

# Create a trainControl object that uses the custom summary function
trainControl <- trainControl(summaryFunction = f1Summary)

# Use the trainControl object when calling the train() function
model <- train(Class ~., data = Sonar, method = "rf",
trControl = trainControl, metric = "F1")
> model
Random Forest

208 samples
60 predictor
2 classes: 'M', 'R'

No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 208, 208, 208, 208, 208, 208, ...
Resampling results across tuning parameters:

mtry  F1
2    0.8416808
31    0.7986326
60    0.7820982

F1 was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

• Examples in papers

## PPV/Precision

The relationship between PPV and sensitivity depends on the prevalence of the condition being tested for in the population. If the condition is rare, a test with high sensitivity may still have a low PPV because there will be many false positive results. On the other hand, if the condition is common, a test with high sensitivity is more likely to also have a high PPV.

Here we have a diagnostic test for a rare disease that affects 1 in 1000 people and the test has a sensitivity of 100% and a false positive rate of 5%:

                Test| positive  negative
-----------------------------------------
Disease present     |  1          0

Disease absent      |  50        949


The prevalence of the disease is 1/1000.

# genefilter package and rowpAUCs function

• rowpAUCs function in genefilter package. The aim is to find potential biomarkers whose expression level is able to distinguish between two groups.
# source("http://www.bioconductor.org/biocLite.R")
# biocLite("genefilter")
library(Biobase) # sample.ExpressionSet data
data(sample.ExpressionSet)

library(genefilter)
r2 = rowpAUCs(sample.ExpressionSet, "sex", p=0.1)
plot(r2) # first gene, asking specificity = .9

r2 = rowpAUCs(sample.ExpressionSet, "sex", p=1.0)
plot(r2) # it won't show pAUC

r2 = rowpAUCs(sample.ExpressionSet, "sex", p=.999)
plot(r2) # pAUC is very close to AUC now


# Some R packages

## pROC

• https://cran.r-project.org/web/packages/pROC/index.html
• response: a factor, numeric or character vector of responses (true class), typically encoded with 0 (controls) and 1 (cases).
• Pay attention to the message printed on screen after calling roc(). It will show which one is control and which one is case.
• pROC: display and analyze ROC curves in R and S+
• ROC Curve and AUC in Machine learning and R pROC Package. Pay attention to the legacy.axes parameter.
library(pROC)
data(aSAH)
str(aSAH[,c('outcome', "s100b")])
# 'data.frame':	113 obs. of  2 variables:
#  $outcome: Factor w/ 2 levels "Good","Poor": 1 1 1 1 2 2 1 2 1 1 ... #$ s100b  : num  0.13 0.14 0.1 0.04 0.13 0.1 0.47 0.16 0.18 0.1 ...
roc.s100b <- roc(aSAH$outcome, aSAH$s100b)

roc(aSAH$outcome, aSAH$s100b,
plot=TRUE,
col="green",
lwd =4,
legacy.axes=TRUE,
main="ROC Curves")
# Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor). # Area under the curve: 0.7314 auc(roc.s100b) # Area under the curve: 0.7314 auc(aSAH$outcome, aSAH$s100b) # Area under the curve: 0.7314  Note: in pROC::roc() or auc(), the response is on the 1st argument while in caret::confusionMatrix(), truth/reference is on the 2nd argument. If we flipped the outcomes, it won't affect AUC. aSAH$outcome2 <- factor(ifelse(aSAH$outcome == "Good", "Poor", "Good"), levels=c("Good", "Poor")) roc(aSAH$outcome2, aSAH$s100b) # Data: aSAH$s100b in 41 controls (aSAH$outcome2 Good) > 72 cases (aSAH$outcome2 Poor).
# Area under the curve: 0.7314

• Plot multiple ROC curves in one plot.
roc1 <- roc()
roc2 <- roc()
roc3 <- roc()
plot(roc1)
lines(roc2, col="red")
lines(roc3, col="blue")


Another solution is to use pROC::ggroc(). The annotation was taken care automatically.

rocobj <- roc(aSAH$outcome, aSAH$s100b)
rocobj2 <- roc(aSAH$outcome, aSAH$wfns)
g2 <- ggroc(list(s100b=rocobj,
wfns=rocobj2,
ndka=roc(aSAH$outcome, aSAH$ndka))) # 3 curves

• Plot a panel of ROC plots by using facet_wrap().

# Confidence interval of AUC

How to get an AUC confidence interval. pROC package was used.

# AUC can be a misleading measure of performance

AUC is high but precision is low (i.e. FDR is high). https://twitter.com/michaelhoffman/status/1398380674206285830?s=09.

# Lessons

• Unbalanced data: kNN or nearest centroid is better than the traditional methods
• Small sample size and large number of predictors: t-test can select predictors while lasso cannot