ROC: Difference between revisions
Line 270: | Line 270: | ||
= Limitation in clinical data = | = Limitation in clinical data = | ||
* [https://www.sciencedirect.com/science/article/abs/pii/S0895435618310047 ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models] Verbakel 2020 | * [https://www.sciencedirect.com/science/article/abs/pii/S0895435618310047 ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models] Verbakel 2020 | ||
= Sample size = | |||
* [https://twitter.com/GSCollins/status/1563916550121426951 Sample size MATTERS - don't ignore it]. | |||
** [https://onlinelibrary.wiley.com/doi/10.1002/sim.7992 Minimum sample size for '''developing''' a multivariable prediction model: PART II - binary and time-to-event outcomes], | |||
** [https://onlinelibrary.wiley.com/doi/10.1002/sim.9025 Minimum sample size for '''external validation''' of a clinical prediction model with a binary outcome] | |||
= Picking a threshold based on model performance/utility = | = Picking a threshold based on model performance/utility = |
Revision as of 17:13, 28 August 2022
ROC curve
- Binary case:
- Y = true positive rate = sensitivity,
- X = false positive rate = 1-specificity = 假陽性率
- Area under the curve AUC from the wikipedia: the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').
- [math]\displaystyle{ A = \int_{\infty}^{-\infty} \mbox{TPR}(T) \mbox{FPR}'(T) \, dT = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} I(T'\gt T)f_1(T') f_0(T) \, dT' \, dT = P(X_1 \gt X_0) }[/math]
- Interpretation of the AUC. A small toy example (n=12=4+8) was used to calculate the exact probability [math]\displaystyle{ P(X_1 \gt X_0) }[/math] (4*8=32 all combinations).
- It is a discrimination measure which tells us how well we can classify patients in two groups: those with and those without the outcome of interest.
- Since the measure is based on ranks, it is not sensitive to systematic errors in the calibration of the quantitative tests.
- The AUC can be defined as The probability that a randomly selected case will have a higher test result than a randomly selected control.
- Plot of sensitivity/specificity (y-axis) vs cutoff points of the biomarker
- The Mann-Whitney U test statistic (or Wilcoxon or Kruskall-Wallis test statistic) is equivalent to the AUC (Mason, 2002)
- The p-value of the Mann-Whitney U test can thus safely be used to test whether the AUC differs significantly from 0.5 (AUC of an uninformative test).
- Calculate AUC by hand. AUC is equal to the probability that a true positive is scored greater than a true negative.
- See the uROC() function in <functions.R> from the supplementary of the paper (need access right) Bivariate Marker Measurements and ROC Analysis Wang 2012. Let [math]\displaystyle{ n_1 }[/math] be the number of obs from X1 and [math]\displaystyle{ n_0 }[/math] be the number of obs from X0. X1 and X0 are the predict values for data from group 1 and 0. [math]\displaystyle{ TP_i=Prob(X_1\gt X_{0i})=\sum_j (X_{1j} \gt X_{0i})/n_1, ~ FP_i=Prob(X_0\gt X_{0i}) = \sum_j (X_{0j} \gt X_{0i}) / n_0 }[/math]. We can draw a scatter plot or smooth.spline() of TP(y-axis) vs FP(x-axis) for the ROC curve.
uROC <- function(marker, status) ### ROC function for univariate marker ### { x <- marker bad <- is.na(status) | is.na(x) status <- status[!bad] x <- x[!bad] if (sum(bad) > 0) cat(paste("\n", sum(bad), "records with missing values dropped. \n")) no_case <- sum(status==1) no_control <- sum(status==0) TP <- rep(0, no_control) FP <- rep(0, no_control) for (i in 1: no_control){ TP[i] <- sum(x[status==1]>x[status==0][i])/no_case FP[i] <- sum(x[status==0]>x[status==0][i])/no_control } list(TP = TP, FP = FP) }
- How to calculate Area Under the Curve (AUC), or the c-statistic, by hand or by R
- Introduction to the ROCR package. Add threshold labels
- http://freakonometrics.hypotheses.org/9066, http://freakonometrics.hypotheses.org/20002
- Illustrated Guide to ROC and AUC
- ROC Curves in Two Lines of R Code
- Learning Data Science: Understanding ROC Curves
- Gini and AUC. Gini = 2*AUC-1.
- Generally, an AUC value over 0.7 is indicative of a model that can distinguish between the two outcomes well. An AUC of 0.5 tells us that the model is a random classifier, and it cannot distinguish between the two outcomes.
- ROC Day at BARUG
- ROC and AUC, Clearly Explained! StatQuest
- Optimal threshold
- Precision/PPV (proportion of positive results that were correctly classified) replacing the False Positive Rate. Useful for unbalanced data.
partial AUC
- https://onlinelibrary.wiley.com/doi/10.1111/j.1541-0420.2012.01783.x
- pROC: an open-source package for R and S+ to analyze and compare ROC curves
- Partial AUC Estimation and Regression Dodd 2003. [math]\displaystyle{ AUC(t_0,t_1) = \int_{t_0}^{t_1} ROC(t) dt }[/math] where the interval [math]\displaystyle{ (t_0, t_1) }[/math] denotes the false-positive rates of interest.
summary ROC
- The partial area under the summary ROC curve Walter 2005
- On summary ROC curve for dichotomous diagnostic studies: an application to meta-analysis of COVID-19 2022
Weighted ROC
- What is the difference between area under roc and weighted area under roc? Weighted ROC curves are used when you're interested in performance in a certain region of ROC space (e.g. high recall) and was proposed as an improvement over partial AUC (which does exactly this but has some issues)
Adjusted AUC
- 'auc.adjust': R function for optimism-adjusted AUC (internal validation)
- GmAMisc::aucadj(data, fit, B = 200)
Difficult to compute for some models
- Plot ROC curve for Nearest Centroid. For NearestCentroid it is not possible to compute a score. This is simply a limitation of the model.
- k-NN model. class::knn() can output prediction probability.
- predict.randomForest() can output class probabilities. See ROC curve for classification from randomForest
Optimal threshold
- Max of “sensitivity + specificity”. See Epi::ROC() function.
- On optimal biomarker cutoffs accounting for misclassification costs in diagnostic trilemmas with applications to pancreatic cancer Bantis, 2022
ROC Curve AUC for Hypothesis Testing
- Interpreting AUROC in Hypothesis Testing
- Receiver Operating Characteristic Curve in Diagnostic Test Assessment 2010
Challenges, issues
Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review 2022. class imbalance, data pre-processing, and hyperparameter tuning. twitter.
Survival data
'Survival Model Predictive Accuracy and ROC Curves' by Heagerty & Zheng 2005
- Recall Sensitivity= [math]\displaystyle{ P(\hat{p_i} \gt c | Y_i=1) }[/math], Specificity= [math]\displaystyle{ P(\hat{p}_i \le c | Y_i=0 }[/math]), [math]\displaystyle{ Y_i }[/math] is binary outcomes, [math]\displaystyle{ \hat{p}_i }[/math] is a prediction, [math]\displaystyle{ c }[/math] is a criterion for classifying the prediction as positive ([math]\displaystyle{ \hat{p}_i \gt c }[/math]) or negative ([math]\displaystyle{ \hat{p}_i \le c }[/math]).
- For survival data, we need to use a fixed time/horizon (t) to classify the data as either a case or a control. Following Heagerty and Zheng's definition in Survival Model Predictive Accuracy and ROC Curves (Incident/dynamic) 2005, Sensitivity(c, t)= [math]\displaystyle{ P(M_i \gt c | T_i = t) }[/math], Specificity= [math]\displaystyle{ P(M_i \le c | T_i \gt t }[/math]) where M is a marker value or [math]\displaystyle{ Z^T \beta }[/math]. Here sensitivity measures the expected fraction of subjects with a marker greater than c among the subpopulation of individuals who die at time t, while specificity measures the fraction of subjects with a marker less than or equal to c among those who survive beyond time t.
- The AUC measures the probability that the marker value for a randomly selected case exceeds the marker value for a randomly selected control
- ROC curves are useful for comparing the discriminatory capacity of different potential biomarkers.
Confusion matrix, Sensitivity/Specificity/Accuracy
Predict | ||||
1 | 0 | |||
True | 1 | TP | FN | Sens=TP/(TP+FN)=Recall=TPR FNR=FN/(TP+FN) |
0 | FP | TN | Spec=TN/(FP+TN), 1-Spec=FPR | |
PPV=TP/(TP+FP) FDR=FP/(TP+FP) =1-PPV |
NPV=TN/(FN+TN) | N = TP + FP + FN + TN |
- Sensitivity 敏感度 = TP / (TP + FN) = Recall
- Specificity 特異度 = TN / (TN + FP)
- Accuracy = (TP + TN) / N
- False discovery rate FDR = FP / (TP + FP)
- False negative rate FNR = FN / (TP + FN)
- False positive rate FPR = FP / (FP + TN) = 1 - Spec
- True positive rate = TP / (TP + FN) = Sensitivity
- Positive predictive value (PPV) = TP / # positive calls = TP / (TP + FP) = 1 - FDR = Precision
- Negative predictive value (NPV) = TN / # negative calls = TN / (FN + TN)
- Prevalence 盛行率 = (TP + FN) / N.
- Note that PPV & NPV can also be computed from sensitivity, specificity, and prevalence:
- PPV is directly proportional to the prevalence of the disease or condition..
- For example, in the extreme case if the prevalence =1, then PPV is always 1.
- [math]\displaystyle{ \text{PPV} = \frac{\text{sensitivity} \times \text{prevalence}}{\text{sensitivity} \times \text{prevalence}+(1-\text{specificity}) \times (1-\text{prevalence})} }[/math]
- [math]\displaystyle{ \text{NPV} = \frac{\text{specificity} \times (1-\text{prevalence})}{(1-\text{sensitivity}) \times \text{prevalence}+\text{specificity} \times (1-\text{prevalence})} }[/math]
- Prediction of heart disease and classifiers’ sensitivity analysis Almustafa, 2020
- Positive percent agreement (PPA) and negative percent agreement (NPA)
- ConfusionTableR has made it to CRAN 21/07/2021. ConfusionTableR
- Precision, Recall, Specificity, Prevalence, Kappa, F1-score check with R by using the caret:: confusionMatrix() function. If there are only two factor levels, the first level will be used as the "positive" result.
False positive rates vs false positive rates
FPR (false positive rate) vs FDR (false discovery rate)
Precision recall curve
- Precision and recall from wikipedia
- Y-axis: Precision = tp/(tp + fp) = PPV. How accurately the model predicted the positive classes. large is better
- X-axis: Recall = tp/(tp + fn) = Sensitivity, large is better
- The Relationship Between Precision-Recall and ROC Curves. Remember ROC is defined as
- Y-axis: Sensitivity = tp/(tp + fn) = Recall
- X-axis: 1-Specificity = fp/(fp + tn)
- The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets
Incidence, Prevalence
https://www.health.ny.gov/diseases/chronic/basicstat.htm
Jaccard index
- https://en.wikipedia.org/wiki/Jaccard_index
- Clear Example of Jaccard Similarity // Visual Explanation of What is the Jaccard Index? (video). Jaccard similarity = TP / (TP + FP + FN)
Calculate area under curve by hand (using trapezoid), relation to concordance measure and the Wilcoxon–Mann–Whitney test
- https://stats.stackexchange.com/a/146174
- The meaning and use of the area under a receiver operating characteristic (ROC) curve J A Hanley, B J McNeil 1982
genefilter package and rowpAUCs function
- rowpAUCs function in genefilter package. The aim is to find potential biomarkers whose expression level is able to distinguish between two groups.
# source("http://www.bioconductor.org/biocLite.R") # biocLite("genefilter") library(Biobase) # sample.ExpressionSet data data(sample.ExpressionSet) library(genefilter) r2 = rowpAUCs(sample.ExpressionSet, "sex", p=0.1) plot(r2[1]) # first gene, asking specificity = .9 r2 = rowpAUCs(sample.ExpressionSet, "sex", p=1.0) plot(r2[1]) # it won't show pAUC r2 = rowpAUCs(sample.ExpressionSet, "sex", p=.999) plot(r2[1]) # pAUC is very close to AUC now
Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction
- http://circ.ahajournals.org/content/115/7/928
- Calibration: the Achilles heel of predictive analytics 2019
Performance evaluation
- Testing for improvement in prediction model performance by Pepe et al 2013.
Youden's index/Youden's J statistic
Some R packages
- Some R Packages for ROC Curves
- ROCR 2005
- pROC 2010. get AUC and plot multiple ROC curves together at the same time
- PRROC 2014
- plotROC 2014
- precrec 2015
- ROCit 2019
- ROC from Bioconductor
- caret
- ROC animation
pROC
- https://cran.r-project.org/web/packages/pROC/index.html
- pROC: display and analyze ROC curves in R and S+
- ROC Curve and AUC in Machine learning and R pROC Package
library(pROC) data(aSAH) str(aSAH[,c('outcome', "s100b")]) # 'data.frame': 113 obs. of 2 variables: # $ outcome: Factor w/ 2 levels "Good","Poor": 1 1 1 1 2 2 1 2 1 1 ... # $ s100b : num 0.13 0.14 0.1 0.04 0.13 0.1 0.47 0.16 0.18 0.1 ... roc.s100b <- roc(aSAH$outcome, aSAH$s100b) roc(aSAH$outcome, aSAH$s100b, plot=TRUE, auc=TRUE, # already the default col="green", lwd =4, legacy.axes=TRUE, main="ROC Curves") # Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor). # Area under the curve: 0.7314 auc(roc.s100b) # Area under the curve: 0.7314 auc(aSAH$outcome, aSAH$s100b) # Area under the curve: 0.7314
Note: in pROC::roc() or auc(), the response is on the 1st argument while in caret::confusionMatrix(), truth/reference is on the 2nd argument.
If we flipped the outcomes, it won't affect AUC.
aSAH$outcome2 <- factor(ifelse(aSAH$outcome == "Good", "Poor", "Good"), levels=c("Good", "Poor")) roc(aSAH$outcome2, aSAH$s100b) # Data: aSAH$s100b in 41 controls (aSAH$outcome2 Good) > 72 cases (aSAH$outcome2 Poor). # Area under the curve: 0.7314
Cross-validation ROC
- ROC cross-validation caret::train(, metric = "ROC")
- Feature selection + cross-validation, but how to make ROC-curves in R. Small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.
- How to easily make a ROC curve in R
- Appropriate way to get Cross Validated AUC
- cvAUC package as linked from Some R Packages for ROC Curves
mean ROC curve
ROC with cross-validation for linear regression in R
Comparison of two AUCs
- Statistical Assessments of AUC. This is using the pROC::roc.test function.
- prioritylasso. It is using roc(), auc(), roc.test(), plot.roc() from the pROC package. The calculation based on the training data is biased so we need to report the one based on test data.
Assess risk of bias
PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration 2019. http://www.probast.org/
Confidence interval of AUC
How to get an AUC confidence interval. pROC package was used.
DeLong test for comparing two ROC curves
- Comparing AUCs of Machine Learning Models with DeLong’s Test
- Misuse of DeLong test to compare AUCs for nested models
- What is the DeLong test for comparing AUCs?
- R语言,ROC曲线,deLong test. pROC::roc.test() was used.
- Daim::deLong.test()
AUC can be a misleading measure of performance
AUC is high but precision is low (i.e. FDR is high). https://twitter.com/michaelhoffman/status/1398380674206285830?s=09.
Caveats and pitfalls of ROC analysis in clinical microarray research
Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them) Berrar 2011
Limitation in clinical data
Sample size
Picking a threshold based on model performance/utility
Squeezing the Most Utility from Your Models
Why does my ROC curve look like a V
https://stackoverflow.com/a/42917384
Unbalanced classes
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- How to Fix k-Fold Cross-Validation for Imbalanced Classification. It teaches you how to split samples in CV by using stratified k-fold cross-validation.
- ROC is especially useful for unbalanced data where the 0.5 threshold may not be appropriate.
- Use Precison/PPV to replace FDR
- Practical Guide to deal with Imbalanced Classification Problems in R
- The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets
- Roc animation
- Undersampling By Groups In R. See the ROSE package & its paper in 2014.
- imbalance package
- Chapter 11 Subsampling For Class Imbalances from the caret package documentation
- SMOTE
- Classification Trees for Imbalanced Data: Surface-to-Volume Regularization Zhu, JASA 2021
- The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression 2022
- How to handle Imbalanced Data?
Metric
- Tour of Evaluation Metrics for Imbalanced Classification. More strategies are available.
- F-score
- tidymodels::f_meas(), Modelling with Tidymodels and Parsnip, Modelling Binary Logistic Regression using Tidymodels Library in R
- caret::train(,metric) from Caret vs. tidymodels — create reusable machine learning workflows
- MLmetrics: Machine Learning Evaluation Metrics
- Classification/evaluation metrics for highly imbalanced data
- What metrics should be used for evaluating a model on an imbalanced data set? (precision + recall or ROC=TPR+FPR)
Class comparison problem
- compcodeR: RNAseq data simulation, differential expression analysis and performance comparison of differential expression methods
- Polyester: simulating RNA-seq datasets with differential transcript expression, github, HTML
Reporting
- Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement 2015
- Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review Yang 2022
Applications
Lessons
- Unbalanced data: kNN or nearest centroid is better than the traditional methods
- Small sample size and large number of predictors: t-test can select predictors while lasso cannot