Common misconceptions and pitfalls of using ROC

 

Receiver operating curve (ROC) 

An ROC curve is a commonly used technique to visualise, organise, and select or compare classifiers based on their performance. Where a classifier is usually binary with two possible outcomes. Historically, the use of ROCs comes from WW-II where it was used for assessing performance of signal detection and specifically to find out whether a signal from radar was a true positive or a false positive. The guy who operated the radar receiver would be known as ‘receiver operator’, and hence, ROC got its name :) Sometime in the 1970s ROC started making its appearance in the field of medicine, where it was used to evaluate and compare algorithms. Now, I will elaborate on understanding different components of an ROC, what they mean, and how they are used.


Let's take an example  where an observed instance is mapped to one of two class labels, say COVID positive or negative. This test can be performed based on a variety of factors a.k.a features. A ‘response’ or ‘output’ or ‘target’ variable can be which patient has COVID or which does not. Let's plot the value of the test on the x-axis and use a test score cutoff above which we call the test as positive and below which it would be negative. Here, blue and green represent whether a patient was tested negative and positive, respectively (Fig 1).


Fig 1: Predicted probabilities vs count of observations

 

The area underneath the green plot that represents the positive outcome is called the true positive (TP) and of the positive patients who fell in the negative area are called the false negatives (FN). Similarly, the area under the blue plot represents the true negative (TN) and incorrectly predicted positive patients here fall in the green area and are termed as false positives (FP).

Here, the proportion of the patients who were tested positive correctly is called sensitivity (SN) or in other words SN tells us how often the test does predict correctly. Further, the proportion of the patients tested negative correctly is known as the specificity (SP) of the test. As you can notice the SN and SP values clearly depend on the cutoff we chose, and this threshold choice is a business decision (Fig2). 

Fig 2: Predicted probabilities vs count of observations

Let me also introduce some more terminologies here to explain various associated statistics:

So we have “true” class labels (P,N) and the labels “predicted” by our test (Y,N). A classifier or a classification model is a process of mapping from observed instances to predicted classes. Some classification models produce a continuous output (e.g., an estimate of an instance's class membership probability) to which different thresholds may be applied to predict class membership. The others may provide a discrete outcome i.e. a class label of the predicted class.

As explained above, given a classifier and an instance, there are four possible outcomes: TP, FP, TN, and FN. The decision made by a classifier on a set of instances (the test set) can be represented in a two-by-two confusion matrix (a.k.a contingency table. This table represents the summary of classifier performance at  a particular threshold, that can be plotted in a 2D graph.

Table 1: Confusion matrix


Positive

Negative

Yes

True Positive or TP

False Positive or FP

No

False Negative or FN

True Negative or TN


Common performance metrics:


  • TPR- how often correctly predicts positive or Rc-Recall or SN = (TP/P)

  • FPR- how often the model incorrectly predicts positive = (FP/N)

  • SP=TN/(FP+TN)

  • Pr (Precision) or positive predictive value = (TP/(TP+FP))

  • Accuracy = ((TP+TN)/(P+N))

  • Balanced Accuracy = ½ * ((TP/TP+FN)+(TN/TN+FP))

  • F-measure = 2* (Pr*Rc /Pr+Rc)  = 2*TP/ 2*(TP+FP*FN)


Let's also take a moment to understand the relationship between SP and SN. There is an inverse relationship here, when one goes up and the other goes down. You can try this by changing the cutoff in the above Fig 2. 


Fig 3:Sensitivity and Specificity have an inverse relationship


But traditionally this is not how we graph it. We will plot this as SN vs (1-SP), therefore, when SP goes up, (1-SP) goes up too hence, this now has a positive relationship, and this shape is known as an ROC graph (Fig 4a). By using (1-SP), which is also called the false positive rate (FPR) (or commission error) we are now focusing on the target variable of interest, which is about predicting the positive class label.

Fig 4: a) Sensitivity and 1-Specificity have a positive relationship. b) (1-Specificity) is also known as the false positive rate


The ROC graph is plotted as TPR (Y-axis) vs FPR (X-axis) for all possible classification thresholds.  Thus, an ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives). This was done to overcome the supposed subjectivity in the threshold selection process.


Fig 5: ROC space has different significance at different positions. Area below the diagonal line has no information about the classification. Space towards the left point is more conservative than the upper right point towards (1,1).


The lower left point (0,0) on an ROC represents a classifier with no false positive errors but also gains no true positives. On the opposite end, that is  the upper right point (1,1) will unconditionally predict positive classifications at high cost. The point (0, 1) represents a perfect classification. 

The left-hand side of an ROC graph marked as “conservative” represents classifiers that make a positive prediction only with strong evidence. Opposite to it is the “liberal” region of ROC that would include classifiers that make positive predictions with weak evidence too, and hence, end up having a high false positive rate. Many real world examples would include more negative instances than positive ones therefore, the conservative classifiers would be of interest. Classifiers falling on the diagonal line would have no information or discriminating ability whereas, the ones below the diagonal line should have useful information but applying it incorrectly.

Area under the curve (AUC)

The calculation of area under the ROC curve (AUC) provides a single-number discrimination measure across possible thresholds of a classifier. The AUC is commonly used as a measure of how good the distinguishing property of a test is. The AUC is a portion of the area of a unit square, and hence, its values will be between 0 and 1. A random guessing line forms a diagonal line from (0,0) to (1,1) with an AUC of 0.5. A realistic classifier should have an AUC greater than 0.5.

Fig 6: Area under the ROC curve can be computed using a trapezoid rule. It represents the accuracy of distribution models 


Statistically speaking, AUC of a classifier denotes the probability of choosing a positive instance higher than a negative instance. The AUC can be computed using a trapezoid rule.


Despite the popularity of AUC as a standard measurement of accuracy of distribution models. The reliance on AUC as a sufficient test of model success is being questioned and re-examined.


First, let's look at the the properties of ROC-AUC that make it attractive metric to use:


  1. ROC-AUC:  measures the ability of a classifier to produce good relative instance scores (i.e. rank the positive instances w.r.t the negative instances), and are insensitive to whether predicted probability are actually calibrated to represent the class membership. As long as the ordering of the observations by predicted probabilities are the same, two models can be compared. Its performance depends on how well the classifier does its job.

  2. ROC-AUC: are useful techniques for operating conditions  such as skewed class distribution or classification error rates. If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. 

  3. ROCs can be extended to more than two classes


Of course, the above characteristics are very attractive and make ROC-AUC a favourite performance measure and useful tool. Next, we will look into its drawbacks:


  1. AUC measures the discrimination ability of classifier such that the “positive” outcome will have a higher predicted value than the “negative” outcome. This ability is insensitive to probability transformations but their ranks. Therefore, AUC disregards the goodness-of-fit of the prediction. A well-fitted model can have poor discrimination and a poor fitted model can over- or underestimate predictions when there is a moderate difference between probability of “pos” or “neg” predictions.

  2. ROC summarises test performance over all of the conditions a model could operate in. However, researchers may not be interested in all of these situations, only a few of them will be of interest. For example, one may be interested in maximising “pos” class prediction at the cost, or a conservative model is of interest.  It is also possible for a high-AUC classifier to perform worse in a specific region of ROC space than a second low-AUC classifier. 

  3. AUC has equal mis-classification costs i.e. weighs both omission (FN) and commission (FP) errors equally. A “negative” prediction could simply be because of the rarity of an instance or it being in a non-sampled area. This will result in a flawed outcome as cost ratios are an important part of decision making.

  4. AUC being a compound, single number discrimination measure does not provide information about the spatial distribution of model errors. This comes from the fact that AUC is calculated from other single number metrics such SN and SP by using a contingency table. It may be crucial to distinguish between randomly distributed errors from local aggregated ones when relative weight of these errors is to be considered.

  5. AUC is oblivious of the distribution of sampling territory. For example, if data has more negatives than positives a model that over predicts positives will have a lower false positives rate. With such a model inflated SN, SP and AUC can be obtained by simply increasing the sampling extent to include more irrelevant instances.


Therefore, I can conclude that widespread adoption AUC to predict discrimination ability of a model by presence-absence modellers was due to its good performance and ease of implementation. And understanding It provides information about the generalisability over a range of modelling conditions. AUC is a very useful compound metric over scalar measures such as, accuracy, error rate or error cost when investigating learning with skewed class distribution or cost-sensitive learning. However, in distribution modelling AUC has serious drawbacks as discussed above and therefore, does not provide information about how good a model performed.  


The aim of this post was to dig deeper into understanding the ROC as an evaluation metric to understand their characteristics and limitations. In the next post, I will discuss some alternative methods to ROC-AUC. Stay tuned!

Further reading:


Comments

Popular posts from this blog

Modelling genomic language using NLP and LLMs

Machine Learning: Interpretability vs Explainability