Binary classification metrics: ROC vs. PR vs. PR-Gain

Binary classification performance on imbalanced data is notoriously tricky. Accuracy with a minority positive class can lead to models that always predict 0.
Balanced accuracy and F1-score attempt to counter this with weighted sums and ratios of the numbers of positive and negative predictions and examples. But they rely on a predetermined classification cutoff, which itself requires tuning.

A metric that doesn’t need this cutoff is the Receiver-Operating-Characteristic (ROC). It treats the model with variable cutoff as a family of classifiers, and plots all of the ratios true-positives / positive-examples (against false-positives / negative-examples) over the test dataset.

ROC can be misleading on imbalanced data: few positive examples means the first ratio is easily distorted, making ROC unreliable at estimating model performance. ROC nonetheless has great properties: easy comparison to the major diagonal, representing a universal baseline B that predicts randomly with probabilities equal to the class balance. Its optimal points lie on the Pareto front, so it’s easy to find cutoffs that optimise certain trade-offs. Finally the Area-Under-Curve (AUC) of ROC estimates the probability that, given a random positive and negative example pair, the model scores the positive example higher than the negative. This allows easy comparison of models fitted to different data.

An alternative metric also using a varying cutoff is the Precision-Recall (PR) curve. PR plots true-positives / positive-predictions against true-positives / positive-examples. It will correctly separate poor from good performance on a minority positive class, since precision and recall measure two complementary aspects in terms of this class. Thus on imbalanced data it could be preferable to ROC.

PR suffers several drawbacks: there is no simple baseline to compare to, since B forms a hyperbola in PR space, position determined by dataset imbalance. The optimal PR points are not easy to identify, so finding optimal cutoffs isn’t obvious. Finally AUC-PR has no known interpretation, and a lower-right-hand region of the PR plot is unachievable (size determined by dataset imbalance), meaning models trained on different data can’t be compared using this quantity.

There is an alternative, that retains the utility of PR analysis but fixes the above drawbacks. Applying a harmonic transform, based on the dataset imbalance, to the precision and recall values, maps to the so-called “Gain” space where the baseline B occupies the minor diagonal. The unachievable region gets transformed away, and the optimal points now lie on the Pareto front of the curve, so finding optimal cutoffs is easy. Finally the PR-Gain AUC now has an interpretation as an estimator of a simple transform of the optimal model’s F1-score. This means different models can be compared directly, even when fitted to different data. A drawback of PR-Gain is the increased complexity of the metric: it becomes harder to interpret individual points on the curve.

In summary:

  • ROC is easily interpretable and comparable between models, but sensitive to class imbalance.
  • PR is similarly easy to interpret, but comparing models by it is mathematically unsound. It is robust to a minority positive class.
  • PR-Gain is harder to interpret, but has sound reasons allowing models to be compared by it, and it is robust to a minority positive class.

In Henesis, we primarily use these three metrics in validating binary classifiers, based on balancing interpretability against the cost of false negatives, and factoring in any imbalance in the data. As everything in Machine Learning, there is no single tool that handles every job. Choosing appropriately, we extract more performant models and meaningful explanations.