Introduction
Every classification report is a forest of percentages. Below we spell out what each metric means in one plain sentence, then show six toy cases to see how numbers shift with class balance.
Metric Definitions (Plain English)
| Metric | Formula | What it really says |
|---|---|---|
| Prevalence | (P / N + P) | How common the positive class is. |
| Recall (TPR) | TP / (TP + FN) | Out of all actual positives, how many we caught. |
| FNR | FN / (TP + FN) | Miss-rate: positives we failed to flag. |
| Precision | TP / (TP + FP) | When we shout “positive”, how often we are right. |
| FDR | FP / (TP + FP) | False-alarm rate among calls we made. |
| F1 | 2 · (Prec · Rec) / (Prec + Rec) | High only if both precision and recall are high. |
| Specificity (TNR) | TN / (TN + FP) | Out of all actual negatives, how many we left alone. |
| FPR | FP / (TN + FP) | How many innocents we falsely flag. |
| NPV | TN / (TN + FN) | When we say “negative”, chance we’re correct. |
| Accuracy | (TP + TN) / total | Overall hit-rate, may be misleading if classes are skewed. |
| AUC | P(score_pos > score_neg) | How well the model ranks positives above negatives. |
Worked Table: Six Hypothetical Cases
Each row fixes Recall = 95% (except Case 6) and FNR = 5%. We vary class prevalence to show how other metrics react.
| Case | Prev (%) | Recall (%) | FNR (%) | Precision (%) | FDR (%) | F1 (%) | TNR (%) | FPR (%) | NPV (%) | Accuracy (%) | AUC (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Case 1 | 1.0 | 95.0 | 5.0 | 16.1 | 83.9 | 27.5 | 95.0 | 5.0 | 99.9 | 95.0 | 95.0 |
| Case 2 | 10.0 | 95.0 | 5.0 | 67.9 | 32.1 | 79.2 | 95.0 | 5.0 | 99.4 | 95.0 | 95.0 |
| Case 3 | 50.0 | 95.0 | 5.0 | 95.0 | 5.0 | 95.0 | 95.0 | 5.0 | 95.0 | 95.0 | 95.0 |
| Case 4 | 90.0 | 95.0 | 5.0 | 99.4 | 0.6 | 97.2 | 95.0 | 5.0 | 67.9 | 95.0 | 95.0 |
| Case 5 | 90.0 | 95.0 | 5.0 | 90.5 | 9.5 | 92.7 | 10.0 | 90.0 | 18.2 | 86.5 | 52.5 |
| Case 6 | 1.0 | 0.0 | 100.0 | 0.0 | 100.0 | 0.0 | 100.0 | 0.0 | 99.0 | 99.0 | 50.0 |
Notice how F1 climbs with prevalence (Case 1 → 4) even though TNR stays identical.
Metric Cheat-Sheet
This quick table shows when to use each metric.
| Metric | Formula | What it measures | Strength | Weakness |
|---|---|---|---|---|
| AUC | P(score_pos > score_neg) | Ranking quality across all thresholds | Threshold-independent | Doesn't say which threshold is best |
| F1 | 2PR / (P+R) | Balance of Precision & Recall | Great for trade-offs | Hides individual contributions |
| Precision | TP / (TP+FP) | When you say positive, how often you’re right | Minimises false alarms | It says nothing about false negatives — it can look good even if the model misses most positives. |
| Recall | TP / (TP+FN) | Out of all real positives, how many you catch | Critical in health/safety | Can create false positives |
| Specificity | TN / (TN+FP) | Out of all actual negatives, how many are correctly identified as negative | Minimises false alarms | Can be high even if true positives are rarely caught |
When to Prioritise Which Metric
There is no one-size-fits-all metric. The right choice depends entirely on the problem. Some require catching every positive. Others demand near-certainty before acting. Below are real-world examples where different metrics matter most.
TPR (Recall): For When Missing a True Case is Unacceptable
In disease screening, recall is critical. Missing a true case (false negative) is dangerous. Doctors are willing to tolerate false positives if it means not missing real cases of disease.
Precision: For When You Only Want to Act if You're Sure
In finance, it’s better to make fewer but surer calls. If you're wrong, you lose money. Precision ensures that when you act, you're probably right.
TNR (Specificity): For Avoiding False Alarms
In fraud detection, you don't want to flag legitimate transactions. High specificity ensures most normal cases aren't wrongly flagged.
F1 Score: When Both Sides Matter
Spam filters need balance. Too strict? You lose real emails. Too lenient? You get spammed. F1 balances both.
AUC: For Comparing Models Overall
When you don't yet know the best cut-off, AUC shows how well your model separates classes overall. It’s good for ranking models before deployment.
Bottom line: Choose based on what mistake matters more — a missed case, or a wrong prediction.