Classification Metrics

Quick-Fire Guide to Confusion-Matrix Metrics

Introduction

Every classification report is a forest of percentages. Below we spell out what each metric means in one plain sentence, then show six toy cases to see how numbers shift with class balance.

Metric Definitions (Plain English)

Metric	Formula	What it really says
Prevalence	`(P / N + P)`	How common the positive class is.
Recall (TPR)	`TP / (TP + FN)`	Out of all actual positives, how many we caught.
FNR	`FN / (TP + FN)`	Miss-rate: positives we failed to flag.
Precision	`TP / (TP + FP)`	When we shout “positive”, how often we are right.
FDR	`FP / (TP + FP)`	False-alarm rate among calls we made.
F1	`2 · (Prec · Rec) / (Prec + Rec)`	High only if both precision and recall are high.
Specificity (TNR)	`TN / (TN + FP)`	Out of all actual negatives, how many we left alone.
FPR	`FP / (TN + FP)`	How many innocents we falsely flag.
NPV	`TN / (TN + FN)`	When we say “negative”, chance we’re correct.
Accuracy	`(TP + TN) / total`	Overall hit-rate, may be misleading if classes are skewed.
AUC	`P(score_pos > score_neg)`	How well the model ranks positives above negatives.

Worked Table: Six Hypothetical Cases

Each row fixes Recall = 95% (except Case 6) and FNR = 5%. We vary class prevalence to show how other metrics react.

Case	Prev (%)	Recall (%)	FNR (%)	Precision (%)	FDR (%)	F1 (%)	TNR (%)	FPR (%)	NPV (%)	Accuracy (%)	AUC (%)
Case 1	1.0	95.0	5.0	16.1	83.9	27.5	95.0	5.0	99.9	95.0	95.0
Case 2	10.0	95.0	5.0	67.9	32.1	79.2	95.0	5.0	99.4	95.0	95.0
Case 3	50.0	95.0	5.0	95.0	5.0	95.0	95.0	5.0	95.0	95.0	95.0
Case 4	90.0	95.0	5.0	99.4	0.6	97.2	95.0	5.0	67.9	95.0	95.0
Case 5	90.0	95.0	5.0	90.5	9.5	92.7	10.0	90.0	18.2	86.5	52.5
Case 6	1.0	0.0	100.0	0.0	100.0	0.0	100.0	0.0	99.0	99.0	50.0

Notice how F1 climbs with prevalence (Case 1 → 4) even though TNR stays identical.

Metric Cheat-Sheet

This quick table shows when to use each metric.

Metric	Formula	What it measures	Strength	Weakness
AUC	`P(score_pos > score_neg)`	Ranking quality across all thresholds	Threshold-independent	Doesn't say which threshold is best
F1	`2PR / (P+R)`	Balance of Precision & Recall	Great for trade-offs	Hides individual contributions
Precision	`TP / (TP+FP)`	When you say positive, how often you’re right	Minimises false alarms	It says nothing about false negatives — it can look good even if the model misses most positives.
Recall	`TP / (TP+FN)`	Out of all real positives, how many you catch	Critical in health/safety	Can create false positives
Specificity	`TN / (TN+FP)`	Out of all actual negatives, how many are correctly identified as negative	Minimises false alarms	Can be high even if true positives are rarely caught

When to Prioritise Which Metric

There is no one-size-fits-all metric. The right choice depends entirely on the problem. Some require catching every positive. Others demand near-certainty before acting. Below are real-world examples where different metrics matter most.

TPR (Recall): For When Missing a True Case is Unacceptable

In disease screening, recall is critical. Missing a true case (false negative) is dangerous. Doctors are willing to tolerate false positives if it means not missing real cases of disease.

Precision: For When You Only Want to Act if You're Sure

In finance, it’s better to make fewer but surer calls. If you're wrong, you lose money. Precision ensures that when you act, you're probably right.

TNR (Specificity): For Avoiding False Alarms

In fraud detection, you don't want to flag legitimate transactions. High specificity ensures most normal cases aren't wrongly flagged.

F1 Score: When Both Sides Matter

Spam filters need balance. Too strict? You lose real emails. Too lenient? You get spammed. F1 balances both.

AUC: For Comparing Models Overall

When you don't yet know the best cut-off, AUC shows how well your model separates classes overall. It’s good for ranking models before deployment.

Bottom line: Choose based on what mistake matters more — a missed case, or a wrong prediction.