Chapter 11
15 min read
Section 37 of 65

Monitoring with Metrics

Training in Practice

Learning Objectives

By the end of this section, you will be able to:

  1. Construct and interpret a confusion matrix and explain why it contains more information than accuracy alone
  2. Compute precision, recall, and F1 score from scratch and explain which metric matters most for a given application
  3. Explain why the decision threshold is a critical hyperparameter and how to tune it for maximum F1
  4. Read and plot learning curves to diagnose underfitting, overfitting, and data hunger
  5. Use scikit-learn's classification_report and PyTorch-native tensor operations for production metric computation

Where We Left Off

In Section 3, we built the train/validation/test framework: we split data, tracked validation loss, and implemented early stopping. But we monitored only one number — the loss. While loss is essential for gradient-based optimization, it does not directly answer the questions practitioners actually care about: “How many positives did we miss?” “How many of our positive predictions are wrong?” “Should I trust this model in production?”

This section introduces the metrics toolkit that answers those questions. Loss tells the optimizer where to go. Metrics tell you whether the model is useful.


Beyond Loss: Why We Need More Metrics

Suppose you are building a model to detect fraudulent credit card transactions. In your dataset, 99.5% of transactions are legitimate and 0.5% are fraudulent. A model that always predicts “legitimate” has:

  • Accuracy: 99.5% — looks fantastic
  • Recall: 0% — it catches zero fraud
  • Business value: zero — the whole point was to catch fraud

This is the accuracy paradox: on imbalanced datasets, accuracy rewards predicting the majority class and ignores the minority class entirely. The model has high accuracy but is completely useless for its intended purpose.

The Core Insight: Accuracy tells you how often the model is right. But on imbalanced data, being right about the majority class is trivial and uninteresting. What you really need to know is: when the model says “positive,” can you trust it? And how many real positives is it missing?

The Confusion Matrix

The confusion matrix is a 2×2 table that breaks down every prediction into one of four categories. It is the foundation from which all classification metrics are derived.

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP): Correctly detectedFalse Negative (FN): Missed
Actually NegativeFalse Positive (FP): False alarmTrue Negative (TN): Correctly rejected

Every sample in the test set falls into exactly one cell. The four counts are exhaustive and mutually exclusive: TP+FP+TN+FN=N\text{TP} + \text{FP} + \text{TN} + \text{FN} = N.

The names use a simple convention: the first word (“True” / “False”) says whether the model was correct. The second word (“Positive” / “Negative”) says what the model predicted.

  • TP: Model said positive. It was positive. Correct ✓
  • FP: Model said positive. It was negative. Wrong ✗ (false alarm)
  • TN: Model said negative. It was negative. Correct ✓
  • FN: Model said negative. It was positive. Wrong ✗ (missed detection)

Precision, Recall, and F1

Accuracy

The fraction of all predictions that are correct:

Accuracy=TP+TNTP+FP+TN+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}}

Good as a sanity check on balanced datasets. Misleading on imbalanced datasets.

Precision: “When I say positive, am I right?”

Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

High precision means few false alarms. When the model says “positive,” you can trust it. Precision matters most when false positives are expensive: a spam filter that blocks real emails (FP) annoys users, so you want high precision.

Recall: “Did I catch all the positives?”

Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

High recall means few missed detections. Recall matters most when false negatives are dangerous: a cancer screening tool that misses a tumor (FN) is potentially fatal, so you want high recall even at the cost of some false positives.

F1 Score: Balancing Precision and Recall

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FN\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}

The F1 score is the harmonic mean of precision and recall. Unlike the arithmetic mean, the harmonic mean penalizes extreme imbalance: if either precision or recall is near zero, F1 is near zero. Both must be high for F1 to be high.

PrecisionRecallArithmetic MeanF1 (Harmonic Mean)
90%90%90%90%
90%10%50%18%
100%1%50.5%2%
50%50%50%50%

The second row is the key insight: a model with 90% precision but only 10% recall seems “okay” by arithmetic mean (50%) but is terrible by F1 (18%). The harmonic mean refuses to paper over a failing metric.

When to use which metric: Use accuracy for balanced multi-class problems (e.g., CIFAR-10 with 10 equal classes). Use F1 for imbalanced binary problems. Use precision when false positives are costly. Use recall when false negatives are dangerous.

The Threshold Decision

A classifier typically outputs a probability p[0,1]p \in [0, 1], not a hard class label. To get a binary prediction, we choose a threshold τ\tau:

y^={1if pτ0if p<τ\hat{y} = \begin{cases} 1 & \text{if } p \geq \tau \\ 0 & \text{if } p < \tau \end{cases}

The default τ=0.5\tau = 0.5 is often not optimal. Changing the threshold creates a precision-recall tradeoff:

  • Lower threshold (τ=0.3\tau = 0.3): More samples are predicted positive. Recall increases (catch more positives) but precision decreases (more false alarms).
  • Higher threshold (τ=0.7\tau = 0.7): Fewer samples are predicted positive. Precision increases (fewer false alarms) but recall decreases (miss more positives).

The optimal threshold depends on the application. To find the threshold that maximizes F1, sweep thresholds from 0 to 1 and pick the one with the highest F1 score. This should be done on the validation set, never on the test set.

Threshold tuning is a hyperparameter decision. It must be done on the validation set, not the test set. If you tune the threshold on the test set, you are leaking test information into your model selection, and your reported metrics become biased.

ROC and PR Curves

The threshold slider gave us one operating point at a time. A ROC curveand a PR curve sweep all thresholds at once and plot the tradeoff as a single line. Instead of picking one threshold and reporting precision / recall for it, we report the area under the curve — a summary of how well the model ranks positives above negatives, independent of any particular threshold.

Curvex-axisy-axisIdeal shapeSummary statistic
ROCFPR = FP / NTPR = TP / P = Recallup-and-left, hugs (0, 1)ROC-AUC ∈ [0, 1], random = 0.5
PRRecallPrecisionflat at the top, hugs precision = 1PR-AUC (Average Precision), random ≈ positive-class base rate

ROC-AUC has a clean probabilistic interpretation: if you pick one random positive and one random negative, ROC-AUC is the probability that the model assigns the positive the higher score (Fawcett 2006). An AUC of 0.8125 means the ranking is correct about 81% of the time.

PR-AUC reports a different thing: how precise the model is at every recall level. When the positive class is rare, ROC-AUC can stay deceptively high because the denominator of FPR (the huge pool of negatives) makes even hundreds of false positives look like a tiny FPR. PR-AUC is dominated by precision, so it collapses as soon as false positives start accumulating. Saito & Rehmsmeier (2015) showed this formally; the short version: on heavily imbalanced data, prefer PR-AUC over ROC-AUC.

ROC sweep in NumPy — every threshold, every confusion count
🐍roc_pr_from_scratch.py
1import numpy as np

NumPy for array math. The sweep over thresholds is vectorized: scores >= tau is a boolean mask applied to all samples at once.

3# 8 predictions: model score and true label

A tiny balanced test set (4 positives + 4 negatives) so every (TP, FP) count is easy to trace by hand.

4scores = np.array([0.95, 0.80, 0.75, 0.60, 0.50, 0.40, 0.30, 0.10])

The model's output probability for class 1 on each of the 8 test samples. Higher score = more confident it is positive.

EXECUTION STATE
scores = array([0.95, 0.80, 0.75, 0.60, 0.50, 0.40, 0.30, 0.10])
5y_true = np.array([1, 1, 0, 1, 0, 1, 0, 0])

Ground truth. The top 2 scores (0.95, 0.80) are actually positive — those are easy for the model. The 3rd score (0.75) is a false positive. Mixing happens in the middle of the ranking — that is what the curves will reveal.

EXECUTION STATE
y_true = array([1, 1, 0, 1, 0, 1, 0, 0])
7P = int((y_true == 1).sum())

Count the positives. Needed to normalize TPR = TP / P (= recall).

EXECUTION STATE
P = 4 — number of true positives in the dataset
8N = int((y_true == 0).sum())

Count the negatives. Needed for FPR = FP / N.

EXECUTION STATE
N = 4 — number of true negatives
12thresholds = np.sort(np.concatenate([[1.01], scores, [-0.01]]))[::-1]

Build the list of thresholds to sweep. We pad with 1.01 (above every score) and -0.01 (below every score) so the curve starts at (0, 0) and ends at (1, 1). Then we reverse so tau starts high and decreases — this makes fpr and tpr increase monotonically, ready for np.trapz.

EXECUTION STATE
📚 np.concatenate = Glue arrays end-to-end into one 1-D array.
📚 np.sort(...)[::-1] = Sort ascending, then reverse — equivalent to sort descending.
thresholds = [1.01, 0.95, 0.80, 0.75, 0.60, 0.50, 0.40, 0.30, 0.10, -0.01]
14tpr, fpr, precision, recall = [], [], [], []

Four parallel lists — one entry per threshold we sweep. tpr and fpr go into the ROC plot; precision and recall go into the PR plot.

15for tau in thresholds:

Sweep every threshold, computing the confusion counts and the four rates at each step. Ten iterations total.

LOOP TRACE · 10 iterations
tau=1.01
(tp, fp) = (0, 0) — nothing above 1.01
(tpr, fpr) = (0.0, 0.0)
(precision, recall) = (1.0, 0.0) — by convention, 0/0 = 1
tau=0.95
(tp, fp) = (1, 0) — 0.95 caught
(tpr, fpr) = (0.25, 0.0)
(precision, recall) = (1.0, 0.25)
tau=0.80
(tp, fp) = (2, 0) — 0.80 caught too
(tpr, fpr) = (0.5, 0.0)
(precision, recall) = (1.0, 0.5)
tau=0.75
(tp, fp) = (2, 1) — first false positive
(tpr, fpr) = (0.5, 0.25)
(precision, recall) = (0.667, 0.5)
tau=0.60
(tp, fp) = (3, 1)
(tpr, fpr) = (0.75, 0.25)
(precision, recall) = (0.75, 0.75)
tau=0.50
(tp, fp) = (3, 2)
(tpr, fpr) = (0.75, 0.5)
(precision, recall) = (0.6, 0.75)
tau=0.40
(tp, fp) = (4, 2) — all positives found
(tpr, fpr) = (1.0, 0.5)
(precision, recall) = (0.667, 1.0)
tau=0.30
(tp, fp) = (4, 3)
(tpr, fpr) = (1.0, 0.75)
(precision, recall) = (0.571, 1.0)
tau=0.10
(tp, fp) = (4, 4)
(tpr, fpr) = (1.0, 1.0)
(precision, recall) = (0.5, 1.0)
tau=-0.01
(tp, fp) = (4, 4) — everything predicted positive
(tpr, fpr) = (1.0, 1.0)
(precision, recall) = (0.5, 1.0)
16pred = scores >= tau

Vectorized boolean mask: one True/False per sample. Equivalent to 'predict 1 iff confidence at least tau'.

17tp = int((pred & (y_true == 1)).sum())

True positives: predicted 1 AND actually 1. Boolean AND, then count the Trues.

18fp = int((pred & (y_true == 0)).sum())

False positives: predicted 1 BUT actually 0. These are the 'false alarms'.

19tpr.append(tp / P)

TPR = TP / P = Recall. Fraction of true positives we caught at this threshold.

20fpr.append(fp / N)

FPR = FP / N. Fraction of true negatives we wrongly flagged. ROC plots TPR vs FPR.

21precision.append(tp / (tp + fp) if (tp + fp) > 0 else 1.0)

Precision = TP / (TP + FP). When nothing is predicted positive, (TP + FP) = 0 and precision is undefined — we use 1.0 by convention so the PR curve starts cleanly at (recall=0, precision=1).

22recall.append(tp / P)

Recall = TPR. Kept as a separate list so we can pair it with precision for the PR plot without relying on ROC coordinates.

25roc_auc = np.trapz(tpr, fpr)

ROC-AUC = area under (FPR, TPR). np.trapz is the composite trapezoidal rule: for each adjacent pair of points it adds width * (y1 + y2) / 2. Because we iterated thresholds from high to low, fpr and tpr are both non-decreasing, so np.trapz integrates left-to-right correctly.

EXECUTION STATE
📚 np.trapz(y, x) = Numerical integration by the trapezoidal rule. x is the abscissa (FPR here), y is the ordinate (TPR). Works with non-uniform spacing.
→ by hand = Non-zero-width trapezoids: 0.25*0.5 + 0.25*0.75 + 0.25*1 + 0.25*1 = 0.125 + 0.1875 + 0.25 + 0.25 = 0.8125
roc_auc = 0.8125
26print(f"ROC-AUC = {roc_auc:.4f}")

A perfect ranker would put all positives above all negatives → AUC = 1.0. A coin flip → 0.5. Our model scores 0.8125, which means: pick a random positive and a random negative, the positive gets the higher score about 81% of the time. This probabilistic interpretation is the main reason ROC-AUC is so popular.

EXECUTION STATE
stdout = ROC-AUC = 0.8125
8 lines without explanation
1import numpy as np
2
3# 8 predictions: model score (probability of class 1) and true label
4scores = np.array([0.95, 0.80, 0.75, 0.60, 0.50, 0.40, 0.30, 0.10])
5y_true = np.array([   1,    1,    0,    1,    0,    1,    0,    0])
6
7P = int((y_true == 1).sum())   # positives  = 4
8N = int((y_true == 0).sum())   # negatives  = 4
9
10# Sweep thresholds from just above the max score down to just below the min.
11# As tau drops, more samples are predicted positive -> both TPR and FPR rise.
12thresholds = np.sort(np.concatenate([[1.01], scores, [-0.01]]))[::-1]
13
14tpr, fpr, precision, recall = [], [], [], []
15for tau in thresholds:
16    pred = scores >= tau
17    tp = int((pred & (y_true == 1)).sum())
18    fp = int((pred & (y_true == 0)).sum())
19    tpr.append(tp / P)
20    fpr.append(fp / N)
21    precision.append(tp / (tp + fp) if (tp + fp) > 0 else 1.0)
22    recall.append(tp / P)
23
24# ROC-AUC: area under (FPR, TPR). Trapezoidal rule on the swept curve.
25roc_auc = np.trapz(tpr, fpr)
26print(f"ROC-AUC = {roc_auc:.4f}")   # 0.8125

scikit-learn gives you the same result in three function calls:

ROC & PR via sklearn — the production workhorses
🐍roc_pr_sklearn.py
1import numpy as np

Just for the input arrays; scikit-learn itself does not require NumPy, but every example in practice uses it.

2from sklearn.metrics import (roc_curve, auc, precision_recall_curve, average_precision_score)

The four functions you need to go from (y_true, y_score) to the two curves and their areas.

EXECUTION STATE
📚 roc_curve(y_true, y_score) = Returns (fpr, tpr, thresholds). Works out the exact set of distinct thresholds where the ranking changes — usually many fewer than scores.size + 2.
📚 auc(x, y) = Generic trapezoidal-rule area under a curve. Used for both ROC-AUC and (occasionally) PR-AUC.
📚 precision_recall_curve = Returns (precision, recall, thresholds). Last point is always (precision=1, recall=0) so the curve closes on the left.
📚 average_precision_score = Sum over ordered recall levels: AP = Σ_n (R_n - R_{n-1}) · P_n. Recommended over auc(recall, precision) because AP is less sensitive to sparse threshold grids.
8scores = ... ; y_true = ...

Same 8-sample dataset as the NumPy cell so the outputs are directly comparable.

12fpr, tpr, roc_th = roc_curve(y_true, scores)

scikit-learn sweeps thresholds efficiently — it sorts scores once and walks the ranking. Returns three arrays of equal length. roc_th[i] is the threshold that produced fpr[i], tpr[i]. The first point is always (0, 0) (no positives predicted) and the last is (1, 1) (everything predicted positive).

EXECUTION STATE
fpr (example) = array([0.0, 0.0, 0.25, 0.25, 0.5, 0.75, 1.0])
tpr (example) = array([0.0, 0.5, 0.5, 0.75, 1.0, 1.0, 1.0])
roc_th (example) = array([1.95, 0.95, 0.75, 0.60, 0.40, 0.30, 0.10])
13roc_auc = auc(fpr, tpr)

Trapezoidal area. Same value as the NumPy cell's np.trapz call, because the underlying integral is the same.

EXECUTION STATE
roc_auc = 0.8125
14print(f"ROC-AUC = {roc_auc:.4f}")

Same as NumPy version: 0.8125. Good sanity check — our from-scratch implementation matches the library.

17prec, rec, pr_th = precision_recall_curve(y_true, scores)

Recall monotonically increases from 0 to 1; precision oscillates as each new predicted positive is either a true or false positive. Note the length of pr_th is len(prec) - 1 — sklearn appends a final (precision=1, recall=0) 'origin' point with no associated threshold, which is why the returned threshold array is one shorter.

EXECUTION STATE
prec (example) = array([0.571, 0.6, 0.667, 0.75, 0.667, 1.0, 1.0, 1.0])
rec (example) = array([1.0, 0.75, 1.0, 0.75, 0.5, 0.5, 0.25, 0.0])
→ nuance = The exact returned arrays depend on sklearn's internal deduplication — values should match the NumPy trace at the unique threshold points.
18ap = average_precision_score(y_true, scores)

Average Precision is sklearn's preferred PR-AUC summary. Definition: iterate through ordered recall points and add (ΔR) · P at each. Hand check: the unique recall increments happen at 0.25 (P=1), 0.5 (P=1), 0.75 (P=0.75), 1.0 (P=0.667). AP = 0.25·1 + 0.25·1 + 0.25·0.75 + 0.25·0.667 = 0.854.

EXECUTION STATE
ap = 0.8542
→ why AP, not auc(rec, prec) = trapezoidal interpolation on a zig-zagging PR curve overestimates the area. AP's step-function rule is what 'PR-AUC' means in modern ML literature.
19print(f"PR-AUC (Avg. Prec) = {ap:.4f}")

0.8542 here vs 0.8125 for ROC-AUC. They disagree because they weight different regions of the curve: ROC treats correctly-rejected negatives as progress; PR only counts correctly-identified positives against the total positive predictions. On imbalanced data the disagreement is dramatic — PR drops sharply while ROC stays flattering (Saito & Rehmsmeier 2015).

EXECUTION STATE
stdout = PR-AUC (Avg. Prec) = 0.8542
9 lines without explanation
1import numpy as np
2from sklearn.metrics import (
3    roc_curve, auc,
4    precision_recall_curve, average_precision_score,
5)
6
7scores = np.array([0.95, 0.80, 0.75, 0.60, 0.50, 0.40, 0.30, 0.10])
8y_true = np.array([   1,    1,    0,    1,    0,    1,    0,    0])
9
10# --- ROC ---
11fpr, tpr, roc_th = roc_curve(y_true, scores)
12roc_auc = auc(fpr, tpr)
13print(f"ROC-AUC            = {roc_auc:.4f}")   # 0.8125
14
15# --- PR ---
16prec, rec, pr_th = precision_recall_curve(y_true, scores)
17ap = average_precision_score(y_true, scores)
18print(f"PR-AUC (Avg. Prec) = {ap:.4f}")        # 0.8542
Quick decision rule. For balanced binary problems, report ROC-AUC and accuracy at the best threshold. For imbalanced problems (e.g., fraud, rare-disease, spam), always report PR-AUC / Average Precision alongside ROC-AUC — the gap between the two is itself a diagnostic.

Multiclass Metrics

Everything so far assumed two classes. For a KK-class problem the confusion matrix grows to K×KK \times K, with rows indexing the true class and columns indexing the predicted class. The diagonal counts correct predictions; every off-diagonal cell Ci,jC_{i,j} counts the samples of true class ii that were misclassified as class jj.

Per-class precision and recall come from a one-vs-rest reduction: treat class kk as “positive” and every other class as “negative”. Then precisionk=Ck,k/iCi,k\text{precision}_k = C_{k,k} / \sum_i C_{i,k} (diagonal over column sum) and recallk=Ck,k/jCk,j\text{recall}_k = C_{k,k} / \sum_j C_{k,j} (diagonal over row sum).

The interesting question is how to collapse KK per-class scores into a single headline number. There are three standard choices (Manning, Raghavan & Schütze 2008):

AveragingFormulaWhen to use
Macromean(F1_k) — unweighted mean of per-class F1Every class matters equally. Your model should not be allowed to coast on the majority class.
WeightedΣ_k support_k · F1_k / Σ_k support_kYou want one number that reflects the real class distribution in the data (e.g., for a business report).
Micropool TP / FP / FN across classes, then F1 on the poolFor multiclass single-label problems this EQUALS accuracy. Useful for multilabel (see below), rarely for plain multiclass.
Multiclass F1 three ways — hand-computed on a 12-sample, 3-class demo
🐍multiclass_metrics_from_scratch.py
1import numpy as np

Arrays and vectorized ops. Everything below is linear in the number of samples.

3# 3-class problem, 12 samples

Small enough to verify every confusion-matrix entry by inspection. Real multiclass problems can have K=1000+ classes (ImageNet), but the math is identical.

4y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])

Supports are not quite balanced: 5 class-0, 3 class-1, 4 class-2. Slight imbalance will make macro vs. weighted averaging differ slightly.

EXECUTION STATE
support[0] = 5 — class 0 appears 5 times
support[1] = 3 — class 1 appears 3 times
support[2] = 4 — class 2 appears 4 times
5y_pred = np.array([0, 0, 0, 1, 2, 1, 1, 0, 2, 2, 1, 0])

The model's argmax predictions. Some are right, some are wrong — the confusion matrix tells the full story.

7K = 3

Number of classes. The confusion matrix is K×K.

8conf = np.zeros((K, K), dtype=int)

Allocate the K×K integer matrix. conf[i, j] will hold the number of samples whose TRUE class is i and PREDICTED class is j.

9for t, p in zip(y_true, y_pred):

Walk every sample once, incrementing the appropriate cell.

LOOP TRACE · 12 iterations
(t=0, p=0)
increment = conf[0,0] += 1 (diagonal hit)
(t=0, p=0)
increment = conf[0,0] += 1 (diagonal hit)
(t=0, p=0)
increment = conf[0,0] += 1 (3rd diagonal hit)
(t=0, p=1)
increment = conf[0,1] += 1 (mispredict class 0 as 1)
(t=0, p=2)
increment = conf[0,2] += 1 (mispredict class 0 as 2)
(t=1, p=1)
increment = conf[1,1] += 1
(t=1, p=1)
increment = conf[1,1] += 1
(t=1, p=0)
increment = conf[1,0] += 1
(t=2, p=2)
increment = conf[2,2] += 1
(t=2, p=2)
increment = conf[2,2] += 1
(t=2, p=1)
increment = conf[2,1] += 1
(t=2, p=0)
increment = conf[2,0] += 1
10conf[t, p] += 1

The only operation inside the loop. For each sample, add 1 to the cell indexed by (true class, predicted class). Diagonal cells count correct predictions; off-diagonals count the specific ways the model got confused.

11print("Confusion (rows = true, cols = pred):")

Human-readable label so the student knows which axis is which.

EXECUTION STATE
stdout = Confusion (rows = true, cols = pred):
12print(conf)

Final confusion matrix after all 12 samples are counted.

EXECUTION STATE
conf =
[[3 1 1]
 [1 2 0]
 [1 1 2]]
→ row sums = 5, 3, 4 — match the support of each true class ✓
→ diagonal = [3, 2, 2] — total 7 correct out of 12 samples
15per_class = {}

Dict that will hold (precision, recall, F1, support) for each of the three classes.

16for k in range(K):

Compute precision / recall / F1 by treating class k as the 'positive' class and everything else as 'negative'. Classic one-vs-rest reduction.

LOOP TRACE · 3 iterations
k=0
tp = conf[0,0] = 3
fp (col 0 − diag) = (3+1+1) − 3 = 2
fn (row 0 − diag) = (3+1+1) − 3 = 2
prec, rec, F1 = 3/5 = 0.600, 3/5 = 0.600, 0.600
k=1
tp = conf[1,1] = 2
fp (col 1 − diag) = (1+2+1) − 2 = 2
fn (row 1 − diag) = (1+2+0) − 2 = 1
prec, rec, F1 = 2/4 = 0.500, 2/3 = 0.667, 2·0.5·0.667/(0.5+0.667) = 0.571
k=2
tp = conf[2,2] = 2
fp (col 2 − diag) = (1+0+2) − 2 = 1
fn (row 2 − diag) = (1+1+2) − 2 = 2
prec, rec, F1 = 2/3 = 0.667, 2/4 = 0.500, 0.571
17tp = conf[k, k]

True positives for class k = the diagonal cell — samples that are truly k AND predicted k.

18fp = conf[:, k].sum() - tp

False positives for class k = column k total minus diagonal = samples predicted k but actually something else.

19fn = conf[k, :].sum() - tp

False negatives for class k = row k total minus diagonal = samples that were truly k but predicted as something else.

20prec = tp / (tp + fp) if (tp + fp) else 0.0

Precision for class k. Guard against divide-by-zero (a class we never predicted).

21rec = tp / (tp + fn) if (tp + fn) else 0.0

Recall for class k. Guard against divide-by-zero (a class with zero support — pathological).

22f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0

Harmonic mean of precision and recall. Penalizes imbalance — if either is 0, F1 is 0.

23support = int(conf[k, :].sum())

How many samples of class k exist in the test set. Used for weighted averaging.

25print(f"class {k}: P={prec:.3f} R={rec:.3f} F1={f1:.3f} support={support}")

Per-class summary row. This is exactly what sklearn's classification_report prints.

EXECUTION STATE
stdout = class 0: P=0.600 R=0.600 F1=0.600 support=5 class 1: P=0.500 R=0.667 F1=0.571 support=3 class 2: P=0.667 R=0.500 F1=0.571 support=4
28macro_f1 = np.mean([v[2] for v in per_class.values()])

Macro average: unweighted mean of per-class F1. Every class counts equally, regardless of support. Use this when you care about every class EQUALLY — e.g., rare-disease detection where one missed class is as bad as another.

EXECUTION STATE
macro_f1 = (0.600 + 0.571 + 0.571) / 3 = 0.581
31total_support = sum(v[3] for v in per_class.values())

Total sample count = 12. Denominator for weighted averaging.

32weighted_f1 = sum(v[2] * v[3] for v in per_class.values()) / total_support

Weighted average: per-class F1 weighted by its support. Lets common classes dominate. Use this when you want a single headline number that reflects the class distribution in your data.

EXECUTION STATE
weighted_f1 = (0.600·5 + 0.571·3 + 0.571·4) / 12 = (3.0 + 1.713 + 2.284) / 12 = 0.583
36tp_tot = int(np.trace(conf))

Sum of diagonal = total correct predictions across all classes.

EXECUTION STATE
tp_tot = 3 + 2 + 2 = 7
37fp_tot = int(conf.sum() - tp_tot)

Every off-diagonal entry is a false positive for exactly one class.

EXECUTION STATE
fp_tot = 12 − 7 = 5
38fn_tot = fp_tot

In multiclass, every misclassified sample is simultaneously a false positive for the wrongly-predicted class and a false negative for the true class. So FP_tot = FN_tot.

39micro_prec = tp_tot / (tp_tot + fp_tot)

Pooled precision.

EXECUTION STATE
micro_prec = 7 / (7 + 5) = 7/12 ≈ 0.583
40micro_rec = tp_tot / (tp_tot + fn_tot)

Pooled recall. Because fp_tot = fn_tot, micro_prec = micro_rec.

EXECUTION STATE
micro_rec = 7 / 12 ≈ 0.583
41micro_f1 = 2 * micro_prec * micro_rec / (micro_prec + micro_rec)

Harmonic mean of equal inputs = the input itself. So micro_f1 = 0.583.

EXECUTION STATE
micro_f1 = 0.583
42accuracy = tp_tot / conf.sum()

Accuracy = correct / total. Since tp_tot = 7 and total = 12, accuracy ≈ 0.583. Identical to micro_f1. This is not a coincidence: for multiclass single-label classification, micro-averaged precision = micro recall = micro F1 = accuracy. That is why sklearn's classification_report only lists one of them.

EXECUTION STATE
accuracy = 7/12 ≈ 0.583
44print(f"macro F1 = {macro_f1:.3f}")

Final summary.

EXECUTION STATE
stdout = macro F1 = 0.581 weighted F1 = 0.583 micro F1 = 0.583 (= accuracy = 0.583)
→ when they diverge = With severe imbalance (say 90/5/5) and a model that predicts only the majority class, macro_f1 drops toward 0 (the two minority classes have F1 ≈ 0) while weighted_f1 and accuracy stay near 0.9. Macro is the honest metric for imbalanced data.
15 lines without explanation
1import numpy as np
2
3# 3-class problem, 12 samples. y_true is the truth, y_pred is argmax(model output).
4y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
5y_pred = np.array([0, 0, 0, 1, 2, 1, 1, 0, 2, 2, 1, 0])
6
7K = 3
8conf = np.zeros((K, K), dtype=int)
9for t, p in zip(y_true, y_pred):
10    conf[t, p] += 1
11print("Confusion (rows = true, cols = pred):")
12print(conf)
13
14# Per-class precision / recall / F1
15per_class = {}
16for k in range(K):
17    tp = conf[k, k]
18    fp = conf[:, k].sum() - tp
19    fn = conf[k, :].sum() - tp
20    prec = tp / (tp + fp) if (tp + fp) else 0.0
21    rec  = tp / (tp + fn) if (tp + fn) else 0.0
22    f1   = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
23    support = int(conf[k, :].sum())
24    per_class[k] = (prec, rec, f1, support)
25    print(f"class {k}: P={prec:.3f}  R={rec:.3f}  F1={f1:.3f}  support={support}")
26
27# Macro: unweighted mean of per-class F1
28macro_f1 = np.mean([v[2] for v in per_class.values()])
29
30# Weighted: per-class F1 weighted by support
31total_support = sum(v[3] for v in per_class.values())
32weighted_f1 = sum(v[2] * v[3] for v in per_class.values()) / total_support
33
34# Micro: pool TP / FP / FN across classes, then compute F1.
35# For multiclass this equals accuracy.
36tp_tot = int(np.trace(conf))
37fp_tot = int(conf.sum() - tp_tot)   # every non-diagonal is FP for some class
38fn_tot = fp_tot                       # and simultaneously FN for another
39micro_prec = tp_tot / (tp_tot + fp_tot)
40micro_rec  = tp_tot / (tp_tot + fn_tot)
41micro_f1   = 2 * micro_prec * micro_rec / (micro_prec + micro_rec)
42accuracy   = tp_tot / conf.sum()
43
44print(f"macro   F1 = {macro_f1:.3f}")
45print(f"weighted F1 = {weighted_f1:.3f}")
46print(f"micro   F1 = {micro_f1:.3f}  (= accuracy = {accuracy:.3f})")

scikit-learn's classification_report prints the full per-class table and all three averages in a single call:

classification_report — the one-liner that gives you everything
🐍multiclass_metrics_sklearn.py
1import numpy as np

For array inputs; sklearn accepts any array-like.

2from sklearn.metrics import confusion_matrix, classification_report, f1_score

Three functions that cover almost every multiclass evaluation need.

EXECUTION STATE
📚 confusion_matrix = Returns the K×K matrix with rows = true class, columns = predicted class. Same convention we used in the NumPy cell.
📚 classification_report = Pretty-prints per-class precision / recall / F1 / support + macro and weighted averages + accuracy. Excellent for a one-line eval dump.
📚 f1_score(..., average=...) = Scalar F1 under a chosen averaging scheme. average='macro' | 'weighted' | 'micro' | 'samples' | None (returns per-class array).
8y_true, y_pred = np.array(...), np.array(...)

Same 12-sample dataset as the NumPy cell.

11print(confusion_matrix(y_true, y_pred))

Prints the same matrix we built by hand.

EXECUTION STATE
stdout = [[3 1 1] [1 2 0] [1 1 2]]
13print(classification_report(y_true, y_pred, digits=3))

Produces the full per-class report + macro + weighted + accuracy. digits=3 controls decimals. The 'support' column is the number of true instances of each class — it is what weighted averaging multiplies by.

EXECUTION STATE
stdout = precision recall f1-score support 0 0.600 0.600 0.600 5 1 0.500 0.667 0.571 3 2 0.667 0.500 0.571 4 accuracy 0.583 12 macro avg 0.589 0.589 0.581 12 weighted avg 0.597 0.583 0.583 12
15print("f1_score macro :", f1_score(y_true, y_pred, average="macro"))

Match to our from-scratch macro_f1 (0.581). average='macro' treats every class equally.

EXECUTION STATE
stdout = f1_score macro : 0.5809523809523809
16print("f1_score weighted:", f1_score(y_true, y_pred, average="weighted"))

Matches our weighted_f1 (≈ 0.583). average='weighted' weights each class's F1 by support.

EXECUTION STATE
stdout = f1_score weighted: 0.5833333333333334
17print("f1_score micro :", f1_score(y_true, y_pred, average="micro"))

Matches our micro_f1 and the accuracy (≈ 0.583). For multiclass single-label problems this is always true.

EXECUTION STATE
stdout = f1_score micro : 0.5833333333333334
→ practical rule = If someone reports 'F1 score' on a multiclass problem without specifying the averaging, default to asking 'macro or weighted?' — micro is almost never what anyone means because it collapses to accuracy.
9 lines without explanation
1import numpy as np
2from sklearn.metrics import (
3    confusion_matrix,
4    classification_report,
5    f1_score,
6)
7
8y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
9y_pred = np.array([0, 0, 0, 1, 2, 1, 1, 0, 2, 2, 1, 0])
10
11print(confusion_matrix(y_true, y_pred))
12print()
13print(classification_report(y_true, y_pred, digits=3))
14
15print("f1_score macro   :", f1_score(y_true, y_pred, average="macro"))
16print("f1_score weighted:", f1_score(y_true, y_pred, average="weighted"))
17print("f1_score micro   :", f1_score(y_true, y_pred, average="micro"))
Warning on imbalance. Swap the supports to something like 90/5/5 and train a model that only predicts class 0. Accuracy is 90%. Weighted F1 is ≈ 0.85. Macro F1 drops toward 1/31/3 because classes 1 and 2 have F1 ≈ 0. If you only reported accuracy you would never notice the failure. This is exactly why macro averaging is the standard metric for imbalanced multiclass problems.

Interactive: Metrics Dashboard

The dashboard below shows 20 binary classification predictions. Drag the threshold slider and watch how precision, recall, F1, and the confusion matrix change in real-time. The precision-recall curve shows the tradeoff as a curve — the optimal operating point (best F1) is marked in green.

Loading metrics dashboard...

Key observations as you experiment:

  • Threshold 0.05–0.20: Nearly all samples predicted positive. Recall is 100% (all positives caught) but precision drops because many negatives are incorrectly labeled positive
  • Threshold 0.30: The sweet spot for this dataset. F1 peaks at 92.3% — all 12 positives are caught (recall=100%) with only 2 false positives (precision=85.7%)
  • Threshold 0.50 (default): 3 positives are missed (recall drops to 75%). F1 = 78.3%. The default is not optimal
  • Threshold 0.70+: Zero false positives (precision=100%) but the model misses more and more positives. Recall and F1 drop sharply
  • Threshold 0.90+: Only the most confident predictions survive. The model is extremely conservative — perfect precision but catches only 25% of positives

Computing Metrics from Scratch

The code below computes all four confusion matrix entries, derives precision/recall/F1, and sweeps thresholds to find the optimal F1 — all in NumPy. Understanding this from-scratch implementation makes the formulas concrete.

NumPy — Confusion Matrix, Precision, Recall, F1, and Threshold Tuning
🐍classification_metrics.py
1import numpy as np

NumPy provides vectorized array operations. We use boolean indexing and element-wise comparisons to compute confusion matrix entries without loops.

EXECUTION STATE
📚 numpy = Numerical computing library. Key features used here: boolean arrays, element-wise &, np.sum on booleans, np.arange for threshold sweep.
3# 20 test samples: ground truth and model probabilities

We have 20 samples from a binary classifier. y_true contains the correct labels (0 or 1). probs contains the model’s predicted probability of class 1 for each sample. These are the outputs of a model that was already trained — we are now evaluating its quality.

4y_true = np.array([1,0,1,1,...]) — Ground truth labels

The true class for each of the 20 test samples. 1 = positive (e.g., ‘has disease’), 0 = negative (‘healthy’). There are 12 positives and 8 negatives.

EXECUTION STATE
y_true = [1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0]
→ class balance = 12 positives (60%), 8 negatives (40%). Moderately imbalanced — accuracy alone may be misleading.
5probs = np.array([0.92,0.15,...]) — Model confidence scores

The model’s predicted probability that each sample belongs to class 1. Higher = more confident it is positive. These come from a softmax or sigmoid output layer.

EXECUTION STATE
probs = [0.92, 0.15, 0.88, 0.42, 0.22, 0.78, 0.61, 0.85, 0.95, 0.18, 0.38, 0.73, 0.12, 0.81, 0.55, 0.90, 0.35, 0.08, 0.87, 0.20]
→ note sample 3 = prob=0.42 but true=1. The model is unsure about this positive sample — it will become a False Negative at threshold 0.5.
→ note sample 6 = prob=0.61 but true=0. The model is fairly confident this is positive, but it is actually negative — this will be a False Positive.
8# Step 1: Apply decision threshold

The model outputs probabilities, not class labels. To get a binary prediction, we choose a threshold: if prob ≥ threshold, predict class 1; otherwise predict class 0. The choice of threshold is a critical decision.

9threshold = 0.5 — The default decision boundary

0.5 is the standard default: predict class 1 if the model thinks it is more likely than not. But this is not always optimal — we will explore other thresholds in Step 4.

EXECUTION STATE
threshold = 0.5 = Predict 1 if prob ≥ 0.5, else 0. Standard default but NOT necessarily the best for every task.
10y_pred = (probs >= threshold).astype(int) — Binary predictions

Compare each probability to the threshold. probs >= 0.5 returns a boolean array [True, False, True, ...]. astype(int) converts to [1, 0, 1, ...] for comparison with y_true.

EXECUTION STATE
📚 (probs >= 0.5) = Element-wise comparison. Returns boolean array: [True, False, True, False, False, True, True, True, True, False, False, True, False, True, True, True, False, False, True, False]
📚 .astype(int) = Convert booleans to integers: True→1, False→0.
⬆ y_pred = [1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]
11print(f"Predictions: ...")

Output: Predictions: [1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]. Compare with y_true to spot errors.

EXECUTION STATE
output = Predictions: [1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0]
13# Step 2: Build confusion matrix

The confusion matrix is a 2×2 table that counts every possible outcome: the model can be right in two ways (TP, TN) and wrong in two ways (FP, FN). These four numbers contain ALL the information about classification performance.

14TP = np.sum((y_true == 1) & (y_pred == 1)) — True Positives

Count samples where the model correctly predicted positive. (y_true == 1) selects the 12 actual positives. (y_pred == 1) selects the 11 predicted positives. The & (AND) keeps only samples in both sets.

EXECUTION STATE
📚 (y_true == 1) = Boolean mask: [True, False, True, True, False, True, False, True, True, False, True, True, False, True, False, True, True, False, True, False]. 12 True entries.
📚 & = Element-wise AND. Both conditions must be True. This is the intersection of actual positives and predicted positives.
⬆ TP = 9 = 9 samples are correctly identified as positive. These are: samples 0, 2, 5, 7, 8, 11, 13, 15, 18.
15FP = np.sum((y_true == 0) & (y_pred == 1)) — False Positives

Count samples where the model incorrectly predicted positive (false alarm). The model said ‘positive’ but the truth is negative.

EXECUTION STATE
⬆ FP = 2 = 2 false alarms: sample 6 (prob=0.61, true=0) and sample 14 (prob=0.55, true=0). The model was moderately confident but wrong.
16TN = np.sum((y_true == 0) & (y_pred == 0)) — True Negatives

Count samples where the model correctly predicted negative.

EXECUTION STATE
⬆ TN = 6 = 6 correct negative predictions: samples 1, 4, 9, 12, 17, 19.
17FN = np.sum((y_true == 1) & (y_pred == 0)) — False Negatives

Count samples where the model missed a positive (missed detection). The model said ‘negative’ but the truth is positive.

EXECUTION STATE
⬆ FN = 3 = 3 missed positives: sample 3 (prob=0.42), sample 10 (prob=0.38), sample 16 (prob=0.35). All had probabilities just below the 0.5 threshold.
→ sanity check = TP + FP + TN + FN = 9 + 2 + 6 + 3 = 20 = total samples ✓
18print(f"TP={TP}, FP={FP}, TN={TN}, FN={FN}")

Output: TP=9, FP=2, TN=6, FN=3.

EXECUTION STATE
confusion matrix =
              Predicted
            Pos    Neg
Actual Pos   9(TP)  3(FN)
Actual Neg   2(FP)  6(TN)
20# Step 3: Compute metrics

Each metric answers a different question. Accuracy asks ‘what fraction is correct overall?’ Precision asks ‘when the model says positive, how often is it right?’ Recall asks ‘of all actual positives, how many did we catch?’ F1 balances precision and recall.

21accuracy = (TP + TN) / len(y_true) — Overall correctness

Fraction of all predictions that are correct. Simple but can be misleading on imbalanced datasets: a model that always predicts the majority class gets high accuracy for free.

EXECUTION STATE
formula = accuracy = (TP + TN) / N = (9 + 6) / 20 = 15 / 20
⬆ accuracy = 0.7500 (75%) = 15 out of 20 predictions are correct. But this masks that we missed 3 positives (FN) and had 2 false alarms (FP).
22precision = TP / (TP + FP) — How trustworthy are positive predictions?

Of all samples the model labeled as positive, what fraction are actually positive? High precision means few false alarms. Critical when false positives are expensive (e.g., spam filter blocking real emails).

EXECUTION STATE
formula = precision = TP / (TP + FP) = 9 / (9 + 2) = 9 / 11
⬆ precision = 0.8182 (81.8%) = When the model says ‘positive’, it is correct 81.8% of the time. The 2 false positives bring it below 100%.
23recall = TP / (TP + FN) — How many positives did we find?

Of all actual positives, what fraction did the model detect? High recall means few missed positives. Critical when false negatives are dangerous (e.g., cancer screening missing a tumor).

EXECUTION STATE
formula = recall = TP / (TP + FN) = 9 / (9 + 3) = 9 / 12
⬆ recall = 0.7500 (75%) = The model catches 9 out of 12 actual positives. It misses 3 (the ones with probabilities 0.42, 0.38, 0.35 — all just below the threshold).
24f1 = 2 * precision * recall / (precision + recall) — Harmonic mean

F1 balances precision and recall. It is the harmonic mean: it penalizes imbalance more than the arithmetic mean. If either precision or recall is 0, F1 is 0. If both are 1.0, F1 is 1.0.

EXECUTION STATE
formula = F1 = 2 × 0.8182 × 0.7500 / (0.8182 + 0.7500) = 1.2273 / 1.5682
⬆ f1 = 0.7826 (78.3%) = Harmonic mean of precision (81.8%) and recall (75.0%). Closer to the smaller value — the harmonic mean ‘drags down’ toward the weaker metric.
→ why harmonic? = Arithmetic mean of 81.8% and 75.0% = 78.4%. Harmonic mean = 78.3%. The difference grows when precision and recall are far apart: arith(90%, 10%) = 50%, harmonic = 18%. The harmonic mean says ‘both must be good.’
25print(f"Accuracy: ...")

Output: Accuracy: 0.7500

EXECUTION STATE
output = Accuracy: 0.7500
26print(f"Precision: ...")

Output: Precision: 0.8182

EXECUTION STATE
output = Precision: 0.8182
27print(f"Recall: ...")

Output: Recall: 0.7500

EXECUTION STATE
output = Recall: 0.7500
28print(f"F1 Score: ...")

Output: F1 Score: 0.7826

EXECUTION STATE
output = F1 Score: 0.7826
30# Step 4: Threshold sweep — find best F1

The default threshold of 0.5 may not give the best F1 score. By sweeping thresholds from 0.05 to 0.95, we find the one that maximizes F1. This is called threshold tuning and should ONLY be done on the validation set, never the test set.

31best_f1 = 0

Track the highest F1 score seen so far during the sweep.

32best_th = 0

Track which threshold achieved the best F1.

33for th in np.arange(0.05, 1.0, 0.05): — Sweep 19 thresholds

Try thresholds 0.05, 0.10, 0.15, ..., 0.95. At each threshold, recompute predictions and metrics.

LOOP TRACE · 5 iterations
th=0.10
pred_pos=19, TP=12, FP=7 = Almost everything predicted positive. Recall=1.000 (perfect), Precision=0.632 (many false alarms).
F1=0.774 = High recall drags F1 up, but low precision drags it down.
th=0.30
pred_pos=14, TP=12, FP=2 = All 12 positives caught, only 2 false positives.
F1=0.923 ★ BEST = Precision=0.857, Recall=1.000. Best balance! Lowering threshold from 0.5 to 0.3 recovers all 3 missed positives.
th=0.50
pred_pos=11, TP=9, FP=2 = Default threshold. Misses 3 positives (FN).
F1=0.783 = Precision=0.818, Recall=0.750. Good but not optimal.
th=0.70
pred_pos=9, TP=9, FP=0 = Zero false positives! But still misses 3 positives.
F1=0.857 = Precision=1.000 (perfect), Recall=0.750.
th=0.90
pred_pos=3, TP=3, FP=0 = Only the most confident predictions pass. Misses 9 of 12 positives.
F1=0.400 = Precision=1.000 but Recall=0.250. Way too conservative.
34yp = (probs >= th).astype(int) — New predictions at this threshold

Recompute binary predictions using the current threshold. Same logic as line 10 but with a different threshold value.

35tp = np.sum((y_true == 1) & (yp == 1))

Count true positives at this threshold.

36fp = np.sum((y_true == 0) & (yp == 1))

Count false positives.

37fn = np.sum((y_true == 1) & (yp == 0))

Count false negatives.

38pr = tp / (tp + fp) if (tp + fp) > 0 else 0

Precision at this threshold. The guard ‘if (tp + fp) > 0’ prevents division by zero when no samples are predicted positive (at very high thresholds).

EXECUTION STATE
→ division by zero = If threshold is so high that y_pred is all zeros: tp=0, fp=0, tp+fp=0. Without the guard: 0/0 = NaN. With guard: 0.
39re = tp / (tp + fn) if (tp + fn) > 0 else 0

Recall at this threshold. tp + fn = total actual positives = 12 (constant across thresholds).

40f = 2*pr*re/(pr+re) if (pr + re) > 0 else 0

F1 at this threshold.

41if f > best_f1: — Is this the best F1 so far?

Track the maximum F1 and corresponding threshold.

42best_f1 = f

Update best F1.

43best_th = th

Record which threshold achieved it.

45print(f"Best F1: {best_f1:.4f} at threshold {best_th:.2f}")

Output: Best F1: 0.9231 at threshold 0.30. Lowering the threshold from the default 0.5 to 0.30 improved F1 from 0.783 to 0.923 — a huge improvement from changing a single number!

EXECUTION STATE
output = Best F1: 0.9231 at threshold 0.30
→ improvement = Default threshold 0.5: F1 = 0.783. Optimal threshold 0.3: F1 = 0.923. That’s +14 percentage points from just moving the decision boundary.
8 lines without explanation
1import numpy as np
2
3# ── 20 test samples: ground truth and model probabilities ──
4y_true = np.array([1,0,1,1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1,0])
5probs  = np.array([0.92,0.15,0.88,0.42,0.22,0.78,0.61,
6                    0.85,0.95,0.18,0.38,0.73,0.12,0.81,
7                    0.55,0.90,0.35,0.08,0.87,0.20])
8
9# ── Step 1: Apply decision threshold ──
10threshold = 0.5
11y_pred = (probs >= threshold).astype(int)
12print(f"Predictions: {y_pred.tolist()}")
13
14# ── Step 2: Build confusion matrix ──
15TP = np.sum((y_true == 1) & (y_pred == 1))
16FP = np.sum((y_true == 0) & (y_pred == 1))
17TN = np.sum((y_true == 0) & (y_pred == 0))
18FN = np.sum((y_true == 1) & (y_pred == 0))
19print(f"TP={TP}, FP={FP}, TN={TN}, FN={FN}")
20
21# ── Step 3: Compute metrics ──
22accuracy = (TP + TN) / len(y_true)
23precision = TP / (TP + FP)
24recall = TP / (TP + FN)
25f1 = 2 * precision * recall / (precision + recall)
26print(f"Accuracy:  {accuracy:.4f}")
27print(f"Precision: {precision:.4f}")
28print(f"Recall:    {recall:.4f}")
29print(f"F1 Score:  {f1:.4f}")
30
31# ── Step 4: Threshold sweep — find best F1 ──
32best_f1 = 0
33best_th = 0
34for th in np.arange(0.05, 1.0, 0.05):
35    yp = (probs >= th).astype(int)
36    tp = np.sum((y_true == 1) & (yp == 1))
37    fp = np.sum((y_true == 0) & (yp == 1))
38    fn = np.sum((y_true == 1) & (yp == 0))
39    pr = tp / (tp + fp) if (tp + fp) > 0 else 0
40    re = tp / (tp + fn) if (tp + fn) > 0 else 0
41    f = 2*pr*re/(pr+re) if (pr + re) > 0 else 0
42    if f > best_f1:
43        best_f1 = f
44        best_th = th
45
46print(f"\nBest F1: {best_f1:.4f} at threshold {best_th:.2f}")

The threshold sweep reveals that the default threshold of 0.5 gives F1 = 0.783, but shifting to 0.30 pushes F1 to 0.923 — a dramatic improvement from changing a single number. The three missed positives (samples 3, 10, 16) all had probabilities between 0.35 and 0.42, just below 0.5 but above 0.3. Lowering the threshold rescues them.

Practical Lesson: Never accept the default threshold of 0.5 without checking. A 5-line threshold sweep on the validation set can improve your F1 by 10+ percentage points. This is one of the highest-ROI optimizations in applied machine learning.

PyTorch and scikit-learn

In production, you use scikit-learn for metrics computation. It handles edge cases (zero division, multiclass averaging, weighted metrics) that our manual code does not. The PyTorch code below shows both the sklearn approach and a pure-PyTorch fallback for environments where sklearn is not available.

PyTorch + scikit-learn — Production Metric Computation
🐍pytorch_metrics.py
1import torch

PyTorch core. We use tensors and boolean operations for metric computation.

2import torch.nn as nn

Neural network module. Not directly used here but shown for completeness — in a real pipeline, nn defines the model whose outputs we are evaluating.

3from sklearn.metrics import ...

scikit-learn provides battle-tested implementations of all standard classification metrics. In production, you use these instead of computing metrics by hand. They handle edge cases (zero division, multiclass, averaging) that our NumPy code does not.

EXECUTION STATE
📚 accuracy_score(y_true, y_pred) = Returns fraction of correct predictions. Same as our (TP+TN)/N.
📚 precision_score(y_true, y_pred) = Returns TP/(TP+FP). Handles zero division gracefully.
📚 recall_score(y_true, y_pred) = Returns TP/(TP+FN).
📚 f1_score(y_true, y_pred) = Returns harmonic mean of precision and recall.
📚 confusion_matrix(y_true, y_pred) = Returns the 2×2 matrix [[TN, FP], [FN, TP]]. Note: sklearn convention puts TN in position [0,0].
📚 classification_report(y_true, y_pred) = Pretty-printed summary of precision, recall, F1 for each class plus averages.
8# Same data as NumPy version

Identical test data for direct comparison between NumPy and sklearn results.

9y_true = torch.tensor([1,0,1,1,...]) — Ground truth

Same labels as NumPy, now as PyTorch tensor (dtype defaults to int64).

10probs = torch.tensor([0.92,0.15,...]) — Model probabilities

Same probabilities, now as float32 tensor.

14# Step 1: Predictions at threshold 0.5

Apply the decision threshold to get binary predictions.

15y_pred = (probs >= 0.5).long() — Binary predictions

.long() converts boolean tensor to int64 (LongTensor). Same result as our NumPy .astype(int).

EXECUTION STATE
📚 .long() = Converts tensor to int64. True→1, False→0. Equivalent to .to(torch.int64).
⬆ y_pred = tensor([1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0])
17# Step 2: Confusion matrix (sklearn)

sklearn expects NumPy arrays, so we convert with .numpy().

18cm = confusion_matrix(y_true.numpy(), y_pred.numpy())

Returns [[TN, FP], [FN, TP]]. Note: sklearn’s convention puts NEGATIVES first (row 0 = actual negative, row 1 = actual positive).

EXECUTION STATE
📚 confusion_matrix() = Returns ndarray of shape (n_classes, n_classes). Element [i,j] = count of samples with true label i and predicted label j.
⬇ y_true.numpy() = Converts PyTorch tensor to NumPy array. Required because sklearn does not accept PyTorch tensors directly.
⬆ cm =
[[6, 2],    ← actual=0: TN=6, FP=2
 [3, 9]]    ← actual=1: FN=3, TP=9
19print(f"Confusion matrix: ...")

Displays the 2×2 matrix.

21# Step 3: All metrics in one call

classification_report generates a formatted summary of all per-class and averaged metrics.

22report = classification_report(...) — Full report

Produces a pretty-printed table with precision, recall, F1 for each class (Negative and Positive), plus macro and weighted averages.

EXECUTION STATE
⬇ target_names=["Negative", "Positive"] = Human-readable labels for class 0 and class 1. Without this, the report shows ‘0’ and ‘1’.
⬇ digits=4 = Show 4 decimal places (default is 2). More precision for comparing models.
27print(report)

Output: a formatted table showing per-class precision, recall, F1, and support (sample count) for each class, plus macro and weighted averages.

29# Step 4: Individual metrics

Extract individual metric values for programmatic use (logging, comparison, plotting).

30acc = accuracy_score(y_true.numpy(), y_pred.numpy())

Returns 0.7500. Same as our manual (TP+TN)/N.

EXECUTION STATE
⬆ acc = 0.7500 = Matches our NumPy computation exactly.
31prec = precision_score(y_true.numpy(), y_pred.numpy())

Returns 0.8182. By default, computes precision for the positive class (label=1).

EXECUTION STATE
⬆ prec = 0.8182 = Matches our manual TP/(TP+FP) = 9/11.
32rec = recall_score(y_true.numpy(), y_pred.numpy())

Returns 0.7500.

EXECUTION STATE
⬆ rec = 0.7500 = Matches TP/(TP+FN) = 9/12.
33f1 = f1_score(y_true.numpy(), y_pred.numpy())

Returns 0.7826.

EXECUTION STATE
⬆ f1 = 0.7826 = Matches our manual harmonic mean computation.
34print(f"Accuracy: {acc:.4f}")

Output: Accuracy: 0.7500

35print(f"Precision: {prec:.4f}")

Output: Precision: 0.8182

36print(f"Recall: {rec:.4f}")

Output: Recall: 0.7500

37print(f"F1: {f1:.4f}")

Output: F1: 0.7826

39# Step 5: PyTorch-native computation (no sklearn)

Sometimes you cannot install sklearn (e.g., in a minimal Docker image or on GPU-only environments). Here we compute the same metrics using only PyTorch tensor operations. This matches our NumPy code exactly.

40tp = ((y_true == 1) & (y_pred == 1)).sum().item()

Count true positives using tensor boolean operations. .sum() counts True values, .item() converts the scalar tensor to a Python int.

EXECUTION STATE
📚 .sum() = Sums all elements. For boolean tensor: counts True values. tensor([True, False, True]).sum() = 2.
📚 .item() = Converts a scalar tensor to Python number. tensor(9).item() = 9. Required for division (Python / operator).
⬆ tp = 9 = Same as NumPy TP.
41fp = ((y_true == 0) & (y_pred == 1)).sum().item()

False positives: 2.

EXECUTION STATE
⬆ fp = 2 = Same as NumPy FP.
42fn = ((y_true == 1) & (y_pred == 0)).sum().item()

False negatives: 3.

EXECUTION STATE
⬆ fn = 3 = Same as NumPy FN.
43precision_pt = tp / (tp + fp)

Precision computed in pure Python: 9 / 11 = 0.8182.

44recall_pt = tp / (tp + fn)

Recall: 9 / 12 = 0.7500.

45f1_pt = 2 * precision_pt * recall_pt / (precision_pt + recall_pt)

F1: 0.7826. All three methods (NumPy manual, sklearn, PyTorch-native) give identical results.

EXECUTION STATE
⬆ f1_pt = 0.7826 = Identical to sklearn and NumPy. Three implementations, one truth.
46print(f"PyTorch-native F1: {f1_pt:.4f}")

Output: PyTorch-native F1: 0.7826

EXECUTION STATE
output = PyTorch-native F1: 0.7826
15 lines without explanation
1import torch
2import torch.nn as nn
3from sklearn.metrics import (
4    accuracy_score, precision_score, recall_score,
5    f1_score, confusion_matrix, classification_report
6)
7
8# ── Same data as NumPy version ──
9y_true = torch.tensor([1,0,1,1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1,0])
10probs  = torch.tensor([0.92,0.15,0.88,0.42,0.22,0.78,0.61,
11                        0.85,0.95,0.18,0.38,0.73,0.12,0.81,
12                        0.55,0.90,0.35,0.08,0.87,0.20])
13
14# ── Step 1: Predictions at threshold 0.5 ──
15y_pred = (probs >= 0.5).long()
16
17# ── Step 2: Confusion matrix (sklearn) ──
18cm = confusion_matrix(y_true.numpy(), y_pred.numpy())
19print(f"Confusion matrix:\n{cm}")
20
21# ── Step 3: All metrics in one call ──
22report = classification_report(
23    y_true.numpy(), y_pred.numpy(),
24    target_names=["Negative", "Positive"],
25    digits=4
26)
27print(report)
28
29# ── Step 4: Individual metrics ──
30acc = accuracy_score(y_true.numpy(), y_pred.numpy())
31prec = precision_score(y_true.numpy(), y_pred.numpy())
32rec = recall_score(y_true.numpy(), y_pred.numpy())
33f1 = f1_score(y_true.numpy(), y_pred.numpy())
34print(f"Accuracy:  {acc:.4f}")
35print(f"Precision: {prec:.4f}")
36print(f"Recall:    {rec:.4f}")
37print(f"F1:        {f1:.4f}")
38
39# ── Step 5: PyTorch-native computation (no sklearn) ──
40tp = ((y_true == 1) & (y_pred == 1)).sum().item()
41fp = ((y_true == 0) & (y_pred == 1)).sum().item()
42fn = ((y_true == 1) & (y_pred == 0)).sum().item()
43precision_pt = tp / (tp + fp)
44recall_pt = tp / (tp + fn)
45f1_pt = 2 * precision_pt * recall_pt / (precision_pt + recall_pt)
46print(f"\nPyTorch-native F1: {f1_pt:.4f}")

All three implementations — NumPy manual, sklearn, and PyTorch-native — produce identical results. The sklearn version is the most concise and handles edge cases. The PyTorch-native version is useful when you need metrics computed on GPU tensors without converting to NumPy.


Learning Curves

A learning curve plots a metric (loss or accuracy) against the training epoch (or number of training samples). Comparing the training curve to the validation curve reveals three diagnostic patterns:

PatternTrain CurveVal CurveDiagnosisFix
UnderfittingHigh loss, plateaus earlyHigh loss, close to trainModel too simpleMore layers/neurons, train longer
Good fitLow lossLow loss, close to trainModel is learning wellCollect more data, try for marginal gains
OverfittingVery low lossGap grows over timeModel memorizes noiseEarly stopping, regularization, more data
Data hungerLow lossGap closes slowlyNot enough dataCollect more data, data augmentation

How to Read Learning Curves

The gap between the training and validation curves is the generalization gap from Section 3. If the gap is small throughout training, the model generalizes well. If it starts small and grows, the model is beginning to overfit. If it is large from the start, the model is memorizing from the very first epoch (usually means the model is too complex for the available data).

A healthy training run shows both curves decreasing, with the validation curve slightly above the training curve. When the validation curve starts to rise while the training curve continues to fall, it is time to stop (this is exactly what early stopping detects automatically).

Always plot learning curves. They are the single most informative diagnostic for neural network training. A 30-second glance at the learning curves tells you whether to increase model complexity, train longer, collect more data, or add regularization. No other single diagnostic provides this much information.

Connection to Modern Systems

LLM Evaluation Metrics

Large language models use specialized metrics beyond classification: perplexity PPL=exp(1TtlogP(xtx<t))\text{PPL} = \exp(-\frac{1}{T}\sum_t \log P(x_t|x_{<t})) measures how well the model predicts the next token. BLEU and ROUGE evaluate text generation quality against references. Human evaluation (Elo ratings, preference rankings) captures qualities that automated metrics miss. The LMSYS Chatbot Arena uses pairwise human preferences to rank models, producing more reliable comparisons than any single automated metric.

Metric Aggregation at Scale

Production systems compute metrics on millions of predictions per second. Key practices: macro averaging (compute metric per class, then average) gives equal weight to rare classes. Micro averaging (pool all predictions, then compute) weights by class frequency. Weighted averaging (macro with class frequency weights) is a compromise. For imbalanced datasets, macro averaging is often preferred because it prevents the majority class from dominating the score.

Monitoring in Production

After deployment, models need continuous monitoring. Data drift occurs when the input distribution changes (e.g., new types of fraud). Concept drift occurs when the relationship between inputs and outputs changes. Production monitoring systems track metrics over time, set alerts when performance degrades, and trigger retraining automatically. Tools like Weights & Biases, MLflow, and TensorBoard provide dashboards for tracking metrics across experiments and deployments.

Beyond Binary Classification

For multiclass problems with KK classes, the confusion matrix becomesK×KK \times K, and precision/recall/F1 are computed per-class then averaged. For regression problems, metrics include MSE, MAE, and . For ranking problems, NDCG and MAP measure the quality of the ordering. Every domain has its own metrics, but the principle is the same: measure what matters for your application, not just what is easy to compute.


Summary

In this final section of Chapter 11, we built the metrics toolkit for evaluating neural networks:

  1. Accuracy is not enough. On imbalanced datasets, a model that always predicts the majority class has high accuracy but zero usefulness. Always check precision, recall, and F1
  2. Confusion matrix breaks predictions into TP, FP, TN, FN. Every classification metric is derived from these four counts: Precision=TP/(TP+FP)\text{Precision} = \text{TP}/(\text{TP}+\text{FP}), Recall=TP/(TP+FN)\text{Recall} = \text{TP}/(\text{TP}+\text{FN})
  3. F1 = harmonic mean of precision and recall. It penalizes imbalance: both precision and recall must be high. Use F1 as the default metric for imbalanced binary problems
  4. Threshold tuning is a critical optimization. The default 0.5 is often not the best. Sweep thresholds on the validation set and pick the one that maximizes F1. This can improve F1 by 10+ points
  5. Learning curves (train vs val metric over epochs) are the most informative single diagnostic. A growing gap signals overfitting. A high plateau signals underfitting

With this, Chapter 11 is complete. You now have the full training toolkit: data loading and batching (Section 1), the training loop with gradient clipping and LR scheduling (Section 2), validation and early stopping (Section 3), and comprehensive metrics (Section 4). In Chapter 12, we will dive deeper into regularization — techniques like dropout and weight decay that directly combat the overfitting we learned to detect in this chapter.

References

  • van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.), Butterworths. The book that introduced the F-measure (originally called the E-measure).
  • Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval, Cambridge University Press. Chapter 8 is the standard reference for macro, micro, and weighted averaging.
  • Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies 2(1), 37–63 (also arXiv:2010.16061). Rigorous derivation of every binary metric.
  • Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8), 861–874. Canonical ROC reference; defines AUC and its probabilistic interpretation.
  • Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. ICML 2006.
  • Saito, T. & Rehmsmeier, M. (2015). The Precision-Recall Plot is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 10(3), e0118432. Shows that ROC-AUC stays flattering on imbalanced data while PR-AUC collapses.
  • Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017 / arXiv:1706.04599. Motivates using log-loss (negative log-likelihood) as a reported metric.
  • Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830. The sklearn.metrics module used throughout this section.
Loading comments...