Chapter 20
30 min read
Section 130 of 175

Information Theory in ML Loss Functions

Information Theoretic Foundations

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Derive cross-entropy loss from information theory principles
  • • Explain the connection between cross-entropy and KL divergence
  • • Prove that minimizing cross-entropy equals maximum likelihood
  • • Understand why cross-entropy dominates MSE for classification

🔧 Practical Skills

  • • Implement cross-entropy loss in NumPy and PyTorch
  • • Choose the right loss function for different ML tasks
  • • Debug gradient flow issues using information theory
  • • Apply numerical stability techniques in production code

🧠 Deep Learning Mastery

  • Classification networks - Why PyTorch's CrossEntropyLoss combines LogSoftmax and NLL
  • Language models - How next-token prediction uses cross-entropy at massive scale
  • Knowledge distillation - Temperature-scaled softmax for transferring "dark knowledge"
  • Variational methods - ELBO and the KL term in VAEs and diffusion models
Why This Matters: Cross-entropy loss is the backbone of virtually every classification system in deep learning—from image classifiers to GPT. Understanding its information-theoretic foundation is essential for any ML practitioner.

The Big Picture

Every time you train a neural network for classification, you're implicitly using information theory. The loss function that guides learning—cross-entropy—is not an arbitrary choice but a mathematically optimal one derived from Shannon's foundational work on communication.

The Central Insight

Training a classifier minimizes the number of bits needed to communicate the true labels using the model's predicted distribution.

📡

Encoding: Model predictions define a "code" for labels

📊

Cross-Entropy: Expected code length using model's distribution

🎯

Minimum: Achieved when model matches true distribution

Historical Context

The connection between information theory and machine learning was recognized early, but it took decades for cross-entropy to become the dominant classification loss.

📜
1948: Shannon's Foundation

Claude Shannon defined entropy and cross-entropy in "A Mathematical Theory of Communication." He proved that cross-entropy measures the expected code length when using one distribution to encode samples from another.

🔬
1986: Backpropagation Era

Rumelhart, Hinton, and Williams popularized backpropagation. Initially, MSE was the dominant loss even for classification, leading to slow training due to vanishing gradients.

🚀
2012+: Deep Learning Revolution

Cross-entropy + softmax became the standard for classification. The beautiful gradient property (p̂ - y) was recognized as crucial for training deep networks, avoiding gradient saturation issues that plagued MSE.


Cross-Entropy Loss

Mathematical Definition

Cross-entropy measures the expected number of bits needed to identify an event from a distribution pp when using an optimal code designed for distribution qq.

Cross-Entropy Definition

H(p,q)=xp(x)logq(x)=Exp[logq(x)]H(p, q) = -\sum_{x} p(x) \log q(x) = \mathbb{E}_{x \sim p}\left[-\log q(x)\right]

where p is the true distribution and q is the model's predicted distribution

SymbolMeaningIn ML Context
p(x)True probability of class xOne-hot encoded label (or soft label)
q(x)Predicted probability of class xSoftmax output of neural network
H(p, q)Cross-entropyClassification loss function
-log q(x)Surprisal of predictionPenalty for assigning low probability to truth

For one-hot encoded labels (standard classification), cross-entropy simplifies dramatically:

Cross-Entropy with One-Hot Labels

L=logq(ytrue)L = -\log q(y_{true})

Just the negative log of the predicted probability for the correct class!

Intuition: Cross-entropy only cares about the probability assigned to the trueclass. If the model predicts 0.99 for the correct class, loss is tiny (-log(0.99) ≈ 0.01). If it predicts 0.01, loss explodes (-log(0.01) ≈ 4.6).

Interactive: Cross-Entropy Explorer

Explore how cross-entropy loss responds to different predictions. Adjust the predicted probabilities and watch how the loss changes based on the true class.

Cross-Entropy Loss Explorer

See how cross-entropy penalizes wrong predictions

True Label (One-Hot Target)

y = [1, 0, 0]

Model Predictions (Before Normalization)

Cat
0.700
Dog
0.200
Bird
0.100

Softmax Probabilities (Normalized)

Cat
0.700
Dog
0.200
Bird
0.100
p̂ = [0.700, 0.200, 0.100]

Cross-Entropy Loss

0.3567
L = -log(p̂0) = -log(0.7000)

Loss vs. Predicted Probability for True Class

Predicted Probability (p)Loss00.250.50.75101234(0.70, 0.36)

The Formula

H(y, p̂) = -∑c yc log(p̂c)
For one-hot targets: L = -log(p̂true class)

💡 Key Insight

Cross-entropy loss only cares about the probability assigned to the true class. When p̂ = 1.0, loss = 0 (perfect). When p̂ → 0, loss → ∞ (catastrophic). This asymmetric penalty makes neural networks learn to be confident about correct predictions.


Binary Cross-Entropy (BCE)

For binary classification (two classes), cross-entropy takes a specialized form known as binary cross-entropy or log loss.

BCE with Sigmoid

Binary Cross-Entropy

L=[ylog(p^)+(1y)log(1p^)]L = -\left[y \log(\hat{p}) + (1-y) \log(1-\hat{p})\right]

where p^=σ(z)=11+ez\hat{p} = \sigma(z) = \frac{1}{1+e^{-z}} is the sigmoid of the raw logit

The magical property of BCE + sigmoid is its gradient:

The Beautiful Gradient

Lz=p^y=σ(z)y\frac{\partial L}{\partial z} = \hat{p} - y = \sigma(z) - y

Prediction minus target. That's it! No sigmoid derivative, no saturation.

Why This Matters: The sigmoid function has very small derivatives at extreme values (near 0 or 1). With MSE loss, this causes vanishing gradients. But BCE's logarithm perfectly cancels the sigmoid derivative, giving a clean (p̂ - y) gradient that never saturates!

Interactive: BCE Visualizer

Watch how BCE loss behaves as you adjust the model's prediction. Notice how the gradient is always proportional to the error, never vanishing even for extreme predictions.

Binary Cross-Entropy (BCE) Loss

The workhorse of binary classification

True Label (y)

Raw Model Output (Logit z)

-5 (very negative)z = 1.50+5 (very positive)

Sigmoid Transformation: σ(z) = 1 / (1 + e-z)

0.5z = 0
Predicted Probability
0.8176
p̂ = σ(z)
BCE Loss
0.2014
Gradient (∂L/∂z)
-0.1824
←
Push logit up (increase z)

Loss Landscape (y = 1)

Logit (z)
BCE(y, p̂) = -[y log(p̂) + (1-y) log(1-p̂)]
= -1 × log(0.8176) - 0 × log(0.1824)
= 0.2014

💡 Why BCE + Sigmoid Works So Well

Beautiful gradient: ∂L/∂z = p̂ - y (prediction minus target). This elegant formula means gradients are never saturated!

Compare to MSE: With MSE, sigmoid's flat regions cause vanishing gradients. BCE cancels the sigmoid derivative, maintaining strong learning signals.


Categorical Cross-Entropy

For multi-class classification with K classes, we use categorical cross-entropy combined with the softmax function.

Softmax + Cross-Entropy

Softmax Function

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Converts K raw logits into a proper probability distribution

The gradient of softmax + cross-entropy has the same beautiful form:

Lzi=p^iyi\frac{\partial L}{\partial z_i} = \hat{p}_i - y_i

For each class i: gradient equals (predicted probability - target probability)

Interactive: Softmax + Cross-Entropy

Explore how softmax normalizes logits into probabilities and how temperature affects the distribution's sharpness. This is crucial for understanding language model sampling!

Softmax + Cross-Entropy Visualizer

Multi-class classification with temperature control

True Label

Raw Logits (zi)

Cat
2.0
Dog
0.5
Bird
-1.0
Fish
0.0

🌡️ Temperature (T)

0.1 (sharp/confident)1.05.0 (soft/uncertain)

Softmax Probabilities

Cat
71.0%
✓ TRUE
0.7101
Dog
15.8%
0.1584
Bird
0.0354
Fish
0.0961
Cross-Entropy Loss
0.3423
Distribution Entropy
0.8783

Gradients (∂L/∂zi = pi - yi)

Cat
-0.290
↑ increase
Dog
+0.158
↓ decrease
Bird
+0.035
↓ decrease
Fish
+0.096
↓ decrease
Softmax: pi = exp(zi/T) / ∑j exp(zj/T)
Loss: L = -log(ptarget) = -log(0.7101)

🌡️ Temperature in Deep Learning

Low temperature (T < 1): Sharpens the distribution, making the model more confident. Used in inference for deterministic outputs.

High temperature (T > 1): Flattens the distribution, increasing randomness. Used in generation (GPT, etc.) for creative text.

Knowledge distillation: Teacher models use high T to expose "dark knowledge" about relationships between classes.


Connection to KL Divergence

Cross-entropy is deeply connected to KL divergence, and understanding this connection reveals why cross-entropy is the optimal loss for classification.

The Fundamental Decomposition

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{KL}(p \| q)
Cross-Entropy
What we minimize
Entropy
Fixed by data
KL Divergence
What actually changes

Since the entropy H(p) is determined by the data and doesn't change during training, minimizing cross-entropy is equivalent to minimizing KL divergence!

Interactive: KL Decomposition

See how cross-entropy decomposes into entropy plus KL divergence. Watch how KL → 0 as the model distribution approaches the true distribution.

Cross-Entropy = Entropy + KL Divergence

Understanding the fundamental decomposition

H(p, q) = H(p) + DKL(p || q)
Cross-Entropy=Entropy+KL Divergence

p(x) - True Distribution (Target)

A
0.600
B
0.300
C
0.100

q(x) - Model Distribution (Predicted)

A
0.330
B
0.330
C
0.340
H(p)
1.2955
bits
DKL(p||q)
0.2997
bits
H(p, q)
1.5952
bits

Distribution Comparison

Class Ap: 0.600 | q: 0.330
p(x) ⬤○ q(x)
Class Bp: 0.300 | q: 0.330
p(x) ⬤○ q(x)
Class Cp: 0.100 | q: 0.340
p(x) ⬤○ q(x)

Per-Class Contributions

ClassEntropy+KL=Cross-Ent
A0.44220.51750.9597
B0.5211-0.04130.4798
C0.3322-0.17660.1556
Total1.29550.29971.5952

💡 Why This Matters for ML

Training minimizes cross-entropy H(p, q). Since H(p) is fixed by the data, minimizing cross-entropy is equivalent to minimizing KL divergence DKL(p || q).

KL = 0 only when q = p. The model learns to match the true distribution exactly. This is why cross-entropy is the theoretically optimal loss for classification!


Connection to Maximum Likelihood

Perhaps the most profound insight: minimizing cross-entropy is equivalent to maximum likelihood estimation. This connection explains why cross-entropy is theoretically optimal.

Step 1: Define Likelihood

Given data points with true classes, the likelihood is the product of predicted probabilities:

L(θ)=i=1Nqθ(yixi)\mathcal{L}(\theta) = \prod_{i=1}^{N} q_\theta(y_i | x_i)
Step 2: Take Log Likelihood

Logs convert products to sums (numerical stability + easier optimization):

logL(θ)=i=1Nlogqθ(yixi)\log \mathcal{L}(\theta) = \sum_{i=1}^{N} \log q_\theta(y_i | x_i)
Step 3: Recognize Cross-Entropy!

Negative log-likelihood is exactly cross-entropy:

1NlogL(θ)=1Ni=1Nlogqθ(yixi)=H(pdata,qθ)-\frac{1}{N}\log \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log q_\theta(y_i | x_i) = H(p_{data}, q_\theta)
The Punchline: When you minimize cross-entropy loss, you're doing maximum likelihood estimation! This means cross-entropy inherits all the nice properties of MLE: consistency, asymptotic efficiency, and equivariance under reparametrization.

MSE vs Cross-Entropy for Classification

Why not just use Mean Squared Error for classification? After all, it worked fine for regression. Let's see why cross-entropy is fundamentally better.

AspectMSE LossCross-Entropy Loss
Formula(y - p&#x0302;)&#xB2;-y log(p&#x0302;) - (1-y) log(1-p&#x0302;)
Gradient magnitudeBounded by 2Unbounded (approaches &#x221E; for wrong predictions)
Saturated predictionsTiny gradient (learning stalls)Strong gradient (correction continues)
Probabilistic meaningNoneNegative log-likelihood
Optimal forRegressionClassification

Interactive: Loss Comparison

Compare MSE and cross-entropy side by side. Pay special attention to the gradient magnitude when the model is confidently wrong (e.g., predicts 0.05 when target is 1).

MSE vs Cross-Entropy: Head to Head

Why cross-entropy dominates classification

Target (y)

Prediction (p̂): 0.70

Mean Squared Error (MSE)

L = (y - p̂)²
Loss
0.0900
Gradient
-0.6000

Binary Cross-Entropy (BCE)

L = -[y log(p̂) + (1-y) log(1-p̂)]
Loss
0.3567
Gradient
-1.4286

Loss vs Prediction (y = 1)

MSEBCEPrediction (p̂)

Gradient vs Prediction (y = 1)

0Prediction (p̂)

MSE Characteristics

  • • Gradient bounded: |∂L/∂p| ≤ 2
  • • Symmetric around target
  • • Small gradients for confident wrong predictions
  • • Not derived from probabilistic principles

BCE Characteristics

  • • Gradient unbounded: |∂L/∂p| → ∞ near 0 or 1
  • • Asymmetric - punishes confident mistakes heavily
  • • Strong gradients for wrong predictions
  • • Equals negative log-likelihood (MLE)

💡 The Critical Difference

Try setting prediction to 0.05 with target = 1 (confident but wrong). MSE gradient is only ~1.9, but BCE gradient is ~19!Cross-entropy provides a much stronger learning signal when the model is confidently wrong, which is exactly when you need to correct it most.


Gradient Flow in Neural Networks

Understanding how gradients propagate through neural networks is crucial for diagnosing training issues. The choice of loss function profoundly affects gradient flow.

Interactive: Gradient Visualization

Watch gradients flow backward through a simple neural network. Compare how cross-entropy and MSE affect learning, especially in earlier layers.

Gradient Flow Visualization

Watch how gradients propagate through a neural network

Target

Loss Function

w1=0.50w2=0.30w3=-0.20Input (x)1.00∇: +0.001Hidden 10.65∇: +0.002Hidden 20.57∇: +0.025Output (p̂)0.50∇: -0.504Loss0.700Backpropagation

Layer 1 Gradient

0.001692
⚠ Vanishing!

Layer 2 Gradient

0.024647

Output Gradient

-0.503645

💡 Why Cross-Entropy Helps

With sigmoid activation and BCE loss, the sigmoid derivative cancels in the output layer gradient, giving ∂L/∂z = p̂ - y. This prevents gradient saturation at the output layer and provides consistent learning signals throughout training.


Practical Considerations

Numerical Stability

Computing cross-entropy naively can lead to numerical issues. Here are the key techniques used in production implementations:


Python Implementation

Let's implement cross-entropy loss from scratch and compare with PyTorch's optimized version.

Cross-Entropy Implementation
🐍cross_entropy.py
1Import Libraries

NumPy for numerical operations, torch for neural network implementations.

4Numerical Stability Epsilon

We use eps=1e-15 to prevent log(0) which would give -infinity. This small value is negligible for the result but crucial for numerical stability.

7Categorical Cross-Entropy Function

Takes one-hot encoded targets and predicted probabilities. Returns the mean cross-entropy loss across all samples.

15Clipping Predictions

np.clip ensures predictions stay in (eps, 1-eps) range, preventing log(0) and log(1) edge cases that cause numerical issues.

18Core Cross-Entropy Formula

The sum is over classes: -sum(y * log(p)). Since y is one-hot, only the true class contributes. axis=1 sums across classes for each sample.

EXAMPLE
For y=[0,1,0] and p=[0.2,0.7,0.1]: loss = -1*log(0.7) = 0.357
21Return Mean Loss

We return the mean loss across all samples. This is standard practice as it makes the loss scale-invariant to batch size.

24Binary Cross-Entropy Function

Specialized version for binary classification. Takes scalar targets (0 or 1) and predicted probabilities.

32BCE Formula

The full BCE formula: -[y*log(p) + (1-y)*log(1-p)]. When y=1, the second term vanishes; when y=0, the first term vanishes.

EXAMPLE
For y=1, p=0.9: loss = -log(0.9) = 0.105
32 lines without explanation
1import numpy as np
2import torch
3
4# Numerical stability constant
5eps = 1e-15
6
7# Categorical Cross-Entropy (from scratch)
8def categorical_cross_entropy(y_true, y_pred):
9    """
10    Compute categorical cross-entropy loss.
11
12    Args:
13        y_true: One-hot encoded true labels (N, K)
14        y_pred: Predicted probabilities (N, K)
15    """
16    # Clip predictions for numerical stability
17    y_pred = np.clip(y_pred, eps, 1 - eps)
18
19    # Cross-entropy: -sum(y * log(p)) per sample
20    ce = -np.sum(y_true * np.log(y_pred), axis=1)
21
22    # Return mean over batch
23    return np.mean(ce)
24
25# Binary Cross-Entropy
26def binary_cross_entropy(y_true, y_pred):
27    """
28    Compute binary cross-entropy loss.
29
30    Args:
31        y_true: Binary labels (N,) - values 0 or 1
32        y_pred: Predicted probabilities (N,)
33    """
34    y_pred = np.clip(y_pred, eps, 1 - eps)
35
36    # BCE formula
37    bce = -(y_true * np.log(y_pred) +
38            (1 - y_true) * np.log(1 - y_pred))
39
40    return np.mean(bce)

Now let's see how to use PyTorch's optimized implementations:

PyTorch Loss Functions
🐍pytorch_losses.py
1Import PyTorch

PyTorch provides optimized, GPU-accelerated loss functions used in production deep learning.

4Create Sample Logits

Logits are raw model outputs before softmax. Neural networks typically output logits, not probabilities.

7Create Target Labels

For CrossEntropyLoss, PyTorch expects class indices (not one-hot). Here class 0 and class 2 are the true classes.

10Instantiate Loss Function

CrossEntropyLoss combines LogSoftmax and NLLLoss. It expects raw logits, not softmax outputs!

13Compute Loss

Pass logits and targets directly. The function handles softmax internally for numerical stability.

17BCE with Logits

BCEWithLogitsLoss combines sigmoid and BCE. More numerically stable than applying sigmoid then BCE separately.

23Label Smoothing

Label smoothing (0.1) prevents overconfidence by mixing one-hot targets with uniform distribution. Regularization technique that improves generalization.

EXAMPLE
With smoothing=0.1 and 3 classes: [1,0,0] becomes [0.933, 0.033, 0.033]
19 lines without explanation
1import torch
2import torch.nn as nn
3
4# Create sample logits (raw model outputs)
5logits = torch.randn(2, 3)  # 2 samples, 3 classes
6
7# Create target labels (class indices, not one-hot!)
8targets = torch.tensor([0, 2])  # Sample 0 is class 0, sample 1 is class 2
9
10# Standard cross-entropy loss
11ce_loss = nn.CrossEntropyLoss()
12
13# Compute loss (pass logits, not softmax!)
14loss = ce_loss(logits, targets)
15
16# -----------------------------------
17# Binary Cross-Entropy with Logits
18bce_logits_loss = nn.BCEWithLogitsLoss()
19binary_logits = torch.randn(5)
20binary_targets = torch.tensor([1., 0., 1., 1., 0.])
21bce_loss = bce_logits_loss(binary_logits, binary_targets)
22
23# -----------------------------------
24# With label smoothing (regularization)
25ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
26smooth_loss = ce_smooth(logits, targets)
Common Mistake: PyTorch's CrossEntropyLoss expects raw logits, not softmax outputs! It combines LogSoftmax and NLLLoss internally for numerical stability. Applying softmax first then passing to CrossEntropyLoss will give wrong results.

Applications in Deep Learning

💬 Language Models

GPT, LLaMA, and all autoregressive language models use cross-entropy loss. The objective is to minimize perplexity = exp(cross-entropy), the geometric mean of inverse prediction probabilities.

🖼 Image Classification

ResNet, EfficientNet, ViT—all image classifiers use softmax cross-entropy. The loss directly measures how well the model's probability mass aligns with the true class distribution.

🎓 Knowledge Distillation

Student models learn from soft targets (teacher's temperature-scaled softmax). Cross-entropy with soft labels captures "dark knowledge" about class similarities that hard labels miss.

🔄 VAEs and Diffusion

The ELBO objective includes a KL divergence term between approximate and prior distributions. Understanding cross-entropy's connection to KL is essential for these generative models.


Knowledge Check

Test your understanding of information theory in ML loss functions.

Knowledge Check

Test your understanding of ML loss functions

0/8
Score
Question 1of 8

For a classification problem with one-hot encoded labels, what does cross-entropy loss reduce to?


Summary

Key Takeaways

  1. Cross-entropy measures information cost: It quantifies the expected bits needed to communicate labels using the model's probability distribution.
  2. H(p,q) = H(p) + DKL(p||q): Minimizing cross-entropy is equivalent to minimizing KL divergence, forcing the model to match the true distribution.
  3. Cross-entropy = negative log-likelihood: Training with cross-entropy is maximum likelihood estimation, inheriting MLE's optimal statistical properties.
  4. Beautiful gradients: Softmax + cross-entropy gives gradient = p̂ - y, avoiding sigmoid saturation that plagues MSE.
  5. MSE fails for classification: It provides weak gradients for confident wrong predictions, making learning slow and unreliable.
  6. Numerical stability matters: Always use log-sum-exp, clipped probabilities, and framework-provided implementations like CrossEntropyLoss.
Looking Ahead: In the next section, we'll explore the Maximum Entropy Principle, which provides a principled way to choose probability distributions when you have partial information—a technique with deep applications in physics and machine learning.
Loading comments...