Chapter 20
25 min read
Section 127 of 175

Cross-Entropy

Information Theoretic Foundations

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define cross-entropy and explain its information-theoretic meaning
  • • Derive the relationship H(P,Q) = H(P) + D_KL(P||Q)
  • • Explain why cross-entropy is preferred over MSE for classification
  • • Understand binary and categorical cross-entropy loss functions

🔧 Practical Skills

  • • Implement cross-entropy loss from scratch
  • • Apply label smoothing to improve model calibration
  • • Use the log-softmax trick for numerical stability
  • • Choose appropriate loss functions for different tasks

🧠 Deep Learning Connections

  • Classification Loss: Cross-entropy is THE standard loss for neural network classifiers - from simple logistic regression to GPT-4
  • Knowledge Distillation: Soft targets from teacher models use cross-entropy to transfer knowledge to smaller student models
  • Variational Inference: The ELBO objective in VAEs uses cross-entropy to measure reconstruction quality
  • Language Modeling: Next-token prediction loss in transformers is categorical cross-entropy over the vocabulary
Where You'll Apply This: Training any classifier (images, text, audio), language model pre-training, recommendation systems, anomaly detection, and anywhere you need to compare probability distributions.

The Big Picture: Measuring Coding Inefficiency

In the previous section on Shannon entropy, we learned that H(P)H(P) measures the minimum average bits needed to encode messages from distribution PP. But what if we don't know PP and instead design our code based on a different distribution QQ?

Cross-entropy answers this question: it measures the average number of bits needed to encode events from the true distribution PP when using a coding scheme optimized for QQ. If QQ doesn't match PP, we waste bits!

The Cross-Entropy Question

If reality follows PP, but we encode using QQ, how many bits do we use on average?

H(P,Q)=xP(x)logQ(x)H(P, Q) = -\sum_x P(x) \log Q(x)

Historical Context

Cross-entropy emerges naturally from Claude Shannon's information theory framework (1948). While Shannon focused on optimal coding (entropy), the cross-entropy concept quantifies thecost of being wrong about the underlying distribution.

  • 1948 - Shannon: Established that entropy H(P) is the theoretical minimum for lossless compression
  • 1951 - Kullback & Leibler: Formalized the KL divergence, showing the gap between cross-entropy and entropy
  • 1960s-70s: Cross-entropy adopted in pattern recognition and decision theory
  • 1990s-Present: Became the dominant loss function for neural network classifiers

Mathematical Definition

Discrete Cross-Entropy

For discrete probability distributions PP and QQ over the same sample space:

Discrete Cross-Entropy (using natural log for nats, log₂ for bits):

H(P,Q)=xP(x)logQ(x)=ExP[logQ(x)]H(P, Q) = -\sum_{x} P(x) \log Q(x) = \mathbb{E}_{x \sim P}[-\log Q(x)]

What each term means:

  • P(x)P(x) — The true probability of event xx occurring
  • Q(x)Q(x) — Our model's predicted probability
  • logQ(x)-\log Q(x) — The "surprise" or code length for xx under QQ
  • H(P,Q)H(P,Q) — The expected surprise when events follow PP but we use QQ's code

Continuous Cross-Entropy

For continuous distributions with densities p(x)p(x) and q(x)q(x):

H(P,Q)=p(x)logq(x)dxH(P, Q) = -\int_{-\infty}^{\infty} p(x) \log q(x) \, dx
Units matter: Using natural log (ln) gives cross-entropy in nats. Using log₂ gives bits. Deep learning frameworks typically use natural log, so cross-entropy loss is measured in nats.

Interactive: Cross-Entropy Explorer

Explore how cross-entropy changes as you adjust the true distribution P and the model distribution Q. Watch the relationship H(P,Q) = H(P) + D_KL(P||Q) in action.

Cross-Entropy Between Two Distributions

Explore how cross-entropy measures the expected surprise when using Q to encode events from P

True Distribution P (Reality)

P(A)0.600
P(B)0.300
P(C)0.100 (auto)

Model Distribution Q (Belief)

Q(A)0.500
Q(B)0.300
Q(C)0.200 (auto)
0.000.250.500.751.00Probabilityx = Ax = Bx = CP (True)Q (Model)
Entropy H(P)
1.2955
Minimum achievable
Cross-Entropy H(P,Q)
1.3533
Using Q to encode P
KL Divergence
0.0578
Extra bits wasted

Term-by-Term Breakdown

OutcomeP(x)Q(x)-P log P-P log QP log(P/Q)
x = A0.6000.5000.44220.60000.1578
x = B0.3000.3000.52110.52110.0000
x = C0.1000.2000.33220.2322-0.1000
Total1.0001.0001.29551.35330.0578
The Fundamental Relationship:
H(P,Q) = H(P) + DKL(P || Q)
1.3533 = 1.2955 + 0.0578
Cross-entropy is always at least as large as entropy. They're equal only when P = Q.

The Fundamental Relationship

Cross-entropy decomposes into two parts: the inherent uncertainty of the data (entropy) plus the extra cost of using the wrong distribution (KL divergence):

The Fundamental Decomposition

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)
H(P,Q)H(P,Q)
Total bits used
H(P)H(P)
Minimum possible
DKL(PQ)D_{\text{KL}}(P \| Q)
Bits wasted

Why this matters for machine learning:

  • H(P)H(P) is fixed by the data — we can't change it
  • DKL(PQ)D_{\text{KL}}(P \| Q) depends on our model — we can minimize it!
  • Therefore, minimizing cross-entropy = minimizing KL divergence
  • The best we can do is H(P,Q)=H(P)H(P,Q) = H(P) when Q=PQ = P
Key Insight: When training a classifier, minimizing cross-entropy loss is equivalent to finding the model distribution Q that is closest to the true data distribution P (in the KL divergence sense).

Building Intuition

The cross-entropy loss logQ(x)-\log Q(x) for a single prediction has a beautiful asymmetric property that makes it ideal for training classifiers:

When Prediction is Correct (high Q)

If Q=0.9Q = 0.9: Loss = ln(0.9)0.105-\ln(0.9) \approx 0.105

Small loss — model is confident and correct. Gradients are small, allowing the model to focus elsewhere.

When Prediction is Wrong (low Q)

If Q=0.01Q = 0.01: Loss = ln(0.01)4.6-\ln(0.01) \approx 4.6

Huge loss — model is confident but wrong! Large gradients force the model to urgently correct this mistake.

Interactive: The -log(p) Loss Curve

Explore the cross-entropy loss curve and see how the gradient changes at different probability levels. Notice how the curve is very steep near zero but flattens near one.

Cross-Entropy Loss Curve: -log(p)

Explore how the loss function penalizes wrong predictions asymmetrically

Predicted Probability p (for correct class)Loss: -log(p)00.250.50.751012345p=0.5: Loss=0.69Loss = 0.357

Good

Probability:70.0%
Loss (-log p):0.3567
Gradient:-1.43

Key Loss Values

p = 90%:0.105
p = 50%:0.693
p = 10%:2.303
p = 1%:4.605

Gradient Insight

The gradient is -1/p. At low probabilities, the gradient is very steep (-1.4), pushing the model strongly to fix wrong predictions.

💡
Why Cross-Entropy Works So Well for Training

Asymmetric penalties: The -log function penalizes confident wrong predictions much more than it rewards confident correct ones. Going from 1% to 10% reduces loss by 2.3, while going from 80% to 90% only reduces it by 0.12. This creates strong gradients that force the model to fix its mistakes.


Cross-Entropy as a Loss Function

In machine learning, we use cross-entropy as a loss function by treating:

  • PP = the true label distribution (one-hot encoded for classification)
  • QQ = the model's predicted probability distribution (softmax output)

Binary Cross-Entropy

For binary classification with label y{0,1}y \in \{0, 1\} and predicted probability pp:

Binary Cross-Entropy (BCE):

BCE=[ylog(p)+(1y)log(1p)]\text{BCE} = -\left[ y \log(p) + (1-y) \log(1-p) \right]

When y=1: BCE = -log(p). When y=0: BCE = -log(1-p).

Categorical Cross-Entropy

For multi-class classification with one-hot encoded labels y=[y1,...,yK]\mathbf{y} = [y_1, ..., y_K] and predicted probabilities p=[p1,...,pK]\mathbf{p} = [p_1, ..., p_K]:

Categorical Cross-Entropy (CCE):

CCE=k=1Kyklog(pk)=log(pc)\text{CCE} = -\sum_{k=1}^{K} y_k \log(p_k) = -\log(p_c)

Since y is one-hot, only the true class c contributes: CCE = -log(p_c)

Interactive: Classification Loss Explorer

See how cross-entropy loss changes with different model predictions. Try various scenarios to understand when the loss is low (good) versus high (bad).

Interactive Cross-Entropy Loss

See how the loss penalizes wrong predictions

Model's Probability Distribution

cat
90.0%
🎯
dog
5.0%
bird
3.0%
fish
2.0%
True Label (One-Hot Encoded):
1
0
0
0
Cross-Entropy Loss
0.1054
✓ Excellent! Low loss = confident & correct

Calculation Breakdown

Correct answer:cat
P(correct):90.00%
log(P):-0.1054
-log(P):0.1054
Low Loss (Good)High Loss (Bad)
02.55+
💡
Key Insight: Why -log(p)?

The negative log function penalizes confident wrong predictions severely. If the model says 1% for the correct answer, loss = -log(0.01) = 4.6. But if it says 90%, loss = -log(0.9) = 0.1. This gradient pushes the model to be both correct and confident.

Cross-Entropy Loss Formula:
L = -Σ yi · log(pi) = -log(pcorrect)

Since y is one-hot encoded, only the probability of the correct class matters


Why Cross-Entropy, Not MSE?

Why do neural network classifiers use cross-entropy instead of Mean Squared Error (MSE)? The answer lies in the gradient behavior.

AspectCross-Entropy LossMSE Loss
Formula-log(p)(y - p)²
Gradient w.r.t. logitsp - y (simple!)(p - y) · p · (1-p) (vanishing!)
When p=0.01, y=1Gradient ≈ -0.99 (strong)Gradient ≈ 0.01 (weak)
Confident wrongHuge penalty, strong correctionSmall gradient, slow learning
Probability rangeNaturally bounded [0,1] via softmaxCan produce invalid probabilities

The Vanishing Gradient Problem with MSE

When using sigmoid/softmax with MSE, the gradient includes a p(1p)p(1-p) term. When the model is confidently wrong (p near 0 or 1), this term vanishes, causing extremely slow learning. Cross-entropy's gradient is (py)(p - y), which stays strong regardless of confidence level.

The Gradient Difference: For logistic regression with cross-entropy, the gradient is simply logits=py\nabla_{\text{logits}} = p - y. This elegant form means the update is proportional to the error, with no diminishing factor. MSE's gradient has an extra p(1p)p(1-p) multiplicative factor that kills gradients at the extremes.

Label Smoothing and Soft Targets

Standard one-hot labels (like [1, 0, 0, 0]) encourage the model to be maximally confident. But overconfidence often hurts generalization! Label smoothing addresses this by using "soft" targets:

Label Smoothing Formula:

ysmooth=(1ε)yhard+εKy_{\text{smooth}} = (1 - \varepsilon) \cdot y_{\text{hard}} + \frac{\varepsilon}{K}

With ε=0.1 and K=4 classes: [1,0,0,0] → [0.925, 0.025, 0.025, 0.025]

Benefits of label smoothing:

  1. Prevents overconfidence: The model can't achieve zero loss by outputting [1,0,0,0] — there's always a small penalty for being too certain
  2. Better calibration: Predicted probabilities more accurately reflect true likelihood
  3. Regularization effect: Acts similarly to L2 regularization on the logits
  4. Improved generalization: Reduces overfitting, especially on small datasets

Interactive: Soft Labels Demo

Label Smoothing and Soft Targets

See how soft labels reduce overconfidence and improve generalization

True Label

Label Smoothing (ε)

Smoothing parameter:ε = 0.00
Hard labelsMaximum smoothing

Model Confidence (for correct class)

Predicted probability:85%

Label Distribution Comparison

cat
Hard: 1.00Soft: 1.000
dog
Hard: 0.00Soft: 0.000
bird
Hard: 0.00Soft: 0.000
fish
Hard: 0.00Soft: 0.000
Hard label
Soft label
Hard Label Loss
0.4249
Soft Label Loss
0.4249

Gradient Comparison (∂L/∂logits)

ClassModel p(x)Hard TargetSoft TargetHard GradientSoft Gradient
cat 0.6541.0001.000-0.346-0.346
dog 0.1150.0000.0000.1150.115
bird 0.1150.0000.0000.1150.115
fish 0.1150.0000.0000.1150.115

Notice: With soft labels, even "wrong" classes get small negative gradients, gently pushing their probabilities up. This prevents the model from being overconfident.

Label Smoothing Formula

ysmooth = (1-ε)·yhard + ε/K

Where K = 4 is the number of classes and ε = 0.00 is the smoothing parameter.

Benefits of Label Smoothing

  • • Prevents overconfident predictions
  • • Better calibrated probability estimates
  • • Acts as regularization (reduces overfitting)
  • • Improves generalization to new data

Numerical Stability

Computing cross-entropy naively can cause numerical issues:

  • Underflow: When Q(x)Q(x) is very small, logQ(x)\log Q(x) becomes a huge negative number
  • Overflow: Softmax involves eze^{z}, which explodes for large logits
  • Log of zero: If softmax outputs exactly 0, log(0) = -∞

The solution is the log-softmax trick:

Step 1:Subtract max: zi=zimax(z)z_i' = z_i - \max(z)
Step 2:Compute: log(softmax(z)i)=zilog(jezj)\log(\text{softmax}(z)_i) = z_i' - \log\left(\sum_j e^{z_j'}\right)
Step 3:Cross-entropy: kyklog(softmax(z)k)-\sum_k y_k \cdot \log(\text{softmax}(z)_k)
Framework Best Practice: Always use the combined softmax + cross-entropy functions: nn.CrossEntropyLoss in PyTorch or tf.nn.softmax_cross_entropy_with_logits in TensorFlow. These handle numerical stability internally and are also more computationally efficient.

Python Implementation

Here's a comprehensive implementation of cross-entropy, including the information-theoretic perspective, classification losses, label smoothing, and numerical stability:

Cross-Entropy: From Theory to Practice
🐍cross_entropy.py
7Cross-Entropy Formula

H(P,Q) = -Σ p(x) log q(x). This measures the expected bits needed to encode events from P using a code optimized for Q. The formula weighs log-probabilities by their true occurrence rate.

EXAMPLE
If P=[0.7, 0.3] and Q=[0.5, 0.5], H(P,Q) ≈ 0.88 nats
27Numerical Safety

We clip q to avoid log(0) = -∞. The value 1e-15 is small enough to not affect results but large enough to prevent numerical issues.

44KL Divergence Identity

D_KL(P||Q) = H(P,Q) - H(P). This is the extra bits wasted by using Q instead of P. It's always non-negative and zero only when P=Q.

69Binary Cross-Entropy

For binary classification: BCE = -[y log(p) + (1-y) log(1-p)]. When y=1, only -log(p) contributes. When y=0, only -log(1-p) contributes.

EXAMPLE
If y=1 and p=0.9: BCE = -log(0.9) ≈ 0.105
101Categorical Cross-Entropy

For multi-class: CCE = -Σ y_k log(p_k). Since y is one-hot encoded, only the true class contributes: -log(p_correct).

116Sparse CE

More memory-efficient version that takes integer labels instead of one-hot vectors. Functionally identical but doesn't require converting labels to one-hot.

131Label Smoothing Formula

y_smooth = (1-ε)y_hard + ε/K. This converts [1,0,0,0] to [0.925, 0.025, 0.025, 0.025] with ε=0.1 and K=4 classes.

159Stable Softmax

Subtracting max(logits) before exp() prevents overflow. exp(1000) overflows, but exp(1000-1000.1) = exp(-0.1) is safe. The result is mathematically identical.

170Log-Softmax Trick

log(softmax(x)_i) = x_i - log(Σexp(x_j)). Computing this directly avoids taking log of very small softmax outputs, which could cause -∞.

178Combined CE + Softmax

The most stable approach: apply log_softmax then CE. This is what PyTorch's nn.CrossEntropyLoss and TensorFlow's from_logits=True do internally.

248 lines without explanation
1import numpy as np
2from typing import Union, Optional
3import warnings
4
5# =====================================================
6# Cross-Entropy: Mathematical Definition
7# =====================================================
8
9def cross_entropy(p: np.ndarray, q: np.ndarray,
10                  base: float = np.e) -> float:
11    """
12    Compute cross-entropy H(P, Q) between two distributions.
13
14    H(P, Q) = -Σ p(x) * log_b(q(x))
15
16    This measures the expected number of bits/nats needed to encode
17    events drawn from P using a code optimized for Q.
18
19    Args:
20        p: True distribution (must sum to 1)
21        q: Model distribution (must sum to 1)
22        base: Logarithm base (e for nats, 2 for bits)
23
24    Returns:
25        Cross-entropy value (non-negative float)
26    """
27    # Input validation
28    assert len(p) == len(q), "Distributions must have same length"
29    assert np.isclose(p.sum(), 1.0), "P must sum to 1"
30    assert np.isclose(q.sum(), 1.0), "Q must sum to 1"
31
32    # Avoid log(0) by clipping small values
33    q_safe = np.clip(q, 1e-15, 1.0)
34
35    # Cross-entropy formula
36    log_q = np.log(q_safe) / np.log(base)
37    ce = -np.sum(p * log_q)
38
39    return ce
40
41def entropy(p: np.ndarray, base: float = np.e) -> float:
42    """
43    Compute Shannon entropy H(P).
44
45    H(P) = -Σ p(x) * log_b(p(x))
46    """
47    p_safe = np.clip(p, 1e-15, 1.0)
48    log_p = np.log(p_safe) / np.log(base)
49    return -np.sum(p * log_p)
50
51def kl_divergence(p: np.ndarray, q: np.ndarray,
52                  base: float = np.e) -> float:
53    """
54    Compute KL divergence D_KL(P || Q).
55
56    D_KL(P || Q) = Σ p(x) * log(p(x)/q(x))
57                 = H(P, Q) - H(P)
58    """
59    return cross_entropy(p, q, base) - entropy(p, base)
60
61
62# Example: Information theory perspective
63print("=" * 60)
64print("Cross-Entropy: Information Theory Perspective")
65print("=" * 60)
66
67# True distribution of weather
68p_weather = np.array([0.7, 0.2, 0.1])  # [sunny, cloudy, rainy]
69labels = ["sunny", "cloudy", "rainy"]
70
71# Different model predictions
72q_accurate = np.array([0.65, 0.25, 0.10])  # Good model
73q_uniform = np.array([0.33, 0.34, 0.33])   # Uncertain model
74q_wrong = np.array([0.1, 0.2, 0.7])        # Wrong model
75
76print(f"\nTrue distribution P: {dict(zip(labels, p_weather))}")
77print(f"Entropy H(P) = {entropy(p_weather):.4f} nats")
78
79for name, q in [("Accurate", q_accurate),
80                ("Uniform", q_uniform),
81                ("Wrong", q_wrong)]:
82    ce = cross_entropy(p_weather, q)
83    kl = kl_divergence(p_weather, q)
84    print(f"\n{name} model Q:")
85    print(f"  Cross-entropy H(P,Q) = {ce:.4f} nats")
86    print(f"  KL divergence D_KL = {kl:.4f} nats (extra bits wasted)")
87
88
89# =====================================================
90# Binary Cross-Entropy Loss
91# =====================================================
92
93def binary_cross_entropy(y_true: np.ndarray,
94                         y_pred: np.ndarray,
95                         epsilon: float = 1e-15) -> float:
96    """
97    Binary cross-entropy (log loss) for classification.
98
99    BCE = -1/n * Σ [y*log(p) + (1-y)*log(1-p)]
100
101    Args:
102        y_true: Ground truth labels (0 or 1)
103        y_pred: Predicted probabilities (0 to 1)
104        epsilon: Small value to avoid log(0)
105
106    Returns:
107        Average binary cross-entropy loss
108    """
109    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
110
111    loss = -(y_true * np.log(y_pred) +
112             (1 - y_true) * np.log(1 - y_pred))
113
114    return np.mean(loss)
115
116# Example: Binary classification
117print("\n" + "=" * 60)
118print("Binary Cross-Entropy Loss")
119print("=" * 60)
120
121y_true = np.array([1, 0, 1, 1, 0])
122scenarios = [
123    ("Confident & Correct", np.array([0.9, 0.1, 0.8, 0.95, 0.05])),
124    ("Uncertain", np.array([0.6, 0.4, 0.55, 0.65, 0.35])),
125    ("Confident & Wrong", np.array([0.1, 0.9, 0.2, 0.15, 0.85])),
126]
127
128for name, y_pred in scenarios:
129    loss = binary_cross_entropy(y_true, y_pred)
130    print(f"\n{name}:")
131    print(f"  y_pred = {y_pred}")
132    print(f"  BCE Loss = {loss:.4f}")
133
134
135# =====================================================
136# Categorical Cross-Entropy Loss
137# =====================================================
138
139def categorical_cross_entropy(y_true: np.ndarray,
140                              y_pred: np.ndarray,
141                              epsilon: float = 1e-15) -> float:
142    """
143    Categorical cross-entropy for multi-class classification.
144
145    CCE = -1/n * Σ Σ y_true[i,k] * log(y_pred[i,k])
146
147    Args:
148        y_true: One-hot encoded labels (n_samples, n_classes)
149        y_pred: Predicted probabilities (n_samples, n_classes)
150
151    Returns:
152        Average categorical cross-entropy loss
153    """
154    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
155
156    # Sum over classes, average over samples
157    loss = -np.sum(y_true * np.log(y_pred), axis=-1)
158    return np.mean(loss)
159
160def sparse_categorical_cross_entropy(y_true: np.ndarray,
161                                     y_pred: np.ndarray,
162                                     epsilon: float = 1e-15) -> float:
163    """
164    Sparse categorical CE (labels as integers, not one-hot).
165    """
166    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
167    n_samples = len(y_true)
168
169    # Pick the probability of the true class for each sample
170    loss = -np.log(y_pred[np.arange(n_samples), y_true])
171    return np.mean(loss)
172
173
174# =====================================================
175# Label Smoothing
176# =====================================================
177
178def label_smoothing(y_true: np.ndarray,
179                    epsilon: float = 0.1,
180                    n_classes: Optional[int] = None) -> np.ndarray:
181    """
182    Apply label smoothing to one-hot encoded labels.
183
184    y_smooth = (1 - epsilon) * y_hard + epsilon / K
185
186    Args:
187        y_true: One-hot encoded labels
188        epsilon: Smoothing parameter (typically 0.1)
189        n_classes: Number of classes (inferred if not provided)
190
191    Returns:
192        Smoothed labels
193    """
194    if n_classes is None:
195        n_classes = y_true.shape[-1]
196
197    y_smooth = (1 - epsilon) * y_true + epsilon / n_classes
198    return y_smooth
199
200print("\n" + "=" * 60)
201print("Label Smoothing Example")
202print("=" * 60)
203
204# One-hot encoded label for class 0
205y_hard = np.array([[1, 0, 0, 0]])
206y_soft = label_smoothing(y_hard, epsilon=0.1, n_classes=4)
207
208print(f"Hard label:     {y_hard[0]}")
209print(f"Soft label:     {y_soft[0]}")
210print(f"Sum still = 1:  {y_soft.sum():.4f}")
211
212
213# =====================================================
214# Numerical Stability: Log-Softmax Trick
215# =====================================================
216
217def stable_softmax(logits: np.ndarray) -> np.ndarray:
218    """
219    Numerically stable softmax using the shift trick.
220    """
221    # Subtract max for numerical stability
222    shifted = logits - np.max(logits, axis=-1, keepdims=True)
223    exp_shifted = np.exp(shifted)
224    return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)
225
226def stable_log_softmax(logits: np.ndarray) -> np.ndarray:
227    """
228    Numerically stable log-softmax.
229
230    log_softmax(x)_i = x_i - log(Σ exp(x_j))
231                     = x_i - max(x) - log(Σ exp(x_j - max(x)))
232    """
233    shifted = logits - np.max(logits, axis=-1, keepdims=True)
234    log_sum_exp = np.log(np.sum(np.exp(shifted), axis=-1, keepdims=True))
235    return shifted - log_sum_exp
236
237def cross_entropy_with_logits(logits: np.ndarray,
238                              y_true: np.ndarray) -> float:
239    """
240    Cross-entropy directly from logits (most stable).
241
242    This combines softmax and cross-entropy for numerical stability.
243    """
244    log_probs = stable_log_softmax(logits)
245    loss = -np.sum(y_true * log_probs, axis=-1)
246    return np.mean(loss)
247
248print("\n" + "=" * 60)
249print("Numerical Stability: Log-Softmax")
250print("=" * 60)
251
252# Large logits that would cause overflow with naive softmax
253logits = np.array([[1000.0, 1000.1, 999.9]])
254y_true = np.array([[0, 1, 0]])
255
256print(f"Large logits: {logits[0]}")
257print(f"Stable softmax: {stable_softmax(logits)[0]}")
258print(f"Stable CE loss: {cross_entropy_with_logits(logits, y_true):.6f}")

Real-World Examples


AI/ML Applications

🔤 NLP & Transformers

Every transformer-based model (BERT, GPT, T5) uses cross-entropy for training. Masked language modeling, next-token prediction, and sequence classification all rely on it.

👁️ Computer Vision

Image classification, object detection (class prediction), semantic segmentation (per-pixel classification) — all use cross-entropy for the classification component.

🎯 Reinforcement Learning

Policy gradient methods use cross-entropy to match the policy distribution to high-reward actions. The "cross-entropy method" for optimization iteratively fits distributions to elite samples.

🧬 Generative Models

VAEs use cross-entropy for reconstruction loss (when output is categorical). GANs use binary cross-entropy in the discriminator. Diffusion models use it for noise prediction.

The Universal Loss: Cross-entropy is arguably the most important loss function in deep learning. Understanding it deeply — from information theory to numerical implementation — is essential for any ML practitioner.

Knowledge Check

Test your understanding of cross-entropy with this comprehensive quiz.

Cross-Entropy Knowledge Check
Question 1 of 8Score: 0/0

What does cross-entropy H(P, Q) measure?


Summary

Key Takeaways

  1. Cross-entropy measures coding inefficiency: H(P,Q) is the expected bits to encode events from P using a code optimized for Q. It's always ≥ H(P), with equality only when Q = P.
  2. The fundamental decomposition: H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL divergence from the model to the true distribution.
  3. Asymmetric loss is a feature: The -log(p) loss penalizes confident wrong predictions severely while giving mild rewards for correct ones. This creates strong gradients that efficiently correct mistakes.
  4. Better than MSE for classification: Cross-entropy provides non-vanishing gradients even when the model is confidently wrong, enabling efficient learning.
  5. Label smoothing prevents overconfidence: Replacing hard [1,0,0] labels with soft [0.9, 0.05, 0.05] improves calibration, generalization, and acts as regularization.
  6. Numerical stability matters: Always use log-softmax + cross-entropy combined functions in frameworks to avoid overflow/underflow issues.
Looking Ahead: In the next section, we'll explore KL Divergence in depth — the quantity that measures how different two distributions are. You'll see how KL divergence connects to cross-entropy, maximum likelihood estimation, and variational inference.
Loading comments...