Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define cross-entropy and explain its information-theoretic meaning
- • Derive the relationship H(P,Q) = H(P) + D_KL(P||Q)
- • Explain why cross-entropy is preferred over MSE for classification
- • Understand binary and categorical cross-entropy loss functions
🔧 Practical Skills
- • Implement cross-entropy loss from scratch
- • Apply label smoothing to improve model calibration
- • Use the log-softmax trick for numerical stability
- • Choose appropriate loss functions for different tasks
🧠 Deep Learning Connections
- • Classification Loss: Cross-entropy is THE standard loss for neural network classifiers - from simple logistic regression to GPT-4
- • Knowledge Distillation: Soft targets from teacher models use cross-entropy to transfer knowledge to smaller student models
- • Variational Inference: The ELBO objective in VAEs uses cross-entropy to measure reconstruction quality
- • Language Modeling: Next-token prediction loss in transformers is categorical cross-entropy over the vocabulary
Where You'll Apply This: Training any classifier (images, text, audio), language model pre-training, recommendation systems, anomaly detection, and anywhere you need to compare probability distributions.
The Big Picture: Measuring Coding Inefficiency
In the previous section on Shannon entropy, we learned that measures the minimum average bits needed to encode messages from distribution . But what if we don't know and instead design our code based on a different distribution ?
Cross-entropy answers this question: it measures the average number of bits needed to encode events from the true distribution when using a coding scheme optimized for . If doesn't match , we waste bits!
The Cross-Entropy Question
If reality follows , but we encode using , how many bits do we use on average?
Historical Context
Cross-entropy emerges naturally from Claude Shannon's information theory framework (1948). While Shannon focused on optimal coding (entropy), the cross-entropy concept quantifies thecost of being wrong about the underlying distribution.
- 1948 - Shannon: Established that entropy H(P) is the theoretical minimum for lossless compression
- 1951 - Kullback & Leibler: Formalized the KL divergence, showing the gap between cross-entropy and entropy
- 1960s-70s: Cross-entropy adopted in pattern recognition and decision theory
- 1990s-Present: Became the dominant loss function for neural network classifiers
Mathematical Definition
Discrete Cross-Entropy
For discrete probability distributions and over the same sample space:
Discrete Cross-Entropy (using natural log for nats, log₂ for bits):
What each term means:
- — The true probability of event occurring
- — Our model's predicted probability
- — The "surprise" or code length for under
- — The expected surprise when events follow but we use 's code
Continuous Cross-Entropy
For continuous distributions with densities and :
Interactive: Cross-Entropy Explorer
Explore how cross-entropy changes as you adjust the true distribution P and the model distribution Q. Watch the relationship H(P,Q) = H(P) + D_KL(P||Q) in action.
Explore how cross-entropy measures the expected surprise when using Q to encode events from P
True Distribution P (Reality)
Model Distribution Q (Belief)
Term-by-Term Breakdown
| Outcome | P(x) | Q(x) | -P log P | -P log Q | P log(P/Q) |
|---|---|---|---|---|---|
| x = A | 0.600 | 0.500 | 0.4422 | 0.6000 | 0.1578 |
| x = B | 0.300 | 0.300 | 0.5211 | 0.5211 | 0.0000 |
| x = C | 0.100 | 0.200 | 0.3322 | 0.2322 | -0.1000 |
| Total | 1.000 | 1.000 | 1.2955 | 1.3533 | 0.0578 |
The Fundamental Relationship
Cross-entropy decomposes into two parts: the inherent uncertainty of the data (entropy) plus the extra cost of using the wrong distribution (KL divergence):
The Fundamental Decomposition
Total bits used
Minimum possible
Bits wasted
Why this matters for machine learning:
- is fixed by the data — we can't change it
- depends on our model — we can minimize it!
- Therefore, minimizing cross-entropy = minimizing KL divergence
- The best we can do is when
Building Intuition
The cross-entropy loss for a single prediction has a beautiful asymmetric property that makes it ideal for training classifiers:
When Prediction is Correct (high Q)
If : Loss =
Small loss — model is confident and correct. Gradients are small, allowing the model to focus elsewhere.
When Prediction is Wrong (low Q)
If : Loss =
Huge loss — model is confident but wrong! Large gradients force the model to urgently correct this mistake.
Interactive: The -log(p) Loss Curve
Explore the cross-entropy loss curve and see how the gradient changes at different probability levels. Notice how the curve is very steep near zero but flattens near one.
Explore how the loss function penalizes wrong predictions asymmetrically
Good
Key Loss Values
Gradient Insight
The gradient is -1/p. At low probabilities, the gradient is very steep (-1.4), pushing the model strongly to fix wrong predictions.
Asymmetric penalties: The -log function penalizes confident wrong predictions much more than it rewards confident correct ones. Going from 1% to 10% reduces loss by 2.3, while going from 80% to 90% only reduces it by 0.12. This creates strong gradients that force the model to fix its mistakes.
Cross-Entropy as a Loss Function
In machine learning, we use cross-entropy as a loss function by treating:
- = the true label distribution (one-hot encoded for classification)
- = the model's predicted probability distribution (softmax output)
Binary Cross-Entropy
For binary classification with label and predicted probability :
Binary Cross-Entropy (BCE):
When y=1: BCE = -log(p). When y=0: BCE = -log(1-p).
Categorical Cross-Entropy
For multi-class classification with one-hot encoded labels and predicted probabilities :
Categorical Cross-Entropy (CCE):
Since y is one-hot, only the true class c contributes: CCE = -log(p_c)
Interactive: Classification Loss Explorer
See how cross-entropy loss changes with different model predictions. Try various scenarios to understand when the loss is low (good) versus high (bad).
Interactive Cross-Entropy Loss
See how the loss penalizes wrong predictions
Model's Probability Distribution
Calculation Breakdown
The negative log function penalizes confident wrong predictions severely. If the model says 1% for the correct answer, loss = -log(0.01) = 4.6. But if it says 90%, loss = -log(0.9) = 0.1. This gradient pushes the model to be both correct and confident.
Since y is one-hot encoded, only the probability of the correct class matters
Why Cross-Entropy, Not MSE?
Why do neural network classifiers use cross-entropy instead of Mean Squared Error (MSE)? The answer lies in the gradient behavior.
| Aspect | Cross-Entropy Loss | MSE Loss |
|---|---|---|
| Formula | -log(p) | (y - p)² |
| Gradient w.r.t. logits | p - y (simple!) | (p - y) · p · (1-p) (vanishing!) |
| When p=0.01, y=1 | Gradient ≈ -0.99 (strong) | Gradient ≈ 0.01 (weak) |
| Confident wrong | Huge penalty, strong correction | Small gradient, slow learning |
| Probability range | Naturally bounded [0,1] via softmax | Can produce invalid probabilities |
The Vanishing Gradient Problem with MSE
When using sigmoid/softmax with MSE, the gradient includes a term. When the model is confidently wrong (p near 0 or 1), this term vanishes, causing extremely slow learning. Cross-entropy's gradient is , which stays strong regardless of confidence level.
Label Smoothing and Soft Targets
Standard one-hot labels (like [1, 0, 0, 0]) encourage the model to be maximally confident. But overconfidence often hurts generalization! Label smoothing addresses this by using "soft" targets:
Label Smoothing Formula:
With ε=0.1 and K=4 classes: [1,0,0,0] → [0.925, 0.025, 0.025, 0.025]
Benefits of label smoothing:
- Prevents overconfidence: The model can't achieve zero loss by outputting [1,0,0,0] — there's always a small penalty for being too certain
- Better calibration: Predicted probabilities more accurately reflect true likelihood
- Regularization effect: Acts similarly to L2 regularization on the logits
- Improved generalization: Reduces overfitting, especially on small datasets
Interactive: Soft Labels Demo
See how soft labels reduce overconfidence and improve generalization
True Label
Label Smoothing (ε)
Model Confidence (for correct class)
Label Distribution Comparison
Gradient Comparison (∂L/∂logits)
| Class | Model p(x) | Hard Target | Soft Target | Hard Gradient | Soft Gradient |
|---|---|---|---|---|---|
| cat ✓ | 0.654 | 1.000 | 1.000 | -0.346 | -0.346 |
| dog | 0.115 | 0.000 | 0.000 | 0.115 | 0.115 |
| bird | 0.115 | 0.000 | 0.000 | 0.115 | 0.115 |
| fish | 0.115 | 0.000 | 0.000 | 0.115 | 0.115 |
Notice: With soft labels, even "wrong" classes get small negative gradients, gently pushing their probabilities up. This prevents the model from being overconfident.
Label Smoothing Formula
Where K = 4 is the number of classes and ε = 0.00 is the smoothing parameter.
Benefits of Label Smoothing
- • Prevents overconfident predictions
- • Better calibrated probability estimates
- • Acts as regularization (reduces overfitting)
- • Improves generalization to new data
Numerical Stability
Computing cross-entropy naively can cause numerical issues:
- Underflow: When is very small, becomes a huge negative number
- Overflow: Softmax involves , which explodes for large logits
- Log of zero: If softmax outputs exactly 0, log(0) = -∞
The solution is the log-softmax trick:
nn.CrossEntropyLoss in PyTorch or tf.nn.softmax_cross_entropy_with_logits in TensorFlow. These handle numerical stability internally and are also more computationally efficient.Python Implementation
Here's a comprehensive implementation of cross-entropy, including the information-theoretic perspective, classification losses, label smoothing, and numerical stability:
Real-World Examples
AI/ML Applications
🔤 NLP & Transformers
Every transformer-based model (BERT, GPT, T5) uses cross-entropy for training. Masked language modeling, next-token prediction, and sequence classification all rely on it.
👁️ Computer Vision
Image classification, object detection (class prediction), semantic segmentation (per-pixel classification) — all use cross-entropy for the classification component.
🎯 Reinforcement Learning
Policy gradient methods use cross-entropy to match the policy distribution to high-reward actions. The "cross-entropy method" for optimization iteratively fits distributions to elite samples.
🧬 Generative Models
VAEs use cross-entropy for reconstruction loss (when output is categorical). GANs use binary cross-entropy in the discriminator. Diffusion models use it for noise prediction.
Knowledge Check
Test your understanding of cross-entropy with this comprehensive quiz.
What does cross-entropy H(P, Q) measure?
Summary
Key Takeaways
- Cross-entropy measures coding inefficiency: H(P,Q) is the expected bits to encode events from P using a code optimized for Q. It's always ≥ H(P), with equality only when Q = P.
- The fundamental decomposition: H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL divergence from the model to the true distribution.
- Asymmetric loss is a feature: The -log(p) loss penalizes confident wrong predictions severely while giving mild rewards for correct ones. This creates strong gradients that efficiently correct mistakes.
- Better than MSE for classification: Cross-entropy provides non-vanishing gradients even when the model is confidently wrong, enabling efficient learning.
- Label smoothing prevents overconfidence: Replacing hard [1,0,0] labels with soft [0.9, 0.05, 0.05] improves calibration, generalization, and acts as regularization.
- Numerical stability matters: Always use log-softmax + cross-entropy combined functions in frameworks to avoid overflow/underflow issues.
Looking Ahead: In the next section, we'll explore KL Divergence in depth — the quantity that measures how different two distributions are. You'll see how KL divergence connects to cross-entropy, maximum likelihood estimation, and variational inference.