Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define entropy as a measure of uncertainty/information content
- • Explain why entropy uses logarithms and what "bits" represent
- • Calculate entropy for discrete probability distributions
- • Understand the maximum entropy principle and its implications
🔧 Practical Skills
- • Compute entropy in Python using NumPy and SciPy
- • Compare uncertainty across different distributions
- • Apply entropy concepts to data compression and coding
- • Use entropy for feature selection in machine learning
🧠 Deep Learning Connections
- • Cross-entropy loss - The foundation of classification training, directly built on Shannon entropy
- • Decision tree splitting - Information gain uses entropy to choose optimal splits
- • Softmax temperature - Controls entropy of output distributions in language models
- • VAEs and information bottleneck - Entropy constraints shape latent representations
Where You'll Apply This: Data compression algorithms, decision tree learning, neural network loss functions, text generation (temperature sampling), anomaly detection, feature selection, and understanding model uncertainty.
The Big Picture
How do you measure information? This seemingly philosophical question has a precise mathematical answer that revolutionized communication, computing, and now machine learning. Shannon entropy is the fundamental measure of uncertainty, information content, and surprise in a probability distribution.
The Core Insight
Information is the resolution of uncertainty. When you learn something, you've reduced your uncertainty about the world. The more uncertain you were before, the more information you gain when uncertainty is resolved.
Fair coin flip: 1 bit of information
Certain outcome: 0 bits of information
More uncertainty: More potential information
Historical Context
The story of entropy begins with one of the most influential scientific papers ever written.
Claude Shannon (1948)
Published "A Mathematical Theory of Communication" - arguably the birth certificate of the information age. Shannon was working at Bell Labs on the problem of efficiently transmitting messages over noisy communication channels. He needed to quantify "how much information" a message contains.
The Naming Story
When Shannon showed his formula to John von Neumann, asking what to call it, von Neumann reportedly said: "Call it entropy. In the first place, your uncertainty function has been used in statistical mechanics under that name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."
Building Block: Surprisal (Self-Information)
Before defining entropy, we need to understand its building block: surprisal(also called self-information). Surprisal measures how "surprising" a single event is.
Surprisal Definition
where I(x) is the surprisal (in bits) of event x with probability p(x)
Why this formula? Shannon derived it from three intuitive requirements:
- Rare events are more surprising: An event with low probability should have high surprisal. If , then .
- Certain events carry no information: If , then . Learning something you already knew is not informative.
- Independent events add: The surprisal of two independent events occurring should be the sum of their individual surprisals. This requires a logarithm! when x and y are independent.
Interactive: Surprisal Explorer
Explore how surprisal changes with probability. Notice how rare events carry much more information than common events.
Surprisal measures how "surprising" an event is. Rare events carry more information!
Current Values
Interpretation
This is a likely event. With 50% chance, it only provides 1.0 bits of information - not very surprising.
Key Insight
Surprisal is inversely related to probability. The rarer an event, the more information we gain when it occurs. A coin flip (p=0.5) gives 1 bit. A dice roll outcome (p=1/6) gives ~2.58 bits.
Real Example
If the weather forecast says "90% chance of rain" and it rains, that's only 0.15 bits of information (not surprising). But if it says "10% chance" and it rains, that's 3.32 bits - very informative!
Shannon Entropy Definition
Shannon entropy is the expected surprisal - the average amount of information gained when observing a random variable. It measures the uncertainty inherent in a probability distribution.
Shannon Entropy
Entropy is the expected value of surprisal, measured in bits
| Symbol | Meaning |
|---|---|
| H(X) | Entropy of random variable X |
| p(x) | Probability of outcome x |
| log₂ | Logarithm base 2 (gives units of bits) |
| -p(x) log₂ p(x) | Contribution of outcome x to total entropy |
Binary Entropy
The simplest case is a binary random variable (like a coin flip). The binary entropy function H(p) gives the entropy as a function of the probability p of one outcome:
Binary Entropy Function
Interactive: Binary Entropy
Drag the slider to see how entropy changes with bias. Notice that entropy is maximized at p = 0.5 (fair coin) and approaches zero as the coin becomes more biased.
The entropy of a binary random variable (like a biased coin) - the simplest case of Shannon entropy.
Binary Entropy Formula
Component Breakdown
Key Properties
- • Max at p=0.5: H(0.5) = 1 bit
- • Min at p=0 or p=1: H = 0 bits
- • Symmetric: H(p) = H(1-p)
- • Concave: Always curves downward
What This Means
The outcome is highly uncertain! Near-maximum entropy - hardest to predict.
Discrete Entropy
For a discrete distribution over n outcomes, entropy measures total uncertainty. You can manipulate the probabilities directly and see how entropy responds.
Interactive: Discrete Entropy Explorer
Drag the sliders to change probabilities and see how entropy responds. Entropy measures uncertainty - more uniform = more uncertain = higher entropy!
Entropy Contribution by Outcome
Formula
Each outcome contributes -p log₂(p) bits to total entropy
Key Insight
Distribution is nearly uniform - maximum uncertainty means maximum entropy. Each outcome is equally likely, so every observation is maximally informative.
Properties of Entropy
Shannon entropy has several important mathematical properties that make it the unique measure of information:
Non-negativity
Entropy is always non-negative. You can't have negative uncertainty!
Maximum Entropy
For n outcomes, entropy is maximized by the uniform distribution.
Concavity
H(X) is a concave function of the distribution p. Mixing distributions increases entropy:
Additivity (Independent)
For independent random variables:Total uncertainty is the sum of individual uncertainties.
Maximum Entropy Principle
A fundamental result: among all distributions over n outcomes, the uniform distribution maximizes entropy. This is because uniform represents maximum uncertainty - we have no reason to prefer any outcome over another.
Maximum Entropy Theorem
For a discrete random variable X with n possible outcomes:
with equality if and only if X is uniformly distributed: for all x.
Interactive: Maximum Entropy Demonstration
Watch entropy increase as a skewed distribution transforms into a uniform distribution. This demonstrates why uniform = maximum entropy.
Watch entropy increase as a skewed distribution transforms into a uniform distribution. The uniform distribution always maximizes entropy!
Probability Distribution
Entropy Over Time
The Maximum Entropy Principle
Theorem: For a discrete random variable with n possible outcomes, entropy is maximized when the distribution is uniform: p(x) = 1/n for all x. In this case, H = log₂(n) bits.
Why? Uniform distributions represent maximum uncertainty - we have no information suggesting any outcome is more likely than another. This principle is foundational in physics (statistical mechanics), information theory, and machine learning (MaxEnt classifiers, softmax temperature).
Entropy Comparison Chart
To build intuition, let's compare entropy values across different common distributions. This helps you develop a sense for what different entropy values "feel like."
Compare entropy values across different probability distributions to build intuition.
Key Observations
- • Deterministic: 0 bits - no uncertainty at all
- • Fair coin: Exactly 1 bit - the fundamental unit
- • n-sided die: H = log₂(n) bits when uniform
- • Natural text: Less than max due to letter frequencies
Practical Implications
- • Compression: English text needs ~4.1 bits/letter (not 8)
- • Passwords: More entropy = harder to guess
- • ML: High entropy = diverse predictions (uncertainty)
- • Coding: Can't compress below entropy limit!
Real-World Examples
AI/ML Connections
Shannon entropy is everywhere in modern machine learning. Understanding it deeply will make you a better ML practitioner.
📉 Cross-Entropy Loss
The standard loss for classification:This measures how well predicted probabilities match true labels. Minimizing cross-entropy is equivalent to maximizing likelihood!
🌡️ Softmax Temperature
In language models, temperature controls output entropy:Low T → low entropy (confident), High T → high entropy (creative/random).
🎯 Mutual Information
Feature selection, representation learning, and InfoGAN all use mutual information:Measures how much knowing Y reduces uncertainty about X.
🔄 VAE Loss (ELBO)
Variational Autoencoders optimize an entropy-related objective:The KL term involves entropy of the approximate posterior.
- H(p) - uncertainty in distribution p itself
- H(p,q) - expected code length using q to encode data from p
- D_KL(p||q) = H(p,q) - H(p) - the "extra" bits needed when using wrong distribution
Python Implementation
Let's implement entropy calculation in Python. This code demonstrates both the basic formula and practical considerations.
Here's a complete example showing entropy calculations for various distributions:
1import numpy as np
2from scipy.stats import entropy as scipy_entropy
3
4def shannon_entropy(probabilities):
5 """Calculate Shannon entropy in bits."""
6 p = np.array(probabilities, dtype=np.float64)
7 p = p / p.sum()
8 p_nonzero = p[p > 0]
9 return -np.sum(p_nonzero * np.log2(p_nonzero))
10
11# ============================================
12# Example 1: Binary distributions
13# ============================================
14print("=== Binary Entropy ===")
15for p in [0.5, 0.7, 0.9, 0.99]:
16 h = shannon_entropy([p, 1-p])
17 print(f"p={p:.2f}: H = {h:.4f} bits")
18
19# Output:
20# p=0.50: H = 1.0000 bits
21# p=0.70: H = 0.8813 bits
22# p=0.90: H = 0.4690 bits
23# p=0.99: H = 0.0808 bits
24
25# ============================================
26# Example 2: Compare with SciPy
27# ============================================
28print("\n=== SciPy Comparison ===")
29fair_die = [1/6] * 6
30print(f"Our implementation: {shannon_entropy(fair_die):.4f} bits")
31print(f"SciPy (base 2): {scipy_entropy(fair_die, base=2):.4f} bits")
32
33# ============================================
34# Example 3: Information gain for decision trees
35# ============================================
36def information_gain(parent_labels, child_groups):
37 """
38 Calculate information gain for a split.
39
40 Parameters:
41 parent_labels: Array of class labels before split
42 child_groups: List of arrays of class labels for each child
43 """
44 # Parent entropy
45 _, counts = np.unique(parent_labels, return_counts=True)
46 h_parent = shannon_entropy(counts / len(parent_labels))
47
48 # Weighted child entropy
49 n_total = len(parent_labels)
50 h_children = 0
51 for child in child_groups:
52 if len(child) > 0:
53 _, counts = np.unique(child, return_counts=True)
54 h_child = shannon_entropy(counts / len(child))
55 weight = len(child) / n_total
56 h_children += weight * h_child
57
58 return h_parent - h_children
59
60# Example: Splitting data by a feature
61labels = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'] # 50/50 split
62left = ['A', 'A', 'A', 'B'] # 75/25
63right = ['A', 'B', 'B', 'B'] # 25/75
64
65ig = information_gain(labels, [left, right])
66print(f"\n=== Information Gain Example ===")
67print(f"Parent entropy: {shannon_entropy([0.5, 0.5]):.4f} bits")
68print(f"After split: {ig:.4f} bits of information gained")
69
70# ============================================
71# Example 4: Text entropy estimation
72# ============================================
73def text_entropy(text):
74 """Estimate entropy of text based on character frequencies."""
75 text = text.lower()
76 chars = [c for c in text if c.isalpha()]
77 _, counts = np.unique(chars, return_counts=True)
78 return shannon_entropy(counts / sum(counts))
79
80sample_text = "the quick brown fox jumps over the lazy dog"
81h = text_entropy(sample_text)
82max_h = np.log2(26) # Uniform over 26 letters
83print(f"\n=== Text Entropy ===")
84print(f"Sample text entropy: {h:.4f} bits/char")
85print(f"Max possible: {max_h:.4f} bits/char")
86print(f"Compression ratio: {h/max_h:.1%}")Knowledge Check
Test your understanding of Shannon entropy with this interactive quiz.
What is the entropy of a fair coin flip in bits?
Summary
Key Takeaways
- Entropy measures uncertainty: Shannon entropy H(X) quantifies the average "surprise" or information content in a probability distribution.
- Surprisal is the building block: I(x) = -log₂p(x) measures how surprising a single event is. Entropy is expected surprisal.
- Units are bits: Using log base 2 connects entropy to binary information - the number of yes/no questions needed to identify an outcome.
- Uniform maximizes entropy: Among all distributions over n outcomes, the uniform distribution has maximum entropy H = log₂(n).
- Entropy bounds compression: Shannon proved you cannot losslessly compress data below its entropy - it's the fundamental limit.
- ML is built on entropy: Cross-entropy loss, information gain, mutual information, and the VAE objective all stem from Shannon entropy.
Looking Ahead: In the next section, we'll explore cross-entropy, which measures the "distance" between two distributions. This is the foundation of the classification loss functions used in every neural network!