Chapter 20
25 min read
Section 126 of 175

Shannon Entropy

Information Theoretic Foundations

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define entropy as a measure of uncertainty/information content
  • • Explain why entropy uses logarithms and what "bits" represent
  • • Calculate entropy for discrete probability distributions
  • • Understand the maximum entropy principle and its implications

🔧 Practical Skills

  • • Compute entropy in Python using NumPy and SciPy
  • • Compare uncertainty across different distributions
  • • Apply entropy concepts to data compression and coding
  • • Use entropy for feature selection in machine learning

🧠 Deep Learning Connections

  • Cross-entropy loss - The foundation of classification training, directly built on Shannon entropy
  • Decision tree splitting - Information gain uses entropy to choose optimal splits
  • Softmax temperature - Controls entropy of output distributions in language models
  • VAEs and information bottleneck - Entropy constraints shape latent representations
Where You'll Apply This: Data compression algorithms, decision tree learning, neural network loss functions, text generation (temperature sampling), anomaly detection, feature selection, and understanding model uncertainty.

The Big Picture

How do you measure information? This seemingly philosophical question has a precise mathematical answer that revolutionized communication, computing, and now machine learning. Shannon entropy is the fundamental measure of uncertainty, information content, and surprise in a probability distribution.

The Core Insight

Information is the resolution of uncertainty. When you learn something, you've reduced your uncertainty about the world. The more uncertain you were before, the more information you gain when uncertainty is resolved.

🎲

Fair coin flip: 1 bit of information

🎯

Certain outcome: 0 bits of information

More uncertainty: More potential information

Historical Context

The story of entropy begins with one of the most influential scientific papers ever written.

📜
Claude Shannon (1948)

Published "A Mathematical Theory of Communication" - arguably the birth certificate of the information age. Shannon was working at Bell Labs on the problem of efficiently transmitting messages over noisy communication channels. He needed to quantify "how much information" a message contains.

🤔
The Naming Story

When Shannon showed his formula to John von Neumann, asking what to call it, von Neumann reportedly said: "Call it entropy. In the first place, your uncertainty function has been used in statistical mechanics under that name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."


Building Block: Surprisal (Self-Information)

Before defining entropy, we need to understand its building block: surprisal(also called self-information). Surprisal measures how "surprising" a single event is.

Surprisal Definition

I(x)=log2p(x)=log21p(x)I(x) = -\log_2 p(x) = \log_2 \frac{1}{p(x)}

where I(x) is the surprisal (in bits) of event x with probability p(x)

Why this formula? Shannon derived it from three intuitive requirements:

  1. Rare events are more surprising: An event with low probability should have high surprisal. If p(x)0p(x) \to 0, then I(x)I(x) \to \infty.
  2. Certain events carry no information: If p(x)=1p(x) = 1, then I(x)=0I(x) = 0. Learning something you already knew is not informative.
  3. Independent events add: The surprisal of two independent events occurring should be the sum of their individual surprisals. This requires a logarithm!I(x,y)=I(x)+I(y)I(x, y) = I(x) + I(y) when x and y are independent.

Interactive: Surprisal Explorer

Explore how surprisal changes with probability. Notice how rare events carry much more information than common events.

Surprisal (Self-Information) Explorer

Surprisal measures how "surprising" an event is. Rare events carry more information!

Probability p(x)Surprisal I(x) = -log₂p(x) bits00.250.50.75102461.00 bits

Current Values

Probability p(x):0.500
Surprisal I(x):1.000 bits
Formula:I(x) = -log₂(0.50) = 1.00

Interpretation

This is a likely event. With 50% chance, it only provides 1.0 bits of information - not very surprising.

Key Insight

Surprisal is inversely related to probability. The rarer an event, the more information we gain when it occurs. A coin flip (p=0.5) gives 1 bit. A dice roll outcome (p=1/6) gives ~2.58 bits.

Real Example

If the weather forecast says "90% chance of rain" and it rains, that's only 0.15 bits of information (not surprising). But if it says "10% chance" and it rains, that's 3.32 bits - very informative!


Shannon Entropy Definition

Shannon entropy is the expected surprisal - the average amount of information gained when observing a random variable. It measures the uncertainty inherent in a probability distribution.

Shannon Entropy

H(X)=E[I(X)]=xXp(x)log2p(x)H(X) = \mathbb{E}[I(X)] = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)

Entropy is the expected value of surprisal, measured in bits

SymbolMeaning
H(X)Entropy of random variable X
p(x)Probability of outcome x
log₂Logarithm base 2 (gives units of bits)
-p(x) log₂ p(x)Contribution of outcome x to total entropy
Convention: We define 0 × log(0) = 0, sincelimp0+plogp=0\lim_{p \to 0^+} p \log p = 0. This means outcomes with zero probability contribute nothing to entropy.

Binary Entropy

The simplest case is a binary random variable (like a coin flip). The binary entropy function H(p) gives the entropy as a function of the probability p of one outcome:

Binary Entropy Function

H(p)=plog2p(1p)log2(1p)H(p) = -p \log_2 p - (1-p) \log_2(1-p)

Interactive: Binary Entropy

Drag the slider to see how entropy changes with bias. Notice that entropy is maximized at p = 0.5 (fair coin) and approaches zero as the coin becomes more biased.

Binary Entropy Function

The entropy of a binary random variable (like a biased coin) - the simplest case of Shannon entropy.

Probability p (of heads/success)Entropy H(p) in bits00.250.50.7510.000.250.500.751.00Max: 1 bitH(p) = 1.000H(p) = Total entropy-p log₂(p)-(1-p) log₂(1-p)

Binary Entropy Formula

H(p) = -p log₂p - (1-p) log₂(1-p)

Component Breakdown

-p log₂(p):0.5000
-(1-p) log₂(1-p):0.5000
Total H(p):1.0000

Key Properties

  • Max at p=0.5: H(0.5) = 1 bit
  • Min at p=0 or p=1: H = 0 bits
  • Symmetric: H(p) = H(1-p)
  • Concave: Always curves downward

What This Means

The outcome is highly uncertain! Near-maximum entropy - hardest to predict.

Discrete Entropy

For a discrete distribution over n outcomes, entropy measures total uncertainty. You can manipulate the probabilities directly and see how entropy responds.

Interactive: Discrete Entropy Explorer

Discrete Entropy Explorer

Drag the sliders to change probabilities and see how entropy responds. Entropy measures uncertainty - more uniform = more uncertain = higher entropy!

0%25%50%75%100%25.0%A25.0%B25.0%C25.0%DProbability
A25%
B25%
C25%
D25%
Shannon Entropy
2.000
bits
0 (certain)2.00 (max uncertainty)
100.0% of maximum entropy

Entropy Contribution by Outcome

A:
0.500 bits
B:
0.500 bits
C:
0.500 bits
D:
0.500 bits
Total:2.000 bits

Formula

H(X) = -Σ p(x) log₂ p(x)

Each outcome contributes -p log₂(p) bits to total entropy

Key Insight

Distribution is nearly uniform - maximum uncertainty means maximum entropy. Each outcome is equally likely, so every observation is maximally informative.


Properties of Entropy

Shannon entropy has several important mathematical properties that make it the unique measure of information:

Non-negativity

H(X)0H(X) \geq 0

Entropy is always non-negative. You can't have negative uncertainty!

Maximum Entropy

H(X)log2nH(X) \leq \log_2 n

For n outcomes, entropy is maximized by the uniform distribution.

Concavity

H(X) is a concave function of the distribution p. Mixing distributions increases entropy: H(λp1+(1λ)p2)λH(p1)+(1λ)H(p2)H(\lambda p_1 + (1-\lambda) p_2) \geq \lambda H(p_1) + (1-\lambda) H(p_2)

Additivity (Independent)

For independent random variables:H(X,Y)=H(X)+H(Y)H(X, Y) = H(X) + H(Y)Total uncertainty is the sum of individual uncertainties.

Maximum Entropy Principle

A fundamental result: among all distributions over n outcomes, the uniform distribution maximizes entropy. This is because uniform represents maximum uncertainty - we have no reason to prefer any outcome over another.

Maximum Entropy Theorem

For a discrete random variable X with n possible outcomes:

H(X)log2(n)H(X) \leq \log_2(n)

with equality if and only if X is uniformly distributed:p(x)=1np(x) = \frac{1}{n} for all x.

Interactive: Maximum Entropy Demonstration

Watch entropy increase as a skewed distribution transforms into a uniform distribution. This demonstrates why uniform = maximum entropy.

Maximum Entropy Demonstration

Watch entropy increase as a skewed distribution transforms into a uniform distribution. The uniform distribution always maximizes entropy!

Probability Distribution

25%70%A15%B10%C5%D

Entropy Over Time

1.21.41.61.82.0Max Entropy = 2.00 bitsAnimation ProgressH(X) bits
Current Entropy
1.319
bits
Maximum Entropy
2.000
bits (uniform)
% of Maximum
66.0%
capacity used

The Maximum Entropy Principle

Theorem: For a discrete random variable with n possible outcomes, entropy is maximized when the distribution is uniform: p(x) = 1/n for all x. In this case, H = log₂(n) bits.

Why? Uniform distributions represent maximum uncertainty - we have no information suggesting any outcome is more likely than another. This principle is foundational in physics (statistical mechanics), information theory, and machine learning (MaxEnt classifiers, softmax temperature).


Entropy Comparison Chart

To build intuition, let's compare entropy values across different common distributions. This helps you develop a sense for what different entropy values "feel like."

Entropy Comparison Across Distributions

Compare entropy values across different probability distributions to build intuition.

02468Deterministic0.00 bitsFair Coin1.00 bitsBiased Coin (90%)0.47 bitsFair 4-sided Die2.00 bitsFair 6-sided Die2.58 bitsEnglish Letters4.11 bitsUniform 26 Letters4.70 bitsByte (256 values)8.00 bitsEntropy (bits)
Deterministic
Completely certain - no uncertainty
H = 0.000 bits
Fair Coin
1 bit of uncertainty per flip
H = 1.000 bits
Biased Coin (90%)
Less uncertain than fair coin
H = 0.469 bits
Fair 4-sided Die
2 bits = log₂(4)
H = 2.000 bits

Key Observations

  • Deterministic: 0 bits - no uncertainty at all
  • Fair coin: Exactly 1 bit - the fundamental unit
  • n-sided die: H = log₂(n) bits when uniform
  • Natural text: Less than max due to letter frequencies

Practical Implications

  • Compression: English text needs ~4.1 bits/letter (not 8)
  • Passwords: More entropy = harder to guess
  • ML: High entropy = diverse predictions (uncertainty)
  • Coding: Can't compress below entropy limit!

Real-World Examples


AI/ML Connections

Shannon entropy is everywhere in modern machine learning. Understanding it deeply will make you a better ML practitioner.

📉 Cross-Entropy Loss

The standard loss for classification:L=cyclog(y^c)L = -\sum_c y_c \log(\hat{y}_c)This measures how well predicted probabilities match true labels. Minimizing cross-entropy is equivalent to maximizing likelihood!

🌡️ Softmax Temperature

In language models, temperature controls output entropy:pi=ezi/Tjezj/Tp_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}Low T → low entropy (confident), High T → high entropy (creative/random).

🎯 Mutual Information

Feature selection, representation learning, and InfoGAN all use mutual information:I(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X|Y)Measures how much knowing Y reduces uncertainty about X.

🔄 VAE Loss (ELBO)

Variational Autoencoders optimize an entropy-related objective:L=E[logp(xz)]DKL(q(zx)p(z))\mathcal{L} = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))The KL term involves entropy of the approximate posterior.

Common Pitfall: Don't confuse entropy (H) with cross-entropy (H(p,q)) or KL divergence (D_KL). They're related but measure different things:
  • H(p) - uncertainty in distribution p itself
  • H(p,q) - expected code length using q to encode data from p
  • D_KL(p||q) = H(p,q) - H(p) - the "extra" bits needed when using wrong distribution

Python Implementation

Let's implement entropy calculation in Python. This code demonstrates both the basic formula and practical considerations.

Shannon Entropy Implementation
🐍entropy.py
1Import NumPy

NumPy provides efficient array operations and mathematical functions like log2 that we need for entropy calculations.

4Function Definition

Define a function that takes a probability distribution (as a list or array) and returns the Shannon entropy in bits.

14Convert to Array

Convert the input to a NumPy array for vectorized operations. This ensures efficient computation even for large distributions.

17Normalize Distribution

Ensure probabilities sum to 1. This handles cases where the input might not be perfectly normalized.

20Filter Zero Probabilities

Critical step! We filter out zero probabilities because 0 * log(0) is undefined, but by convention equals 0 in entropy calculations. Using p > 0 avoids log(0) errors.

23Calculate Entropy

The core entropy formula: H = -Σ p(x) log₂ p(x). The negative sign converts the negative log values to positive entropy. np.sum is used for efficient vectorized summation.

EXAMPLE
For p = [0.5, 0.5]: -0.5*log₂(0.5) - 0.5*log₂(0.5) = 1 bit
18 lines without explanation
1import numpy as np
2
3# Basic entropy calculation
4def shannon_entropy(probabilities):
5    """
6    Calculate Shannon entropy of a discrete distribution.
7
8    Parameters:
9        probabilities: Array-like of probabilities (must sum to 1)
10
11    Returns:
12        Entropy in bits (log base 2)
13    """
14    # Convert to numpy array
15    p = np.array(probabilities, dtype=np.float64)
16
17    # Normalize to ensure sum = 1
18    p = p / p.sum()
19
20    # Filter out zero probabilities (0 * log(0) = 0 by convention)
21    p_nonzero = p[p > 0]
22
23    # Calculate entropy: H = -sum(p * log2(p))
24    return -np.sum(p_nonzero * np.log2(p_nonzero))

Here's a complete example showing entropy calculations for various distributions:

🐍python
1import numpy as np
2from scipy.stats import entropy as scipy_entropy
3
4def shannon_entropy(probabilities):
5    """Calculate Shannon entropy in bits."""
6    p = np.array(probabilities, dtype=np.float64)
7    p = p / p.sum()
8    p_nonzero = p[p > 0]
9    return -np.sum(p_nonzero * np.log2(p_nonzero))
10
11# ============================================
12# Example 1: Binary distributions
13# ============================================
14print("=== Binary Entropy ===")
15for p in [0.5, 0.7, 0.9, 0.99]:
16    h = shannon_entropy([p, 1-p])
17    print(f"p={p:.2f}: H = {h:.4f} bits")
18
19# Output:
20# p=0.50: H = 1.0000 bits
21# p=0.70: H = 0.8813 bits
22# p=0.90: H = 0.4690 bits
23# p=0.99: H = 0.0808 bits
24
25# ============================================
26# Example 2: Compare with SciPy
27# ============================================
28print("\n=== SciPy Comparison ===")
29fair_die = [1/6] * 6
30print(f"Our implementation: {shannon_entropy(fair_die):.4f} bits")
31print(f"SciPy (base 2):     {scipy_entropy(fair_die, base=2):.4f} bits")
32
33# ============================================
34# Example 3: Information gain for decision trees
35# ============================================
36def information_gain(parent_labels, child_groups):
37    """
38    Calculate information gain for a split.
39
40    Parameters:
41        parent_labels: Array of class labels before split
42        child_groups: List of arrays of class labels for each child
43    """
44    # Parent entropy
45    _, counts = np.unique(parent_labels, return_counts=True)
46    h_parent = shannon_entropy(counts / len(parent_labels))
47
48    # Weighted child entropy
49    n_total = len(parent_labels)
50    h_children = 0
51    for child in child_groups:
52        if len(child) > 0:
53            _, counts = np.unique(child, return_counts=True)
54            h_child = shannon_entropy(counts / len(child))
55            weight = len(child) / n_total
56            h_children += weight * h_child
57
58    return h_parent - h_children
59
60# Example: Splitting data by a feature
61labels = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']  # 50/50 split
62left = ['A', 'A', 'A', 'B']  # 75/25
63right = ['A', 'B', 'B', 'B']  # 25/75
64
65ig = information_gain(labels, [left, right])
66print(f"\n=== Information Gain Example ===")
67print(f"Parent entropy: {shannon_entropy([0.5, 0.5]):.4f} bits")
68print(f"After split:    {ig:.4f} bits of information gained")
69
70# ============================================
71# Example 4: Text entropy estimation
72# ============================================
73def text_entropy(text):
74    """Estimate entropy of text based on character frequencies."""
75    text = text.lower()
76    chars = [c for c in text if c.isalpha()]
77    _, counts = np.unique(chars, return_counts=True)
78    return shannon_entropy(counts / sum(counts))
79
80sample_text = "the quick brown fox jumps over the lazy dog"
81h = text_entropy(sample_text)
82max_h = np.log2(26)  # Uniform over 26 letters
83print(f"\n=== Text Entropy ===")
84print(f"Sample text entropy: {h:.4f} bits/char")
85print(f"Max possible:        {max_h:.4f} bits/char")
86print(f"Compression ratio:   {h/max_h:.1%}")

Knowledge Check

Test your understanding of Shannon entropy with this interactive quiz.

Shannon Entropy Knowledge CheckScore: 0/8
Question 1 of 8

What is the entropy of a fair coin flip in bits?


Summary

Key Takeaways

  1. Entropy measures uncertainty: Shannon entropy H(X) quantifies the average "surprise" or information content in a probability distribution.
  2. Surprisal is the building block: I(x) = -log₂p(x) measures how surprising a single event is. Entropy is expected surprisal.
  3. Units are bits: Using log base 2 connects entropy to binary information - the number of yes/no questions needed to identify an outcome.
  4. Uniform maximizes entropy: Among all distributions over n outcomes, the uniform distribution has maximum entropy H = log₂(n).
  5. Entropy bounds compression: Shannon proved you cannot losslessly compress data below its entropy - it's the fundamental limit.
  6. ML is built on entropy: Cross-entropy loss, information gain, mutual information, and the VAE objective all stem from Shannon entropy.
Looking Ahead: In the next section, we'll explore cross-entropy, which measures the "distance" between two distributions. This is the foundation of the classification loss functions used in every neural network!
Loading comments...