Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define mutual information as shared information between variables
- • Express MI using entropy: I(X;Y) = H(X) - H(X|Y)
- • Understand the relationship to KL divergence
- • Prove and apply the Data Processing Inequality
🔧 Practical Skills
- • Compute MI from joint distributions in Python
- • Estimate MI from samples using various methods
- • Apply MI for feature selection (mRMR)
- • Compare MI to correlation for dependency detection
🧠 Deep Learning Connections
- Information Bottleneck - Representations compress input while preserving task-relevant information
- InfoGAN - Maximizes MI between latent codes and generated outputs for disentangled representations
- Contrastive Learning - InfoNCE loss is a lower bound on mutual information
- Feature Selection - MI identifies the most informative features for prediction
Where You'll Apply This: Feature selection in ML pipelines, representation learning, contrastive learning (SimCLR, CLIP), variational autoencoders, information bottleneck analysis, and understanding what neural networks learn.
The Big Picture
Entropy measures uncertainty in a single random variable. But what about the relationship between two variables? How much does knowing one tell us about the other? Mutual Information answers this question—it quantifies the shared information between two random variables.
The Core Insight
Mutual Information is the reduction in uncertainty about X when we learn Y (or equivalently, about Y when we learn X). It measures how much information the two variables share.
Dependent variables: High MI
Independent variables: MI = 0
Deterministic relation: MI = min(H(X), H(Y))
Historical Context
Mutual information emerged from Shannon's foundational work on information theory, but its applications have expanded far beyond communication systems.
Communication Theory (1948)
Shannon introduced MI to characterize channel capacity—the maximum rate at which information can be reliably transmitted. The capacity of a noisy channel is the maximum MI between input and output: C = max I(X;Y).
Modern Machine Learning
Today, MI is used for feature selection, understanding neural network representations (Information Bottleneck), training generative models (InfoGAN), and contrastive learning objectives. It captures dependencies that correlation misses.
Building Intuition: The Venn Diagram View
One of the most helpful ways to understand mutual information is through an entropy Venn diagram. Imagine two overlapping circles representing the entropy of X and Y. The overlap is the mutual information—the information they share.
Entropy Relationships
Interactive: Entropy Venn Diagram
Explore how the entropy relationships work. Adjust H(X), H(Y), and I(X;Y) to see how the conditional entropies and joint entropy change.
Entropy Venn Diagram
Key Relationships
The overlap represents mutual information - the shared information between X and Y. Adjust the sliders to see how changing entropies affects the relationships.
Formal Definition
Mutual Information between random variables X and Y measures how much knowing one variable reduces uncertainty about the other.
Mutual Information Definition
For discrete random variables
For continuous random variables
Notice what this formula measures: it compares the joint distribution p(x,y) to what it would be if X and Y were independent: p(x)p(y). The log ratio is zero when they're equal (independence) and positive otherwise.
Equivalent Formulas
Mutual information can be expressed in many equivalent ways, each providing different insights:
| Formula | Interpretation |
|---|---|
| I(X;Y) = H(X) - H(X|Y) | Reduction in uncertainty about X when Y is known |
| I(X;Y) = H(Y) - H(Y|X) | Reduction in uncertainty about Y when X is known |
| I(X;Y) = H(X) + H(Y) - H(X,Y) | Sum of marginals minus joint (overlap formula) |
| I(X;Y) = D_KL(P(X,Y) || P(X)P(Y)) | Divergence from independence |
Derivation: Why These Are Equivalent
Interactive: Mutual Information Explorer
Explore how mutual information changes with the joint distribution. Adjust the probabilities and see how MI responds. Notice that perfectly correlated and perfectly anti-correlated distributions both have high MI!
Mutual Information Explorer
Joint Distribution P(X, Y)
Entropy Values (bits)
Adjust the joint probabilities to see how mutual information changes. Higher MI indicates stronger dependence between X and Y.
Properties of Mutual Information
Mutual information has several important mathematical properties that make it a powerful tool for measuring dependence:
Non-negativity
Knowing Y can never increase uncertainty about X. MI is zero if and only if X and Y are independent.
Symmetry
Information is symmetric: X tells us as much about Y as Y tells us about X.
Upper Bound
MI cannot exceed the total information in either variable. Equality holds when one variable determines the other.
Self-Information
A variable has maximum MI with itself—knowing X completely determines X.
Conditional Mutual Information
We can also measure mutual information conditioned on a third variable Z:
Conditional Mutual Information
How much Y tells us about X, given that we already know Z
Conditional MI is crucial for understanding indirect relationships. For example, if X and Y are conditionally independent given Z, then I(X;Y|Z) = 0 even though I(X;Y) might be positive.
Chain Rule for Mutual Information
Just like entropy, mutual information obeys a chain rule:
This decomposes the total MI into contributions from each variable, accounting for what previous variables already revealed.
The Data Processing Inequality
One of the most profound results in information theory is the Data Processing Inequality (DPI). It states that processing data can only lose information, never create it.
Data Processing Inequality
If X → Y → Z forms a Markov chain, then:
No processing of Y can recover information about X that wasn't already in Y
Why is this profound? It means there's no magic processing step that can improve the quality of information beyond what's in the data. If your input features don't contain relevant signal, no amount of deep learning can conjure it.
Interactive: Data Processing Inequality
Explore the DPI interactively. See how information flows through a processing chain and why you cannot recover lost information.
Data Processing Inequality
Processing data can only lose information, never create it
Why DPI Cannot Be Violated
Key Insight for Deep Learning
In a neural network, each layer processes information from the previous layer. The Data Processing Inequality tells us that no layer can increase the mutual information with the target beyond what was present in earlier layers. This is why the quality of input features fundamentally limits model performance— you cannot recover information that was lost upstream.
Mutual Information vs Correlation
Why use mutual information when we have correlation? The answer is that MI captures all dependencies, while correlation only measures linear relationships.
Pearson Correlation
- • Only measures linear relationships
- • ρ = 0 doesn't mean independence
- • Fast to compute: O(n)
- • Easy to interpret: -1 to +1 scale
Mutual Information
- • Captures any statistical dependency
- • I(X;Y) = 0 ↔ true independence
- • Harder to estimate: requires density estimation
- • Scale depends on entropy (bits)
Example: When Correlation Fails
Consider Y = X² where X ~ Uniform(-1, 1). Despite strong dependence:
Estimating Mutual Information
In practice, we rarely know the true distributions—we must estimate MI from data. This is challenging, especially for continuous variables.
AI/ML Connections
Mutual information has become central to modern machine learning, from classical feature selection to cutting-edge representation learning.
📊 Feature Selection
The mRMR (minimum Redundancy Maximum Relevance) algorithm selects features with high MI with the target while minimizing MI between selected features.
🔄 Information Bottleneck
Neural networks find representations that compress input (minimize I(X;Z)) while preserving task-relevant info (maximize I(Z;Y)):This explains the "forgetting" and "fitting" phases during training.
🎭 InfoGAN
InfoGAN learns disentangled representations by maximizing MI between latent codes c and generated outputs G(z,c):This ensures latent factors control meaningful variations.
🔗 Contrastive Learning
The InfoNCE loss used in SimCLR, CLIP, and others is a lower bound on MI:Contrastive learning maximizes agreement between augmented views.
Interactive: Feature Selection with MI
Try selecting features using MI-based criteria. Compare pure relevance (highest MI with target) versus mRMR (balancing relevance and redundancy).
Feature Selection with Mutual Information
Selection Criterion
Redundancy: Overlap with other features
mRMR: Selects informative yet non-redundant features
MI-based feature selection identifies the most informative features while avoiding redundancy.
Python Implementation
Let's implement mutual information calculation in Python, covering both discrete and continuous cases.
Here's a complete example demonstrating MI-based feature selection:
1from sklearn.datasets import make_classification
2from sklearn.feature_selection import mutual_info_classif, SelectKBest
3import numpy as np
4
5# Create dataset with informative and noisy features
6X, y = make_classification(
7 n_samples=1000,
8 n_features=20,
9 n_informative=5,
10 n_redundant=5,
11 n_classes=2,
12 random_state=42
13)
14
15# Compute MI scores for each feature
16mi_scores = mutual_info_classif(X, y, random_state=42)
17
18# Print feature rankings
19print("Feature MI Scores (sorted):")
20for idx in np.argsort(mi_scores)[::-1]:
21 print(f" Feature {idx:2d}: {mi_scores[idx]:.4f} bits")
22
23# Select top 5 features
24selector = SelectKBest(mutual_info_classif, k=5)
25X_selected = selector.fit_transform(X, y)
26print(f"\nSelected features: {selector.get_support(indices=True)}")
27
28# Compare with correlation-based selection
29correlations = [abs(np.corrcoef(X[:, i], y)[0, 1]) for i in range(X.shape[1])]
30print(f"\nCorrelation ranking differs from MI:")
31print(f" MI top 5: {np.argsort(mi_scores)[::-1][:5]}")
32print(f" Corr top 5: {np.argsort(correlations)[::-1][:5]}")Knowledge Check
Test your understanding of mutual information with this interactive quiz.
Knowledge Check
What does mutual information I(X;Y) = 0 imply about random variables X and Y?
Summary
Key Takeaways
- MI measures shared information: I(X;Y) quantifies how much knowing one variable tells us about the other.
- Equivalent formulas: I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y).
- MI detects all dependencies: Unlike correlation, MI captures nonlinear and complex relationships.
- Data Processing Inequality: Information can only be lost through processing: X → Y → Z implies I(X;Z) ≤ I(X;Y).
- Estimation is challenging: Use histogram binning for discrete data, KSG estimator for continuous, or neural estimators for high dimensions.
- Central to modern ML: MI drives feature selection (mRMR), representation learning (Information Bottleneck), GANs (InfoGAN), and contrastive learning (InfoNCE).
Looking Ahead: In the next section, we'll see how information theory concepts directly connect to machine learning loss functions—exploring why cross-entropy loss works and its relationship to KL divergence.