Chapter 20
25 min read
Section 129 of 175

Mutual Information

Information Theoretic Foundations

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define mutual information as shared information between variables
  • • Express MI using entropy: I(X;Y) = H(X) - H(X|Y)
  • • Understand the relationship to KL divergence
  • • Prove and apply the Data Processing Inequality

🔧 Practical Skills

  • • Compute MI from joint distributions in Python
  • • Estimate MI from samples using various methods
  • • Apply MI for feature selection (mRMR)
  • • Compare MI to correlation for dependency detection

🧠 Deep Learning Connections

  • Information Bottleneck - Representations compress input while preserving task-relevant information
  • InfoGAN - Maximizes MI between latent codes and generated outputs for disentangled representations
  • Contrastive Learning - InfoNCE loss is a lower bound on mutual information
  • Feature Selection - MI identifies the most informative features for prediction
Where You'll Apply This: Feature selection in ML pipelines, representation learning, contrastive learning (SimCLR, CLIP), variational autoencoders, information bottleneck analysis, and understanding what neural networks learn.

The Big Picture

Entropy measures uncertainty in a single random variable. But what about the relationship between two variables? How much does knowing one tell us about the other? Mutual Information answers this question—it quantifies the shared information between two random variables.

The Core Insight

Mutual Information is the reduction in uncertainty about X when we learn Y (or equivalently, about Y when we learn X). It measures how much information the two variables share.

🔗

Dependent variables: High MI

🎲

Independent variables: MI = 0

🎯

Deterministic relation: MI = min(H(X), H(Y))

Historical Context

Mutual information emerged from Shannon's foundational work on information theory, but its applications have expanded far beyond communication systems.

📡
Communication Theory (1948)

Shannon introduced MI to characterize channel capacity—the maximum rate at which information can be reliably transmitted. The capacity of a noisy channel is the maximum MI between input and output: C = max I(X;Y).

🧬
Modern Machine Learning

Today, MI is used for feature selection, understanding neural network representations (Information Bottleneck), training generative models (InfoGAN), and contrastive learning objectives. It captures dependencies that correlation misses.


Building Intuition: The Venn Diagram View

One of the most helpful ways to understand mutual information is through an entropy Venn diagram. Imagine two overlapping circles representing the entropy of X and Y. The overlap is the mutual information—the information they share.

Entropy Relationships

H(X) = Total uncertainty in X
H(Y) = Total uncertainty in Y
I(X;Y) = Shared information (overlap)
H(X|Y) = Uncertainty in X after knowing Y
H(Y|X) = Uncertainty in Y after knowing X
H(X,Y) = Total joint uncertainty

Interactive: Entropy Venn Diagram

Explore how the entropy relationships work. Adjust H(X), H(Y), and I(X;Y) to see how the conditional entropies and joint entropy change.

Entropy Venn Diagram

H(X|Y)1.80I(X;Y)1.20H(Y|X)1.30XYH(X,Y) = 4.30 bits
H(X)3.00 bits
H(Y)2.50 bits
I(X;Y)1.20 bits(max: 2.50)
Key Relationships
H(X|Y) =
1.800
H(Y|X) =
1.300
H(X,Y) =
4.300
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

The overlap represents mutual information - the shared information between X and Y. Adjust the sliders to see how changing entropies affects the relationships.

Key Observation: As mutual information increases, the circles overlap more. When I(X;Y) = min(H(X), H(Y)), one variable completely determines the other. When I(X;Y) = 0, the variables are independent with no overlap.

Formal Definition

Mutual Information between random variables X and Y measures how much knowing one variable reduces uncertainty about the other.

Mutual Information Definition

I(X;Y)=x,yp(x,y)logp(x,y)p(x)p(y)I(X; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

For discrete random variables

I(X;Y)=p(x,y)logp(x,y)p(x)p(y)dxdyI(X; Y) = \int \int p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \, dx \, dy

For continuous random variables

Notice what this formula measures: it compares the joint distribution p(x,y) to what it would be if X and Y were independent: p(x)p(y). The log ratio is zero when they're equal (independence) and positive otherwise.

MI as KL Divergence: Mutual information is the KL divergence between the joint distribution and the product of marginals:
I(X;Y)=DKL(PX,YPXPY)I(X;Y) = D_{KL}(P_{X,Y} \| P_X \otimes P_Y)
This measures how much the joint distribution "diverges" from independence.

Equivalent Formulas

Mutual information can be expressed in many equivalent ways, each providing different insights:

FormulaInterpretation
I(X;Y) = H(X) - H(X|Y)Reduction in uncertainty about X when Y is known
I(X;Y) = H(Y) - H(Y|X)Reduction in uncertainty about Y when X is known
I(X;Y) = H(X) + H(Y) - H(X,Y)Sum of marginals minus joint (overlap formula)
I(X;Y) = D_KL(P(X,Y) || P(X)P(Y))Divergence from independence

Derivation: Why These Are Equivalent

I(X;Y) = H(X) - H(X|Y)
= H(X) - [H(X,Y) - H(Y)]   
= H(X) + H(Y) - H(X,Y) ✓

Interactive: Mutual Information Explorer

Explore how mutual information changes with the joint distribution. Adjust the probabilities and see how MI responds. Notice that perfectly correlated and perfectly anti-correlated distributions both have high MI!

Mutual Information Explorer

Joint Distribution P(X, Y)
YXY=0Y=1X=0X=10.4000.1000.1000.4000.5000.5000.5000.500
P(X=0,Y=0)0.400
P(X=0,Y=1)0.100
P(X=1,Y=0)0.100
P(X=1,Y=1)0.400
Mutual Information
0.2781
bits
Normalized MI:27.8%
Entropy Values (bits)
H(X)
1.0000
H(Y)
1.0000
H(X,Y)
1.7219
H(X|Y)0.7219
H(Y|X)0.7219
Formula Verification:
I(X;Y) = H(X) - H(X|Y)
= 1.0000 - 0.7219
= 0.2781

Adjust the joint probabilities to see how mutual information changes. Higher MI indicates stronger dependence between X and Y.


Properties of Mutual Information

Mutual information has several important mathematical properties that make it a powerful tool for measuring dependence:

Non-negativity

I(X;Y)0I(X;Y) \geq 0

Knowing Y can never increase uncertainty about X. MI is zero if and only if X and Y are independent.

Symmetry

I(X;Y)=I(Y;X)I(X;Y) = I(Y;X)

Information is symmetric: X tells us as much about Y as Y tells us about X.

Upper Bound

I(X;Y)min(H(X),H(Y))I(X;Y) \leq \min(H(X), H(Y))

MI cannot exceed the total information in either variable. Equality holds when one variable determines the other.

Self-Information

I(X;X)=H(X)I(X;X) = H(X)

A variable has maximum MI with itself—knowing X completely determines X.

Conditional Mutual Information

We can also measure mutual information conditioned on a third variable Z:

Conditional Mutual Information

I(X;YZ)=H(XZ)H(XY,Z)I(X;Y|Z) = H(X|Z) - H(X|Y,Z)

How much Y tells us about X, given that we already know Z

Conditional MI is crucial for understanding indirect relationships. For example, if X and Y are conditionally independent given Z, then I(X;Y|Z) = 0 even though I(X;Y) might be positive.

Chain Rule for Mutual Information

Just like entropy, mutual information obeys a chain rule:

I(X1,X2,,Xn;Y)=i=1nI(Xi;YX1,,Xi1)I(X_1, X_2, \ldots, X_n ; Y) = \sum_{i=1}^{n} I(X_i ; Y | X_1, \ldots, X_{i-1})

This decomposes the total MI into contributions from each variable, accounting for what previous variables already revealed.


The Data Processing Inequality

One of the most profound results in information theory is the Data Processing Inequality (DPI). It states that processing data can only lose information, never create it.

Data Processing Inequality

If X → Y → Z forms a Markov chain, then:

I(X;Z)I(X;Y)I(X;Z) \leq I(X;Y)

No processing of Y can recover information about X that wasn't already in Y

Why is this profound? It means there's no magic processing step that can improve the quality of information beyond what's in the data. If your input features don't contain relevant signal, no amount of deep learning can conjure it.

Interactive: Data Processing Inequality

Explore the DPI interactively. See how information flows through a processing chain and why you cannot recover lost information.

Data Processing Inequality

X → Y → Z implies I(X; Z) ≤ I(X; Y)

Processing data can only lose information, never create it

XSourceYIntermediateZProcessedchannelprocessingI(X; Y) = 1.50 bitsI(X; Z) = 1.05 bitsLost: 0.45 bits (30%)
I(X; Y)1.50 bits
Channel Noise30%
Why DPI Cannot Be Violated
Try to "boost" I(X;Z)0.00
Try to increase MI beyond the bound to see why it's impossible.
Key Insight for Deep Learning

In a neural network, each layer processes information from the previous layer. The Data Processing Inequality tells us that no layer can increase the mutual information with the target beyond what was present in earlier layers. This is why the quality of input features fundamentally limits model performance— you cannot recover information that was lost upstream.

Implication for ML: The DPI explains why data quality and feature engineering matter so much. A neural network is just a processing chain—it cannot recover information that was lost in preprocessing or never present in the raw data.

Mutual Information vs Correlation

Why use mutual information when we have correlation? The answer is that MI captures all dependencies, while correlation only measures linear relationships.

Pearson Correlation

  • • Only measures linear relationships
  • • ρ = 0 doesn't mean independence
  • • Fast to compute: O(n)
  • • Easy to interpret: -1 to +1 scale

Mutual Information

  • • Captures any statistical dependency
  • • I(X;Y) = 0 ↔ true independence
  • • Harder to estimate: requires density estimation
  • • Scale depends on entropy (bits)

Example: When Correlation Fails

Consider Y = X² where X ~ Uniform(-1, 1). Despite strong dependence:

ρ = 0
Correlation sees nothing!
I(X;Y) > 0
MI detects the dependence
Common Mistake: Don't assume zero correlation means independence! The classic parabola example Y = X² has zero correlation but strong dependence. MI is the proper test for independence.

Estimating Mutual Information

In practice, we rarely know the true distributions—we must estimate MI from data. This is challenging, especially for continuous variables.


AI/ML Connections

Mutual information has become central to modern machine learning, from classical feature selection to cutting-edge representation learning.

📊 Feature Selection

The mRMR (minimum Redundancy Maximum Relevance) algorithm selects features with high MI with the target while minimizing MI between selected features.max1SiSI(xi;y)1S2i,jSI(xi;xj)\max \frac{1}{|S|} \sum_{i \in S} I(x_i; y) - \frac{1}{|S|^2} \sum_{i,j \in S} I(x_i; x_j)

🔄 Information Bottleneck

Neural networks find representations that compress input (minimize I(X;Z)) while preserving task-relevant info (maximize I(Z;Y)):minI(X;Z)βI(Z;Y)\min I(X;Z) - \beta \cdot I(Z;Y)This explains the "forgetting" and "fitting" phases during training.

🎭 InfoGAN

InfoGAN learns disentangled representations by maximizing MI between latent codes c and generated outputs G(z,c):maxI(c;G(z,c))\max I(c; G(z,c))This ensures latent factors control meaningful variations.

🔗 Contrastive Learning

The InfoNCE loss used in SimCLR, CLIP, and others is a lower bound on MI:LInfoNCEI(X;Y)\mathcal{L}_{\text{InfoNCE}} \leq I(X; Y)Contrastive learning maximizes agreement between augmented views.

Interactive: Feature Selection with MI

Try selecting features using MI-based criteria. Compare pure relevance (highest MI with target) versus mRMR (balancing relevance and redundancy).

Feature Selection with Mutual Information

Feature Candidates
credit_score
I(X;Y)
0.92
Redund.
0.25
age
I(X;Y)
0.85
Redund.
0.10
spending_total
I(X;Y)
0.78
Redund.
0.70
spending_avg
I(X;Y)
0.75
Redund.
0.75
income
I(X;Y)
0.72
Redund.
0.35
education
I(X;Y)
0.65
Redund.
0.40
occupation
I(X;Y)
0.58
Redund.
0.55
location
I(X;Y)
0.45
Redund.
0.15
Selection Criterion
Features to select:
I(X;Y): Mutual information with target (relevance)
Redundancy: Overlap with other features
mRMR: Selects informative yet non-redundant features

MI-based feature selection identifies the most informative features while avoiding redundancy.


Python Implementation

Let's implement mutual information calculation in Python, covering both discrete and continuous cases.

Mutual Information Implementation
🐍mutual_information.py
1Import Libraries

NumPy for numerical operations, sklearn for MI estimation, scipy for statistical calculations.

7Discrete MI Function

For discrete variables, we compute MI directly from the joint and marginal probability distributions.

19Joint Distribution

Using np.histogram2d to estimate the joint distribution P(X,Y) from sample data. The bins parameter controls the resolution.

EXAMPLE
More bins = finer estimation, but needs more samples
23Marginal Distributions

Sum the joint distribution along rows and columns to get P(X) and P(Y). These marginals tell us the distribution of each variable independently.

27MI Calculation

The core formula: sum over all (x,y) pairs of p(x,y) * log(p(x,y) / (p(x)*p(y))). We skip zero-probability cells to avoid log(0).

EXAMPLE
If p(x,y) = p(x)*p(y), the log term is 0 (independence)
39KSG Estimator

For continuous variables, the Kraskov-Stögbauer-Grassberger estimator uses k-nearest neighbors, avoiding explicit density estimation.

50Sklearn Implementation

sklearn provides mutual_info_classif and mutual_info_regression for feature selection, handling continuous variables with the KNN approach.

108 lines without explanation
1import numpy as np
2from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
3from scipy.special import digamma
4from scipy.spatial import KDTree
5
6# ==================================
7# Discrete Mutual Information
8# ==================================
9def discrete_mi(x, y, bins=10):
10    """
11    Estimate MI for discrete/discretized variables.
12
13    Parameters:
14        x, y: Arrays of samples
15        bins: Number of histogram bins
16
17    Returns:
18        Mutual information in bits
19    """
20    # Compute joint histogram (2D)
21    joint_hist, _, _ = np.histogram2d(x, y, bins=bins)
22    joint_prob = joint_hist / joint_hist.sum()
23
24    # Marginal distributions (sum along axes)
25    px = joint_prob.sum(axis=1)
26    py = joint_prob.sum(axis=0)
27
28    # Compute MI = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
29    mi = 0.0
30    for i in range(bins):
31        for j in range(bins):
32            if joint_prob[i,j] > 0 and px[i] > 0 and py[j] > 0:
33                mi += joint_prob[i,j] * np.log2(
34                    joint_prob[i,j] / (px[i] * py[j])
35                )
36    return mi
37
38# ==================================
39# KNN-based MI (KSG Estimator)
40# ==================================
41def ksg_mi(x, y, k=3):
42    """
43    KSG estimator for continuous MI using k-NN.
44
45    Based on Kraskov et al., 2004.
46    """
47    n = len(x)
48    x = np.array(x).reshape(-1, 1)
49    y = np.array(y).reshape(-1, 1)
50    xy = np.hstack([x, y])
51
52    # Build KD trees
53    tree_xy = KDTree(xy)
54    tree_x = KDTree(x)
55    tree_y = KDTree(y)
56
57    # Find k-th nearest neighbor distances
58    mi_sum = 0.0
59    for i in range(n):
60        # Distance to k-th neighbor in joint space
61        dists, _ = tree_xy.query(xy[i], k=k+1)
62        eps = dists[-1]
63
64        # Count neighbors within eps in marginal spaces
65        nx = len(tree_x.query_ball_point(x[i], eps)) - 1
66        ny = len(tree_y.query_ball_point(y[i], eps)) - 1
67
68        mi_sum += digamma(nx + 1) + digamma(ny + 1)
69
70    mi = digamma(k) + digamma(n) - mi_sum / n
71    return max(0, mi / np.log(2))  # Convert to bits
72
73# ==================================
74# Using sklearn for Feature Selection
75# ==================================
76def mi_feature_selection(X, y, n_features=5, task='classification'):
77    """
78    Select top features by mutual information.
79
80    Parameters:
81        X: Feature matrix (n_samples, n_features)
82        y: Target variable
83        n_features: Number of features to select
84        task: 'classification' or 'regression'
85    """
86    if task == 'classification':
87        mi_scores = mutual_info_classif(X, y, random_state=42)
88    else:
89        mi_scores = mutual_info_regression(X, y, random_state=42)
90
91    # Get indices of top features
92    top_indices = np.argsort(mi_scores)[::-1][:n_features]
93
94    return top_indices, mi_scores[top_indices]
95
96# ==================================
97# Example Usage
98# ==================================
99if __name__ == "__main__":
100    np.random.seed(42)
101
102    # Generate correlated data
103    n = 1000
104    x = np.random.randn(n)
105    y = 0.8 * x + 0.6 * np.random.randn(n)  # Linear dependence
106
107    print("=== MI Estimation ===")
108    print(f"Discrete (binned): {discrete_mi(x, y, bins=20):.4f} bits")
109    print(f"KSG estimator:     {ksg_mi(x, y, k=5):.4f} bits")
110
111    # Nonlinear relationship
112    z = x**2 + 0.1 * np.random.randn(n)
113    print(f"\nNonlinear (Y=X²):")
114    print(f"  Correlation: {np.corrcoef(x, z)[0,1]:.4f}")
115    print(f"  MI:          {ksg_mi(x, z):.4f} bits")

Here's a complete example demonstrating MI-based feature selection:

🐍python
1from sklearn.datasets import make_classification
2from sklearn.feature_selection import mutual_info_classif, SelectKBest
3import numpy as np
4
5# Create dataset with informative and noisy features
6X, y = make_classification(
7    n_samples=1000,
8    n_features=20,
9    n_informative=5,
10    n_redundant=5,
11    n_classes=2,
12    random_state=42
13)
14
15# Compute MI scores for each feature
16mi_scores = mutual_info_classif(X, y, random_state=42)
17
18# Print feature rankings
19print("Feature MI Scores (sorted):")
20for idx in np.argsort(mi_scores)[::-1]:
21    print(f"  Feature {idx:2d}: {mi_scores[idx]:.4f} bits")
22
23# Select top 5 features
24selector = SelectKBest(mutual_info_classif, k=5)
25X_selected = selector.fit_transform(X, y)
26print(f"\nSelected features: {selector.get_support(indices=True)}")
27
28# Compare with correlation-based selection
29correlations = [abs(np.corrcoef(X[:, i], y)[0, 1]) for i in range(X.shape[1])]
30print(f"\nCorrelation ranking differs from MI:")
31print(f"  MI top 5:   {np.argsort(mi_scores)[::-1][:5]}")
32print(f"  Corr top 5: {np.argsort(correlations)[::-1][:5]}")

Knowledge Check

Test your understanding of mutual information with this interactive quiz.

Knowledge Check

Question 1 of 8

What does mutual information I(X;Y) = 0 imply about random variables X and Y?

Score: 0 / 8

Summary

Key Takeaways

  1. MI measures shared information: I(X;Y) quantifies how much knowing one variable tells us about the other.
  2. Equivalent formulas: I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y).
  3. MI detects all dependencies: Unlike correlation, MI captures nonlinear and complex relationships.
  4. Data Processing Inequality: Information can only be lost through processing: X → Y → Z implies I(X;Z) ≤ I(X;Y).
  5. Estimation is challenging: Use histogram binning for discrete data, KSG estimator for continuous, or neural estimators for high dimensions.
  6. Central to modern ML: MI drives feature selection (mRMR), representation learning (Information Bottleneck), GANs (InfoGAN), and contrastive learning (InfoNCE).
Looking Ahead: In the next section, we'll see how information theory concepts directly connect to machine learning loss functions—exploring why cross-entropy loss works and its relationship to KL divergence.
Loading comments...