Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Draw the complete hierarchy of convergence modes and explain which implications hold
Prove the key implications: a.s. \u21D2 P, L² \u21D2 P, P \u21D2 d, and a.s. \u21D2 L² (under conditions)
Construct and explain counterexamples showing which implications do NOT hold
Identify the special case where convergence in distribution implies convergence in probability
Apply the correct convergence mode to analyze deep learning training dynamics
Choose the appropriate convergence concept for different statistical and ML applications

Why This Matters: Understanding the relationships between convergence modes is essential for correctly interpreting theoretical guarantees. When a paper claims “the estimator converges,” knowing which type of convergence determines what you can actually conclude about performance, reliability, and generalization.

The Story: A Unified Theory of Convergence

In the early 20th century, mathematicians developing probability theory faced a fundamental question: what does it mean for a sequence of random quantities to “converge”? Unlike deterministic sequences where convergence has a single definition, random sequences can approach a limit in multiple distinct ways.

This led to the development of four major modes of convergence, each capturing a different aspect of “getting close.” But immediately, researchers asked: How do these concepts relate to each other? If we know one type of convergence holds, what else can we conclude?

The answers form a beautiful hierarchy—a map of implications, conditionals, and striking counterexamples. This hierarchy isn't just mathematical elegance; it's essential for:

Theoretical guarantees: Knowing what your convergence theorem actually promises
Algorithm analysis: Understanding when optimization “succeeds”
Practical diagnostics: Interpreting training curves and convergence metrics
Research claims: Evaluating what “convergence” really means in a paper

The Central Insight: Each convergence mode captures a different question:Does every realization converge? (a.s.) Does the average squared error vanish? (L²)Does the probability of being far away vanish? (P) Do the distributions become indistinguishable? (d)

The Convergence Hierarchy

The four modes of convergence form a strict hierarchy, with almost sure convergence being the strongest and convergence in distribution being the weakest. Here is the complete picture:

The Convergence Hierarchy (Strongest to Weakest)

Almost Sure (a.s.)

P(lim X_n = X) = 1

Mean Square (L²)

E[(X_n - X)²] \u2192 0

(if dominated)

In Probability (P)

\u2200\u03B5 > 0: P(|X_n - X| > \u03B5) \u2192 0

In Distribution (d)

F_n(x) \u2192 F(x) at continuity points

Interactive: Convergence Hierarchy Explorer

Convergence Hierarchy Explorer

Show implications

Click a convergence mode

Select any convergence mode to see its definition, intuition, and what it implies about the sequence.

Always holds

Conditional

\u2717

Does NOT hold

The Implications: What Implies What

Let's rigorously establish each implication in the hierarchy, providing both the proof and the intuition behind why it holds.

Almost Sure \u21D2 In Probability

Theorem: a.s. \u21D2 P (Always Holds)

If $X_n \xrightarrow{a.s.} X$ , then $X_n \xrightarrow{P} X$ .

Proof:

Let $\u03B5 > 0$ . Define the event $A_n = \{|X_n - X| > \varepsilon\}$ .

Almost sure convergence means $P(\lim_{n\to\infty} X_n = X) = 1$ .

This implies that for almost all \u03C9, there exists N(\u03C9) such that for all n \u2265 N(\u03C9), |X_n(\u03C9) - X(\u03C9)| < \u03B5.

Therefore, for any fixed \u03B5:

P(A_n) = P(|X_n - X| > \varepsilon) \to 0 \text{ as } n \to \infty

This is exactly the definition of convergence in probability. \u220E

Intuition: If the sequence converges for almost every sample path, then the probability of being far from the limit must eventually become tiny. The “bad” paths where |X_n - X| > \u03B5 form a set whose measure shrinks to zero.

L² \u21D2 In Probability

Theorem: L² \u21D2 P (via Chebyshev/Markov)

If $X_n \xrightarrow{L^2} X$ , then $X_n \xrightarrow{P} X$ .

Proof (using Chebyshev's Inequality):

By Chebyshev's inequality, for any \u03B5 > 0:

P(|X_n - X| > \varepsilon) \leq \frac{E[(X_n - X)^2]}{\varepsilon^2}

Since $X_n \xrightarrow{L^2} X$ , we have $E[(X_n - X)^2] \to 0$ .

Therefore, for any fixed \u03B5 > 0:

P(|X_n - X| > \varepsilon) \leq \frac{E[(X_n - X)^2]}{\varepsilon^2} \to 0

This proves convergence in probability. \u220E

Intuition: If the average squared error vanishes, then the probabilityof a large error must also vanish. Large deviations contribute disproportionately to the mean square, so they must become rare.

In Probability \u21D2 In Distribution

Theorem: P \u21D2 d (Always Holds)

If $X_n \xrightarrow{P} X$ , then $X_n \xrightarrow{d} X$ .

Proof Sketch:

Let x be a continuity point of F(x) = P(X \u2264 x). We need to show F_n(x) \u2192 F(x).

For any \u03B5 > 0:

P(X_n \leq x) \leq P(X \leq x + \varepsilon) + P(|X_n - X| > \varepsilon)

Similarly:

P(X_n \leq x) \geq P(X \leq x - \varepsilon) - P(|X_n - X| > \varepsilon)

Taking n \u2192 \u221E and using P(|X_n - X| > \u03B5) \u2192 0:

F(x - \varepsilon) \leq \liminf F_n(x) \leq \limsup F_n(x) \leq F(x + \varepsilon)

Since x is a continuity point, letting \u03B5 \u2192 0 gives F_n(x) \u2192 F(x). \u220E

Intuition: If X_n is “close” to X with high probability, then the probability of X_n \u2264 x should be close to the probability of X \u2264 x. The distributions must become similar.

Almost Sure \u21D2 L² (Conditional)

Theorem: a.s. \u21D2 L² (Under Dominated Convergence)

If $X_n \xrightarrow{a.s.} X$ and there exists Y with E[Y²] < \u221E such that |X_n|² \u2264 Y for all n, then $X_n \xrightarrow{L^2} X$ .

Proof (via Dominated Convergence Theorem):

We have (X_n - X)² \u2192 0 almost surely.

By the triangle inequality: (X_n - X)² \u2264 (|X_n| + |X|)² \u2264 4(\u221AY + |X|)²

If this bound has finite expectation, the Dominated Convergence Theorem gives:

E[(X_n - X)^2] \to E[\lim (X_n - X)^2] = 0

This proves L² convergence. \u220E

Critical Condition: Without the dominating function, this implication FAILS! The typewriter sequence shows a.s. convergence does not guarantee L² convergence in general.

Critical Counterexamples

Just as important as knowing what implies what is knowing what does NOT imply what. These counterexamples are fundamental to understanding the theory and avoiding incorrect reasoning.

The Typewriter Sequence: L² \u21CF a.s.

This is the most famous counterexample in convergence theory. It shows that even if the average squared error vanishes, individual sample paths can fail to converge.

Typewriter Sequence: L² \u21CF Almost Sure

Sequence Index (n): 1

Interval

[0.000, 1.000]

Interval Width

1.0000

L² Norm

1.0000

Converges a.s.?

NO \u2717

What's happening: X_n(\u03C9) = 1 on interval [0.000, 1.000], 0 elsewhere. The interval “sweeps” across [0,1], getting smaller each pass. For every \u03C9 \u2208 [0,1], X_n(\u03C9) = 1 for infinitely many n, so X_n(\u03C9) does NOT converge to 0 for any \u03C9. Yet E[X_n²] = interval width \u2192 0, proving L² convergence to 0!

Mathematical Analysis

Define X_n on [0, 1] as the indicator of interval I_n = [k/2^m, (k+1)/2^m] where n = 2^m + k for 0 \u2264 k < 2^m.

L² Convergence:

E[X_n^2] = \text{length}(I_n) = \frac{1}{2^m} \to 0

No Almost Sure Convergence: For any \u03C9 \u2208 [0, 1], the interval I_ncovers \u03C9 for infinitely many n (each “sweep” across [0,1] hits every point). Therefore X_n(\u03C9) = 1 infinitely often, and lim X_n(\u03C9) does not exist.

Key Insight: L² convergence only requires the average to behave well. The sequence can have arbitrarily bad behavior on small probability sets that still contribute nothing to the expectation.

The Sliding Bump: P \u21CF L²

This counterexample shows that even if large deviations become unlikely, they can still dominate the mean square error if they're large enough.

Sliding Bump: Probability \u21CF L²

Sequence Index (n): 1

P(X_n = n)

1/1 = 1.0000

Spike Height

E[X_n²] = MSE

1 \u2192 \u221E

L² Convergence?

NO \u2717

Key Insight: X_n = n with probability 1/n, else 0. P(X_n > \u03B5) = 1/n \u2192 0, so X_n \u2192 0 in probability. But E[X_n²] = n² \u00D7 (1/n) = n \u2192 \u221E, so L² convergence fails. Rare but large spikes ruin the mean square error even as they become increasingly unlikely.

Mathematical Analysis

Define X_n = n with probability 1/n, and X_n = 0 otherwise.

Probability Convergence: For any \u03B5 > 0:

P(|X_n| > \varepsilon) = P(X_n = n) = \frac{1}{n} \to 0

No L² Convergence:

E[X_n^2] = n^2 \cdot \frac{1}{n} + 0 \cdot \frac{n-1}{n} = n \to \infty

Key Insight: Rare events with extreme values can blow up the MSE even as they become arbitrarily improbable. L² convergence requires controlling both the probability AND the magnitude of deviations.

Convergence in Distribution \u21CF In Probability

Counterexample: Independent Copies

Let X ~ N(0, 1) and let X_n be independent copies of X (i.e., X_n ~ N(0, 1) independently).

Distribution Convergence: X_n \u2192 X in distribution trivially, since all random variables have the same distribution N(0, 1).

No Probability Convergence:

P(|X_n - X| > \varepsilon) = P(|Z| > \varepsilon)

where Z = X_n - X ~ N(0, 2) (sum of independent normals). This probability is constant for all n, not converging to zero!

Key Insight: Convergence in distribution only says the marginaldistributions match. It says nothing about the joint behavior of X_n and X. Two random variables can have the same distribution while being far apart pointwise.

The Exception: When X is a constant c, convergence in distribution to c DOES imply convergence in probability to c. This is because P(X = c) = 1 means F(x) has a jump at c, forcing X_n to concentrate near c.

Summary: The Complete Implication Picture

From	To	Implies?	Condition/Counterexample
Almost Sure	In Probability	✓ Yes	Always holds
Almost Sure	L²	✓ Conditional	Requires bounded second moment
L²	In Probability	✓ Yes	Via Chebyshev inequality
L²	Almost Sure	✗ No	Typewriter sequence
In Probability	Almost Sure	✗ No	Typewriter sequence
In Probability	L²	✗ No	Sliding bump (tall rare spikes)
In Probability	In Distribution	✓ Yes	Always holds
In Distribution	In Probability	✗ No	Independent copies of same distribution
In Distribution (to c)	In Probability	✓ Yes	Special case: constant limit

The Special Case: Convergence to a Constant

There is one remarkable exception to the general rule that weaker convergence doesn't imply stronger:when the limit X is a constant c, convergence in distribution implies convergence in probability!

Theorem: d to Constant \u21D2 P

If $X_n \xrightarrow{d} c$ where c is a constant, then $X_n \xrightarrow{P} c$ .

Proof:

The CDF of constant c is: F(x) = 0 for x < c, F(x) = 1 for x \u2265 c.

Convergence in distribution means F_n(x) \u2192 F(x) at continuity points, which is every x \u2260 c.

For any \u03B5 > 0:

P(|X_n - c| > \varepsilon) = 1 - P(c - \varepsilon < X_n \leq c + \varepsilon)

= 1 - [F_n(c + \varepsilon) - F_n(c - \varepsilon)]

\to 1 - [F(c + \varepsilon) - F(c - \varepsilon)] = 1 - [1 - 0] = 0

This proves convergence in probability to c. \u220E

Why This Matters: Many limit theorems (like the Law of Large Numbers) establish convergence in distribution to a constant. This theorem tells us we automatically get the stronger convergence in probability as well.

Deep Learning Applications

The convergence hierarchy is not just abstract mathematics—it directly impacts how we understand and analyze neural network training.

SGD Convergence Analysis

When analyzing stochastic gradient descent, different convergence modes give different guarantees:

SGD Convergence: Different Modes in Action

Epoch: 0

Learning Rate: 0.100

Batch Size: 32

L² Error (MSE)

0.3968

E[(θ_t - θ*)\u00B2]

P(|error| > 0.1)

1.0027

Probability of large deviation

a.s. Indicator

1.0000

Pathwise convergence quality

Deep Learning Insight: During SGD training, we often achieve convergence in probability(loss stabilizes) before achieving strict L² convergence (MSE to optimal). Almost sure convergence is typically too strong to guarantee for stochastic optimization. Understanding these distinctions helps interpret training dynamics and convergence diagnostics.

Convergence Mode	What It Guarantees for SGD	Practical Meaning
In Probability	θₜ → θ* in probability	With high probability, weights are near optimum eventually
L²	E[‖θₜ - θ*‖²] → 0	Average squared distance to optimum vanishes
Almost Sure	θₜ(ω) → θ* for almost all training runs	Every training run converges (measure-zero exceptions)
In Distribution	Distribution of θₜ stabilizes	Weights fluctuate around a limiting distribution

Typical SGD Results: Most theoretical guarantees for SGD establish convergence in probability or L². Almost sure convergence requires stronger conditions on learning rate decay (e.g., \u03A3\u03B1_t = \u221E, \u03A3\u03B1_t² < \u221E).

Batch Normalization Statistics

Batch normalization computes running estimates of mean and variance:

\mu_t = (1-\alpha)\mu_{t-1} + \alpha \cdot \mu_{batch}

L² Convergence: E[(\u03BC_t - \u03BC)²] \u2192 0 if batch statistics are unbiased
In Probability: \u03BC_t stabilizes around true \u03BC
Key Issue: During training, the true distribution shifts, complicating convergence analysis

Model Ensembling and Convergence

When training an ensemble of models:

Individual models: Each converges (hopefully) in L² to low loss
Ensemble average: By LLN, the ensemble prediction converges in probability to expected prediction
Variance reduction: Ensemble variance decreases as 1/M (L² convergence rate)

Practical Insight: Understanding convergence modes helps diagnose training issues. If your loss is noisy but trending down, you likely have probability convergence but not a.s. If the loss occasionally spikes, you may lack L² convergence.

Practical Guide: Choosing the Right Mode

Different applications require different convergence guarantees. Here's a guide:

Application	Recommended Mode	Why
Statistical estimation	L² (MSE)	Natural for comparing estimator quality
Hypothesis testing	In Distribution	Test statistics need limiting distribution
Monte Carlo integration	Almost Sure	Need guarantee that specific run converges
Neural network training	L² or In Probability	Practical convergence without pathwise guarantees
Bootstrap methods	In Distribution	Bootstrap distribution should match asymptotically
Concentration bounds	In Probability	Directly gives deviation probabilities

Rule of Thumb: If you need guarantees about individual sample paths, use a.s. convergence. If you care about average performance, use L². If you care about tail probabilities, use probability convergence. If you only need distributional behavior, use convergence in distribution.

Python Implementation

Here's a complete implementation demonstrating the relationships between convergence modes:

Demonstrating Convergence Relationships

🐍convergence_relationships.py

Explanation(12)

Code(94)

1Imports

NumPy for numerical computation, matplotlib for visualization, scipy.stats for statistical tests like Kolmogorov-Smirnov.

3Main Demo Function

This function demonstrates all four convergence modes simultaneously, showing how they relate as sample size increases.

10Key Relationships

The docstring summarizes the implication hierarchy. Note that a.s. => L² requires an additional dominance condition.

17Metrics Dictionary

We track four quantities: L² error (MSE), probability of exceeding ε, maximum deviation (for a.s.), and KS distance (for distribution convergence).

28Sample Mean Generation

We generate n_samples independent realizations of X̄ₙ, computing sample means across the second axis.

32L² Error Computation

The mean squared error E[(X̄ₙ - μ)²]. For N(0,1) samples, this equals 1/n theoretically.

37Probability Metric

The fraction of sample means exceeding ε from the true mean. This estimates P(|X̄ₙ - μ| > ε).

42KS Distance

Kolmogorov-Smirnov statistic measures how close the empirical CDF is to the theoretical. Convergence in distribution means KS → 0.

47a.s. Metric

For almost sure convergence, we track the maximum deviation along sample paths. The supremum should converge.

54Chebyshev Verification

This function explicitly verifies the L² ⇒ Probability implication using Chebyshev's inequality.

60Chebyshev Inequality

P(|X - μ| > ε) ≤ E[(X - μ)²]/ε². If L² error → 0, the bound → 0, proving probability convergence.

82Visual Verification

The plot shows empirical probability always stays below the Chebyshev bound, confirming the theoretical implication.

82 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5def demonstrate_convergence_relationships(n_max=1000, n_samples=10000):
6    """
7    Demonstrate relationships between convergence modes.
8
9    Key relationships:
10    - a.s. => P (always)
11    - L² => P (via Chebyshev)
12    - P => d (always)
13    - a.s. => L² (if dominated)
14    """
15    sample_sizes = np.arange(10, n_max + 1, 10)
16
17    # Store metrics for each convergence mode
18    metrics = {
19        'l2_error': [],      # E[(X̄_n - μ)²]
20        'prob_exceed': [],   # P(|X̄_n - μ| > ε)
21        'max_deviation': [], # sup_i |X̄_i - μ| (for a.s.)
22        'ks_distance': [],   # KS distance for d convergence
23    }
24
25    true_mean = 0
26    epsilon = 0.1
27
28    for n in sample_sizes:
29        # Generate samples: X̄_n for many trials
30        samples = np.random.normal(true_mean, 1, (n_samples, n))
31        sample_means = samples.mean(axis=1)
32
33        # L² metric: MSE
34        l2_error = np.mean((sample_means - true_mean) ** 2)
35        metrics['l2_error'].append(l2_error)
36
37        # Probability metric: P(|X̄_n - μ| > ε)
38        prob_exceed = np.mean(np.abs(sample_means - true_mean) > epsilon)
39        metrics['prob_exceed'].append(prob_exceed)
40
41        # Distribution metric: KS distance to N(0, 1/n)
42        theoretical_std = 1 / np.sqrt(n)
43        normalized = (sample_means - true_mean) / theoretical_std
44        ks_stat, _ = stats.kstest(normalized, 'norm')
45        metrics['ks_distance'].append(ks_stat)
46
47        # a.s. metric: maximum deviation across paths
48        cumsum = np.cumsum(samples[:100], axis=1) / np.arange(1, n+1)
49        max_dev = np.max(np.abs(cumsum - true_mean), axis=1).mean()
50        metrics['max_deviation'].append(max_dev)
51
52    return sample_sizes, metrics
53
54def verify_implication_l2_to_prob(n_samples=5000):
55    """
56    Verify: L² convergence => Convergence in Probability
57    via Chebyshev's inequality: P(|X-μ|>ε) ≤ E[(X-μ)²]/ε²
58    """
59    n_values = np.arange(10, 500, 10)
60    epsilon = 0.2
61
62    l2_errors = []
63    prob_exceeds = []
64    chebyshev_bounds = []
65
66    for n in n_values:
67        samples = np.random.normal(0, 1, (n_samples, n))
68        means = samples.mean(axis=1)
69
70        # Empirical L² error
71        l2 = np.mean(means ** 2)
72        l2_errors.append(l2)
73
74        # Empirical probability
75        prob = np.mean(np.abs(means) > epsilon)
76        prob_exceeds.append(prob)
77
78        # Chebyshev bound
79        bound = l2 / (epsilon ** 2)
80        chebyshev_bounds.append(min(1, bound))
81
82    # Plot
83    plt.figure(figsize=(10, 5))
84    plt.semilogy(n_values, prob_exceeds, 'b-', label='P(|X̄ₙ|>ε)', linewidth=2)
85    plt.semilogy(n_values, chebyshev_bounds, 'r--',
86                 label=f'Chebyshev bound: E[X̄ₙ²]/ε²', linewidth=2)
87    plt.xlabel('Sample Size (n)')
88    plt.ylabel('Probability / Bound')
89    plt.title('L² → Probability via Chebyshev')
90    plt.legend()
91    plt.grid(True, alpha=0.3)
92    plt.show()
93
94    return n_values, l2_errors, prob_exceeds, chebyshev_bounds

And here's the implementation of the typewriter counterexample:

Typewriter Sequence: L\u00B2 \u21CF a.s.

🐍typewriter_counterexample.py

Explanation(8)

Code(76)

5Typewriter Sequence

This classic counterexample shows L² convergence does NOT imply almost sure convergence. The interval 'sweeps' across [0,1] repeatedly.

10Interval Definition

For n = 2^m + k, the bump is on [k/2^m, (k+1)/2^m]. Each 'sweep' uses smaller intervals, covering [0,1] completely each time.

14L² Convergence

E[Xₙ²] = (interval length) × 1² = 1/2^m → 0. This proves L² convergence to 0.

15a.s. Failure

Every ω ∈ [0,1] is covered by infinitely many intervals. So Xₙ(ω) = 1 infinitely often, meaning lim Xₙ(ω) ≠ 0.

30L² Norm Calculation

The L² norm is √(E[Xₙ²]) = 1/√(2^m). This decreases roughly as 1/√n, showing convergence.

39Verification

Plotting E[Xₙ²] on a log scale shows linear decay, confirming L² convergence.

53a.s. Counterexample

For any fixed ω, we count how many times Xₙ(ω) = 1. This count grows without bound, proving no pointwise limit exists.

65Key Insight

The stem plot shows Xₙ(ω) bouncing between 0 and 1 forever. L² convergence only means average behavior, not pointwise behavior.

68 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from matplotlib.animation import FuncAnimation
4
5def typewriter_sequence(n):
6    """
7    The 'typewriter' or 'sliding bump' counterexample.
8
9    X_n(ω) = 1 on interval [k/2^m, (k+1)/2^m] where n = 2^m + k
10    X_n(ω) = 0 elsewhere
11
12    This sequence:
13    - DOES converge in L² to 0 (E[X_n²] = 1/2^m → 0)
14    - Does NOT converge almost surely (every ω hit infinitely often)
15    """
16    if n == 0:
17        return (0, 1)
18
19    m = int(np.floor(np.log2(n)))
20    power_2m = 2 ** m
21    k = n - power_2m
22
23    interval_size = 1 / power_2m
24    start = k * interval_size
25    end = (k + 1) * interval_size
26
27    return (start, end)
28
29def compute_l2_norm(n):
30    """
31    L² norm of X_n: sqrt(E[X_n²]) = sqrt(interval_length) = 1/sqrt(2^m)
32    """
33    if n == 0:
34        return 1.0
35    m = int(np.floor(np.log2(n)))
36    return 1 / np.sqrt(2 ** m)
37
38def verify_l2_convergence(max_n=100):
39    """
40    Verify L² convergence: E[X_n²] → 0
41    """
42    ns = range(1, max_n + 1)
43    l2_norms = [compute_l2_norm(n) for n in ns]
44    l2_squared = [x**2 for x in l2_norms]
45
46    plt.figure(figsize=(10, 5))
47    plt.semilogy(ns, l2_squared, 'b-', linewidth=2)
48    plt.xlabel('n')
49    plt.ylabel('E[Xₙ²]')
50    plt.title('L² Convergence: E[Xₙ²] → 0')
51    plt.grid(True, alpha=0.3)
52    plt.show()
53
54    return ns, l2_norms
55
56def show_no_as_convergence(omega=0.3, max_n=50):
57    """
58    Show that for ANY ω, X_n(ω) = 1 for infinitely many n.
59    Thus X_n does NOT converge to 0 almost surely!
60    """
61    values = []
62    for n in range(1, max_n + 1):
63        start, end = typewriter_sequence(n)
64        values.append(1 if start <= omega < end else 0)
65
66    plt.figure(figsize=(12, 4))
67    plt.stem(range(1, max_n + 1), values)
68    plt.xlabel('n')
69    plt.ylabel(f'Xₙ(ω={omega})')
70    plt.title(f'Xₙ(ω) at ω={omega}: Hits 1 infinitely often → No a.s. convergence')
71    plt.grid(True, alpha=0.3)
72    plt.show()
73
74    hits = sum(values)
75    print(f"X_n(ω={omega}) = 1 for {hits} values of n in [1, {max_n}]")
76    return values

Common Mistakes to Avoid

Practice Problems

Summary

Key Takeaways: Convergence Relationships

The Hierarchy: a.s. \u21D2 L² (conditional) \u21D2 P \u21D2 d, with a.s. \u21D2 P always.
Key Implications: L² \u21D2 P via Chebyshev; a.s. \u21D2 L² requires dominated convergence.
Critical Counterexamples: Typewriter (L² \u21CF a.s.), Sliding bump (P \u21CF L²), Independent copies (d \u21CF P).
Special Case: When X = c constant, d \u21D2 P (crucial for LLN).
Deep Learning: SGD typically achieves P or L² convergence; a.s. convergence requires special conditions.
Practical Choice: Use the mode that matches your application: a.s. for pathwise, L² for MSE, P for bounds, d for asymptotics.

Final Thought: The convergence hierarchy is a map for navigating probabilistic guarantees. Knowing which implications hold—and which don't—is essential for correctly interpreting theoretical results and designing reliable algorithms. The counterexamples are not just curiosities; they protect us from false reasoning.

Learning Objectives

The Story: A Unified Theory of Convergence

The Convergence Hierarchy

The Convergence Hierarchy (Strongest to Weakest)

Interactive: Convergence Hierarchy Explorer

Convergence Hierarchy Explorer

The Implications: What Implies What

Almost Sure \u21D2 In Probability

Theorem: a.s. \u21D2 P (Always Holds)

L² \u21D2 In Probability

Theorem: L² \u21D2 P (via Chebyshev/Markov)

In Probability \u21D2 In Distribution

Theorem: P \u21D2 d (Always Holds)

Almost Sure \u21D2 L² (Conditional)

Theorem: a.s. \u21D2 L² (Under Dominated Convergence)

Critical Counterexamples

The Typewriter Sequence: L² \u21CF a.s.

Typewriter Sequence: L² \u21CF Almost Sure

Mathematical Analysis

The Sliding Bump: P \u21CF L²

Sliding Bump: Probability \u21CF L²

Mathematical Analysis

Convergence in Distribution \u21CF In Probability

Counterexample: Independent Copies

Summary: The Complete Implication Picture

The Special Case: Convergence to a Constant

Theorem: d to Constant \u21D2 P

Deep Learning Applications

SGD Convergence Analysis

SGD Convergence: Different Modes in Action

Batch Normalization Statistics

Model Ensembling and Convergence

Practical Guide: Choosing the Right Mode

Python Implementation

Common Mistakes to Avoid

Mistake 1: Assuming convergence in distribution implies anything about joint behavior

Mistake 2: Thinking “converges quickly” means stronger convergence

Mistake 3: Forgetting the special case for constant limits

Mistake 4: Confusing “with probability 1” with almost sure convergence

Practice Problems

Problem 1: Proving an Implication

Problem 2: Constructing a Counterexample

Problem 3: Deep Learning Application

Problem 4: Central Limit Theorem Interpretation

Summary

Key Takeaways: Convergence Relationships