Chapter 9
20 min read
Section 64 of 175

Relationships Between Convergence Modes

Convergence Concepts

Learning Objectives

By the end of this section, you will be able to:

  1. Draw the complete hierarchy of convergence modes and explain which implications hold
  2. Prove the key implications: a.s. \u21D2 P, L² \u21D2 P, P \u21D2 d, and a.s. \u21D2 L² (under conditions)
  3. Construct and explain counterexamples showing which implications do NOT hold
  4. Identify the special case where convergence in distribution implies convergence in probability
  5. Apply the correct convergence mode to analyze deep learning training dynamics
  6. Choose the appropriate convergence concept for different statistical and ML applications
Why This Matters: Understanding the relationships between convergence modes is essential for correctly interpreting theoretical guarantees. When a paper claims “the estimator converges,” knowing which type of convergence determines what you can actually conclude about performance, reliability, and generalization.

The Story: A Unified Theory of Convergence

In the early 20th century, mathematicians developing probability theory faced a fundamental question: what does it mean for a sequence of random quantities to “converge”? Unlike deterministic sequences where convergence has a single definition, random sequences can approach a limit in multiple distinct ways.

This led to the development of four major modes of convergence, each capturing a different aspect of “getting close.” But immediately, researchers asked: How do these concepts relate to each other? If we know one type of convergence holds, what else can we conclude?

The answers form a beautiful hierarchy—a map of implications, conditionals, and striking counterexamples. This hierarchy isn't just mathematical elegance; it's essential for:

  • Theoretical guarantees: Knowing what your convergence theorem actually promises
  • Algorithm analysis: Understanding when optimization “succeeds”
  • Practical diagnostics: Interpreting training curves and convergence metrics
  • Research claims: Evaluating what “convergence” really means in a paper
The Central Insight: Each convergence mode captures a different question:Does every realization converge? (a.s.) Does the average squared error vanish? (L²)Does the probability of being far away vanish? (P) Do the distributions become indistinguishable? (d)

The Convergence Hierarchy

The four modes of convergence form a strict hierarchy, with almost sure convergence being the strongest and convergence in distribution being the weakest. Here is the complete picture:

The Convergence Hierarchy (Strongest to Weakest)

Almost Sure (a.s.)

P(lim Xn = X) = 1

Mean Square (L²)

E[(Xn - X)²] \u2192 0

(if dominated)
In Probability (P)

\u2200\u03B5 > 0: P(|Xn - X| > \u03B5) \u2192 0

In Distribution (d)

Fn(x) \u2192 F(x) at continuity points

Interactive: Convergence Hierarchy Explorer

Convergence Hierarchy Explorer

StrongestWeakestalwaysif E[|X|²] < ∞Chebyshevalwaysif X = c\u2717Almost SureXn \u2192 X (a.s.)Mean Square (L²)Xn \u2192 X ()In ProbabilityXn \u2192 X (P)In DistributionXn \u2192 X (d)
Click a convergence mode

Select any convergence mode to see its definition, intuition, and what it implies about the sequence.

Always holds
Conditional
\u2717
Does NOT hold

The Implications: What Implies What

Let's rigorously establish each implication in the hierarchy, providing both the proof and the intuition behind why it holds.

Almost Sure \u21D2 In Probability

Theorem: a.s. \u21D2 P (Always Holds)

If Xna.s.XX_n \xrightarrow{a.s.} X, then XnPXX_n \xrightarrow{P} X.

Proof:

Let 0˘3B5>0\u03B5 > 0. Define the event An={XnX>ε}A_n = \{|X_n - X| > \varepsilon\}.

Almost sure convergence means P(limnXn=X)=1P(\lim_{n\to\infty} X_n = X) = 1.

This implies that for almost all \u03C9, there exists N(\u03C9) such that for all n \u2265 N(\u03C9), |Xn(\u03C9) - X(\u03C9)| < \u03B5.

Therefore, for any fixed \u03B5:

P(An)=P(XnX>ε)0 as nP(A_n) = P(|X_n - X| > \varepsilon) \to 0 \text{ as } n \to \infty

This is exactly the definition of convergence in probability. \u220E

Intuition: If the sequence converges for almost every sample path, then the probability of being far from the limit must eventually become tiny. The “bad” paths where |Xn - X| > \u03B5 form a set whose measure shrinks to zero.

L² \u21D2 In Probability

Theorem: L² \u21D2 P (via Chebyshev/Markov)

If XnL2XX_n \xrightarrow{L^2} X, then XnPXX_n \xrightarrow{P} X.

Proof (using Chebyshev's Inequality):

By Chebyshev's inequality, for any \u03B5 > 0:

P(XnX>ε)E[(XnX)2]ε2P(|X_n - X| > \varepsilon) \leq \frac{E[(X_n - X)^2]}{\varepsilon^2}

Since XnL2XX_n \xrightarrow{L^2} X, we have E[(XnX)2]0E[(X_n - X)^2] \to 0.

Therefore, for any fixed \u03B5 > 0:

P(XnX>ε)E[(XnX)2]ε20P(|X_n - X| > \varepsilon) \leq \frac{E[(X_n - X)^2]}{\varepsilon^2} \to 0

This proves convergence in probability. \u220E

Intuition: If the average squared error vanishes, then the probabilityof a large error must also vanish. Large deviations contribute disproportionately to the mean square, so they must become rare.

In Probability \u21D2 In Distribution

Theorem: P \u21D2 d (Always Holds)

If XnPXX_n \xrightarrow{P} X, then XndXX_n \xrightarrow{d} X.

Proof Sketch:

Let x be a continuity point of F(x) = P(X \u2264 x). We need to show Fn(x) \u2192 F(x).

For any \u03B5 > 0:

P(Xnx)P(Xx+ε)+P(XnX>ε)P(X_n \leq x) \leq P(X \leq x + \varepsilon) + P(|X_n - X| > \varepsilon)

Similarly:

P(Xnx)P(Xxε)P(XnX>ε)P(X_n \leq x) \geq P(X \leq x - \varepsilon) - P(|X_n - X| > \varepsilon)

Taking n \u2192 \u221E and using P(|Xn - X| > \u03B5) \u2192 0:

F(xε)lim infFn(x)lim supFn(x)F(x+ε)F(x - \varepsilon) \leq \liminf F_n(x) \leq \limsup F_n(x) \leq F(x + \varepsilon)

Since x is a continuity point, letting \u03B5 \u2192 0 gives Fn(x) \u2192 F(x). \u220E

Intuition: If Xn is “close” to X with high probability, then the probability of Xn \u2264 x should be close to the probability of X \u2264 x. The distributions must become similar.

Almost Sure \u21D2 L² (Conditional)

Theorem: a.s. \u21D2 L² (Under Dominated Convergence)

If Xna.s.XX_n \xrightarrow{a.s.} X and there exists Y with E[Y²] < \u221E such that |Xn|² \u2264 Y for all n, then XnL2XX_n \xrightarrow{L^2} X.

Proof (via Dominated Convergence Theorem):

We have (Xn - X)² \u2192 0 almost surely.

By the triangle inequality: (Xn - X)² \u2264 (|Xn| + |X|)² \u2264 4(\u221AY + |X|)²

If this bound has finite expectation, the Dominated Convergence Theorem gives:

E[(XnX)2]E[lim(XnX)2]=0E[(X_n - X)^2] \to E[\lim (X_n - X)^2] = 0

This proves L² convergence. \u220E

Critical Condition: Without the dominating function, this implication FAILS! The typewriter sequence shows a.s. convergence does not guarantee L² convergence in general.

Critical Counterexamples

Just as important as knowing what implies what is knowing what does NOT imply what. These counterexamples are fundamental to understanding the theory and avoiding incorrect reasoning.

The Typewriter Sequence: L² \u21CF a.s.

This is the most famous counterexample in convergence theory. It shows that even if the average squared error vanishes, individual sample paths can fail to converge.

Typewriter Sequence: L² \u21CF Almost Sure

Interval
[0.000, 1.000]
Interval Width
1.0000
L² Norm
1.0000
Converges a.s.?
NO \u2717
0.000.250.500.751.00\u03C9 \u2208 [0, 1]Xn(\u03C9)10

What's happening: Xn(\u03C9) = 1 on interval [0.000, 1.000], 0 elsewhere. The interval “sweeps” across [0,1], getting smaller each pass. For every \u03C9 \u2208 [0,1], Xn(\u03C9) = 1 for infinitely many n, so Xn(\u03C9) does NOT converge to 0 for any \u03C9. Yet E[Xn²] = interval width \u2192 0, proving L² convergence to 0!

Mathematical Analysis

Define Xn on [0, 1] as the indicator of interval In = [k/2m, (k+1)/2m] where n = 2m + k for 0 \u2264 k < 2m.

L² Convergence:

E[Xn2]=length(In)=12m0E[X_n^2] = \text{length}(I_n) = \frac{1}{2^m} \to 0

No Almost Sure Convergence: For any \u03C9 \u2208 [0, 1], the interval Incovers \u03C9 for infinitely many n (each “sweep” across [0,1] hits every point). Therefore Xn(\u03C9) = 1 infinitely often, and lim Xn(\u03C9) does not exist.

Key Insight: L² convergence only requires the average to behave well. The sequence can have arbitrarily bad behavior on small probability sets that still contribute nothing to the expectation.

The Sliding Bump: P \u21CF L²

This counterexample shows that even if large deviations become unlikely, they can still dominate the mean square error if they're large enough.

Sliding Bump: Probability \u21CF L²

P(Xn = n)
1/1 = 1.0000
Spike Height
1
E[Xn²] = MSE
1 \u2192 \u221E
L² Convergence?
NO \u2717
Xn = 1P = 0.000Xn = 0P = 1/1Value / ProbabilityMSE = n² \u00D7 1/n= 1\u2192 \u221E as n \u2192 \u221E

Key Insight: Xn = n with probability 1/n, else 0. P(Xn > \u03B5) = 1/n \u2192 0, so Xn \u2192 0 in probability. But E[Xn²] = n² \u00D7 (1/n) = n \u2192 \u221E, so L² convergence fails. Rare but large spikes ruin the mean square error even as they become increasingly unlikely.

Mathematical Analysis

Define Xn = n with probability 1/n, and Xn = 0 otherwise.

Probability Convergence: For any \u03B5 > 0:

P(Xn>ε)=P(Xn=n)=1n0P(|X_n| > \varepsilon) = P(X_n = n) = \frac{1}{n} \to 0

No L² Convergence:

E[Xn2]=n21n+0n1n=nE[X_n^2] = n^2 \cdot \frac{1}{n} + 0 \cdot \frac{n-1}{n} = n \to \infty

Key Insight: Rare events with extreme values can blow up the MSE even as they become arbitrarily improbable. L² convergence requires controlling both the probability AND the magnitude of deviations.

Convergence in Distribution \u21CF In Probability

Counterexample: Independent Copies

Let X ~ N(0, 1) and let Xn be independent copies of X (i.e., Xn ~ N(0, 1) independently).

Distribution Convergence: Xn \u2192 X in distribution trivially, since all random variables have the same distribution N(0, 1).

No Probability Convergence:

P(XnX>ε)=P(Z>ε)P(|X_n - X| > \varepsilon) = P(|Z| > \varepsilon)

where Z = Xn - X ~ N(0, 2) (sum of independent normals). This probability is constant for all n, not converging to zero!

Key Insight: Convergence in distribution only says the marginaldistributions match. It says nothing about the joint behavior of Xn and X. Two random variables can have the same distribution while being far apart pointwise.

The Exception: When X is a constant c, convergence in distribution to c DOES imply convergence in probability to c. This is because P(X = c) = 1 means F(x) has a jump at c, forcing Xn to concentrate near c.

Summary: The Complete Implication Picture

FromToImplies?Condition/Counterexample
Almost SureIn Probability✓ YesAlways holds
Almost Sure✓ ConditionalRequires bounded second moment
In Probability✓ YesVia Chebyshev inequality
Almost Sure✗ NoTypewriter sequence
In ProbabilityAlmost Sure✗ NoTypewriter sequence
In Probability✗ NoSliding bump (tall rare spikes)
In ProbabilityIn Distribution✓ YesAlways holds
In DistributionIn Probability✗ NoIndependent copies of same distribution
In Distribution (to c)In Probability✓ YesSpecial case: constant limit

The Special Case: Convergence to a Constant

There is one remarkable exception to the general rule that weaker convergence doesn't imply stronger:when the limit X is a constant c, convergence in distribution implies convergence in probability!

Theorem: d to Constant \u21D2 P

If XndcX_n \xrightarrow{d} c where c is a constant, then XnPcX_n \xrightarrow{P} c.

Proof:

The CDF of constant c is: F(x) = 0 for x < c, F(x) = 1 for x \u2265 c.

Convergence in distribution means Fn(x) \u2192 F(x) at continuity points, which is every x \u2260 c.

For any \u03B5 > 0:

P(Xnc>ε)=1P(cε<Xnc+ε)P(|X_n - c| > \varepsilon) = 1 - P(c - \varepsilon < X_n \leq c + \varepsilon)
=1[Fn(c+ε)Fn(cε)]= 1 - [F_n(c + \varepsilon) - F_n(c - \varepsilon)]
1[F(c+ε)F(cε)]=1[10]=0\to 1 - [F(c + \varepsilon) - F(c - \varepsilon)] = 1 - [1 - 0] = 0

This proves convergence in probability to c. \u220E

Why This Matters: Many limit theorems (like the Law of Large Numbers) establish convergence in distribution to a constant. This theorem tells us we automatically get the stronger convergence in probability as well.

Deep Learning Applications

The convergence hierarchy is not just abstract mathematics—it directly impacts how we understand and analyze neural network training.

SGD Convergence Analysis

When analyzing stochastic gradient descent, different convergence modes give different guarantees:

SGD Convergence: Different Modes in Action

L² Error (MSE)
0.3968
E[(θt - θ*)\u00B2]
P(|error| > 0.1)
1.0027
Probability of large deviation
a.s. Indicator
1.0000
Pathwise convergence quality
EpochL\u00B2 ErrorP(large dev)

Deep Learning Insight: During SGD training, we often achieve convergence in probability(loss stabilizes) before achieving strict L² convergence (MSE to optimal). Almost sure convergence is typically too strong to guarantee for stochastic optimization. Understanding these distinctions helps interpret training dynamics and convergence diagnostics.

Convergence ModeWhat It Guarantees for SGDPractical Meaning
In Probabilityθₜ → θ* in probabilityWith high probability, weights are near optimum eventually
E[‖θₜ - θ*‖²] → 0Average squared distance to optimum vanishes
Almost Sureθₜ(ω) → θ* for almost all training runsEvery training run converges (measure-zero exceptions)
In DistributionDistribution of θₜ stabilizesWeights fluctuate around a limiting distribution
Typical SGD Results: Most theoretical guarantees for SGD establish convergence in probability or L². Almost sure convergence requires stronger conditions on learning rate decay (e.g., \u03A3\u03B1t = \u221E, \u03A3\u03B1t² < \u221E).

Batch Normalization Statistics

Batch normalization computes running estimates of mean and variance:

μt=(1α)μt1+αμbatch\mu_t = (1-\alpha)\mu_{t-1} + \alpha \cdot \mu_{batch}
  • L² Convergence: E[(\u03BCt - \u03BC)²] \u2192 0 if batch statistics are unbiased
  • In Probability: \u03BCt stabilizes around true \u03BC
  • Key Issue: During training, the true distribution shifts, complicating convergence analysis

Model Ensembling and Convergence

When training an ensemble of models:

  • Individual models: Each converges (hopefully) in L² to low loss
  • Ensemble average: By LLN, the ensemble prediction converges in probability to expected prediction
  • Variance reduction: Ensemble variance decreases as 1/M (L² convergence rate)
Practical Insight: Understanding convergence modes helps diagnose training issues. If your loss is noisy but trending down, you likely have probability convergence but not a.s. If the loss occasionally spikes, you may lack L² convergence.

Practical Guide: Choosing the Right Mode

Different applications require different convergence guarantees. Here's a guide:

ApplicationRecommended ModeWhy
Statistical estimationL² (MSE)Natural for comparing estimator quality
Hypothesis testingIn DistributionTest statistics need limiting distribution
Monte Carlo integrationAlmost SureNeed guarantee that specific run converges
Neural network trainingL² or In ProbabilityPractical convergence without pathwise guarantees
Bootstrap methodsIn DistributionBootstrap distribution should match asymptotically
Concentration boundsIn ProbabilityDirectly gives deviation probabilities
Rule of Thumb: If you need guarantees about individual sample paths, use a.s. convergence. If you care about average performance, use L². If you care about tail probabilities, use probability convergence. If you only need distributional behavior, use convergence in distribution.

Python Implementation

Here's a complete implementation demonstrating the relationships between convergence modes:

Demonstrating Convergence Relationships
🐍convergence_relationships.py
1Imports

NumPy for numerical computation, matplotlib for visualization, scipy.stats for statistical tests like Kolmogorov-Smirnov.

3Main Demo Function

This function demonstrates all four convergence modes simultaneously, showing how they relate as sample size increases.

10Key Relationships

The docstring summarizes the implication hierarchy. Note that a.s. => L² requires an additional dominance condition.

17Metrics Dictionary

We track four quantities: L² error (MSE), probability of exceeding ε, maximum deviation (for a.s.), and KS distance (for distribution convergence).

28Sample Mean Generation

We generate n_samples independent realizations of X̄ₙ, computing sample means across the second axis.

32L² Error Computation

The mean squared error E[(X̄ₙ - μ)²]. For N(0,1) samples, this equals 1/n theoretically.

37Probability Metric

The fraction of sample means exceeding ε from the true mean. This estimates P(|X̄ₙ - μ| > ε).

42KS Distance

Kolmogorov-Smirnov statistic measures how close the empirical CDF is to the theoretical. Convergence in distribution means KS → 0.

47a.s. Metric

For almost sure convergence, we track the maximum deviation along sample paths. The supremum should converge.

54Chebyshev Verification

This function explicitly verifies the L² ⇒ Probability implication using Chebyshev's inequality.

60Chebyshev Inequality

P(|X - μ| > ε) ≤ E[(X - μ)²]/ε². If L² error → 0, the bound → 0, proving probability convergence.

82Visual Verification

The plot shows empirical probability always stays below the Chebyshev bound, confirming the theoretical implication.

82 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5def demonstrate_convergence_relationships(n_max=1000, n_samples=10000):
6    """
7    Demonstrate relationships between convergence modes.
8
9    Key relationships:
10    - a.s. => P (always)
11    - L² => P (via Chebyshev)
12    - P => d (always)
13    - a.s. => L² (if dominated)
14    """
15    sample_sizes = np.arange(10, n_max + 1, 10)
16
17    # Store metrics for each convergence mode
18    metrics = {
19        'l2_error': [],      # E[(X̄_n - μ)²]
20        'prob_exceed': [],   # P(|X̄_n - μ| > ε)
21        'max_deviation': [], # sup_i |X̄_i - μ| (for a.s.)
22        'ks_distance': [],   # KS distance for d convergence
23    }
24
25    true_mean = 0
26    epsilon = 0.1
27
28    for n in sample_sizes:
29        # Generate samples: X̄_n for many trials
30        samples = np.random.normal(true_mean, 1, (n_samples, n))
31        sample_means = samples.mean(axis=1)
32
33        # L² metric: MSE
34        l2_error = np.mean((sample_means - true_mean) ** 2)
35        metrics['l2_error'].append(l2_error)
36
37        # Probability metric: P(|X̄_n - μ| > ε)
38        prob_exceed = np.mean(np.abs(sample_means - true_mean) > epsilon)
39        metrics['prob_exceed'].append(prob_exceed)
40
41        # Distribution metric: KS distance to N(0, 1/n)
42        theoretical_std = 1 / np.sqrt(n)
43        normalized = (sample_means - true_mean) / theoretical_std
44        ks_stat, _ = stats.kstest(normalized, 'norm')
45        metrics['ks_distance'].append(ks_stat)
46
47        # a.s. metric: maximum deviation across paths
48        cumsum = np.cumsum(samples[:100], axis=1) / np.arange(1, n+1)
49        max_dev = np.max(np.abs(cumsum - true_mean), axis=1).mean()
50        metrics['max_deviation'].append(max_dev)
51
52    return sample_sizes, metrics
53
54def verify_implication_l2_to_prob(n_samples=5000):
55    """
56    Verify: L² convergence => Convergence in Probability
57    via Chebyshev's inequality: P(|X-μ|>ε) ≤ E[(X-μ)²]/ε²
58    """
59    n_values = np.arange(10, 500, 10)
60    epsilon = 0.2
61
62    l2_errors = []
63    prob_exceeds = []
64    chebyshev_bounds = []
65
66    for n in n_values:
67        samples = np.random.normal(0, 1, (n_samples, n))
68        means = samples.mean(axis=1)
69
70        # Empirical L² error
71        l2 = np.mean(means ** 2)
72        l2_errors.append(l2)
73
74        # Empirical probability
75        prob = np.mean(np.abs(means) > epsilon)
76        prob_exceeds.append(prob)
77
78        # Chebyshev bound
79        bound = l2 / (epsilon ** 2)
80        chebyshev_bounds.append(min(1, bound))
81
82    # Plot
83    plt.figure(figsize=(10, 5))
84    plt.semilogy(n_values, prob_exceeds, 'b-', label='P(|X̄ₙ|>ε)', linewidth=2)
85    plt.semilogy(n_values, chebyshev_bounds, 'r--',
86                 label=f'Chebyshev bound: E[X̄ₙ²]/ε²', linewidth=2)
87    plt.xlabel('Sample Size (n)')
88    plt.ylabel('Probability / Bound')
89    plt.title('L² → Probability via Chebyshev')
90    plt.legend()
91    plt.grid(True, alpha=0.3)
92    plt.show()
93
94    return n_values, l2_errors, prob_exceeds, chebyshev_bounds

And here's the implementation of the typewriter counterexample:

Typewriter Sequence: L\u00B2 \u21CF a.s.
🐍typewriter_counterexample.py
5Typewriter Sequence

This classic counterexample shows L² convergence does NOT imply almost sure convergence. The interval 'sweeps' across [0,1] repeatedly.

10Interval Definition

For n = 2^m + k, the bump is on [k/2^m, (k+1)/2^m]. Each 'sweep' uses smaller intervals, covering [0,1] completely each time.

14L² Convergence

E[Xₙ²] = (interval length) × 1² = 1/2^m → 0. This proves L² convergence to 0.

15a.s. Failure

Every ω ∈ [0,1] is covered by infinitely many intervals. So Xₙ(ω) = 1 infinitely often, meaning lim Xₙ(ω) ≠ 0.

30L² Norm Calculation

The L² norm is √(E[Xₙ²]) = 1/√(2^m). This decreases roughly as 1/√n, showing convergence.

39Verification

Plotting E[Xₙ²] on a log scale shows linear decay, confirming L² convergence.

53a.s. Counterexample

For any fixed ω, we count how many times Xₙ(ω) = 1. This count grows without bound, proving no pointwise limit exists.

65Key Insight

The stem plot shows Xₙ(ω) bouncing between 0 and 1 forever. L² convergence only means average behavior, not pointwise behavior.

68 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3from matplotlib.animation import FuncAnimation
4
5def typewriter_sequence(n):
6    """
7    The 'typewriter' or 'sliding bump' counterexample.
8
9    X_n(ω) = 1 on interval [k/2^m, (k+1)/2^m] where n = 2^m + k
10    X_n(ω) = 0 elsewhere
11
12    This sequence:
13    - DOES converge in L² to 0 (E[X_n²] = 1/2^m → 0)
14    - Does NOT converge almost surely (every ω hit infinitely often)
15    """
16    if n == 0:
17        return (0, 1)
18
19    m = int(np.floor(np.log2(n)))
20    power_2m = 2 ** m
21    k = n - power_2m
22
23    interval_size = 1 / power_2m
24    start = k * interval_size
25    end = (k + 1) * interval_size
26
27    return (start, end)
28
29def compute_l2_norm(n):
30    """
31    L² norm of X_n: sqrt(E[X_n²]) = sqrt(interval_length) = 1/sqrt(2^m)
32    """
33    if n == 0:
34        return 1.0
35    m = int(np.floor(np.log2(n)))
36    return 1 / np.sqrt(2 ** m)
37
38def verify_l2_convergence(max_n=100):
39    """
40    Verify L² convergence: E[X_n²] → 0
41    """
42    ns = range(1, max_n + 1)
43    l2_norms = [compute_l2_norm(n) for n in ns]
44    l2_squared = [x**2 for x in l2_norms]
45
46    plt.figure(figsize=(10, 5))
47    plt.semilogy(ns, l2_squared, 'b-', linewidth=2)
48    plt.xlabel('n')
49    plt.ylabel('E[Xₙ²]')
50    plt.title('L² Convergence: E[Xₙ²] → 0')
51    plt.grid(True, alpha=0.3)
52    plt.show()
53
54    return ns, l2_norms
55
56def show_no_as_convergence(omega=0.3, max_n=50):
57    """
58    Show that for ANY ω, X_n(ω) = 1 for infinitely many n.
59    Thus X_n does NOT converge to 0 almost surely!
60    """
61    values = []
62    for n in range(1, max_n + 1):
63        start, end = typewriter_sequence(n)
64        values.append(1 if start <= omega < end else 0)
65
66    plt.figure(figsize=(12, 4))
67    plt.stem(range(1, max_n + 1), values)
68    plt.xlabel('n')
69    plt.ylabel(f'Xₙ(ω={omega})')
70    plt.title(f'Xₙ(ω) at ω={omega}: Hits 1 infinitely often → No a.s. convergence')
71    plt.grid(True, alpha=0.3)
72    plt.show()
73
74    hits = sum(values)
75    print(f"X_n(ω={omega}) = 1 for {hits} values of n in [1, {max_n}]")
76    return values

Common Mistakes to Avoid


Practice Problems


Summary

Key Takeaways: Convergence Relationships

  1. The Hierarchy: a.s. \u21D2 L² (conditional) \u21D2 P \u21D2 d, with a.s. \u21D2 P always.
  2. Key Implications: L² \u21D2 P via Chebyshev; a.s. \u21D2 L² requires dominated convergence.
  3. Critical Counterexamples: Typewriter (L² \u21CF a.s.), Sliding bump (P \u21CF L²), Independent copies (d \u21CF P).
  4. Special Case: When X = c constant, d \u21D2 P (crucial for LLN).
  5. Deep Learning: SGD typically achieves P or L² convergence; a.s. convergence requires special conditions.
  6. Practical Choice: Use the mode that matches your application: a.s. for pathwise, L² for MSE, P for bounds, d for asymptotics.
Final Thought: The convergence hierarchy is a map for navigating probabilistic guarantees. Knowing which implications hold—and which don't—is essential for correctly interpreting theoretical results and designing reliable algorithms. The counterexamples are not just curiosities; they protect us from false reasoning.
Loading comments...