Chapter 2
20 min read
Section 6 of 6

Mixed Random Variables

Random Variables

Learning Objectives

By the end of this section, you will:

  • Define what makes a random variable "mixed" — having both discrete (jump) and continuous (smooth) probability components
  • Identify real-world phenomena with mixed behavior: insurance claims, censored data, zero-inflated counts
  • Decompose a mixed RV: F(x)=ipi1(xxi)+xg(t)dtF(x) = \sum_i p_i \cdot \mathbf{1}(x \geq x_i) + \int_{-\infty}^{x} g(t) \, dt
  • Calculate probabilities and expected values for mixed distributions
  • Recognize why CDF is the ONLY universal tool that works for discrete, continuous, AND mixed RVs
  • Apply to AI/ML: zero-inflated models, censored regression, mixture density networks
Where You'll Apply This Knowledge:
• Zero-Inflated Models (CTR prediction)
• Censored/Truncated Data (survival analysis)
• Insurance Claims Prediction
• Recommendation Systems (not-rated vs ratings)
• Mixture Density Networks
• Dropout & Stochastic Depth in DNNs

Historical Context

The Problem of "Weird" Distributions

In the early 20th century, statisticians encountered distributions that defied classification. Most notably, the insurance industry faced a puzzle: most policyholders claim $0, but actual claims can be any positive amount. This is neither purely discrete nor purely continuous!

Key Historical Developments:

  • Émile Borel (1909): First studied probability measures that aren't purely discrete or continuous
  • Andrey Kolmogorov (1933): Made CDF the fundamental object precisely because it handles ALL cases
  • William Feller (1950s): Popularized mixed distributions in his influential textbook
  • Modern Era (1980s-present): Zero-inflated models became standard in econometrics and biostatistics
Kolmogorov's Insight: Instead of defining probability through PMF (discrete-only) or PDF (continuous-only), define it through a single object—the CDF—that works for all random variables: discrete, continuous, and mixed!

The Puzzle: Neither Discrete Nor Continuous

Consider this real-world scenario that neither PMF nor PDF can fully describe:

Insurance Claims Example

Let XX = insurance claim amount for a randomly selected policyholder.

Discrete Part (Atom)
P(X=0)=0.70P(X = 0) = 0.70

70% of policyholders file no claim

Continuous Part
XX>0extExp(λ=0.01)X | X > 0 \sim ext{Exp}(\lambda = 0.01)

30% file claims with exponentially distributed amounts

⚠️ This is neither purely discrete nor purely continuous!

Why Standard Tools Fail

❌ PMF Cannot Work

PMF requires sumxp(x)=1sum_x p(x) = 1 over countable values.

But X>0X > 0 can be any positive real number—uncountably many values!

❌ PDF Cannot Work

PDF requires P(X=x)=0P(X = x) = 0 for all xx.

But P(X=0)=0.70eq0P(X = 0) = 0.70 eq 0! There's a discrete "atom" at zero.

✓ CDF Works Perfectly

The CDF gracefully handles both the discrete jump at 0 and the continuous accumulation for positive values:

F(x) = egin{cases} 0 & ext{if } x < 0 \\ 0.70 + 0.30(1 - e^{-0.01x}) & ext{if } x \geq 0 \end{cases}
Key Insight: This is why Kolmogorov chose the CDF as the fundamental object in probability theory. The CDF is universal—it works for discrete, continuous, AND mixed random variables!

Interactive: Mixed RV Visualizer

Explore the insurance claims example interactively. See how the CDF combines a discrete jump at zero with a smooth exponential curve for positive values.

Loading interactive demo...


Formal Definition

Definition: Mixed Random Variable

A random variable XX is mixed if its CDF F(x)F(x) has:

  1. Jump discontinuities (discrete "atoms") at some points x1,x2,ldotsx_1, x_2, ldots
  2. Continuous, differentiable portions elsewhere

The CDF can be decomposed as:

F(x)=sumxileqxP(X=xi)extdiscretejumps+intinftyxg(t)dtextcontinuouspartF(x) = \underbrace{sum_{x_i leq x} P(X = x_i)}_{ ext{discrete jumps}} + \underbrace{int_{-infty}^{x} g(t) \, dt}_{ ext{continuous part}}

Symbol Reference

SymbolNameIntuitive Meaning
F(x)CDF of mixed RVTotal probability ≤ x (jumps + smooth part)
pᵢ = P(X = xᵢ)Probability atomDiscrete probability mass at point xᵢ
g(t)Continuous componentThe "density" between atoms (may not integrate to 1)
1(x ≥ xᵢ)Indicator function1 if x ≥ xᵢ, else 0
F(x) - F(x⁻)Jump sizeDiscrete probability at x

The Three Types of Random Variables

With mixed random variables, we complete the picture. There are exactly three types:

PropertyDiscreteContinuousMixed
CDF shapePure staircaseSmooth curveStaircase + curve
P(X = x)> 0 for some x= 0 for all x> 0 at atoms, = 0 elsewhere
Described byPMF p(x)PDF f(x)CDF F(x) only!
Jump pointsEvery possible valueNoneSelected atoms
ExamplesDice, countsHeights, timesClaims, ratings
The CDF is Universal: While PMF only works for discrete RVs and PDF only works for continuous RVs, the CDF works for all random variables. This is why it's the foundation of probability theory!

Interactive: Three Types Comparison

See the three types of random variables side by side. Notice how the CDF handles each case seamlessly!

Loading interactive demo...


Common Patterns of Mixed Distributions

Mixed distributions appear frequently in real-world data. Here are the most common patterns:

Pattern A: "Atom at Zero" (Zero-Inflated)

X = egin{cases} 0 & ext{with probability } p \\ Y \sim ext{Continuous} & ext{with probability } 1-p \end{cases}

Examples: Insurance claims (no claim vs positive), rainfall (no rain vs amount), customer spending (non-buyer vs spending amount), click-through (no click vs engagement time)

Pattern B: "Atoms at Boundaries" (Censored/Truncated)

X = egin{cases} L & ext{if true value} < L \\ ext{true value} & ext{if } L \leq ext{true value} \leq U \\ U & ext{if true value} > U \end{cases}

Examples: Sensor readings at detection limits, credit scores (300-850), survey responses with ceiling/floor, grades bounded by 0-100

Pattern C: "Discrete + Continuous Mixture"

X = egin{cases} ext{Category 1 value} & ext{with prob } p_1 \\ ext{Category 2 value} & ext{with prob } p_2 \\ Z \sim ext{Continuous} & ext{with prob } 1-p_1-p_2 \end{cases}

Examples: Product ratings (not-rated vs stars vs quality score), customer lifetime value (churned vs active with spend), medical diagnosis


Interactive: Atom Pattern Explorer

Explore each pattern interactively. Adjust parameters to see how the CDF changes for zero-inflated, censored, and mixture distributions.

Loading interactive demo...


CDF for Mixed Random Variables

The CDF of a mixed random variable combines discrete jumps with continuous accumulation:

General CDF Formula for Mixed RVs

F(x)=xixP(X=xi)+xg(t)dtF(x) = \sum_{x_i \leq x} P(X = x_i) + \int_{-\infty}^{x} g(t) \, dt
Σ discrete jumps up to x
+ continuous area up to x

Properties Still Hold!

F(infty)=0F(-infty) = 0 and F(+infty)=1F(+infty) = 1
FF is non-decreasing
FF is right-continuous
Jumps at atoms: F(xi)F(xi)=P(X=xi)F(x_i) - F(x_i^-) = P(X = x_i)
Between atoms: FF is smooth and differentiable

Computing Probabilities

The same CDF formulas work for mixed random variables:

At Most x
P(Xleqx)=F(x)P(X leq x) = F(x)
Greater Than x
P(X>x)=1F(x)P(X > x) = 1 - F(x)
In an Interval
P(a<Xleqb)=F(b)F(a)P(a < X leq b) = F(b) - F(a)
Exactly x (at an atom)
P(X=xi)=F(xi)F(xi)=extjumpsizeP(X = x_i) = F(x_i) - F(x_i^-) = ext{jump size}

Worked Example: Insurance Claims

Let XX = claim amount where P(X=0)=0.7P(X = 0) = 0.7 and for x>0x > 0: continuous Exp(0.01) scaled by 0.3.

F(x) = egin{cases} 0 & x < 0 \\ 0.7 + 0.3(1 - e^{-0.01x}) & x \geq 0 \end{cases}

Compute P(X = 0): F(0) - F(0⁻) = 0.7 - 0 = 0.7

Compute P(X ≤ 100): 0.7 + 0.3(1 - e⁻¹) ≈ 0.7 + 0.189 = 0.889

Compute P(X > 200): 1 - F(200) = 1 - [0.7 + 0.3(1 - e⁻²)] ≈ 0.041

Compute P(0 < X ≤ 100): F(100) - F(0) = 0.889 - 0.7 = 0.189


Interactive: Probability Calculator

Practice computing probabilities for mixed distributions. Drag the interval bounds and see how the CDF handles discrete jumps.

Loading interactive demo...


Expected Value for Mixed RVs

The expected value of a mixed random variable has contributions from both parts:

Expected Value Formula

E[X]=sumixicdotP(X=xi)extdiscretecontribution+intxcdotg(x)dxextcontinuouscontributionE[X] = \underbrace{sum_i x_i cdot P(X = x_i)}_{ ext{discrete contribution}} + \underbrace{int x cdot g(x) \, dx}_{ ext{continuous contribution}}

Worked Example

For insurance claims X with P(X = 0) = 0.7 and X|X>0 ~ Exp(0.01):

E[X]=00.7+0x0.30.01e0.01xdxE[X] = 0 \cdot 0.7 + \int_0^{\infty} x \cdot 0.3 \cdot 0.01 \cdot e^{-0.01x} \, dx
=0+0.3E[extExp(0.01)]=0.3100=$30= 0 + 0.3 \cdot E[ ext{Exp}(0.01)] = 0.3 \cdot 100 = \$30

Interpretation: The average claim is $30, even though 70% of claims are $0! The few large claims pull up the average significantly.

Intuition: E[X] is a weighted average where atoms contribute their value × probability, and the continuous part contributes its weighted integral.

Interactive: Expected Value Demo

See how the expected value is computed for mixed distributions. Watch the discrete and continuous contributions combine.

Loading interactive demo...


Real-World Examples

🏥 Medical Costs

Pattern: Many patients have $0 cost, others have continuous distribution

Atom: P(X = 0) = healthy population

ML Use: Healthcare cost prediction, fraud detection

Daily Rainfall

Pattern: Many days have 0mm, rainy days have continuous amount

Atom: P(X = 0) = probability of dry day

ML Use: Weather forecasting, agriculture planning

🛒 Customer Spending

Pattern: Non-buyers at $0, buyers with continuous spend

Atom: P(X = 0) = non-conversion rate

ML Use: LTV prediction, recommendation systems

🔋 Sensor Readings

Pattern: Atoms at min/max limits, continuous in between

Atoms: P(X = L), P(X = U) at sensor limits

ML Use: Anomaly detection, sensor fusion


AI/ML Applications

Mixed distributions are fundamental to modern machine learning. Here are key applications:

1. Zero-Inflated Neural Networks

Problem: Predicting quantities where many values are exactly zero

Solution: Two-headed model: classifier (is it zero?) + regressor (if not, how much?)

Applications: Click prediction, demand forecasting, medical diagnosis

2. Censored Regression (Tobit Models)

Problem: True values are censored at bounds (sensor limits, survey scales)

Solution: Model combines censoring probability with truncated distribution

Applications: Survival analysis, credit risk, time-to-event prediction

3. Mixture Density Networks (MDN)

Problem: Output can be discrete category + continuous value

Solution: Network outputs mixture weights for discrete and Gaussian parameters

Applications: Multi-modal prediction, inverse problems, generative models

4. Dropout as Mixed Distribution

Insight: Dropout can be viewed as a mixed distribution!

X = egin{cases} 0 & ext{with probability } p ext{ (dropped)} \\ a/p & ext{with probability } 1-p ext{ (scaled activation)} \end{cases}

This is a zero-inflated scaled activation—a mixed random variable!

5. VAE with Discrete + Continuous Latents

Architecture: Discrete latent (cluster) + Continuous latent (variation)

Training: Gumbel-Softmax for discrete, reparameterization for continuous

Applications: Controllable generation, disentangled representations


Interactive: Zero-Inflated ML Demo

See zero-inflated regression in action. Watch how the model learns to separate the zero-probability from the continuous distribution.

Loading interactive demo...


Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ============================================
6# EXAMPLE 1: Zero-Inflated Distribution
7# ============================================
8
9class ZeroInflatedExponential:
10    """Zero-inflated exponential distribution for insurance claims."""
11
12    def __init__(self, p_zero, lambda_):
13        self.p_zero = p_zero
14        self.lambda_ = lambda_
15        self.exp = stats.expon(scale=1/lambda_)
16
17    def cdf(self, x):
18        """CDF: F(x) = p_zero * 1(x >= 0) + (1 - p_zero) * F_exp(x)"""
19        x = np.asarray(x)
20        result = np.zeros_like(x, dtype=float)
21        mask = x >= 0
22        result[mask] = self.p_zero + (1 - self.p_zero) * self.exp.cdf(x[mask])
23        return result
24
25    def pmf_at_zero(self):
26        """P(X = 0) = jump at zero"""
27        return self.p_zero
28
29    def expected_value(self):
30        """E[X] = 0 * p_zero + (1 - p_zero) * E[Exp(lambda)]"""
31        return (1 - self.p_zero) * (1 / self.lambda_)
32
33    def prob_greater(self, x):
34        """P(X > x) = 1 - F(x)"""
35        return 1 - self.cdf(x)
36
37# Create and analyze
38claims = ZeroInflatedExponential(p_zero=0.7, lambda_=0.01)
39
40print("=== Zero-Inflated Exponential (Insurance Claims) ===")
41print("P(X = 0) = %.3f" % claims.pmf_at_zero())
42print("P(X <= 100) = %.4f" % claims.cdf(100))
43print("P(X > 200) = %.4f" % claims.prob_greater(200))
44print("E[X] = $%.2f" % claims.expected_value())
45
46# ============================================
47# EXAMPLE 2: Compute probabilities
48# ============================================
49
50# P(0 < X <= 100) = F(100) - F(0)
51p_interval = claims.cdf(100) - claims.cdf(0)
52print("\nP(0 < X <= 100) = %.4f" % p_interval)
53
54# P(X = 0) = F(0) - F(0-) = jump at 0
55p_zero = claims.cdf(0) - 0  # F(0-) = 0 for this distribution
56print("P(X = 0) = %.4f" % p_zero)
57
58# ============================================
59# EXAMPLE 3: PyTorch Zero-Inflated Loss
60# ============================================
61
62import torch
63import torch.nn as nn
64
65class ZeroInflatedLoss(nn.Module):
66    """Loss function for zero-inflated regression."""
67
68    def forward(self, p_zero, mu, sigma, target):
69        eps = 1e-8
70        is_zero = (target == 0).float()
71
72        # Log-likelihood for zeros
73        ll_zero = is_zero * torch.log(p_zero + eps)
74
75        # Log-likelihood for positive values
76        dist = torch.distributions.Normal(mu, sigma)
77        ll_pos = (1 - is_zero) * (
78            torch.log(1 - p_zero + eps) + dist.log_prob(target)
79        )
80
81        # Negative log-likelihood
82        nll = -(ll_zero + ll_pos)
83        return nll.mean()
84
85# ============================================
86# EXAMPLE 4: Sampling from Mixed Distribution
87# ============================================
88
89def sample_mixed_rv(n_samples, p_zero, continuous_dist):
90    """Sample from zero-inflated distribution."""
91    samples = np.zeros(n_samples)
92
93    # Decide: zero or positive?
94    is_positive = np.random.binomial(1, 1 - p_zero, n_samples)
95
96    # Generate continuous values for positive cases
97    continuous_samples = continuous_dist.rvs(n_samples)
98
99    # Combine: zero for non-positive, continuous otherwise
100    samples = is_positive * continuous_samples
101    return samples
102
103# Generate samples
104samples = sample_mixed_rv(
105    n_samples=10000,
106    p_zero=0.7,
107    continuous_dist=stats.expon(scale=100)
108)
109
110print("\n=== Sampling Results ===")
111print("Sample mean: %.2f" % samples.mean())
112print("Empirical P(X=0): %.3f" % (samples == 0).mean())
113print("Theoretical P(X=0): 0.700")
114print("Theoretical E[X]: %.2f" % (0.3 * 100))  # (1-0.7) * 100
115
116# ============================================
117# EXAMPLE 5: Inverse Transform Sampling
118# ============================================
119
120# For zero-inflated exponential with CDF:
121# F(x) = 0.7 + 0.3 * (1 - exp(-0.01*x)) for x >= 0
122
123def inverse_cdf_mixed(u, p_zero, lambda_):
124    """Inverse CDF for zero-inflated exponential."""
125    if u <= p_zero:
126        return 0  # Atom at zero
127    else:
128        # Solve: u = p_zero + (1-p_zero) * (1 - exp(-lambda*x))
129        # (u - p_zero) / (1 - p_zero) = 1 - exp(-lambda*x)
130        # exp(-lambda*x) = 1 - (u - p_zero) / (1 - p_zero)
131        # x = -ln(1 - (u - p_zero) / (1 - p_zero)) / lambda
132        v = (u - p_zero) / (1 - p_zero)
133        return -np.log(1 - v) / lambda_
134
135# Generate using inverse transform
136u_samples = np.random.uniform(0, 1, 10000)
137x_samples = np.array([inverse_cdf_mixed(u, 0.7, 0.01) for u in u_samples])
138print("\n=== Inverse Transform Sampling ===")
139print("Mean from inverse transform: %.2f" % x_samples.mean())

Common Pitfalls

Pitfall 1: Trying to Write a PDF for Mixed RVs

The PDF doesn't exist at atoms! You'd need infinite density at jump points (a Dirac delta function). Use the CDF or decompose into discrete + continuous parts.

Pitfall 2: Forgetting the Discrete Contribution to E[X]

E[X]eqintxcdotf(x)dxE[X] eq int x cdot f(x) \, dx for mixed RVs! You must add sumixicdotP(X=xi)sum_i x_i cdot P(X = x_i) for the atoms.

Pitfall 3: Endpoint Confusion for Intervals

P(aleqXleqb)eqP(a<Xleqb)P(a leq X leq b) eq P(a < X leq b) when a is an atom! For mixed RVs, be precise about whether endpoints are included.

Pitfall 4: Assuming Atoms Are Obvious from Data

Rounded measurements can look discrete but aren't true atoms. True atoms have exact repeated values with positive probability.

Pitfall 5: Incorrectly Normalizing the Continuous Part

The "density" g(x) for mixed RVs doesn't integrate to 1. It integrates to 1sumiP(X=xi)1 - sum_i P(X = x_i).


Test Your Understanding

Loading interactive demo...


Key Takeaways

  1. Mixed = Discrete + Continuous: Real-world data often has both jump (discrete atoms) and smooth (continuous) probability components.
  2. CDF is Universal: Only the CDF works for ALL random variables— discrete, continuous, AND mixed. This is why Kolmogorov chose it as the foundation.
  3. Common Pattern: "Atom at zero" appears everywhere: no purchase, no claim, no click, no rain—all are zeros with positive probability.
  4. Decomposition: F(x)=extjumps+extcontinuousF(x) = \sum ext{jumps} + \int ext{continuous}
  5. Expected Value: E[X]=xiP(X=xi)+xg(x)dxE[X] = \sum x_i \cdot P(X = x_i) + \int x \cdot g(x) \, dx—both parts contribute!
  6. AI/ML Essential: Zero-inflated models, censored regression, mixture density networks, and even dropout are all applications of mixed distributions.
  7. Practical Reality: Pure discrete or continuous models are idealized. Mixed models capture the true complexity of real-world data.
Chapter Complete! You now understand all three types of random variables: discrete (PMF), continuous (PDF), and mixed (CDF-only). The CDF unifies them all—and that's why it's the foundation of probability theory. In the next chapter, we'll explore Expected Value and how to compute the "center" of any distribution.