Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Distinguish between confidence intervals and prediction intervals
• Explain why prediction intervals are always wider than confidence intervals
• Derive prediction intervals for normal data
• Understand tolerance intervals and their use cases

🔧 Practical Skills

• Compute prediction intervals in Python
• Apply prediction intervals to regression problems
• Choose appropriate intervals for different use cases
• Implement uncertainty quantification for ML models

🧠 Deep Learning Connections

• Prediction uncertainty: Distinguishing epistemic vs. aleatoric uncertainty
• Conformal prediction: Distribution-free prediction intervals
• Quantile regression: Direct interval estimation via neural networks
• Probabilistic forecasting: Time series prediction intervals

Where You'll Apply This: Sales forecasting, demand prediction, medical prognosis, quality control, weather forecasting, and any ML application where predicting individual outcomes (not just averages) is important.

Prediction vs. Confidence Intervals

One of the most common errors in applied statistics is confusing confidence intervals (for parameters) with prediction intervals (for future observations). They answer fundamentally different questions.

The Key Distinction

📊 Confidence Interval

Question: Where is the population mean μ likely to be?

Target: A fixed but unknown parameter

Uncertainty: Only from sampling variability of X̄

\bar{X} \pm t^* \cdot \frac{s}{\sqrt{n}}

🎯 Prediction Interval

Question: Where will the next observation Xₙ₊₁ fall?

Target: A future random variable

Uncertainty: Sampling variability + inherent randomness of Xₙ₊₁

\bar{X} \pm t^* \cdot s\sqrt{1 + \frac{1}{n}}

Key Insight: A confidence interval shrinks as n → ∞ (we learn μ exactly). A prediction interval does NOT shrink to zero — even with infinite data about μ, the next observation still has inherent randomness σ²!

Variance Components

The prediction interval is wider because it accounts for two sources of uncertainty:

Variance Decomposition

\text{Var}(\bar{X} - X_{n+1}) = \text{Var}(\bar{X}) + \text{Var}(X_{n+1}) = \frac{\sigma^2}{n} + \sigma^2 = \sigma^2\left(1 + \frac{1}{n}\right)

σ²/n

Estimation uncertainty

(Shrinks with more data)

σ²

Inherent variability

(Irreducible)

Aspect	Confidence Interval	Prediction Interval
Target	Population parameter μ	Next observation Xₙ₊₁
Variance factor	1/n	1 + 1/n
As n → ∞	Width → 0	Width → 2z*σ (irreducible)
Interpretation	95% of such intervals contain μ	95% of such intervals contain Xₙ₊₁
Use case	Estimating average effect	Predicting individual outcomes

Prediction Intervals for Normal Data

Known Variance Case

If X₁, ..., Xₙ ~ N(μ, σ²) with σ² known, and we want to predict Xₙ₊₁:

Prediction Interval (σ known)

\bar{X} \pm z^* \cdot \sigma\sqrt{1 + \frac{1}{n}}

where z* is the appropriate standard normal quantile (e.g., 1.96 for 95%)

Derivation: The prediction error X̄ - Xₙ₊₁ is normally distributed:

\bar{X} - X_{n+1} \sim N\left(0, \sigma^2\left(1 + \frac{1}{n}\right)\right)

Standardizing gives a standard normal variable, leading to the interval.

Unknown Variance Case

When σ² is unknown and estimated by s², we use the t-distribution:

Prediction Interval (σ unknown)

\bar{X} \pm t^*_{n-1} \cdot s\sqrt{1 + \frac{1}{n}}

where t*ₙ₋₁ is the appropriate t-quantile with n-1 degrees of freedom

Why t-distribution? When we estimate σ by s, the standardized prediction error follows a t-distribution, not a normal distribution. This accounts for the additional uncertainty from estimating the variance.

Prediction Intervals in Regression

In regression, prediction intervals are especially important. We want to predict not just E[Y|X=x] (the conditional mean), but where an individual observation Y might fall.

Simple Linear Regression

For the model Y = β₀ + β₁X + ε with ε ~ N(0, σ²):

Prediction Interval for Y at X = x₀

\hat{y}_0 \pm t^*_{n-2} \cdot s_e \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum(x_i - \bar{x})^2}}

Where:

ŷ₀ = β̂₀ + β̂₁x₀ is the predicted value
sₑ is the residual standard error
The square root term is the standard error of prediction

Notice the interval has three variance components:

1 (irreducible): Inherent noise in Y
1/n: Uncertainty in estimating the intercept
(x₀ - x̄)² / Σ(xᵢ - x̄)²: Uncertainty from the slope, larger for x₀ far from x̄

Prediction intervals widen as x₀ moves away from x̄! Predicting at extreme values of X (extrapolation) gives much wider intervals than predicting near the center of the data.

Multiple Regression

For the model Y = Xβ + ε in matrix notation:

Prediction Interval (Matrix Form)

\hat{y}_0 \pm t^*_{n-p} \cdot s_e \sqrt{1 + x_0^T(X^TX)^{-1}x_0}

where p is the number of parameters (including intercept)

Tolerance Intervals

A third type of interval, often confused with both confidence and prediction intervals, is the tolerance interval.

Tolerance Interval Definition

A (1-α, β) tolerance interval is an interval that, with confidence 1-α, contains at least proportion β of the population.

\bar{X} \pm k \cdot s

where k depends on n, α, and β (from tolerance tables)

Interval Type	What It Covers	Example Interpretation
Confidence	Parameter (μ)	95% confident μ is in [L, U]
Prediction	Next observation	95% of next observations fall in [L, U]
Tolerance	Proportion of population	95% confident that 90% of population is in [L, U]

Use case: Manufacturing quality control. "We are 95% confident that 99% of all products have measurements in this range."

AI/ML Applications

🎯 Conformal Prediction

Distribution-free prediction intervals for any ML model! Uses calibration data to construct intervals with guaranteed coverage, without assuming normality or other parametric forms.

📊 Quantile Regression

Train neural networks to predict quantiles (e.g., 5th and 95th percentiles) directly. Produces prediction intervals without distributional assumptions. Used in demand forecasting and risk estimation.

🧠 Bayesian Neural Networks

Posterior predictive distribution naturally provides prediction intervals that account for both epistemic (model) and aleatoric (data) uncertainty.

📈 Time Series Forecasting

Prediction intervals are essential for forecasts: they communicate uncertainty to decision-makers. ARIMA, Prophet, and deep learning methods all provide forecast intervals.

Epistemic vs. Aleatoric Uncertainty

Epistemic (Model) Uncertainty

Uncertainty from not knowing the true model/parameters. Reducible with more data. Analogous to σ²/n term.

Aleatoric (Data) Uncertainty

Inherent randomness in the data-generating process. Irreducible even with infinite data. Analogous to σ² term.

Python Implementation

🐍python

1import numpy as np
2from scipy import stats
3from typing import Tuple
4import warnings
5
6# ============================================
7# Prediction Intervals for Normal Data
8# ============================================
9
10def prediction_interval_normal(
11    data: np.ndarray,
12    confidence: float = 0.95
13) -> Tuple[float, float, float]:
14    """
15    Compute prediction interval for next observation from normal data.
16
17    Parameters
18    ----------
19    data : array
20        Observed data (assumed normal)
21    confidence : float
22        Confidence level (e.g., 0.95 for 95%)
23
24    Returns
25    -------
26    lower, upper, width : Prediction interval bounds and width
27    """
28    n = len(data)
29    x_bar = np.mean(data)
30    s = np.std(data, ddof=1)
31
32    # t-quantile
33    alpha = 1 - confidence
34    t_crit = stats.t.ppf(1 - alpha/2, df=n-1)
35
36    # Standard error of prediction
37    se_pred = s * np.sqrt(1 + 1/n)
38
39    # Margin of error
40    margin = t_crit * se_pred
41
42    return x_bar - margin, x_bar + margin, 2 * margin
43
44
45def confidence_interval_mean(
46    data: np.ndarray,
47    confidence: float = 0.95
48) -> Tuple[float, float, float]:
49    """
50    Compute confidence interval for population mean.
51
52    Parameters
53    ----------
54    data : array
55        Observed data
56    confidence : float
57        Confidence level
58
59    Returns
60    -------
61    lower, upper, width : CI bounds and width
62    """
63    n = len(data)
64    x_bar = np.mean(data)
65    s = np.std(data, ddof=1)
66
67    alpha = 1 - confidence
68    t_crit = stats.t.ppf(1 - alpha/2, df=n-1)
69
70    # Standard error of mean
71    se_mean = s / np.sqrt(n)
72
73    margin = t_crit * se_mean
74
75    return x_bar - margin, x_bar + margin, 2 * margin
76
77
78def compare_intervals(data: np.ndarray, confidence: float = 0.95) -> dict:
79    """Compare CI for mean vs. prediction interval."""
80    ci_lower, ci_upper, ci_width = confidence_interval_mean(data, confidence)
81    pi_lower, pi_upper, pi_width = prediction_interval_normal(data, confidence)
82
83    return {
84        'n': len(data),
85        'mean': np.mean(data),
86        'std': np.std(data, ddof=1),
87        'confidence_interval': {
88            'lower': ci_lower,
89            'upper': ci_upper,
90            'width': ci_width
91        },
92        'prediction_interval': {
93            'lower': pi_lower,
94            'upper': pi_upper,
95            'width': pi_width
96        },
97        'width_ratio': pi_width / ci_width
98    }
99
100
101# ============================================
102# Prediction Intervals for Regression
103# ============================================
104
105def regression_prediction_interval(
106    X: np.ndarray,
107    y: np.ndarray,
108    x_new: np.ndarray,
109    confidence: float = 0.95
110) -> dict:
111    """
112    Compute prediction interval for simple linear regression.
113
114    Parameters
115    ----------
116    X : array of shape (n,)
117        Predictor values
118    y : array of shape (n,)
119        Response values
120    x_new : array
121        New x values for prediction
122    confidence : float
123        Confidence level
124
125    Returns
126    -------
127    dict : Predictions, confidence intervals, and prediction intervals
128    """
129    n = len(X)
130    x_bar = np.mean(X)
131    ss_x = np.sum((X - x_bar)**2)
132
133    # Fit regression
134    beta_1 = np.sum((X - x_bar) * (y - np.mean(y))) / ss_x
135    beta_0 = np.mean(y) - beta_1 * x_bar
136
137    # Residuals and residual standard error
138    y_hat = beta_0 + beta_1 * X
139    residuals = y - y_hat
140    s_e = np.sqrt(np.sum(residuals**2) / (n - 2))
141
142    # t-quantile
143    alpha = 1 - confidence
144    t_crit = stats.t.ppf(1 - alpha/2, df=n-2)
145
146    # Predictions
147    y_new = beta_0 + beta_1 * x_new
148
149    # Standard errors
150    se_mean = s_e * np.sqrt(1/n + (x_new - x_bar)**2 / ss_x)
151    se_pred = s_e * np.sqrt(1 + 1/n + (x_new - x_bar)**2 / ss_x)
152
153    return {
154        'x_new': x_new,
155        'y_pred': y_new,
156        'coefficients': {'intercept': beta_0, 'slope': beta_1},
157        'residual_std_error': s_e,
158        'confidence_interval': {
159            'lower': y_new - t_crit * se_mean,
160            'upper': y_new + t_crit * se_mean
161        },
162        'prediction_interval': {
163            'lower': y_new - t_crit * se_pred,
164            'upper': y_new + t_crit * se_pred
165        }
166    }
167
168
169# ============================================
170# Tolerance Interval
171# ============================================
172
173def tolerance_interval_normal(
174    data: np.ndarray,
175    coverage: float = 0.90,
176    confidence: float = 0.95
177) -> Tuple[float, float]:
178    """
179    Compute two-sided tolerance interval for normal data.
180
181    A (confidence, coverage) tolerance interval contains at least
182    'coverage' proportion of the population with 'confidence' confidence.
183
184    Parameters
185    ----------
186    data : array
187        Observed data (assumed normal)
188    coverage : float
189        Proportion of population to cover (e.g., 0.90)
190    confidence : float
191        Confidence level (e.g., 0.95)
192
193    Returns
194    -------
195    lower, upper : Tolerance interval bounds
196    """
197    n = len(data)
198    x_bar = np.mean(data)
199    s = np.std(data, ddof=1)
200
201    # Approximate k factor (two-sided)
202    # This is a simplified approximation; exact values require special tables
203    z_p = stats.norm.ppf((1 + coverage) / 2)
204    chi2_val = stats.chi2.ppf(confidence, df=n-1)
205
206    # Approximate k factor
207    k = z_p * np.sqrt((n - 1) * (1 + 1/n) / chi2_val)
208
209    return x_bar - k * s, x_bar + k * s
210
211
212# ============================================
213# Conformal Prediction (Simple Version)
214# ============================================
215
216def conformal_prediction_interval(
217    cal_predictions: np.ndarray,
218    cal_true: np.ndarray,
219    new_prediction: float,
220    confidence: float = 0.90
221) -> Tuple[float, float]:
222    """
223    Compute distribution-free prediction interval using conformal prediction.
224
225    Parameters
226    ----------
227    cal_predictions : array
228        Model predictions on calibration set
229    cal_true : array
230        True values on calibration set
231    new_prediction : float
232        Model prediction for new point
233    confidence : float
234        Desired coverage level
235
236    Returns
237    -------
238    lower, upper : Prediction interval bounds
239    """
240    # Compute nonconformity scores (absolute residuals)
241    scores = np.abs(cal_predictions - cal_true)
242
243    # Quantile for coverage
244    n_cal = len(scores)
245    q_level = np.ceil((n_cal + 1) * confidence) / n_cal
246    q_level = min(q_level, 1.0)
247
248    # Quantile of scores
249    q_hat = np.quantile(scores, q_level)
250
251    return new_prediction - q_hat, new_prediction + q_hat
252
253
254# ============================================
255# Demonstration
256# ============================================
257
258if __name__ == "__main__":
259    np.random.seed(42)
260
261    print("=" * 60)
262    print("PREDICTION INTERVALS vs CONFIDENCE INTERVALS")
263    print("=" * 60)
264
265    # Generate normal data
266    mu_true = 100
267    sigma_true = 15
268    n = 30
269    data = np.random.normal(mu_true, sigma_true, n)
270
271    # Compare intervals
272    comparison = compare_intervals(data, confidence=0.95)
273
274    print(f"\nSample: n={comparison['n']}, mean={comparison['mean']:.2f}, std={comparison['std']:.2f}")
275    print(f"True parameters: μ={mu_true}, σ={sigma_true}")
276
277    print("\n--- Confidence Interval for Mean ---")
278    ci = comparison['confidence_interval']
279    print(f"  [{ci['lower']:.2f}, {ci['upper']:.2f}]")
280    print(f"  Width: {ci['width']:.2f}")
281
282    print("\n--- Prediction Interval for Next Observation ---")
283    pi = comparison['prediction_interval']
284    print(f"  [{pi['lower']:.2f}, {pi['upper']:.2f}]")
285    print(f"  Width: {pi['width']:.2f}")
286
287    print(f"\nPrediction interval is {comparison['width_ratio']:.1f}x wider!")
288
289    # Show convergence behavior
290    print("\n--- Width Ratio as n Increases ---")
291    for n_test in [10, 30, 100, 500, 1000]:
292        data_test = np.random.normal(mu_true, sigma_true, n_test)
293        comp = compare_intervals(data_test)
294        print(f"  n={n_test:4d}: CI width={comp['confidence_interval']['width']:.2f}, "
295              f"PI width={comp['prediction_interval']['width']:.2f}, "
296              f"ratio={comp['width_ratio']:.2f}")
297
298    print("\n--- Regression Prediction Interval ---")
299    # Generate regression data
300    X = np.linspace(0, 10, 50)
301    y = 2 + 3 * X + np.random.normal(0, 2, 50)
302
303    # Predict at new points
304    x_new = np.array([2, 5, 8, 12])  # Note: 12 is extrapolation!
305    result = regression_prediction_interval(X, y, x_new)
306
307    print(f"Regression: y = {result['coefficients']['intercept']:.2f} + "
308          f"{result['coefficients']['slope']:.2f}x")
309    print(f"Residual SE: {result['residual_std_error']:.2f}")
310    print("\nPredictions with intervals:")
311    for i, x in enumerate(x_new):
312        pi = result['prediction_interval']
313        ci = result['confidence_interval']
314        print(f"  x={x:2d}: ŷ={result['y_pred'][i]:.1f}, "
315              f"CI=[{ci['lower'][i]:.1f}, {ci['upper'][i]:.1f}], "
316              f"PI=[{pi['lower'][i]:.1f}, {pi['upper'][i]:.1f}]")
317
318    print("\n--- Tolerance Interval ---")
319    tol_lower, tol_upper = tolerance_interval_normal(data, coverage=0.90, confidence=0.95)
320    print(f"(0.95, 0.90) Tolerance Interval: [{tol_lower:.2f}, {tol_upper:.2f}]")
321    print("Interpretation: 95% confident that 90% of population is in this interval")

Common Pitfalls

Summary

Key Takeaways

Prediction intervals quantify uncertainty about future observations, while confidence intervals quantify uncertainty aboutpopulation parameters.
Prediction intervals are always wider because they account for both estimation uncertainty (reducible) and inherent variability (irreducible).
As n → ∞: Confidence intervals shrink to zero width, but prediction intervals converge to a minimum width of 2z*σ.
In regression: Prediction intervals widen for extrapolation (predicting at X values far from the data center).
Tolerance intervals answer a third question: what interval contains a specified proportion of the population with given confidence?
Modern ML: Conformal prediction provides distribution-free prediction intervals with guaranteed coverage for any model.

Looking Ahead: In the next section, we'll explore Simultaneous Confidence Intervals — how to maintain correct coverage when making multiple inferences at once, addressing the multiple testing problem through Bonferroni and related corrections.