Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the PMF definition: $p(x) = P(X = x)$ as the probability X takes value x
Verify the two essential properties: non-negativity and normalization
Calculate probabilities using PMF: $P(X in A) = sum_{x in A} p(x)$
Visualize distributions as bar charts and interpret their shapes
Recognize common PMFs: Bernoulli, Binomial, Poisson, Geometric
Apply PMFs to AI/ML: softmax outputs, language models, classification

Historical Context

From Counting Outcomes to Describing Distributions

In the early days of probability, mathematicians like Blaise Pascal and Pierre de Fermat (1654) solved gambling problems by counting favorable outcomes. But they faced a challenge: how do you systematically describe "how likely is each possible value?"

Abraham de Moivre (1667-1754) made a breakthrough in his 1718 work "The Doctrine of Chances" by systematically assigning probabilities to each outcome. This evolved into what we now call the Probability Mass Function.

Andrey Kolmogorov (1933) later formalized this in his axiomatic foundation of probability, establishing the PMF as the fundamental way to describe discrete distributions.

🎲

The Question

"How likely is each value?"

→

📊

The Answer

PMF: p(x) = P(X = x)

Why Do We Need the PMF?

In the previous section, we learned that a random variable X maps outcomes to numbers. But knowing what values X can take isn't enough — we need to know how likely each value is.

Consider rolling a fair die. We know X ∈ {1, 2, 3, 4, 5, 6}, but:

What is P(X = 3)? → We need to assign probabilities!
What is P(X ≤ 2)? → We need to combine probabilities!
What is the "average" value? → We need a complete probability description!

The Core Insight: The PMF is the complete probability description of a discrete random variable. It tells us exactly how probability is "distributed" (hence "distribution") across all possible values.

Think of the PMF as a probability allocation:

We have 1 unit of total probability to distribute across all possible values. The PMF tells us exactly how much each value gets.

Formal Definition

Definition: Probability Mass Function (PMF)

p_X(x) = P(X = x) \quad ext{for all } x \in \mathbb{R}

For a discrete random variable X, the probability mass function p(x) gives the probability that X equals exactly x.

Symbol Reference

Symbol	Name	Meaning
p(x) or pₓ(x)	PMF at x	Probability that X takes value x
P(X = x)	Event probability	Same as p(x), explicit notation
x	Value	A specific number in the range of X
Rₓ	Range/Support	Set of values where p(x) > 0

Why "Mass" Function?

The term "mass" comes from physics. Just as physical objects have mass concentrated at specific points, a discrete random variable has probability "mass" concentrated at specific values. Later, we'll contrast this with continuous distributions where probability is spread as "density."

📍 Discrete: Probability Mass

Probability concentrated at specific points. p(x) gives actual probability at each x.

🌊 Continuous: Probability Density

Probability spread continuously. f(x) gives density, not probability. (Next section!)

Two Essential Properties

A valid PMF must satisfy exactly two properties. These ensure the probabilities make sense and account for all possibilities:

1Non-negativity

p(x) \geq 0 \quad ext{for all } x

Intuition: You can't have negative probability. A -10% chance of something makes no sense!

2Normalization

sum_{x in R_X} p(x) = 1

Intuition: All probabilities must sum to 1 (100%). We account for every possibility!

Try the interactive demo below to see what happens when these properties are violated:

Visualizing PMFs

The most common way to visualize a PMF is with a bar chart (or stem plot). Each bar's height represents the probability of that value. Explore different distributions below:

Reading a PMF Chart

X-axis: Possible values of the random variable
Y-axis: Probability p(x) = P(X = x)
Bar height: How likely that value is
Total area: Sum of all bar heights = 1

Shape Matters: The shape of a PMF tells you about the distribution. Symmetric? Skewed left/right? Multiple modes? These visual features encode important statistical properties.

Computing Probabilities

The power of the PMF is that once you have it, you can compute any probability involving X. The key formula is:

Probability of an Event

P(X in A) = sum_{x in A} p(x)

To find the probability that X is in set A, sum the PMF values for all x in A.

Common Probability Calculations

Query	Formula	Example (fair die)
P(X = k)	p(k)	P(X = 3) = 1/6
P(X ≤ k)	Σ p(x) for x ≤ k	P(X ≤ 2) = 1/6 + 1/6 = 1/3
P(X ≥ k)	Σ p(x) for x ≥ k	P(X ≥ 5) = 1/6 + 1/6 = 1/3
P(a ≤ X ≤ b)	Σ p(x) for a ≤ x ≤ b	P(2 ≤ X ≤ 4) = 3/6 = 1/2
P(X ∈ {a,b,c})	p(a) + p(b) + p(c)	P(X ∈ {1,6}) = 2/6 = 1/3

Try the calculator below to compute probabilities interactively:

Common PMFs Gallery

Certain PMFs appear so frequently that they have names. Each models a specific type of discrete random phenomenon. Explore them below:

Quick Reference: When to Use Each PMF

Distribution	Use When...	Example
Bernoulli(p)	Single trial with two outcomes	Coin flip, Pass/Fail
Binomial(n, p)	Count of successes in n trials	# heads in 10 flips
Poisson(λ)	Count of events in fixed time/space	# emails per hour
Geometric(p)	# trials until first success	# flips until first head
Uniform{1,...,n}	All values equally likely	Fair die roll

Real-World Applications

🎰 Games & Gambling

Craps dice roll: PMF of sum of two dice

Casino games use PMFs to calculate house edge. The PMF of the sum of two dice is triangular with P(7) = 6/36 being the mode.

🏭 Quality Control

Defects per batch: Poisson or Binomial PMF

Manufacturing uses PMFs to set acceptance criteria. If X ~ Poisson(2), then P(X ≤ 3) ≈ 0.857 gives the acceptance probability.

📱 Customer Service

Calls per minute: Poisson PMF

Call centers model arrival rates with Poisson distributions to staff appropriately and minimize wait times.

🧬 Genetics

Allele counts: Binomial PMF

Hardy-Weinberg equilibrium uses binomial distributions to predict genotype frequencies in populations.

AI/ML Applications

PMFs are everywhere in machine learning. If you work with classification, language models, or reinforcement learning, you're working with PMFs constantly.

1. Classification and Softmax

For K-class classification:

p(y = k \mid \mathbf{x}) = ext{softmax}(\mathbf{z})_k = rac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}

The softmax output IS a PMF over the K classes! It's non-negative and sums to 1.

2. Language Models

Next token prediction:

p( ext{next token} = t \mid ext{context}) = ext{PMF over vocabulary } V

GPT-4, Claude, and all language models output a PMF over the vocabulary (often 50,000+ tokens) at each position. Sampling strategies (greedy, top-k, nucleus) operate on this PMF.

3. Reinforcement Learning

Policy as a PMF:

\pi(a \mid s) = P(A = a \mid ext{state} = s)

In discrete action spaces, the policy π is a PMF over actions for each state. Policy gradient methods optimize this PMF directly.

4. Attention Mechanism

Attention weights:

alpha_{ij} = ext{softmax}left( rac{Q_i K_j^T}{sqrt{d}} ight)_j

For each query, attention weights form a PMF over all keys. The model "attends" by distributing probability mass across positions.

Python Implementation

🐍python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ===== Define a PMF from scratch =====
6def create_pmf(values, probabilities):
7    """Create and validate a PMF as a dictionary."""
8    assert all(p >= 0 for p in probabilities), "Probabilities must be non-negative"
9    assert np.isclose(sum(probabilities), 1.0), "Probabilities must sum to 1"
10    return dict(zip(values, probabilities))
11
12# Fair die example
13die_pmf = create_pmf(
14    values=[1, 2, 3, 4, 5, 6],
15    probabilities=[1/6] * 6
16)
17
18print("P(X = 3):", die_pmf[3])  # 0.1666...
19
20# ===== Compute probabilities from PMF =====
21def prob_leq(pmf, k):
22    """Compute P(X <= k)"""
23    return sum(p for x, p in pmf.items() if x <= k)
24
25def prob_in_set(pmf, values):
26    """Compute P(X in values)"""
27    return sum(pmf.get(x, 0) for x in values)
28
29print("P(X <= 2):", prob_leq(die_pmf, 2))  # 0.3333...
30print("P(X in {1,6}):", prob_in_set(die_pmf, {1, 6}))  # 0.3333...
31
32# ===== Using scipy for standard distributions =====
33# Binomial(n=10, p=0.3)
34binomial = stats.binom(n=10, p=0.3)
35print("\nBinomial(10, 0.3):")
36print("P(X = 3):", binomial.pmf(3))
37print("P(X <= 3):", binomial.cdf(3))
38
39# Poisson(λ=4)
40poisson = stats.poisson(mu=4)
41print("\nPoisson(4):")
42print("P(X = 5):", poisson.pmf(5))
43print("P(X >= 3):", 1 - poisson.cdf(2))
44
45# ===== Softmax: Creating a PMF in ML =====
46def softmax(logits):
47    """Convert logits to a valid PMF."""
48    exp_logits = np.exp(logits - np.max(logits))  # Numerical stability
49    return exp_logits / exp_logits.sum()
50
51logits = np.array([2.0, 1.0, 0.1])
52probs = softmax(logits)
53print("\nSoftmax PMF:", probs)
54print("Sum:", probs.sum())  # Should be 1.0

Common Pitfalls

❌ Confusing PMF with PDF

PMF gives actual probabilities (can be greater than 0.5). PDF gives density, which can exceed 1. They're fundamentally different!

❌ Forgetting to check normalization

Always verify Σp(x) = 1. In code, use assert np.isclose(sum(probs), 1.0). Numerical errors can accumulate!

❌ Using PMF for continuous variables

PMF only applies to discrete random variables. For continuous RVs like height or temperature, use the probability density function (PDF) instead.

❌ Ignoring the support

Remember p(x) = 0 outside the support. For a die, p(7) = 0 and p(0.5) = 0. The PMF is only non-zero on valid values.

Test Your Understanding

Key Takeaways

PMF = Complete Description: p(x) = P(X = x) tells us exactly how probability is distributed across all values of a discrete random variable.
Two Properties: Valid PMFs have p(x) ≥ 0 (non-negative) and Σp(x) = 1 (normalization).
Computing Probabilities: To find P(X ∈ A), sum the PMF values: P(X ∈ A) = Σ p(x) for x ∈ A.
Visualization: Bar charts show PMFs clearly — height = probability, total area = 1.
Common PMFs: Bernoulli (binary), Binomial (counts), Poisson (rare events), Geometric (waiting time).
AI/ML Connection: Softmax outputs, language model predictions, RL policies, and attention weights are all PMFs!

Next Up: In the next section, we'll explore Continuous Random Variables — where probability is spread continuously rather than concentrated at discrete points. Instead of PMF, we'll use the Probability Density Function (PDF).