Learning Objectives
By the end of this section, you will be able to:
- Connect the Categorical (one trial, k outcomes) to the Multinomial (n trials, k outcomes) and recover Binomial as the k = 2 special case.
- Derive the PMF with its multinomial coefficient and visualize probability vectors on the simplex.
- Compute means, variances, and the negative covariances that enforce the constraint that counts must sum to n.
- Explain why minimizing cross-entropy is equivalent to maximizing Multinomial log-likelihood — the foundation of softmax classifiers.
- Apply Dirichlet-Multinomial smoothing to prevent zero probabilities and improve calibration in language models and classifiers.
- Use the Chi-Square goodness-of-fit test to check if observed counts match a hypothesized distribution.
- Know when Poisson or Normal approximations apply (rare events vs. large n), critical for NLP and A/B/n testing.
Big Picture: From Coins to Categories
Bernoulli and Binomial handle binary outcomes (yes/no, heads/tails). But the real world has richer choices: six die faces, survey options with A/B/C/D, word vocabularies with 50,000 tokens, A/B/C test variants. The Multinomial distribution is the counting workhorse when each trial can land in one of k buckets.
The Multinomial Analogy: The Multinomial is to the Categorical what the Binomial is to the Bernoulli — repeat the experiment n times and count how often each outcome appears.
Think of rolling a die 60 times and counting how many 1s, 2s, ..., 6s appear. Or surveying 1000 people about their favorite color and tallying the responses. Or counting word frequencies in a document. All of these are Multinomial scenarios.
Historical Context: From Genetics to Google
The Multinomial distribution emerged from generalizing Bernoulli's work to multiple outcomes. Key historical milestones:
| Year | Contributor | Contribution |
|---|---|---|
| 1713 | Jacob Bernoulli | Ars Conjectandi — Binomial foundation |
| 1900s | Karl Pearson | Chi-square test for multinomial goodness-of-fit |
| 1900s | R.A. Fisher | Maximum likelihood estimation for genetic ratios |
| 1920s | Geneticists | Mendel's pea experiments — 9:3:3:1 phenotype ratios |
| 2003 | Blei, Ng, Jordan | LDA topic modeling — Dirichlet-Multinomial |
| 2010s | Deep Learning | Softmax + Cross-entropy as standard classifier loss |
Mendel's Peas: Gregor Mendel predicted that crossing two hybrid pea plants would yield phenotypes in a 9:3:3:1 ratio. Testing whether observed counts match this Multinomial expectation was one of the first applications of the Chi-square test.
Categorical Primer (1 Trial, k Outcomes)
Before the Multinomial, we need the Categorical distribution — a single draw with k possible outcomes:
Symbol Table: Categorical Distribution
| Symbol | Name | Meaning | Constraints |
|---|---|---|---|
| Y | Outcome | Which category was selected (1, 2, ..., k) | Y ∈ {1, 2, ..., k} |
| k | Number of categories | How many possible outcomes | k ≥ 2 |
| pᵢ | Category probability | Probability of outcome i | 0 ≤ pᵢ ≤ 1 |
| p | Probability vector | (p₁, p₂, ..., pₖ) | Σpᵢ = 1 |
Intuition: One spin of a k-sided wheel. One roll of a die. One forward pass of a softmax classifier producing class probabilities.
Multinomial Definition (n Trials, k Outcomes)
Repeat the Categorical experiment n times independently. Let be the count of trials that land in category i. The count vector follows the Multinomial distribution:
Probability Mass Function (PMF)
Symbol Table: Multinomial PMF
| Symbol | Name | Meaning | Constraints |
|---|---|---|---|
| n | Number of trials | Total trials performed | n ≥ 1 |
| k | Number of categories | How many possible outcomes per trial | k ≥ 2 |
| xᵢ | Count for category i | How many trials landed in category i | xᵢ ≥ 0 |
| pᵢ | Probability of category i | Per-trial probability of outcome i | 0 < pᵢ < 1 |
| n!/(x₁!...xₖ!) | Multinomial coefficient | Number of orderings giving these counts | Integer |
Special Case: Binomial (k = 2)
When k = 2, we have only "success" and "failure". Setting and , the Multinomial becomes Binomial(n, p):
Combinatorial Intuition: Why n!/(x₁!...xₖ!)?
The multinomial coefficient counts how many orderings of n trials produce the same count vector. Let's build intuition with a concrete example.
The Dice Analogy
Suppose you roll a die n = 5 times and want exactly: x₁ = 2 ones, x₂ = 2 twos, x₃ = 1 three, and x₄ = x₅ = x₆ = 0 for the other faces.
How many ways can this happen? We need to choose:
- Which 2 of the 5 rolls show "1": ways
- Which 2 of the remaining 3 rolls show "2": ways
- The last roll must be "3": way
Total: . This equals the multinomial coefficient:
Simplex Geometry
Where do valid probability vectors live? Since and , the vector p is constrained to a (k-1)-dimensional simplex:
- k = 2: A line segment from (1, 0) to (0, 1)
- k = 3: A triangle with vertices (1,0,0), (0,1,0), (0,0,1)
- k = 4: A tetrahedron in 3D space
Moving toward one vertex increases that category's probability while decreasing others. This geometric constraint is why Multinomial counts have negative covariance — they can't all be large simultaneously.
Interactive: Simplex Explorer
Drag the point inside the triangle or use the sliders. Notice how changing one probability forces the others to adjust — this is the simplex constraint in action.
Loading interactive demo...
Moments and Dependence Structure
The Multinomial has a rich covariance structure that reflects the constraint.
Mean, Variance, and Covariance
Symbol Table: Moments
| Formula | Name | Interpretation |
|---|---|---|
| E[Xᵢ] = n·pᵢ | Expected count | On average, n·pᵢ trials land in category i |
| Var(Xᵢ) = n·pᵢ·(1-pᵢ) | Variance | Same as Binomial(n, pᵢ) — each component is marginal Binomial |
| Cov(Xᵢ, Xⱼ) = -n·pᵢ·pⱼ | Covariance | Always negative! When one count goes up, others must decrease |
Interactive: Covariance Matrix Visualizer
Adjust the probabilities and number of trials. Watch how the covariance matrix changes — all off-diagonal entries remain negative.
Loading interactive demo...
Interactive: Multinomial Sampling Simulation
Sample from a Multinomial distribution and watch the empirical statistics converge to theoretical values. Pay attention to how the negative covariance manifests in the samples.
Loading interactive demo...
Likelihood and Cross-Entropy
The log-likelihood of observing counts x under probabilities p is:
The first two terms are constants (they don't depend on p). Maximizing the likelihood means maximizing:
where is the empirical proportion.
Connection to Cross-Entropy
The cross-entropy between empirical distribution q and model distribution p is:
Setting , we see that minimizing cross-entropy is equivalent to maximizing Multinomial log-likelihood. This is why deep learning uses cross-entropy loss for classification!
Interactive: Softmax Playground
Softmax converts raw logits into a probability vector on the simplex. Temperature scaling controls how "peaked" or "flat" the distribution is. Cross-entropy measures how much probability mass lands on the correct class.
Loading interactive demo...
Loading interactive demo...
Dirichlet Smoothing and Bayesian Update
The Dirichlet distribution is the conjugate prior for Multinomial probabilities. Starting with prior parameters α = (α₁, ..., αₖ), after observing counts x, the posterior is:
Posterior Mean (Smoothed Estimate)
Common Prior Choices
| Prior | αᵢ values | Effect | Use case |
|---|---|---|---|
| Laplace | αᵢ = 1 | Add 1 pseudo-count per category | Simple, conservative |
| Jeffreys | αᵢ = 0.5 | Milder smoothing, often better calibrated | Bayesian default |
| Uniform | αᵢ = 1/k | Minimal information | Very sparse data |
| Empirical | αᵢ = prior estimate × n₀ | Informative prior from past data | Transfer learning |
Label Smoothing in Deep Learning: Setting soft targets like (0.9, 0.05, 0.05) instead of hard one-hot (1, 0, 0) is equivalent to adding Dirichlet pseudo-counts. This improves calibration and reduces overconfidence.
Interactive: Multinomial PMF Explorer
Adjust probabilities and counts. See exact probabilities, expected values, and how negative covariances emerge from the simplex constraint.
Loading interactive demo...
Chi-Square Goodness-of-Fit Test
How do we test whether observed counts match a hypothesized Multinomial distribution? The Chi-Square goodness-of-fit test compares observed counts to expected counts :
Under the null hypothesis (observed data comes from the specified Multinomial), this statistic approximately follows a distribution with k-1 degrees of freedom.
Example: Testing a Fair Die
Roll a die 60 times and observe: (8, 12, 10, 11, 9, 10) for faces 1-6. Under a fair die, expected counts are (10, 10, 10, 10, 10, 10).
With df = 5 and α = 0.05, the critical value is 11.07. Since 1.4 < 11.07, we fail to reject the null — no evidence the die is unfair.
Approximations and Limits
Poisson Approximation (Rare Events)
When n is large, all pᵢ are small, and n·pᵢ = λᵢ remains moderate, each count Xᵢ is approximately Poisson(λᵢ) and the counts become nearly independent.
When to use: Word counts in NLP (vocabulary of 50,000 words, each rare), mutation counts in genomics, rare event monitoring.
Normal Approximation (Large n)
For large n with moderate pᵢ, the count vector is approximately multivariate normal:
where and the covariance matrix has and .
Real-World Examples
Example 1: Mendel's Pea Experiment
Setup: Mendel crossed hybrid pea plants and predicted offspring phenotypes in ratio 9:3:3:1 (round-yellow : round-green : wrinkled-yellow : wrinkled-green).
Observed: From 556 plants: (315, 108, 101, 32)
Expected (if theory correct): 556 × (9/16, 3/16, 3/16, 1/16) = (312.75, 104.25, 104.25, 34.75)
Chi-square test: χ² = 0.47, df = 3, p-value = 0.925. Theory fits excellently!
Example 2: A/B/C Test for Website Variants
Setup: 1000 visitors randomly assigned to 3 landing page variants. We want equal traffic split but observe (312, 358, 330).
Expected (equal split): (333.3, 333.3, 333.3)
Chi-square test: χ² = 3.17, df = 2, critical value (α=0.05) = 5.99. Not significant — the randomization appears to be working.
Example 3: Document Classification
Setup: Classify 500 news articles into 4 categories: Sports (180), Politics (150), Tech (120), Entertainment (50). Estimate category probabilities.
MLE: p̂ = (0.36, 0.30, 0.24, 0.10)
With Laplace smoothing (α=1): p̂ = (181/504, 151/504, 121/504, 51/504) = (0.359, 0.300, 0.240, 0.101). Barely different because counts are large.
AI/ML Applications
1. Softmax Classifiers
The softmax function converts logits z to probabilities: . Training minimizes cross-entropy, which is Multinomial negative log-likelihood. The model learns to put probability mass on the correct class.
2. Label Smoothing
Instead of hard targets (1, 0, 0, ...), use soft targets like (0.9, 0.033, 0.033, ...). This is a Dirichlet prior that prevents overconfident predictions and improves calibration. Standard in modern vision models (ImageNet training).
3. Language Models
Word counts in a document follow Multinomial over the vocabulary. N-gram models estimate p(word | context) using smoothed Multinomial MLE. Even neural language models output a Multinomial (via softmax) at each position.
4. Topic Models (LDA)
Latent Dirichlet Allocation models documents as mixtures of topics, where each topic is a Dirichlet-Multinomial over words. The conjugacy makes inference tractable via variational methods or Gibbs sampling.
5. Class Imbalance Handling
When training data has imbalanced classes (90% negative, 10% positive), the Multinomial perspective suggests: (1) reweight the loss by 1/pᵢ, (2) add informative Dirichlet priors, or (3) use focal loss (a soft version of importance weighting).
1import torch
2import torch.nn.functional as F
3
4# Standard cross-entropy (Multinomial log-likelihood)
5logits = model(x) # shape: (batch, num_classes)
6loss = F.cross_entropy(logits, labels)
7
8# With label smoothing (Dirichlet prior)
9loss = F.cross_entropy(logits, labels, label_smoothing=0.1)
10
11# Class weights for imbalanced data
12weights = torch.tensor([1.0, 10.0, 1.0]) # upweight rare class
13loss = F.cross_entropy(logits, labels, weight=weights)
14
15# Temperature scaling for calibration
16temperature = 1.5
17calibrated_probs = F.softmax(logits / temperature, dim=-1)Python Implementation
1import numpy as np
2from scipy.stats import multinomial, chi2
3
4# ========== Basic Multinomial Operations ==========
5n = 20
6p = np.array([0.5, 0.3, 0.2])
7counts = np.array([10, 6, 4])
8
9# PMF: probability of exact count vector
10pmf = multinomial.pmf(counts, n=n, p=p)
11print(f"P(X = {counts}) = {pmf:.6f}")
12
13# Sample from Multinomial
14samples = multinomial.rvs(n=n, p=p, size=5)
15print(f"Samples:\n{samples}")
16
17# ========== Maximum Likelihood Estimation ==========
18observed = np.array([45, 30, 25])
19n_obs = observed.sum()
20p_mle = observed / n_obs
21print(f"MLE: {p_mle}")
22
23# ========== Dirichlet Smoothing ==========
24alpha = np.array([1.0, 1.0, 1.0]) # Laplace prior
25p_smooth = (alpha + observed) / (alpha.sum() + n_obs)
26print(f"Smoothed estimate: {p_smooth}")
27
28# ========== Chi-Square Goodness-of-Fit ==========
29# Test if observed counts match expected probabilities
30p_null = np.array([0.5, 0.3, 0.2])
31expected = n_obs * p_null
32chi2_stat = np.sum((observed - expected)**2 / expected)
33df = len(observed) - 1
34p_value = 1 - chi2.cdf(chi2_stat, df)
35print(f"Chi-square: {chi2_stat:.3f}, df={df}, p-value={p_value:.4f}")
36
37# ========== Log-likelihood and Cross-Entropy ==========
38def multinomial_log_likelihood(counts, probs):
39 """Compute log-likelihood (up to constant)"""
40 return np.sum(counts * np.log(probs))
41
42def cross_entropy(q, p):
43 """Cross-entropy H(q, p)"""
44 return -np.sum(q * np.log(p))
45
46q_empirical = observed / n_obs
47ce = cross_entropy(q_empirical, p_mle)
48print(f"Cross-entropy with MLE: {ce:.4f}")
49
50# ========== Covariance Matrix ==========
51def multinomial_covariance(n, p):
52 """Compute the full covariance matrix"""
53 k = len(p)
54 cov = np.zeros((k, k))
55 for i in range(k):
56 for j in range(k):
57 if i == j:
58 cov[i, j] = n * p[i] * (1 - p[i])
59 else:
60 cov[i, j] = -n * p[i] * p[j]
61 return cov
62
63cov_matrix = multinomial_covariance(n=100, p=[0.5, 0.3, 0.2])
64print(f"Covariance matrix:\n{cov_matrix}")Common Pitfalls
1. Forgetting Constraints
Probabilities must sum to 1; counts must sum to n. Always normalize predictions and validate inputs. In PyTorch, softmax guarantees valid probabilities.
2. Treating Categories as Independent
Multinomial counts are negatively correlated. You cannot compute separate confidence intervals for each category and combine them. Use multivariate methods or bootstrap for joint inference.
3. Zero Counts → Log(0) Explosion
Never allow p = 0 in the likelihood. Always add smoothing: Laplace (add 1), Jeffreys (add 0.5), or use a proper Dirichlet prior.
4. Overconfident Softmax
Neural networks often produce overconfident predictions. Use temperature scaling, label smoothing, or Bayesian methods to calibrate probabilities.
5. Using Multinomial When n is Random
If the total count n is itself random (e.g., number of words in a document varies), consider Poisson or Dirichlet-Multinomial models instead.
6. Ignoring Chi-Square Assumptions
The Chi-square test requires expected counts ≥ 5 in each cell. For sparse data, combine categories or use exact tests.
Test Your Understanding
Loading interactive demo...
Summary
The Multinomial distribution generalizes the Binomial to k categories. Its key properties:
- Counts live on a simplex: Probability vectors (p₁, ..., pₖ) with pᵢ ≥ 0 and Σpᵢ = 1 form a (k-1)-dimensional simplex.
- Negative covariance: Counts are negatively correlated because they must sum to n. This is the defining dependence structure.
- Cross-entropy = negative log-likelihood: Softmax + cross-entropy loss is maximum likelihood estimation for Categorical/Multinomial.
- Dirichlet is the conjugate prior: Adding pseudo-counts smooths estimates and prevents log(0). This underlies label smoothing.
- Chi-square tests goodness-of-fit: Compare observed to expected counts using χ² = Σ(O-E)²/E.
Whether you're training a neural classifier, building a language model, running an A/B/n test, or testing Mendel's genetic predictions, the Multinomial provides the probabilistic foundation for reasoning about categorical outcomes.