Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Connect the Categorical (one trial, k outcomes) to the Multinomial (n trials, k outcomes) and recover Binomial as the k = 2 special case.
Derive the PMF with its multinomial coefficient and visualize probability vectors on the simplex.
Compute means, variances, and the negative covariances that enforce the constraint that counts must sum to n.
Explain why minimizing cross-entropy is equivalent to maximizing Multinomial log-likelihood — the foundation of softmax classifiers.
Apply Dirichlet-Multinomial smoothing to prevent zero probabilities and improve calibration in language models and classifiers.
Use the Chi-Square goodness-of-fit test to check if observed counts match a hypothesized distribution.
Know when Poisson or Normal approximations apply (rare events vs. large n), critical for NLP and A/B/n testing.

Big Picture: From Coins to Categories

Bernoulli and Binomial handle binary outcomes (yes/no, heads/tails). But the real world has richer choices: six die faces, survey options with A/B/C/D, word vocabularies with 50,000 tokens, A/B/C test variants. The Multinomial distribution is the counting workhorse when each trial can land in one of k buckets.

The Multinomial Analogy: The Multinomial is to the Categorical what the Binomial is to the Bernoulli — repeat the experiment n times and count how often each outcome appears.

Think of rolling a die 60 times and counting how many 1s, 2s, ..., 6s appear. Or surveying 1000 people about their favorite color and tallying the responses. Or counting word frequencies in a document. All of these are Multinomial scenarios.

Historical Context: From Genetics to Google

The Multinomial distribution emerged from generalizing Bernoulli's work to multiple outcomes. Key historical milestones:

Year	Contributor	Contribution
1713	Jacob Bernoulli	Ars Conjectandi — Binomial foundation
1900s	Karl Pearson	Chi-square test for multinomial goodness-of-fit
1900s	R.A. Fisher	Maximum likelihood estimation for genetic ratios
1920s	Geneticists	Mendel's pea experiments — 9:3:3:1 phenotype ratios
2003	Blei, Ng, Jordan	LDA topic modeling — Dirichlet-Multinomial
2010s	Deep Learning	Softmax + Cross-entropy as standard classifier loss

Mendel's Peas: Gregor Mendel predicted that crossing two hybrid pea plants would yield phenotypes in a 9:3:3:1 ratio. Testing whether observed counts match this Multinomial expectation was one of the first applications of the Chi-square test.

The connection between Multinomial likelihood and cross-entropy loss was not explicitly recognized until the deep learning era, when practitioners realized that softmax + cross-entropy is just maximum likelihood estimation for Categorical/Multinomial models.

Categorical Primer (1 Trial, k Outcomes)

Before the Multinomial, we need the Categorical distribution — a single draw with k possible outcomes:

Y \sim \text{Categorical}(p_1, p_2, \ldots, p_k), \quad P(Y = i) = p_i, \quad \sum_{i=1}^k p_i = 1

Symbol Table: Categorical Distribution

Symbol	Name	Meaning	Constraints
Y	Outcome	Which category was selected (1, 2, ..., k)	Y ∈ {1, 2, ..., k}
k	Number of categories	How many possible outcomes	k ≥ 2
pᵢ	Category probability	Probability of outcome i	0 ≤ pᵢ ≤ 1
p	Probability vector	(p₁, p₂, ..., pₖ)	Σpᵢ = 1

Intuition: One spin of a k-sided wheel. One roll of a die. One forward pass of a softmax classifier producing class probabilities.

Multinomial Definition (n Trials, k Outcomes)

Repeat the Categorical experiment n times independently. Let $X_i$ be the count of trials that land in category i. The count vector follows the Multinomial distribution:

\mathbf{X} = (X_1, X_2, \ldots, X_k) \sim \text{Multinomial}(n, \mathbf{p})

Probability Mass Function (PMF)

P(X_1 = x_1, \ldots, X_k = x_k) = \frac{n!}{x_1! \cdot x_2! \cdots x_k!} \prod_{i=1}^k p_i^{x_i}

Symbol Table: Multinomial PMF

Symbol	Name	Meaning	Constraints
n	Number of trials	Total trials performed	n ≥ 1
k	Number of categories	How many possible outcomes per trial	k ≥ 2
xᵢ	Count for category i	How many trials landed in category i	xᵢ ≥ 0
pᵢ	Probability of category i	Per-trial probability of outcome i	0 < pᵢ < 1
n!/(x₁!...xₖ!)	Multinomial coefficient	Number of orderings giving these counts	Integer

Key Constraints: The counts must sum to n (

\sum x_i = n

) and the probabilities must sum to 1 (

\sum p_i = 1

). Violating these constraints gives probability zero.

Special Case: Binomial (k = 2)

When k = 2, we have only "success" and "failure". Setting $p_1 = p$ and $p_2 = 1-p$ , the Multinomial becomes Binomial(n, p):

P(X_1 = x) = \binom{n}{x} p^x (1-p)^{n-x}

Combinatorial Intuition: Why n!/(x₁!...xₖ!)?

The multinomial coefficient counts how many orderings of n trials produce the same count vector. Let's build intuition with a concrete example.

The Dice Analogy

Suppose you roll a die n = 5 times and want exactly: x₁ = 2 ones, x₂ = 2 twos, x₃ = 1 three, and x₄ = x₅ = x₆ = 0 for the other faces.

How many ways can this happen? We need to choose:

Which 2 of the 5 rolls show "1": $\binom{5}{2} = 10$ ways
Which 2 of the remaining 3 rolls show "2": $\binom{3}{2} = 3$ ways
The last roll must be "3": $\binom{1}{1} = 1$ way

Total: $10 \times 3 \times 1 = 30$ . This equals the multinomial coefficient:

\frac{5!}{2! \cdot 2! \cdot 1! \cdot 0! \cdot 0! \cdot 0!} = \frac{120}{2 \cdot 2 \cdot 1 \cdot 1 \cdot 1 \cdot 1} = 30

The multinomial coefficient generalizes "n choose k" to multiple groups. It counts the number of ways to partition n labeled items into k groups of sizes x₁, x₂, ..., xₖ.

Simplex Geometry

Where do valid probability vectors live? Since $p_i \geq 0$ and $\sum p_i = 1$ , the vector p is constrained to a (k-1)-dimensional simplex:

k = 2: A line segment from (1, 0) to (0, 1)
k = 3: A triangle with vertices (1,0,0), (0,1,0), (0,0,1)
k = 4: A tetrahedron in 3D space

Moving toward one vertex increases that category's probability while decreasing others. This geometric constraint is why Multinomial counts have negative covariance — they can't all be large simultaneously.

Every softmax output is a point on this simplex. Gradient descent on cross-entropy loss slides the point toward the correct vertex (the true label).

Interactive: Simplex Explorer

Drag the point inside the triangle or use the sliders. Notice how changing one probability forces the others to adjust — this is the simplex constraint in action.

Loading interactive demo...

Moments and Dependence Structure

The Multinomial has a rich covariance structure that reflects the constraint $\sum X_i = n$ .

Mean, Variance, and Covariance

E[X_i] = n p_i

\text{Var}(X_i) = n p_i (1 - p_i)

\text{Cov}(X_i, X_j) = -n p_i p_j \quad (i \neq j)

Symbol Table: Moments

Formula	Name	Interpretation
E[Xᵢ] = n·pᵢ	Expected count	On average, n·pᵢ trials land in category i
Var(Xᵢ) = n·pᵢ·(1-pᵢ)	Variance	Same as Binomial(n, pᵢ) — each component is marginal Binomial
Cov(Xᵢ, Xⱼ) = -n·pᵢ·pⱼ	Covariance	Always negative! When one count goes up, others must decrease

Critical Insight: The covariances are always negative. This means you cannot treat Multinomial components as independent Binomials unless you're in the rare-event (Poisson) regime. Ignoring this dependence leads to incorrect confidence intervals and hypothesis tests.

Interactive: Covariance Matrix Visualizer

Adjust the probabilities and number of trials. Watch how the covariance matrix changes — all off-diagonal entries remain negative.

Loading interactive demo...

Interactive: Multinomial Sampling Simulation

Sample from a Multinomial distribution and watch the empirical statistics converge to theoretical values. Pay attention to how the negative covariance manifests in the samples.

Loading interactive demo...

Likelihood and Cross-Entropy

The log-likelihood of observing counts x under probabilities p is:

\log L(\mathbf{p} \mid \mathbf{x}) = \log(n!) - \sum_i \log(x_i!) + \sum_i x_i \log p_i

The first two terms are constants (they don't depend on p). Maximizing the likelihood means maximizing:

\sum_i x_i \log p_i = n \sum_i \hat{p}_i \log p_i

where $\hat{p}_i = x_i / n$ is the empirical proportion.

Connection to Cross-Entropy

The cross-entropy between empirical distribution q and model distribution p is:

H(q, p) = -\sum_i q_i \log p_i

Setting $q_i = x_i / n$ , we see that minimizing cross-entropy is equivalent to maximizing Multinomial log-likelihood. This is why deep learning uses cross-entropy loss for classification!

MLE Solution: The maximum likelihood estimate is simply

\hat{p}_i = x_i / n

— set each probability equal to the observed proportion.

If any

x_i = 0

and we set

p_i = 0

, then

\log(0) = -\infty

. This is catastrophic for training! Smoothing prevents this disaster.

Interactive: Softmax Playground

Softmax converts raw logits into a probability vector on the simplex. Temperature scaling controls how "peaked" or "flat" the distribution is. Cross-entropy measures how much probability mass lands on the correct class.

Loading interactive demo...

Dirichlet Smoothing and Bayesian Update

The Dirichlet distribution is the conjugate prior for Multinomial probabilities. Starting with prior parameters α = (α₁, ..., αₖ), after observing counts x, the posterior is:

\mathbf{p} \mid \mathbf{x} \sim \text{Dirichlet}(\alpha_1 + x_1, \ldots, \alpha_k + x_k)

Posterior Mean (Smoothed Estimate)

\hat{p}_i^{\text{smooth}} = \frac{\alpha_i + x_i}{\sum_j \alpha_j + n}

Common Prior Choices

Prior	αᵢ values	Effect	Use case
Laplace	αᵢ = 1	Add 1 pseudo-count per category	Simple, conservative
Jeffreys	αᵢ = 0.5	Milder smoothing, often better calibrated	Bayesian default
Uniform	αᵢ = 1/k	Minimal information	Very sparse data
Empirical	αᵢ = prior estimate × n₀	Informative prior from past data	Transfer learning

Label Smoothing in Deep Learning: Setting soft targets like (0.9, 0.05, 0.05) instead of hard one-hot (1, 0, 0) is equivalent to adding Dirichlet pseudo-counts. This improves calibration and reduces overconfidence.

Interactive: Multinomial PMF Explorer

Adjust probabilities and counts. See exact probabilities, expected values, and how negative covariances emerge from the simplex constraint.

Loading interactive demo...

Chi-Square Goodness-of-Fit Test

How do we test whether observed counts match a hypothesized Multinomial distribution? The Chi-Square goodness-of-fit test compares observed counts $O_i$ to expected counts $E_i = n \cdot p_i$ :

\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}

Under the null hypothesis (observed data comes from the specified Multinomial), this statistic approximately follows a $\chi^2_{k-1}$ distribution with k-1 degrees of freedom.

Example: Testing a Fair Die

Roll a die 60 times and observe: (8, 12, 10, 11, 9, 10) for faces 1-6. Under a fair die, expected counts are (10, 10, 10, 10, 10, 10).

\chi^2 = \frac{(8-10)^2}{10} + \frac{(12-10)^2}{10} + \ldots + \frac{(10-10)^2}{10} = 1.4

With df = 5 and α = 0.05, the critical value is 11.07. Since 1.4 < 11.07, we fail to reject the null — no evidence the die is unfair.

The Chi-square test requires all expected counts to be at least 5. For rare categories, combine cells or use exact multinomial tests.

Approximations and Limits

Poisson Approximation (Rare Events)

When n is large, all pᵢ are small, and n·pᵢ = λᵢ remains moderate, each count Xᵢ is approximately Poisson(λᵢ) and the counts become nearly independent.

X_i \stackrel{d}{\approx} \text{Poisson}(n \cdot p_i), \quad \text{counts approximately independent}

When to use: Word counts in NLP (vocabulary of 50,000 words, each rare), mutation counts in genomics, rare event monitoring.

Normal Approximation (Large n)

For large n with moderate pᵢ, the count vector is approximately multivariate normal:

\mathbf{X} \stackrel{d}{\approx} \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})

where $\mu_i = n p_i$ and the covariance matrix has $\Sigma_{ii} = n p_i (1-p_i)$ and $\Sigma_{ij} = -n p_i p_j$ .

The rule of thumb: use Normal approximation when n·pᵢ ≥ 5 for all categories. Use Poisson when dealing with many rare categories.

Real-World Examples

Example 1: Mendel's Pea Experiment

Setup: Mendel crossed hybrid pea plants and predicted offspring phenotypes in ratio 9:3:3:1 (round-yellow : round-green : wrinkled-yellow : wrinkled-green).

Observed: From 556 plants: (315, 108, 101, 32)

Expected (if theory correct): 556 × (9/16, 3/16, 3/16, 1/16) = (312.75, 104.25, 104.25, 34.75)

Chi-square test: χ² = 0.47, df = 3, p-value = 0.925. Theory fits excellently!

Example 2: A/B/C Test for Website Variants

Setup: 1000 visitors randomly assigned to 3 landing page variants. We want equal traffic split but observe (312, 358, 330).

Expected (equal split): (333.3, 333.3, 333.3)

Chi-square test: χ² = 3.17, df = 2, critical value (α=0.05) = 5.99. Not significant — the randomization appears to be working.

Example 3: Document Classification

Setup: Classify 500 news articles into 4 categories: Sports (180), Politics (150), Tech (120), Entertainment (50). Estimate category probabilities.

MLE: p̂ = (0.36, 0.30, 0.24, 0.10)

With Laplace smoothing (α=1): p̂ = (181/504, 151/504, 121/504, 51/504) = (0.359, 0.300, 0.240, 0.101). Barely different because counts are large.

AI/ML Applications

1. Softmax Classifiers

The softmax function converts logits z to probabilities: $p_i = e^{z_i} / \sum_j e^{z_j}$ . Training minimizes cross-entropy, which is Multinomial negative log-likelihood. The model learns to put probability mass on the correct class.

2. Label Smoothing

Instead of hard targets (1, 0, 0, ...), use soft targets like (0.9, 0.033, 0.033, ...). This is a Dirichlet prior that prevents overconfident predictions and improves calibration. Standard in modern vision models (ImageNet training).

3. Language Models

Word counts in a document follow Multinomial over the vocabulary. N-gram models estimate p(word | context) using smoothed Multinomial MLE. Even neural language models output a Multinomial (via softmax) at each position.

4. Topic Models (LDA)

Latent Dirichlet Allocation models documents as mixtures of topics, where each topic is a Dirichlet-Multinomial over words. The conjugacy makes inference tractable via variational methods or Gibbs sampling.

5. Class Imbalance Handling

When training data has imbalanced classes (90% negative, 10% positive), the Multinomial perspective suggests: (1) reweight the loss by 1/pᵢ, (2) add informative Dirichlet priors, or (3) use focal loss (a soft version of importance weighting).

🐍python

1import torch
2import torch.nn.functional as F
3
4# Standard cross-entropy (Multinomial log-likelihood)
5logits = model(x)  # shape: (batch, num_classes)
6loss = F.cross_entropy(logits, labels)
7
8# With label smoothing (Dirichlet prior)
9loss = F.cross_entropy(logits, labels, label_smoothing=0.1)
10
11# Class weights for imbalanced data
12weights = torch.tensor([1.0, 10.0, 1.0])  # upweight rare class
13loss = F.cross_entropy(logits, labels, weight=weights)
14
15# Temperature scaling for calibration
16temperature = 1.5
17calibrated_probs = F.softmax(logits / temperature, dim=-1)

Python Implementation

🐍python

1import numpy as np
2from scipy.stats import multinomial, chi2
3
4# ========== Basic Multinomial Operations ==========
5n = 20
6p = np.array([0.5, 0.3, 0.2])
7counts = np.array([10, 6, 4])
8
9# PMF: probability of exact count vector
10pmf = multinomial.pmf(counts, n=n, p=p)
11print(f"P(X = {counts}) = {pmf:.6f}")
12
13# Sample from Multinomial
14samples = multinomial.rvs(n=n, p=p, size=5)
15print(f"Samples:\n{samples}")
16
17# ========== Maximum Likelihood Estimation ==========
18observed = np.array([45, 30, 25])
19n_obs = observed.sum()
20p_mle = observed / n_obs
21print(f"MLE: {p_mle}")
22
23# ========== Dirichlet Smoothing ==========
24alpha = np.array([1.0, 1.0, 1.0])  # Laplace prior
25p_smooth = (alpha + observed) / (alpha.sum() + n_obs)
26print(f"Smoothed estimate: {p_smooth}")
27
28# ========== Chi-Square Goodness-of-Fit ==========
29# Test if observed counts match expected probabilities
30p_null = np.array([0.5, 0.3, 0.2])
31expected = n_obs * p_null
32chi2_stat = np.sum((observed - expected)**2 / expected)
33df = len(observed) - 1
34p_value = 1 - chi2.cdf(chi2_stat, df)
35print(f"Chi-square: {chi2_stat:.3f}, df={df}, p-value={p_value:.4f}")
36
37# ========== Log-likelihood and Cross-Entropy ==========
38def multinomial_log_likelihood(counts, probs):
39    """Compute log-likelihood (up to constant)"""
40    return np.sum(counts * np.log(probs))
41
42def cross_entropy(q, p):
43    """Cross-entropy H(q, p)"""
44    return -np.sum(q * np.log(p))
45
46q_empirical = observed / n_obs
47ce = cross_entropy(q_empirical, p_mle)
48print(f"Cross-entropy with MLE: {ce:.4f}")
49
50# ========== Covariance Matrix ==========
51def multinomial_covariance(n, p):
52    """Compute the full covariance matrix"""
53    k = len(p)
54    cov = np.zeros((k, k))
55    for i in range(k):
56        for j in range(k):
57            if i == j:
58                cov[i, j] = n * p[i] * (1 - p[i])
59            else:
60                cov[i, j] = -n * p[i] * p[j]
61    return cov
62
63cov_matrix = multinomial_covariance(n=100, p=[0.5, 0.3, 0.2])
64print(f"Covariance matrix:\n{cov_matrix}")

Common Pitfalls

1. Forgetting Constraints

Probabilities must sum to 1; counts must sum to n. Always normalize predictions and validate inputs. In PyTorch, softmax guarantees valid probabilities.

2. Treating Categories as Independent

Multinomial counts are negatively correlated. You cannot compute separate confidence intervals for each category and combine them. Use multivariate methods or bootstrap for joint inference.

3. Zero Counts → Log(0) Explosion

Never allow p = 0 in the likelihood. Always add smoothing: Laplace (add 1), Jeffreys (add 0.5), or use a proper Dirichlet prior.

4. Overconfident Softmax

Neural networks often produce overconfident predictions. Use temperature scaling, label smoothing, or Bayesian methods to calibrate probabilities.

5. Using Multinomial When n is Random

If the total count n is itself random (e.g., number of words in a document varies), consider Poisson or Dirichlet-Multinomial models instead.

6. Ignoring Chi-Square Assumptions

The Chi-square test requires expected counts ≥ 5 in each cell. For sparse data, combine categories or use exact tests.

Test Your Understanding

Loading interactive demo...

Summary

The Multinomial distribution generalizes the Binomial to k categories. Its key properties:

Counts live on a simplex: Probability vectors (p₁, ..., pₖ) with pᵢ ≥ 0 and Σpᵢ = 1 form a (k-1)-dimensional simplex.
Negative covariance: Counts are negatively correlated because they must sum to n. This is the defining dependence structure.
Cross-entropy = negative log-likelihood: Softmax + cross-entropy loss is maximum likelihood estimation for Categorical/Multinomial.
Dirichlet is the conjugate prior: Adding pseudo-counts smooths estimates and prevents log(0). This underlies label smoothing.
Chi-square tests goodness-of-fit: Compare observed to expected counts using χ² = Σ(O-E)²/E.

Whether you're training a neural classifier, building a language model, running an A/B/n test, or testing Mendel's genetic predictions, the Multinomial provides the probabilistic foundation for reasoning about categorical outcomes.