Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain what makes a Markov model "hidden"
• Define the five components of an HMM mathematically
• Describe the three fundamental HMM problems
• Understand why dynamic programming is essential

🔧 Practical Skills

• Implement the Forward algorithm for sequence evaluation
• Apply the Viterbi algorithm for optimal decoding
• Understand the Baum-Welch learning algorithm
• Recognize when to use HMMs vs other sequence models

🧠 Deep Learning Connections

• RNNs and LSTMs — Neural sequence models that generalize HMMs with continuous hidden states
• CRFs — Conditional Random Fields combine HMM structure with discriminative training
• Attention mechanisms — Transformers replace Markov assumptions with global attention
• Variational autoencoders — VAEs extend HMM latent variable ideas to deep generative models

Where You'll Apply This: Speech recognition, natural language processing (POS tagging, named entity recognition), bioinformatics (gene finding, protein structure), time series analysis, music generation, and robotics (localization).

The Big Picture

Hidden Markov Models (HMMs) are one of the most successful statistical models for sequential data. They address a fundamental challenge: inferring hidden structure from noisy observations.

The Core Idea

Imagine you're in a room with no windows. You can't see the weather outside, but you can observe what your friend does each day: going for a walk, shopping, or cleaning. From these observations, you want to infer the hidden weather states. An HMM provides the mathematical framework to do exactly this.

🌤️ → ❓

Hidden States: The true underlying process we can't see

🚶 🛒 🧹

Observations: What we actually measure or detect

📊

Inference: Determine hidden states from observations

The Story Behind HMMs

📜

Leonard Baum & Lloyd Welch (1960s)

HMMs were developed at the Institute for Defense Analyses for speech recognition research. Baum and colleagues developed the mathematical foundations, including the famous Baum-Welch algorithm for learning HMM parameters from data.

🎤

Speech Recognition Revolution (1980s-2000s)

HMMs became the dominant approach for speech recognition at IBM, CMU, and Bell Labs. Systems like Dragon NaturallySpeaking and early Siri used HMMs. The states represented phonemes, and observations were acoustic features extracted from audio.

🧬

Bioinformatics Applications (1990s-present)

HMMs proved invaluable for analyzing DNA and protein sequences. Profile HMMs can identify protein families, predict gene structures, and align biological sequences. The hidden states represent functional regions (exons, introns, regulatory elements).

Where HMMs Are Used

Domain	Hidden States	Observations
Speech Recognition	Phonemes/sub-word units	Acoustic features (MFCCs)
POS Tagging	Part-of-speech tags	Words in sentences
Gene Finding	Coding/non-coding regions	DNA nucleotides (A,T,G,C)
Handwriting Recognition	Character strokes	Pen coordinates
Music Generation	Musical modes/keys	Notes and rhythms
Robot Localization	True position	Noisy sensor readings

Mathematical Foundation

An HMM is defined by two stochastic processes: a hidden Markov chain (the states we can't observe) and an observation process (what we actually see). Let's formalize this.

The Five Components

An HMM is specified by the tuple $\lambda = (S, V, A, B, \pi)$ :

States: $S = \{s_1, s_2, \ldots, s_N\}$

The set of $N$ hidden states. These are the unobservable conditions we want to infer.

Observations: $V = \{v_1, v_2, \ldots, v_M\}$

The set of $M$ possible observation symbols. This is what we actually measure.

Transition Matrix: $A = [a_{ij}]$ where $a_{ij} = P(q_{t+1} = s_j | q_t = s_i)$

$N \times N$ matrix of state transition probabilities. Row $i$ sums to 1.

Emission Matrix: $B = [b_j(k)]$ where $b_j(k) = P(o_t = v_k | q_t = s_j)$

$N \times M$ matrix of observation probabilities given each state. Row $j$ sums to 1.

Initial Distribution: $\pi = [\pi_i]$ where $\pi_i = P(q_1 = s_i)$

Vector of length $N$ giving initial state probabilities. Sums to 1.

Formal Definition

The key assumptions that define an HMM are:

HMM Assumptions

1. Markov Property (First-Order)

P(q_t | q_1, q_2, \ldots, q_{t-1}) = P(q_t | q_{t-1})

The next state depends only on the current state, not the entire history.

2. Output Independence

P(o_t | q_1, \ldots, q_T, o_1, \ldots, o_{t-1}, o_{t+1}, \ldots) = P(o_t | q_t)

Each observation depends only on the current hidden state.

3. Time Homogeneity

P(q_{t+1} = s_j | q_t = s_i) = a_{ij} \quad \text{for all } t

Transition and emission probabilities don't change over time.

Given a sequence of $T$ observations $O = (o_1, o_2, \ldots, o_T)$ and a hidden state sequence $Q = (q_1, q_2, \ldots, q_T)$ , the joint probability is:

P(O, Q | \lambda) = \pi_{q_1} \cdot b_{q_1}(o_1) \cdot \prod_{t=2}^{T} a_{q_{t-1}, q_t} \cdot b_{q_t}(o_t)

Interactive: HMM Explorer

Explore how HMMs work with the classic weather-activity example. Adjust transition and emission probabilities to see how they affect the generated sequences. Notice how the hidden states (weather) produce the observations (activities) probabilistically.

Hidden Markov Model Visualizer

The classic weather-activity HMM: hidden weather states (Sunny/Rainy) generate observable activities (Walk/Shop/Clean).

Transition Probabilities

P(Sunny → Sunny): 0.70

P(Rainy → Rainy): 0.60

Emission Probabilities

Sunny:

Walk: 0.60

Shop: 0.30

Rainy:

Walk: 0.10

Shop: 0.40

The Three Fundamental Problems

All HMM applications reduce to solving three fundamental problems, each addressed by a different algorithm:

Problem 1

Evaluation

Given $O$ and $\lambda$ , compute $P(O | \lambda)$

Algorithm: Forward

Use: Model selection, likelihood scoring

Problem 2

Decoding

Given $O$ and $\lambda$ , find best $Q^*$

Algorithm: Viterbi

Use: POS tagging, speech recognition

Problem 3

Learning

Given $O$ , find best $\lambda^*$

Algorithm: Baum-Welch (EM)

Use: Training models from data

Problem 1: Evaluation (Forward Algorithm)

Goal: Compute $P(O | \lambda)$ —the probability of observing a particular sequence given our model.

The naive approach would enumerate all possible state sequences: $P(O | \lambda) = \sum_{Q} P(O, Q | \lambda)$ . But with $N$ states and $T$ time steps, there are $N^T$ possible sequences—exponential!

The Forward Algorithm uses dynamic programming to compute this in $O(T \times N^2)$ time.

Forward Algorithm

Define: $\alpha_t(i) = P(o_1, o_2, \ldots, o_t, q_t = s_i | \lambda)$

The probability of seeing observations up to time $t$ AND being in state $i$ at time $t$ .

1. Initialization:

\alpha_1(i) = \pi_i \cdot b_i(o_1)

2. Recursion:

\alpha_{t+1}(j) = \left[ \sum_{i=1}^{N} \alpha_t(i) \cdot a_{ij} \right] \cdot b_j(o_{t+1})

3. Termination:

P(O | \lambda) = \sum_{i=1}^{N} \alpha_T(i)

Intuition: At each time step, we sum over all ways to reach each state (from any previous state), then multiply by the emission probability. The dynamic programming table stores these partial sums, avoiding redundant computation.

Interactive: Forward Algorithm

Watch the Forward algorithm compute the sequence probability step by step. Observe how the alpha values propagate through the trellis, summing contributions from all incoming paths.

Forward Algorithm Step-by-Step

Compute P(O|λ) - the probability of observing a sequence given the model

Observation Sequence:

O1:

O2:

O3:

Current Step Explanation

Initialization (t=0): Compute initial forward probabilities.

α₁(Sunny) = π(Sunny) × b_Sunny(Walk) = 0.6 × 0.6 = 0.3600

α₁(Rainy) = π(Rainy) × b_Rainy(Walk) = 0.4 × 0.1 = 0.0400

Forward Algorithm Formula

Initialization: α₁(i) = π(i) × b_i(o₁)

Recursion: α_t+1(j) = [Σ_i α_t(i) × a_ij] × b_j(o_t+1)

Termination: P(O|λ) = Σ_i α_T(i)

Problem 2: Decoding (Viterbi Algorithm)

Goal: Find the most likely state sequence given observations: $Q^* = \arg\max_Q P(Q | O, \lambda)$

The Viterbi Algorithm is nearly identical to Forward, but replaces SUM with MAX. Instead of summing over all paths, we keep only the best path to each state.

Viterbi Algorithm

Define: $\delta_t(i) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t = s_i, o_1, \ldots, o_t | \lambda)$

The probability of the best path ending in state $i$ at time $t$ .

1. Initialization:

\delta_1(i) = \pi_i \cdot b_i(o_1)

2. Recursion:

\delta_{t+1}(j) = \max_{i} \left[ \delta_t(i) \cdot a_{ij} \right] \cdot b_j(o_{t+1})

\psi_{t+1}(j) = \arg\max_{i} \left[ \delta_t(i) \cdot a_{ij} \right]

3. Termination & Backtracking:

q_T^* = \arg\max_{i} \delta_T(i)

q_t^* = \psi_{t+1}(q_{t+1}^*) \quad \text{for } t = T-1, \ldots, 1

Why backtracking? The delta values tell us the probability of the best path, but not which path it is. The psi array stores backpointers—at each step, we remember which previous state led to the current maximum. After reaching the end, we trace back through these pointers to recover the optimal sequence.

Interactive: Viterbi Algorithm

Step through the Viterbi algorithm to see how it finds the most likely hidden state sequence. Notice the backtracking phase that recovers the optimal path.

Viterbi Algorithm - Finding the Most Likely State Sequence

Decoding: Given observations, find argmax_Q P(Q|O, λ)

Observation Sequence:

O1:

O2:

O3:

Current Step

Initialization (t=0): Compute initial Viterbi scores.

δ₁(Sunny) = π(Sunny) × b_Sunny(Walk) = 0.6 × 0.6 = 0.3600

δ₁(Rainy) = π(Rainy) × b_Rainy(Walk) = 0.4 × 0.1 = 0.0400

Viterbi vs Forward Algorithm

Forward (Evaluation):
α_t(j) = SUM over all paths
Result: P(O | λ)

Viterbi (Decoding):
δ_t(j) = MAX over all paths
Result: Best state sequence Q*

Problem 3: Learning (Baum-Welch Algorithm)

Goal: Given observation sequences, estimate the model parameters $\lambda^* = (A^*, B^*, \pi^*)$ that maximize $P(O | \lambda)$ .

The Baum-Welch Algorithm is an Expectation-Maximization (EM) algorithm tailored for HMMs. It iterates between:

E-Step: Compute Expected Counts

Using current parameters, compute expected state occupancies and transition counts:

• $\gamma_t(i) = P(q_t = s_i | O, \lambda)$ — probability of being in state $i$ at time $t$
• $\xi_t(i,j) = P(q_t = s_i, q_{t+1} = s_j | O, \lambda)$ — probability of transition $i \to j$ at time $t$

M-Step: Update Parameters

Re-estimate parameters using expected counts:

• $\hat{a}_{ij} = \frac{\sum_t \xi_t(i,j)}{\sum_t \gamma_t(i)}$ — expected transitions $i \to j$ / visits to $i$
• $\hat{b}_j(k) = \frac{\sum_{t: o_t = v_k} \gamma_t(j)}{\sum_t \gamma_t(j)}$ — expected emissions of $k$ from state $j$
• $\hat{\pi}_i = \gamma_1(i)$ — expected initial state

Convergence: Baum-Welch is guaranteed to converge to a local optimum of the likelihood, but not necessarily the global optimum. In practice, multiple random restarts are used to find better solutions.

Applications in AI and Deep Learning

HMMs laid the foundation for sequence modeling. Understanding them illuminates modern deep learning approaches:

🎤 Speech Recognition: From HMMs to End-to-End

For decades, HMMs dominated speech recognition. Phonemes were hidden states; acoustic features were observations. Modern systems (Whisper, Wave2Vec) use transformers for end-to-end recognition, but the core problem—mapping audio to text—remains the same. CTC (Connectionist Temporal Classification) loss is inspired by HMM forward-backward computations.

📝 NLP: HMM POS Taggers to Neural Taggers

HMM part-of-speech taggers model word sequences with POS tags as hidden states. Neural sequence labelers (BiLSTM-CRF, BERT fine-tuning) achieve higher accuracy, but CRFs maintain the structured prediction insight from HMMs—modeling tag sequences rather than independent predictions.

🔄 Latent Variable Models: From HMMs to VAEs

HMMs are latent variable models: hidden states are discrete latent variables generating observations. Variational Autoencoders (VAEs) extend this to continuous latent spaces and nonlinear generation via neural networks. The ELBO objective in VAEs is analogous to HMM likelihood bounds.

⚡ Sequential Inference: Kalman Filters and Transformers

HMMs with continuous states and Gaussian emissions become Kalman Filters (used in robotics, tracking). The forward algorithm becomes Kalman filtering; smoothing becomes Kalman smoothing. Transformers replace the Markov assumption with attention, allowing every position to attend to all others—but sacrifice the efficient online inference HMMs provide.

HMM Concept	Modern Deep Learning Analog
Hidden states	Latent representations (VAE latent space, RNN hidden states)
Transition matrix A	Recurrent weights in RNN/LSTM
Emission matrix B	Decoder/output layer
Forward algorithm	Forward pass through sequence model
Viterbi decoding	Beam search, CTC decoding, argmax
Baum-Welch (EM)	Variational inference, ELBO optimization
Markov assumption	Attention allows breaking this constraint

Python Implementation

Let's implement the core HMM algorithms from scratch. This implementation covers all three fundamental problems.

Hidden Markov Model Implementation

🐍hmm_from_scratch.py

Explanation(20)

Code(73)

1Import NumPy

NumPy provides efficient matrix operations essential for HMM computations - all our probabilities are stored as matrices.

4Class Definition

We implement HMM as a class storing the model parameters (A, B, pi) and providing methods for the three fundamental problems.

5Constructor

n_states is the number of hidden states, n_observations is the number of possible observation symbols.

8Transition Matrix A

A[i,j] = P(state_j at t+1 | state_i at t). Each row sums to 1. Shape: (n_states, n_states). Initialized uniformly.

EXAMPLE

A[0,1] = 0.3 means 30% chance of transitioning from state 0 to state 1

11Emission Matrix B

B[i,k] = P(observation_k | state_i). Each row sums to 1. Shape: (n_states, n_observations). Encodes what each state 'emits'.

EXAMPLE

B[1,2] = 0.5 means state 1 has 50% chance of emitting observation symbol 2

14Initial Distribution pi

pi[i] = P(state_i at t=0). Sums to 1. Determines the probability of starting in each hidden state.

17Forward Algorithm

Computes P(O|lambda) - the probability of an observation sequence given the model. Uses dynamic programming to avoid exponential complexity.

20Initialize Alpha

alpha[t,i] = P(O_1,...,O_t, q_t=i | lambda). Start with alpha[0,i] = pi[i] * B[i, O[0]].

24Forward Recursion

For each time step, alpha[t,j] = SUM_i(alpha[t-1,i] * A[i,j]) * B[j, O[t]]. This is the key DP step.

30Termination

P(O|lambda) = SUM_i(alpha[T-1,i]). Sum of all paths that end at any state gives total sequence probability.

33Viterbi Algorithm

Finds the most likely state sequence Q* = argmax P(Q|O,lambda). Similar to forward but uses MAX instead of SUM.

36Initialize Delta

delta[t,i] = max probability of reaching state i at time t. psi stores backpointers for traceback.

42Viterbi Recursion

delta[t,j] = MAX_i(delta[t-1,i] * A[i,j]) * B[j, O[t]]. Take best incoming path, not sum.

47Store Backpointer

psi[t,j] records which previous state maximized the path to state j. Needed for backtracking.

51Backtracking

Starting from the best final state, trace back through psi pointers to recover the optimal path Q*.

58Baum-Welch Algorithm

EM algorithm for learning HMM parameters from observations. E-step computes expected counts, M-step updates parameters.

60EM Iterations

Iterate between E-step (compute gamma, xi using forward-backward) and M-step (update A, B, pi). Converges to local optimum.

63Compute Xi

xi[t,i,j] = P(q_t=i, q_{t+1}=j | O, lambda). Joint probability of consecutive state pairs given observations.

66Compute Gamma

gamma[t,i] = P(q_t=i | O, lambda). Probability of being in state i at time t, marginalized from xi.

69M-Step Updates

Update parameters using expected counts: A[i,j] = expected transitions i->j / expected visits to i. Similar for B and pi.

53 lines without explanation

1import numpy as np
2
3# Hidden Markov Model Implementation
4class HMM:
5    def __init__(self, n_states, n_observations):
6        self.n_states = n_states
7        self.n_observations = n_observations
8        # Transition matrix A[i,j] = P(state_j | state_i)
9        self.A = np.ones((n_states, n_states)) / n_states
10
11        # Emission matrix B[i,k] = P(obs_k | state_i)
12        self.B = np.ones((n_states, n_observations)) / n_observations
13
14        # Initial distribution pi[i] = P(start in state_i)
15        self.pi = np.ones(n_states) / n_states
16
17    def forward(self, observations):
18        """Compute P(O|lambda) using Forward algorithm."""
19        T = len(observations)
20        alpha = np.zeros((T, self.n_states))
21
22        # Initialization
23        alpha[0] = self.pi * self.B[:, observations[0]]
24
25        # Recursion
26        for t in range(1, T):
27            for j in range(self.n_states):
28                alpha[t, j] = np.sum(alpha[t-1] * self.A[:, j])
29                alpha[t, j] *= self.B[j, observations[t]]
30
31        # Termination: P(O|lambda) = sum of alpha at final step
32        return np.sum(alpha[-1]), alpha
33
34    def viterbi(self, observations):
35        """Find most likely state sequence using Viterbi."""
36        T = len(observations)
37        delta = np.zeros((T, self.n_states))
38        psi = np.zeros((T, self.n_states), dtype=int)
39
40        # Initialization
41        delta[0] = self.pi * self.B[:, observations[0]]
42
43        # Recursion
44        for t in range(1, T):
45            for j in range(self.n_states):
46                probs = delta[t-1] * self.A[:, j]
47                psi[t, j] = np.argmax(probs)  # Best previous state
48                delta[t, j] = np.max(probs) * self.B[j, observations[t]]
49
50        # Backtracking
51        path = np.zeros(T, dtype=int)
52        path[-1] = np.argmax(delta[-1])
53        for t in range(T-2, -1, -1):
54            path[t] = psi[t+1, path[t+1]]
55
56        return path, np.max(delta[-1])
57
58    def baum_welch(self, observations, n_iter=100):
59        """Learn parameters using Baum-Welch (EM) algorithm."""
60        for iteration in range(n_iter):
61            # E-Step: Compute forward/backward, then xi and gamma
62            prob, alpha = self.forward(observations)
63            beta = self._backward(observations)
64
65            xi = self._compute_xi(observations, alpha, beta, prob)
66            gamma = np.sum(xi, axis=2)  # Marginalize over next state
67
68            # M-Step: Update parameters
69            self.pi = gamma[0] / np.sum(gamma[0])
70            self.A = np.sum(xi, axis=0) / np.sum(gamma[:-1], axis=0)[:, None]
71            for k in range(self.n_observations):
72                mask = (observations == k)
73                self.B[:, k] = np.sum(gamma[mask], axis=0) / np.sum(gamma, axis=0)

Using libraries: For production use, consider hmmlearn (scikit-learn style API) or pomegranate (more flexible). PyTorch and TensorFlow also have HMM implementations with GPU acceleration.

Numerical Stability: In practice, alpha/delta values become very small for long sequences (products of many probabilities). Use log-space computations: log(ab) = log(a) + log(b), and log-sum-exp for stable summation.

Knowledge Check

Test your understanding of Hidden Markov Models with these questions:

Hidden Markov Models - Knowledge Check

1. In an HMM, what do we directly observe?

2. What does the Forward algorithm compute?

3. How does the Viterbi algorithm differ from the Forward algorithm?

4. What is the time complexity of the Forward algorithm for T observations and N states?

5. Which HMM problem does the Baum-Welch (EM) algorithm solve?

6. What does the 'Markov' in Hidden Markov Model refer to?

7. In speech recognition, what might the hidden states represent?

8. What is the emission probability b_j(o_t) in an HMM?

Summary

Key Takeaways

✅HMMs model sequences where hidden states generate observations probabilistically.

✅Five components: states S, observations V, transitions A, emissions B, initial π.

✅The Forward algorithm computes P(O|λ) in O(T × N²) via dynamic programming.

✅The Viterbi algorithm finds the best state sequence using MAX instead of SUM.

✅Baum-Welch (EM) learns parameters from observations without labeled states.

✅The Markov property makes computation tractable: state depends only on previous state.

✅HMMs influenced modern sequence models: RNNs, CRFs, VAEs, and transformers.

✅Applications span speech, NLP, bioinformatics, music, and robotics.

Looking Ahead

In the next section, we'll explore Gaussian Processes—another powerful probabilistic model for sequences and functions. While HMMs use discrete hidden states, GPs define distributions over functions using infinite-dimensional Gaussian distributions. They're particularly useful for regression with uncertainty quantification and have deep connections to neural networks through the neural tangent kernel.