Learning Objectives
By the end of this section, you will be able to:
π Core Knowledge
- β’ Explain what makes a Markov model "hidden"
- β’ Define the five components of an HMM mathematically
- β’ Describe the three fundamental HMM problems
- β’ Understand why dynamic programming is essential
π§ Practical Skills
- β’ Implement the Forward algorithm for sequence evaluation
- β’ Apply the Viterbi algorithm for optimal decoding
- β’ Understand the Baum-Welch learning algorithm
- β’ Recognize when to use HMMs vs other sequence models
π§ Deep Learning Connections
- β’ RNNs and LSTMs β Neural sequence models that generalize HMMs with continuous hidden states
- β’ CRFs β Conditional Random Fields combine HMM structure with discriminative training
- β’ Attention mechanisms β Transformers replace Markov assumptions with global attention
- β’ Variational autoencoders β VAEs extend HMM latent variable ideas to deep generative models
Where You'll Apply This: Speech recognition, natural language processing (POS tagging, named entity recognition), bioinformatics (gene finding, protein structure), time series analysis, music generation, and robotics (localization).
The Big Picture
Hidden Markov Models (HMMs) are one of the most successful statistical models for sequential data. They address a fundamental challenge: inferring hidden structure from noisy observations.
The Core Idea
Imagine you're in a room with no windows. You can't see the weather outside, but you can observe what your friend does each day: going for a walk, shopping, or cleaning. From these observations, you want to infer the hidden weather states. An HMM provides the mathematical framework to do exactly this.
Hidden States: The true underlying process we can't see
Observations: What we actually measure or detect
Inference: Determine hidden states from observations
The Story Behind HMMs
Leonard Baum & Lloyd Welch (1960s)
HMMs were developed at the Institute for Defense Analyses for speech recognition research. Baum and colleagues developed the mathematical foundations, including the famous Baum-Welch algorithm for learning HMM parameters from data.
Speech Recognition Revolution (1980s-2000s)
HMMs became the dominant approach for speech recognition at IBM, CMU, and Bell Labs. Systems like Dragon NaturallySpeaking and early Siri used HMMs. The states represented phonemes, and observations were acoustic features extracted from audio.
Bioinformatics Applications (1990s-present)
HMMs proved invaluable for analyzing DNA and protein sequences. Profile HMMs can identify protein families, predict gene structures, and align biological sequences. The hidden states represent functional regions (exons, introns, regulatory elements).
Where HMMs Are Used
| Domain | Hidden States | Observations |
|---|---|---|
| Speech Recognition | Phonemes/sub-word units | Acoustic features (MFCCs) |
| POS Tagging | Part-of-speech tags | Words in sentences |
| Gene Finding | Coding/non-coding regions | DNA nucleotides (A,T,G,C) |
| Handwriting Recognition | Character strokes | Pen coordinates |
| Music Generation | Musical modes/keys | Notes and rhythms |
| Robot Localization | True position | Noisy sensor readings |
Mathematical Foundation
An HMM is defined by two stochastic processes: a hidden Markov chain (the states we can't observe) and an observation process (what we actually see). Let's formalize this.
The Five Components
An HMM is specified by the tuple :
States:
The set of hidden states. These are the unobservable conditions we want to infer.
Observations:
The set of possible observation symbols. This is what we actually measure.
Transition Matrix: where
matrix of state transition probabilities. Row sums to 1.
Emission Matrix: where
matrix of observation probabilities given each state. Row sums to 1.
Initial Distribution: where
Vector of length giving initial state probabilities. Sums to 1.
Formal Definition
The key assumptions that define an HMM are:
HMM Assumptions
1. Markov Property (First-Order)
The next state depends only on the current state, not the entire history.
2. Output Independence
Each observation depends only on the current hidden state.
3. Time Homogeneity
Transition and emission probabilities don't change over time.
Given a sequence of observations and a hidden state sequence , the joint probability is:
Interactive: HMM Explorer
Explore how HMMs work with the classic weather-activity example. Adjust transition and emission probabilities to see how they affect the generated sequences. Notice how the hidden states (weather) produce the observations (activities) probabilistically.
Hidden Markov Model Visualizer
The classic weather-activity HMM: hidden weather states (Sunny/Rainy) generate observable activities (Walk/Shop/Clean).
Transition Probabilities
Emission Probabilities
The Three Fundamental Problems
All HMM applications reduce to solving three fundamental problems, each addressed by a different algorithm:
Evaluation
Given and , compute
Algorithm: Forward
Use: Model selection, likelihood scoring
Decoding
Given and , find best
Algorithm: Viterbi
Use: POS tagging, speech recognition
Learning
Given , find best
Algorithm: Baum-Welch (EM)
Use: Training models from data
Problem 1: Evaluation (Forward Algorithm)
Goal: Compute βthe probability of observing a particular sequence given our model.
The naive approach would enumerate all possible state sequences: . But with states and time steps, there are possible sequencesβexponential!
The Forward Algorithm uses dynamic programming to compute this in time.
Forward Algorithm
Define:
The probability of seeing observations up to time AND being in state at time .
1. Initialization:
2. Recursion:
3. Termination:
Interactive: Forward Algorithm
Watch the Forward algorithm compute the sequence probability step by step. Observe how the alpha values propagate through the trellis, summing contributions from all incoming paths.
Forward Algorithm Step-by-Step
Compute P(O|Ξ») - the probability of observing a sequence given the model
Current Step Explanation
Initialization (t=0): Compute initial forward probabilities.
Forward Algorithm Formula
Problem 2: Decoding (Viterbi Algorithm)
Goal: Find the most likely state sequence given observations:
The Viterbi Algorithm is nearly identical to Forward, but replaces SUM with MAX. Instead of summing over all paths, we keep only the best path to each state.
Viterbi Algorithm
Define:
The probability of the best path ending in state at time .
1. Initialization:
2. Recursion:
3. Termination & Backtracking:
Interactive: Viterbi Algorithm
Step through the Viterbi algorithm to see how it finds the most likely hidden state sequence. Notice the backtracking phase that recovers the optimal path.
Viterbi Algorithm - Finding the Most Likely State Sequence
Decoding: Given observations, find argmaxQ P(Q|O, Ξ»)
Current Step
Initialization (t=0): Compute initial Viterbi scores.
Viterbi vs Forward Algorithm
Ξ±t(j) = SUM over all paths
Result: P(O | Ξ»)
Ξ΄t(j) = MAX over all paths
Result: Best state sequence Q*
Problem 3: Learning (Baum-Welch Algorithm)
Goal: Given observation sequences, estimate the model parameters that maximize .
The Baum-Welch Algorithm is an Expectation-Maximization (EM) algorithm tailored for HMMs. It iterates between:
E-Step: Compute Expected Counts
Using current parameters, compute expected state occupancies and transition counts:
- β’ β probability of being in state at time
- β’ β probability of transition at time
M-Step: Update Parameters
Re-estimate parameters using expected counts:
- β’ β expected transitions / visits to
- β’ β expected emissions of from state
- β’ β expected initial state
Applications in AI and Deep Learning
HMMs laid the foundation for sequence modeling. Understanding them illuminates modern deep learning approaches:
π€ Speech Recognition: From HMMs to End-to-End
For decades, HMMs dominated speech recognition. Phonemes were hidden states; acoustic features were observations. Modern systems (Whisper, Wave2Vec) use transformers for end-to-end recognition, but the core problemβmapping audio to textβremains the same. CTC (Connectionist Temporal Classification) loss is inspired by HMM forward-backward computations.
π NLP: HMM POS Taggers to Neural Taggers
HMM part-of-speech taggers model word sequences with POS tags as hidden states. Neural sequence labelers (BiLSTM-CRF, BERT fine-tuning) achieve higher accuracy, but CRFs maintain the structured prediction insight from HMMsβmodeling tag sequences rather than independent predictions.
π Latent Variable Models: From HMMs to VAEs
HMMs are latent variable models: hidden states are discrete latent variables generating observations. Variational Autoencoders (VAEs) extend this to continuous latent spaces and nonlinear generation via neural networks. The ELBO objective in VAEs is analogous to HMM likelihood bounds.
β‘ Sequential Inference: Kalman Filters and Transformers
HMMs with continuous states and Gaussian emissions become Kalman Filters (used in robotics, tracking). The forward algorithm becomes Kalman filtering; smoothing becomes Kalman smoothing. Transformers replace the Markov assumption with attention, allowing every position to attend to all othersβbut sacrifice the efficient online inference HMMs provide.
| HMM Concept | Modern Deep Learning Analog |
|---|---|
| Hidden states | Latent representations (VAE latent space, RNN hidden states) |
| Transition matrix A | Recurrent weights in RNN/LSTM |
| Emission matrix B | Decoder/output layer |
| Forward algorithm | Forward pass through sequence model |
| Viterbi decoding | Beam search, CTC decoding, argmax |
| Baum-Welch (EM) | Variational inference, ELBO optimization |
| Markov assumption | Attention allows breaking this constraint |
Python Implementation
Let's implement the core HMM algorithms from scratch. This implementation covers all three fundamental problems.
hmmlearn (scikit-learn style API) or pomegranate (more flexible). PyTorch and TensorFlow also have HMM implementations with GPU acceleration.Knowledge Check
Test your understanding of Hidden Markov Models with these questions:
Hidden Markov Models - Knowledge Check
1. In an HMM, what do we directly observe?
2. What does the Forward algorithm compute?
3. How does the Viterbi algorithm differ from the Forward algorithm?
4. What is the time complexity of the Forward algorithm for T observations and N states?
5. Which HMM problem does the Baum-Welch (EM) algorithm solve?
6. What does the 'Markov' in Hidden Markov Model refer to?
7. In speech recognition, what might the hidden states represent?
8. What is the emission probability b_j(o_t) in an HMM?
Summary
Key Takeaways
Looking Ahead
In the next section, we'll explore Gaussian Processesβanother powerful probabilistic model for sequences and functions. While HMMs use discrete hidden states, GPs define distributions over functions using infinite-dimensional Gaussian distributions. They're particularly useful for regression with uncertainty quantification and have deep connections to neural networks through the neural tangent kernel.