Learning Objectives
By the end of this section, you will be able to:
- Understand why initialization matters: Grasp how poor initialization causes vanishing/exploding gradients that prevent deep networks from learning
- Derive Xavier initialization: Mathematically derive the variance-preserving initialization for linear and tanh activations
- Derive He initialization: Understand why ReLU networks need different initialization and derive the formula
- Implement initialization in PyTorch: Apply proper initialization to your networks using PyTorch's built-in functions
- Choose the right strategy: Select appropriate initialization based on your network architecture and activation functions
Why This Matters: Before Xavier and He initialization were discovered, training deep networks was extremely difficult. These insights about variance preservation enabled the training of networks with hundreds of layers. Understanding initialization is essential for training modern deep learning models.
The Big Picture
The Training Disaster
Imagine spending weeks designing a sophisticated 50-layer neural network, only to find that training fails completely—the loss doesn't decrease, gradients are either zero or infinity, and no learning occurs. This was a common experience in the early days of deep learning, and the culprit was often something as simple as how the weights were initialized.
Weight initialization might seem like a minor implementation detail, but it's actually one of the most critical factors determining whether a deep network can be trained at all. The values we assign to weights before training begins set the stage for everything that follows.
The Core Insight
The fundamental insight is this: as signals (activations during forward pass, gradients during backward pass) flow through a deep network, their magnitudes must remain approximately constant. If signals shrink at each layer, they vanish by the time they reach deep layers. If they grow, they explode into numerical overflow.
The key breakthrough was realizing that we can control this variance flow by carefully choosing the variance of the initial weights.
Historical Context
Two seminal papers solved this problem:
| Year | Authors | Method | Key Insight |
|---|---|---|---|
| 2010 | Xavier Glorot & Yoshua Bengio | Xavier/Glorot Init | Preserve variance for linear/tanh activations |
| 2015 | Kaiming He et al. | He/Kaiming Init | Account for ReLU halving the variance |
These discoveries, along with batch normalization and residual connections, enabled the training of very deep networks that power modern AI.
The Symmetry Breaking Problem
Why Not Initialize All Weights to the Same Value?
Consider initializing all weights to zero (or any constant). What happens?
In the forward pass, every neuron in a layer receives the same weighted sum of inputs (since all weights are identical). They all produce the same output. During backpropagation, they all receive the same gradient. And after the weight update, they all have the same new weight value.
This symmetry persists throughout training. The network effectively collapses to having just one neuron per layer, wasting all the extra capacity. No matter how long you train, the neurons never learn different features.
Never Initialize All Weights to the Same Value
Random Initialization: The First Attempt
The obvious solution is random initialization. But what distribution should we sample from? A naive approach is to use standard normal: .
This breaks symmetry, but creates a new problem. Consider a fully-connected layer with inputs:
If and (with everything independent and zero-mean):
The variance grows by a factor of at each layer! In a network with 256 neurons per layer, the variance explodes to astronomical values within just a few layers.
Quick Check
If you initialize weights from N(0, 1) and have 512 neurons per layer, what happens to the variance after 10 layers?
Variance Flow Through Networks
The Mathematical Framework
Let's carefully analyze how variance propagates through a single layer. Consider a fully-connected layer with:
- input neurons (fan-in)
- output neurons (fan-out)
- Weights drawn i.i.d. with mean 0 and variance
- Inputs with mean 0 and variance
For a single output neuron (before activation):
Using the property that the variance of a sum of independent terms is the sum of variances:
The Variance Preservation Condition
For stable forward propagation, we want the output variance to equal the input variance:
This gives us the condition:
This is the key insight: the variance of the weights should be inversely proportional to the number of input connections.
Backward Pass: Gradient Variance
The same analysis applies to gradients flowing backward. During backpropagation, the gradient with respect to an input is:
For gradients to maintain variance:
We have a conflict! Forward pass wants , but backward pass wants .
Interactive: Variance Propagation
Experiment with different initialization strategies and see how variance changes as signals propagate through a deep network. Observe how naive initialization causes variance to explode or vanish, while proper initialization maintains stability:
Variance Flow Through Layers
See how different initializations affect signal propagation
Naive Initialization
Variance can vanish or explode exponentially with depth. Unusable for deep networks.
Xavier Initialization
Maintains variance for linear/tanh activations. Slightly suboptimal for ReLU.
He Initialization
Accounts for ReLU killing half the neurons. Best choice for ReLU networks.
Key Observations
- With naive initialization (Var=1), variance grows exponentially—leading to numerical overflow
- Xavier initialization maintains variance for linear activations, but slightly underperforms for ReLU
- He initialization is optimal for ReLU, keeping variance stable across all layers
- Increasing network depth makes proper initialization even more critical
Xavier (Glorot) Initialization
Resolving the Conflict
Xavier Glorot and Yoshua Bengio proposed a compromise: use the average of the forward and backward requirements:
This ensures that variance is approximately preserved in both directions. The name "Xavier" comes from Xavier Glorot's first name.
Xavier Normal Initialization
Sample weights from a normal distribution with the computed variance:
Standard Deviation vs Variance
xavier_normal_ function takes std, not variance. The standard deviation is the square root of variance: .Xavier Uniform Initialization
Alternatively, sample from a uniform distribution with the same variance:
The limit is derived from the uniform distribution's variance formula: . Setting this equal to gives .
When to Use Xavier
Xavier initialization was derived assuming linear activations or activations symmetric around zero like tanh. It works well for:
- Networks with tanh or sigmoid activations
- Linear layers without activation (like the final output layer)
- Attention mechanisms and transformers (which often use linear projections)
Not Optimal for ReLU
He (Kaiming) Initialization
The ReLU Problem
Consider ReLU activation: . For a symmetric input distribution centered at zero, ReLU outputs zero for half the inputs. This halves the variance:
This derivation assumes is symmetric around zero. For the forward pass through a ReLU layer:
To preserve variance ():
This is exactly twice the variance of Xavier initialization (considering only fan-in). Kaiming He and colleagues derived this in their 2015 paper on training very deep networks.
He Normal Initialization
He Uniform Initialization
Fan-in vs Fan-out Mode
PyTorch allows you to choose whether to use (fan-in) or (fan-out):
| Mode | Formula | Best For |
|---|---|---|
| fan_in | Var = 2/n_in | Preserving forward pass variance (default) |
| fan_out | Var = 2/n_out | Preserving backward pass variance |
Rule of Thumb
fan_in mode (the default) for most cases. Use fan_out only if you have specific reasons to prioritize gradient stability over activation stability.Quick Check
For a layer with 512 input neurons and ReLU activation, what should be the weight variance using He initialization?
Interactive: Weight Distributions
Explore how different initialization strategies create different weight distributions. Adjust the fan-in and fan-out values to see how they affect the distribution's spread:
Weight Distribution Visualizer
See how different initialization strategies create different weight distributions
Xavier Normal
W ~ N(0, 2/(n_in+n_out))
Observations
- Larger layers = smaller weights: As fan-in increases, the weights become more concentrated around zero
- Xavier vs He: He initialization has slightly wider spread than Xavier to compensate for ReLU
- Uniform vs Normal: Both achieve the same variance but with different distribution shapes
Interactive: Signal Propagation
Visualize how activations and gradients flow through a deep network with different initialization strategies. Watch how poor initialization causes signals to vanish or explode:
Signal Propagation Through Layers
Visualize how activations flow forward through a deep network
| Layer | Mean | Std Dev | Max Abs | Status |
|---|---|---|---|---|
| Input | 5.23e-1 | 5.64e-1 | 1.84e+0 | Healthy |
| Hidden 1 | 5.31e-1 | 6.58e-1 | 3.05e+0 | Healthy |
| Hidden 2 | 0.00e+0 | 0.00e+0 | 0.00e+0 | Vanishing |
| Hidden 3 | 0.00e+0 | 0.00e+0 | 0.00e+0 | Vanishing |
| Hidden 4 | 0.00e+0 | 0.00e+0 | 0.00e+0 | Vanishing |
| ... 3 more layers ... | ||||
He/Kaiming Initialization
Optimal for ReLU: Compensates for the fact that ReLU sets half of activations to zero. Maintains healthy signal flow in deep networks.
What to Look For
- All Zeros: All neurons output identical values—no learning possible
- Too Small: Activations shrink toward zero in deep layers (vanishing)
- Too Large: Activations grow exponentially (exploding)
- Xavier/He: Healthy signal flow with consistent magnitudes across layers
PyTorch Implementation
Built-in Initialization Functions
PyTorch provides initialization functions in torch.nn.init:
Custom Initialization for a Model
Initialization with apply()
PyTorch also provides apply() for recursively applying a function to all modules:
1def init_weights(m):
2 """Initialize weights for different layer types."""
3 if isinstance(m, nn.Linear):
4 init.kaiming_normal_(m.weight, nonlinearity='relu')
5 if m.bias is not None:
6 init.zeros_(m.bias)
7 elif isinstance(m, nn.Conv2d):
8 init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
9 if m.bias is not None:
10 init.zeros_(m.bias)
11 elif isinstance(m, nn.BatchNorm2d):
12 init.ones_(m.weight)
13 init.zeros_(m.bias)
14
15# Apply to model
16model = MyNetwork()
17model.apply(init_weights)PyTorch Default Initialization
nn.Linear is already pretty good—it uses a variant of Kaiming uniform initialization. But explicitly setting initialization gives you control and makes your code clearer.Practical Guidelines
Choosing the Right Initialization
| Layer Type | Activation | Recommended Initialization |
|---|---|---|
| Linear/Dense | ReLU, LeakyReLU, PReLU | He (Kaiming) |
| Linear/Dense | Tanh, Sigmoid, Linear | Xavier (Glorot) |
| Conv2d | ReLU, LeakyReLU | He (Kaiming) |
| BatchNorm | - | weight=1, bias=0 |
| LayerNorm | - | weight=1, bias=0 |
| Embedding | - | Normal(0, 1) or Xavier |
| LSTM/GRU | Various | Orthogonal or Xavier |
| Output layer | Softmax/Linear | Xavier or smaller |
Special Cases
Residual Networks
For residual networks with skip connections, it's common to initialize the last layer of each residual block to zero. This makes the network initially behave like a shallower network:
1class ResidualBlock(nn.Module):
2 def __init__(self, dim):
3 super().__init__()
4 self.layers = nn.Sequential(
5 nn.Linear(dim, dim),
6 nn.ReLU(),
7 nn.Linear(dim, dim), # Initialize this to zero
8 )
9 # Zero-initialize the final layer
10 init.zeros_(self.layers[-1].weight)
11 init.zeros_(self.layers[-1].bias)
12
13 def forward(self, x):
14 return x + self.layers(x) # Skip connectionTransformers
Modern transformers often scale initialization by layer depth:
1# GPT-style initialization
2# Scale residual layers by 1/sqrt(num_layers)
3scale = 1 / math.sqrt(num_layers)
4init.normal_(layer.weight, mean=0.0, std=0.02 * scale)Output Layers
For classification tasks, initializing the output layer with smaller weights often helps:
1# Smaller initialization for output layer
2init.normal_(output_layer.weight, mean=0.0, std=0.01)
3init.zeros_(output_layer.bias)Related Topics
- Chapter 8 Section 6: Gradient Flow Analysis - Deep dive into vanishing/exploding gradients and why initialization matters for gradient flow
- Section 5: Normalization Layers - Batch and Layer Normalization also help stabilize activations and gradients
Summary
Weight initialization is crucial for training deep neural networks. Here are the key takeaways:
| Concept | Key Point |
|---|---|
| Symmetry Breaking | Random initialization is essential; identical weights lead to identical neurons |
| Variance Preservation | Signals should maintain consistent magnitude across layers |
| Xavier/Glorot | Var(W) = 2/(n_in + n_out), optimal for linear/tanh activations |
| He/Kaiming | Var(W) = 2/n_in, optimal for ReLU activations (doubles Xavier variance) |
| Fan-in vs Fan-out | fan_in preserves forward variance (default), fan_out preserves backward |
| Biases | Usually initialized to zero (no symmetry issue) |
Quick Reference: Initialization Formulas
1Xavier Normal: W ~ N(0, sqrt(2 / (n_in + n_out)))
2Xavier Uniform: W ~ U(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))
3
4He Normal: W ~ N(0, sqrt(2 / n_in))
5He Uniform: W ~ U(-sqrt(6/n_in), sqrt(6/n_in))Knowledge Check
Test your understanding of weight initialization concepts:
Knowledge Check
Question 1 of 5
Why can't we initialize all weights to zero?
Exercises
Conceptual Questions
- Explain why initializing all weights to small constants (e.g., 0.01) is better than zeros but still problematic for deep networks.
- A layer has 1024 input neurons and 512 output neurons. Calculate the variance and uniform initialization bounds for both Xavier and He initialization.
- Why might you choose
fan_outmode for He initialization in a generative model? - How does batch normalization reduce the dependence on initialization?
Solution Hints
- Q1: Small constants break symmetry but can still cause vanishing gradients in deep networks since they don't scale with layer size.
- Q2: Xavier: Var = 2/(1024+512) = 0.0013, bounds = ±0.089. He: Var = 2/1024 = 0.00195, bounds = ±0.076.
- Q3: In generative models, gradient flow from loss to input is critical. fan_out optimizes for stable backward propagation.
- Q4: BatchNorm normalizes activations to zero mean and unit variance, partially correcting for poor initialization.
Coding Exercises
- Implement from scratch: Write your own
xavier_normalandkaiming_normalfunctions without using PyTorch's built-in versions. Verify they produce the same variance. - Variance tracking: Create a neural network that records the variance of activations at each layer during a forward pass. Compare different initializations.
- Training comparison: Train the same model architecture on MNIST with zeros, random N(0,1), Xavier, and He initialization. Plot loss curves and final accuracy for each.
- Gradient analysis: Extend exercise 2 to also track gradient variance during backward pass. Verify that He initialization preserves gradient magnitudes better than Xavier for ReLU networks.
Exercise Code Template
1class VarianceTracker(nn.Module):
2 def __init__(self, layers):
3 super().__init__()
4 self.layers = nn.ModuleList(layers)
5 self.activation_vars = []
6
7 def forward(self, x):
8 self.activation_vars = [x.var().item()]
9 for layer in self.layers:
10 x = layer(x)
11 self.activation_vars.append(x.var().item())
12 return xIn the next section, we'll explore regularization techniques—methods like dropout, weight decay, and data augmentation that prevent overfitting and improve generalization.