Chapter 6
18 min read
Section 40 of 178

Shallow vs Deep Networks

From Perceptrons to Deep Networks

Introduction

In the previous section, we established a remarkable theoretical result: the Universal Approximation Theorem. It tells us that a single hidden layer network with enough neurons can approximate any continuous function to arbitrary precision. This raises a profound question: Why do we need deep networks at all?

The Central Paradox: If shallow networks can approximate anything, why does modern deep learning use networks with 100+ layers? The answer reveals one of the most important insights in machine learning: theoretical possibility and practical efficiency are vastly different things.

Consider an analogy: you could describe any image using a single polynomial equation with billions of terms. But this would be absurd—no one does this. Instead, we decompose images into hierarchies of edges, textures, parts, and objects. Deep networks discover similar compositional structures automatically.

This section explores the fundamental trade-off between width (neurons per layer) and depth (number of layers). We'll see why depth provides exponential efficiency gains for certain function classes, and why modern architectures like ResNet, GPT, and BERT all embrace depth as a core design principle.


Learning Objectives

After completing this section, you will be able to:

  1. Define shallow and deep networks precisely: Understand what constitutes depth, how to count layers, and the standard conventions in the field
  2. Explain the efficiency of depth: Articulate why deep networks can represent certain functions exponentially more efficiently than shallow networks
  3. Understand function composition: See how deep networks build complex functions by composing simpler functions layer by layer
  4. Apply the circuit complexity perspective: Connect neural network depth to results from computational complexity theory
  5. Make architectural decisions: Know when to choose deeper vs wider networks for different problems
  6. Implement and compare architectures: Build shallow and deep networks in PyTorch and measure their differences empirically

Where This Knowledge Applies

  • Architecture Design: Choosing between wide-shallow and narrow-deep networks
  • Transfer Learning: Understanding why depth enables hierarchical feature learning
  • Model Compression: Knowing which layers can be pruned vs which are essential
  • Debugging: Diagnosing when depth is hurting (vanishing gradients) vs helping

Defining Network Depth

Before comparing shallow and deep networks, we need precise definitions. The depth of a network is typically defined as the number of layers with learnable parameters, not counting the input layer.

Counting Layers: The Standard Convention

ArchitectureHidden LayersTotal LayersDepth Classification
Single perceptron01 (output only)No hidden layers
1 hidden layer MLP12Shallow
2 hidden layer MLP23Shallow to moderate
5 hidden layer MLP56Deep
ResNet-505050Deep
GPT-3 (175B)9696Very deep

Convention: What Counts as a Layer?

  • Fully connected (Linear) layers: Count as 1 layer each
  • Convolutional layers: Count as 1 layer each
  • Activation functions: Usually NOT counted (they're considered part of the previous layer)
  • Batch normalization: Usually NOT counted (grouped with adjacent layer)
  • Pooling layers: Sometimes counted, sometimes not (no learnable parameters)

Mathematical Definition

A feedforward neural network with LL layers computes:

f(x)=f(L)f(L1)f(2)f(1)(x)f(x) = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(2)} \circ f^{(1)}(x)

Where each layer f(l)f^{(l)} typically applies:

f(l)(h)=σ(W(l)h+b(l))f^{(l)}(h) = \sigma(W^{(l)} h + b^{(l)})

Here W(l)W^{(l)} is the weight matrix, b(l)b^{(l)} is the bias vector, and σ\sigma is the activation function.

SymbolMeaningTypical Values
LTotal number of layers2-100+
f^{(l)}Function computed by layer lLinear + activation
W^{(l)}Weight matrix of layer lShape: (n_out, n_in)
b^{(l)}Bias vector of layer lShape: (n_out,)
σActivation functionReLU, tanh, sigmoid
Function compositionf ∘ g means f(g(x))

Shallow vs Deep: Where's the Line?

There's no universal consensus, but common conventions include:

  • Shallow: 1-2 hidden layers
  • Moderately deep: 3-10 hidden layers
  • Deep: 10+ hidden layers
  • Very deep: 50+ hidden layers (ResNet, Transformers)
Historical Note: Before 2012, networks with more than 2-3 hidden layers were considered "deep" and notoriously difficult to train. The AlexNet breakthrough (2012) used 8 layers, considered very deep at the time. Today's state-of-the-art models routinely use 100+ layers.

The Power of Shallow Networks

The Universal Approximation Theorem tells us that shallow networks are theoretically sufficient. Let's understand exactly what this means and where its limits lie.

What Shallow Networks Can Do

A single hidden layer network with sufficient width can:

  1. Approximate any continuous function on a compact domain to arbitrary precision
  2. Learn any decision boundary (for classification)
  3. Represent any input-output mapping given enough neurons

The Width Requirement

Here's the catch: "sufficient width" can mean exponentially many neurons. For a function of dd input variables, the required width may scale as:

Width=O(2d) or worse\text{Width} = O(2^d) \text{ or worse}

This exponential scaling makes shallow networks impractical for high-dimensional problems like images (224×224×3 = 150,528 dimensions).

Interactive Comparison

This visualization shows how shallow and deep networks approximate the same target function. Observe how the shallow network requires many more neurons to achieve the same accuracy:

Shallow vs Deep Network Approximation
Target functionShallow networkDeep networkEpoch: 0

Shallow MSE

0.3250

Deep MSE

0.3871

Shallow Network

Single hidden layer with 8 neurons

Deep Network

4 layers with 4 neurons each

Key Observations

  • • The deep network often converges faster with fewer total parameters
  • • The shallow network needs many more neurons to approximate complex patterns
  • • Adjust parameters and watch how each architecture adapts

When Shallow Networks Excel

Despite their limitations, shallow networks work well for:

  • Low-dimensional problems: Input dimension < 20
  • Smooth functions: Functions without sharp transitions or hierarchical structure
  • Tabular data: Traditional ML problems with feature-engineered inputs
  • Theoretical analysis: Easier to analyze mathematically

Practical Rule of Thumb

For tabular data with < 100 features, a 2-3 layer MLP often works as well as deeper alternatives. For images, text, or audio, depth is almost always beneficial.

The Efficiency of Depth

The key insight about depth is not that it enables new functions—shallow networks can already approximate anything—but that it provides exponential efficiency for many function classes.

The Exponential Gap

There exist functions that a deep network with O(k)O(k) neurons can compute, but a shallow network would require O(2k)O(2^k) neurons to approximate. This is known as the depth-width trade-off.

Deep: O(k) parametersvsShallow: O(2k) parameters\text{Deep: } O(k) \text{ parameters} \quad \text{vs} \quad \text{Shallow: } O(2^k) \text{ parameters}

Classic Example: Parity Function

Consider the parity function on nn bits: output 1 if an odd number of input bits are 1, output 0 otherwise.

🐍parity_example.py
1# Parity function: XOR of all input bits
2def parity(bits):
3    result = 0
4    for bit in bits:
5        result ^= bit  # XOR accumulates
6    return result
7
8# Examples
9print(parity([1, 0, 1, 0]))  # 0 (two 1s = even)
10print(parity([1, 1, 1, 0]))  # 1 (three 1s = odd)
11print(parity([1, 1, 1, 1]))  # 0 (four 1s = even)

Complexity bounds:

ArchitectureRequired SizeWhy
Deep network (n layers)O(n) neuronsEach layer handles one XOR
Shallow network (2 layers)O(2^n) neuronsMust enumerate all even/odd patterns

Visualizing the Efficiency Gap

This interactive demonstration shows how the parameter count grows as we increase the problem complexity. Notice the exponential explosion for shallow networks:

The Exponential Efficiency of Depth
1101001k10kNeurons Required (log scale)n=2n=3n=4n=5n=6n=7n=8n=9n=10n=11n=12Shallow (wide)Deep (narrow)

Parity Function (XOR)

Compute XOR of n input bits. Deep networks need O(n) neurons, shallow need O(2ⁿ).

f(x₁, x₂, ..., xₙ) = x₁ ⊕ x₂ ⊕ ... ⊕ xₙ

Shallow Network

64

neurons needed

Deep Network

12

6 layers

Efficiency Ratio

5.3×

The shallow network needs 5.3× more neurons!

Key insight: As problem size n increases, the shallow network's requirements grow exponentially while the deep network grows only polynomially. This is the fundamental reason why depth matters.

Intuition: Why Does Depth Help?

The key insight is re-use of intermediate computations. A deep network can:

  1. Compute features once, use many times: Early layers detect basic patterns (edges, strokes); later layers combine these patterns in multiple ways
  2. Build hierarchical representations: Layer 1 detects edges, Layer 2 combines edges into textures, Layer 3 combines textures into parts, Layer 4 combines parts into objects
  3. Factor the function: Instead of learning f(x)f(x) directly, learn f=ghkf = g \circ h \circ k where each component is simpler
The Deep Learning Hypothesis: Many real-world functions (vision, language, physics) have compositional structure that deep networks can exploit. This structure reflects the hierarchical nature of the physical world.

Function Composition: The Key Insight

The power of deep networks comes from function composition—building complex functions by chaining simpler functions together. Each layer transforms its input into a form that's easier for the next layer to process.

Mathematical Formulation

A deep network computes:

f(x)=fL(fL1(f2(f1(x))))f(x) = f_L(f_{L-1}(\cdots f_2(f_1(x))\cdots))

Each fif_i is a relatively simple function (affine transformation + nonlinearity), but their composition can represent extremely complex mappings.

Analogy: Building with LEGO

Think of each layer as a LEGO brick:

  • Individual bricks: Simple, limited shapes (like individual neurons)
  • Small assemblies: First layer outputs, basic patterns
  • Medium structures: Middle layer representations, meaningful parts
  • Final creation: Network output, complex decision

You can't make a castle with one giant brick, but you can with many small ones arranged hierarchically.

Interactive Function Composition

Explore how simple functions compose into complex ones. Each layer applies a transformation, and their composition creates increasingly complex mappings:

Function Composition: Building Complexity Layer by Layer
Input xOutputInput xL1: linearL2: sinL3: relu
Layers (3/5)
Layer 1: linear
Output: 0.000
Layer 2: sin
Output: 0.000
Layer 3: relu
Output: 0.000

Add layer:

Final Composed Output

f(x) = 0.0000

3 layers composing linear → sin → relu

Real-World Example: Image Classification

In a CNN for image classification, function composition manifests as:

LayerInputOutputDetects
1Raw pixelsEdge mapsHorizontal, vertical, diagonal edges
2Edge mapsTexture patternsCorners, curves, gratings
3TexturesPart detectorsEyes, wheels, windows
4PartsObject detectorsFaces, cars, buildings
5ObjectsScene understandingIndoor, outdoor, activity

Each layer's output becomes the next layer's input, progressively abstracting from raw pixels to semantic concepts.

Key Insight: Learned Hierarchies

We don't tell the network what features to detect at each level. Given only the final task (classify images), backpropagation discovers that edge detectors at layer 1, texture detectors at layer 2, etc., minimize the loss. The hierarchy emerges from the data and task.

Mathematical Framework

Let's formalize the concepts we've discussed with precise mathematical statements.

Expressive Power and Depth

Define the representational capacity of a network architecture as the set of functions it can compute. For a network with LL layers and nn neurons per layer:

FL,n={f:RdRkf is representable by (L,n) architecture}\mathcal{F}_{L,n} = \{f : \mathbb{R}^d \to \mathbb{R}^k \mid f \text{ is representable by } (L, n) \text{ architecture}\}

Depth Separation Theorems

Several theoretical results prove the power of depth:

  1. Telgarsky (2016): There exist functions computable by a network of depth k3k^3 and polynomial size that require exponential size to approximate with depth kk.
  2. Eldan & Shamir (2016): There exist functions that a 3-layer network with polynomial width can represent, but a 2-layer network requires exponential width.
  3. Cohen et al. (2016): Deep networks can achieve exponentially lower tensor rank than shallow networks for certain function classes.

The Depth-Width Trade-off Equation

For many function classes, the number of parameters needed scales as:

Parametersshallow=Ω(exp(Parametersdeepdepth))\text{Parameters}_{\text{shallow}} = \Omega\left(\exp\left(\frac{\text{Parameters}_{\text{deep}}}{\text{depth}}\right)\right)

This means adding depth can reduce parameters exponentially—a logarithmic increase in depth yields a linear reduction in width requirements.

Compositionality and Factorization

Many real-world functions have compositional structure:

f(x)=gKgK1g1(x)f(x) = g_K \circ g_{K-1} \circ \cdots \circ g_1(x)

If each gig_i can be efficiently represented by a single layer, then ff requires only KK layers. A shallow network trying to represent ff directly must "flatten" this composition, losing the factorization benefit.

Connection to Matrix Factorization

This is analogous to matrix factorization. Representing an m×nm \times n matrix directly requires mnmn parameters. If the matrix has rank rr, we can factor it as UVTUV^T with only (m+n)r(m+n)r parameters—potentially much smaller.

Circuit Complexity Perspective

The power of depth has deep connections to circuit complexity—a branch of theoretical computer science studying the resources needed to compute functions with Boolean circuits.

Neural Networks as Circuits

A feedforward neural network can be viewed as a circuit where:

  • Input gates: Correspond to input neurons
  • Computation gates: Correspond to hidden neurons (weighted sum + activation)
  • Output gates: Correspond to output neurons
  • Wires: Correspond to weighted connections
  • Depth: Corresponds to the longest path from input to output

Depth Lower Bounds from Circuit Complexity

Classic results in circuit complexity show that certain functions require logarithmic depth:

FunctionDepth Lower BoundImplication for Neural Networks
Parity of n bitsΩ(log n)Shallow networks need exponential width
Majority functionΩ(log n)Cannot be computed in constant depth
Iterated multiplicationΩ(log n)Sequential operations need depth

The NC Hierarchy

Complexity classes based on circuit depth reveal fundamental limits:

  • NC\u00b0: Constant depth, bounded fan-in (very limited)
  • NC\u00b9: O(log n) depth, polynomial size
  • NC\u00b2: O(log\u00b2 n) depth, polynomial size

Many useful functions (like matrix multiplication) are in NC\u00b9 but not NC\u00b0—they inherently require logarithmic depth.

The Deep Learning Connection: If the function you're trying to learn has inherent depth complexity, no amount of width in a shallow network can compensate. You need the depth.

Interactive Architecture Explorer

Now let's build intuition by experimenting with different architectures. This explorer lets you configure shallow and deep networks and observe their behavior:

Interactive Architecture Explorer

Shallow Network (1 hidden layer)

129 params

Accuracy

50.0%

Deep Network (4 hidden layers)

217 params

Accuracy

50.0%

Training Progress0/100 epochs

Observation Tips

  • • XOR requires learning non-linear decision boundaries
  • • A single layer struggles; depth helps separate the quadrants

Key observations to make while exploring:

  1. Parameter count: Compare how many parameters each architecture uses
  2. Representation capacity: See what shapes/patterns each can learn
  3. Training dynamics: Notice how differently deep and shallow networks train
  4. Decision boundaries: Observe the complexity of learned boundaries

PyTorch Implementation

Let's implement both shallow and deep networks in PyTorch and compare them empirically.

Defining the Architectures

Comparing Shallow and Deep Architectures
🐍shallow_vs_deep.py
7Shallow Network Definition

A shallow network with just ONE hidden layer. All the capacity comes from width (number of neurons in that single hidden layer).

EXAMPLE
hidden_dim=256 means 256 neurons in one layer
11Sequential Container

nn.Sequential chains layers together. Data flows through Linear -> ReLU -> Linear. This is the simplest way to build feedforward networks.

24Deep Network Definition

A deep network with MULTIPLE hidden layers. Capacity comes from depth (number of layers), with each layer being narrower.

EXAMPLE
5 layers with 32 neurons each
30Building Layers Dynamically

We build layers in a loop, allowing arbitrary depth. Each hidden layer has the same width (hidden_dim).

48Parameter Counting

Counting parameters helps compare architectures fairly. Both networks have roughly the same number of parameters, but distributed differently.

58 lines without explanation
1import torch
2import torch.nn as nn
3import torch.optim as optim
4import numpy as np
5
6class ShallowNetwork(nn.Module):
7    """Wide, shallow network with a single hidden layer."""
8
9    def __init__(self, input_dim, hidden_dim, output_dim):
10        super().__init__()
11        self.model = nn.Sequential(
12            nn.Linear(input_dim, hidden_dim),
13            nn.ReLU(),
14            nn.Linear(hidden_dim, output_dim)
15        )
16
17    def forward(self, x):
18        return self.model(x)
19
20    def count_parameters(self):
21        return sum(p.numel() for p in self.parameters())
22
23
24class DeepNetwork(nn.Module):
25    """Narrow, deep network with multiple hidden layers."""
26
27    def __init__(self, input_dim, hidden_dim, output_dim, num_layers=5):
28        super().__init__()
29
30        layers = []
31        # First layer
32        layers.extend([nn.Linear(input_dim, hidden_dim), nn.ReLU()])
33
34        # Hidden layers
35        for _ in range(num_layers - 2):
36            layers.extend([nn.Linear(hidden_dim, hidden_dim), nn.ReLU()])
37
38        # Output layer
39        layers.append(nn.Linear(hidden_dim, output_dim))
40
41        self.model = nn.Sequential(*layers)
42
43    def forward(self, x):
44        return self.model(x)
45
46    def count_parameters(self):
47        return sum(p.numel() for p in self.parameters())
48
49
50# Compare architectures with similar parameter counts
51input_dim = 10
52output_dim = 1
53
54# Shallow: 1 hidden layer, 256 neurons
55# Parameters: 10*256 + 256 + 256*1 + 1 = 2,817
56shallow_net = ShallowNetwork(input_dim, 256, output_dim)
57
58# Deep: 5 hidden layers, 32 neurons each
59# Parameters: 10*32 + 32 + 3*(32*32 + 32) + 32*1 + 1 = 3,489
60deep_net = DeepNetwork(input_dim, 32, output_dim, num_layers=5)
61
62print(f"Shallow network parameters: {shallow_net.count_parameters():,}")
63print(f"Deep network parameters: {deep_net.count_parameters():,}")

Training Comparison

Training Shallow vs Deep Networks
🐍training_comparison.py
1Compositional Data Generation

We generate data with inherent compositional structure: first compute pairwise interactions, then apply nonlinearity, then combine. This mimics real-world hierarchical structure.

6Hierarchical Structure

The target function is built in stages: pairwise products -> tanh -> addition -> sin. Deep networks can learn this decomposition; shallow networks must approximate it directly.

17Training Loop

Standard PyTorch training with Adam optimizer. We track both training and test losses to monitor overfitting.

45Fair Comparison

Both networks have similar parameter counts. The deep network typically achieves lower loss on compositional data despite having fewer neurons per layer.

57 lines without explanation
1def generate_compositional_data(n_samples=1000, input_dim=10):
2    """Generate data that has compositional structure."""
3    X = torch.randn(n_samples, input_dim)
4
5    # Target: hierarchical XOR-like function
6    # First, compute pairwise products
7    pairs = X[:, 0] * X[:, 1] + X[:, 2] * X[:, 3] + X[:, 4] * X[:, 5]
8    # Then, apply nonlinearity and combine more
9    intermediate = torch.tanh(pairs) + torch.relu(X[:, 6:].sum(dim=1))
10    # Final transformation
11    y = torch.sin(intermediate).unsqueeze(1)
12
13    return X, y
14
15
16def train_and_evaluate(model, X_train, y_train, X_test, y_test, epochs=500):
17    """Train a model and return train/test losses."""
18    optimizer = optim.Adam(model.parameters(), lr=0.01)
19    criterion = nn.MSELoss()
20
21    train_losses = []
22    test_losses = []
23
24    for epoch in range(epochs):
25        # Training
26        model.train()
27        optimizer.zero_grad()
28        pred = model(X_train)
29        loss = criterion(pred, y_train)
30        loss.backward()
31        optimizer.step()
32
33        # Evaluation
34        model.eval()
35        with torch.no_grad():
36            train_loss = criterion(model(X_train), y_train).item()
37            test_loss = criterion(model(X_test), y_test).item()
38
39        train_losses.append(train_loss)
40        test_losses.append(test_loss)
41
42    return train_losses, test_losses
43
44
45# Generate data
46X, y = generate_compositional_data(2000, 10)
47X_train, X_test = X[:1500], X[1500:]
48y_train, y_test = y[:1500], y[1500:]
49
50# Train both networks
51shallow_train, shallow_test = train_and_evaluate(
52    ShallowNetwork(10, 256, 1), X_train, y_train, X_test, y_test
53)
54deep_train, deep_test = train_and_evaluate(
55    DeepNetwork(10, 32, 1, 5), X_train, y_train, X_test, y_test
56)
57
58print(f"Shallow - Final train loss: {shallow_train[-1]:.4f}")
59print(f"Shallow - Final test loss: {shallow_test[-1]:.4f}")
60print(f"Deep - Final train loss: {deep_train[-1]:.4f}")
61print(f"Deep - Final test loss: {deep_test[-1]:.4f}")

Results Depend on the Function

The deep network outperforms the shallow network when the target function has compositional structure. For simple, smooth functions without hierarchy, the shallow network may perform equally well or better.

Practical Implications

Now that we understand the theory, let's translate it into practical guidelines for architecture design.

When to Use Deep Networks

DomainWhy Depth HelpsTypical Depth
Computer VisionImages have hierarchical structure (edges → textures → parts → objects)50-200 layers
Natural LanguageSentences have syntactic and semantic hierarchies12-96 layers
Speech RecognitionPhonemes → words → phrases → semantics10-50 layers
Game PlayingTactics → strategy → meta-strategy20-80 layers
Scientific SimulationsMulti-scale physics phenomenaVariable

When Shallow Networks May Suffice

  • Tabular data: Features are already meaningful; hierarchy less important
  • Low-dimensional problems: Not enough structure to exploit
  • Latency-critical applications: Shallower is faster at inference
  • Limited training data: Deep networks need more data to generalize

The Trade-offs to Consider

AspectShallow NetworksDeep Networks
Training speedFaster per epochSlower per epoch
Optimization difficultyEasier (fewer local minima)Harder (vanishing gradients, etc.)
Parameter efficiencyLess efficient for structured problemsMore efficient for compositional functions
GeneralizationMay need more regularizationImplicit regularization from depth
InterpretabilityEasier to analyzeHarder to understand intermediate representations
HardwareLess memoryMore memory for activations

Modern Best Practices

  1. Start with proven architectures: ResNet, Transformer, EfficientNet have solved many depth-related problems
  2. Use residual connections: Enable training of very deep networks (100+ layers)
  3. Apply normalization: BatchNorm, LayerNorm stabilize deep network training
  4. Choose depth based on data: More data → can support more depth
  5. Consider inference constraints: Depth increases latency

Rule of Thumb

For a new problem, start with a proven architecture. Only after establishing a baseline should you experiment with making it shallower (for efficiency) or deeper (for capacity).

Summary

You've now developed a deep understanding of why depth matters in neural networks:

Key Takeaways

  1. Shallow networks are universal: The Universal Approximation Theorem guarantees they can approximate any function—but may need exponentially many neurons
  2. Depth provides exponential efficiency: For functions with compositional structure, deep networks use exponentially fewer parameters than shallow networks
  3. Function composition is the key: Deep networks build complex functions by composing simple layers, enabling feature re-use and hierarchical representations
  4. Real-world data is compositional: Images, language, and audio all have hierarchical structure that deep networks can exploit
  5. Depth has costs: Training difficulty, optimization challenges, and computational requirements all increase with depth

The Big Picture

Depth vs Width: Both increase capacity, but they do so differently. Width adds parallel processing; depth adds sequential composition. For functions with inherent hierarchy, depth wins exponentially.
ConceptKey FormulaInsight
Function compositionf = g_L ∘ ... ∘ g_1Deep networks factor complex functions
Depth separationShallow: O(2^k), Deep: O(k)Exponential efficiency gap exists
Parameter trade-offParams ∝ depth × width²Depth is more parameter-efficient

Exercises

Conceptual Questions

  1. Explain why the Universal Approximation Theorem doesn't imply that shallow networks are always preferable to deep networks.
  2. A colleague argues: "We should always use deep networks since they're more powerful." Provide three counterarguments.
  3. For the parity function on 8 bits, estimate the number of neurons needed for (a) a 2-layer network and (b) an 8-layer network.
  4. Why might a deep network for image classification learn edge detectors in early layers, even though we never explicitly told it to?

Quick Check

For a function with deep compositional structure, which statement is TRUE?

Coding Exercises

  1. Parity experiment: Implement both shallow (2-layer) and deep (n-layer) networks for the n-bit parity function. Measure how the required width scales with n for each architecture.
  2. Visualization: Train shallow and deep networks on 2D classification problems (circles, spirals, checkerboard). Visualize the decision boundaries learned by each.
  3. MNIST comparison: Compare a wide-shallow network (1 hidden layer, 2048 neurons) with a narrow-deep network (5 hidden layers, 128 neurons each) on MNIST. Which achieves better accuracy with similar parameter count?

Challenge: Circuit Simulation

Implement a neural network that simulates a Boolean circuit. Show that the network depth must match the circuit depth for efficient simulation. Demonstrate this with a simple circuit (e.g., a 4-bit adder).

Exercise Hints

  • Q1: Think about the distinction between "possible" and "practical."
  • Q3: For parity on n bits: 2-layer needs O(2^n) neurons, n-layer needs O(n) neurons.
  • MNIST: Use similar total parameter counts for fair comparison.

In the next section, we'll dive deeper into Why Depth Matters—exploring additional theoretical results, the role of depth in feature learning, and how modern architectures harness depth effectively.