Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

In the previous section, we established a remarkable theoretical result: the Universal Approximation Theorem. It tells us that a single hidden layer network with enough neurons can approximate any continuous function to arbitrary precision. This raises a profound question: Why do we need deep networks at all?

The Central Paradox: If shallow networks can approximate anything, why does modern deep learning use networks with 100+ layers? The answer reveals one of the most important insights in machine learning: theoretical possibility and practical efficiency are vastly different things.

Consider an analogy: you could describe any image using a single polynomial equation with billions of terms. But this would be absurd—no one does this. Instead, we decompose images into hierarchies of edges, textures, parts, and objects. Deep networks discover similar compositional structures automatically.

This section explores the fundamental trade-off between width (neurons per layer) and depth (number of layers). We'll see why depth provides exponential efficiency gains for certain function classes, and why modern architectures like ResNet, GPT, and BERT all embrace depth as a core design principle.

Learning Objectives

After completing this section, you will be able to:

Define shallow and deep networks precisely: Understand what constitutes depth, how to count layers, and the standard conventions in the field
Explain the efficiency of depth: Articulate why deep networks can represent certain functions exponentially more efficiently than shallow networks
Understand function composition: See how deep networks build complex functions by composing simpler functions layer by layer
Apply the circuit complexity perspective: Connect neural network depth to results from computational complexity theory
Make architectural decisions: Know when to choose deeper vs wider networks for different problems
Implement and compare architectures: Build shallow and deep networks in PyTorch and measure their differences empirically

Where This Knowledge Applies

Architecture Design: Choosing between wide-shallow and narrow-deep networks
Transfer Learning: Understanding why depth enables hierarchical feature learning
Model Compression: Knowing which layers can be pruned vs which are essential
Debugging: Diagnosing when depth is hurting (vanishing gradients) vs helping

Defining Network Depth

Before comparing shallow and deep networks, we need precise definitions. The depth of a network is typically defined as the number of layers with learnable parameters, not counting the input layer.

Counting Layers: The Standard Convention

Architecture	Hidden Layers	Total Layers	Depth Classification
Single perceptron	0	1 (output only)	No hidden layers
1 hidden layer MLP	1	2	Shallow
2 hidden layer MLP	2	3	Shallow to moderate
5 hidden layer MLP	5	6	Deep
ResNet-50	50	50	Deep
GPT-3 (175B)	96	96	Very deep

Convention: What Counts as a Layer?

Fully connected (Linear) layers: Count as 1 layer each
Convolutional layers: Count as 1 layer each
Activation functions: Usually NOT counted (they're considered part of the previous layer)
Batch normalization: Usually NOT counted (grouped with adjacent layer)
Pooling layers: Sometimes counted, sometimes not (no learnable parameters)

Mathematical Definition

A feedforward neural network with $L$ layers computes:

f(x) = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(2)} \circ f^{(1)}(x)

Where each layer $f^{(l)}$ typically applies:

f^{(l)}(h) = \sigma(W^{(l)} h + b^{(l)})

Here $W^{(l)}$ is the weight matrix, $b^{(l)}$ is the bias vector, and $\sigma$ is the activation function.

Symbol	Meaning	Typical Values
L	Total number of layers	2-100+
f^{(l)}	Function computed by layer l	Linear + activation
W^{(l)}	Weight matrix of layer l	Shape: (n_out, n_in)
b^{(l)}	Bias vector of layer l	Shape: (n_out,)
σ	Activation function	ReLU, tanh, sigmoid
∘	Function composition	f ∘ g means f(g(x))

Shallow vs Deep: Where's the Line?

There's no universal consensus, but common conventions include:

Shallow: 1-2 hidden layers
Moderately deep: 3-10 hidden layers
Deep: 10+ hidden layers
Very deep: 50+ hidden layers (ResNet, Transformers)

Historical Note: Before 2012, networks with more than 2-3 hidden layers were considered "deep" and notoriously difficult to train. The AlexNet breakthrough (2012) used 8 layers, considered very deep at the time. Today's state-of-the-art models routinely use 100+ layers.

The Power of Shallow Networks

The Universal Approximation Theorem tells us that shallow networks are theoretically sufficient. Let's understand exactly what this means and where its limits lie.

What Shallow Networks Can Do

A single hidden layer network with sufficient width can:

Approximate any continuous function on a compact domain to arbitrary precision
Learn any decision boundary (for classification)
Represent any input-output mapping given enough neurons

The Width Requirement

Here's the catch: "sufficient width" can mean exponentially many neurons. For a function of $d$ input variables, the required width may scale as:

\text{Width} = O(2^d) \text{ or worse}

This exponential scaling makes shallow networks impractical for high-dimensional problems like images (224×224×3 = 150,528 dimensions).

Interactive Comparison

This visualization shows how shallow and deep networks approximate the same target function. Observe how the shallow network requires many more neurons to achieve the same accuracy:

Shallow vs Deep Network Approximation

Shallow MSE

0.4567

Deep MSE

0.3799

Shallow Network

Hidden neurons: 8~25 params

Single hidden layer with 8 neurons

Deep Network

Layers: 4

Neurons per layer: 4~56 params

4 layers with 4 neurons each

Key Observations

• The deep network often converges faster with fewer total parameters
• The shallow network needs many more neurons to approximate complex patterns
• Adjust parameters and watch how each architecture adapts

When Shallow Networks Excel

Despite their limitations, shallow networks work well for:

Low-dimensional problems: Input dimension < 20
Smooth functions: Functions without sharp transitions or hierarchical structure
Tabular data: Traditional ML problems with feature-engineered inputs
Theoretical analysis: Easier to analyze mathematically

Practical Rule of Thumb

For tabular data with < 100 features, a 2-3 layer MLP often works as well as deeper alternatives. For images, text, or audio, depth is almost always beneficial.

The Efficiency of Depth

The key insight about depth is not that it enables new functions—shallow networks can already approximate anything—but that it provides exponential efficiency for many function classes.

The Exponential Gap

There exist functions that a deep network with $O(k)$ neurons can compute, but a shallow network would require $O(2^k)$ neurons to approximate. This is known as the depth-width trade-off.

\text{Deep: } O(k) \text{ parameters} \quad \text{vs} \quad \text{Shallow: } O(2^k) \text{ parameters}

Classic Example: Parity Function

Consider the parity function on $n$ bits: output 1 if an odd number of input bits are 1, output 0 otherwise.

🐍parity_example.py

1# Parity function: XOR of all input bits
2def parity(bits):
3    result = 0
4    for bit in bits:
5        result ^= bit  # XOR accumulates
6    return result
7
8# Examples
9print(parity([1, 0, 1, 0]))  # 0 (two 1s = even)
10print(parity([1, 1, 1, 0]))  # 1 (three 1s = odd)
11print(parity([1, 1, 1, 1]))  # 0 (four 1s = even)

Complexity bounds:

Architecture	Required Size	Why
Deep network (n layers)	O(n) neurons	Each layer handles one XOR
Shallow network (2 layers)	O(2^n) neurons	Must enumerate all even/odd patterns

Visualizing the Efficiency Gap

This interactive demonstration shows how the parameter count grows as we increase the problem complexity. Notice the exponential explosion for shallow networks:

The Exponential Efficiency of Depth

Parity Function (XOR)

Compute XOR of n input bits. Deep networks need O(n) neurons, shallow need O(2ⁿ).

f(x₁, x₂, ..., xₙ) = x₁ ⊕ x₂ ⊕ ... ⊕ xₙ

Problem Size (n): 6

Shallow Network

neurons needed

Deep Network

6 layers

Efficiency Ratio

5.3×

The shallow network needs 5.3× more neurons!

Key insight: As problem size n increases, the shallow network's requirements grow exponentially while the deep network grows only polynomially. This is the fundamental reason why depth matters.

Intuition: Why Does Depth Help?

The key insight is re-use of intermediate computations. A deep network can:

Compute features once, use many times: Early layers detect basic patterns (edges, strokes); later layers combine these patterns in multiple ways
Build hierarchical representations: Layer 1 detects edges, Layer 2 combines edges into textures, Layer 3 combines textures into parts, Layer 4 combines parts into objects
Factor the function: Instead of learning $f(x)$ directly, learn $f = g \circ h \circ k$ where each component is simpler

The Deep Learning Hypothesis: Many real-world functions (vision, language, physics) have compositional structure that deep networks can exploit. This structure reflects the hierarchical nature of the physical world.

Function Composition: The Key Insight

The power of deep networks comes from function composition—building complex functions by chaining simpler functions together. Each layer transforms its input into a form that's easier for the next layer to process.

Mathematical Formulation

A deep network computes:

f(x) = f_L(f_{L-1}(\cdots f_2(f_1(x))\cdots))

Each $f_i$ is a relatively simple function (affine transformation + nonlinearity), but their composition can represent extremely complex mappings.

Analogy: Building with LEGO

Think of each layer as a LEGO brick:

Individual bricks: Simple, limited shapes (like individual neurons)
Small assemblies: First layer outputs, basic patterns
Medium structures: Middle layer representations, meaningful parts
Final creation: Network output, complex decision

You can't make a castle with one giant brick, but you can with many small ones arranged hierarchically.

Interactive Function Composition

Explore how simple functions compose into complex ones. Each layer applies a transformation, and their composition creates increasingly complex mappings:

Function Composition: Building Complexity Layer by Layer

Input value: x = 0.00

Layers (3/5)

Layer 1: linear

Scale (a): 2.0

Shift (b): 0.0

Output: 0.000

Layer 2: sin

Output: 0.000

Layer 3: relu

Output: 0.000

Add layer:

Final Composed Output

f(x) = 0.0000

3 layers composing linear → sin → relu

Real-World Example: Image Classification

In a CNN for image classification, function composition manifests as:

Layer	Input	Output	Detects
1	Raw pixels	Edge maps	Horizontal, vertical, diagonal edges
2	Edge maps	Texture patterns	Corners, curves, gratings
3	Textures	Part detectors	Eyes, wheels, windows
4	Parts	Object detectors	Faces, cars, buildings
5	Objects	Scene understanding	Indoor, outdoor, activity

Each layer's output becomes the next layer's input, progressively abstracting from raw pixels to semantic concepts.

Key Insight: Learned Hierarchies

We don't tell the network what features to detect at each level. Given only the final task (classify images), backpropagation discovers that edge detectors at layer 1, texture detectors at layer 2, etc., minimize the loss. The hierarchy emerges from the data and task.

Mathematical Framework

Let's formalize the concepts we've discussed with precise mathematical statements.

Expressive Power and Depth

Define the representational capacity of a network architecture as the set of functions it can compute. For a network with $L$ layers and $n$ neurons per layer:

\mathcal{F}_{L,n} = \{f : \mathbb{R}^d \to \mathbb{R}^k \mid f \text{ is representable by } (L, n) \text{ architecture}\}

Depth Separation Theorems

Several theoretical results prove the power of depth:

Telgarsky (2016): There exist functions computable by a network of depth $k^3$ and polynomial size that require exponential size to approximate with depth $k$ .
Eldan & Shamir (2016): There exist functions that a 3-layer network with polynomial width can represent, but a 2-layer network requires exponential width.
Cohen et al. (2016): Deep networks can achieve exponentially lower tensor rank than shallow networks for certain function classes.

The Depth-Width Trade-off Equation

For many function classes, the number of parameters needed scales as:

\text{Parameters}_{\text{shallow}} = \Omega\left(\exp\left(\frac{\text{Parameters}_{\text{deep}}}{\text{depth}}\right)\right)

This means adding depth can reduce parameters exponentially—a logarithmic increase in depth yields a linear reduction in width requirements.

Compositionality and Factorization

Many real-world functions have compositional structure:

f(x) = g_K \circ g_{K-1} \circ \cdots \circ g_1(x)

If each $g_i$ can be efficiently represented by a single layer, then $f$ requires only $K$ layers. A shallow network trying to represent $f$ directly must "flatten" this composition, losing the factorization benefit.

Connection to Matrix Factorization

This is analogous to matrix factorization. Representing an

m \times n

matrix directly requires

mn

parameters. If the matrix has rank

r

, we can factor it as

UV^T

with only

(m+n)r

parameters—potentially much smaller.

Circuit Complexity Perspective

The power of depth has deep connections to circuit complexity—a branch of theoretical computer science studying the resources needed to compute functions with Boolean circuits.

Neural Networks as Circuits

A feedforward neural network can be viewed as a circuit where:

Input gates: Correspond to input neurons
Computation gates: Correspond to hidden neurons (weighted sum + activation)
Output gates: Correspond to output neurons
Wires: Correspond to weighted connections
Depth: Corresponds to the longest path from input to output

Depth Lower Bounds from Circuit Complexity

Classic results in circuit complexity show that certain functions require logarithmic depth:

Function	Depth Lower Bound	Implication for Neural Networks
Parity of n bits	Ω(log n)	Shallow networks need exponential width
Majority function	Ω(log n)	Cannot be computed in constant depth
Iterated multiplication	Ω(log n)	Sequential operations need depth

The NC Hierarchy

Complexity classes based on circuit depth reveal fundamental limits:

NC\u00b0: Constant depth, bounded fan-in (very limited)
NC\u00b9: O(log n) depth, polynomial size
NC\u00b2: O(log\u00b2 n) depth, polynomial size

Many useful functions (like matrix multiplication) are in NC\u00b9 but not NC\u00b0—they inherently require logarithmic depth.

The Deep Learning Connection: If the function you're trying to learn has inherent depth complexity, no amount of width in a shallow network can compensate. You need the depth.

Interactive Architecture Explorer

Now let's build intuition by experimenting with different architectures. This explorer lets you configure shallow and deep networks and observe their behavior:

Interactive Architecture Explorer

Shallow Network (1 hidden layer)

129 params

Hidden neurons: 32

Accuracy

50.0%

Deep Network (4 hidden layers)

217 params

Layers: 4

Neurons per layer: 8

Accuracy

50.0%

Training Progress0/100 epochs

Observation Tips

• XOR requires learning non-linear decision boundaries
• A single layer struggles; depth helps separate the quadrants

Key observations to make while exploring:

Parameter count: Compare how many parameters each architecture uses
Representation capacity: See what shapes/patterns each can learn
Training dynamics: Notice how differently deep and shallow networks train
Decision boundaries: Observe the complexity of learned boundaries

PyTorch Implementation

Let's implement both shallow and deep networks in PyTorch and compare them empirically.

Defining the Architectures

Comparing Shallow and Deep Architectures

🐍shallow_vs_deep.py

Explanation(5)

Code(63)

7Shallow Network Definition

A shallow network with just ONE hidden layer. All the capacity comes from width (number of neurons in that single hidden layer).

EXAMPLE

hidden_dim=256 means 256 neurons in one layer

11Sequential Container

nn.Sequential chains layers together. Data flows through Linear -> ReLU -> Linear. This is the simplest way to build feedforward networks.

24Deep Network Definition

A deep network with MULTIPLE hidden layers. Capacity comes from depth (number of layers), with each layer being narrower.

EXAMPLE

5 layers with 32 neurons each

30Building Layers Dynamically

We build layers in a loop, allowing arbitrary depth. Each hidden layer has the same width (hidden_dim).

48Parameter Counting

Counting parameters helps compare architectures fairly. Both networks have roughly the same number of parameters, but distributed differently.

58 lines without explanation

1import torch
2import torch.nn as nn
3import torch.optim as optim
4import numpy as np
5
6class ShallowNetwork(nn.Module):
7    """Wide, shallow network with a single hidden layer."""
8
9    def __init__(self, input_dim, hidden_dim, output_dim):
10        super().__init__()
11        self.model = nn.Sequential(
12            nn.Linear(input_dim, hidden_dim),
13            nn.ReLU(),
14            nn.Linear(hidden_dim, output_dim)
15        )
16
17    def forward(self, x):
18        return self.model(x)
19
20    def count_parameters(self):
21        return sum(p.numel() for p in self.parameters())
22
23
24class DeepNetwork(nn.Module):
25    """Narrow, deep network with multiple hidden layers."""
26
27    def __init__(self, input_dim, hidden_dim, output_dim, num_layers=5):
28        super().__init__()
29
30        layers = []
31        # First layer
32        layers.extend([nn.Linear(input_dim, hidden_dim), nn.ReLU()])
33
34        # Hidden layers
35        for _ in range(num_layers - 2):
36            layers.extend([nn.Linear(hidden_dim, hidden_dim), nn.ReLU()])
37
38        # Output layer
39        layers.append(nn.Linear(hidden_dim, output_dim))
40
41        self.model = nn.Sequential(*layers)
42
43    def forward(self, x):
44        return self.model(x)
45
46    def count_parameters(self):
47        return sum(p.numel() for p in self.parameters())
48
49
50# Compare architectures with similar parameter counts
51input_dim = 10
52output_dim = 1
53
54# Shallow: 1 hidden layer, 256 neurons
55# Parameters: 10*256 + 256 + 256*1 + 1 = 2,817
56shallow_net = ShallowNetwork(input_dim, 256, output_dim)
57
58# Deep: 5 hidden layers, 32 neurons each
59# Parameters: 10*32 + 32 + 3*(32*32 + 32) + 32*1 + 1 = 3,489
60deep_net = DeepNetwork(input_dim, 32, output_dim, num_layers=5)
61
62print(f"Shallow network parameters: {shallow_net.count_parameters():,}")
63print(f"Deep network parameters: {deep_net.count_parameters():,}")

Training Comparison

Training Shallow vs Deep Networks

🐍training_comparison.py

Explanation(4)

Code(61)

1Compositional Data Generation

We generate data with inherent compositional structure: first compute pairwise interactions, then apply nonlinearity, then combine. This mimics real-world hierarchical structure.

6Hierarchical Structure

The target function is built in stages: pairwise products -> tanh -> addition -> sin. Deep networks can learn this decomposition; shallow networks must approximate it directly.

17Training Loop

Standard PyTorch training with Adam optimizer. We track both training and test losses to monitor overfitting.

45Fair Comparison

Both networks have similar parameter counts. The deep network typically achieves lower loss on compositional data despite having fewer neurons per layer.

57 lines without explanation

1def generate_compositional_data(n_samples=1000, input_dim=10):
2    """Generate data that has compositional structure."""
3    X = torch.randn(n_samples, input_dim)
4
5    # Target: hierarchical XOR-like function
6    # First, compute pairwise products
7    pairs = X[:, 0] * X[:, 1] + X[:, 2] * X[:, 3] + X[:, 4] * X[:, 5]
8    # Then, apply nonlinearity and combine more
9    intermediate = torch.tanh(pairs) + torch.relu(X[:, 6:].sum(dim=1))
10    # Final transformation
11    y = torch.sin(intermediate).unsqueeze(1)
12
13    return X, y
14
15
16def train_and_evaluate(model, X_train, y_train, X_test, y_test, epochs=500):
17    """Train a model and return train/test losses."""
18    optimizer = optim.Adam(model.parameters(), lr=0.01)
19    criterion = nn.MSELoss()
20
21    train_losses = []
22    test_losses = []
23
24    for epoch in range(epochs):
25        # Training
26        model.train()
27        optimizer.zero_grad()
28        pred = model(X_train)
29        loss = criterion(pred, y_train)
30        loss.backward()
31        optimizer.step()
32
33        # Evaluation
34        model.eval()
35        with torch.no_grad():
36            train_loss = criterion(model(X_train), y_train).item()
37            test_loss = criterion(model(X_test), y_test).item()
38
39        train_losses.append(train_loss)
40        test_losses.append(test_loss)
41
42    return train_losses, test_losses
43
44
45# Generate data
46X, y = generate_compositional_data(2000, 10)
47X_train, X_test = X[:1500], X[1500:]
48y_train, y_test = y[:1500], y[1500:]
49
50# Train both networks
51shallow_train, shallow_test = train_and_evaluate(
52    ShallowNetwork(10, 256, 1), X_train, y_train, X_test, y_test
53)
54deep_train, deep_test = train_and_evaluate(
55    DeepNetwork(10, 32, 1, 5), X_train, y_train, X_test, y_test
56)
57
58print(f"Shallow - Final train loss: {shallow_train[-1]:.4f}")
59print(f"Shallow - Final test loss: {shallow_test[-1]:.4f}")
60print(f"Deep - Final train loss: {deep_train[-1]:.4f}")
61print(f"Deep - Final test loss: {deep_test[-1]:.4f}")

Results Depend on the Function

The deep network outperforms the shallow network when the target function has compositional structure. For simple, smooth functions without hierarchy, the shallow network may perform equally well or better.

Practical Implications

Now that we understand the theory, let's translate it into practical guidelines for architecture design.

When to Use Deep Networks

Domain	Why Depth Helps	Typical Depth
Computer Vision	Images have hierarchical structure (edges → textures → parts → objects)	50-200 layers
Natural Language	Sentences have syntactic and semantic hierarchies	12-96 layers
Speech Recognition	Phonemes → words → phrases → semantics	10-50 layers
Game Playing	Tactics → strategy → meta-strategy	20-80 layers
Scientific Simulations	Multi-scale physics phenomena	Variable

When Shallow Networks May Suffice

Tabular data: Features are already meaningful; hierarchy less important
Low-dimensional problems: Not enough structure to exploit
Latency-critical applications: Shallower is faster at inference
Limited training data: Deep networks need more data to generalize

The Trade-offs to Consider

Aspect	Shallow Networks	Deep Networks
Training speed	Faster per epoch	Slower per epoch
Optimization difficulty	Easier (fewer local minima)	Harder (vanishing gradients, etc.)
Parameter efficiency	Less efficient for structured problems	More efficient for compositional functions
Generalization	May need more regularization	Implicit regularization from depth
Interpretability	Easier to analyze	Harder to understand intermediate representations
Hardware	Less memory	More memory for activations

Modern Best Practices

Start with proven architectures: ResNet, Transformer, EfficientNet have solved many depth-related problems
Use residual connections: Enable training of very deep networks (100+ layers)
Apply normalization: BatchNorm, LayerNorm stabilize deep network training
Choose depth based on data: More data → can support more depth
Consider inference constraints: Depth increases latency

Rule of Thumb

For a new problem, start with a proven architecture. Only after establishing a baseline should you experiment with making it shallower (for efficiency) or deeper (for capacity).

Summary

You've now developed a deep understanding of why depth matters in neural networks:

Key Takeaways

Shallow networks are universal: The Universal Approximation Theorem guarantees they can approximate any function—but may need exponentially many neurons
Depth provides exponential efficiency: For functions with compositional structure, deep networks use exponentially fewer parameters than shallow networks
Function composition is the key: Deep networks build complex functions by composing simple layers, enabling feature re-use and hierarchical representations
Real-world data is compositional: Images, language, and audio all have hierarchical structure that deep networks can exploit
Depth has costs: Training difficulty, optimization challenges, and computational requirements all increase with depth

The Big Picture

Depth vs Width: Both increase capacity, but they do so differently. Width adds parallel processing; depth adds sequential composition. For functions with inherent hierarchy, depth wins exponentially.

Concept	Key Formula	Insight
Function composition	f = g_L ∘ ... ∘ g_1	Deep networks factor complex functions
Depth separation	Shallow: O(2^k), Deep: O(k)	Exponential efficiency gap exists
Parameter trade-off	Params ∝ depth × width²	Depth is more parameter-efficient

Exercises

Conceptual Questions

Explain why the Universal Approximation Theorem doesn't imply that shallow networks are always preferable to deep networks.
A colleague argues: "We should always use deep networks since they're more powerful." Provide three counterarguments.
For the parity function on 8 bits, estimate the number of neurons needed for (a) a 2-layer network and (b) an 8-layer network.
Why might a deep network for image classification learn edge detectors in early layers, even though we never explicitly told it to?

Quick Check

For a function with deep compositional structure, which statement is TRUE?

Coding Exercises

Parity experiment: Implement both shallow (2-layer) and deep (n-layer) networks for the n-bit parity function. Measure how the required width scales with n for each architecture.
Visualization: Train shallow and deep networks on 2D classification problems (circles, spirals, checkerboard). Visualize the decision boundaries learned by each.
MNIST comparison: Compare a wide-shallow network (1 hidden layer, 2048 neurons) with a narrow-deep network (5 hidden layers, 128 neurons each) on MNIST. Which achieves better accuracy with similar parameter count?

Challenge: Circuit Simulation

Implement a neural network that simulates a Boolean circuit. Show that the network depth must match the circuit depth for efficient simulation. Demonstrate this with a simple circuit (e.g., a 4-bit adder).

Exercise Hints

Q1: Think about the distinction between "possible" and "practical."
Q3: For parity on n bits: 2-layer needs O(2^n) neurons, n-layer needs O(n) neurons.
MNIST: Use similar total parameter counts for fair comparison.

In the next section, we'll dive deeper into Why Depth Matters—exploring additional theoretical results, the role of depth in feature learning, and how modern architectures harness depth effectively.