Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the optimizer landscape: Know the key optimizers (SGD, Momentum, RMSprop, Adam) and the problems each one solves
Derive optimizer update rules: Understand the mathematical formulation behind each optimizer and what each term contributes
Apply physical intuition: Connect momentum and adaptive learning rates to physics concepts like velocity, friction, and acceleration
Implement optimizers in PyTorch: Use PyTorch's optimizer API and understand when to choose each optimizer
Diagnose optimization problems: Recognize when training issues stem from optimizer choice and how to fix them

Why This Matters: The optimizer determines how your network learns. A well-chosen optimizer can mean the difference between a model that converges in hours versus days—or one that never converges at all. Understanding optimizers is essential for training deep networks effectively.

The Big Picture

The Problem We're Solving

In the previous section, we established the training loop: forward pass, compute loss, backward pass, update weights. The optimizer is the algorithm that performs that final step—deciding exactly how to adjust each weight based on its gradient.

The simplest approach is vanilla gradient descent: move each weight in the direction opposite to its gradient, scaled by a learning rate. But this simple approach has serious problems:

Oscillation: In ravines (long, narrow valleys in the loss landscape), gradients point steeply across the valley but only gently along it. SGD oscillates back and forth across the ravine instead of moving efficiently toward the minimum.
Uniform learning rates: All parameters get the same learning rate, but some features are sparse (appear rarely) while others are dense. Sparse features need larger updates when they appear.
Local minima and saddle points: With no momentum, the optimizer can get stuck in suboptimal regions.

A Brief History

The evolution of optimizers is a story of progressively solving these problems:

Year	Optimizer	Key Innovation	Introduced By
1847	Gradient Descent	Follow the negative gradient	Augustin-Louis Cauchy
1964	Momentum	Accumulate velocity to smooth updates	Boris Polyak
2011	AdaGrad	Adapt learning rate per-parameter	John Duchi et al.
2012	RMSprop	Use moving average instead of sum	Geoffrey Hinton (unpublished)
2014	Adam	Combine momentum + adaptive LR	Diederik Kingma & Jimmy Ba
2017	AdamW	Fix weight decay in Adam	Ilya Loshchilov & Frank Hutter

Today, Adam is the default choice for most practitioners, though SGD with momentum remains competitive for some tasks (especially vision models). Let's understand why by building up from the basics.

Vanilla Stochastic Gradient Descent

The simplest optimizer updates each parameter by stepping in the direction opposite to the gradient:

\theta_{t+1} = \theta_t - \eta \cdot g_t

Where:

$\theta_t$ is the parameter value at step $t$
$\eta$ (eta) is the learning rate
$g_t = \nabla_\theta \mathcal{L}(\theta_t)$ is the gradient of the loss

Why "Stochastic"?

In true gradient descent, we would compute the gradient over the entire dataset. Stochastic gradient descent (SGD) instead computes gradients on small batches, introducing noise into the gradient estimate. This noise:

Makes training faster (we update after each batch, not each epoch)
Acts as regularization (noise helps escape local minima)
Reduces memory requirements (we only need one batch in memory)

The Problem: Ravines

Consider a loss function shaped like a long, narrow valley. The gradient perpendicular to the valley (across it) is large, while the gradient along the valley (toward the minimum) is small.

What happens with vanilla SGD? The optimizer takes large steps across the valley and small steps along it. Result: it oscillates back and forth across the valley, making slow progress toward the minimum.

\text{Progress along valley} \ll \text{Oscillation across valley}

The Hessian Perspective

Mathematically, ravines occur when the Hessian (matrix of second derivatives) has eigenvalues of very different magnitudes. The condition number

\kappa = \lambda_{\max} / \lambda_{\min}

measures this. Higher condition numbers mean more severe ravines and slower convergence.

Quick Check

What causes oscillation in vanilla SGD?

Momentum: Learning from Physics

Momentum solves the oscillation problem by borrowing an idea from physics: a ball rolling down a hill accumulates velocity. If the hill curves back and forth, the ball's inertia smooths out the oscillations.

The Physical Analogy

Imagine a ball rolling on the loss surface. Without friction, the ball would accelerate indefinitely. With friction, it reaches a terminal velocity where the gravitational force (gradient) balances the friction.

Momentum introduces a velocity term that accumulates across updates:

\begin{aligned} v_t &= \mu \cdot v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta \cdot v_t \end{aligned}

Where:

$v_t$ is the velocity at step $t$
$\mu$ (mu) is the momentum coefficient, typically 0.9
$g_t$ is the current gradient

Why Momentum Works

In a ravine:

Across the valley: Gradients alternate direction (left, right, left, right). The momentum term averages these out, canceling the oscillation.
Along the valley: Gradients consistently point toward the minimum. Momentum accumulates, accelerating progress.

v_t = g_t + \mu g_{t-1} + \mu^2 g_{t-2} + \cdots = \sum_{i=0}^{t} \mu^{t-i} g_i

The velocity is an exponentially weighted moving average of past gradients. Recent gradients have more influence (weight $\mu^0 = 1$ ) than older ones (weight $\mu^t \to 0$ as $t \to \infty$ ).

Nesterov Accelerated Gradient (NAG)

A clever variant of momentum is Nesterov momentum, which "looks ahead" before computing the gradient:

\begin{aligned} v_t &= \mu \cdot v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t - \eta \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \eta \cdot v_t \end{aligned}

Instead of computing the gradient at the current position, NAG computes it at the position we would reach if we followed momentum alone. This provides a correction: if momentum would overshoot, we notice sooner and slow down.

Typical Values

Momentum coefficient: 0.9 is the standard choice. Higher values (0.99) give more smoothing but slower response to changes. Lower values (0.8) respond faster but smooth less.

Quick Check

In momentum SGD with μ = 0.9, how much weight does a gradient from 10 steps ago have compared to the current gradient?

Adaptive Learning Rates

Momentum solves oscillation, but what about the uniform learning rate problem? Not all parameters should be updated at the same rate:

Sparse features: Some features (like rare words in NLP) appear infrequently. When they do appear, we want to learn from them quickly.
Different scales: Weights in different layers may have gradients of different magnitudes. A learning rate that works for one layer may be too large or too small for another.

AdaGrad: Accumulate Squared Gradients

AdaGrad (Adaptive Gradient) adapts the learning rate for each parameter based on how much that parameter has been updated:

\begin{aligned} r_t &= r_{t-1} + g_t \odot g_t \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{r_t} + \epsilon} \odot g_t \end{aligned}

Where:

$r_t$ is the accumulated squared gradient (one value per parameter)
$\odot$ denotes element-wise multiplication
$\epsilon$ is a small constant (e.g., $10^{-8}$ ) to prevent division by zero

Intuition: Parameters with large accumulated gradients get smaller learning rates; parameters with small accumulated gradients get larger learning rates. This naturally handles sparse features.

AdaGrad's Problem

The accumulated squared gradient

r_t

only grows. Over time, the effective learning rate

\eta / \sqrt{r_t}

shrinks to zero, and learning stops completely. This makes AdaGrad unsuitable for long training runs.

RMSprop: Exponential Moving Average

RMSprop (Root Mean Square Propagation) fixes AdaGrad's problem by using an exponentially weighted moving average instead of a sum:

\begin{aligned} v_t &= \beta \cdot v_{t-1} + (1 - \beta) \cdot g_t \odot g_t \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t \end{aligned}

Where $\beta$ is typically 0.9 or 0.99.

Key difference: Unlike AdaGrad, RMSprop "forgets" old gradients. If a parameter stops receiving updates, its accumulated gradient decays, and its learning rate can recover.

Origin of RMSprop

Geoffrey Hinton introduced RMSprop in Lecture 6e of his Coursera course "Neural Networks for Machine Learning" (2012). It was never formally published in a paper, yet it became one of the most widely used optimizers!

Quick Check

What problem does RMSprop solve that AdaGrad has?

Adam: The Best of Both Worlds

Adam (Adaptive Moment Estimation) combines the best ideas from momentum and RMSprop: it maintains both a first moment (momentum-like) and a second moment (RMSprop-like) of the gradients.

The Algorithm

\begin{aligned} m_t &= \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t && \text{(first moment estimate)} \\ v_t &= \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t \odot g_t && \text{(second moment estimate)} \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} && \text{(bias correction)} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} && \text{(bias correction)} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t \end{aligned}

Default hyperparameters (from the original paper):

$\beta_1 = 0.9$ (momentum decay)
$\beta_2 = 0.999$ (RMSprop decay)
$\epsilon = 10^{-8}$
$\eta = 0.001$

Understanding Each Component

First Moment (m): Momentum

The first moment $m_t$ is an exponentially weighted average of past gradients—exactly like momentum. It smooths out oscillations and accumulates velocity in consistent directions.

Second Moment (v): Adaptive Learning Rate

The second moment $v_t$ is an exponentially weighted average of squared gradients—exactly like RMSprop. It provides per-parameter adaptive learning rates.

Bias Correction

Why the correction terms $\hat{m}_t$ and $\hat{v}_t$ ? Both $m_t$ and $v_t$ are initialized to zero. Early in training, they're biased toward zero.

Consider $m_t$ after one step:

m_1 = (1 - \beta_1) \cdot g_1 = 0.1 \cdot g_1

This is much smaller than $g_1$ ! The bias correction divides by $1 - \beta_1^1 = 0.1$ , giving $\hat{m}_1 = g_1$ . As $t \to \infty$ , $\beta_1^t \to 0$ , and the correction vanishes.

Adam is Robust

Adam's great advantage is that it works well "out of the box" for a wide range of problems. The default hyperparameters (

\beta_1=0.9, \beta_2=0.999, \eta=0.001

) rarely need tuning, making Adam the go-to choice for rapid prototyping.

Interactive: Optimizer Comparison

Watch different optimizers navigate the same loss landscape. The Rosenbrock function creates a curved valley—a challenging test for optimizers. Observe how:

SGD oscillates dramatically, struggling in the curved valley
Momentum smooths the path but may overshoot
RMSprop adapts to the landscape's shape
Adam combines smooth movement with adaptation

Optimizer Comparison on Loss Landscape

Learning Rate: 0.0010Step: 0

SGD

Loss: 12.5000 | Pos: (-1.500, 2.000)

Momentum

Loss: 12.5000 | Pos: (-1.500, 2.000)

Adam

Loss: 12.5000 | Pos: (-1.500, 2.000)

The yellow dot marks the global minimum at (1, 1). Watch how different optimizers navigate the curved valley of the Rosenbrock function. SGD oscillates, while momentum and Adam find smoother paths.

Experiment

Try different learning rates! Notice how SGD is sensitive to the learning rate choice, while Adam is more robust. Also try selecting different combinations of optimizers to compare them directly.

Quick Check

Based on the visualization, which optimizer typically reaches the minimum first on the Rosenbrock function?

Other Notable Optimizers

AdamW: Adam with Decoupled Weight Decay

The original Adam paper applied L2 regularization by adding the weight penalty to the loss, which means it gets scaled by the adaptive learning rate. AdamW decouples weight decay, applying it directly to the weights:

\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

This seemingly small change improves generalization significantly. AdamW is now the default for training transformers and many other architectures.

NAdam: Nesterov + Adam

NAdam incorporates Nesterov momentum into Adam, computing the gradient at a "look-ahead" position. This can provide faster convergence in some cases.

Lion: Evolved Sign Momentum

Lion (2023) was discovered via automated search over optimizer algorithms. It uses only the sign of the momentum, not the magnitude:

\theta_{t+1} = \theta_t - \eta \cdot \text{sign}(m_t)

Lion uses less memory than Adam (no second moment) and has shown strong performance on large language models.

Summary Table

Optimizer	Memory	Momentum	Adaptive LR	Best For
SGD	O(n)	No	No	Baseline, simple problems
SGD + Momentum	O(2n)	Yes	No	Vision, when tuned well
RMSprop	O(2n)	No	Yes	RNNs, quick prototyping
Adam	O(3n)	Yes	Yes	General default
AdamW	O(3n)	Yes	Yes	Transformers, with regularization
Lion	O(2n)	Yes (sign)	No	Large models, memory-constrained

PyTorch Implementation

PyTorch provides all major optimizers in torch.optim. Here's how to use them:

Basic Usage

Creating Optimizers in PyTorch

🐍optimizers.py

Explanation(6)

Code(33)

12SGD with Momentum

Classic SGD with momentum. Add nesterov=True for Nesterov accelerated gradient, which often converges faster.

EXAMPLE

Good for computer vision tasks when carefully tuned

14Learning Rate for SGD

SGD typically needs a higher learning rate (0.01-0.1) compared to Adam (0.001). SGD is more sensitive to this choice.

20Adam Optimizer

Adam is the default choice for most deep learning tasks. It combines momentum and adaptive learning rates.

23Beta Parameters

β₁ controls momentum (first moment decay), β₂ controls adaptive learning rate (second moment decay). Defaults work well in most cases.

28AdamW

AdamW fixes weight decay in Adam. Use weight_decay to control regularization strength. Preferred for transformers.

31Weight Decay

Typical values: 0.01-0.1 for transformers, 0.0001-0.001 for other models. Higher values = stronger regularization.

27 lines without explanation

1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Create a simple model
6model = nn.Sequential(
7    nn.Linear(784, 256),
8    nn.ReLU(),
9    nn.Linear(256, 10)
10)
11
12# SGD with momentum
13optimizer_sgd = optim.SGD(
14    model.parameters(),
15    lr=0.01,
16    momentum=0.9,
17    nesterov=True  # Use Nesterov momentum
18)
19
20# Adam (the default choice)
21optimizer_adam = optim.Adam(
22    model.parameters(),
23    lr=0.001,
24    betas=(0.9, 0.999),  # (β₁, β₂)
25    eps=1e-8
26)
27
28# AdamW (Adam with decoupled weight decay)
29optimizer_adamw = optim.AdamW(
30    model.parameters(),
31    lr=0.001,
32    weight_decay=0.01  # L2 regularization strength
33)

Per-Parameter Learning Rates

Sometimes you want different learning rates for different parts of the model. PyTorch supports this with parameter groups:

Per-Parameter Learning Rates

🐍per_param_lr.py

Explanation(3)

Code(12)

3Lower LR for Pretrained Layers

Pretrained layers already have good weights. Use a small learning rate to make fine adjustments without destroying learned features.

4Higher LR for New Layers

Newly initialized layers (like a classification head) need larger updates to learn from scratch.

12No Weight Decay for Classifier

Sometimes you want different regularization for different layers. The classifier may not need weight decay if it's small.

9 lines without explanation

1# Different learning rates for different layers
2optimizer = optim.Adam([
3    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Pretrained: small LR
4    {'params': model.head.parameters(), 'lr': 1e-3}      # New layers: larger LR
5], lr=1e-4)  # Default LR for any params not in groups
6
7# Fine-tuning a pretrained model
8optimizer = optim.AdamW([
9    {'params': model.encoder.parameters(), 'lr': 1e-5, 'weight_decay': 0.01},
10    {'params': model.decoder.parameters(), 'lr': 1e-4, 'weight_decay': 0.01},
11    {'params': model.classifier.parameters(), 'lr': 1e-3, 'weight_decay': 0.0}
12])

Implementing Adam from Scratch

Understanding optimizers deeply means being able to implement them yourself. Here's Adam from scratch:

Adam Optimizer from Scratch

🐍adam_scratch.py

Explanation(6)

Code(38)

10First Moment (m)

Stores the exponentially weighted average of gradients, like momentum. Initialized to zeros.

11Second Moment (v)

Stores the exponentially weighted average of squared gradients, like RMSprop. Used for adaptive learning rates.

27Update First Moment

Blend previous momentum with current gradient. β₁=0.9 means 90% old momentum, 10% new gradient.

30Update Second Moment

Blend previous squared gradient estimate with current squared gradient. β₂=0.999 means very smooth adaptation.

33Bias Correction

Divide by (1 - β^t) to correct for initialization at zero. Essential in early steps, vanishes over time.

37Parameter Update

Combine everything: step in direction of corrected momentum, scaled by inverse of corrected RMS gradient.

32 lines without explanation

1class AdamFromScratch:
2    def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
3        self.params = list(params)
4        self.lr = lr
5        self.beta1, self.beta2 = betas
6        self.eps = eps
7        self.t = 0
8
9        # Initialize moment estimates
10        self.m = [torch.zeros_like(p) for p in self.params]  # First moment
11        self.v = [torch.zeros_like(p) for p in self.params]  # Second moment
12
13    def zero_grad(self):
14        for p in self.params:
15            if p.grad is not None:
16                p.grad.zero_()
17
18    def step(self):
19        self.t += 1
20
21        for i, p in enumerate(self.params):
22            if p.grad is None:
23                continue
24
25            g = p.grad.data
26
27            # Update biased first moment estimate
28            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
29
30            # Update biased second moment estimate
31            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g * g
32
33            # Compute bias-corrected estimates
34            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
35            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
36
37            # Update parameters
38            p.data -= self.lr * m_hat / (torch.sqrt(v_hat) + self.eps)

Choosing the Right Optimizer

With so many optimizers available, how do you choose? Here are practical guidelines:

Decision Tree

Starting a new project? Use Adam with default hyperparameters. It's robust and works well across domains.
Training a transformer or LLM? Use AdamW. The decoupled weight decay is important for these architectures.
Training a CNN for image classification? Consider SGD with momentum and a learning rate schedule. It often gives better final performance than Adam, though it requires more tuning.
Memory constrained? Consider Lion or SGD with momentum. They use less memory than Adam.
Training an RNN? RMSprop or Adam work well. Gradient clipping is often needed too.

Common Mistakes

Mistake	Symptom	Fix
Using Adam's default LR for SGD	Training doesn't progress	SGD needs higher LR (0.01-0.1)
Forgetting weight decay	Model overfits	Add weight_decay (0.01 for AdamW)
Not using bias correction	Slow start to training	PyTorch optimizers handle this automatically
Epsilon too large	Updates too small	Use ε = 1e-8 (the default)
Same LR for pretrained and new layers	Pretrained features get destroyed	Use smaller LR for pretrained layers

When in Doubt, Try Both

For important projects, it's worth comparing Adam and SGD+momentum. SGD often achieves slightly better final performance if you have time to tune the learning rate schedule, but Adam is faster to get working.

Summary

Optimizers determine how neural networks learn from gradients. Let's review the key concepts:

Optimizer	Key Idea	Update Rule Essence
SGD	Follow negative gradient	θ ← θ - η·g
Momentum	Accumulate velocity	v ← μv + g; θ ← θ - η·v
RMSprop	Adapt LR per-parameter	θ ← θ - η·g / √(v + ε)
Adam	Momentum + adaptive LR	θ ← θ - η·m̂ / √(v̂ + ε)
AdamW	Adam with fixed weight decay	Adam + λθ decay applied directly

Key Takeaways

Momentum smooths oscillations by accumulating velocity across updates
Adaptive learning rates handle parameters of different scales and sparsities
Adam combines both ideas and works well out-of-the-box
AdamW is preferred for transformers due to proper weight decay handling
SGD + momentum can achieve better final performance with careful tuning

🐍quick_reference.py

1# Quick reference: PyTorch optimizers
2import torch.optim as optim
3
4# For quick prototyping (most projects)
5opt = optim.Adam(model.parameters(), lr=1e-3)
6
7# For transformers and LLMs
8opt = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
9
10# For CNNs when tuning for best performance
11opt = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
12
13# For RNNs
14opt = optim.RMSprop(model.parameters(), lr=1e-3)

Exercises

Conceptual Questions

Explain why momentum helps with ravines in the loss landscape. What would happen if you used momentum coefficient $\mu = 0.99$ instead of $0.9$ ?
Why does Adam need bias correction but vanilla momentum doesn't? What would happen to Adam's first steps without bias correction?
You're training a model and notice that loss oscillates wildly. Which optimizer hyperparameters would you adjust, and how?
Explain why AdamW is preferred over Adam for transformers. What goes wrong with standard Adam + weight decay?

Solution Hints

Q1: Higher momentum means more smoothing but slower response to gradient changes. μ=0.99 would be very sluggish and might overshoot more.
Q2: Momentum starts from zero but quickly gets close to the gradient average. Adam's moments are exponentially weighted averages initialized to zero, which biases them toward zero early on.
Q3: Reduce learning rate first. If using SGD, add or increase momentum. Consider switching to Adam if not already using it.
Q4: In Adam, weight decay gets scaled by the adaptive learning rate, so parameters with small gradients (often important features) get less regularization. AdamW applies weight decay uniformly.

Coding Exercises

Implement SGD with momentum from scratch: Create a custom optimizer class that implements SGD with momentum (with optional Nesterov). Verify it produces the same results as PyTorch's optim.SGD.
Optimizer comparison experiment: Train a small CNN on MNIST with SGD, SGD+momentum, RMSprop, and Adam. Plot training and validation loss curves. Which converges fastest? Which achieves the best final accuracy?
Learning rate sensitivity analysis: Train the same model with Adam using learning rates [0.0001, 0.001, 0.01, 0.1]. Repeat with SGD. Compare how sensitive each optimizer is to learning rate choice.
Visualize optimizer trajectories: Create a 2D loss surface (like the Rosenbrock function) and animate the trajectories of different optimizers. Highlight where each optimizer struggles or excels.

Coding Exercise Hints

Exercise 1: Store velocity as a list of tensors, one per parameter. For Nesterov, evaluate gradient at θ - η·μ·v.
Exercise 2: Use a simple CNN (2-3 conv layers + FC). Train for 10 epochs, plot losses with matplotlib.
Exercise 3: Create a loop over learning rates and store results in a dictionary. Use early stopping if loss explodes.
Exercise 4: Use matplotlib.animation. Store optimizer state history and plot paths on a contour plot.

In the next section, we'll explore learning rate strategies—techniques for changing the learning rate during training. You'll learn about warmup, cosine annealing, and other schedules that can significantly improve training.

Learning Objectives

The Big Picture

The Problem We're Solving

A Brief History

Vanilla Stochastic Gradient Descent

Why "Stochastic"?

The Problem: Ravines

The Hessian Perspective

Quick Check

Momentum: Learning from Physics

The Physical Analogy

Why Momentum Works

Nesterov Accelerated Gradient (NAG)

Typical Values

Quick Check

Adaptive Learning Rates

AdaGrad: Accumulate Squared Gradients

AdaGrad's Problem

RMSprop: Exponential Moving Average

Origin of RMSprop

Quick Check

Adam: The Best of Both Worlds

The Algorithm

Understanding Each Component

First Moment (m): Momentum

Second Moment (v): Adaptive Learning Rate

Bias Correction

Adam is Robust

Interactive: Optimizer Comparison

Experiment

Quick Check

Other Notable Optimizers

AdamW: Adam with Decoupled Weight Decay

NAdam: Nesterov + Adam

Lion: Evolved Sign Momentum

Summary Table

PyTorch Implementation

Basic Usage

Per-Parameter Learning Rates

Implementing Adam from Scratch

Choosing the Right Optimizer

Decision Tree

Common Mistakes

When in Doubt, Try Both

Related Topics

Summary

Key Takeaways

Exercises

Conceptual Questions

Solution Hints

Coding Exercises

Coding Exercise Hints