Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define exponential functions $f(x) = a^x$ and explain the role of the base.
Explain why Euler's number $e \approx 2.71828$ is special in calculus.
Distinguish between exponential growth (base > 1) and decay (0 < base < 1).
Derive $e$ from the compound-interest limit and confirm it numerically by hand.
Recognize the unique self-derivative property $\tfrac{d}{dx} e^x = e^x$ .
Apply exponentials to model growth, decay, cooling, and softmax / cross-entropy in ML.

The Big Picture: Why Exponential Functions Matter

"The greatest shortcoming of the human race is our inability to understand the exponential function." — Albert Bartlett, physicist

Linear functions describe additive change: every step adds the same amount. Exponential functions describe multiplicative change: every step multiplies by the same factor. That single switch — from plus to times — is responsible for everything from compound interest to viral spread to the softmax layer in a transformer.

The Core Insight

Linear: "I add 10 every hour" → 10, 20, 30, 40, … (arithmetic sequence).

Exponential: "I double every hour" → 1, 2, 4, 8, 16, 32, … (geometric sequence).

After 10 hours the linear rule gives 100; the exponential rule gives 1024. After 30 hours the linear rule gives 300; the exponential rule gives 1,073,741,824. Same starting point. Same time. A billion-fold gap. That gap is the exponential function.

The Intuition: a Bank Account That Refuses to Be Boring

Imagine putting $1 in an account that grows continuously — every instant, the account adds a little interest, and that new interest immediately starts earning its own interest. The account is not just growing; it is growing because it is growing. That self-referential loop is the soul of $e^x$ .

Where Exponential Functions Appear

🦠 Biology

Bacterial colony growth
Viral spread before immunity kicks in
Radioactive decay of carbon-14
Drug concentration after a single dose

💰 Finance

Continuous compound interest
Reinvested investment returns
Inflation over decades
Loan amortization curves

⚛️ Physics

Radioactive half-life
RC capacitor discharge
Newton's law of cooling
Atmospheric pressure vs altitude

🤖 Machine Learning

Softmax over logits
Cross-entropy loss
Exponential learning-rate decay
Attention weights in transformers

Historical Origins: From Logs to Limits

The story of exponential functions weaves three discoveries together.

1. Napier's Logarithms (1614)

John Napier invented logarithms to speed up astronomical multiplication. He observed:

\log(a \times b) = \log(a) + \log(b)

Multiplication of large numbers reduced to addition of small ones. Logarithm tables exploded across Europe. But every logarithm implies an inverse — and that inverse is an exponential function.

2. Bernoulli's Banking Question (1683)

Jacob Bernoulli asked a question that looks like idle arithmetic but turned out to be cosmic:

If I invest $1 at 100% annual interest, and the bank compounds the interest more and more frequently, what is the most I can have at year's end?

Compounding once gives $2.00. Twice gives $2.25. Monthly gives $2.6130. Daily gives $2.7146. Hourly gives $2.7181. Continuous compounding gives a number Bernoulli could not put a name to:

\displaystyle\lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n = e \approx 2.71828\ldots

That limit defines Euler's number. The interactive table further down lets you watch the convergence happen one row at a time.

3. Euler's Unification (1748)

Leonhard Euler recognized that $e^x$ has the magical property of being its own derivative, and he united exponentials with trigonometry through:

e^{i\theta} = \cos\theta + i\sin\theta

This connects the geometry of circles to the algebra of growth — one of the most beautiful equations in mathematics, and we will meet it again in Chapter 5.

Mathematical Definition

An exponential function with base $a$ (where $a > 0$ and $a \neq 1$ ) is defined as:

f(x) = a^x \qquad \text{for every real } x.

What Each Symbol Means

Symbol	Name	Meaning
a	Base	A positive constant we keep fixed
x	Exponent	The variable input — can be any real number
a^x	Output	a multiplied by itself x times (extended to all reals by limits)

Why the Restrictions?

$a > 0$ : A negative base breaks for fractional exponents — for example, $(-1)^{1/2}$ is not real.
$a \neq 1$ : If $a=1$ , then $1^x = 1$ for every x — a flat horizontal line, not an exponential.
x can be any real number: Continuity lets us define $a^x$ for irrationals like $2^{\pi}$ via limits of rational exponents.

The Natural Exponential Function

The most important exponential uses Euler's number e:

f(x) = e^x \qquad \text{where } e \approx 2.71828\ldots

It is often written $\exp(x)$ in code and papers.

Exploring Exponential Functions

Drag the base slider below. Watch how the curve's shape transforms as you cross $a = 1$ — the boundary between growth and decay. Hover anywhere on the plot to read off the exact value of $a^x$ at that x.

Exponential Function Explorer

Explore how changing the base affects exponential growth

Base (a): 2.00

0.1 (decay)e = 2.718...4 (rapid growth)

Y-Intercept

(0, 1)

a^0 = 1 for all a > 0

Behavior

Exponential Growth

Increases without bound

Derivative

2.00^x · ln(2.00)

≈ 0.693 × original

What to Notice as You Play

All curves cross at (0, 1). Because $a^0 = 1$ for every positive a — this is the family's universal anchor.
Base > 1 → growth. The curve rises ever faster as x grows.
0 < Base < 1 → decay. The curve falls toward zero but never touches it.
The x-axis is an asymptote on one side. Growth curves hug y=0 as x→-∞; decay curves hug it as x→+∞.
The base e ≈ 2.718 sits between 2 and 3. It is geometrically unremarkable — but calculus elevates it to king status because its derivative equals itself.

Euler's Number e: The Most Important Constant in Calculus

Alongside $\pi$ , Euler's number $e$ is the most important constant in mathematics. It appears in any system where the rate of change is proportional to the current size — which is almost every system in nature.

Four Equivalent Definitions of e

1. As a limit (compound interest).

e = \lim_{n \to \infty} \left(1 + \tfrac{1}{n}\right)^n

2. As an infinite series.

e = \sum_{n=0}^{\infty} \frac{1}{n!} = 1 + 1 + \tfrac{1}{2} + \tfrac{1}{6} + \tfrac{1}{24} + \cdots

3. As the base for which f' = f.

\frac{d}{dx} e^x = e^x

4. As an area under 1/t.

e \text{ is the unique number with } \int_1^e \tfrac{1}{t}\,dt = 1.

Each definition emphasizes a different face of $e$ : financial, algebraic, differential, and integral. All four pinpoint the exact same constant.

Its Numerical Value

e = 2.71828\,18284\,59045\,23536\ldots

Like $\pi$ , $e$ is irrational (no fractional form) and transcendental (no polynomial with integer coefficients has it as a root).

Compound Interest & The Discovery of e

Play with the demo below. Push the compounding slider to the right — watch the final balance climb toward a ceiling. That ceiling is $Pe^{rt}$ .

Compound Interest & The Discovery of e

See how continuous compounding leads to Euler's number

Principal: $1,000

Annual Rate: 10%

Time: 10 years

Compounding Frequency

Simple Interest

$2,000

A = P(1 + rt)

Compound (12x/year)

$2,707.04

A = P(1 + r/n)^(nt)

Continuous

$2,718.28

A = Pe^(rt)

💡The Birth of e: What happens as we compound more frequently?

With $1 at 100% interest for 1 year, watch what happens to (1 + 1/n)^n as n increases:

n (compounds/year)	(1 + 1/n)^n
1 (annual)	2.0000000000
2	2.2500000000
4	2.4414062500
12 (monthly)	2.6130352902
52	2.6925969544
365 (daily)	2.7145674820
8760	2.7181266916
100000	2.7182682372
n → ∞ (limit)	e = 2.7182818284...

This limit is exactly how Euler's number e was discovered!

The Compound Interest Formula

A = P \left(1 + \frac{r}{n}\right)^{nt}

Symbol	Meaning
A	Final amount after t years
P	Principal (initial deposit)
r	Annual interest rate (decimal, e.g. 0.05 for 5%)
n	Number of compounding periods per year
t	Time in years

The Limit as Compounding Becomes Continuous

Let's push $n \to \infty$ (compounding every instant). Substitute $m = n/r$ so that $n = mr$ :

A = P \left(1 + \tfrac{r}{n}\right)^{nt} = P \left(1 + \tfrac{1}{m}\right)^{mrt} = P \left[\left(1 + \tfrac{1}{m}\right)^{m}\right]^{rt}.

As $n \to \infty$ we also have $m \to \infty$ , and the inner bracket converges to $e$ . So:

\boxed{\,A = P e^{rt}\,}

The Practical Insight

Continuous compounding gives $A = P e^{rt}$ . In practice the gap between daily and continuous compounding is tiny — but the algebraic simplicity of $e^{rt}$ makes it the formula of choice for every differential equation in finance and physics.

Worked Example: Bernoulli's Dollar by Hand

Before reading the code, do this with a pencil. Set $P = 1$ , $r = 1$ (100% annual interest), $t = 1$ year. The formula collapses to $A(n) = (1 + 1/n)^n$ . Compute six rows by hand and watch the convergence.

▶ Click to expand the full hand calculation

Step 1. Yearly compounding (n = 1).

A(1) = \left(1 + \tfrac{1}{1}\right)^1 = 2^1 = 2.00000000

Step 2. Semi-annual (n = 2). Inner term: $1 + 1/2 = 1.5$ .

A(2) = 1.5^2 = 1.5 \times 1.5 = 2.25000000

Step 3. Quarterly (n = 4). Inner term: $1.25$ .

A(4) = 1.25^4 = (1.25^2)^2 = 1.5625^2 = 2.44140625

Step 4. Monthly (n = 12). Inner term: $1 + 1/12 \approx 1.08333$ .

A(12) = 1.08333\ldots^{12} \approx 2.61303529

Step 5. Daily (n = 365). Inner term: $1 + 1/365 \approx 1.0027397$ .

A(365) = 1.0027397\ldots^{365} \approx 2.71456748

Step 6. Hourly (n = 8760).

A(8760) \approx 2.71812669

Pattern. Stack the answers side by side:

n	A(n)	Gap to e ≈ 2.71828182
1	2.00000000	≈ 7.18 × 10⁻¹
2	2.25000000	≈ 4.68 × 10⁻¹
4	2.44140625	≈ 2.77 × 10⁻¹
12	2.61303529	≈ 1.05 × 10⁻¹
365	2.71456748	≈ 3.71 × 10⁻³
8 760	2.71812669	≈ 1.55 × 10⁻⁴
∞	2.71828182…	0

Each time $n$ grows by roughly 10x, the gap shrinks by roughly 10x. The sequence is converging linearly to $e$ .

The intuition. For huge $n$ , every instant we add $1/n$ of our balance — and that tiny addition immediately starts earning its own interest. The infinite tower of "interest on interest on interest …" converges because each layer is $1/n$ times smaller than the previous one, and $\sum 1/n!$ is finite. The sum is exactly $e$ .

Connection to the series definition. Bernoulli's limit and Euler's series are the same number: $(1 + 1/n)^n$ expanded by the binomial theorem yields, term by term, the partial sums of $\sum 1/n!$ as $n \to \infty$ .

Key Properties of Exponential Functions

The Laws of Exponents

Property	Formula	Example
Product Rule	a^m · a^n = a^(m+n)	2³ · 2² = 2⁵ = 32
Quotient Rule	a^m / a^n = a^(m-n)	3⁵ / 3² = 3³ = 27
Power Rule	(a^m)^n = a^(m·n)	(2²)³ = 2⁶ = 64
Zero Exponent	a^0 = 1	5⁰ = 1
Negative Exponent	a^(-n) = 1 / a^n	2⁻³ = 1/8
Fractional Exponent	a^(1/n) = ⁿ√a	8^(1/3) = 2

Function-Level Properties

Domain: all reals $(-\infty, \infty)$ .
Range: positive reals $(0, \infty)$ .
Y-intercept: always $(0, 1)$ .
Horizontal asymptote: the line $y = 0$ .
No x-intercept: $a^x > 0$ for every x.
Continuous and smooth: no breaks, no corners — derivatives of every order exist.
One-to-one: distinct x give distinct y — so the inverse (the logarithm) exists.

Growth vs Decay: The Role of the Base

The base alone decides whether $a^x$ grows or shrinks as x increases.

📈 Exponential Growth (a > 1)

f(x) = a^x \text{ with } a > 1

Function increases as x increases
Accelerates — the slope itself grows
As $x \to +\infty$ , $f(x) \to +\infty$
As $x \to -\infty$ , $f(x) \to 0^+$

Examples: populations, compound interest, viral spread.

📉 Exponential Decay (0 < a < 1)

f(x) = a^x \text{ with } 0 < a < 1

Function decreases as x increases
Decelerates — the slope itself shrinks
As $x \to +\infty$ , $f(x) \to 0^+$
As $x \to -\infty$ , $f(x) \to +\infty$

Examples: radioactive decay, cooling, drug clearance.

Converting Between Growth and Decay

A decay $(1/2)^x$ can be rewritten as growth with a flipped exponent: $(1/2)^x = 2^{-x}$ . Likewise $e^{-x}$ is the decay version of $e^x$ — same family, mirrored across the y-axis.

Preview: The Derivative of e^x

Here is the property that earns $e$ its place at the heart of calculus:

\frac{d}{dx} e^x = e^x.

The slope of $e^x$ at any point equals the height of $e^x$ at that point. The function tells its own derivative what to be. In the demo below, slide the tangent point along the curve — the slope of the tangent line always equals the height of the function. Shrink the secant step $h$ toward zero and watch the secant collapse onto the tangent.

The Derivative of e^x: Visualized

Watch how the secant line approaches the tangent as h → 0

Point x = 1.00

Step h = 0.5000

Secant Slope (Difference Quotient)

[f(x+h) - f(x)] / h

= [e^(1.00 + 0.500) - e^1.00] / 0.500

= 3.526814

Tangent Slope (True Derivative)

f'(x) = e^x

= e^1.00

= 2.718282

Error: |Secant - Tangent|

8.0853e-1

As h → 0, error → 0

Secant slope converges to tangent slope!

The Magical Property of e^x

Notice: f(x) = e^x = 2.7183 and f'(x) = e^x = 2.7183

They're the same! The function equals its own derivative.

What Makes This Unique to e?

For any other base, the derivative carries an extra factor:

\frac{d}{dx} a^x = a^x \cdot \ln(a).

The factor $\ln(a)$ is the "tax" that every base except $e$ pays for not being $e$ . Only when $a = e$ do we get $\ln(e) = 1$ , and the derivative collapses back to the function itself.

Why This Matters for Calculus

The self-derivative property makes $e^x$ the easiest function to differentiate and integrate:

Derivative: $\dfrac{d}{dx} e^x = e^x$
Integral: $\displaystyle\int e^x \, dx = e^x + C$

This is why $e$ appears in every solution of a linear differential equation, in every radioactive-decay formula, and in the softmax of every neural network.

Transformations of Exponential Functions

Like every function, exponentials can be shifted, stretched, and reflected. The general form is:

f(x) = A \cdot a^{\,B(x - h)} + k.

Parameter	Effect	Example
A	Vertical stretch / compression; reflect if A<0	3 · 2^x is 3x taller than 2^x
a	Base — sets the growth/decay rate	e^x grows faster than 2^x
B	Horizontal stretch / compression	2^(2x) compresses x-axis by 2
h	Horizontal shift (right if h>0)	e^(x-2) shifts right by 2
k	Vertical shift (up if k>0)	e^x + 3 shifts up by 3

Common Transformations

$e^{-x}$ : reflection through the y-axis (decay version of $e^x$ ).
$-e^x$ : reflection through the x-axis (always negative).
$e^{2x}$ : faster growth — horizontal compression by 2.
$e^{x/2}$ : slower growth — horizontal stretch by 2.

Real-World Applications

1. Population Growth

Under unlimited resources, populations grow exponentially:

P(t) = P_0 \, e^{rt}.

Here $P_0$ is the initial population, $r$ is the growth rate, and $t$ is time.

Example: Bacteria that double every 20 minutes have $r = \ln(2)/20 \approx 0.0347/\text{min}$ . Starting with 1000 cells, after 2 hours: $P = 1000 \cdot e^{0.0347 \times 120} \approx 64{,}000$ .

2. Radioactive Decay

Unstable atoms decay independently and exponentially:

N(t) = N_0 \, e^{-\lambda t} = N_0 \left(\tfrac{1}{2}\right)^{t/t_{1/2}}.

$\lambda$ is the decay constant and $t_{1/2}$ is the half-life.

Example: Carbon-14 has a half-life of 5730 years — the foundation of carbon dating.

3. Newton's Law of Cooling

An object's temperature relaxes exponentially toward its surroundings:

T(t) = T_{\text{env}} + (T_0 - T_{\text{env}}) \, e^{-kt}.

$T_{\text{env}}$ is room temperature; $T_0$ is the starting temperature.

Example: A 90°C coffee in a 20°C room cools toward 20°C, with the gap shrinking exponentially.

Machine Learning Applications

Exponentials appear at the very heart of modern ML. The reason is always the same: we need to map arbitrary real-valued scores into positive numbers (probabilities, rates, weights), and $e^x$ is the smooth, differentiable way to do it.

1. Softmax

Converts a vector of logits into a probability distribution:

\mathrm{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}.

Every logit is exponentiated (forcing it positive), then normalized to sum to 1. Used in every classification head and inside every attention layer.

2. Cross-Entropy Loss

\mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i).

The logarithm is the inverse of the exponential in softmax — the two cancel out beautifully inside the gradient, producing the famously clean update $\partial \mathcal{L} / \partial z_i = \hat{y}_i - y_i$ .

3. Exponential Learning-Rate Decay

\eta_t = \eta_0 \, e^{-k t}.

We start with a big learning rate (to make fast progress), then exponentially shrink it (to fine-tune as we converge). Same shape as radioactive decay — same math.

4. Attention Mechanisms

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right) V.

The softmax inside attention is the only nonlinearity in a transformer's mixing operation. The exponential makes attention weights sharp around the most-relevant key, while staying differentiable.

Python Implementation

We start with plain Python + NumPy + Matplotlib. First we plot the family; then we use the formula $A(n) = (1 + 1/n)^n$ to watch e emerge numerically; then we verify the self-derivative property with a central-difference quotient.

Plotting the Exponential Family

Visualizing four exponential functions on one axes

🐍exponential_plot.py

Explanation(21)

Code(29)

1Import NumPy

NumPy gives us vectorized math. np.power and np.exp accept arrays element-wise, so we can compute base^x for 200 points at once.

EXAMPLE

np.power(2.0, np.array([0, 1, 2])) -> array([1., 2., 4.])

2Import matplotlib

pyplot is the plotting interface we'll use to draw the four curves on a single axes.

4Define exponential(x, base)

A thin wrapper around np.power so the call site reads like the math: exponential(x, base) computes base raised to x.

EXAMPLE

exponential(np.array([0, 1, 2]), 3.0) -> array([1., 3., 9.])

6Docstring states the math

The function implements f(x) = base ** x. We keep the docstring short because the formula is the whole story.

7Return base ** x

np.power broadcasts: scalar base, array x -> array of base^x. This is the single line where the entire visualization is computed.

EXAMPLE

Inputs base=2, x=[0,1,2,3] -> Output [1, 2, 4, 8]

9Create the x-axis grid

200 evenly spaced points from -3 to 3. Density matters: too few points and the curve looks like polylines, too many wastes work. 200 is a good default for a smooth curve over a small interval.

EXAMPLE

x[0] = -3.0, x[1] ≈ -2.9698, ..., x[199] = 3.0

11Four bases to compare

We pick bases that span the qualitative behaviors: 0.5 (decay), e ≈ 2.7183 (the natural base), 2.0 (a popular discrete base), and 3.0 (faster growth than e).

EXAMPLE

At x=1: 0.5^1=0.5, e^1≈2.7183, 2^1=2.0, 3^1=3.0. Notice e lives between 2 and 3.

12Color palette

Distinct hex colors so each curve stays visually separable when they overlap near x=0.

13Legend labels

We tag 0.5^x as decay so the reader doesn't have to compute that 0.5 < 1 implies a falling curve.

15Create the figure

figsize=(10, 6) gives a wide rectangle — better for showing that x = 3 is a *short* x-distance but a *huge* y-distance for exponential growth.

16Loop over the four bases

zip pairs (base, color, label) so each iteration draws one curve in its assigned color with its assigned legend entry.

EXAMPLE

Iter 1: base=0.5, color='#ef4444', label='0.5^x (decay)'
Iter 2: base=e, color='#22c55e', label='e^x'
Iter 3: base=2.0, color='#3b82f6', label='2^x'
Iter 4: base=3.0, color='#8b5cf6', label='3^x'

17Compute y = base^x

This is the line that produces the actual data. For base=2.0 and x=[-3,...,3] the endpoints are 2^-3=0.125 and 2^3=8.0 — a 64x range from a small change in x.

18Plot the curve

linewidth=2 is thick enough that all four curves remain readable when they cross near (0, 1).

21Mark (0, 1) — the universal anchor

Every exponential a^x passes through (0, 1) because a^0 = 1 for any positive a. This is the only point all four curves share.

EXAMPLE

0.5^0 = 1, e^0 = 1, 2^0 = 1, 3^0 = 1

22Annotate the anchor

A small label so the reader's eye lands on (0, 1) immediately — the geometric heart of the family.

24Horizontal reference line y=0

The x-axis is the horizontal asymptote: a^x → 0 as x → -∞ for a > 1, and as x → +∞ for 0 < a < 1. The curves approach this line but never touch it.

25Vertical reference line x=0

Helps the eye align (0, 1) with the y-axis.

26Axis labels

We label the axes x and f(x). Notice we are NOT labeling f(x) as y — these are different concepts in calculus: f(x) is a value of the function.

27Title and legend

Title states what the figure shows; legend distinguishes the four curves by color and label.

29Clip the y-range to [-1, 10]

Without this, 3^3 = 27 would dominate the plot and squash the interesting region near (0, 1). Clipping is a pedagogical choice, not a mathematical one.

EXAMPLE

Unclipped y-range would be roughly [0, 27]; clipped to [-1, 10] lets us still see 0.5^x decay clearly.

30Render

plt.show() pumps the figure to the screen. In a notebook this is implicit; in a script it's required.

8 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4def exponential(x, base):
5    """General exponential function f(x) = base ** x."""
6    return np.power(base, x)
7
8x = np.linspace(-3, 3, 200)
9
10bases  = [0.5, np.e, 2.0, 3.0]
11colors = ['#ef4444', '#22c55e', '#3b82f6', '#8b5cf6']
12labels = ['0.5^x (decay)', 'e^x', '2^x', '3^x']
13
14plt.figure(figsize=(10, 6))
15for base, color, label in zip(bases, colors, labels):
16    y = exponential(x, base)
17    plt.plot(x, y, color=color, linewidth=2, label=label)
18
19# The universal anchor point (0, 1) — true for every base.
20plt.scatter([0], [1], color='red', s=80, zorder=5)
21plt.annotate('(0, 1)', (0.1, 1.3), fontsize=12)
22
23plt.axhline(0, color='gray', linestyle='--', alpha=0.5)
24plt.axvline(0, color='gray', linestyle='--', alpha=0.5)
25plt.xlabel('x'); plt.ylabel('f(x)')
26plt.title('Exponential functions with different bases')
27plt.legend(); plt.grid(True, alpha=0.3)
28plt.ylim(-1, 10)
29plt.show()

Building e from Bernoulli's Limit

This is the worked-example table, but produced by the computer so we can push $n$ to a billion and confirm convergence to 9 decimals.

Numerically discovering e from (1 + 1/n)^n

🐍euler_from_limit.py

Explanation(9)

Code(19)

1Import math

We only need math.e (the exact constant e ≈ 2.7182818284...) for the comparison column. Everything else is pure arithmetic.

EXAMPLE

math.e = 2.718281828459045

9Choose compounding frequencies

We sweep n through the natural physical cadences: yearly (1), semi-annual (2), quarterly (4), monthly (12), daily (365), hourly (8760), every minute (525,600), every second (31,536,000).

EXAMPLE

8_760 = 24 * 365 (hours in a year)
525_600 = 60 * 8_760 (minutes in a year)

11Header row

Three columns: the compounding frequency n, the resulting balance A(n), and how far it is from e. We use right-justified widths so the decimal points line up.

12Separator

A line of dashes that visually delimits the header from the data — a small thing that makes the output skim-readable.

14Loop over n

Each iteration computes one row of the convergence table. The loop body is intentionally short: one formula, one gap, one print.

EXAMPLE

Iter 1: n=1     -> A(1)     = 2.0000000000      gap ≈ 7.18e-01
Iter 2: n=2     -> A(2)     = 2.2500000000      gap ≈ 4.68e-01
Iter 3: n=4     -> A(4)     ≈ 2.4414062500      gap ≈ 2.77e-01
Iter 4: n=12    -> A(12)    ≈ 2.6130352902      gap ≈ 1.05e-01
Iter 5: n=365   -> A(365)   ≈ 2.7145674820      gap ≈ 3.71e-03
Iter 6: n=8760  -> A(8760)  ≈ 2.7181266916      gap ≈ 1.55e-04

15Compute A(n) = (1 + 1/n)^n

This single line is the entire Bernoulli experiment. Notice it has two competing forces: (1 + 1/n) → 1 (which would push the whole expression to 1), and the exponent n → ∞ (which would push toward ∞). The limit is finite: e.

EXAMPLE

n=1:   (1 + 1)^1     = 2^1     = 2.0
n=2:   (1 + 0.5)^2   = 1.5^2   = 2.25
n=4:   (1.25)^4               ≈ 2.44140625
n=12:  (1.08333...)^12         ≈ 2.61303529

16Compute the gap to e

math.e - a_n is positive and shrinks toward zero — empirical proof that A(n) → e from below. The gap roughly halves each time n is multiplied by ~10, which is the rate-of-convergence signature of this limit (gap ≈ e / (2n)).

EXAMPLE

n=12:    gap ≈ 1.053e-01
n=365:   gap ≈ 3.715e-03   (about 28x smaller; n grew 30x)
n=8760:  gap ≈ 1.550e-04   (about 24x smaller; n grew 24x)

17Print a formatted row

f-string format specs: '>12' = right-justified width 12, '.10f' = 10 decimal places, '.2e' = scientific notation with 2 decimals. The alignment makes the convergence visually obvious.

19Print the true value of e

Showing math.e at the bottom lets the reader confirm by eye that the column on row n=31,536,000 matches e to ~9 decimal places — proof that 'continuous compounding' is just 'compounding so often the limit is reached for any practical accuracy'.

EXAMPLE

Final row: n=31_536_000 -> A(n) ≈ 2.7182817853  vs  e ≈ 2.7182818285

10 lines without explanation

1import math
2
3# Bernoulli's question (1683):
4#   $1 at 100% annual interest, compounded n times per year.
5#   What is the maximum we can earn in one year as n -> infinity?
6#
7# A(n) = (1 + 1/n) ** n
8
9ns = [1, 2, 4, 12, 365, 8_760, 525_600, 31_536_000]
10
11print(f"{'n':>12}  {'A(n) = (1 + 1/n)^n':>22}  {'gap to e':>12}")
12print("-" * 52)
13
14for n in ns:
15    a_n = (1 + 1 / n) ** n
16    gap = math.e - a_n
17    print(f"{n:>12}  {a_n:>22.10f}  {gap:>12.2e}")
18
19print(f"\n{'e':>12}  {math.e:>22.10f}")

Verifying That e^x Is Its Own Derivative

d/dx e^x = e^x, verified analytically and numerically

🐍exponential_derivative.py

Explanation(15)

Code(36)

1Import NumPy

We use np.exp (the natural exponential), np.power (general exponential), and np.log (natural logarithm).

3Define derivative_exponential

Returns the analytic derivative of base^x. The formula d/dx[a^x] = a^x · ln(a) follows from rewriting a^x = e^(x ln a) and applying the chain rule.

5Docstring captures the magic

The docstring states the rule and immediately notes the special case base = e where ln(e) = 1 collapses the formula to f'(x) = f(x).

8Return a^x · ln(a)

Vectorized: np.power(base, x) gives a^x, np.log(base) gives ln(a). Their product is the derivative.

EXAMPLE

derivative_exponential(1.5, 2.0)
  np.power(2.0, 1.5) = 2.8284271247
  np.log(2.0)        = 0.6931471806
  product            = 1.9605559857

11Choose a test point

x = 1.5 is a generic non-integer point. We avoid x=0 and x=1 because they are degenerate (a^0=1, a^1=a) and don't expose the multiplicative structure.

14Compute f(x) = e^x at x = 1.5

np.exp(1.5) ≈ 4.4816890703. This is the value of the natural exponential at x = 1.5.

EXAMPLE

f_e = e^1.5 ≈ 4.4816890703

15Compute the analytic derivative

For base e: f'(x) = e^x — the very same value. So f_e_prime is the same number as f_e. This is the defining miracle of e.

EXAMPLE

f_e_prime = e^1.5 ≈ 4.4816890703   (identical to f_e)

18Pick a tiny step h

h = 1e-6 is small enough that the central-difference quotient is accurate to roughly 11 decimal places (error ~ h^2 / 6 · f'''), but large enough to avoid floating-point cancellation.

19Central-difference quotient

Numerical derivative: (f(x+h) - f(x-h)) / (2h). This independent estimate confirms the analytic answer — useful because students often distrust the 'it equals itself' claim.

EXAMPLE

x = 1.5, h = 1e-6
  e^(1.500001)  ≈ 4.481693552
  e^(1.499999)  ≈ 4.481684589
  difference    ≈ 0.000008963
  / (2*h)       ≈ 4.4816890703   <-- matches analytic!

21Print the e^x summary

Three lines show f(x), the analytic f'(x), and the numeric f'(x). They are all the same number to 10 decimal places.

25The killer line: ratio f'(x)/f(x) = 1

Among all positive bases, only base e gives this exact ratio of 1. Every other base produces a ratio equal to ln(base).

28Now switch to base 2

We repeat the same experiment with base = 2.0 to show what happens for a 'non-magic' base.

29Compute 2^x and its derivative

Same structure: np.power for the value, derivative_exponential for the analytic derivative.

EXAMPLE

f_2 = 2^1.5 ≈ 2.8284271247
f_2_prime = 2^1.5 · ln(2) ≈ 1.9605559857

32Compute the ratio for 2^x

ratio = f'(x) / f(x) = ln(2) ≈ 0.6931, NOT 1. The function 2^x grows, but its growth rate is only 69.3% of its current value.

34Print the 2^x summary

The final line shows ratio ≈ 0.6931 alongside ln(2) ≈ 0.6931 — confirming d/dx[2^x] = 2^x · ln(2).

EXAMPLE

Expected stdout:
  2^x at x=1.5:
    f(x)             = 2.8284271247
    f'(x)            = 1.9605559857
    ratio f'(x)/f(x) = 0.6931471806   <-- equals ln(2) = 0.6931471806

21 lines without explanation

1import numpy as np
2
3def derivative_exponential(x, base):
4    """
5    Analytic derivative of f(x) = base^x is f'(x) = base^x * ln(base).
6    Special case: when base = e, ln(e) = 1, so f'(x) = e^x = f(x).
7    """
8    return np.power(base, x) * np.log(base)
9
10# Pick a test point.
11x = 1.5
12
13# Case 1: base = e (the magic base)
14f_e        = np.exp(x)                       # value of e^x at x=1.5
15f_e_prime  = np.exp(x)                       # analytic derivative
16
17# Numerical sanity check: central difference quotient.
18h = 1e-6
19numeric_e_prime = (np.exp(x + h) - np.exp(x - h)) / (2 * h)
20
21print(f"e^x at x={x}:")
22print(f"  f(x)             = {f_e:.10f}")
23print(f"  f'(x) analytic   = {f_e_prime:.10f}")
24print(f"  f'(x) numeric    = {numeric_e_prime:.10f}")
25print(f"  ratio f'(x)/f(x) = {f_e_prime / f_e:.10f}   <-- exactly 1")
26
27# Case 2: base = 2 (typical exponential, not magic)
28base = 2.0
29f_2          = np.power(base, x)
30f_2_prime    = derivative_exponential(x, base)
31ratio        = f_2_prime / f_2
32
33print(f"\n2^x at x={x}:")
34print(f"  f(x)             = {f_2:.10f}")
35print(f"  f'(x)            = {f_2_prime:.10f}")
36print(f"  ratio f'(x)/f(x) = {ratio:.10f}   <-- equals ln(2) = {np.log(2):.10f}")

PyTorch Implementation

Now in PyTorch. We'll use the very same $e^x$ in two places: building softmax probabilities (the ML-flavored use of exponentials) and confirming the self-derivative property via autograd (the calculus-flavored use).

Softmax and autograd: exponentials in PyTorch

🐍exponential_pytorch.py

Explanation(16)

Code(32)

1Import torch

torch gives us tensors, autograd, and torch.exp — the PyTorch analog of np.exp that also tracks gradients.

2Import functional API

torch.nn.functional contains the production-quality softmax that uses the log-sum-exp trick to avoid overflow on large logits.

8Create a tensor of logits

A vector of three raw scores. They are not probabilities yet — they can be any real numbers, including negatives. The softmax will turn them into a probability distribution.

EXAMPLE

logits = tensor([2.0, 1.0, 0.1])

11Element-wise e^z

torch.exp applies e^x to every element. This is where exponentials enter ML: it maps the entire real line into the positive reals, which is exactly what we need to interpret values as 'likelihoods'.

EXAMPLE

e^2.0 ≈ 7.3891
e^1.0 ≈ 2.7183
e^0.1 ≈ 1.1052
=> exps = tensor([7.3891, 2.7183, 1.1052])

12Normalize

Divide each exp by the sum so the components add to 1. Now probs is a valid probability distribution over the three classes.

EXAMPLE

sum = 7.3891 + 2.7183 + 1.1052 = 11.2126
probs = [7.3891/11.2126, 2.7183/11.2126, 1.1052/11.2126]
      ≈ [0.6590, 0.2424, 0.0986]

14Print logits

Sanity-check: we have not modified the input tensor — operations on logits are read-only.

15Print the raw exponentials

All three are positive. exps is what makes softmax differ from a simple 'argmax with ties': the relative magnitudes of e^z_i are preserved smoothly.

16Print the probabilities

The largest logit (2.0) gets the largest probability (≈0.659). The smallest logit (0.1) gets the smallest probability (≈0.099). The middle one falls between.

17Confirm probabilities sum to 1

.item() unwraps a 0-d tensor to a Python float. The print should display 1.0 (or 0.9999999... due to floating-point) — proof that probs is a valid distribution.

20PyTorch's built-in softmax

F.softmax(logits, dim=0) is mathematically identical to our manual version. It subtracts max(logits) before exponentiating to avoid overflow when logits are large (e.g. logit=100 would make exp = 2.7e43 and crash float32).

EXAMPLE

Numerically-stable form:
  z' = z - max(z) = [2.0, 1.0, 0.1] - 2.0 = [0.0, -1.0, -1.9]
  softmax(z') = softmax(z)    (proved algebraically)
The answer is the same: tensor([0.6590, 0.2424, 0.0986]).

26Tensor with requires_grad=True

requires_grad=True tells autograd to record every operation on x so we can later call .backward() and have x.grad populated with the derivative.

EXAMPLE

x.requires_grad = True   <-- autograd starts tracking

27Compute y = e^x

torch.exp is differentiable. Internally PyTorch builds a tiny computation graph: x -> ExpBackward -> y. The graph remembers the value of y because the derivative of e^x is y itself.

EXAMPLE

y = e^1.5 ≈ 4.4816890717   (tensor with grad_fn=<ExpBackward0>)

29Trigger backpropagation

y.backward() walks the graph backward: it computes dy/dx and writes the result into x.grad. For y = e^x the rule says dy/dx = e^x — which is the very value of y we already have.

31Print y

y ≈ 4.4816890717. Same as np.exp(1.5) in the previous code block — PyTorch and NumPy agree.

32Print dy/dx

x.grad ≈ 4.4816890717 — identical to y. This is autograd numerically confirming the analytic claim d/dx e^x = e^x.

33Assert they are equal

torch.allclose returns True if every element of y matches x.grad within floating-point tolerance. For the magic base e, the assertion holds exactly (to within ulp).

EXAMPLE

Expected stdout:
  y     = 4.481689453125
  dy/dx = 4.481689453125
  equal : True

16 lines without explanation

1import torch
2import torch.nn.functional as F
3
4# ----------------------------------------------------------------------
5# Part 1: softmax — exponentials turn a vector of logits into probabilities.
6# softmax(z_i) = exp(z_i) / sum_j exp(z_j)
7# ----------------------------------------------------------------------
8logits = torch.tensor([2.0, 1.0, 0.1])
9
10# Manual softmax with raw exponentials.
11exps   = torch.exp(logits)              # element-wise e^z_i
12probs  = exps / exps.sum()              # normalize to a probability vector
13
14print("logits :", logits)
15print("exps   :", exps)
16print("probs  :", probs)
17print("sum    :", probs.sum().item())   # must be 1.0
18
19# PyTorch's built-in (numerically stable) softmax — same answer.
20print("F.softmax:", F.softmax(logits, dim=0))
21
22# ----------------------------------------------------------------------
23# Part 2: autograd confirms d/dx e^x = e^x exactly.
24# ----------------------------------------------------------------------
25x = torch.tensor(1.5, requires_grad=True)
26y = torch.exp(x)
27
28y.backward()                            # populates x.grad with dy/dx
29
30print("\ny     =", y.item())            # e^1.5
31print("dy/dx =", x.grad.item())          # also e^1.5 — they are equal
32print("equal :", torch.allclose(y.detach(), x.grad))

Why autograd nails the self-derivative property exactly

When PyTorch computes the derivative of $y = e^x$ , it does not use a numerical approximation. Internally, ExpBackward caches $y$ (since the derivative happens to equal the forward value) and re-uses it as the gradient — so the assertion torch.allclose(y, x.grad) holds to within floating-point precision, not merely a tolerance.

Common Pitfalls

Confusing Exponential with Power Functions

$2^x$ (exponential) is NOT the same as $x^2$ (power function):

$2^x$ : variable exponent, fixed base — exponential growth.
$x^2$ : fixed exponent, variable base — polynomial growth.

For large x, exponentials always dominate: $2^{10} = 1024$ while $10^2 = 100$ .

Negative Bases Are Not Allowed

$(-2)^x$ is not a valid exponential function:

$(-2)^{1/2} = \sqrt{-2}$ is not real.
Many real x produce complex or undefined results.

That is why we require $a > 0$ .

`e` vs `exp()` in code

In Python, e**x only works if you first set e = math.e. Otherwise e is undefined and you'll get a NameError. Prefer the explicit functions:

math.exp(x) or numpy.exp(x) in Python
torch.exp(x) in PyTorch
Math.exp(x) in JavaScript

Softmax Overflow

A naive softmax computes $e^{z_i}$ directly. With a logit of 100, this is $e^{100} \approx 2.7 \times 10^{43}$ — far beyond float32's ~3.4e38 limit. Always subtract $\max(z)$ first (this is what F.softmax does). The result is mathematically identical but numerically safe.

Test Your Understanding

Score: 0/0

Question 1 of 10

What is the value of e^0?

Summary

Exponential functions describe multiplicative change — the universal pattern whenever the rate of change is proportional to the current quantity.

Key Formulas

Formula	Description
f(x) = a^x	General exponential (a > 0, a ≠ 1)
f(x) = e^x	Natural exponential (e ≈ 2.718)
e = lim (1 + 1/n)^n	Bernoulli's definition of e
e = Σ 1/n!	Series definition of e
d/dx e^x = e^x	Self-derivative property
d/dx a^x = a^x · ln(a)	Derivative for any base
A = P e^(rt)	Continuous compounding / growth
softmax(z_i) = e^z_i / Σ e^z_j	ML probability normalization

Key Takeaways

Exponentials model multiplicative change: each step multiplies by the same factor.
Every exponential passes through $(0, 1)$ because $a^0 = 1$ .
Base > 1 gives growth; 0 < base < 1 gives decay; the x-axis is the asymptote on one side.
Euler's number $e \approx 2.718$ emerges as the limit of $(1 + 1/n)^n$ — the natural ceiling of continuous compounding.
$e^x$ is the unique exponential that equals its own derivative — confirmed analytically, numerically, and via PyTorch autograd.
Exponentials are everywhere: population, decay, cooling, finance, softmax, attention, learning-rate schedules.

The Essence of Exponentials

"Among all bases, only e gives an exponential that is its own derivative — the natural language of every system whose rate of change is proportional to its size."

Coming Next: in the next section we invert the exponential to get the logarithm. You will see why $\log$ turns multiplication into addition, why the natural log $\ln$ is the inverse of $e^x$ , and how all of this powers the cross-entropy loss in deep learning.