Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

State and prove the rule $\dfrac{d}{dx}\,e^x = e^x$
Explain the magical limit $\displaystyle\lim_{h \to 0} \dfrac{e^h - 1}{h} = 1$ and why it defines the number $e$
Compute the slope of the tangent line to $y = e^x$ at any point without using a calculator beyond evaluating $e^x$
Recognize why $a = e$ is the unique base for which $a^x$ is its own derivative
Apply the chain-rule version $\dfrac{d}{dx}\, e^{kx} = k\, e^{kx}$ to growth and decay problems
Verify the derivative numerically in plain Python and via PyTorch autograd

The Big Picture: A Function Equal to Its Own Slope

“Of all functions known to mathematics, only one — up to a constant multiplier — is exactly equal to its own derivative. That function is $e^x$ .”

Stop and let that land. Take any other function you can think of — a polynomial, a sine, a logarithm — and its derivative is some different function. The derivative of $x^2$ is $2x$ , a different shape entirely. The derivative of $\sin x$ is $\cos x$ , a shifted wave. But the derivative of $e^x$ is itself:

\frac{d}{dx}\, e^x \;=\; e^x

What this rule actually says

At every point on the curve $y = e^x$ , the height of the curve equals the slope of its tangent line.

If the curve is 2.7 units above the x-axis, the tangent at that point rises 2.7 units for every 1 unit you move right. If the curve is 20 units up, the tangent has slope 20. The steepness and the value are the same number.

This single fact is the reason $e$ shows up everywhere: compound interest, radioactive decay, population growth, RC-circuit charging, Bayesian priors, neural-network softmax outputs, the Schrödinger equation, the normal distribution. Every time a quantity grows at a rate proportional to itself, the answer is dressed in $e$ .

Intuition: Money and Bacteria

The bank account that compounds continuously

Imagine a savings account that pays interest continuously at rate $100\%$ per year. After time $t$ years your balance is $B(t) = e^t$ (starting from $B(0) = 1$ ).

The rate at which money is added to the account at instant $t$ is $B'(t)$ — interest dollars per year. Common sense says:

\text{interest per year} \;=\; (\text{rate}) \times (\text{principal}) \;=\; 1 \cdot B(t) \;=\; B(t)

So $B'(t) = B(t)$ . The bigger the balance, the faster it grows. The function and its derivative are the same. That is the differential equation $y' = y$ , and its solution is $y = e^t$ .

A colony of bacteria

Now replace dollars with bacteria. Each bacterium splits at a constant per-capita rate. If there are $N(t)$ bacteria, the population produces new bacteria at rate proportional to $N(t)$ itself — twice as many parents, twice as many babies per minute. Same equation, $N'(t) = N(t)$ , same solution $N(t) = e^t$ .

The slogan to remember

Whenever you see “the rate of growth is proportional to the current amount,” the answer is an exponential. Whenever the constant of proportionality is exactly 1, the answer is $e^x$ .

Numerical Discovery

Let's set the theory aside for a moment and measure the slope of $y = e^x$ by hand at three different x values. We will use the secant-line slope with a small step $h = 0.001$ as a stand-in for the tangent slope:

\text{slope at } x \;\approx\; \frac{e^{x+h} - e^x}{h}, \qquad h = 0.001

x	Height e^x	Numerical slope (h=0.001)	Ratio slope / height
0	1.0000000000	1.0005001667	1.0005
1	2.7182818285	2.7196414762	1.0005
2	7.3890560989	7.3927514660	1.0005
3	20.0855369232	20.0955817247	1.0005

Look at the last column. The ratio is the same number at every $x$ ! And as $h$ shrinks toward zero that ratio approaches 1. The slope and the height aren't just proportional; they're equal.

That is the entire empirical content of this section. The rest is just turning the observation into a proof.

Derivation from First Principles

Apply the limit definition of the derivative to $f(x) = e^x$ :

f'(x) \;=\; \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \;=\; \lim_{h \to 0} \frac{e^{x+h} - e^{x}}{h}

Use the most useful property of the exponential — $e^{a+b} = e^a \cdot e^b$ — to split $e^{x+h}$ as $e^x \cdot e^h$ :

f'(x) \;=\; \lim_{h \to 0} \frac{e^x \cdot e^h - e^x}{h} \;=\; \lim_{h \to 0} \frac{e^x \,(e^h - 1)}{h}

The factor $e^x$ does not depend on $h$ , so it survives the limit as a constant. Pull it outside:

f'(x) \;=\; e^x \cdot \lim_{h \to 0} \frac{e^h - 1}{h}

Now the entire question reduces to a single number: what is $\displaystyle\lim_{h \to 0} \dfrac{e^h - 1}{h}$ ?

A clean test

If this limit equals 1, then $f'(x) = e^x \cdot 1 = e^x$ — done. If it equals any other number $k$ , the derivative would be $k \, e^x$ instead. So the value of this one little limit is everything.

The Magical Limit (e^h − 1)/h → 1

Drag the slider below and watch the orange dot fall onto the green dashed line at $y = 1$ . The orange curve has a hole at $h = 0$ (we'd divide by zero) — but the limiting value is unambiguously 1.

The Magical Limit: (e^h − 1) / h → 1

Drag h toward zero. The whole derivative of e^x rests on this one limit.

h = 0.50000

Current h

5.00e-1

(e^h − 1) / h

1.29744254

Distance to 1

2.97e-1

h	(e^h − 1) / h	\| value − 1 \|
1	1.7182818285	7.18e-1
0.5	1.2974425414	2.97e-1
0.25	1.1361016668	1.36e-1
0.1	1.0517091808	5.17e-2
0.05	1.0254219275	2.54e-2
0.01	1.0050167084	5.02e-3
0.005	1.0025041719	2.50e-3
0.001	1.0005001667	5.00e-4
0.0005	1.0002500417	2.50e-4
0.0001	1.0000500017	5.00e-5

As h shrinks, the orange dot snaps onto the green dashed line at y = 1.

That single fact is the seed from which d/dx (e^x) = e^x grows.

Why is the limit 1 and not some other number?

The number $e$ can be defined as the unique positive real number for which this limit equals 1. From chapter 1 we also have the equivalent definitions:

$\displaystyle e = \lim_{n \to \infty} \left(1 + \tfrac{1}{n}\right)^n$ — compound interest as the compounding period vanishes.
$\displaystyle e = \sum_{k=0}^{\infty} \tfrac{1}{k!} = 1 + 1 + \tfrac{1}{2} + \tfrac{1}{6} + \tfrac{1}{24} + \cdots$ — Euler's power series.
$e^x = \displaystyle\sum_{k=0}^{\infty} \tfrac{x^k}{k!}$ — the Taylor series for the exponential.

Use definition (3) and plug in $e^h - 1$ :

e^h - 1 \;=\; h + \frac{h^2}{2!} + \frac{h^3}{3!} + \cdots

Divide by $h$ :

\frac{e^h - 1}{h} \;=\; 1 + \frac{h}{2!} + \frac{h^2}{3!} + \cdots

Now let $h \to 0$ . Every term except the leading $1$ vanishes. The limit is exactly 1.

The fingerprint of e

That leading $1$ in the series isn't a coincidence — it is exactly why $e$ is the “natural” base for the exponential. Any other base $a$ gives a different leading coefficient, namely $\ln(a)$ . We'll see this next.

Why e Is the Unique Base

Slide the base $a$ in the visualizer below. The solid blue curve is $a^x$ and the red dashed curve is its derivative. They are always proportional. The proportionality constant is exactly $\ln(a)$ . Slide all the way to $a = e \approx 2.71828$ and watch the two curves snap together.

Why is e the special base?

Slide the base a. The derivative is always a constant multiple of the function. That constant is ln(a). Only when a = e does it equal 1.

Base a = 2.0000

ln(a) = 0.693147

At x = 0

f(0) = 1.0000

f'(0) = 0.6931

ratio = 0.6931

At x = 1

f(1) = 2.0000

f'(1) = 1.3863

ratio = 0.6931

Magic ratio f'/f

0.693147

distance from 1 → 0.3069

The derivative is the function scaled by ln(a) = 0.6931.

Try sliding closer to 2.71828 and watch the two curves merge.

Where ln(a) comes from

Run the same derivation we just did but with a general base $a$ :

\frac{d}{dx}\, a^x \;=\; \lim_{h \to 0} \frac{a^{x+h} - a^x}{h} \;=\; a^x \cdot \lim_{h \to 0} \frac{a^h - 1}{h}

The remaining limit defines a number that depends only on $a$ . Call it $k(a)$ :

k(a) \;=\; \lim_{h \to 0} \frac{a^h - 1}{h}

Using $a^h = e^{h\ln a}$ and the special limit we just proved:

k(a) \;=\; \lim_{h \to 0} \frac{e^{h \ln a} - 1}{h} \;=\; \ln(a) \cdot \lim_{u \to 0} \frac{e^u - 1}{u} \;=\; \ln(a) \cdot 1 \;=\; \ln(a)

(We substituted $u = h \ln a$ , which also tends to zero.) Therefore:

\boxed{\;\frac{d}{dx}\, a^x \;=\; \ln(a)\,\cdot\, a^x\;}

And the “the function is its own derivative” property happens precisely when $\ln(a) = 1$ , i.e. when $a = e$ . That is the definition of natural exponential.

Base a	ln(a)	d/dx a^x	Behaviour
2	0.6931	0.6931 · 2^x	Derivative is shorter than function
e ≈ 2.71828	1.0000	1 · e^x = e^x	Derivative equals function ✓
3	1.0986	1.0986 · 3^x	Derivative is taller than function
10	2.3026	2.3026 · 10^x	Derivative is much taller

Geometric Meaning: Slope = Height

Pick any point $(a, e^a)$ on the curve. The tangent line at that point has equation:

y - e^a \;=\; e^a \cdot (x - a)

Two consequences worth absorbing:

The tangent at $x = a$ always has slope $e^a$ , which is the very height of the point of tangency.
That tangent line crosses the x-axis at $x = a - 1$ — exactly one unit to the left of the point of tangency, no matter where on the curve you are. (Try it: substitute $y = 0$ in the tangent equation and solve.) This is a striking self-similarity property unique to $e^x$ .

Geometric self-similarity

If you stand on the curve at any point and look one unit to your left along the tangent, you are looking at the x-axis. Move along the curve to a new point — the same thing happens again. The curve is self-similar under horizontal translation in a way no other function is.

Worked Example: Compute the Tangent by Hand

Find the tangent line to $y = e^x$ at $x = 2$ . We will compute the slope, write the tangent equation, and check the “one unit to the left” property above.

▶ Click to expand the full hand calculation

Step 1. Locate the point.

y(2) = e^{2} \approx 7.3890561

So the point of tangency is $(2,\; 7.3890561)$ .

Step 2. Compute the slope using the rule $\dfrac{d}{dx} e^x = e^x$ .

m = e^{2} \approx 7.3890561

The slope equals the height of the point — exactly the property we proved.

Step 3. Write the tangent line in point-slope form.

y - 7.3890561 \;=\; 7.3890561 \cdot (x - 2)

Expand:

y \;=\; 7.3890561 \, x \;-\; 7.3890561

So the tangent line is $y = e^{2} \, x - e^{2}$ , or equivalently $y = e^{2}(x - 1)$ .

Step 4. Verify the “one unit to the left” property by setting $y = 0$ :

0 \;=\; e^{2}(x - 1) \;\Longrightarrow\; x = 1

The tangent crosses the x-axis at $x = 1$ , which is $2 - 1 = 1$ unit to the left of the point of tangency. ✓

Step 5. Sanity-check numerically. Move $\Delta x = 0.001$ to the right along the tangent line. The tangent predicts:

\Delta y_{\text{tangent}} = m \cdot \Delta x = 7.3890561 \times 0.001 = 0.0073890561

The actual curve gives:

\Delta y_{\text{curve}} = e^{2.001} - e^{2} = 0.0073927516

Difference $\approx 3.7 \times 10^{-6}$ — the tangent is indistinguishable from the curve over a step that small. That is what “the derivative is the slope of the curve” means in practice.

Notice we never reached for a calculator beyond evaluating $e^2$ . The slope follows for free because the derivative is the function. That ease is the practical reason scientists overwhelmingly use base $e$ instead of base 2 or base 10.

Tangent Explorer: See It All in One Picture

Drag the purple dot along the curve below and watch the green tangent line stay glued to the curve. The reported slope is always equal to the height — the central property of $e^x$ . Then shrink the step $h$ to see the orange secant collapse onto the tangent.

The Derivative of e^x: Visualized

Watch how the secant line approaches the tangent as h → 0

Point x = 1.00

Step h = 0.5000

Secant Slope (Difference Quotient)

[f(x+h) - f(x)] / h

= [e^(1.00 + 0.500) - e^1.00] / 0.500

= 3.526814

Tangent Slope (True Derivative)

f'(x) = e^x

= e^1.00

= 2.718282

Error: |Secant - Tangent|

8.0853e-1

As h → 0, error → 0

Secant slope converges to tangent slope!

The Magical Property of e^x

Notice: f(x) = e^x = 2.7183 and f'(x) = e^x = 2.7183

They're the same! The function equals its own derivative.

Chain Rule Preview: e^(kx) and Half-Life

The most common exponential you'll meet in physics, biology, and finance isn't $e^x$ with growth rate 1 per unit time — it's $e^{kx}$ with some non-unit rate $k$ . By the chain rule (proved in detail in section 4.7):

\frac{d}{dx}\, e^{kx} \;=\; k \, e^{kx}

So $e^{kx}$ is $k$ times its own derivative. The constant $k$ is precisely the per-unit-time growth rate.

Function	Derivative	What it models
e^x	e^x	Unit growth rate
e^(2x)	2 e^(2x)	Doubling every ln(2)/2 ≈ 0.347 units
e^(-x)	-e^(-x)	Decay at rate 1
e^(-0.693 t)	-0.693 e^(-0.693 t)	Radioactive half-life t½ = 1
e^(rt) — Black–Scholes	r e^(rt)	Continuously compounded return

The differential equation y' = ky

Every “rate of change is proportional to amount” problem reduces to the equation $y' = k\,y$ , and the solution is $y(t) = y_0 \, e^{kt}$ . We will solve it formally in chapter 11; here, just notice that plugging in confirms it: $y'(t) = k \, y_0 e^{kt} = k \, y(t)$ . ✓

Real-World Applications

🏦 Continuously compounded interest

Balance $B(t) = P\, e^{rt}$ . Instantaneous earning rate is $B'(t) = r B(t)$ — interest per unit time equals rate × current principal.

☢ Radioactive decay

$N(t) = N_0\, e^{-\lambda t}$ . Number of decays per second is $-N'(t) = \lambda N(t)$ — proportional to atoms still present.

⚡ RC circuit charging

Voltage across capacitor: $V(t) = V_0\,(1 - e^{-t/RC})$ . Charging current $i(t) \propto e^{-t/RC}$ decays as the cap fills.

🤖 Softmax in neural networks

$\sigma(z_i) = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$ . Gradients involve $\dfrac{\partial \sigma_i}{\partial z_j}$ , and every term carries the “derivative-equals-itself” signature of $e^x$ , making backprop simple.

🌡 Newton's law of cooling

Temperature difference $\Delta T(t) = \Delta T_0\, e^{-kt}$ . The rate of cooling is proportional to the current temperature gap.

🎯 Normal distribution (statistics)

$f(x) = \dfrac{1}{\sqrt{2\pi}\sigma}\, e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ . Differentiating to find the maximum reduces to setting the inner exponent's derivative to zero — the value of $e^x$ never enters.

Plain Python Implementation

Let's convert everything we proved into code. The script does three things:

Defines a numerical derivative routine using the limit definition.
Compares the numerical derivative of $e^x$ against $e^x$ itself at six points.
Watches $(e^h - 1)/h$ march toward 1 as $h$ shrinks.

Numerical verification: derivative of e^x equals e^x

🐍derivative_of_exp.py

Explanation(22)

Code(26)

1Import math

We need math.exp(x) — Python's built-in implementation of e^x. It is computed via a hardware-accelerated power series, so we can treat it as the exact reference value for our experiment.

EXECUTION STATE

math.e = 2.718281828459045

math.exp(0) = 1.0

math.exp(1) = 2.718281828459045

3Step 1 of the experiment

The comment lays out the exact algebra we're about to verify numerically. We start from the limit definition of the derivative, factor out e^x, and recognise that the remaining limit equals 1 — the famous special limit. The conclusion is that e^x is its own derivative.

4Limit definition

(e^(x+h) − e^x) / h is the slope of the secant line through (x, e^x) and (x+h, e^(x+h)). As h shrinks toward 0, that slope becomes the slope of the tangent line — the derivative.

5Factor e^x out

Because e^(x+h) = e^x · e^h, the numerator becomes e^x · (e^h − 1). The e^x factor doesn't depend on h, so it can leave the limit. Everything difficult is now packaged inside the lone factor (e^h − 1)/h.

6Apply the special limit

The very special fact lim h→0 (e^h − 1)/h = 1 is the defining property of e. We will verify it numerically in step 3 below.

7Conclusion

Multiplying e^x by 1 gives e^x back. So the derivative of e^x is e^x. This is the only nonzero function (up to a scalar) with that property.

9Define the numerical derivative

A general-purpose helper. Given x and a small step h, it returns the difference-quotient approximation to f'(x) for f(x) = e^x. Default h = 1e-6 is small enough to be accurate yet large enough to avoid catastrophic floating-point cancellation.

10The difference quotient itself

Implements (e^(x+h) − e^x) / h literally. Try plugging x = 0: we get (e^h − 1)/h, which is exactly the magical limit we will study.

EXAMPLE

At x = 0, h = 1e-6: (e^(1e-6) − 1)/1e-6 ≈ 1.0000005000017

12Step 2 of the experiment

Time to confront theory with measurement. We will compute e^x at several values of x, then compute the numerical derivative at the same x, then compare. If our claim 'derivative equals function' is right, the two columns should match to about 1e-7 precision (limited by floating-point arithmetic, not by our math).

13Print the column headers

The format spec >6 means 'right-align in a 6-character field'. Just cosmetics so the output forms a clean table the reader can scan.

14Iterate over test points

Six values spanning negative, zero, fractional, and positive integers. The point is to show that the function-equals-derivative property holds everywhere, not just at one lucky x.

LOOP TRACE · 6 iterations

x = -1.0

true_val = e^x = 0.3678794412

approx = 0.3678796253

error = 1.84e-07

x = 0.0

true_val = e^x = 1.0000000000

approx = 1.0000005000

error = 5.00e-07

x = 0.5

true_val = e^x = 1.6487212707

approx = 1.6487220950

error = 8.24e-07

x = 1.0

true_val = e^x = 2.7182818285

approx = 2.7182831876

error = 1.36e-06

x = 2.0

true_val = e^x = 7.3890560989

approx = 7.3890597932

error = 3.69e-06

x = 3.0

true_val = e^x = 20.0855369232

approx = 20.0855469670

error = 1.00e-05

15Compute the true value

Use Python's library e^x as ground truth. Inside this loop body the value true_val is the analytic derivative according to our theory.

16Compute the numerical approximation

Calls our helper, which evaluates the secant slope with h = 1e-6. If the theory is right, this should be within rounding error of true_val.

17Compute the absolute error

abs() because we don't care about the sign of the rounding error, only its size. Looking at the iteration table: errors are around 1e-7 to 1e-5 — exactly the size we expect from finite-difference rounding, not from any flaw in the theory.

18Print one row

Format spec :>6.2f reads 'right-align in 6 chars, 2 decimal places'. :>12.2e is scientific notation. The result is a neat table the reader can compare column by column.

20Step 3 of the experiment

Now we strip away the e^x factor and study just the inner limit. If lim h→0 (e^h − 1)/h is really 1, the column we print should converge to 1 as h shrinks.

21What to watch for

Comment foreshadowing the punchline. The numbers will start near 1.6 (for h = 1) and march toward 1.000000 as h shrinks.

22Blank line

Just visual separation between the two tables. print() with no argument emits a newline.

23Print the inner-table headers

Field width 14 fits the small h values cleanly; width 20 leaves room for ten decimal places in the result column.

24Iterate over shrinking h

Each successive h is roughly 10× smaller. This makes the convergence to 1 visually unmistakable — every row should drop two more zeros after the decimal.

LOOP TRACE · 8 iterations

h = 1

math.exp(1) = 2.718281828

value = (e^1 - 1)/1 = 1.7182818285

h = 0.5

math.exp(0.5) = 1.6487212707

value = (e^0.5 - 1)/0.5 = 1.2974425414

h = 0.1

math.exp(0.1) = 1.1051709181

value = 1.0517091808

h = 0.01

math.exp(0.01) = 1.0100501671

value = 1.0050167084

h = 0.001

math.exp(0.001) = 1.0010005002

value = 1.0005001667

h = 0.0001

math.exp(0.0001) = 1.0001000050

value = 1.0000500017

h = 1e-5

value = 1.0000050000

h = 1e-6

value = 1.0000005000

25Evaluate the special limit

For each h we compute (math.exp(h) − 1) / h. Algebraically this is exactly the inner factor we pulled out of the derivative. The iteration trace above shows the value snapping toward 1 — confirming numerically that the limit is 1.

26Print one row

Ten decimal places makes the convergence pattern obvious: 1.71… → 1.05… → 1.005… → 1.0005… — each smaller h gives one more leading 9 (the value is 1 + h/2 + h²/6 + …, so the leading error is h/2).

4 lines without explanation

1import math
2
3# 1. Numerical derivative of e^x via the limit definition.
4#    d/dx e^x = lim h->0  (e^(x+h) - e^x) / h
5#             = e^x * lim h->0  (e^h - 1) / h
6#             = e^x * 1
7#             = e^x   <-- the function is its own derivative
8
9def numerical_derivative_exp(x, h=1e-6):
10    return (math.exp(x + h) - math.exp(x)) / h
11
12# 2. Verify by comparing to e^x itself across a few points.
13print(f"{'x':>6}  {'e^x':>16}  {'approx d/dx e^x':>20}  {'|error|':>12}")
14for x in [-1.0, 0.0, 0.5, 1.0, 2.0, 3.0]:
15    true_val = math.exp(x)
16    approx   = numerical_derivative_exp(x)
17    error    = abs(approx - true_val)
18    print(f"{x:>6.2f}  {true_val:>16.10f}  {approx:>20.10f}  {error:>12.2e}")
19
20# 3. Inspect the magical limit (e^h - 1) / h directly.
21#    Watch the column converge to 1.
22print()
23print(f"{'h':>14}  {'(e^h - 1)/h':>20}")
24for h in [1, 0.5, 0.1, 0.01, 0.001, 0.0001, 1e-5, 1e-6]:
25    value = (math.exp(h) - 1) / h
26    print(f"{h:>14.6f}  {value:>20.10f}")

What the run produces

The first table shows e^x and the numerical derivative agreeing to about seven decimal places at every test point. The second table shows $(e^h - 1)/h$ converging to 1: at $h = 10^{-6}$ it equals $1.0000005000$ , exactly $1 + h/2$ as predicted by the Taylor series.

PyTorch Verification

Plain Python gave us numerical confirmation. Let's now ask PyTorch's autograd engine — designed to handle the messiest neural-network gradients — to compute $d/dx\; e^x$ for us. We expect the gradient tensor to be bit-for-bit equal to $e^x$ .

PyTorch autograd reproduces d/dx e^x = e^x

🐍derivative_of_exp_pytorch.py

Explanation(15)

Code(20)

1Import PyTorch

PyTorch's autograd engine lets us compute exact derivatives by recording every operation in a computational graph. We will use it as a second, independent witness that d/dx e^x = e^x.

3Plan of the experiment

Build a small tensor of test points, run them through exp, ask PyTorch for the gradient, and compare the gradient to e^x itself. If they match exactly, the derivative claim is verified by software written by people who do not know we are testing it.

4Create x with gradient tracking

torch.tensor wraps a plain list of floats into a PyTorch tensor. requires_grad=True tells autograd 'watch every operation that touches this tensor — I will eventually want d/dx of something with respect to it.'

EXECUTION STATE

x = tensor([-1.0, 0.0, 0.5, 1.0, 2.0, 3.0])

x.requires_grad = True

x.grad = None (not yet populated)

6Compute y = e^x

torch.exp applies e^x element-wise. Because x requires grad, y also carries grad-tracking and remembers that it was produced by torch.exp.

7Element-wise exponential

Each y[i] equals e^{x[i]}. PyTorch stores the inverse computation needed for backprop: dy_i/dx_i = e^{x_i}, which is exactly y_i itself.

EXECUTION STATE

y = tensor([0.3679, 1.0000, 1.6487, 2.7183, 7.3891, 20.0855])

y.grad_fn = <ExpBackward0>

9Need a scalar to backprop from

PyTorch's .backward() expects a scalar. We use .sum() because the derivative of a sum is the sum of the derivatives, and each term contributes exactly its own gradient to its own x[i] — no cross-talk between coordinates.

10Why .sum() is safe here

For y_sum = y_0 + y_1 + ... + y_n, partial derivative ∂y_sum/∂x_i equals ∂y_i/∂x_i (the others don't depend on x_i). So x.grad[i] will end up holding exactly d/dx_i e^{x_i} = e^{x_i}.

11Form the scalar

y_sum is now a 0-dimensional tensor (a single number). It still carries grad history — PyTorch remembers it came from summing six exp(...) calls.

EXECUTION STATE

y_sum = tensor(33.1095, grad_fn=<SumBackward0>)

13Trigger backpropagation

y_sum.backward() walks the computation graph in reverse, applying the chain rule. When it reaches x it deposits the accumulated gradient into x.grad. After this single call, every x[i] has its derivative computed.

14.backward() vs torch.autograd.grad()

.backward() writes gradients into the .grad attribute of every leaf tensor that participated. torch.autograd.grad(y_sum, x) would instead return them as a tuple without mutating .grad. Both are valid; we use .backward() because we only have one input tensor and the side-effect style is cleaner here.

EXECUTION STATE

x.grad = tensor([0.3679, 1.0000, 1.6487, 2.7183, 7.3891, 20.0855])

16Compare to the analytic answer

Now we print three lines side by side: the inputs x, the exponentials e^x, and PyTorch's computed gradient. The claim of the section is x.grad should equal e^x exactly. Read the next three lines as the experimental test of that claim.

17Print x

.detach() returns a copy without grad-tracking, .tolist() converts it to a plain Python list for readable printing. Output: [-1.0, 0.0, 0.5, 1.0, 2.0, 3.0].

18Print e^x

These are the function values themselves. We will compare them to the gradient line below.

EXAMPLE

[0.3679, 1.0, 1.6487, 2.7183, 7.3891, 20.0855]

19Print x.grad

These are PyTorch's gradient values. They should be identical to the e^x line above. They are: [0.3679, 1.0, 1.6487, 2.7183, 7.3891, 20.0855]. Confirmation that PyTorch's autograd computes d/dx e^x = e^x.

20Final assertion

torch.allclose returns True if two tensors are element-wise equal within a small tolerance. We print 'matches? : True'. Two independent computations — one hand-derived (e^x), one from a general-purpose autograd engine — agree.

EXAMPLE

matches? : True

5 lines without explanation

1import torch
2
3# Build a tensor x with several test points and ask PyTorch to track gradients.
4x = torch.tensor([-1.0, 0.0, 0.5, 1.0, 2.0, 3.0], requires_grad=True)
5
6# Compute y = e^x element-wise.
7y = torch.exp(x)
8
9# Sum so we have a scalar to backpropagate through.
10# For a sum, d(y_sum)/dx_i = d(e^{x_i})/dx_i = e^{x_i}.
11y_sum = y.sum()
12
13# Trigger autograd. After this call, x.grad[i] holds d(y_sum)/dx_i.
14y_sum.backward()
15
16# Compare PyTorch's autograd to the analytic answer e^x.
17print("x        :", x.detach().tolist())
18print("e^x      :", y.detach().tolist())
19print("x.grad   :", x.grad.tolist())
20print("matches? :", torch.allclose(x.grad, torch.exp(x.detach())))

The final line prints $\text{matches?} \,:\, \text{True}$ . Two independent computations — a hand derivation following the limit definition, and a general-purpose automatic-differentiation engine — agree on every digit.

Why this matters for deep learning

Inside every neural network, the chain rule has to propagate gradients through hundreds or thousands of operations. Every time an exponential appears (softmax, sigmoid via $\sigma(x) = 1/(1 + e^{-x})$ , attention weights), the framework needs $d/dx\, e^x = e^x$ as a primitive. The identity $\text{forward output} = \text{backward gradient}$ makes those exponentials computationally cheap — you reuse the cached forward value instead of recomputing.

Common Mistakes

Mistake 1: Applying the Power Rule to e^x

Wrong: $\dfrac{d}{dx}\, e^x = x \, e^{x-1}$

Correct: $\dfrac{d}{dx}\, e^x = e^x$

The Power Rule $\dfrac{d}{dx}\, x^n = n\, x^{n-1}$ applies when the variable is in the base. Here, the variable is in the exponent — entirely different rule.

Mistake 2: Forgetting the chain rule for e^(kx)

Wrong: $\dfrac{d}{dx}\, e^{5x} = e^{5x}$

Correct: $\dfrac{d}{dx}\, e^{5x} = 5\, e^{5x}$

The factor of $5$ comes from the inner derivative $\dfrac{d}{dx}(5x) = 5$ . Only when the exponent is exactly $x$ does the chain-rule factor equal 1.

Mistake 3: Confusing e^x with general a^x

Wrong: $\dfrac{d}{dx}\, 2^x = 2^x$

Correct: $\dfrac{d}{dx}\, 2^x = \ln(2) \cdot 2^x \approx 0.693 \cdot 2^x$

The function-equals-derivative property is unique to base $e$ . Every other base picks up a factor of $\ln(a)$ .

Mistake 4: Treating e as a variable

Wrong: $\dfrac{d}{de}\, e^x = x\, e^{x-1}$

Correct: $e$ is the constant 2.71828…, not a variable. The expression $\dfrac{d}{de}$ is meaningless. Only $x$ varies.

Summary

Concept	Statement
The headline rule	d/dx e^x = e^x
The special limit	lim h→0 (e^h − 1) / h = 1
General base a	d/dx a^x = ln(a) · a^x
Chain rule version	d/dx e^(kx) = k · e^(kx)
Geometric reading	At every point on y = e^x, slope of tangent = height of curve
Tangent x-intercept	Tangent at (a, e^a) crosses x-axis at x = a − 1 (always 1 unit left)
Differential-equation form	y' = y has solution y = C · e^x

One sentence to take away

$e$ is the unique base for which the exponential function $e^x$ coincides with its own derivative — a property born from the magical limit $(e^h - 1)/h \to 1$ , which in turn is baked into the very definition of $e$ .

In the next section we generalize: what is the derivative of a general exponential $a^x$ , and how does $\ln(a)$ enter the formula? Spoiler — we already derived it above. We'll just put it to work.