Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this short section, you will be able to:

State the rule $\frac{d}{dx}\,a^{x} = (\ln a)\,a^{x}$ from memory.
Derive it two ways: rewriting through $e$ with the chain rule, and directly from the limit definition.
See visually why $\ln a$ is the only constant that can possibly appear there.
Use the rule to find tangent lines and instantaneous growth rates for any positive base.
Verify the rule both in plain Python and with PyTorch's automatic differentiation.

The Question We Are Forced to Ask

In the previous section we discovered a small miracle: the derivative of $e^{x}$ is $e^{x}$ itself. That is unusually clean — it almost feels like cheating. But the world does not only use base $e$ .

Bacteria double: the natural base of the model is $2$ , not $e$ .
Earthquakes and pH live on a base- $10$ scale.
Computer memory grows by halves and doubles — bases $\tfrac{1}{2}$ and $2$ .
Radioactive decay is naturally modeled with bases like $(1/2)$ (the half-life view).

So the honest question is: how does the slope of $a^{x}$ behave for a general positive base $a > 0,\ a \neq 1$ ? We are about to discover that the answer is almost as clean as the case $a = e$ , just with a single extra multiplicative fingerprint of the base: the number $\ln a$ .

The headline result

\frac{d}{dx}\bigl(a^{x}\bigr) \;=\; (\ln a)\,\cdot\,a^{x}

Read it out loud: “the derivative of $a^{x}$ is $\ln a$ times $a^{x}$ .” The whole section exists to make that line feel inevitable.

Intuition: Every a^x Is a Stretched e^x

Here is the picture you should carry in your head before any algebra: every exponential function $a^{x}$ is secretly the function $e^{x}$ that has been horizontally stretched. If $a > e$ the curve is squeezed tighter (it grows faster); if $a < e$ it is stretched out (it grows slower). The stretch factor is precisely the number $\ln a$ .

Why should a horizontal stretch matter for the slope? Because the chain rule says: when you compress the input axis by a factor $k$ , every slope on the graph gets multiplied by $k$ . The slope of $e^{x}$ at the corresponding point is $e^{x}$ . Rebrand the variable and you get $(\ln a)\cdot a^{x}$ . That is the whole argument in one sentence — everything below just makes it rigorous.

Mental model: the function

a^{x}

is “

e^{x}

with the clock running at speed

\ln a

.” The slope picks up that clock speed exactly — nothing else.

Rewriting a^x Through e

Any positive number $a$ can be written as a power of $e$ , because the natural log is defined to undo $e^{(\cdot)}$ :

a = e^{\ln a} \quad\Longrightarrow\quad a^{x} = \bigl(e^{\ln a}\bigr)^{x} = e^{(\ln a)\,x}.

That single identity is the bridge from the previous section to this one. It says: every exponential lives inside the exponential function. The base $a$ hides inside the exponent as the constant multiplier $\ln a$ .

Why this rewrite is legal

We are using two facts: (1) the definition of the natural log, $e^{\ln a} = a$ for $a > 0$ ; and (2) the power-of-a-power rule $(b^{p})^{q} = b^{pq}$ . Both are valid for any real exponent, so the rewrite holds for every positive base.

Derivation by the Chain Rule

Now use the chain rule on $a^{x} = e^{(\ln a)\,x}$ . Let $u(x) = (\ln a)\,x$ , so that $a^{x} = e^{u(x)}$ .

\frac{d}{dx}\,a^{x} = \frac{d}{dx}\,e^{u(x)} = e^{u(x)}\cdot u'(x)

= e^{(\ln a)\,x}\cdot (\ln a)

= (\ln a)\,\cdot\, a^{x}.

∎

Three short lines, and we have the rule for every base at once. The outer derivative gave us back the function (because that is what $e^{x}$ does), and the inner derivative pulled the constant $\ln a$ down. The constant is the fingerprint of the base — everything else is the function itself.

Why this is the clean way: we did not have to re-prove anything. We borrowed the previous section's identity

(e^{x})' = e^{x}

and the chain rule from chapter 4. New facts in this section: zero. New consequences: all exponentials.

Derivation from the Limit Definition

For readers who want to see this without invoking the chain rule, here is the same conclusion straight from the difference quotient. At any point $x$ ,

\frac{d}{dx}\,a^{x} = \lim_{h\to 0}\frac{a^{x+h}-a^{x}}{h} = a^{x}\,\lim_{h\to 0}\frac{a^{h}-1}{h}.

All the $x$ -dependence factored out into the $a^{x}$ in front. Whatever the rule's multiplier turns out to be, it must be a constant that depends only on $a$ . Call it $L(a)$ :

L(a) \;=\; \lim_{h\to 0}\frac{a^{h}-1}{h}.

We claim $L(a) = \ln a$ . The fastest proof uses the rewrite we just did:

a^{h} - 1 = e^{(\ln a)\,h} - 1

\;\;= (\ln a)\,h + \tfrac{1}{2}\,(\ln a)^{2}\,h^{2} + \cdots\quad

(Taylor series for e^z)

\frac{a^{h}-1}{h} = \ln a + \tfrac{1}{2}\,(\ln a)^{2}\,h + \cdots \;\xrightarrow[h\to 0]{}\; \ln a.

So $L(a) = \ln a$ and the rule reappears. Notice what the limit equation is telling us geometrically: the constant $\ln a$ is just the slope of $a^{x}$ at $x = 0$ . Stretch the function up by $a^{x}$ and you get the slope at every other point.

Interactive Visualization

Drag the base $a$ . The blue curve is $f(x) = a^{x}$ . The dashed red curve is $f'(x) = (\ln a)\,a^{x}$ . Watch the red curve as you slide $a$ through $e \approx 2.718$ : it passes through the blue curve. That moment is the only moment where $\ln a = 1$ , and it is exactly why $e$ is the “natural” base.

d/dx (a^x) = (ln a) · a^x — Interactive Explorer

Drag the base. Watch the multiplier ln a stretch the derivative curve, and notice it lands exactly on the function when a = e.

Base a = 2.000

Point x = 1.00

ln a (the stretch factor)

0.6931

f(x) = a^x

2.0000

f'(x) = (ln a) · a^x

1.3863

Flatter than e^x at every point

Because 0 < ln a < 1, the derivative is a shrunken copy of the function.

Two things to verify with the sliders:

For $a > e$ the dashed red derivative sits above the blue function — the slope is bigger than the value.
For $1 < a < e$ the red curve sits below the blue curve — the slope is a shrunken copy of the function.
For $a < 1$ the constant $\ln a$ becomes negative; the red curve flips below the x-axis and the function decays.

Why the Multiplier Is ln a (Not Something Else)

It is worth pausing on the question that confuses every student the first time: why log? Why not square root, why not the base itself? Here is the cleanest answer the limit gives us.

We just saw $L(a)=\lim_{h\to 0}(a^{h}-1)/h$ is the slope of $a^{x}$ at the origin. Two properties pin it down completely:

It is additive in exponents. The product rule for exponents gives $(ab)^{h} = a^{h}b^{h}$ , and a short calculation shows $L(ab) = L(a) + L(b)$ . A function turning products into sums is a logarithm.
It equals $1$ at $a = e$ . Because $(e^{x})' = e^{x}$ , the slope of $e^{x}$ at the origin is $1$ , so $L(e) = 1$ .

There is exactly one continuous function on the positive reals satisfying both: the natural logarithm $\ln$ . So the multiplier was never a choice — it was forced.

Why ln a? The Hidden Limit Inside Every a^x

The slope of a^x at x = 0 is the limit of (a^h − 1) / h. Shrink h and watch the orange dashed curve fold onto the blue ln a curve.

Base a = 2.000

Step h = 0.1000

True slope ln a

0.693147

Estimate (a^h − 1) / h

0.717735

|error|

2.459e-2

Slide $a$ until the dashed orange curve and the solid blue curve meet at $y = 1$ . The crossing point on the x-axis is, by construction, $a = e$ .

Worked Example: Doubling Time and Tangents

Let us do one full numerical walk-through that ties intuition, formula, and computation together.

Problem. A population grows according to $P(t) = 2^{t}$ , where $t$ is measured in days. (a) How fast is the population growing at $t = 3$ ? (b) Write the tangent line to $P$ at $t = 3$ . (c) Use that tangent line to estimate $P(3.1)$ , and compare to the true value.

Click to expand the hand-computation

Step 1 — Apply the rule. With $a = 2$ ,

P'(t) = (\ln 2)\,\cdot\,2^{t}.

Step 2 — Evaluate at $t = 3$ .

P'(3) = (\ln 2)\cdot 2^{3} = (\ln 2)\cdot 8.

Numerically $\ln 2 \approx 0.693147$ , so $P'(3) \approx 5.545177$ — the population is growing by roughly 5.5 individuals per day at $t = 3$ .

Step 3 — Tangent line. The point on the curve is $\bigl(3,\;P(3)\bigr) = (3,\,8)$ . So the tangent has the point-slope equation

L(t) = 8 + (\ln 2)\cdot 8\,(t - 3).

Step 4 — Linear estimate at $t = 3.1$ .

L(3.1) = 8 + (\ln 2)\cdot 8\cdot 0.1 \approx 8 + 0.554518 \approx 8.554518.

Step 5 — Compare. The true value is $P(3.1) = 2^{3.1} \approx 8.574188$ . The tangent under-estimates by about $0.0197$ — an error of roughly 0.23%. For such a fast-growing function over a tenth of a day, the local linear model is excellent. This is why engineers love tangent lines: they replace a hard exponential with a one-line multiplication and still get three accurate digits.

Quantity	Symbolic	Numeric
Function value	P(3) = 2^3	8.000000
Rate of change	P'(3) = (ln 2)·8	5.545177
Tangent at 3.1	L(3.1)	8.554518
True P(3.1)	2^3.1	8.574188
Linear error	\|P - L\|	≈ 0.019670

Shortcut Rules and Common Cases

Once you have $(a^{x})' = (\ln a)\,a^{x}$ , every related rule falls out by the chain rule. The ones worth memorising:

Function	Derivative	Why
a^x	(ln a) · a^x	this section
a^(kx)	k (ln a) · a^(kx)	chain rule with inner u = kx
a^(g(x))	(ln a) · a^(g(x)) · g'(x)	chain rule with inner g
e^x	e^x	special case ln e = 1
2^x	(ln 2) · 2^x ≈ 0.6931 · 2^x	a = 2
10^x	(ln 10) · 10^x ≈ 2.3026 · 10^x	a = 10
(1/2)^x	(ln 0.5) · (1/2)^x ≈ -0.6931 · (1/2)^x	decay base

Do not confuse

a^{x}

(exponential, variable in the exponent) with

x^{a}

(power function, variable in the base). They are completely different rules:

$\dfrac{d}{dx}\,a^{x} = (\ln a)\,a^{x}$ — an exponential rule.
$\dfrac{d}{dx}\,x^{a} = a\,x^{a-1}$ — the power rule from chapter 4.

a = 2,\ x = 3

the first gives ≈ 5.545 and the second gives 12. Same letters, different answers — the moving variable matters.

Python: Verifying the Rule Numerically

We will build intuition with plain Python before reaching for any framework. The goal is not to compute $(\ln a) \cdot a^{x}$ — that is one line. The goal is to cross-check the formula against a brute-force numerical slope so we trust it physically, not just symbolically.

Cross-checking d/dx (a^x) numerically

🐍general_exponential_derivative.py

Explanation(9)

Code(42)

1Why we import math

We need math.log for the natural logarithm ln. In Python, math.log(x) is the *natural* log by default — not log base 10. That is exactly the function the derivative rule asks for.

EXAMPLE

math.log(math.e) == 1.0  # by definition

3The function signature: three knobs

a is the base (any positive number except 1), x is the point where we want the slope, and h is the tiny step we will use to *cross-check* the rule numerically. Defaulting h to 1e-6 gives 6-digit agreement with the analytic answer for well-behaved x.

14f(x) — the function value at the point

This is just a**x evaluated at the chosen x. We compute it once because both the closed-form rule and the numerical estimate refer back to it. For a=2, x=3 this is 8.

EXAMPLE

2 ** 3  ==  8

15f'(x) — the closed-form derivative we are teaching

This is the single line that encodes the rule of this whole section: d/dx (a^x) = (ln a) · a^x. Notice it is literally the function value multiplied by the constant ln a. For a=2, x=3 that is ln(2) · 8 = 0.6931… · 8 ≈ 5.5452.

EXAMPLE

math.log(2) * 2**3  →  5.5451774444...

16Symmetric numerical slope — the witness

This is a centered difference quotient. For tiny h, [a^(x+h) − a^(x−h)] / (2h) approximates f'(x) with error proportional to h². If the rule is correct, this number must match line 15 to many digits. It is our independent witness.

EXAMPLE

Centered ≈ true slope + O(h²)

18Return a labelled dictionary

We deliberately return all of a, x, f(x), ln a, the rule's answer, and the numerical answer, plus their absolute error. That lets the reader *see* the rule working — not just trust the formula.

29Test #1 — base 2 at x = 3

Hand-check: f(3) = 2^3 = 8. f'(3) = ln(2) · 8 ≈ 0.693147 · 8 = 5.545177. The 'numerical' line should print 5.545177… with abs error around 1e-12.

34Test #2 — base e (the magical case)

Here ln a = ln e = 1 *exactly*. The rule collapses to f'(x) = f(x). At x = 1 both f(x) and f'(x) print as 2.71828… — the function literally equals its own derivative. This is *why* e is the natural base.

39Test #3 — decay base 1/2

ln(0.5) = −ln(2) ≈ −0.6931 is negative, so f'(x) = (ln 0.5) · (0.5)^x is negative everywhere. The function is decreasing, and the rule encodes that sign automatically — we did not have to special-case it.

EXAMPLE

f'(2) = ln(0.5) * 0.25  ≈  -0.1733

33 lines without explanation

1import math
2
3def derivative_a_to_x(a: float, x: float, h: float = 1e-6) -> dict:
4    """
5    Compare three things at the same point:
6
7      1. The closed-form derivative:   f'(x) = (ln a) * a**x
8      2. A symmetric numerical slope:  [a**(x+h) - a**(x-h)] / (2h)
9      3. The function value itself:    f(x) = a**x
10
11    The rule says (1) = (2) up to floating-point noise, and (1) equals (3)
12    only when ln a == 1, i.e. exactly when a == e.
13    """
14    f_x        = a ** x                                  # the function value
15    f_prime    = math.log(a) * f_x                       # closed-form rule
16    f_numeric  = (a ** (x + h) - a ** (x - h)) / (2 * h) # symmetric estimate
17
18    return {
19        "a":          a,
20        "x":          x,
21        "f(x)":       f_x,
22        "ln(a)":      math.log(a),
23        "f'(x) rule": f_prime,
24        "numerical":  f_numeric,
25        "abs error":  abs(f_prime - f_numeric),
26    }
27
28
29# 1) f(x) = 2^x at x = 3.  f(3) = 8, f'(3) = ln(2) * 8 ≈ 5.5452
30print("--- 2^x at x = 3 ---")
31for k, v in derivative_a_to_x(2.0, 3.0).items():
32    print(f"  {k:>10}: {v}")
33
34# 2) f(x) = e^x at x = 1.  Here ln(a) = 1, so f'(x) = f(x).
35print("\n--- e^x at x = 1 ---")
36for k, v in derivative_a_to_x(math.e, 1.0).items():
37    print(f"  {k:>10}: {v}")
38
39# 3) f(x) = (1/2)^x at x = 2.  Decay: ln(a) = -ln(2) < 0, so slope is negative.
40print("\n--- (1/2)^x at x = 2 ---")
41for k, v in derivative_a_to_x(0.5, 2.0).items():
42    print(f"  {k:>10}: {v}")

What you should see when you run this

For $a=2,\ x=3$ : f'(x) rule ≈ 5.545177, numerical ≈ 5.545177, error ≈ 1e-12.
For $a=e,\ x=1$ : f(x), f'(x), and the numerical estimate all print 2.71828… — the function is its derivative.
For $a=1/2,\ x=2$ : the derivative is negative, around $-0.1733$ . The sign comes for free from $\ln(1/2) < 0$ .

PyTorch: Autograd Knows the Same Rule

Now let us check the same identity with the tool that powers modern deep learning. PyTorch's autograd does not know any rule by name — it composes elementary derivatives at runtime. So if $(a^{x})' = (\ln a)\,a^{x}$ were wrong, every model that ever used a learnable exponent would be wrong too. Spoiler: it isn't.

PyTorch confirms d/dx (a^x) = (ln a) a^x

🐍autograd_check.py

Explanation(9)

Code(25)

1Why PyTorch for a calculus rule?

Autograd is a derivative engine. Every neural-network gradient you have ever heard of is the chain rule applied automatically. If we wire up f = a^x and call .backward(), PyTorch will hand us exactly the number our rule predicts — a perfect cross-check.

3Pick a base that is NOT e

We use a = 2 so the multiplier ln a ≈ 0.693 is clearly different from 1. If we used a = e the rule would collapse to f'(x) = f(x) and the multiplier would be invisible — interesting but less convincing as a test.

4requires_grad=True on x

Autograd only tracks tensors that ask to be tracked. x is the variable we are differentiating with respect to, so it is the one that needs requires_grad=True. a is a constant here.

6f = a ** x — the forward pass

PyTorch computes 2^3 = 8 and silently records the operation in a graph. The graph remembers 'this output came from a power with base a and exponent x' so it can replay the derivative later.

7f.backward() — the rule fires

This is the moment of truth. Backward applies the same identity we proved on paper: d/dx (a^x) = (ln a) · a^x. The result is deposited into x.grad. We never typed the formula — autograd derived it.

10Compute the analytic answer by hand

torch.log is the natural log. We detach x because we only want the *value* 2^3, not a node in the graph. The result is (ln 2) · 8 ≈ 5.5452 — the very number the rule predicts.

EXAMPLE

(ln 2) * 2**3 ≈ 0.6931 * 8 ≈ 5.5452

12Print to compare

Both x.grad and analytic should match to roughly 1e-7 (float32 precision). If they didn't, *either* our rule is wrong *or* PyTorch's autograd is — and after a million users, autograd is not wrong.

18Same operation, different variable

Now we make a the leaf with requires_grad=True and treat x as a constant. The expression a^x is now viewed as a *power* function in a, so its derivative follows the power rule from chapter 4: x · a^(x-1). Same Python code, completely different rule — the role of variable matters.

EXAMPLE

d/da (a^3) at a=2 = 3 * 2^2 = 12

22The numeric check

AD gives 12.0, and x · a^(x-1) also gives 12.0. This drives the deepest point home: 'derivative of a^x' is *ambiguous* until you say which letter is varying. Section 5.2 fixes that letter to be x.

16 lines without explanation

1import torch
2
3# We deliberately pick a base a != e to make the multiplier (ln a) visible.
4a = torch.tensor(2.0)
5x = torch.tensor(3.0, requires_grad=True)   # x must be a leaf with grad
6
7f = a ** x                                  # f = 2^x at x = 3 -> 8.0
8f.backward()                                # ask autograd for df/dx
9
10# Closed-form answer for comparison
11analytic = torch.log(a) * a ** x.detach()   # (ln 2) * 2^3 = 5.5451...
12
13print(f"f(x)         = {f.item():.6f}")     # 8.000000
14print(f"x.grad (AD)  = {x.grad.item():.6f}")# 5.545177
15print(f"analytic     = {analytic.item():.6f}")
16print(f"abs error    = {abs(x.grad.item() - analytic.item()):.2e}")
17
18# A second example: gradient w.r.t. the BASE a, not x.
19a2 = torch.tensor(2.0, requires_grad=True)
20x2 = torch.tensor(3.0)
21g  = a2 ** x2                               # g = a^x as a function of a
22g.backward()
23# Power rule (in a, treating x as constant): dg/da = x * a^(x-1)
24print(f"\ndg/da (AD)   = {a2.grad.item():.6f}")  # 3 * 2^2 = 12.0
25print(f"x * a^(x-1)  = {(x2 * a2.detach() ** (x2 - 1)).item():.6f}")

Two derivatives, one expression: the same Python line a ** x has two completely different derivatives depending on which leaf carries requires_grad=True. With respect to

x

we get our new exponential rule; with respect to

a

we get the power rule. That is the cleanest mental check that “exponential” and “power” really are distinct families of functions.

Where This Rule Lives in the Real World

1. Doubling and halving processes

Anywhere a quantity doubles in a fixed window — cell division, Moore's law, viral spread — the model is $N(t) = N_{0}\,2^{t/T}$ for some doubling time $T$ . Its instantaneous growth rate is

N'(t) = \frac{\ln 2}{T}\,N(t).

The constant $(\ln 2)/T$ is what epidemiologists call the growth rate. It is the section's rule with the base 2 and the chain rule applied to $t/T$ .

2. Half-life of radioactive isotopes

Carbon-14 decays as $N(t) = N_{0}\,(1/2)^{t/T_{1/2}}$ . Then

N'(t) = -\frac{\ln 2}{T_{1/2}}\,N(t).

The negative sign comes straight out of $\ln(1/2) = -\ln 2$ . The rule encoded decay and growth with the same formula — the sign of the slope is the sign of $\ln a$ .

3. Earthquakes, decibels, pH — base-10 scales

The seismic moment magnitude scale is base-10. A model of energy release like $E(M) = E_{0}\cdot 10^{1.5 M}$ has

\frac{dE}{dM} = 1.5\,(\ln 10)\,E_{0}\cdot 10^{1.5 M} \approx 3.4538\,E(M).

Going up one magnitude multiplies the energy by $10^{1.5} \approx 31.6$ ; the derivative tells us the local sensitivity at every magnitude.

4. Machine learning: learnable bases and temperatures

Temperature-scaled softmax, $p_{i} \propto a^{z_{i}}$ , makes the base $a$ a tunable knob. Backpropagating through it — for example, learning the temperature — relies on exactly this rule. Likewise, any layer that uses bases other than $e$ (rare but possible in custom architectures) needs $(\ln a)\,a^{x}$ to propagate gradients correctly.

Common Mistakes

Dropping the $\ln a$ . The most frequent error is writing $(2^{x})' = 2^{x}$ . That is true only for $a = e$ . For every other base the fingerprint $\ln a$ must be there.
Using $\log_{10}$ instead of $\ln$ . The rule is $(a^{x})' = (\ln a)\,a^{x}$ , with the natural log. Using base 10 will make every answer off by a factor of $\ln 10 \approx 2.3026$ .
Confusing exponential and power rules. When the variable is in the exponent, use the exponential rule; when the variable is in the base, use the power rule. The previous PyTorch example showed both rules applied to the same code — the only difference was which tensor required grad.
Forgetting the chain rule on $a^{g(x)}$ . The derivative is $(\ln a)\,a^{g(x)}\,g'(x)$ , not just $(\ln a)\,a^{g(x)}$ . People often nail the first factor and then drop the inner derivative.
Treating $a < 1$ as a special case. It is not. The same formula handles decay automatically because $\ln a$ is negative.

Summary

The Derivative of a^x in one line

\frac{d}{dx}\,a^{x} \;=\; (\ln a)\,\cdot\,a^{x},\qquad a > 0,\ a \neq 1.

Key takeaways

Every exponential rewrites as $a^{x} = e^{(\ln a)\,x}$ . That single identity reduces this entire section to the chain rule.
The constant $\ln a$ is the slope of $a^{x}$ at the origin and the fingerprint of the base.
The base $e$ is “natural” precisely because $\ln e = 1$ makes the multiplier disappear — nothing more, nothing less.
Sign of growth, rate of decay, and instantaneous sensitivity to the input are all encoded in that one constant $\ln a$ .
The rule is consistent with both elementary numerical slopes (plain Python) and full automatic differentiation (PyTorch).

Coming next: we invert the picture and ask what is the derivative of $\ln x$ ? The answer is going to be startlingly simple — and the “ $\ln a$ ” we just discovered is no accident.