Learning Objectives
By the end of this section you will be able to:
- Derive the formula from scratch using the change-of-base identity — no memorization required.
- See visually why a bigger base gives a flatter curve, and why the slope formula has in the denominator.
- Apply the chain rule to differentiate for any inner function .
- Confirm the formula in two ways: a hand-traced Python finite-difference, then PyTorch autograd.
- Recognize why is the single base that calculus prefers, and how it connects to information theory and machine learning.
The Question We Are Answering
We already know two important derivatives. The exponential is its own derivative, and the natural logarithm satisfies . Both formulas are beautiful precisely because the base is . But the real world does not always speak in base :
- Engineers reason about decibels and the Richter scale in base 10.
- Computer scientists count bits in base 2 — one binary digit per factor of 2.
- Chemists measure pH as .
- Statisticians compute log-likelihoods in whichever base is convenient for the experiment.
So we want a derivative formula that works for — the logarithm to any positive base . The intuition we will build is short and physical: changing the base just rescales the y-axis of the natural log, and that vertical rescaling rides through the derivative as a constant factor.
The one-sentence goal: express in terms of , then borrow the natural-log derivative we already know.
The Change-of-Base Trick
From any base back to the natural log
The change-of-base identity is the bridge we need. Starting from the definition meaning , take the natural log of both sides:
That is the identity we will use everywhere from now on:
Look closely at the right-hand side. The variable lives only inside . The factor is a constant — it does not depend on at all. So is literally a constant times .
The visual meaning
Every loga(x) is just ln(x) divided by ln(a)
The green curve is ln(x). The yellow curve is loga(x). Move the slider: the yellow curve is just the green curve squashed vertically by the factor 1 / ln(a). That same constant factor is exactly what divides the slope.
The Formula and Its Picture
Differentiating both sides of :
That collapses to the clean answer:
Three quick sanity checks confirm this is the right formula:
| Special case | Plug into formula | Result |
|---|---|---|
| Natural log, a = e | ✓ | |
| Common log, a = 10 | ||
| Binary log, a = 2 |
The four numbers you should always be able to recall
| Function | Derivative | Numerical factor in front of 1/x |
|---|---|---|
| 1.0000 | ||
| 1.4427 | ||
| 0.4343 | ||
Interactive: Drag the Slope
The picture is the best argument. In the visualization below, the orange dot is the point where you want the slope. The cyan dashed line is the actual tangent. The readouts confirm that the slope you see is exactly . Play with three things:
- Drag the dot rightward. Watch the tangent flatten out: the slope shrinks like .
- Crank up the base with the slider. The whole curve gets squashed and the slope number drops accordingly.
- Turn on "compare bases." The three classical curves all pass through but separate immediately because each one is stretched by its own factor .
Slope of log2(x) at any point
Drag the orange point along the curve. The tangent line shows the instantaneous slope. Notice how the slope formula 1 / (x · ln a) gives a smaller number when x is larger or when the base a is larger.
Same x, different bases: slope = 1 / (x · ln a)
At a fixed point x, the slope shrinks as the base grows. The shape of the curve gets flatter; the input x is identical — only ln(a) changes in the denominator.
Slope at x = 3.00 is plotted as a horizontal bar for each base. Same x; only ln(a) changes.
Worked Example (Do-It-By-Hand)
Open the box and work through this example with a pencil before you peek at the steps. Pick simple numbers so you can see the formula in motion.
Worked example: differentiate at — click to expand
Step 1 — rewrite using the change-of-base identity. . The factor is a constant, not a function of .
Step 2 — pull the constant out.
Step 3 — plug in .
Step 4 — compute by hand.
Step 5 — geometric sanity check. (because ). One step to the right at means we go from 8 to 9. The change in should be approximately . Check it: . That under-estimates the tangent slope a bit because is concave; the secant lies below the tangent. Halve the step to 0.5 and the secant slope rises to , closing in on the true slope . This is exactly the "limit of difference quotients" picture we saw in Chapter 4.
Step 6 — second example, with the chain rule. Try . Set the inner function so . Then At this is
What you should take away: the entire derivation is "factor the base out, then it's just ." You never have to memorize a separate rule for each base.
The Chain Rule Version
In practice you rarely meet a bare . You usually meet of something more complicated — a polynomial, a ratio, a trig function. The chain rule combines with our formula in the cleanest possible way:
The recipe: divide by , then divide once more by . That extra denominator is the only thing that changes when you switch bases.
Three worked applications of the chain rule
| Function | u(x) and u'(x) | Derivative |
|---|---|---|
Why ln Is the "Natural" Base for Calculus
Looking at the formula , you can immediately see why the natural logarithm gets the adjective "natural":
Among all bases , the base is the unique one whose derivative formula does not contain an extra constant. Every other base pays a tax.
This is not a coincidence; it is the definition of . The number was chosen precisely so that , killing the constant factor. Calculus rewards the choice with the cleanest possible derivative.
In practice this means: if you have a choice, work in . Convert to base 2 or base 10 only at the very end, when reporting the answer in "bits" or "decibels." Inside derivatives, gradients, optimizers, and ML training loops, is the right currency.
Plain Python: Verify the Formula Numerically
Mathematics tells us . The computer is happy to check that, but it does not know about derivatives directly — it knows about finite differences. So we are going to compute thetrue slope two ways:
- Analytic: the closed-form formula we just derived, evaluated at our test points.
- Numeric: a tiny Δx → 0 approximation of the difference quotient. If both numbers agree to many decimal places, the formula is right.
We will do this for three bases at three test points — nine comparisons total — in plain Python with no external libraries.
Expected output, copy-pasteable:
base x analytic numeric |diff|
2 0.50 2.88539008 2.88539008 4.84e-11
2 4.00 0.36067376 0.36067376 6.04e-12
2 100.00 0.01442695 0.01442695 2.39e-13
e 0.50 2.00000000 2.00000000 3.33e-11
e 4.00 0.25000000 0.25000000 4.16e-12
e 100.00 0.01000000 0.01000000 1.67e-13
10 0.50 0.86858896 0.86858896 1.45e-11
10 4.00 0.10857362 0.10857362 1.85e-12
10 100.00 0.00434294 0.00434294 7.20e-14Read the table
PyTorch: Autograd Confirms the Same Number
For machine-learning workflows, we never write the derivative by hand. PyTorch's autograd engine records every operation as we go and applies the chain rule in reverse. If we compute and call , PyTorch will fill with . Let's verify.
Application: Information Theory and log2
A satisfying place to land is in the place where is the unquestioned king: information theory. Encoding one out of equally likely outcomes costs exactly bits. The derivative we just derived tells you the marginal cost of making N a little larger:
Two things to feel here. First, doubling always adds exactly one bit — a discrete jump that the derivative averages out into a continuous slope. Second, that slope shrinks like , so growing a code from to costs almost nothing, but growing from 2 to 3 costs nearly half a bit.
Why we care about d/dN [log2(N)]
In information theory, the number of bits needed to encode one of N equally likely outcomes is log2(N). The derivative tells us: doubling N adds one bit, but the marginal cost of adding one more option shrinks as N grows.
| situation | N | log₂(N) bits | d/dN [log₂N] |
|---|---|---|---|
| coin flip | 2 | 1.0000 | 7.21e-1 |
| die roll | 6 | 2.5850 | 2.40e-1 |
| letter of alphabet | 26 | 4.7004 | 5.55e-2 |
| kilobyte address | 1024 | 10.0000 | 1.41e-3 |
| millionth element of a sorted list | 1000000 | 19.9316 | 1.44e-6 |
Encoding the 1,000,001st item from a million-item list costs less than a millionth of a bit on average. Information has diminishing returns, and the derivative of log2 is exactly that statement.
The same idea appears every time machine learning measures "perplexity" : the cross-entropy is a sum of log2 terms, and the gradient with respect to any predicted probability inherits a factor — the very formula of this section.
Summary
- One identity unlocks everything: . The right side is just a constant times .
- Plain derivative: .
- Chain-rule version: .
- Geometric meaning: bigger base ⇒ flatter curve ⇒ smaller slope, because sits in the denominator.
- Calculus prefers : base is the unique base whose derivative carries no extra constant. Convert to base 2 or base 10 only at the reporting stage.
- Verified twice: finite differences in plain Python and PyTorch autograd both produce the same numbers to .
Exercises
- Differentiate and evaluate . Sanity-check against the formula . (Answer: .)
- Differentiate and evaluate at . (Hint: , . Answer: .)
- Find the x-coordinate where the tangent to has slope . (Answer: set ⇒ .)
- Show that for any positive constant . (Hint: use ; the constant differentiates to zero.)
- Open the interactive slope explorer. Set base using the custom slider. Read off the slope at . Does it match the formula prediction ? (It should.)
- A 256-symbol alphabet (ASCII) costs bits per symbol. Use the derivative to estimate the cost per symbol if the alphabet grew to 260 symbols. Compare against the exact answer . (Linear approximation: extra bits. Exact: .)