Chapter 5
12 min read
Section 44 of 353

Derivatives of General Logarithms

Derivatives of Transcendental Functions

Learning Objectives

By the end of this section you will be able to:

  1. Derive the formula ddx ⁣[logax]=1xlna\frac{d}{dx}\!\bigl[\log_a x\bigr] = \dfrac{1}{x\,\ln a} from scratch using the change-of-base identity — no memorization required.
  2. See visually why a bigger base gives a flatter curve, and why the slope formula has lna\ln a in the denominator.
  3. Apply the chain rule to differentiate loga(u(x))\log_a(u(x)) for any inner function u(x)u(x).
  4. Confirm the formula in two ways: a hand-traced Python finite-difference, then PyTorch autograd.
  5. Recognize why lnx\ln x is the single base that calculus prefers, and how it connects to information theory and machine learning.

The Question We Are Answering

We already know two important derivatives. The exponential exe^x is its own derivative, and the natural logarithm satisfies ddxlnx=1x\dfrac{d}{dx}\ln x = \dfrac{1}{x}. Both formulas are beautiful precisely because the base is ee. But the real world does not always speak in base ee:

  • Engineers reason about decibels and the Richter scale in base 10.
  • Computer scientists count bits in base 2 — one binary digit per factor of 2.
  • Chemists measure pH as log10[H+]-\log_{10}[\text{H}^+].
  • Statisticians compute log-likelihoods in whichever base is convenient for the experiment.

So we want a derivative formula that works for logax\log_a x — the logarithm to any positive base a1a \ne 1. The intuition we will build is short and physical: changing the base just rescales the y-axis of the natural log, and that vertical rescaling rides through the derivative as a constant factor.

The one-sentence goal: express logax\log_a x in terms of lnx\ln x, then borrow the natural-log derivative we already know.

The Change-of-Base Trick

From any base back to the natural log

The change-of-base identity is the bridge we need. Starting from the definition y=logaxy = \log_a x meaning ay=xa^y = x, take the natural log of both sides:

ln ⁣(ay)=lnx    ylna=lnx    y=lnxlna.\ln\!\bigl(a^{y}\bigr) = \ln x \;\Longrightarrow\; y \ln a = \ln x \;\Longrightarrow\; y = \frac{\ln x}{\ln a}.

That is the identity we will use everywhere from now on:

  logax  =  lnxlna  \boxed{\;\log_a x \;=\; \frac{\ln x}{\ln a}\;}

Look closely at the right-hand side. The variable xx lives only inside lnx\ln x. The factor 1/lna1/\ln a is a constant — it does not depend on xx at all. So logax\log_a x is literally a constant times lnx\ln x.

The visual meaning

Geometrically, multiplying a function by a constant cc just stretches its graph vertically by a factor of cc. So log2x\log_2 x is a taller version of lnx\ln x (stretched by 1/ln21.4431/\ln 2 \approx 1.443), and log10x\log_{10} x is a shorter version (squashed by 1/ln100.4341/\ln 10 \approx 0.434). The visualization below makes this stretching tangible.

Every loga(x) is just ln(x) divided by ln(a)

The green curve is ln(x). The yellow curve is loga(x). Move the slider: the yellow curve is just the green curve squashed vertically by the factor 1 / ln(a). That same constant factor is exactly what divides the slope.

10.00stretch = 1/ln(a) = 0.4343
(e, 1)(e, 0.434)ln(x)log_10.0(x)
Why this matters for derivatives: if loga(x) = (1/ln a) · ln(x), and the derivative of ln(x) is 1/x, then the constant factor 1/ln(a) just rides along through the derivative — giving 1/(x · ln a). The formula is not memorized; it is visible in this picture.

The Formula and Its Picture

Differentiating both sides of logax=1lnalnx\log_a x = \dfrac{1}{\ln a}\,\ln x:

ddx[logax]=1lnaddx[lnx]=1lna1x.\frac{d}{dx}\bigl[\log_a x\bigr] = \frac{1}{\ln a}\cdot\frac{d}{dx}\bigl[\ln x\bigr] = \frac{1}{\ln a}\cdot\frac{1}{x}.

That collapses to the clean answer:

  ddx[logax]  =  1xlna  (x>0,  a>0,  a1)\boxed{\;\dfrac{d}{dx}\bigl[\log_a x\bigr] \;=\; \dfrac{1}{x\,\ln a}\;}\quad (x > 0,\; a > 0,\; a \ne 1)

Three quick sanity checks confirm this is the right formula:

Special casePlug into formulaResult
Natural log, a = e1xlne=1x1\dfrac{1}{x\,\ln e} = \dfrac{1}{x\cdot 1}1x\dfrac{1}{x} ✓
Common log, a = 101xln10\dfrac{1}{x\,\ln 10}0.4343x\approx \dfrac{0.4343}{x}
Binary log, a = 21xln2\dfrac{1}{x\,\ln 2}1.4427x\approx \dfrac{1.4427}{x}
The number log2e=1/ln21.4427\log_2 e = 1/\ln 2 \approx 1.4427 shows up so often in information theory that it has its own name: nats per bit. Computing things in nats (natural log) and converting to bits at the end is more numerically stable than computing in bits all the way through.

The four numbers you should always be able to recall

FunctionDerivativeNumerical factor in front of 1/x
lnx\ln x1/x1/x1.0000
log2x\log_2 x1/(xln2)1/(x\ln 2)1.4427
log10x\log_{10} x1/(xln10)1/(x\ln 10)0.4343
logax\log_a x1/(xlna)1/(x\ln a)1/lna1/\ln a

Interactive: Drag the Slope

The picture is the best argument. In the visualization below, the orange dot is the point where you want the slope. The cyan dashed line is the actual tangent. The readouts confirm that the slope you see is exactly 1/(xlna)1/(x\,\ln a). Play with three things:

  • Drag the dot rightward. Watch the tangent flatten out: the slope shrinks like 1/x1/x.
  • Crank up the base with the slider. The whole curve gets squashed and the slope number drops accordingly.
  • Turn on "compare bases." The three classical curves all pass through (1,0)(1, 0) but separate immediately because each one is stretched by its own factor 1/lna1/\ln a.

Slope of log2(x) at any point

Drag the orange point along the curve. The tangent line shows the instantaneous slope. Notice how the slope formula 1 / (x · ln a) gives a smaller number when x is larger or when the base a is larger.

Base:
1246810-3-2-11234xyy = log_2(x) (active)log₂(x)ln(x) = logₑ(x)log₁₀(x)
point x
4.000
y = log2(x)
2.0000
x · ln(a)
2.7726
slope = 1 / (x · ln a)
0.3607
Notice: as the base a grows, the curve gets flatter and the slope shrinks — because ln(a) sits in the denominator. The cheapest base (steepest slope, slope = 1/x) is a = e, where ln(e) = 1 and the ln(a) factor disappears.

Same x, different bases: slope = 1 / (x · ln a)

At a fixed point x, the slope shrinks as the base grows. The shape of the curve gets flatter; the input x is identical — only ln(a) changes in the denominator.

3.00
log₂
ln(a)=0.693
0.48090
ln
ln(a)=1.000
0.33333
log₁₀
ln(a)=2.303
0.14476
log₁₀₀
ln(a)=4.605
0.07238

Slope at x = 3.00 is plotted as a horizontal bar for each base. Same x; only ln(a) changes.


Worked Example (Do-It-By-Hand)

Open the box and work through this example with a pencil before you peek at the steps. Pick simple numbers so you can see the formula in motion.

Worked example: differentiate f(x)=log2xf(x) = \log_2 x at x=8x = 8 — click to expand

Step 1 — rewrite using the change-of-base identity. f(x)=log2x=lnxln2f(x) = \log_2 x = \dfrac{\ln x}{\ln 2}. The factor 1/ln21/\ln 2 is a constant, not a function of xx.

Step 2 — pull the constant out. f(x)=1ln2ddxlnx=1ln21x=1xln2.f'(x) = \dfrac{1}{\ln 2}\cdot\dfrac{d}{dx}\ln x = \dfrac{1}{\ln 2}\cdot\dfrac{1}{x} = \dfrac{1}{x\,\ln 2}.

Step 3 — plug in x=8x = 8. f(8)=18ln2=180.69315=15.5452.f'(8) = \dfrac{1}{8 \cdot \ln 2} = \dfrac{1}{8 \cdot 0.69315\ldots} = \dfrac{1}{5.5452\ldots}.

Step 4 — compute by hand. f(8)0.18034.f'(8) \approx 0.18034.

Step 5 — geometric sanity check. log28=3\log_2 8 = 3 (because 23=82^3 = 8). One step to the right at x=8x = 8 means we go from 8 to 9. The change in log2\log_2 should be approximately 0.1800.180. Check it: log29log283.16993=0.1699\log_2 9 - \log_2 8 \approx 3.1699 - 3 = 0.1699. That under-estimates the tangent slope a bit because log2\log_2 is concave; the secant lies below the tangent. Halve the step to 0.5 and the secant slope rises to (log28.5log28)/0.50.1782(\log_2 8.5 - \log_2 8)/0.5 \approx 0.1782, closing in on the true slope 0.180340.18034. This is exactly the "limit of difference quotients" picture we saw in Chapter 4.

Step 6 — second example, with the chain rule. Try g(x)=log10(3x2+5)g(x) = \log_{10}(3x^{2} + 5). Set the inner function u=3x2+5u = 3x^2 + 5 so u=6xu' = 6x. Then g(x)=uuln10=6x(3x2+5)ln10.g'(x) = \dfrac{u'}{u\,\ln 10} = \dfrac{6x}{(3x^2 + 5)\,\ln 10}. At x=1x = 1 this is 68ln10=618.42070.3258.\dfrac{6}{8\cdot\ln 10} = \dfrac{6}{18.4207} \approx 0.3258.

What you should take away: the entire derivation is "factor the base out, then it's just ln\ln." You never have to memorize a separate rule for each base.


The Chain Rule Version

In practice you rarely meet a bare logax\log_a x. You usually meet loga\log_a of something more complicated — a polynomial, a ratio, a trig function. The chain rule combines with our formula in the cleanest possible way:

  ddx[logau(x)]  =  u(x)u(x)lna  \boxed{\;\dfrac{d}{dx}\bigl[\log_a u(x)\bigr] \;=\; \dfrac{u'(x)}{u(x)\,\ln a}\;}

The recipe: divide uu' by uu, then divide once more by lna\ln a. That extra denominator lna\ln a is the only thing that changes when you switch bases.

Three worked applications of the chain rule

Functionu(x) and u'(x)Derivative
log2(x3+1)\log_2(x^3 + 1)u=x3+1,  u=3x2u = x^3+1,\; u' = 3x^23x2(x3+1)ln2\dfrac{3x^2}{(x^3+1)\,\ln 2}
log10(sinx)\log_{10}(\sin x)u=sinx,  u=cosxu = \sin x,\; u' = \cos xcosxsinxln10=cotxln10\dfrac{\cos x}{\sin x\,\ln 10} = \dfrac{\cot x}{\ln 10}
log5 ⁣(1+x1x)\log_5\!\left(\dfrac{1+x}{1-x}\right)u=1+x1x,  u=2(1x)2u = \dfrac{1+x}{1-x},\; u' = \dfrac{2}{(1-x)^2}2(1x2)ln5\dfrac{2}{(1-x^2)\,\ln 5}
Why is the formula so "clean"? Because the only piece that depends on x is lnu(x)\ln u(x). The base change just attaches a constant 1/lna1/\ln a. Differentiating a constant times a function differentiates the function and keeps the constant. That is the whole story.

Why ln Is the "Natural" Base for Calculus

Looking at the formula ddxlogax=1xlna\dfrac{d}{dx}\log_a x = \dfrac{1}{x\,\ln a}, you can immediately see why the natural logarithm gets the adjective "natural":

Among all bases a>0,a1a > 0, a \ne 1, the base a=ea = e is the unique one whose derivative formula does not contain an extra constant. Every other base pays a 1/lna1/\ln a tax.

This is not a coincidence; it is the definition of ee. The number ee was chosen precisely so that lne=1\ln e = 1, killing the constant factor. Calculus rewards the choice with the cleanest possible derivative.

In practice this means: if you have a choice, work in ln\ln. Convert to base 2 or base 10 only at the very end, when reporting the answer in "bits" or "decibels." Inside derivatives, gradients, optimizers, and ML training loops, ln\ln is the right currency.


Plain Python: Verify the Formula Numerically

Mathematics tells us f(x)=1/(xlna)f'(x) = 1/(x\,\ln a). The computer is happy to check that, but it does not know about derivatives directly — it knows about finite differences. So we are going to compute thetrue slope two ways:

  1. Analytic: the closed-form formula we just derived, evaluated at our test points.
  2. Numeric: a tiny Δx → 0 approximation of the difference quotient. If both numbers agree to many decimal places, the formula is right.

We will do this for three bases at three test points — nine comparisons total — in plain Python with no external libraries.

Verifying d/dx [log_a(x)] = 1/(x ln a) with finite differences
🐍derivative_of_log_a.py
1Import math

We need just two functions from the standard library: math.log (which is natural log, i.e. ln) and math.e (the constant e ≈ 2.718281828). No NumPy, no PyTorch — every operation is visible.

3log_a(x, a) — change-of-base in code

Python's math.log is base-e. To get log base a we apply the identity we proved: log_a(x) = ln(x) / ln(a). One line, no magic. This is the very same formula we differentiated by hand.

EXAMPLE
log_a(8, 2) → 3.0   because 2^3 = 8
7analytic_derivative(x, a) — the formula under test

Returns 1 / (x · ln a). This is the answer we are claiming is the derivative. If we are right, the numeric derivative below should agree to many decimal places.

EXECUTION STATE
analytic_derivative(4, 2) = 1/(4·0.6931...) ≈ 0.36067
11numeric_derivative(x, a, h) — central difference

We do not have to trust the formula. We can ask the function 'what is your slope at x?' by sampling two nearby points and taking the secant slope. The central difference (f(x+h) − f(x−h)) / (2h) is much more accurate than the one-sided forward difference, with error proportional to h² instead of h.

EXAMPLE
h = 1e-6 → error ≈ 10⁻¹², far below the printout precision.
15bases and points to test

Three bases (2, e, 10) cover the three logarithms anyone actually uses. Three test points (0.5, 4, 100) cover small, medium, and large x — so we can see the slope shrink like 1/x.

EXECUTION STATE
bases = [2, e, 10]
points = [0.5, 4.0, 100.0]
19Print the header row

f-string with field widths so the columns line up. We will print analytic, numeric, and absolute difference side by side. If the formula is correct, |diff| will be tiny — on the order of 10⁻¹¹ or 10⁻¹².

20Loop: every (base, x) pair — full trace

Nested loop: for each base, walk through each test point. Nine iterations total. Each iteration computes the analytic slope, the numeric slope, and their absolute difference. The agreement between the two is the evidence that the formula is right.

LOOP TRACE · 9 iterations
a=2, x=0.5
analytic = 1/(0.5·ln 2) = 2.88539008
numeric = 2.88539008
|diff| = ≈ 4.8e-11
a=2, x=4
analytic = 1/(4·ln 2) = 0.36067376
numeric = 0.36067376
|diff| = ≈ 6.0e-12
a=2, x=100
analytic = 1/(100·ln 2) = 0.01442695
numeric = 0.01442695
|diff| = ≈ 2.4e-13
a=e, x=0.5
analytic = 1/(0.5·1) = 2.00000000
numeric = 2.00000000
|diff| = ≈ 3.3e-11
a=e, x=4
analytic = 1/(4·1) = 0.25000000
numeric = 0.25000000
|diff| = ≈ 4.2e-12
a=e, x=100
analytic = 1/(100·1) = 0.01000000
numeric = 0.01000000
|diff| = ≈ 1.7e-13
a=10, x=0.5
analytic = 1/(0.5·ln 10) = 0.86858896
numeric = 0.86858896
|diff| = ≈ 1.4e-11
a=10, x=4
analytic = 1/(4·ln 10) = 0.10857362
numeric = 0.10857362
|diff| = ≈ 1.8e-12
a=10, x=100
analytic = 1/(100·ln 10) = 0.00434294
numeric = 0.00434294
|diff| = ≈ 7.2e-14
21ana = analytic_derivative(x, a)

Compute the formula's answer. Cheap: one multiplication, one division, one log of the base (which Python caches).

22num = numeric_derivative(x, a)

Compute the secant slope across an interval of width 2h around x. This is the operational definition of a derivative from Chapter 4: f'(x) = lim_{h→0} (f(x+h) − f(x−h)) / (2h). We are taking h = 10⁻⁶, which is small enough to be very accurate but large enough that floating-point cancellation is not yet dominant.

23diff = |ana − num|

If the formula is right, this is essentially zero — limited only by float64 precision. The expected magnitude is ~10⁻¹¹, which is the noise floor of central differences at h=10⁻⁶ in double precision.

24Detect base e

math.e is a float, not exactly equal to the symbolic e. We compare with a tiny tolerance to label that base 'e' in the output instead of '2.71828...'. Just a cosmetic touch.

EXAMPLE
a == math.e is dangerous — use abs(a - math.e) < 1e-9.
25Aligned printout

Each column is right-justified with a fixed width using f-string format specifiers. The result is a tidy 5-column table that visually proves the formula matches the numeric truth to ~10 decimal places.

14 lines without explanation
1import math
2
3def log_a(x, a):
4    """log base a, expressed via the change-of-base identity."""
5    return math.log(x) / math.log(a)
6
7def analytic_derivative(x, a):
8    """The formula we derived: 1 / (x * ln a)."""
9    return 1.0 / (x * math.log(a))
10
11def numeric_derivative(x, a, h=1e-6):
12    """Central difference: (f(x+h) - f(x-h)) / (2h)."""
13    return (log_a(x + h, a) - log_a(x - h, a)) / (2 * h)
14
15# Three bases x three test points = nine checks.
16bases  = [2, math.e, 10]
17points = [0.5, 4.0, 100.0]
18
19print(f"{'base':>6} {'x':>8} {'analytic':>14} {'numeric':>14} {'|diff|':>10}")
20for a in bases:
21    for x in points:
22        ana = analytic_derivative(x, a)
23        num = numeric_derivative(x, a)
24        diff = abs(ana - num)
25        a_str = "e" if abs(a - math.e) < 1e-9 else str(int(a))
26        print(f"{a_str:>6} {x:>8.2f} {ana:>14.8f} {num:>14.8f} {diff:>10.2e}")

Expected output, copy-pasteable:

  base        x       analytic        numeric     |diff|
     2     0.50     2.88539008     2.88539008   4.84e-11
     2     4.00     0.36067376     0.36067376   6.04e-12
     2   100.00     0.01442695     0.01442695   2.39e-13
     e     0.50     2.00000000     2.00000000   3.33e-11
     e     4.00     0.25000000     0.25000000   4.16e-12
     e   100.00     0.01000000     0.01000000   1.67e-13
    10     0.50     0.86858896     0.86858896   1.45e-11
    10     4.00     0.10857362     0.10857362   1.85e-12
    10   100.00     0.00434294     0.00434294   7.20e-14

Read the table

Every |diff| column entry is at most 10⁻¹⁰. That is the difference between "mathematical equality" and "numerical equality." The formula and the finite difference are reporting the same number, to the limit of float64 precision.

PyTorch: Autograd Confirms the Same Number

For machine-learning workflows, we never write the derivative by hand. PyTorch's autograd engine records every operation as we go and applies the chain rule in reverse. If we compute y=log2(x)y = \log_2(x) and call y.backward()y\text{.backward}(), PyTorch will fill x.gradx\text{.grad} with 1/(xln2)1/(x\,\ln 2). Let's verify.

PyTorch autograd: derivative of log_a(x) and log_a(u(x))
🐍autograd_log_a.py
1Import torch and math

torch gives us tensors with autograd; math gives us the natural log of the base — itself a plain Python float, not a tensor. We never need to put the constant ln(a) inside the autograd graph.

4xs = tensor([0.5, 4, 100], requires_grad=True)

Three test points packed into one tensor. Setting requires_grad=True tells PyTorch to track every operation that touches xs so it can later answer 'what is the derivative with respect to each entry?'

EXECUTION STATE
xs = tensor([0.5, 4.0, 100.0])
xs.requires_grad = True
xs.grad = None (no backward call yet)
7y2 = torch.log(xs) / math.log(2)

Change-of-base again: torch.log is natural log (base e), so dividing by math.log(2) gives log base 2. The division by a Python float is broadcast across the tensor. y2 is a 3-element tensor of log_2 values.

EXAMPLE
y2 ≈ tensor([-1.0000,  2.0000,  6.6439])
9y2.sum().backward()

PyTorch's backward() expects a scalar input. By summing y2 first, we are asking for the gradient of the sum. Because d/dx_i [Σ_j y_j] = dy_i/dx_i for each i, this is exactly how you ask for 'the derivative at each entry' all at once. It's the standard trick for vectorized derivative tests.

EXECUTION STATE
y2.sum() = tensor(7.6439) (the scalar we ask grad of)
xs.grad (after) = tensor([2.8854, 0.3607, 0.0144])
10Print autograd result

What autograd returns. Each entry should be exactly 1/(x_i · ln 2). These are the slopes of log_2 at x = 0.5, 4, and 100.

11Print the formula prediction

Compute the formula directly on the raw float values (we call .detach() so this side computation doesn't get tangled in the autograd graph). The two printouts should be identical to the displayed precision — they are.

EXECUTION STATE
1/(xs · ln 2) = tensor([2.8854, 0.3607, 0.0144])
14xs.grad.zero_()

Autograd ACCUMULATES gradients into .grad — it does not overwrite. If we ran a second backward without zeroing, the new gradients would be ADDED to the old ones, giving wrong numbers. This is the single most common autograd pitfall; every PyTorch training loop calls zero_grad() before each step for exactly this reason.

EXAMPLE
Without zero_(), xs.grad would become 2× the correct slopes on the next call.
17y10 = log_10(xs)

Same construction, base 10 this time. The only difference from line 7 is dividing by math.log(10) ≈ 2.3026 instead of math.log(2) ≈ 0.6931.

18y10.sum().backward()

Same scalar-sum trick. After this call, xs.grad holds the three slopes of log_10 at our test points. They should equal 1/(x · ln 10) ≈ [0.8686, 0.1086, 0.00434].

EXECUTION STATE
xs.grad = tensor([0.8686, 0.1086, 0.0043])
22Part B — chain rule on u(x) = 3x² + 5

A scalar test of the chain-rule version: g(x) = log_10(3x² + 5). The whole computation graph is just five operations (square, multiply, add, log, divide), but autograd will apply the chain rule across all of them automatically.

EXECUTION STATE
x = tensor(1.0, requires_grad=True)
23u = 3 * x**2 + 5

Build the inner function. At x = 1, u(1) = 3·1 + 5 = 8. The graph records: input x → x² → 3·x² → 3·x² + 5. The derivative u'(x) = 6x = 6 at x = 1.

EXECUTION STATE
u(1) = tensor(8., grad_fn=<AddBackward>)
u'(1) = 6.0 (PyTorch will derive this implicitly)
24y = log_10(u)

Apply the change of base to the inner function. At x = 1, y(1) = log_10(8) ≈ 0.9031. The full chain is now: x → u(x) → log(u) → log(u)/ln(10).

25y.backward()

Reverse-mode differentiation walks the graph backward, applying the chain rule. The result that lands in x.grad is exactly u'(x) / (u(x) · ln 10) = 6 / (8 · 2.3026) ≈ 0.3258.

EXECUTION STATE
x.grad = tensor(0.3258)
27Compare to the hand-derived formula

Direct computation of u'/(u · ln a) with raw Python floats. Should print 0.3257908..., matching x.grad to all displayed digits. The chain rule formula and PyTorch agree.

EXAMPLE
6 / (8 * 2.302585) = 0.325790760...
16 lines without explanation
1import torch
2import math
3
4# --- Part A: derivative of plain log_a(x) at three bases ------------
5xs = torch.tensor([0.5, 4.0, 100.0], requires_grad=True)
6
7# log_2(x) = ln(x) / ln(2)
8y2 = torch.log(xs) / math.log(2)
9# We need the gradient of the SUM so PyTorch returns a vector of slopes.
10y2.sum().backward()
11print("d/dx log_2 at [0.5, 4, 100]:", xs.grad)
12print("formula 1/(x*ln 2):          ", 1.0 / (xs.detach() * math.log(2)))
13
14# Reset gradients before the next test
15xs.grad.zero_()
16
17# log_10(x) = ln(x) / ln(10)
18y10 = torch.log(xs) / math.log(10)
19y10.sum().backward()
20print("\nd/dx log_10 at [0.5, 4, 100]:", xs.grad)
21print("formula 1/(x*ln 10):           ", 1.0 / (xs.detach() * math.log(10)))
22
23# --- Part B: chain rule on u(x) = 3x^2 + 5, base 10 -----------------
24x = torch.tensor(1.0, requires_grad=True)
25u = 3 * x**2 + 5            # u(1) = 8
26y = torch.log(u) / math.log(10)
27y.backward()
28print("\nd/dx log_10(3x^2 + 5) at x=1:")
29print("  autograd:", x.grad.item())
30print("  formula :", (6 * 1.0) / (8.0 * math.log(10)))
Why the agreement is not a coincidence. PyTorch autograd is not running finite differences in the background. It recorded the exact symbolic chain (torch.log → divide by constant → sum) and applied the analytic derivative of each elementary operation. The fact that it lands on the same formula we derived by hand is direct confirmation that our derivation is the true rule, not an approximation.

Application: Information Theory and log2

A satisfying place to land is in the place where log2\log_2 is the unquestioned king: information theory. Encoding one out of NN equally likely outcomes costs exactly log2N\log_2 N bits. The derivative we just derived tells you the marginal cost of making N a little larger:

ddN[log2N]  =  1Nln2    bits per outcome.\dfrac{d}{dN}\bigl[\log_2 N\bigr] \;=\; \dfrac{1}{N\,\ln 2}\;\;\text{bits per outcome.}

Two things to feel here. First, doubling NN always adds exactly one bit — a discrete jump that the derivative averages out into a continuous slope. Second, that slope shrinks like 1/N1/N, so growing a code from N=1000N = 1\,000 to N=1001N = 1\,001 costs almost nothing, but growing from 2 to 3 costs nearly half a bit.

Why we care about d/dN [log2(N)]

In information theory, the number of bits needed to encode one of N equally likely outcomes is log2(N). The derivative tells us: doubling N adds one bit, but the marginal cost of adding one more option shrinks as N grows.

N = 32
bits needed
5.000
log₂(32)
marginal bits per outcome
4.508e-2
1 / (N · ln 2)
bits to double N
+1.000
log₂(2N) − log₂(N)
Quick reference table
situationNlog₂(N) bitsd/dN [log₂N]
coin flip21.00007.21e-1
die roll62.58502.40e-1
letter of alphabet264.70045.55e-2
kilobyte address102410.00001.41e-3
millionth element of a sorted list100000019.93161.44e-6

Encoding the 1,000,001st item from a million-item list costs less than a millionth of a bit on average. Information has diminishing returns, and the derivative of log2 is exactly that statement.

The same idea appears every time machine learning measures "perplexity" =2cross-entropy= 2^{\text{cross-entropy}}: the cross-entropy is a sum of log2 terms, and the gradient with respect to any predicted probability inherits a 1/(pln2)1/(p\,\ln 2) factor — the very formula of this section.


Summary

  • One identity unlocks everything: logax=lnxlna\log_a x = \dfrac{\ln x}{\ln a}. The right side is just a constant times lnx\ln x.
  • Plain derivative: ddxlogax=1xlna\dfrac{d}{dx}\log_a x = \dfrac{1}{x\,\ln a}.
  • Chain-rule version: ddxlogau(x)=u(x)u(x)lna\dfrac{d}{dx}\log_a u(x) = \dfrac{u'(x)}{u(x)\,\ln a}.
  • Geometric meaning: bigger base aa ⇒ flatter curve ⇒ smaller slope, because lna\ln a sits in the denominator.
  • Calculus prefers ln\ln: base ee is the unique base whose derivative carries no extra constant. Convert to base 2 or base 10 only at the reporting stage.
  • Verified twice: finite differences in plain Python and PyTorch autograd both produce the same numbers to 1011\sim 10^{-11}.

Exercises

  1. Differentiate f(x)=log7(x)f(x) = \log_7(x) and evaluate f(49)f'(49). Sanity-check against the formula 1/(xlna)1/(x\,\ln a). (Answer: 1/(49ln7)0.010491/(49\,\ln 7) \approx 0.01049.)
  2. Differentiate g(x)=log2(x2+1)g(x) = \log_2(x^2 + 1) and evaluate at x=1x = 1. (Hint: u=x2+1u = x^2 + 1, u=2xu' = 2x. Answer: 2/(2ln2)=1/ln21.44272/(2\,\ln 2) = 1/\ln 2 \approx 1.4427.)
  3. Find the x-coordinate where the tangent to y=log10xy = \log_{10} x has slope 1/101/10. (Answer: set 1/(xln10)=1/101/(x\,\ln 10) = 1/10 x=10/ln104.343x = 10/\ln 10 \approx 4.343.)
  4. Show that ddxloga(kx)=ddxloga(x)\dfrac{d}{dx}\log_a(kx) = \dfrac{d}{dx}\log_a(x) for any positive constant kk. (Hint: use loga(kx)=logak+logax\log_a(kx) = \log_a k + \log_a x; the constant differentiates to zero.)
  5. Open the interactive slope explorer. Set base a=e2a = e^2 using the custom slider. Read off the slope at x=5x = 5. Does it match the formula prediction 1/(52)=0.11/(5 \cdot 2) = 0.1? (It should.)
  6. A 256-symbol alphabet (ASCII) costs log2256=8\log_2 256 = 8 bits per symbol. Use the derivative 1/(Nln2)1/(N \ln 2) to estimate the cost per symbol if the alphabet grew to 260 symbols. Compare against the exact answer log22608\log_2 260 - 8. (Linear approximation: 4/(256ln2)0.0225\approx 4/(256 \ln 2) \approx 0.0225 extra bits. Exact: 0.0224\approx 0.0224.)
Loading comments...