Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section you will be able to:

Derive the formula $\frac{d}{dx}\!\bigl[\log_a x\bigr] = \dfrac{1}{x\,\ln a}$ from scratch using the change-of-base identity — no memorization required.
See visually why a bigger base gives a flatter curve, and why the slope formula has $\ln a$ in the denominator.
Apply the chain rule to differentiate $\log_a(u(x))$ for any inner function $u(x)$ .
Confirm the formula in two ways: a hand-traced Python finite-difference, then PyTorch autograd.
Recognize why $\ln x$ is the single base that calculus prefers, and how it connects to information theory and machine learning.

The Question We Are Answering

We already know two important derivatives. The exponential $e^x$ is its own derivative, and the natural logarithm satisfies $\dfrac{d}{dx}\ln x = \dfrac{1}{x}$ . Both formulas are beautiful precisely because the base is $e$ . But the real world does not always speak in base $e$ :

Engineers reason about decibels and the Richter scale in base 10.
Computer scientists count bits in base 2 — one binary digit per factor of 2.
Chemists measure pH as $-\log_{10}[\text{H}^+]$ .
Statisticians compute log-likelihoods in whichever base is convenient for the experiment.

So we want a derivative formula that works for $\log_a x$ — the logarithm to any positive base $a \ne 1$ . The intuition we will build is short and physical: changing the base just rescales the y-axis of the natural log, and that vertical rescaling rides through the derivative as a constant factor.

The one-sentence goal: express $\log_a x$ in terms of $\ln x$ , then borrow the natural-log derivative we already know.

The Change-of-Base Trick

From any base back to the natural log

The change-of-base identity is the bridge we need. Starting from the definition $y = \log_a x$ meaning $a^y = x$ , take the natural log of both sides:

\ln\!\bigl(a^{y}\bigr) = \ln x \;\Longrightarrow\; y \ln a = \ln x \;\Longrightarrow\; y = \frac{\ln x}{\ln a}.

That is the identity we will use everywhere from now on:

\boxed{\;\log_a x \;=\; \frac{\ln x}{\ln a}\;}

Look closely at the right-hand side. The variable $x$ lives only inside $\ln x$ . The factor $1/\ln a$ is a constant — it does not depend on $x$ at all. So $\log_a x$ is literally a constant times $\ln x$ .

The visual meaning

Geometrically, multiplying a function by a constant

c

just stretches its graph vertically by a factor of

c

. So

\log_2 x

is a taller version of

\ln x

(stretched by

1/\ln 2 \approx 1.443

), and

\log_{10} x

is a shorter version (squashed by

1/\ln 10 \approx 0.434

). The visualization below makes this stretching tangible.

Every log_a(x) is just ln(x) divided by ln(a)

The green curve is ln(x). The yellow curve is log_a(x). Move the slider: the yellow curve is just the green curve squashed vertically by the factor 1 / ln(a). That same constant factor is exactly what divides the slope.

base a:10.00stretch = 1/ln(a) = 0.4343

Why this matters for derivatives: if log_a(x) = (1/ln a) · ln(x), and the derivative of ln(x) is 1/x, then the constant factor 1/ln(a) just rides along through the derivative — giving 1/(x · ln a). The formula is not memorized; it is visible in this picture.

The Formula and Its Picture

Differentiating both sides of $\log_a x = \dfrac{1}{\ln a}\,\ln x$ :

\frac{d}{dx}\bigl[\log_a x\bigr] = \frac{1}{\ln a}\cdot\frac{d}{dx}\bigl[\ln x\bigr] = \frac{1}{\ln a}\cdot\frac{1}{x}.

That collapses to the clean answer:

\boxed{\;\dfrac{d}{dx}\bigl[\log_a x\bigr] \;=\; \dfrac{1}{x\,\ln a}\;}\quad (x > 0,\; a > 0,\; a \ne 1)

Three quick sanity checks confirm this is the right formula:

Special case	Plug into formula	Result
Natural log, a = e	$\dfrac{1}{x\,\ln e} = \dfrac{1}{x\cdot 1}$	$\dfrac{1}{x}$ &check;
Common log, a = 10	$\dfrac{1}{x\,\ln 10}$	$\approx \dfrac{0.4343}{x}$
Binary log, a = 2	$\dfrac{1}{x\,\ln 2}$	$\approx \dfrac{1.4427}{x}$

The number

\log_2 e = 1/\ln 2 \approx 1.4427

shows up so often in information theory that it has its own name: nats per bit. Computing things in nats (natural log) and converting to bits at the end is more numerically stable than computing in bits all the way through.

The four numbers you should always be able to recall

Function	Derivative	Numerical factor in front of 1/x
$\ln x$	$1/x$	1.0000
$\log_2 x$	$1/(x\ln 2)$	1.4427
$\log_{10} x$	$1/(x\ln 10)$	0.4343
$\log_a x$	$1/(x\ln a)$	$1/\ln a$

Interactive: Drag the Slope

The picture is the best argument. In the visualization below, the orange dot is the point where you want the slope. The cyan dashed line is the actual tangent. The readouts confirm that the slope you see is exactly $1/(x\,\ln a)$ . Play with three things:

Drag the dot rightward. Watch the tangent flatten out: the slope shrinks like $1/x$ .
Crank up the base with the slider. The whole curve gets squashed and the slope number drops accordingly.
Turn on "compare bases." The three classical curves all pass through $(1, 0)$ but separate immediately because each one is stretched by its own factor $1/\ln a$ .

Slope of log₂(x) at any point

Drag the orange point along the curve. The tangent line shows the instantaneous slope. Notice how the slope formula 1 / (x · ln a) gives a smaller number when x is larger or when the base a is larger.

Base:custom a:2.00tangentcompare bases

point x

4.000

y = log₂(x)

2.0000

x · ln(a)

2.7726

slope = 1 / (x · ln a)

0.3607

Notice: as the base a grows, the curve gets flatter and the slope shrinks — because ln(a) sits in the denominator. The cheapest base (steepest slope, slope = 1/x) is a = e, where ln(e) = 1 and the ln(a) factor disappears.

Same x, different bases: slope = 1 / (x · ln a)

At a fixed point x, the slope shrinks as the base grows. The shape of the curve gets flatter; the input x is identical — only ln(a) changes in the denominator.

x =3.00

log₂

ln(a)=0.693

0.48090

ln(a)=1.000

0.33333

log₁₀

ln(a)=2.303

0.14476

log₁₀₀

ln(a)=4.605

0.07238

Slope at x = 3.00 is plotted as a horizontal bar for each base. Same x; only ln(a) changes.

Worked Example (Do-It-By-Hand)

Open the box and work through this example with a pencil before you peek at the steps. Pick simple numbers so you can see the formula in motion.

Worked example: differentiate

f(x) = \log_2 x

x = 8

— click to expand

Step 1 — rewrite using the change-of-base identity. $f(x) = \log_2 x = \dfrac{\ln x}{\ln 2}$ . The factor $1/\ln 2$ is a constant, not a function of $x$ .

Step 2 — pull the constant out. $f'(x) = \dfrac{1}{\ln 2}\cdot\dfrac{d}{dx}\ln x = \dfrac{1}{\ln 2}\cdot\dfrac{1}{x} = \dfrac{1}{x\,\ln 2}.$

Step 3 — plug in $x = 8$ . $f'(8) = \dfrac{1}{8 \cdot \ln 2} = \dfrac{1}{8 \cdot 0.69315\ldots} = \dfrac{1}{5.5452\ldots}.$

Step 4 — compute by hand. $f'(8) \approx 0.18034.$

Step 5 — geometric sanity check. $\log_2 8 = 3$ (because $2^3 = 8$ ). One step to the right at $x = 8$ means we go from 8 to 9. The change in $\log_2$ should be approximately $0.180$ . Check it: $\log_2 9 - \log_2 8 \approx 3.1699 - 3 = 0.1699$ . That under-estimates the tangent slope a bit because $\log_2$ is concave; the secant lies below the tangent. Halve the step to 0.5 and the secant slope rises to $(\log_2 8.5 - \log_2 8)/0.5 \approx 0.1782$ , closing in on the true slope $0.18034$ . This is exactly the "limit of difference quotients" picture we saw in Chapter 4.

Step 6 — second example, with the chain rule. Try $g(x) = \log_{10}(3x^{2} + 5)$ . Set the inner function $u = 3x^2 + 5$ so $u' = 6x$ . Then $g'(x) = \dfrac{u'}{u\,\ln 10} = \dfrac{6x}{(3x^2 + 5)\,\ln 10}.$ At $x = 1$ this is $\dfrac{6}{8\cdot\ln 10} = \dfrac{6}{18.4207} \approx 0.3258.$

What you should take away: the entire derivation is "factor the base out, then it's just $\ln$ ." You never have to memorize a separate rule for each base.

The Chain Rule Version

In practice you rarely meet a bare $\log_a x$ . You usually meet $\log_a$ of something more complicated — a polynomial, a ratio, a trig function. The chain rule combines with our formula in the cleanest possible way:

\boxed{\;\dfrac{d}{dx}\bigl[\log_a u(x)\bigr] \;=\; \dfrac{u'(x)}{u(x)\,\ln a}\;}

The recipe: divide $u'$ by $u$ , then divide once more by $\ln a$ . That extra denominator $\ln a$ is the only thing that changes when you switch bases.

Three worked applications of the chain rule

Function	u(x) and u'(x)	Derivative
$\log_2(x^3 + 1)$	$u = x^3+1,\; u' = 3x^2$	$\dfrac{3x^2}{(x^3+1)\,\ln 2}$
$\log_{10}(\sin x)$	$u = \sin x,\; u' = \cos x$	$\dfrac{\cos x}{\sin x\,\ln 10} = \dfrac{\cot x}{\ln 10}$
$\log_5\!\left(\dfrac{1+x}{1-x}\right)$	$u = \dfrac{1+x}{1-x},\; u' = \dfrac{2}{(1-x)^2}$	$\dfrac{2}{(1-x^2)\,\ln 5}$

Why is the formula so "clean"? Because the only piece that depends on x is

\ln u(x)

. The base change just attaches a constant

1/\ln a

. Differentiating a constant times a function differentiates the function and keeps the constant. That is the whole story.

Why ln Is the "Natural" Base for Calculus

Looking at the formula $\dfrac{d}{dx}\log_a x = \dfrac{1}{x\,\ln a}$ , you can immediately see why the natural logarithm gets the adjective "natural":

Among all bases $a > 0, a \ne 1$ , the base $a = e$ is the unique one whose derivative formula does not contain an extra constant. Every other base pays a $1/\ln a$ tax.

This is not a coincidence; it is the definition of $e$ . The number $e$ was chosen precisely so that $\ln e = 1$ , killing the constant factor. Calculus rewards the choice with the cleanest possible derivative.

In practice this means: if you have a choice, work in $\ln$ . Convert to base 2 or base 10 only at the very end, when reporting the answer in "bits" or "decibels." Inside derivatives, gradients, optimizers, and ML training loops, $\ln$ is the right currency.

Plain Python: Verify the Formula Numerically

Mathematics tells us $f'(x) = 1/(x\,\ln a)$ . The computer is happy to check that, but it does not know about derivatives directly — it knows about finite differences. So we are going to compute thetrue slope two ways:

Analytic: the closed-form formula we just derived, evaluated at our test points.
Numeric: a tiny Δx → 0 approximation of the difference quotient. If both numbers agree to many decimal places, the formula is right.

We will do this for three bases at three test points — nine comparisons total — in plain Python with no external libraries.

Verifying d/dx [log_a(x)] = 1/(x ln a) with finite differences

🐍derivative_of_log_a.py

Explanation(12)

Code(26)

1Import math

We need just two functions from the standard library: math.log (which is natural log, i.e. ln) and math.e (the constant e ≈ 2.718281828). No NumPy, no PyTorch — every operation is visible.

3log_a(x, a) — change-of-base in code

Python's math.log is base-e. To get log base a we apply the identity we proved: log_a(x) = ln(x) / ln(a). One line, no magic. This is the very same formula we differentiated by hand.

EXAMPLE

log_a(8, 2) → 3.0   because 2^3 = 8

7analytic_derivative(x, a) — the formula under test

Returns 1 / (x · ln a). This is the answer we are claiming is the derivative. If we are right, the numeric derivative below should agree to many decimal places.

EXECUTION STATE

analytic_derivative(4, 2) = 1/(4·0.6931...) ≈ 0.36067

11numeric_derivative(x, a, h) — central difference

We do not have to trust the formula. We can ask the function 'what is your slope at x?' by sampling two nearby points and taking the secant slope. The central difference (f(x+h) − f(x−h)) / (2h) is much more accurate than the one-sided forward difference, with error proportional to h² instead of h.

EXAMPLE

h = 1e-6 → error ≈ 10⁻¹², far below the printout precision.

15bases and points to test

Three bases (2, e, 10) cover the three logarithms anyone actually uses. Three test points (0.5, 4, 100) cover small, medium, and large x — so we can see the slope shrink like 1/x.

EXECUTION STATE

bases = [2, e, 10]

points = [0.5, 4.0, 100.0]

19Print the header row

f-string with field widths so the columns line up. We will print analytic, numeric, and absolute difference side by side. If the formula is correct, |diff| will be tiny — on the order of 10⁻¹¹ or 10⁻¹².

20Loop: every (base, x) pair — full trace

Nested loop: for each base, walk through each test point. Nine iterations total. Each iteration computes the analytic slope, the numeric slope, and their absolute difference. The agreement between the two is the evidence that the formula is right.

LOOP TRACE · 9 iterations

a=2, x=0.5

analytic = 1/(0.5·ln 2) = 2.88539008

numeric = 2.88539008

|diff| = ≈ 4.8e-11

a=2, x=4

analytic = 1/(4·ln 2) = 0.36067376

numeric = 0.36067376

|diff| = ≈ 6.0e-12

a=2, x=100

analytic = 1/(100·ln 2) = 0.01442695

numeric = 0.01442695

|diff| = ≈ 2.4e-13

a=e, x=0.5

analytic = 1/(0.5·1) = 2.00000000

numeric = 2.00000000

|diff| = ≈ 3.3e-11

a=e, x=4

analytic = 1/(4·1) = 0.25000000

numeric = 0.25000000

|diff| = ≈ 4.2e-12

a=e, x=100

analytic = 1/(100·1) = 0.01000000

numeric = 0.01000000

|diff| = ≈ 1.7e-13

a=10, x=0.5

analytic = 1/(0.5·ln 10) = 0.86858896

numeric = 0.86858896

|diff| = ≈ 1.4e-11

a=10, x=4

analytic = 1/(4·ln 10) = 0.10857362

numeric = 0.10857362

|diff| = ≈ 1.8e-12

a=10, x=100

analytic = 1/(100·ln 10) = 0.00434294

numeric = 0.00434294

|diff| = ≈ 7.2e-14

21ana = analytic_derivative(x, a)

Compute the formula's answer. Cheap: one multiplication, one division, one log of the base (which Python caches).

22num = numeric_derivative(x, a)

Compute the secant slope across an interval of width 2h around x. This is the operational definition of a derivative from Chapter 4: f'(x) = lim_{h→0} (f(x+h) − f(x−h)) / (2h). We are taking h = 10⁻⁶, which is small enough to be very accurate but large enough that floating-point cancellation is not yet dominant.

23diff = |ana − num|

If the formula is right, this is essentially zero — limited only by float64 precision. The expected magnitude is ~10⁻¹¹, which is the noise floor of central differences at h=10⁻⁶ in double precision.

24Detect base e

math.e is a float, not exactly equal to the symbolic e. We compare with a tiny tolerance to label that base 'e' in the output instead of '2.71828...'. Just a cosmetic touch.

EXAMPLE

a == math.e is dangerous — use abs(a - math.e) < 1e-9.

25Aligned printout

Each column is right-justified with a fixed width using f-string format specifiers. The result is a tidy 5-column table that visually proves the formula matches the numeric truth to ~10 decimal places.

14 lines without explanation

1import math
2
3def log_a(x, a):
4    """log base a, expressed via the change-of-base identity."""
5    return math.log(x) / math.log(a)
6
7def analytic_derivative(x, a):
8    """The formula we derived: 1 / (x * ln a)."""
9    return 1.0 / (x * math.log(a))
10
11def numeric_derivative(x, a, h=1e-6):
12    """Central difference: (f(x+h) - f(x-h)) / (2h)."""
13    return (log_a(x + h, a) - log_a(x - h, a)) / (2 * h)
14
15# Three bases x three test points = nine checks.
16bases  = [2, math.e, 10]
17points = [0.5, 4.0, 100.0]
18
19print(f"{'base':>6} {'x':>8} {'analytic':>14} {'numeric':>14} {'|diff|':>10}")
20for a in bases:
21    for x in points:
22        ana = analytic_derivative(x, a)
23        num = numeric_derivative(x, a)
24        diff = abs(ana - num)
25        a_str = "e" if abs(a - math.e) < 1e-9 else str(int(a))
26        print(f"{a_str:>6} {x:>8.2f} {ana:>14.8f} {num:>14.8f} {diff:>10.2e}")

Expected output, copy-pasteable:

  base        x       analytic        numeric     |diff|
     2     0.50     2.88539008     2.88539008   4.84e-11
     2     4.00     0.36067376     0.36067376   6.04e-12
     2   100.00     0.01442695     0.01442695   2.39e-13
     e     0.50     2.00000000     2.00000000   3.33e-11
     e     4.00     0.25000000     0.25000000   4.16e-12
     e   100.00     0.01000000     0.01000000   1.67e-13
    10     0.50     0.86858896     0.86858896   1.45e-11
    10     4.00     0.10857362     0.10857362   1.85e-12
    10   100.00     0.00434294     0.00434294   7.20e-14

Read the table

Every |diff| column entry is at most 10⁻¹⁰. That is the difference between "mathematical equality" and "numerical equality." The formula and the finite difference are reporting the same number, to the limit of float64 precision.

PyTorch: Autograd Confirms the Same Number

For machine-learning workflows, we never write the derivative by hand. PyTorch's autograd engine records every operation as we go and applies the chain rule in reverse. If we compute $y = \log_2(x)$ and call $y\text{.backward}()$ , PyTorch will fill $x\text{.grad}$ with $1/(x\,\ln 2)$ . Let's verify.

PyTorch autograd: derivative of log_a(x) and log_a(u(x))

🐍autograd_log_a.py

Explanation(14)

Code(30)

1Import torch and math

torch gives us tensors with autograd; math gives us the natural log of the base — itself a plain Python float, not a tensor. We never need to put the constant ln(a) inside the autograd graph.

4xs = tensor([0.5, 4, 100], requires_grad=True)

Three test points packed into one tensor. Setting requires_grad=True tells PyTorch to track every operation that touches xs so it can later answer 'what is the derivative with respect to each entry?'

EXECUTION STATE

xs = tensor([0.5, 4.0, 100.0])

xs.requires_grad = True

xs.grad = None (no backward call yet)

7y2 = torch.log(xs) / math.log(2)

Change-of-base again: torch.log is natural log (base e), so dividing by math.log(2) gives log base 2. The division by a Python float is broadcast across the tensor. y2 is a 3-element tensor of log_2 values.

EXAMPLE

y2 ≈ tensor([-1.0000,  2.0000,  6.6439])

9y2.sum().backward()

PyTorch's backward() expects a scalar input. By summing y2 first, we are asking for the gradient of the sum. Because d/dx_i [Σ_j y_j] = dy_i/dx_i for each i, this is exactly how you ask for 'the derivative at each entry' all at once. It's the standard trick for vectorized derivative tests.

EXECUTION STATE

y2.sum() = tensor(7.6439) (the scalar we ask grad of)

xs.grad (after) = tensor([2.8854, 0.3607, 0.0144])

10Print autograd result

What autograd returns. Each entry should be exactly 1/(x_i · ln 2). These are the slopes of log_2 at x = 0.5, 4, and 100.

11Print the formula prediction

Compute the formula directly on the raw float values (we call .detach() so this side computation doesn't get tangled in the autograd graph). The two printouts should be identical to the displayed precision — they are.

EXECUTION STATE

1/(xs · ln 2) = tensor([2.8854, 0.3607, 0.0144])

14xs.grad.zero_()

Autograd ACCUMULATES gradients into .grad — it does not overwrite. If we ran a second backward without zeroing, the new gradients would be ADDED to the old ones, giving wrong numbers. This is the single most common autograd pitfall; every PyTorch training loop calls zero_grad() before each step for exactly this reason.

EXAMPLE

Without zero_(), xs.grad would become 2× the correct slopes on the next call.

17y10 = log_10(xs)

Same construction, base 10 this time. The only difference from line 7 is dividing by math.log(10) ≈ 2.3026 instead of math.log(2) ≈ 0.6931.

18y10.sum().backward()

Same scalar-sum trick. After this call, xs.grad holds the three slopes of log_10 at our test points. They should equal 1/(x · ln 10) ≈ [0.8686, 0.1086, 0.00434].

EXECUTION STATE

xs.grad = tensor([0.8686, 0.1086, 0.0043])

22Part B — chain rule on u(x) = 3x² + 5

A scalar test of the chain-rule version: g(x) = log_10(3x² + 5). The whole computation graph is just five operations (square, multiply, add, log, divide), but autograd will apply the chain rule across all of them automatically.

EXECUTION STATE

x = tensor(1.0, requires_grad=True)

23u = 3 * x**2 + 5

Build the inner function. At x = 1, u(1) = 3·1 + 5 = 8. The graph records: input x → x² → 3·x² → 3·x² + 5. The derivative u'(x) = 6x = 6 at x = 1.

EXECUTION STATE

u(1) = tensor(8., grad_fn=<AddBackward>)

u'(1) = 6.0 (PyTorch will derive this implicitly)

24y = log_10(u)

Apply the change of base to the inner function. At x = 1, y(1) = log_10(8) ≈ 0.9031. The full chain is now: x → u(x) → log(u) → log(u)/ln(10).

25y.backward()

Reverse-mode differentiation walks the graph backward, applying the chain rule. The result that lands in x.grad is exactly u'(x) / (u(x) · ln 10) = 6 / (8 · 2.3026) ≈ 0.3258.

EXECUTION STATE

x.grad = tensor(0.3258)

27Compare to the hand-derived formula

Direct computation of u'/(u · ln a) with raw Python floats. Should print 0.3257908..., matching x.grad to all displayed digits. The chain rule formula and PyTorch agree.

EXAMPLE

6 / (8 * 2.302585) = 0.325790760...

16 lines without explanation

1import torch
2import math
3
4# --- Part A: derivative of plain log_a(x) at three bases ------------
5xs = torch.tensor([0.5, 4.0, 100.0], requires_grad=True)
6
7# log_2(x) = ln(x) / ln(2)
8y2 = torch.log(xs) / math.log(2)
9# We need the gradient of the SUM so PyTorch returns a vector of slopes.
10y2.sum().backward()
11print("d/dx log_2 at [0.5, 4, 100]:", xs.grad)
12print("formula 1/(x*ln 2):          ", 1.0 / (xs.detach() * math.log(2)))
13
14# Reset gradients before the next test
15xs.grad.zero_()
16
17# log_10(x) = ln(x) / ln(10)
18y10 = torch.log(xs) / math.log(10)
19y10.sum().backward()
20print("\nd/dx log_10 at [0.5, 4, 100]:", xs.grad)
21print("formula 1/(x*ln 10):           ", 1.0 / (xs.detach() * math.log(10)))
22
23# --- Part B: chain rule on u(x) = 3x^2 + 5, base 10 -----------------
24x = torch.tensor(1.0, requires_grad=True)
25u = 3 * x**2 + 5            # u(1) = 8
26y = torch.log(u) / math.log(10)
27y.backward()
28print("\nd/dx log_10(3x^2 + 5) at x=1:")
29print("  autograd:", x.grad.item())
30print("  formula :", (6 * 1.0) / (8.0 * math.log(10)))

Why the agreement is not a coincidence. PyTorch autograd is not running finite differences in the background. It recorded the exact symbolic chain (torch.log → divide by constant → sum) and applied the analytic derivative of each elementary operation. The fact that it lands on the same formula we derived by hand is direct confirmation that our derivation is the true rule, not an approximation.

Application: Information Theory and log₂

A satisfying place to land is in the place where $\log_2$ is the unquestioned king: information theory. Encoding one out of $N$ equally likely outcomes costs exactly $\log_2 N$ bits. The derivative we just derived tells you the marginal cost of making N a little larger:

\dfrac{d}{dN}\bigl[\log_2 N\bigr] \;=\; \dfrac{1}{N\,\ln 2}\;\;\text{bits per outcome.}

Two things to feel here. First, doubling $N$ always adds exactly one bit — a discrete jump that the derivative averages out into a continuous slope. Second, that slope shrinks like $1/N$ , so growing a code from $N = 1\,000$ to $N = 1\,001$ costs almost nothing, but growing from 2 to 3 costs nearly half a bit.

Why we care about d/dN [log₂(N)]

In information theory, the number of bits needed to encode one of N equally likely outcomes is log₂(N). The derivative tells us: doubling N adds one bit, but the marginal cost of adding one more option shrinks as N grows.

N (outcomes):N = 32

bits needed

5.000

log₂(32)

marginal bits per outcome

4.508e-2

1 / (N · ln 2)

bits to double N

+1.000

log₂(2N) − log₂(N)

Quick reference table

situation	N	log₂(N) bits	d/dN [log₂N]
coin flip	2	1.0000	7.21e-1
die roll	6	2.5850	2.40e-1
letter of alphabet	26	4.7004	5.55e-2
kilobyte address	1024	10.0000	1.41e-3
millionth element of a sorted list	1000000	19.9316	1.44e-6

Encoding the 1,000,001st item from a million-item list costs less than a millionth of a bit on average. Information has diminishing returns, and the derivative of log₂ is exactly that statement.

The same idea appears every time machine learning measures "perplexity" $= 2^{\text{cross-entropy}}$ : the cross-entropy is a sum of log₂ terms, and the gradient with respect to any predicted probability inherits a $1/(p\,\ln 2)$ factor — the very formula of this section.

Summary

One identity unlocks everything: $\log_a x = \dfrac{\ln x}{\ln a}$ . The right side is just a constant times $\ln x$ .
Plain derivative: $\dfrac{d}{dx}\log_a x = \dfrac{1}{x\,\ln a}$ .
Chain-rule version: $\dfrac{d}{dx}\log_a u(x) = \dfrac{u'(x)}{u(x)\,\ln a}$ .
Geometric meaning: bigger base $a$ ⇒ flatter curve ⇒ smaller slope, because $\ln a$ sits in the denominator.
Calculus prefers $\ln$ : base $e$ is the unique base whose derivative carries no extra constant. Convert to base 2 or base 10 only at the reporting stage.
Verified twice: finite differences in plain Python and PyTorch autograd both produce the same numbers to $\sim 10^{-11}$ .

Exercises

Differentiate $f(x) = \log_7(x)$ and evaluate $f'(49)$ . Sanity-check against the formula $1/(x\,\ln a)$ . (Answer: $1/(49\,\ln 7) \approx 0.01049$ .)
Differentiate $g(x) = \log_2(x^2 + 1)$ and evaluate at $x = 1$ . (Hint: $u = x^2 + 1$ , $u' = 2x$ . Answer: $2/(2\,\ln 2) = 1/\ln 2 \approx 1.4427$ .)
Find the x-coordinate where the tangent to $y = \log_{10} x$ has slope $1/10$ . (Answer: set $1/(x\,\ln 10) = 1/10$ ⇒ $x = 10/\ln 10 \approx 4.343$ .)
Show that $\dfrac{d}{dx}\log_a(kx) = \dfrac{d}{dx}\log_a(x)$ for any positive constant $k$ . (Hint: use $\log_a(kx) = \log_a k + \log_a x$ ; the constant differentiates to zero.)
Open the interactive slope explorer. Set base $a = e^2$ using the custom slider. Read off the slope at $x = 5$ . Does it match the formula prediction $1/(5 \cdot 2) = 0.1$ ? (It should.)
A 256-symbol alphabet (ASCII) costs $\log_2 256 = 8$ bits per symbol. Use the derivative $1/(N \ln 2)$ to estimate the cost per symbol if the alphabet grew to 260 symbols. Compare against the exact answer $\log_2 260 - 8$ . (Linear approximation: $\approx 4/(256 \ln 2) \approx 0.0225$ extra bits. Exact: $\approx 0.0224$ .)

Learning Objectives

The Question We Are Answering

The Change-of-Base Trick

From any base back to the natural log

The visual meaning

Every loga(x) is just ln(x) divided by ln(a)

The Formula and Its Picture

The four numbers you should always be able to recall

Interactive: Drag the Slope

Slope of log2(x) at any point

Same x, different bases: slope = 1 / (x · ln a)

Worked Example (Do-It-By-Hand)

The Chain Rule Version

Three worked applications of the chain rule

Why ln Is the "Natural" Base for Calculus

Plain Python: Verify the Formula Numerically

Read the table

PyTorch: Autograd Confirms the Same Number

Application: Information Theory and log2

Why we care about d/dN [log2(N)]

Summary

Exercises

Every log_a(x) is just ln(x) divided by ln(a)

Slope of log₂(x) at any point

Application: Information Theory and log₂

Why we care about d/dN [log₂(N)]