Chapter 5
20 min read
Section 45 of 353

Logarithmic Differentiation

Derivatives of Exponential, Logarithmic, and Trigonometric Functions

Learning Objectives

By the end of this section you will be able to:

  1. Recognize the three function shapes that demand logarithmic differentiation — variable-base-variable-exponent (like xxx^x), messy products and quotients, and towers of exponents.
  2. Apply the 4-step recipe: take ln\ln of both sides, differentiate implicitly, simplify, multiply back by yy.
  3. Derive the master formula f(x)=f(x)ddx[lnf(x)]f'(x) = f(x)\,\dfrac{d}{dx}\bigl[\ln f(x)\bigr] from (lny)=y/y(\ln y)' = y'/y.
  4. Use log-diff to find critical points of functions the basic power and exponential rules cannot touch.
  5. Verify any log-diff answer in Python with SymPy and in PyTorch with autograd.
  6. Connect the trick to log-likelihoods in machine learning — the reason every ML paper differentiates sums of logs and never products of probabilities.

Why We Need a New Trick

So far in this chapter you have learned three derivative machines, and each one solves exactly one job:

Function shapeRule that worksExample
Power: x^n (n is a fixed number)Power ruled/dx(x^3) = 3x^2
Exponential: a^x (a is a fixed number)Exponential ruled/dx(2^x) = 2^x · ln 2
Logarithm: log_a(x)Log ruled/dx(ln x) = 1/x

Now look at f(x)=xxf(x) = x^x. The base is xx (changing). The exponent is xx (also changing). The power rule wants the exponent to be constant. The exponential rule wants the base to be constant. Both refuse.

The pain point: tools that assume one slot is fixed cannot differentiate a function where both slots move at once.

Logarithmic differentiation is the universal escape hatch. The recipe is short, the algebra is mostly bookkeeping, and once you see it you will recognize whole new families of functions you can now differentiate by hand.

Big idea in one sentence: taking the natural log of both sides turns multiplication into addition, exponentiation into multiplication, and a stack of operators into a simple sum of friendly terms. Once the equation is a sum, every basic rule applies.

Intuition: ln() Linearizes Multiplication

Imagine you are an accountant and someone hands you the equation

y  =  x2(x+1)3exy \;=\; x^2 \cdot (x+1)^3 \cdot e^{x}

Differentiating this directly is a triple-nested product rule and chain rule. You will fill half a page with parentheses and you will probably drop a sign somewhere. Painful.

But the same equation, viewed through ln\ln, becomes

lny  =  2lnx  +  3ln(x+1)  +  x\ln y \;=\; 2\ln x \;+\; 3\ln(x+1) \;+\; x

That is a sum. Three simple terms. The right side is now a job for the sum rule plus three one-line applications of the log derivative.

Analogy. Think of ln\ln as a decoder ring that turns multiplicative structure into additive structure. We use it any time the original function is glued together by multiplication or exponentiation and we wish it were glued together by addition instead.

This is the same reason ML engineers always work with log-likelihoods instead of likelihoods: probabilities multiply, log-probabilities add, and additive things are easier to optimize, store, and differentiate.


The 4-Step Recipe

Given an equation y=f(x)y = f(x) where ff is a tangle of products, quotients, or variable-exponent towers (and f(x)>0f(x) > 0 near the point of interest):

  1. Take ln of both sides. The left becomes lny\ln y. Use the log laws to break the right side into a sum: ln(ab)=lna+lnb\ln(ab) = \ln a + \ln b, ln(a/b)=lnalnb\ln(a/b) = \ln a - \ln b, ln(ac)=clna\ln(a^c) = c\ln a.
  2. Differentiate both sides with respect to x. The left side gives 1ydydx\dfrac{1}{y}\cdot\dfrac{dy}{dx} by the chain rule (because yy is itself a function of xx). The right side is a sum, so differentiate term by term.
  3. Simplify the right side. Plain algebra. No calculus left.
  4. Multiply both sides by y to solve for dy/dxdy/dx, then substitute the original formula for yy so the final answer is in pure xx.
Step 2 is where almost every beginner mistake lives. The left side is not d/dx[lny]=1/yd/dx[\ln y] = 1/y. That would be true if yy were the variable. But yy is a function of xx, so the chain rule fires and produces an extra dy/dxdy/dx.

Interactive Walkthrough on x^x

Step through the recipe on y=xxy = x^x below. Watch how line 1 contains an exponent and line 4 contains only sums and products.

Loading step-by-step pipeline…

The Master Formula

Strip away the names and the entire procedure is one identity. Starting from g(x)=lnf(x)g(x) = \ln f(x) and applying the chain rule,

g(x)  =  ddxlnf(x)  =  f(x)f(x)g'(x) \;=\; \frac{d}{dx}\,\ln f(x) \;=\; \frac{f'(x)}{f(x)}

Solving for f(x)f'(x):

f(x)  =  f(x)ddx[lnf(x)]\boxed{\,f'(x) \;=\; f(x)\,\cdot\,\dfrac{d}{dx}\bigl[\ln f(x)\bigr]\,}

Read it out loud: “the derivative of f is f times the derivative of its log.” Every example in the rest of this section is a deliberate exercise of this one identity.

The expression f(x)f(x)\dfrac{f'(x)}{f(x)} is called the logarithmic derivative of ff. It measures relative rate of change — what fraction of itself ff changes by per unit xx. In finance this is the instantaneous return; in biology, the instantaneous growth rate; in ML, the log-likelihood gradient.

Worked Example: f(x)=xxf(x) = x^x at x=2x = 2

We will walk the full 4-step recipe and end with a single number. Try the steps on paper before opening the answer.

Click to expand the by-hand walkthrough
Step 1 — Take ln of both sides.
lny  =  ln ⁣(xx)  =  xlnx\ln y \;=\; \ln\!\bigl(x^x\bigr) \;=\; x\,\ln x
The log law ln(ab)=blna\ln(a^b) = b\ln a just brought the exponent down out of the tower.
Step 2 — Differentiate both sides. On the left, chain rule:
1ydydx  =  ddx[xlnx]\frac{1}{y}\,\frac{dy}{dx} \;=\; \frac{d}{dx}\bigl[x\,\ln x\bigr]
On the right, product rule with u=x,  v=lnxu = x,\; v = \ln x:
(uv)  =  uv+uv  =  (1)(lnx)+(x) ⁣(1x)  =  lnx+1(uv)' \;=\; u'v + uv' \;=\; (1)(\ln x) + (x)\!\left(\tfrac{1}{x}\right) \;=\; \ln x + 1
Step 3 — Combine.
1ydydx  =  lnx+1\frac{1}{y}\,\frac{dy}{dx} \;=\; \ln x + 1
Pure algebra from here.
Step 4 — Solve for dy/dx. Multiply both sides by yy and substitute y=xxy = x^x:
dydx  =  y(lnx+1)  =  xx(lnx+1)\frac{dy}{dx} \;=\; y\,(\ln x + 1) \;=\; x^{x}\,(\ln x + 1)
Plug in x = 2.
f(2)  =  22(ln2+1)  =  4(0.6931+1)  =  41.6931    6.7726f'(2) \;=\; 2^{2}\,(\ln 2 + 1) \;=\; 4\,(0.6931 + 1) \;=\; 4 \cdot 1.6931 \;\approx\; 6.7726
Sanity check by tiny step. A forward difference with h=0.001h = 0.001:
f(2.001) − f(2) ≈ 4.006774 − 4.000000 = 0.006774
divide by h: 0.006774 / 0.001 ≈ 6.774

Matches our analytic 6.77266.7726 to 3 decimals — exactly the agreement we expect from a first-order difference.

Interactive Log-Diff Explorer (4 Functions)

Pick any of the four notoriously annoying functions below. Drag the probe slider to read off f(x)f(x), lnf(x)\ln f(x), and f(x)f'(x) simultaneously. Notice how dashed purple (the log) is always tame and smooth even when cyan (the original) explodes.

Loading interactive log-diff explorer…
What to look for. For xxx^x, slide x near 0.370.37 — that is 1/e1/e, and the orange f(x)f'(x) curve crosses zero exactly there. For x1/xx^{1/x}, slide x near e2.718e \approx 2.718 — the same thing happens. Both critical points fall straight out of the log-diff identity and cannot be obtained any other way at this level.

Payoff: Finding the Minimum of xxx^x

Setting f(x)=xx(lnx+1)=0f'(x) = x^x(\ln x + 1) = 0 and using the fact that xx>0x^x > 0 always, the only way the product can be zero is lnx+1=0\ln x + 1 = 0, i.e. x=e1=1/e0.3679x = e^{-1} = 1/e \approx 0.3679. At that point,

f(1/e)  =  (1e)1/e    0.6922f(1/e) \;=\; \left(\tfrac{1}{e}\right)^{1/e} \;\approx\; 0.6922

Drag the slider below until the slope readout turns green at zero. That is the global minimum of xxx^x on (0,)(0, \infty).

Loading minimum-of-xˣ explorer…
Why this is impressive. Without log-diff you would have no rule that even applies to xxx^x, let alone a way to find its critical point. The very existence of x=1/ex = 1/e as a minimum is a result you can only obtain with logarithmic differentiation at this level of math.

Pattern 2: Messy Products and Quotients

Even when the exponents are constant, log-diff is a labor-saver whenever you have many factors. Consider

y  =  (x+1)2x1(x+3)4y \;=\; \dfrac{(x+1)^2\,\sqrt{x-1}}{(x+3)^4}

Direct differentiation requires the quotient rule wrapping a product rule wrapping the chain rule. Through ln\ln the same thing becomes

lny  =  2ln(x+1)  +  12ln(x1)    4ln(x+3)\ln y \;=\; 2\ln(x+1) \;+\; \tfrac{1}{2}\ln(x-1) \;-\; 4\ln(x+3)

Differentiate each term in isolation:

1ydydx  =  2x+1  +  12(x1)    4x+3\frac{1}{y}\frac{dy}{dx} \;=\; \frac{2}{x+1} \;+\; \frac{1}{2(x-1)} \;-\; \frac{4}{x+3}

Multiply back by y and we are done.

Pattern. Whenever a function has n factors connected by ×\times or ÷\div, log-diff converts its derivative into a single sum of n simple rational terms. That is a linear-cost-in-n procedure where the direct method is closer to quadratic.

Pattern 3: Variable Base, Variable Exponent

This is the family that only log-diff can handle. Examples:

FunctionLog-formDerivative
y = x^xln y = x ln xy' = x^x (ln x + 1)
y = x^{sin x}ln y = sin(x) · ln xy' = x^{sin x} ( cos x · ln x + sin x / x )
y = (sin x)^x (with sin x > 0)ln y = x · ln(sin x)y' = (sin x)^x ( ln(sin x) + x·cot x )
y = x^{1/x}ln y = (1/x) · ln xy' = x^{1/x} · (1 − ln x) / x^2

Every row above came out of the same 4-step recipe. Once the log step linearizes things, you are doing standard sum-rule and product-rule moves — no special memorization required.


Common Pitfalls (Read This!)

Pitfall 1: forgetting the chain rule on the left side. After taking ln\ln of both sides, the left is lny\ln y, and the derivative with respect to xx is (1/y)(dy/dx)(1/y) \cdot (dy/dx), not just 1/y1/y. Skipping the dy/dxdy/dx is the #1 mistake.
Pitfall 2: applying it where ff is not strictly positive. ln\ln is undefined on (,0](-\infty, 0]. If ff changes sign, work with lnf(x)\ln|f(x)| — the identity (lnf)=f/f(\ln|f|)' = f'/f still holds — and be extra careful at the zeros of ff themselves.
Pitfall 3: confusing xxx^x with xnx^n or axa^x. ddxxxxxx1\dfrac{d}{dx}\,x^x \neq x\,x^{x-1} (that would be the power rule) and xxlnx\neq x^x \ln x (that would be the exponential rule applied with a wrong base). The correct derivative is xx(lnx+1)x^x(\ln x + 1) — and the extra +1+1 exists because the base is changing too.
Pitfall 4: forgetting to substitute yy back at the end. Your answer should be in pure xx, not in yy. Step 4 is finished only after you replace yy with the original formula.

Python: Symbolic + Numerical Verification

Plain Python first. We will implement the log-diff identity f(x)=f(x)(lnf)(x)f'(x) = f(x)\cdot(\ln f)'(x) as a numerical estimator, then compare it against the analytic answer we derived by hand. If our algebra was right, the two columns should agree to ~10 decimals.

log_diff.py — pen-and-paper log-diff, in numerical form
🐍log_diff.py
1Imports — just the standard library

We only need math.log and a type hint for the callable. No NumPy. The whole point of this section is to make logarithmic differentiation feel mechanical, so we stay close to pen-and-paper math.

EXECUTION STATE
math.log = natural logarithm, base e
Callable = type alias for a function-valued argument
4Function signature: log_diff(f, x, h)

This is the central object of the file: a numerical estimator of f'(x) that goes through ln f first. Inputs are the function f, the point x, and the step size h used for the central difference on ln f.

EXAMPLE
log_diff(x_to_the_x, 2.0) → 6.7726
EXECUTION STATE
f = Callable[[float], float] — the function we want to differentiate
x = the point where we want f'(x)
h = 1e-6 (step size for the symmetric difference)
14Evaluate f(x) once

We call f(x) exactly once because (a) we need it later as a multiplier and (b) repeated calls would waste work. fx is the y-value of the original function at the probe point.

EXECUTION STATE
fx = f(x), e.g. 2^2 = 4.0 when x = 2
15Guard: log-diff only works where f > 0

ln is only defined for positive arguments, so log-diff requires f(x) > 0 in a neighborhood. For functions like x^x with x > 0, this is always fine. For signed functions, you would first take the absolute value (the identity d/dx ln|f| = f'/f still works).

EXECUTION STATE
ValueError = raised if f(x) ≤ 0
19Sample ln f at x + h

We push the input forward by h and immediately apply ln. ln f(x + h) is the value of the auxiliary, linearized function g(x) = ln f(x), evaluated slightly to the right of x.

EXAMPLE
f(2 + 1e-6) = 2.000001^2.000001 ≈ 4.0000048
ln(...) ≈ 1.38629580…
EXECUTION STATE
g_plus = ln f(x + h), e.g. 1.3863 for x = 2
20Sample ln f at x − h

Same idea, but a step to the left. We need both sides for the symmetric (central) difference, which has O(h²) accuracy instead of the O(h) accuracy of a one-sided difference.

EXECUTION STATE
g_minus = ln f(x − h)
21Central difference quotient

This estimates the derivative of g(x) = ln f(x) at the probe point. The (g_plus − g_minus) / (2h) form cancels the leading-order error term of the Taylor expansion, leaving an O(h²) approximation.

EXAMPLE
For f = x^x at x = 2:
  g_prime ≈ (1.38629548 − 1.38629375) / 2e-6
  g_prime ≈ 1.6931472
EXECUTION STATE
g_prime = (g_plus − g_minus) / (2h), e.g. 1.6931 at x = 2
24Multiply back by f(x) — the log-diff identity

Here is the whole point: d/dx ln f(x) = f'(x) / f(x), so f'(x) = f(x) · d/dx ln f(x). We just plug in the numerical estimate of d/dx ln f(x) and multiply by the original f(x). One identity does the entire job.

EXAMPLE
At x = 2:
  f'(2) ≈ 4.0 * 1.6931 = 6.7726
  (matches the analytic answer)
EXECUTION STATE
return value = f(x) · g_prime — our estimate of f'(x)
27Concrete test function: x ↦ x^x

We pick the canonical "hard" function for log-diff. The basic power rule says d/dx x^n = n x^(n−1) but n is constant. The basic exponential rule says d/dx a^x = a^x ln a but a is constant. x^x violates *both*. Log-diff is the way out.

33Analytical answer (so we can grade the numerics)

From the algebra ln(x^x) = x ln x → (ln f)' = ln x + 1 → f' = f · (ln x + 1) = x^x (ln x + 1). We include this so the test prints both the numerical estimate and the exact value side-by-side.

EXAMPLE
At x = 2:
  exact = 4 · (ln 2 + 1) = 4 · 1.6931 = 6.7726
37Driver loop — five sample points

We sweep x through 0.5, 1.0, 1.5, 2.0, 2.5 so the reader can see log-diff agree with the analytic formula across both the decreasing region (x < 1/e) and the increasing region (x > 1/e).

LOOP TRACE · 5 iterations
x = 0.5
f(x) = 0.5^0.5 ≈ 0.7071
numerical = 0.7071 · (ln 0.5 + 1) ≈ 0.2171
analytic = 0.2171
x = 1.0
f(x) = 1.0^1.0 = 1.0
numerical = 1.0 · (ln 1 + 1) = 1.0000
analytic = 1.0000
x = 1.5
f(x) = 1.5^1.5 ≈ 1.8371
numerical = 1.8371 · (ln 1.5 + 1) ≈ 2.5821
analytic = 2.5821
x = 2.0
f(x) = 2^2 = 4.0
numerical = 4.0 · (ln 2 + 1) ≈ 6.7726
analytic = 6.7726
x = 2.5
f(x) = 2.5^2.5 ≈ 9.8821
numerical = 9.8821 · (ln 2.5 + 1) ≈ 18.937
analytic = 18.937
38Print row — readable diagnostic table

Each row shows x, f(x), the numerical log-diff estimate, the analytic exact value, and the absolute error. With h = 1e-6 the error column will be ~1e-10 across the board, which is the signature of an O(h²) method on a smooth function.

31 lines without explanation
1import math
2from typing import Callable
3
4def log_diff(f: Callable[[float], float], x: float, h: float = 1e-6) -> float:
5    """
6    Numerically compute f'(x) using the *logarithmic-differentiation identity*
7
8        f'(x) = f(x) * d/dx [ ln f(x) ]
9
10    instead of differencing f directly. For functions like f(x) = x^x this
11    is dramatically more stable because ln(x^x) = x ln(x) is a smooth,
12    well-behaved sum — the original f explodes super-exponentially.
13    """
14    fx = f(x)
15    if fx <= 0:
16        raise ValueError("log-diff requires f(x) > 0 in a neighborhood of x")
17
18    # central difference on g(x) = ln f(x)
19    g_plus  = math.log(f(x + h))
20    g_minus = math.log(f(x - h))
21    g_prime = (g_plus - g_minus) / (2 * h)
22
23    # multiply back by f(x) to recover f'(x)
24    return fx * g_prime
25
26
27def x_to_the_x(x: float) -> float:
28    return x ** x
29
30
31# Analytical answer derived in the section:
32# f(x) = x^x  =>  f'(x) = x^x * (ln x + 1)
33def x_to_the_x_prime_analytic(x: float) -> float:
34    return (x ** x) * (math.log(x) + 1)
35
36
37if __name__ == "__main__":
38    for x in [0.5, 1.0, 1.5, 2.0, 2.5]:
39        numerical = log_diff(x_to_the_x, x)
40        analytic  = x_to_the_x_prime_analytic(x)
41        print(f"x={x:>4}  f(x)={x_to_the_x(x):10.4f}  "
42              f"num={numerical:10.4f}  exact={analytic:10.4f}  "
43              f"err={abs(numerical - analytic):.2e}")
Run the script and you will see error column entries on the order of 1e-10 for every row. That is the floating-point signature of an O(h²) central difference applied to a smooth function — confirmation that our hand-derived formula f(x)=xx(lnx+1)f'(x) = x^x(\ln x + 1) is correct.

PyTorch: Autograd Confirms the Recipe

PyTorch will happily differentiate xxx^x directly via torch.autograd.grad\texttt{torch.autograd.grad} — internally, its PowBackward node already implements exactly the formula we just derived. That gives us a cross-check: we will compute the answer two independent ways (through ln\ln first, vs. directly) and confirm they agree.

log_diff_torch.py — autograd reproduces the log-diff identity
🐍log_diff_torch.py
1Import PyTorch

Just torch. Everything we need — tensors, autograd, log — lives in the top-level namespace.

4The forward function f(x) = x^x

We define f as a normal PyTorch operation. Because x is a tensor with requires_grad=True (set in the driver loop), every operation we do here is recorded by autograd onto a computation graph.

EXAMPLE
f(tensor(2.0, requires_grad=True))
  → tensor(4.0, grad_fn=<PowBackward>)
8Log-diff path: derivative *via* ln f(x)

This function is the PyTorch translation of the pen-and-paper recipe. Step 1: build g = ln f. Step 2: ask autograd for g'. Step 3: multiply by f(x). Three lines, exact same logic as the math.

14g = torch.log(f(x))

This is the linearization step. Inside autograd's view, the computation graph is now: x → x^x → ln(x^x) = x ln x. We never explicitly write `x ln x`; PyTorch tracks the equivalent graph automatically.

EXECUTION STATE
g = tensor(1.3863, grad_fn=<LogBackward>) for x = 2
17torch.autograd.grad(outputs=g, inputs=x)

This call backpropagates from g to x and returns dg/dx as a one-element tuple. It is the autograd equivalent of `g.backward()` followed by `x.grad`, but without mutating x's .grad attribute — important when you want to call grad() multiple times in the same script.

EXAMPLE
g_prime = tensor(1.6931) for x = 2
  (matches ln(2) + 1 = 1.6931 to 4+ decimals)
18outputs=g

We are differentiating g (the ln of f), not f itself. This is the entire trick — pushing the differentiation through ln converts an exponential tangle into a sum.

19inputs=x

The variable with respect to which we want the gradient. x must have requires_grad=True or autograd refuses to track it.

20create_graph=False

We do NOT need second-order derivatives here, so we keep create_graph off to avoid building a graph-over-the-graph. If we wanted f''(x), we would set create_graph=True and call grad() a second time.

24return f(x) * g_prime

This is f(x) · d/dx ln f(x) = f'(x). We re-call f(x) here because the previous tensor was consumed inside the log; recomputing is cheap.

EXAMPLE
x = 2:
  f(x) = 4.0
  g_prime = 1.6931
  f'(x) = 4.0 · 1.6931 = 6.7726
27f_prime_direct — sanity check

We ask autograd to differentiate the original messy f directly. PyTorch's chain rule already handles x^x because internally Pow with two tensor arguments has a known gradient formula. We compare the two results to confirm log-diff matches.

35Driver loop

Same five probe points as the plain-Python file. For each, we build a fresh tensor (autograd consumes graphs once unless retain_graph=True), call both methods, and print whether they agree to within floating-point tolerance.

LOOP TRACE · 5 iterations
x = 0.5
log-diff = 0.217148
direct = 0.217148
match = True
x = 1.0
log-diff = 1.000000
direct = 1.000000
match = True
x = 1.5
log-diff = 2.582057
direct = 2.582057
match = True
x = 2.0
log-diff = 6.772589
direct = 6.772589
match = True
x = 2.5
log-diff = 18.936735
direct = 18.936735
match = True
47torch.allclose comparison

We use allclose instead of `==` because the two paths add floating-point operations in different orders. Up to ~1e-6 relative error is normal; allclose handles this gracefully.

36 lines without explanation
1import torch
2
3# Define the function symbolically inside PyTorch
4def f(x: torch.Tensor) -> torch.Tensor:
5    return x ** x  # x must be positive for this to be real-valued
6
7
8def f_prime_via_logdiff(x: torch.Tensor) -> torch.Tensor:
9    """
10    Implements the textbook log-diff trick using autograd:
11        let  g(x) = ln f(x)
12        then f'(x) = f(x) * g'(x)
13    """
14    # 1. Apply ln to convert multiplication/exponentiation into addition
15    g = torch.log(f(x))                  # ln f(x) = ln(x^x) = x ln x
16
17    # 2. Let autograd compute g'(x) for us
18    (g_prime,) = torch.autograd.grad(
19        outputs=g,
20        inputs=x,
21        create_graph=False,
22    )
23
24    # 3. Multiply back by f(x) to recover f'(x)
25    return f(x) * g_prime
26
27
28def f_prime_direct(x: torch.Tensor) -> torch.Tensor:
29    """Cross-check: ask autograd to differentiate the messy original directly."""
30    y = f(x)
31    (y_prime,) = torch.autograd.grad(outputs=y, inputs=x)
32    return y_prime
33
34
35if __name__ == "__main__":
36    torch.set_printoptions(precision=6)
37    for xv in [0.5, 1.0, 1.5, 2.0, 2.5]:
38        x = torch.tensor(xv, requires_grad=True)
39
40        via_log = f_prime_via_logdiff(x)
41
42        # autograd needs a fresh graph for the second call, so rebuild x
43        x2 = torch.tensor(xv, requires_grad=True)
44        direct = f_prime_direct(x2)
45
46        print(f"x={xv}  log-diff={via_log.item():.6f}  "
47              f"direct={direct.item():.6f}  "
48              f"match={torch.allclose(via_log.detach(), direct.detach())}")
What this tells you: the chain-rule machinery inside autograd is the log-diff identity written down in code. Any time PyTorch differentiates an expression of the form a(x)b(x)a(x)^{b(x)} it internally rewrites it as exp ⁣(b(x)lna(x))\exp\!\bigl(b(x)\ln a(x)\bigr) and differentiates that. So when you use log-diff on paper, you are doing — by hand — the exact step modern autodiff systems do automatically.

Why ML Engineers Care

In statistics and machine learning we constantly minimize negative log-likelihoods. Given i.i.d. data, the likelihood is a product of densities:

L(θ)  =  i=1Np(xiθ)L(\theta) \;=\; \prod_{i=1}^{N} p(x_i \mid \theta)

Differentiating that product directly would be a nightmare for even modest NN. But taking ln\ln gives the log-likelihood:

(θ)  =  lnL(θ)  =  i=1Nlnp(xiθ)\ell(\theta) \;=\; \ln L(\theta) \;=\; \sum_{i=1}^{N} \ln p(x_i \mid \theta)

Now the derivative is a sum of NN clean terms — exactly what gradient descent needs. This is the log-diff trick applied at industrial scale.

SettingProduct form (don't differentiate)Log form (do differentiate)
MLE for Gaussian meanΠ exp(−(xᵢ−μ)² / 2σ²)−½σ⁻² Σ (xᵢ−μ)²
Softmax cross-entropy−ln(Π probs_correctᵢ)−Σ ln(probs_correctᵢ)
Variational free energy (ELBO)log-ratio of densitiesexpectation of differences of logs
Diffusion model lossscore = ∇ log p_tdirectly the logarithmic derivative
Every time you write F.cross_entropy(logits, targets) in PyTorch, the framework is doing log-diff for you — operating on log-probabilities for numerical stability and so that the gradient is a clean sum.

Summary

  1. Log-diff is for three shapes: variable-base-variable-exponent (e.g. xxx^x), messy products and quotients with many factors, and stacked exponents (towers).
  2. The recipe is 4 steps: (1) take ln\ln of both sides, (2) differentiate (the left side gets 1/ydy/dx1/y \cdot dy/dx from the chain rule), (3) simplify, (4) multiply by yy and substitute the original formula back.
  3. The master identity is f(x)=f(x)ddx[lnf(x)]f'(x) = f(x)\cdot \dfrac{d}{dx}[\ln f(x)]. Everything in this section flows from it.
  4. Critical insight: f/ff^\prime/f is the logarithmic derivative — a measure of relative rate of change. It is the natural quantity for finance, biology, and ML loss surfaces.
  5. Verification is cheap. Plain Python with math.log and a central difference, or PyTorch autograd.grad on a log-then-differentiate computation, will reproduce any log-diff result you derive by hand.
  6. Connection to ML. Every log-likelihood you ever differentiate is an industrial-scale application of the exact same trick. Products of densities become sums of log densities, and those are what optimizers need.
Loading comments...