Chapter 1
18 min read
Section 6 of 353

Logarithmic Functions: Inverting Exponentials

Mathematical Functions - The Building Blocks

Learning Objectives

After completing this section, you will be able to:

  1. Define logarithms as inverse functions of exponentials and translate between logarithmic and exponential forms
  2. Distinguish between common logarithms (base 10), natural logarithms (base e), and binary logarithms (base 2)
  3. Apply logarithm properties (product, quotient, and power rules) to simplify expressions and solve equations
  4. Use the change of base formula to convert between logarithms of different bases
  5. Graph logarithmic functions and identify key features: domain, range, asymptotes, and intercepts
  6. Recognize real-world applications including the Richter scale, decibels, pH, and information theory
  7. Understand why logarithms are essential in machine learning for numerical stability and gradient computation

The Story of Logarithms

A Revolution in Calculation

In the early 17th century, astronomers, navigators, and scientists faced an enormous computational challenge. Calculating planetary orbits, navigating ships across oceans, and conducting scientific experiments required multiplying and dividing very large numbers—a process that was tedious, error-prone, and could take hours or even days.

In 1614, Scottish mathematician John Napier published a revolutionary discovery that would transform calculation: logarithms. His key insight was profound:

Napier's Insight: Multiplication can be converted to addition by working with exponents. If you want to multiply two numbers, add their logarithms instead—then convert back.

This single idea reduced multiplication to addition, division to subtraction, and exponentiation to simple multiplication. Before electronic calculators, logarithm tables and slide rules (which are essentially analog logarithm computers) were indispensable tools used by scientists and engineers for over 300 years.

The Name "Logarithm"

Napier coined the term from Greek: logos (ratio, proportion) and arithmos (number). A logarithm is literally a "ratio number"—it captures the ratio or proportion of growth in exponential processes.

YearDevelopmentImpact
1614Napier publishes logarithm tablesReduces calculation time by orders of magnitude
1617Henry Briggs creates common (base 10) logarithmsEasier to use with decimal system
1620sSlide rules inventedPortable logarithm calculators used for 350 years
1668Natural logarithm (base e) formalizedEssential for calculus and continuous growth
1948Shannon\'s information theoryLogarithms measure information in bits
TodayMachine learningLog-likelihood, cross-entropy, softmax stabilization

Why This History Matters

Understanding logarithms as "computation tools for handling exponentials" explains why they appear everywhere: they convert multiplicative processes into additive ones. This is exactly why machine learning uses log-probabilities instead of raw probabilities—products of many small numbers become sums of manageable log-values.

Definition: The Inverse of Exponentials

The Fundamental Relationship

A logarithm answers a simple question: "To what power must I raise the base to get this number?"

Formally, for any base b>0b > 0 with beq1b eq 1:

logb(x)=yby=x\log_b(x) = y \quad \Longleftrightarrow \quad b^y = x

In words: "the logarithm base bb of xx" equals yy if and only if "bb to the yy power equals xx."

The Inverse Function Relationship

The logarithm function f(x)=logb(x)f(x) = log_b(x) and the exponential function g(x)=bxg(x) = b^x are inverse functions. This means they "undo" each other:

logb(bx)=xfor all real x\log_b(b^x) = x \quad \text{for all real } x
blogb(x)=xfor all x>0b^{\log_b(x)} = x \quad \text{for all } x > 0

Domain Restriction

The logarithm logb(x)log_b(x) is only defined for x>0x > 0. You cannot take the logarithm of zero or a negative number (in the real numbers). This is because no real power of a positive base can produce zero or a negative result.

Intuition: The "What Power?" Question

Forget the symbols for a moment. Every logarithm is just a question. When you write log2(8)log_2(8) you are literally asking: “starting from 2, how many times must I multiply by 2 to reach 8?” The answer is 3, because 222=82 \cdot 2 \cdot 2 = 8. That number you arrived at — that count of multiplications — is the logarithm.

Analogy: think of an exponential function byb^y as a recipe (“multiply bb by itself yy times to get a size xx”). The logarithm is the recipe reader: given the finished cake of size xx, it tells you how many times you must have doubled (or tripled, or bb-tupled) to bake it. Exponentials grow; logs count growth steps. They are two views of the same coin.

📝 Worked example by hand: convert 5 exponential statements to logarithms (click to expand)

Translate each exponential statement into a logarithm by reading it out loud: “base, to what power, gives result?

Statement 1: 24 = 16
Ask: 2 to what power gives 16?
Answer: 4. Therefore log2(16) = 4.
Statement 2: 103 = 1000
Ask: 10 to what power gives 1000?
Answer: 3. Therefore log10(1000) = 3.
(The common log of any power-of-10 is the count of trailing zeros.)
Statement 3: e0 = 1
Ask: e to what power gives 1?
Answer: 0. Therefore ln(1) = 0.
(Every base to the zero power is 1, so logb(1)=0 always.)
Statement 4: 5−2 = 1/25 = 0.04
Ask: 5 to what power gives 0.04?
Answer: −2. Therefore log5(0.04) = −2.
(Logs of numbers smaller than 1 are negative — we “un-multiplied.”)
Statement 5: 91/2 = 3 (the square root)
Ask: 9 to what power gives 3?
Answer: 1/2. Therefore log9(3) = 0.5.
(Roots are fractional exponents, so logs of roots are fractional logs.)

Round trip check. If log2(16)=4\log_2(16) = 4, then plugging back gives 24=162^4 = 16. The two operations exactly undo each other. That round trip is the entire content of the inverse-function property.


Common and Natural Logarithms

The Three Most Important Bases

BaseNameNotationPrimary Use
b = 10Common logarithmlog(x) or log₁₀(x)Scientific notation, orders of magnitude, decibels
b = e ≈ 2.718...Natural logarithmln(x) or logₑ(x)Calculus, continuous growth, ML/statistics
b = 2Binary logarithmlg(x) or log₂(x)Computer science, information theory, algorithms

The Natural Logarithm: Why Base e?

The natural logarithm (base eapprox2.71828ldotse approx 2.71828ldots) might seem like an arbitrary choice, but it's actually the most natural base for calculus. The number ee is special because:

ddx[ex]=exandddx[ln(x)]=1x\frac{d}{dx}\left[e^x\right] = e^x \quad \text{and} \quad \frac{d}{dx}\left[\ln(x)\right] = \frac{1}{x}

No other base produces such clean derivatives. This is why natural logarithms dominate in calculus, differential equations, and any field dealing with continuous change.

The Natural Choice: When working with rates of change, growth, or decay, the natural logarithm is almost always the right choice. It simplifies derivatives, integrals, and the mathematics of continuous processes.

The Binary Logarithm: Counting Bits

In computer science, the binary logarithm tells you "how many bits do I need?" For a number nn:

  • log2(n)log_2(n) bits are needed to represent nn distinct values
  • Binary search on nn elements takes O(log2n)O(log_2 n) comparisons
  • A balanced binary tree with nn nodes has height log2(n)log_2(n)

Notation Warning

In pure mathematics, "log" without a subscript usually means natural log (ln). In engineering and applied sciences, it often means log₁₀. In computer science, it typically means log₂. Always check the context or explicitly write the base to avoid confusion!

Properties of Logarithms

The Three Fundamental Laws

The properties of logarithms follow directly from the properties of exponents. Since logb(x)log_b(x) is the inverse of bxb^x, the laws of exponents transform into laws of logarithms.

Try the interactive explorer below to see these properties in action:

Logarithm Properties Explorer

2
3
2
Product Rule: log(ab) = log(a) + log(b)
log(6.00)=0.7782=0.3010+0.4771=0.7782✓
Quotient Rule: log(a/b) = log(a) - log(b)
log(0.67)=-0.1761=0.3010-0.4771=-0.1761✓
Power Rule: log(a^n) = n · log(a)
log(4.00)=0.6021=2×0.3010=0.6021✓

Adjust the sliders to see how logarithm properties hold for any positive values.

Key Values to Memorize

PropertyFormulaWhy It's True
Log of 1logᵦ(1) = 0b⁰ = 1 for any base
Log of the baselogᵦ(b) = 1b¹ = b
Log of a power of baselogᵦ(bⁿ) = nBy definition of logarithm
Inverse compositionb^(logᵦ(x)) = xInverse functions cancel

Change of Base Formula

What if you need log2(100)log_2(100) but your calculator only has ln and log₁₀? The change of base formula converts between any bases:

loga(x)=logb(x)logb(a)=ln(x)ln(a)=log(x)log(a)\log_a(x) = \frac{\log_b(x)}{\log_b(a)} = \frac{\ln(x)}{\ln(a)} = \frac{\log(x)}{\log(a)}

Change of Base Formula Calculator

Change of Base Formula:
log2(100)=log10(100)/log10(2)
log10(100)
2.0000
log10(2)
0.3010
log2(100)
6.6439

The change of base formula lets you convert between any logarithm bases using only one type of logarithm.

Computational Tip

In programming, most languages provide only log (natural log) and sometimeslog10. Use the change of base formula:log2(x) = log(x) / log(2)

Graphing Logarithmic Functions

The graph of y=logb(x)y = log_b(x) is the reflection of y=bxy = b^x across the line y=xy = x. This reflection relationship between inverse functions is the heart of logarithmic graphs — swap x and y, and you swap the two curves.

Drag along the blue curve below. The amber dashed segment shows the reflection across the line y=xy=x: the moment you pick a point (x,logbx)(x, \log_b x), the mirror point (logbx,x)(\log_b x, x) on the green exponential curve falls into place automatically. That single picture is the inverse relationship.

The Mirror: log and exp are reflections across y = x

Drag horizontally to move the point along the log curve. Its mirror image jumps to the matching point on the exponential curve. The dashed line y = x is the mirror.

x = 2.50
y = x(x, loge(x)) = (2.50, 0.92)(e^y, y) = (2.50, 0.92) swapped
Log says:
loge(2.50) = 0.9163
Exp confirms (round trip):
e^0.9163 = 2.5000

Swap the coordinates of any point on the blue curve and you land on the green curve. That coordinate-swap is reflection across y = x.

Below is the more traditional graph with adjustable base and a click-to-probe coordinate readout — use it to read off specific values like log2(8)=3\log_2(8) = 3 or ln(e)=1\ln(e) = 1.

Interactive Logarithm Graph: y = loge(x)

b = 2.72
(1, 0)(e, 1)xyy = loge(x)y = exasymptote
Domain
(0, +∞)
Range
(-∞, +∞)
Key Property
loge(e) = 1

Click anywhere on the graph to see coordinates. The logarithm and exponential are reflections across y = x.

Key Features of Logarithmic Graphs

FeatureValueExplanation
Domain(0, +∞)Only positive inputs allowed
Range(-∞, +∞)Output can be any real number
Vertical Asymptotex = 0Graph approaches but never touches y-axis
x-intercept(1, 0)logᵦ(1) = 0 for any base
Key Point(b, 1)logᵦ(b) = 1
Behavior as x → 0⁺y → -∞Logarithm of small positive numbers is very negative
Behavior as x → +∞y → +∞Grows without bound, but very slowly

How the Base Affects the Graph

  • Base > 1: The function is increasing. Larger x gives larger y.
  • Larger base: The curve rises more slowly. Compare log₁₀(100) = 2 vs log₂(100) ≈ 6.64.
  • Base between 0 and 1: The function is decreasing (rarely used in practice).

Transformations of Logarithmic Functions

Standard function transformations apply to logarithms:

TransformationEffect on GraphExample
y = logᵦ(x) + kVertical shift up by kln(x) + 2
y = logᵦ(x - h)Horizontal shift right by hln(x - 3), asymptote moves to x = 3
y = a · logᵦ(x)Vertical stretch by factor a2 ln(x)
y = logᵦ(cx)Horizontal compression by factor cln(2x)
y = -logᵦ(x)Reflection over x-axis-ln(x)

Real-World Logarithmic Scales

Many natural phenomena span enormous ranges of values. Logarithmic scales compress these ranges into manageable numbers, making patterns visible that would be hidden on a linear scale.

Logarithmic Scales in the Real World

Earthquake Magnitudes (Richter Scale)

Each unit increase = 10x more ground motion, ~31.6x more energy

Barely felt
2
1.0e+2
Minor damage
4
1.0e+4
Moderate
5
1.0e+5
Strong
6
1.0e+6
Major
7
1.0e+7
Great (1906 SF)
7.9
7.9e+7
Massive (2011 Japan)
9.1
1.3e+9

Why use logarithmic scales? They compress huge ranges of values into manageable numbers. The difference between a magnitude 5 and magnitude 9 earthquake is about 10,000x in ground motion, but only 4 units on the Richter scale.

Why Logarithmic Scales?

  1. Compress huge ranges: The visible light spectrum covers wavelengths from 400nm to 700nm, but the full electromagnetic spectrum spans from 10⁻¹⁵m (gamma rays) to 10⁸m (radio waves)—a factor of 10²³!
  2. Match human perception: Our ears perceive loudness logarithmically. A sound 10x more intense sounds about twice as loud.
  3. Reveal multiplicative patterns: Exponential growth appears as a straight line on a log scale, making it easy to identify.
  4. Compare relative changes: A doubling looks the same size whether it's from 10 to 20 or from 10,000 to 20,000.

The Richter Scale: A Deep Dive

The Richter magnitude MM of an earthquake is:

M=log10(AA0)M = \log_{10}\left(\frac{A}{A_0}\right)

where AA is the measured amplitude and A0A_0 is a reference amplitude. Each unit increase in magnitude means:

  • 10× more ground motion (amplitude)
  • ~31.6× more energy released (since energy ∝ amplitude²)

Applications in Machine Learning

Logarithms are ubiquitous in machine learning, not as a historical curiosity, but as an essential tool for numerical stability and theoretical elegance.

Why ML Uses Log-Probabilities

Consider computing the likelihood of observing data given a model. If you havenn independent observations with probabilities p1,p2,ldots,pnp_1, p_2, ldots, p_n:

Likelihood L=i=1npi\text{Likelihood } L = \prod_{i=1}^{n} p_i

The Problem: If each piapprox0.99p_i approx 0.99 and n=1000n = 1000:

L=0.9910004.3×105L = 0.99^{1000} \approx 4.3 \times 10^{-5}

For smaller probabilities or more samples, LL underflows to exactly 0 in floating-point arithmetic—catastrophic for training!

The Solution: Work with log-likelihood instead:

logL=i=1nlog(pi)\log L = \sum_{i=1}^{n} \log(p_i)

Products become sums, tiny numbers become manageable negative numbers, and gradients become stable.

Logarithms in Machine Learning: Log-Likelihood

Why ML uses logarithms: products become sums, tiny probabilities become manageable numbers.

Raw Likelihood (product of probabilities)
L = p8 × (1-p)2
5.1883e-3
Underflows to 0 with many samples!
Log-Likelihood (sum of logs)
log(L) = 8 × log(p) + 2 × log(1-p)
-5.2613
Numerically stable always!
Connection to Cross-Entropy Loss
Cross-Entropy = -log(L)/n = 0.5261

Cross-entropy loss is the negative average log-likelihood. Minimizing cross-entropy = maximizing likelihood!

Plain Python first — multiply, then add logs

Before we touch any framework, let's do log-likelihood by hand. The point of this snippet is not to be efficient — it is to make you feel the difference between a product and a sum of logs.

Likelihood vs Log-Likelihood — Plain Python Loop
🐍python
1Import math

We only need math.log (= ln) and math.exp. Using the standard library makes the algorithm visible — no NumPy magic.

5observations = the data

A list of 8 binary outcomes. Six 1s and two 0s. This list is the entire dataset we will score the model against. In ML language, these are the labels for 8 training examples.

EXECUTION STATE
observations = [1, 1, 0, 1, 1, 0, 1, 1]
len(observations) = 8
sum(observations) = 6 (heads)
6p = 0.6 — the model's belief

The model says 'I think heads is 60% likely'. Our job is to score how good that belief is given the actual observations. Higher likelihood = better fit.

EXECUTION STATE
p = 0.6
9likelihood = 1.0 — the running product

We initialise to 1.0 because the multiplicative identity is 1, just like a running sum starts at 0. Every observation will multiply this number by something between 0 and 1, so the product can only shrink.

EXECUTION STATE
likelihood = 1.0
10Loop over observations — iteration trace

We walk through all 8 outcomes. For each y=1 we multiply by p; for each y=0 we multiply by (1−p). Watch the running product collapse toward zero.

LOOP TRACE · 8 iterations
i=0, y=1
factor = p = 0.6
likelihood = 0.600000
i=1, y=1
factor = p = 0.6
likelihood = 0.360000
i=2, y=0
factor = 1 - p = 0.4
likelihood = 0.144000
i=3, y=1
factor = p = 0.6
likelihood = 0.086400
i=4, y=1
factor = p = 0.6
likelihood = 0.051840
i=5, y=0
factor = 1 - p = 0.4
likelihood = 0.020736
i=6, y=1
factor = p = 0.6
likelihood = 0.012442
i=7, y=1
factor = p = 0.6
likelihood = 0.007465
11likelihood *= … the multiplication

Python's conditional expression: pick p if heads, 1−p if tails. After 8 iterations the product is 0.6⁶ × 0.4² = 0.046656 × 0.16 = 0.0074649600. For 1000 observations this number would underflow to exactly 0.0 in float64.

12Print the likelihood

Expected stdout: 'likelihood = 0.00746496'. Small but still representable. Now imagine 10000 observations — the product becomes 10^-7000-ish, far below the float64 minimum (~10^-308), and the answer collapses to 0.

15log_likelihood = 0.0 — the running sum

Switch to log space. Multiplication of probabilities becomes addition of log-probabilities (product rule of logs). We start the running sum at 0 because log(1) = 0.

EXECUTION STATE
log_likelihood = 0.0
16Loop again — same logic, in log space

Same eight iterations, but every multiply becomes an add and every probability becomes its log. log(0.6) ≈ −0.510826, log(0.4) ≈ −0.916291.

LOOP TRACE · 8 iterations
i=0, y=1
log_likelihood = -0.510826
i=1, y=1
log_likelihood = -1.021651
i=2, y=0
log_likelihood = -1.937942
i=3, y=1
log_likelihood = -2.448768
i=4, y=1
log_likelihood = -2.959594
i=5, y=0
log_likelihood = -3.875885
i=6, y=1
log_likelihood = -4.386710
i=7, y=1
log_likelihood = -4.897536
17log_likelihood += math.log(…)

Conditional add: log(p) if heads, log(1−p) if tails. After 8 iterations the sum is 6·log(0.6) + 2·log(0.4) = 6·(−0.510826) + 2·(−0.916291) ≈ −4.897536. Notice: the number is comfortably representable, and adding more samples just makes it more negative — never crashes.

18Print log_likelihood

Expected stdout: 'log_likelihood = -4.897536'. Imagine 10000 observations: the log-likelihood would be around −5000, still a totally normal float. That is exactly why every ML library actually stores and optimises log-likelihoods.

21exp(log_likelihood) — round-trip check

exp(−4.897536) ≈ 0.00746496 — identical to the direct product on line 12. This is the mathematical guarantee: multiplying probabilities and summing their logs are the same operation, viewed through the exp/log mirror.

EXECUTION STATE
math.exp(log_likelihood) = 0.00746496
9 lines without explanation
1import math
2
3# Imagine 8 independent coin flips. Our model predicts p = 0.6 for "heads".
4# Observed sequence: 1=heads, 0=tails.
5observations = [1, 1, 0, 1, 1, 0, 1, 1]
6p = 0.6
7
8# Approach A: multiply raw probabilities (the dangerous way)
9likelihood = 1.0
10for y in observations:
11    likelihood *= p if y == 1 else (1.0 - p)
12print("likelihood        =", likelihood)
13
14# Approach B: sum log-probabilities (the safe way)
15log_likelihood = 0.0
16for y in observations:
17    log_likelihood += math.log(p) if y == 1 else math.log(1.0 - p)
18print("log_likelihood    =", log_likelihood)
19
20# Sanity: exp of the log-likelihood should recover the raw likelihood.
21print("exp(log_lik)      =", math.exp(log_likelihood))

The one-line moral

Products of many small probabilities underflow; sums of their logs do not. Every reputable ML loss function (cross-entropy, KL divergence, negative log-likelihood) is computed in log space for this single reason.

Cross-Entropy Loss

The cross-entropy loss for classification is the negative log-likelihood:

H(y,y^)=iyilog(y^i)H(y, \hat{y}) = -\sum_{i} y_i \log(\hat{y}_i)

where yy is the true label (one-hot) and hatyhat{y} is the predicted probability distribution.

Log-Softmax for Numerical Stability

The softmax function converts logits to probabilities:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

The problem: For large logits, ezie^{z_i} overflows. The solution: Compute log-softmax directly:

log(softmax(zi))=zilog(jezj)\log(\text{softmax}(z_i)) = z_i - \log\left(\sum_j e^{z_j}\right)

Using the "log-sum-exp trick" with a shift c=max(z)c = max(z):

log(jezj)=c+log(jezjc)\log\left(\sum_j e^{z_j}\right) = c + \log\left(\sum_j e^{z_j - c}\right)

We will now build the log-softmax twice. First in plain NumPy — line by line, with every value hand-traced — so you can see the math. Then we will hand it to PyTorch and watch the same numbers come out in one call. Plain Python first, framework second.

Step 1 — Pure NumPy: trace every line

Numerically Stable Log-Softmax (NumPy, from scratch)
🐍python
1Import NumPy

We need vectorised math: np.exp, np.log, np.max, np.sum. Everything operates element-wise on the array of logits.

3Define the naive (broken) baseline

Naive log-softmax follows the textbook formula directly: exponentiate, normalize, take log. It is mathematically correct but numerically fragile. We keep it so you can see the failure mode side-by-side with the fix.

5exp_logits = np.exp(logits) — the overflow

For logits = [1000, 1001, 1002], np.exp(1000) is roughly 2 × 10^434, which is far beyond the largest float64 value (~1.8 × 10^308). NumPy returns inf for every entry. The naive path is already dead at this line — everything downstream is nan.

EXECUTION STATE
logits = array([1000., 1001., 1002.])
exp_logits = array([inf, inf, inf])
6softmax = exp_logits / sum(exp_logits)

inf divided by inf is the indeterminate form. NumPy reports nan for every entry and raises a runtime warning. Note that mathematically the answer should be approximately [0.0900, 0.2447, 0.6652] — we have completely lost the information.

EXECUTION STATE
np.sum(exp_logits) = inf
softmax = array([nan, nan, nan])
7return np.log(softmax) — final blowup

log of nan is nan, so the function returns an array of nans. In a real training loop this propagates into the loss, then into gradients, then into weights, and your network is unrecoverable.

EXECUTION STATE
return = array([nan, nan, nan])
9Define the stable version

We now repair the algorithm using the log-sum-exp identity: log Σ e^{z_i} = c + log Σ e^{z_i − c} for any constant c. Choosing c = max(z) keeps every shifted exponent ≤ 0, which keeps exp safely between 0 and 1.

11c = np.max(logits) — pick the shift

c is just a number. It cancels mathematically (we add it back later), but operationally it makes the exponents harmless. Choosing the maximum guarantees every shifted entry is ≤ 0, so every exp call returns a finite number in (0, 1].

EXECUTION STATE
c = 1002.0
12shifted = logits − c — bring everything to ≤ 0

We subtract c from every entry. The largest logit becomes 0, every other entry becomes a small negative number. Concretely [1000, 1001, 1002] − 1002 = [−2, −1, 0]. These are perfectly safe inputs to exp.

EXECUTION STATE
shifted = array([-2., -1., 0.])
13sum_exp = np.sum(np.exp(shifted))

Now exp behaves: exp([-2, -1, 0]) = [e^-2, e^-1, e^0] ≈ [0.1353, 0.3679, 1.0000]. Their sum ≈ 1.5032. No overflow, no underflow, full precision retained.

EXECUTION STATE
np.exp(shifted) = array([0.1353, 0.3679, 1.0000])
sum_exp = 1.5032148247
14log_sum_exp = c + np.log(sum_exp) — add c back

Here is where the cancellation happens. log Σ e^{z_i} = log Σ e^{z_i − c + c} = c + log Σ e^{z_i − c}. We added c inside the exp, so we add it back outside the log. The result is mathematically identical to the direct formula but computed with finite numbers.

EXECUTION STATE
np.log(sum_exp) = 0.4076059644
log_sum_exp = 1002.4076059644
15return logits − log_sum_exp

log-softmax of z_i is z_i − log Σ e^{z_j}. We subtract the single scalar log_sum_exp from every logit. The output is a vector of log probabilities — all ≤ 0, and exp of them sums to 1.

EXECUTION STATE
result = array([-2.4076, -1.4076, -0.4076])
np.exp(result).sum() = 1.0000
18logits = the stress test

Three values, each large enough to overflow exp individually. Real transformer logits sit in the range ±20, but during training a single bad batch can spike them past 100 — which is already enough to blow up the naive path. Using 1000 makes the failure dramatic and unmissable.

19print naive — should be all nans

Expected stdout: naive : [nan nan nan]. This is the canonical sign of a numerical instability in a deep learning pipeline.

20print stable — finite, correct

Expected stdout: stable: [-2.40760596 -1.40760596 -0.40760596]. Exponentiating those gives [0.0900, 0.2447, 0.6652], which sum to 1.0 — a proper probability distribution.

6 lines without explanation
1import numpy as np
2
3def log_softmax_naive(logits):
4    """Naive implementation -- WILL OVERFLOW for large logits."""
5    exp_logits = np.exp(logits)              # e^1000 = inf
6    softmax = exp_logits / np.sum(exp_logits)  # inf / inf = nan
7    return np.log(softmax)                   # log(nan) = nan
8
9def log_softmax_stable(logits):
10    """Stable log-softmax using the log-sum-exp trick."""
11    c = np.max(logits)                       # shift constant
12    shifted = logits - c                     # all entries <= 0
13    sum_exp = np.sum(np.exp(shifted))        # finite, well-behaved
14    log_sum_exp = c + np.log(sum_exp)        # add c back
15    return logits - log_sum_exp              # log of softmax
16
17# Extreme logits that crash the naive path
18logits = np.array([1000.0, 1001.0, 1002.0])
19print("naive :", log_softmax_naive(logits))
20print("stable:", log_softmax_stable(logits))

Step 2 — The same idea in PyTorch (one call)

Once you understand the trick by hand, PyTorch hides the boilerplate. torch.nn.functional.log_softmax already uses the log-sum-exp trick internally, with extra speed-ups for GPU and autograd.

Same Computation in PyTorch — Library-Level
🐍python
1Import torch

torch is the core tensor library. Tensors are NumPy-like arrays but also track gradients and can live on GPU.

2Import functional API

torch.nn.functional (aliased F) holds stateless operations that work on tensors directly — log_softmax, cross_entropy, relu, and so on.

5Create the logits tensor

Same numbers as the NumPy example. torch.tensor(...) copies the data into a float32 tensor by default. dtype matters: float32 ulp near 1 is ~6e-8, so the stability arguments are even more important than in NumPy's float64.

EXECUTION STATE
logits = tensor([1000., 1001., 1002.])
logits.dtype = torch.float32
logits.requires_grad = False
8F.log_softmax — one line replaces our 5 lines

Internally PyTorch computes max(logits) along dim, subtracts it, exponentiates, sums, takes log, and subtracts. Same algorithm we just wrote, just behind a C++ kernel. dim=-1 means 'the last axis' — for a 1-D vector that is the only axis; for a batch tensor of shape (N, C) it would mean reduce across the class axis.

EXECUTION STATE
log_probs = tensor([-2.4076, -1.4076, -0.4076])
11probs = log_probs.exp()

We exponentiate the log-probabilities to recover the actual softmax. exp(-2.4076) ≈ 0.0900, exp(-1.4076) ≈ 0.2447, exp(-0.4076) ≈ 0.6652. Notice the largest logit gets the largest probability — that is exactly the 'soft argmax' interpretation of softmax.

EXECUTION STATE
probs = tensor([0.0900, 0.2447, 0.6652])
12Inspect log_probs

Expected stdout: tensor([-2.4076, -1.4076, -0.4076]) — identical to our hand-built NumPy stable version (within float32 precision). The framework call and the hand-derived call agree.

13Inspect probs

Expected: tensor([0.0900, 0.2447, 0.6652]). The probabilities are all positive, monotonic with the logits, and (by construction) sum to 1.

14Round-trip sanity check

Expected: 0.9999999... ≈ 1.0. This is the unit test for any softmax-like operation: probabilities must sum to one. If you ever see 0.97 or 1.03 here, your implementation is broken.

EXECUTION STATE
probs.sum() = tensor(1.0000)
6 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Same extreme logits — now as a PyTorch tensor on the default device.
5logits = torch.tensor([1000.0, 1001.0, 1002.0])
6
7# log_softmax internally does max-shift + log-sum-exp, with autograd support.
8log_probs = F.log_softmax(logits, dim=-1)
9
10# Round-trip check: exp(log_probs) must sum to 1.
11probs = log_probs.exp()
12print("log_probs:", log_probs)
13print("probs    :", probs)
14print("sum(probs):", probs.sum().item())

Information Theory: Bits and Entropy

Claude Shannon's information theory uses the binary logarithm to measure information:

I(x)=log2(p(x)) bitsI(x) = -\log_2(p(x)) \text{ bits}

The entropy (average information content) of a distribution is:

H(X)=xp(x)log2(p(x))H(X) = -\sum_x p(x) \log_2(p(x))
Event ProbabilityInformation ContentIntuition
p = 1 (certain)-log₂(1) = 0 bitsNo surprise, no information
p = 0.5 (coin flip)-log₂(0.5) = 1 bitOne binary question answered
p = 0.25-log₂(0.25) = 2 bitsTwo binary questions answered
p = 0.001 (rare)-log₂(0.001) ≈ 10 bitsVery surprising, high information

Numerical Computing Considerations

Common Pitfalls

Numerical Hazards with Logarithms

  • log(0): Undefined (returns -inf in most libraries)
  • log(negative): Undefined in reals (returns NaN)
  • log(1 + x) for small x: Use log1p(x) for accuracy
  • Subtracting large logs: Can cause precision loss; use log-sum-exp tricks

The log1p Function

When computing log(1+x)log(1 + x) for small xx, direct computation loses precision because 1+xapprox11 + x approx 1 in floating point. The function log1p(x) computes log(1+x)log(1 + x) accurately:

For tiny xx, the Taylor series of ln(1+x)\ln(1+x) starts with xx22+x33x - \tfrac{x^2}{2} + \tfrac{x^3}{3} - \dots, so to first order ln(1+x)x\ln(1+x) \approx x. The naive computationnp.log(1 + x) first adds 1 to x, and that addition is where the precision is destroyed — before the logarithm ever runs. log1p sidesteps the bad addition by evaluating the series directly.

Why log1p Matters for Small Values
🐍python
1Import NumPy

We use NumPy to access np.log (= ln) and np.log1p. Both are vectorised; here we use them on scalars for clarity.

3x = 1e-17 — chosen to be below the ulp at 1.0

The 'unit in the last place' (ulp) of a float64 near 1.0 is 2^-52 ≈ 2.22 × 10^-16. Any number smaller than half a ulp is invisible when added to 1.0. We picked x = 1e-17 specifically to fall below that threshold so the failure is unmistakable.

EXECUTION STATE
x = 1e-17
ulp(1.0) = ~2.22e-16
6ones_plus_x = 1.0 + x — the silent failure

Floating-point addition has to round the exact result (1.00000000000000001) to the nearest representable double. The nearest representable double is exactly 1.0. So the entire information content of x is thrown away here, before any log runs.

EXECUTION STATE
ones_plus_x = 1.0
ones_plus_x == 1.0 = True
7Print the stored value

Expected stdout: '1 + x stored as: 1.0'. Note repr — Python's full-precision printer — still shows 1.0, confirming x has vanished.

10naive = np.log(1.0 + x) — log of the wrong number

Since (1.0 + x) was stored as 1.0, np.log returns log(1.0) = 0.0 exactly. The naive routine reports the answer is zero, when in reality it should be approximately 1e-17. Catastrophic cancellation: 100% relative error.

EXECUTION STATE
naive = 0.0
11Print naive — confirms it is 0

Expected stdout: 'naive np.log(1 + x): 0.0'. A silent, devastating bug.

14accurate = np.log1p(x) — does the right thing

log1p is implemented to compute log(1+x) without ever forming the intermediate 1+x. For tiny x it uses the Taylor series x - x²/2 + x³/3 - … directly. The bad addition never happens, so all of x's bits survive.

EXECUTION STATE
accurate = 1e-17
15Print accurate — recovers x correctly

Expected stdout: 'np.log1p(x) : 1e-17'. The function returns the correct answer to full precision.

18true_value via Taylor — independent cross-check

Computing x − x²/2 by hand bypasses np.log entirely. For x = 1e-17, the x² term is 1e-34 — way below precision — so true_value ≈ x = 1e-17. This confirms log1p's answer is right.

EXECUTION STATE
true_value = 1e-17
19Print the truth

Expected stdout: 'true ≈ x - x^2/2 : 1e-17'. Three numbers were printed: naive=0.0, log1p=1e-17, truth=1e-17. log1p and truth agree; naive is wrong.

9 lines without explanation
1import numpy as np
2
3x = 1e-17  # tiny: well below the float64 ulp at 1.0 (~2.22e-16)
4
5# Step 1 — the addition that ruins precision
6ones_plus_x = 1.0 + x
7print("1 + x stored as:", repr(ones_plus_x))
8
9# Step 2 — naive log loses every bit of x
10naive = np.log(1.0 + x)
11print("naive np.log(1 + x):", naive)
12
13# Step 3 — log1p evaluates the series, no precision loss
14accurate = np.log1p(x)
15print("np.log1p(x)        :", accurate)
16
17# Step 4 — true value to compare against
18true_value = x - 0.5 * x**2  # first two Taylor terms
19print("true ≈ x - x^2/2   :", true_value)

Summary

ConceptKey FormulaApplication
Definitionlogᵦ(x) = y ⟺ bʸ = xInverse of exponentials
Product Rulelog(xy) = log(x) + log(y)Multiply → Add
Quotient Rulelog(x/y) = log(x) - log(y)Divide → Subtract
Power Rulelog(xⁿ) = n·log(x)Exponent → Multiply
Change of Baselogₐ(x) = log(x)/log(a)Convert between bases
Natural Logln(x) = logₑ(x)Calculus, ML, continuous growth
Log-Likelihoodlog L = Σ log(pᵢ)ML training stability

Key Takeaways

  1. Logarithms are inverses of exponentials—they answer "what power?"
  2. Three important bases: e (calculus), 10 (scientific), 2 (computing)
  3. Properties convert operations: multiplication → addition, powers → multiplication
  4. Real-world scales (Richter, decibels, pH) compress huge ranges
  5. ML relies on logarithms for numerical stability and theoretical elegance
  6. Always check for edge cases: log(0), log(negative), precision for small arguments

Exercises

Conceptual Questions

  1. Explain in your own words why logb(1)=0log_b(1) = 0 for any valid base bb.
  2. Why can't we take the logarithm of a negative number using real numbers?
  3. If log2(n)=10log_2(n) = 10, how many elements are in the set? How many comparisons would binary search take?
  4. A sound at 60 dB is how many times more intense than a sound at 40 dB?

Computational Problems

  1. Simplify: log_3(81) + log_3left( rac{1}{27} ight)
  2. Solve for x: log2(x3)+log2(x+3)=4log_2(x-3) + log_2(x+3) = 4
  3. Express log5(100)log_5(100) in terms of common logarithms (base 10).
  4. If an earthquake releases 1000 times more energy than a magnitude 4 earthquake, what is its magnitude? (Hint: energy ratio ≈ 31.6ΔM31.6^{\Delta M})

Programming Challenges

  1. Implement a function that computes logb(x)log_b(x) for any base using only the natural log.
  2. Write a numerically stable function to compute cross-entropy loss given predicted probabilities and true labels.
  3. Create a visualization comparing linear and logarithmic scales for the electromagnetic spectrum (wavelengths from 10⁻¹² to 10⁴ meters).

Exploration

  1. Research Benford's Law: why do leading digits in many datasets follow a logarithmic distribution?
  2. Investigate how logarithms appear in the analysis of algorithm complexity (e.g., why is merge sort O(n log n)?).
  3. Explore the connection between logarithms and music: why is the frequency ratio between octaves 2:1, and how do logarithms relate to the perception of pitch?

In the next section, we'll explore trigonometric functions—the mathematics of circular motion and periodic phenomena. These functions are essential for understanding waves, oscillations, rotations, and countless applications in physics, engineering, and signal processing.

Loading comments...