Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

After completing this section, you will be able to:

Define logarithms as inverse functions of exponentials and translate between logarithmic and exponential forms
Distinguish between common logarithms (base 10), natural logarithms (base e), and binary logarithms (base 2)
Apply logarithm properties (product, quotient, and power rules) to simplify expressions and solve equations
Use the change of base formula to convert between logarithms of different bases
Graph logarithmic functions and identify key features: domain, range, asymptotes, and intercepts
Recognize real-world applications including the Richter scale, decibels, pH, and information theory
Understand why logarithms are essential in machine learning for numerical stability and gradient computation

The Story of Logarithms

A Revolution in Calculation

In the early 17th century, astronomers, navigators, and scientists faced an enormous computational challenge. Calculating planetary orbits, navigating ships across oceans, and conducting scientific experiments required multiplying and dividing very large numbers—a process that was tedious, error-prone, and could take hours or even days.

In 1614, Scottish mathematician John Napier published a revolutionary discovery that would transform calculation: logarithms. His key insight was profound:

Napier's Insight: Multiplication can be converted to addition by working with exponents. If you want to multiply two numbers, add their logarithms instead—then convert back.

This single idea reduced multiplication to addition, division to subtraction, and exponentiation to simple multiplication. Before electronic calculators, logarithm tables and slide rules (which are essentially analog logarithm computers) were indispensable tools used by scientists and engineers for over 300 years.

The Name "Logarithm"

Napier coined the term from Greek: logos (ratio, proportion) and arithmos (number). A logarithm is literally a "ratio number"—it captures the ratio or proportion of growth in exponential processes.

Year	Development	Impact
1614	Napier publishes logarithm tables	Reduces calculation time by orders of magnitude
1617	Henry Briggs creates common (base 10) logarithms	Easier to use with decimal system
1620s	Slide rules invented	Portable logarithm calculators used for 350 years
1668	Natural logarithm (base e) formalized	Essential for calculus and continuous growth
1948	Shannon\'s information theory	Logarithms measure information in bits
Today	Machine learning	Log-likelihood, cross-entropy, softmax stabilization

Why This History Matters

Understanding logarithms as "computation tools for handling exponentials" explains why they appear everywhere: they convert multiplicative processes into additive ones. This is exactly why machine learning uses log-probabilities instead of raw probabilities—products of many small numbers become sums of manageable log-values.

Definition: The Inverse of Exponentials

The Fundamental Relationship

A logarithm answers a simple question: "To what power must I raise the base to get this number?"

Formally, for any base $b > 0$ with $b eq 1$ :

\log_b(x) = y \quad \Longleftrightarrow \quad b^y = x

In words: "the logarithm base $b$ of $x$ " equals $y$ if and only if " $b$ to the $y$ power equals $x$ ."

The Inverse Function Relationship

The logarithm function $f(x) = log_b(x)$ and the exponential function $g(x) = b^x$ are inverse functions. This means they "undo" each other:

\log_b(b^x) = x \quad \text{for all real } x

b^{\log_b(x)} = x \quad \text{for all } x > 0

Domain Restriction

The logarithm

log_b(x)

is only defined for

x > 0

. You cannot take the logarithm of zero or a negative number (in the real numbers). This is because no real power of a positive base can produce zero or a negative result.

Intuition: The "What Power?" Question

Forget the symbols for a moment. Every logarithm is just a question. When you write $log_2(8)$ you are literally asking: “starting from 2, how many times must I multiply by 2 to reach 8?” The answer is 3, because $2 \cdot 2 \cdot 2 = 8$ . That number you arrived at — that count of multiplications — is the logarithm.

Analogy: think of an exponential function $b^y$ as a recipe (“multiply $b$ by itself $y$ times to get a size $x$ ”). The logarithm is the recipe reader: given the finished cake of size $x$ , it tells you how many times you must have doubled (or tripled, or $b$ -tupled) to bake it. Exponentials grow; logs count growth steps. They are two views of the same coin.

📝 Worked example by hand: convert 5 exponential statements to logarithms (click to expand)

Translate each exponential statement into a logarithm by reading it out loud: “base, to what power, gives result?”

Statement 1: 2⁴ = 16

Ask: 2 to what power gives 16?

Answer: 4. Therefore log₂(16) = 4.

Statement 2: 10³ = 1000

Ask: 10 to what power gives 1000?

Answer: 3. Therefore log₁₀(1000) = 3.

(The common log of any power-of-10 is the count of trailing zeros.)

Statement 3: e⁰ = 1

Ask: e to what power gives 1?

Answer: 0. Therefore ln(1) = 0.

(Every base to the zero power is 1, so log_b(1)=0 always.)

Statement 4: 5⁻² = 1/25 = 0.04

Ask: 5 to what power gives 0.04?

Answer: −2. Therefore log₅(0.04) = −2.

(Logs of numbers smaller than 1 are negative — we “un-multiplied.”)

Statement 5: 9^1/2 = 3 (the square root)

Ask: 9 to what power gives 3?

Answer: 1/2. Therefore log₉(3) = 0.5.

(Roots are fractional exponents, so logs of roots are fractional logs.)

Round trip check. If $\log_2(16) = 4$ , then plugging back gives $2^4 = 16$ . The two operations exactly undo each other. That round trip is the entire content of the inverse-function property.

Common and Natural Logarithms

The Three Most Important Bases

Base	Name	Notation	Primary Use
b = 10	Common logarithm	log(x) or log₁₀(x)	Scientific notation, orders of magnitude, decibels
b = e ≈ 2.718...	Natural logarithm	ln(x) or logₑ(x)	Calculus, continuous growth, ML/statistics
b = 2	Binary logarithm	lg(x) or log₂(x)	Computer science, information theory, algorithms

The Natural Logarithm: Why Base e?

The natural logarithm (base $e approx 2.71828ldots$ ) might seem like an arbitrary choice, but it's actually the most natural base for calculus. The number $e$ is special because:

\frac{d}{dx}\left[e^x\right] = e^x \quad \text{and} \quad \frac{d}{dx}\left[\ln(x)\right] = \frac{1}{x}

No other base produces such clean derivatives. This is why natural logarithms dominate in calculus, differential equations, and any field dealing with continuous change.

The Natural Choice: When working with rates of change, growth, or decay, the natural logarithm is almost always the right choice. It simplifies derivatives, integrals, and the mathematics of continuous processes.

The Binary Logarithm: Counting Bits

In computer science, the binary logarithm tells you "how many bits do I need?" For a number $n$ :

$log_2(n)$ bits are needed to represent $n$ distinct values
Binary search on $n$ elements takes $O(log_2 n)$ comparisons
A balanced binary tree with $n$ nodes has height $log_2(n)$

Notation Warning

In pure mathematics, "log" without a subscript usually means natural log (ln). In engineering and applied sciences, it often means log₁₀. In computer science, it typically means log₂. Always check the context or explicitly write the base to avoid confusion!

Properties of Logarithms

The Three Fundamental Laws

The properties of logarithms follow directly from the properties of exponents. Since $log_b(x)$ is the inverse of $b^x$ , the laws of exponents transform into laws of logarithms.

Try the interactive explorer below to see these properties in action:

Logarithm Properties Explorer

Value a:2

Value b:3

Exponent n:2

Base:

Product Rule: log(ab) = log(a) + log(b)

log(6.00)=0.7782=0.3010+0.4771=0.7782&check;

Quotient Rule: log(a/b) = log(a) - log(b)

log(0.67)=-0.1761=0.3010-0.4771=-0.1761&check;

Power Rule: log(a^n) = n · log(a)

log(4.00)=0.6021=2×0.3010=0.6021&check;

Adjust the sliders to see how logarithm properties hold for any positive values.

Key Values to Memorize

Property	Formula	Why It's True
Log of 1	logᵦ(1) = 0	b⁰ = 1 for any base
Log of the base	logᵦ(b) = 1	b¹ = b
Log of a power of base	logᵦ(bⁿ) = n	By definition of logarithm
Inverse composition	b^(logᵦ(x)) = x	Inverse functions cancel

Change of Base Formula

What if you need $log_2(100)$ but your calculator only has ln and log₁₀? The change of base formula converts between any bases:

\log_a(x) = \frac{\log_b(x)}{\log_b(a)} = \frac{\ln(x)}{\ln(a)} = \frac{\log(x)}{\log(a)}

Change of Base Formula Calculator

Value (x):

From base:

To base:

Change of Base Formula:

log₂(100)=log₁₀(100)/log₁₀(2)

log₁₀(100)

2.0000

log₁₀(2)

0.3010

log₂(100)

6.6439

The change of base formula lets you convert between any logarithm bases using only one type of logarithm.

Computational Tip

In programming, most languages provide only log (natural log) and sometimeslog10. Use the change of base formula:log2(x) = log(x) / log(2)

Graphing Logarithmic Functions

The graph of $y = log_b(x)$ is the reflection of $y = b^x$ across the line $y = x$ . This reflection relationship between inverse functions is the heart of logarithmic graphs — swap x and y, and you swap the two curves.

Drag along the blue curve below. The amber dashed segment shows the reflection across the line $y=x$ : the moment you pick a point $(x, \log_b x)$ , the mirror point $(\log_b x, x)$ on the green exponential curve falls into place automatically. That single picture is the inverse relationship.

The Mirror: log and exp are reflections across y = x

Drag horizontally to move the point along the log curve. Its mirror image jumps to the matching point on the exponential curve. The dashed line y = x is the mirror.

Base:x:x = 2.50

Log says:

log_e(2.50) = 0.9163

Exp confirms (round trip):

e^0.9163 = 2.5000

Swap the coordinates of any point on the blue curve and you land on the green curve. That coordinate-swap is reflection across y = x.

Below is the more traditional graph with adjustable base and a click-to-probe coordinate readout — use it to read off specific values like $\log_2(8) = 3$ or $\ln(e) = 1$ .

Interactive Logarithm Graph: y = log_e(x)

Base:

Show exponential (inverse)Show grid

Custom base:b = 2.72

Domain

(0, +∞)

Range

(-∞, +∞)

Key Property

log_e(e) = 1

Click anywhere on the graph to see coordinates. The logarithm and exponential are reflections across y = x.

Key Features of Logarithmic Graphs

Feature	Value	Explanation
Domain	(0, +∞)	Only positive inputs allowed
Range	(-∞, +∞)	Output can be any real number
Vertical Asymptote	x = 0	Graph approaches but never touches y-axis
x-intercept	(1, 0)	logᵦ(1) = 0 for any base
Key Point	(b, 1)	logᵦ(b) = 1
Behavior as x → 0⁺	y → -∞	Logarithm of small positive numbers is very negative
Behavior as x → +∞	y → +∞	Grows without bound, but very slowly

How the Base Affects the Graph

Base > 1: The function is increasing. Larger x gives larger y.
Larger base: The curve rises more slowly. Compare log₁₀(100) = 2 vs log₂(100) ≈ 6.64.
Base between 0 and 1: The function is decreasing (rarely used in practice).

Transformations of Logarithmic Functions

Standard function transformations apply to logarithms:

Transformation	Effect on Graph	Example
y = logᵦ(x) + k	Vertical shift up by k	ln(x) + 2
y = logᵦ(x - h)	Horizontal shift right by h	ln(x - 3), asymptote moves to x = 3
y = a · logᵦ(x)	Vertical stretch by factor a	2 ln(x)
y = logᵦ(cx)	Horizontal compression by factor c	ln(2x)
y = -logᵦ(x)	Reflection over x-axis	-ln(x)

Real-World Logarithmic Scales

Many natural phenomena span enormous ranges of values. Logarithmic scales compress these ranges into manageable numbers, making patterns visible that would be hidden on a linear scale.

Logarithmic Scales in the Real World

Earthquake Magnitudes (Richter Scale)

Each unit increase = 10x more ground motion, ~31.6x more energy

Barely felt

1.0e+2

Minor damage

1.0e+4

Moderate

1.0e+5

Strong

1.0e+6

Major

1.0e+7

Great (1906 SF)

7.9

7.9e+7

Massive (2011 Japan)

9.1

1.3e+9

Why use logarithmic scales? They compress huge ranges of values into manageable numbers. The difference between a magnitude 5 and magnitude 9 earthquake is about 10,000x in ground motion, but only 4 units on the Richter scale.

Why Logarithmic Scales?

Compress huge ranges: The visible light spectrum covers wavelengths from 400nm to 700nm, but the full electromagnetic spectrum spans from 10⁻¹⁵m (gamma rays) to 10⁸m (radio waves)—a factor of 10²³!
Match human perception: Our ears perceive loudness logarithmically. A sound 10x more intense sounds about twice as loud.
Reveal multiplicative patterns: Exponential growth appears as a straight line on a log scale, making it easy to identify.
Compare relative changes: A doubling looks the same size whether it's from 10 to 20 or from 10,000 to 20,000.

The Richter Scale: A Deep Dive

The Richter magnitude $M$ of an earthquake is:

M = \log_{10}\left(\frac{A}{A_0}\right)

where $A$ is the measured amplitude and $A_0$ is a reference amplitude. Each unit increase in magnitude means:

10× more ground motion (amplitude)
~31.6× more energy released (since energy ∝ amplitude²)

Applications in Machine Learning

Logarithms are ubiquitous in machine learning, not as a historical curiosity, but as an essential tool for numerical stability and theoretical elegance.

Why ML Uses Log-Probabilities

Consider computing the likelihood of observing data given a model. If you have $n$ independent observations with probabilities $p_1, p_2, ldots, p_n$ :

\text{Likelihood } L = \prod_{i=1}^{n} p_i

The Problem: If each $p_i approx 0.99$ and $n = 1000$ :

L = 0.99^{1000} \approx 4.3 \times 10^{-5}

For smaller probabilities or more samples, $L$ underflows to exactly 0 in floating-point arithmetic—catastrophic for training!

The Solution: Work with log-likelihood instead:

\log L = \sum_{i=1}^{n} \log(p_i)

Products become sums, tiny numbers become manageable negative numbers, and gradients become stable.

Logarithms in Machine Learning: Log-Likelihood

Why ML uses logarithms: products become sums, tiny probabilities become manageable numbers.

Predicted probability (p): 0.70

Correct predictions: 8/10

Raw Likelihood (product of probabilities)

L = p⁸ × (1-p)²

5.1883e-3

Underflows to 0 with many samples!

Log-Likelihood (sum of logs)

log(L) = 8 × log(p) + 2 × log(1-p)

-5.2613

Numerically stable always!

Connection to Cross-Entropy Loss

Cross-Entropy = -log(L)/n = 0.5261

Cross-entropy loss is the negative average log-likelihood. Minimizing cross-entropy = maximizing likelihood!

Plain Python first — multiply, then add logs

Before we touch any framework, let's do log-likelihood by hand. The point of this snippet is not to be efficient — it is to make you feel the difference between a product and a sum of logs.

Likelihood vs Log-Likelihood — Plain Python Loop

🐍python

Explanation(12)

Code(21)

1Import math

We only need math.log (= ln) and math.exp. Using the standard library makes the algorithm visible — no NumPy magic.

5observations = the data

A list of 8 binary outcomes. Six 1s and two 0s. This list is the entire dataset we will score the model against. In ML language, these are the labels for 8 training examples.

EXECUTION STATE

observations = [1, 1, 0, 1, 1, 0, 1, 1]

len(observations) = 8

sum(observations) = 6 (heads)

6p = 0.6 — the model's belief

The model says 'I think heads is 60% likely'. Our job is to score how good that belief is given the actual observations. Higher likelihood = better fit.

EXECUTION STATE

p = 0.6

9likelihood = 1.0 — the running product

We initialise to 1.0 because the multiplicative identity is 1, just like a running sum starts at 0. Every observation will multiply this number by something between 0 and 1, so the product can only shrink.

EXECUTION STATE

likelihood = 1.0

10Loop over observations — iteration trace

We walk through all 8 outcomes. For each y=1 we multiply by p; for each y=0 we multiply by (1−p). Watch the running product collapse toward zero.

LOOP TRACE · 8 iterations

i=0, y=1

factor = p = 0.6

likelihood = 0.600000

i=1, y=1

factor = p = 0.6

likelihood = 0.360000

i=2, y=0

factor = 1 - p = 0.4

likelihood = 0.144000

i=3, y=1

factor = p = 0.6

likelihood = 0.086400

i=4, y=1

factor = p = 0.6

likelihood = 0.051840

i=5, y=0

factor = 1 - p = 0.4

likelihood = 0.020736

i=6, y=1

factor = p = 0.6

likelihood = 0.012442

i=7, y=1

factor = p = 0.6

likelihood = 0.007465

11likelihood *= … the multiplication

Python's conditional expression: pick p if heads, 1−p if tails. After 8 iterations the product is 0.6⁶ × 0.4² = 0.046656 × 0.16 = 0.0074649600. For 1000 observations this number would underflow to exactly 0.0 in float64.

12Print the likelihood

Expected stdout: 'likelihood = 0.00746496'. Small but still representable. Now imagine 10000 observations — the product becomes 10^-7000-ish, far below the float64 minimum (~10^-308), and the answer collapses to 0.

15log_likelihood = 0.0 — the running sum

Switch to log space. Multiplication of probabilities becomes addition of log-probabilities (product rule of logs). We start the running sum at 0 because log(1) = 0.

EXECUTION STATE

log_likelihood = 0.0

16Loop again — same logic, in log space

Same eight iterations, but every multiply becomes an add and every probability becomes its log. log(0.6) ≈ −0.510826, log(0.4) ≈ −0.916291.

LOOP TRACE · 8 iterations

i=0, y=1

log_likelihood = -0.510826

i=1, y=1

log_likelihood = -1.021651

i=2, y=0

log_likelihood = -1.937942

i=3, y=1

log_likelihood = -2.448768

i=4, y=1

log_likelihood = -2.959594

i=5, y=0

log_likelihood = -3.875885

i=6, y=1

log_likelihood = -4.386710

i=7, y=1

log_likelihood = -4.897536

17log_likelihood += math.log(…)

Conditional add: log(p) if heads, log(1−p) if tails. After 8 iterations the sum is 6·log(0.6) + 2·log(0.4) = 6·(−0.510826) + 2·(−0.916291) ≈ −4.897536. Notice: the number is comfortably representable, and adding more samples just makes it more negative — never crashes.

18Print log_likelihood

Expected stdout: 'log_likelihood = -4.897536'. Imagine 10000 observations: the log-likelihood would be around −5000, still a totally normal float. That is exactly why every ML library actually stores and optimises log-likelihoods.

21exp(log_likelihood) — round-trip check

exp(−4.897536) ≈ 0.00746496 — identical to the direct product on line 12. This is the mathematical guarantee: multiplying probabilities and summing their logs are the same operation, viewed through the exp/log mirror.

EXECUTION STATE

math.exp(log_likelihood) = 0.00746496

9 lines without explanation

1import math
2
3# Imagine 8 independent coin flips. Our model predicts p = 0.6 for "heads".
4# Observed sequence: 1=heads, 0=tails.
5observations = [1, 1, 0, 1, 1, 0, 1, 1]
6p = 0.6
7
8# Approach A: multiply raw probabilities (the dangerous way)
9likelihood = 1.0
10for y in observations:
11    likelihood *= p if y == 1 else (1.0 - p)
12print("likelihood        =", likelihood)
13
14# Approach B: sum log-probabilities (the safe way)
15log_likelihood = 0.0
16for y in observations:
17    log_likelihood += math.log(p) if y == 1 else math.log(1.0 - p)
18print("log_likelihood    =", log_likelihood)
19
20# Sanity: exp of the log-likelihood should recover the raw likelihood.
21print("exp(log_lik)      =", math.exp(log_likelihood))

The one-line moral

Products of many small probabilities underflow; sums of their logs do not. Every reputable ML loss function (cross-entropy, KL divergence, negative log-likelihood) is computed in log space for this single reason.

Cross-Entropy Loss

The cross-entropy loss for classification is the negative log-likelihood:

H(y, \hat{y}) = -\sum_{i} y_i \log(\hat{y}_i)

where $y$ is the true label (one-hot) and $hat{y}$ is the predicted probability distribution.

Log-Softmax for Numerical Stability

The softmax function converts logits to probabilities:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

The problem: For large logits, $e^{z_i}$ overflows. The solution: Compute log-softmax directly:

\log(\text{softmax}(z_i)) = z_i - \log\left(\sum_j e^{z_j}\right)

Using the "log-sum-exp trick" with a shift $c = max(z)$ :

\log\left(\sum_j e^{z_j}\right) = c + \log\left(\sum_j e^{z_j - c}\right)

We will now build the log-softmax twice. First in plain NumPy — line by line, with every value hand-traced — so you can see the math. Then we will hand it to PyTorch and watch the same numbers come out in one call. Plain Python first, framework second.

Step 1 — Pure NumPy: trace every line

Numerically Stable Log-Softmax (NumPy, from scratch)

🐍python

Explanation(14)

Code(20)

1Import NumPy

We need vectorised math: np.exp, np.log, np.max, np.sum. Everything operates element-wise on the array of logits.

3Define the naive (broken) baseline

Naive log-softmax follows the textbook formula directly: exponentiate, normalize, take log. It is mathematically correct but numerically fragile. We keep it so you can see the failure mode side-by-side with the fix.

5exp_logits = np.exp(logits) — the overflow

For logits = [1000, 1001, 1002], np.exp(1000) is roughly 2 × 10^434, which is far beyond the largest float64 value (~1.8 × 10^308). NumPy returns inf for every entry. The naive path is already dead at this line — everything downstream is nan.

EXECUTION STATE

logits = array([1000., 1001., 1002.])

exp_logits = array([inf, inf, inf])

6softmax = exp_logits / sum(exp_logits)

inf divided by inf is the indeterminate form. NumPy reports nan for every entry and raises a runtime warning. Note that mathematically the answer should be approximately [0.0900, 0.2447, 0.6652] — we have completely lost the information.

EXECUTION STATE

np.sum(exp_logits) = inf

softmax = array([nan, nan, nan])

7return np.log(softmax) — final blowup

log of nan is nan, so the function returns an array of nans. In a real training loop this propagates into the loss, then into gradients, then into weights, and your network is unrecoverable.

EXECUTION STATE

return = array([nan, nan, nan])

9Define the stable version

We now repair the algorithm using the log-sum-exp identity: log Σ e^{z_i} = c + log Σ e^{z_i − c} for any constant c. Choosing c = max(z) keeps every shifted exponent ≤ 0, which keeps exp safely between 0 and 1.

11c = np.max(logits) — pick the shift

c is just a number. It cancels mathematically (we add it back later), but operationally it makes the exponents harmless. Choosing the maximum guarantees every shifted entry is ≤ 0, so every exp call returns a finite number in (0, 1].

EXECUTION STATE

c = 1002.0

12shifted = logits − c — bring everything to ≤ 0

We subtract c from every entry. The largest logit becomes 0, every other entry becomes a small negative number. Concretely [1000, 1001, 1002] − 1002 = [−2, −1, 0]. These are perfectly safe inputs to exp.

EXECUTION STATE

shifted = array([-2., -1., 0.])

13sum_exp = np.sum(np.exp(shifted))

Now exp behaves: exp([-2, -1, 0]) = [e^-2, e^-1, e^0] ≈ [0.1353, 0.3679, 1.0000]. Their sum ≈ 1.5032. No overflow, no underflow, full precision retained.

EXECUTION STATE

np.exp(shifted) = array([0.1353, 0.3679, 1.0000])

sum_exp = 1.5032148247

14log_sum_exp = c + np.log(sum_exp) — add c back

Here is where the cancellation happens. log Σ e^{z_i} = log Σ e^{z_i − c + c} = c + log Σ e^{z_i − c}. We added c inside the exp, so we add it back outside the log. The result is mathematically identical to the direct formula but computed with finite numbers.

EXECUTION STATE

np.log(sum_exp) = 0.4076059644

log_sum_exp = 1002.4076059644

15return logits − log_sum_exp

log-softmax of z_i is z_i − log Σ e^{z_j}. We subtract the single scalar log_sum_exp from every logit. The output is a vector of log probabilities — all ≤ 0, and exp of them sums to 1.

EXECUTION STATE

result = array([-2.4076, -1.4076, -0.4076])

np.exp(result).sum() = 1.0000

18logits = the stress test

Three values, each large enough to overflow exp individually. Real transformer logits sit in the range ±20, but during training a single bad batch can spike them past 100 — which is already enough to blow up the naive path. Using 1000 makes the failure dramatic and unmissable.

19print naive — should be all nans

Expected stdout: naive : [nan nan nan]. This is the canonical sign of a numerical instability in a deep learning pipeline.

20print stable — finite, correct

Expected stdout: stable: [-2.40760596 -1.40760596 -0.40760596]. Exponentiating those gives [0.0900, 0.2447, 0.6652], which sum to 1.0 — a proper probability distribution.

6 lines without explanation

1import numpy as np
2
3def log_softmax_naive(logits):
4    """Naive implementation -- WILL OVERFLOW for large logits."""
5    exp_logits = np.exp(logits)              # e^1000 = inf
6    softmax = exp_logits / np.sum(exp_logits)  # inf / inf = nan
7    return np.log(softmax)                   # log(nan) = nan
8
9def log_softmax_stable(logits):
10    """Stable log-softmax using the log-sum-exp trick."""
11    c = np.max(logits)                       # shift constant
12    shifted = logits - c                     # all entries <= 0
13    sum_exp = np.sum(np.exp(shifted))        # finite, well-behaved
14    log_sum_exp = c + np.log(sum_exp)        # add c back
15    return logits - log_sum_exp              # log of softmax
16
17# Extreme logits that crash the naive path
18logits = np.array([1000.0, 1001.0, 1002.0])
19print("naive :", log_softmax_naive(logits))
20print("stable:", log_softmax_stable(logits))

Step 2 — The same idea in PyTorch (one call)

Once you understand the trick by hand, PyTorch hides the boilerplate. torch.nn.functional.log_softmax already uses the log-sum-exp trick internally, with extra speed-ups for GPU and autograd.

Same Computation in PyTorch — Library-Level

🐍python

Explanation(8)

Code(14)

1Import torch

torch is the core tensor library. Tensors are NumPy-like arrays but also track gradients and can live on GPU.

2Import functional API

torch.nn.functional (aliased F) holds stateless operations that work on tensors directly — log_softmax, cross_entropy, relu, and so on.

5Create the logits tensor

Same numbers as the NumPy example. torch.tensor(...) copies the data into a float32 tensor by default. dtype matters: float32 ulp near 1 is ~6e-8, so the stability arguments are even more important than in NumPy's float64.

EXECUTION STATE

logits = tensor([1000., 1001., 1002.])

logits.dtype = torch.float32

logits.requires_grad = False

8F.log_softmax — one line replaces our 5 lines

Internally PyTorch computes max(logits) along dim, subtracts it, exponentiates, sums, takes log, and subtracts. Same algorithm we just wrote, just behind a C++ kernel. dim=-1 means 'the last axis' — for a 1-D vector that is the only axis; for a batch tensor of shape (N, C) it would mean reduce across the class axis.

EXECUTION STATE

log_probs = tensor([-2.4076, -1.4076, -0.4076])

11probs = log_probs.exp()

We exponentiate the log-probabilities to recover the actual softmax. exp(-2.4076) ≈ 0.0900, exp(-1.4076) ≈ 0.2447, exp(-0.4076) ≈ 0.6652. Notice the largest logit gets the largest probability — that is exactly the 'soft argmax' interpretation of softmax.

EXECUTION STATE

probs = tensor([0.0900, 0.2447, 0.6652])

12Inspect log_probs

Expected stdout: tensor([-2.4076, -1.4076, -0.4076]) — identical to our hand-built NumPy stable version (within float32 precision). The framework call and the hand-derived call agree.

13Inspect probs

Expected: tensor([0.0900, 0.2447, 0.6652]). The probabilities are all positive, monotonic with the logits, and (by construction) sum to 1.

14Round-trip sanity check

Expected: 0.9999999... ≈ 1.0. This is the unit test for any softmax-like operation: probabilities must sum to one. If you ever see 0.97 or 1.03 here, your implementation is broken.

EXECUTION STATE

probs.sum() = tensor(1.0000)

6 lines without explanation

1import torch
2import torch.nn.functional as F
3
4# Same extreme logits — now as a PyTorch tensor on the default device.
5logits = torch.tensor([1000.0, 1001.0, 1002.0])
6
7# log_softmax internally does max-shift + log-sum-exp, with autograd support.
8log_probs = F.log_softmax(logits, dim=-1)
9
10# Round-trip check: exp(log_probs) must sum to 1.
11probs = log_probs.exp()
12print("log_probs:", log_probs)
13print("probs    :", probs)
14print("sum(probs):", probs.sum().item())

Information Theory: Bits and Entropy

Claude Shannon's information theory uses the binary logarithm to measure information:

I(x) = -\log_2(p(x)) \text{ bits}

The entropy (average information content) of a distribution is:

H(X) = -\sum_x p(x) \log_2(p(x))

Event Probability	Information Content	Intuition
p = 1 (certain)	-log₂(1) = 0 bits	No surprise, no information
p = 0.5 (coin flip)	-log₂(0.5) = 1 bit	One binary question answered
p = 0.25	-log₂(0.25) = 2 bits	Two binary questions answered
p = 0.001 (rare)	-log₂(0.001) ≈ 10 bits	Very surprising, high information

Numerical Computing Considerations

Common Pitfalls

Numerical Hazards with Logarithms

log(0): Undefined (returns -inf in most libraries)
log(negative): Undefined in reals (returns NaN)
log(1 + x) for small x: Use log1p(x) for accuracy
Subtracting large logs: Can cause precision loss; use log-sum-exp tricks

The log1p Function

When computing $log(1 + x)$ for small $x$ , direct computation loses precision because $1 + x approx 1$ in floating point. The function log1p(x) computes $log(1 + x)$ accurately:

For tiny $x$ , the Taylor series of $\ln(1+x)$ starts with $x - \tfrac{x^2}{2} + \tfrac{x^3}{3} - \dots$ , so to first order $\ln(1+x) \approx x$ . The naive computationnp.log(1 + x) first adds 1 to x, and that addition is where the precision is destroyed — before the logarithm ever runs. log1p sidesteps the bad addition by evaluating the series directly.

Why log1p Matters for Small Values

🐍python

Explanation(10)

Code(19)

1Import NumPy

We use NumPy to access np.log (= ln) and np.log1p. Both are vectorised; here we use them on scalars for clarity.

3x = 1e-17 — chosen to be below the ulp at 1.0

The 'unit in the last place' (ulp) of a float64 near 1.0 is 2^-52 ≈ 2.22 × 10^-16. Any number smaller than half a ulp is invisible when added to 1.0. We picked x = 1e-17 specifically to fall below that threshold so the failure is unmistakable.

EXECUTION STATE

x = 1e-17

ulp(1.0) = ~2.22e-16

6ones_plus_x = 1.0 + x — the silent failure

Floating-point addition has to round the exact result (1.00000000000000001) to the nearest representable double. The nearest representable double is exactly 1.0. So the entire information content of x is thrown away here, before any log runs.

EXECUTION STATE

ones_plus_x = 1.0

ones_plus_x == 1.0 = True

7Print the stored value

Expected stdout: '1 + x stored as: 1.0'. Note repr — Python's full-precision printer — still shows 1.0, confirming x has vanished.

10naive = np.log(1.0 + x) — log of the wrong number

Since (1.0 + x) was stored as 1.0, np.log returns log(1.0) = 0.0 exactly. The naive routine reports the answer is zero, when in reality it should be approximately 1e-17. Catastrophic cancellation: 100% relative error.

EXECUTION STATE

naive = 0.0

11Print naive — confirms it is 0

Expected stdout: 'naive np.log(1 + x): 0.0'. A silent, devastating bug.

14accurate = np.log1p(x) — does the right thing

log1p is implemented to compute log(1+x) without ever forming the intermediate 1+x. For tiny x it uses the Taylor series x - x²/2 + x³/3 - … directly. The bad addition never happens, so all of x's bits survive.

EXECUTION STATE

accurate = 1e-17

15Print accurate — recovers x correctly

Expected stdout: 'np.log1p(x) : 1e-17'. The function returns the correct answer to full precision.

18true_value via Taylor — independent cross-check

Computing x − x²/2 by hand bypasses np.log entirely. For x = 1e-17, the x² term is 1e-34 — way below precision — so true_value ≈ x = 1e-17. This confirms log1p's answer is right.

EXECUTION STATE

true_value = 1e-17

19Print the truth

Expected stdout: 'true ≈ x - x^2/2 : 1e-17'. Three numbers were printed: naive=0.0, log1p=1e-17, truth=1e-17. log1p and truth agree; naive is wrong.

9 lines without explanation

1import numpy as np
2
3x = 1e-17  # tiny: well below the float64 ulp at 1.0 (~2.22e-16)
4
5# Step 1 — the addition that ruins precision
6ones_plus_x = 1.0 + x
7print("1 + x stored as:", repr(ones_plus_x))
8
9# Step 2 — naive log loses every bit of x
10naive = np.log(1.0 + x)
11print("naive np.log(1 + x):", naive)
12
13# Step 3 — log1p evaluates the series, no precision loss
14accurate = np.log1p(x)
15print("np.log1p(x)        :", accurate)
16
17# Step 4 — true value to compare against
18true_value = x - 0.5 * x**2  # first two Taylor terms
19print("true ≈ x - x^2/2   :", true_value)

Summary

Concept	Key Formula	Application
Definition	logᵦ(x) = y ⟺ bʸ = x	Inverse of exponentials
Product Rule	log(xy) = log(x) + log(y)	Multiply → Add
Quotient Rule	log(x/y) = log(x) - log(y)	Divide → Subtract
Power Rule	log(xⁿ) = n·log(x)	Exponent → Multiply
Change of Base	logₐ(x) = log(x)/log(a)	Convert between bases
Natural Log	ln(x) = logₑ(x)	Calculus, ML, continuous growth
Log-Likelihood	log L = Σ log(pᵢ)	ML training stability

Key Takeaways

Logarithms are inverses of exponentials—they answer "what power?"
Three important bases: e (calculus), 10 (scientific), 2 (computing)
Properties convert operations: multiplication → addition, powers → multiplication
Real-world scales (Richter, decibels, pH) compress huge ranges
ML relies on logarithms for numerical stability and theoretical elegance
Always check for edge cases: log(0), log(negative), precision for small arguments

Exercises

Conceptual Questions

Explain in your own words why $log_b(1) = 0$ for any valid base $b$ .
Why can't we take the logarithm of a negative number using real numbers?
If $log_2(n) = 10$ , how many elements are in the set? How many comparisons would binary search take?
A sound at 60 dB is how many times more intense than a sound at 40 dB?

Computational Problems

Simplify: $log_3(81) + log_3left( rac{1}{27} ight)$
Solve for x: $log_2(x-3) + log_2(x+3) = 4$
Express $log_5(100)$ in terms of common logarithms (base 10).
If an earthquake releases 1000 times more energy than a magnitude 4 earthquake, what is its magnitude? (Hint: energy ratio ≈ $31.6^{\Delta M}$ )

Programming Challenges

Implement a function that computes $log_b(x)$ for any base using only the natural log.
Write a numerically stable function to compute cross-entropy loss given predicted probabilities and true labels.
Create a visualization comparing linear and logarithmic scales for the electromagnetic spectrum (wavelengths from 10⁻¹² to 10⁴ meters).

Exploration

Research Benford's Law: why do leading digits in many datasets follow a logarithmic distribution?
Investigate how logarithms appear in the analysis of algorithm complexity (e.g., why is merge sort O(n log n)?).
Explore the connection between logarithms and music: why is the frequency ratio between octaves 2:1, and how do logarithms relate to the perception of pitch?

In the next section, we'll explore trigonometric functions—the mathematics of circular motion and periodic phenomena. These functions are essential for understanding waves, oscillations, rotations, and countless applications in physics, engineering, and signal processing.

Learning Objectives

The Story of Logarithms

A Revolution in Calculation

The Name "Logarithm"

Why This History Matters

Definition: The Inverse of Exponentials

The Fundamental Relationship

The Inverse Function Relationship

Domain Restriction

Intuition: The "What Power?" Question

Common and Natural Logarithms

The Three Most Important Bases

The Natural Logarithm: Why Base e?

The Binary Logarithm: Counting Bits

Notation Warning

Properties of Logarithms

The Three Fundamental Laws

Logarithm Properties Explorer

Key Values to Memorize

Change of Base Formula

Change of Base Formula Calculator

Computational Tip

Graphing Logarithmic Functions

The Mirror: log and exp are reflections across y = x

Interactive Logarithm Graph: y = loge(x)

Key Features of Logarithmic Graphs

How the Base Affects the Graph

Transformations of Logarithmic Functions

Real-World Logarithmic Scales

Logarithmic Scales in the Real World

Earthquake Magnitudes (Richter Scale)

Why Logarithmic Scales?

The Richter Scale: A Deep Dive

Applications in Machine Learning

Why ML Uses Log-Probabilities

Logarithms in Machine Learning: Log-Likelihood

Plain Python first — multiply, then add logs

The one-line moral

Cross-Entropy Loss

Log-Softmax for Numerical Stability

Step 1 — Pure NumPy: trace every line

Step 2 — The same idea in PyTorch (one call)

Information Theory: Bits and Entropy

Numerical Computing Considerations

Common Pitfalls

Numerical Hazards with Logarithms

The log1p Function

Summary

Key Takeaways

Exercises

Conceptual Questions

Computational Problems

Programming Challenges

Exploration

Interactive Logarithm Graph: y = log_e(x)