Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Read and write piecewise function notation fluently — translating between a list of rules and the graph they produce.
Decide whether a piecewise function is continuous by checking left and right limits at every breakpoint.
Re-express $|x|$ and its variants as explicit two-piece linear functions, and solve equations like $|2x - 3| = 5$ by case analysis.
Visualize the V-shape transformations $f(x) = a|x - h| + k$ and read off the vertex, slopes, and reflections at a glance.
Recognize the Heaviside step, sign function, ReLU, and tax brackets as piecewise functions sharing one structural idea.
Anticipate the consequences of a corner — non-differentiability — and connect this to the dying-ReLU phenomenon in deep learning.

The Big Picture: One Function, Many Rules

"Reality is piecewise. Math gets neat when we admit it."

Most functions in school appear as a single formula: $f(x) = x^2$ , $f(x) = \sin x$ , $f(x) = e^x$ . But many natural rules are not like that. A tax authority applies one rate up to one threshold, a different rate above it. A thermostat is OFF below a setpoint and ON above. A neuron in a deep network is silent for negative inputs and linear for positive ones. A car's brakes do nothing until you press them.

These rules all share one shape: different formulas on different parts of the input domain. The function that captures this idea is called a piecewise function. The simplest and most famous instance is the absolute value, $|x|$ , which is $-x$ on one side and $+x$ on the other.

The core insight

A piecewise function is not a new kind of object. It is an old kind of object — a function — that we describe with a switchboard instead of a single algebraic expression. The switchboard says: "If x lives in region 1, apply rule 1; if x lives in region 2, apply rule 2; …" Every input still produces exactly one output.

An everyday analogy

Think about your phone bill. The first 100 minutes are free; the next 500 minutes cost $\$0.05$ each; everything beyond that is $\$0.10$ each. The cost as a function of minutes is not one formula — it's three formulas stitched together. That stitch is what piecewise notation makes precise.

Mathematical Definition

A piecewise function is a function whose rule changes depending on which subset of the domain its input lies in. The general form with $n$ pieces is

f(x) = \begin{cases} f_1(x) & x \in D_1 \\ f_2(x) & x \in D_2 \\ \;\;\vdots & \;\;\vdots \\ f_n(x) & x \in D_n \end{cases}

with two non-negotiable rules:

Coverage. The domains $D_1, D_2, \ldots, D_n$ together cover every input the function is supposed to accept.
No conflict. The domains do not overlap. (Or, if they do overlap, the rules agree on the overlap so there is no ambiguity.)

The breakpoints

The boundary $x$ -values where one piece ends and the next begins are called breakpoints (or knots). Whether the breakpoint belongs to the left piece, the right piece, or both, is part of the function's definition — and we mark it visually with a closed dot (included) or an open dot (approached but not taken).

The two questions to ask at every breakpoint

(1) Which side's rule wins at the breakpoint itself? (2) Do the two side-rules produce the same y-value there? If the answer to (2) is "yes," the function is continuous at the breakpoint. If "no," there is a jump discontinuity.

Interactive Piecewise Playground

Switch between the presets in the dropdown, then try Custom 2-piece and drag the sliders. Watch the green/red banner at the bottom: it tells you when the two pieces agree at the breakpoint (continuous) and when they disagree (jump). Pay special attention to filled vs hollow dots — they are how we visually encode which side "owns" the breakpoint value.

Piecewise function playground

Two rays meeting at the origin with slopes -1 and +1. The corner is at x = 0.

f(x) = |x| = \begin{cases} -x & x < 0 \\ \;\;x & x \geq 0 \end{cases}

Continuous — every breakpoint matches from both sides. You can draw this curve without lifting your pen.

Tip: filled dot = value included at that x; hollow dot = value approached but not taken. Continuity fails exactly when the filled-dot and hollow-dot y-values disagree.

What to notice as you play

Continuity is a meeting condition, not a smoothness condition. Two pieces can meet at the breakpoint (continuous) and still arrive with completely different slopes (corner).
|x| and ReLU are cousins. Both are two-piece linear functions whose pieces meet at the origin. ReLU clips the negative side to zero; |x| reflects it upward. We will exploit this kinship in the ML section.
Heaviside and sign have jumps on purpose. They model switches — events that have to flip discontinuously.
Tax brackets are designed to be continuous. The piecewise intercepts are chosen specifically so that crossing into the next bracket never produces a sudden jump in tax owed. Without continuity, taxpayers would have a perverse incentive to earn slightly less.

Absolute Value as a Piecewise Function

The absolute value of a real number x, written $|x|$ , is its distance from zero on the number line. Distance is always non-negative, so $|x| \geq 0$ for every x — equality holds only at $x = 0$ .

The two-piece definition

The formal definition spells out the two cases:

|x| = \begin{cases} -x & \text{if } x < 0 \\ \;\;x & \text{if } x \geq 0 \end{cases}

Read it like English: "if x is negative, flip its sign to make it positive; if x is non-negative, leave it alone." Both cases produce a non-negative output, which is why the absolute-value graph lies entirely on or above the x-axis.

Geometric meaning: a V at the origin

The graph of $|x|$ is a perfect V. The left ray has slope $-1$ ; the right ray has slope $+1$ . They meet at the origin in a corner. The corner is the geometric signature of every absolute-value function — it is the price you pay for forcing the output to be non-negative.

A second face: distance between two numbers

Replacing x with $x - a$ gives $|x - a|$ , which is the distance from x to a. This is the most useful reading of absolute value in the rest of calculus and statistics:

$|x - 3| < \varepsilon$ means "x is within $\varepsilon$ of 3."
$|f(x) - L| < \varepsilon$ means "the function value is within $\varepsilon$ of L." This is the language of limits, which we will meet in Chapter 2.
The median of a data set minimizes $\sum_i |x_i - m|$ . (Compare the mean, which minimizes $\sum_i (x_i - m)^2$ .)

Two readings, one symbol

When you see $|x - a|$ , it always means the distance from $x$ to $a$ . When you see $|x|$ alone, it means the distance from $x$ to 0. They are the same function shifted.

Transformations of |x|: Building Vs Anywhere

Just as the parabola family $y = a(x - h)^2 + k$ places a U anywhere on the plane, the absolute-value family

f(x) = a\,|x - h| + k

places a V anywhere on the plane. The three parameters do exactly what you think:

Parameter	Geometric effect
h	Horizontal shift of the vertex. Positive h moves the V right.
k	Vertical shift of the vertex. Positive k moves the V up.
a	Steepness and orientation. \|a\| > 1 makes the V steeper; 0 < \|a\| < 1 makes it shallower; a < 0 flips the V upside down.

Play with the sliders below. The dashed grey curve is the reference $|x|$ ; the amber curve is your transformed V. The amber dot marks the vertex, which sits at $(h, k)$ .

Absolute-value transformations

f(x) = 1.00\,\big|\,x - 0.00\,\big| + 0.00

Unchanged from |x|.

Vertical scale a1.00Horizontal shift h0.00Vertical shift k0.00

Reading slopes off the formula

For $f(x) = a|x - h| + k$ :

To the right of the vertex ( $x > h$ ): slope is $+a$ .
To the left of the vertex ( $x < h$ ): slope is $-a$ .
At the vertex itself: no single slope exists — the function has a corner.

Worked Example by Hand

We will work two examples in detail. The first solves an absolute-value equation by case analysis. The second graphs a sum of two absolute-value functions and discovers a surprising flat region. Try each step yourself before peeking.

▶ Example 1 — Solve |2x − 3| = 5

Step 1. Write the defining rule of absolute value:

|u| = c \;\; (\text{with } c \geq 0) \quad \Longleftrightarrow \quad u = c \;\; \text{or} \;\; u = -c.

Step 2. Identify $u = 2x - 3$ and $c = 5$ . Substituting gives two linear equations.

2x - 3 = 5 \qquad \text{or} \qquad 2x - 3 = -5.

Step 3. Solve each.

2x = 8 \;\Rightarrow\; x = 4, \qquad 2x = -2 \;\Rightarrow\; x = -1.

Step 4. Verify both by substitution:

Candidate x	2x − 3	\|2x − 3\|	Match 5?
x = 4	5	5	✓
x = −1	−5	5	✓

Answer. $x = -1$ or $x = 4$ . The equation has two solutions, symmetric about $x = 3/2$ — which is the value where the inside $2x - 3$ equals zero, i.e. the corner of $|2x - 3|$ .

Why exactly two solutions, geometrically

Graphing $y = |2x - 3|$ gives a V with vertex at $(3/2, 0)$ . The horizontal line $y = 5$ intersects this V at two points, equidistant from the vertex. Their horizontal distance from $3/2$ is $5/2$ (because the V has slopes $\pm 2$ , so we travel $5/2$ in x to rise 5 in y). Thus the solutions are $3/2 \pm 5/2$ = $-1$ and $4$ .

▶ Example 2 — Graph g(x) = |x − 1| + |x + 2| and find its minimum

Step 1. Locate the corners. Each $|\cdot|$ has a corner where its inside is zero, so:

$|x - 1|$ has a corner at $x = 1$ .
$|x + 2|$ has a corner at $x = -2$ .

The corners split the real line into three regions: $x < -2$ , $-2 \leq x \leq 1$ , and $x > 1$ .

Step 2. Drop the bars in each region by choosing the right sign for the inside.

Region	\|x − 1\| becomes	\|x + 2\| becomes	g(x)
x < −2	1 − x	−x − 2	−2x − 1
−2 ≤ x ≤ 1	1 − x	x + 2	3
x > 1	x − 1	x + 2	2x + 1

Step 3. The middle row is the punchline. On the entire interval $[-2, 1]$ , the function is identically 3 — the x-terms cancel. Outside the interval, the function ramps up linearly with slope $\pm 2$ .

Step 4. Spot-check three values inside the flat region.

x	\|x − 1\|	\|x + 2\|	g(x)
−2	3	0	3
0	1	2	3
0.7	0.3	2.7	3
1	0	3	3

Step 5. The minimum value of g is $3$ , achieved on the entire interval $[-2, 1]$ . This is the distance between the two corners — a general fact: for $g(x) = |x - a| + |x - b|$ with $a < b$ , the minimum value is $b - a$ , achieved on the entire interval $[a, b]$ .

Why this matters for ML

The fact that the minimum of a sum of absolute values is achieved on an interval rather than at a single point is exactly the reason L1-regularized regression produces sparse coefficients — the optimizer can "land" flat against the corner of a coordinate axis and stay there. We'll return to this when we discuss the lasso in the optimization chapter.

The Sign Function and the Heaviside Step

Two more piecewise functions deserve names because they appear everywhere from control engineering to deep learning to distribution theory.

The sign function

\mathrm{sgn}(x) = \begin{cases} -1 & x < 0 \\ \;\;0 & x = 0 \\ +1 & x > 0 \end{cases}

The sign function reports which side of zero a number lives on, ignoring its magnitude. A useful identity ties it to absolute value:

|x| = x \cdot \mathrm{sgn}(x) \qquad \text{and equivalently} \qquad \mathrm{sgn}(x) = \frac{x}{|x|} \;\; (x \neq 0).

That second form says: "to find the sign of x, divide x by its magnitude." The result is $\pm 1$ .

The Heaviside step

H(x) = \begin{cases} 0 & x < 0 \\ 1 & x \geq 0 \end{cases}

Oliver Heaviside introduced this step around 1880 to write down equations for electrical circuits that switch on at $t = 0$ . It is the cleanest possible discontinuous function: zero on one side, one on the other, with a single jump of size 1.

The Heaviside-step trick

Many piecewise functions can be expressed as linear combinations of Heaviside steps. For instance, $f(x) = 3 H(x) - 2 H(x - 1)$ describes a function that jumps up by 3 at $x = 0$ , then down by 2 at $x = 1$ . This trick is the foundation of signal processing.

How they relate to ReLU and |x|

All four functions are first cousins on the same family tree:

Function	Pieces	Continuous?	Slope on the left	Slope on the right
sgn(x)	−1, 0, +1	No (jump of 2)	0	0
H(x)	0, 1	No (jump of 1)	0	0
\|x\|	−x, x	Yes (corner)	−1	+1
ReLU(x)	0, x	Yes (corner)	0	+1

Differentiation moves up this table: $\frac{d}{dx} |x| = \mathrm{sgn}(x)$ , and $\frac{d}{dx} \mathrm{ReLU}(x) = H(x)$ . Each row is the derivative of the row below it. The corners become jumps; the jumps become "deltas" (which we'll meet in distribution theory, beyond this chapter).

Corners, Continuity, and Derivatives

Piecewise functions are the first place a student meets the distinction between continuity and differentiability. They are not the same thing.

Continuity at a breakpoint

Let $c$ be a breakpoint between two pieces. The function is continuous at $c$ if and only if

\lim_{x \to c^-} f(x) \;=\; \lim_{x \to c^+} f(x) \;=\; f(c).

All three numbers must exist and be equal. We will make these limit symbols precise in Chapter 2; for now, "left limit equals right limit equals value" is the working definition.

Differentiability at a breakpoint

Differentiability is a stricter requirement than continuity. Even when the pieces meet (no jump), if they arrive with different slopes, the function has a corner, and the derivative at that point does not exist. $|x|$ at $x = 0$ is the canonical example:

\lim_{h \to 0^-} \frac{|h| - |0|}{h} = -1, \qquad \lim_{h \to 0^+} \frac{|h| - |0|}{h} = +1.

Two different one-sided slopes ⟹ no single derivative. We will derive this carefully in Chapter 4, but the geometry already tells the story: the V has two different tangent directions at the vertex.

Continuity ⇏ Differentiability

A piecewise function can be perfectly continuous and still fail to be differentiable. Corners (|x|, ReLU) and cusps ( $x^{2/3}$ at 0) are continuous everywhere yet non-differentiable at one point. The reverse implication holds: differentiable always implies continuous.

The hierarchy of smoothness

We can stack functions by how much regularity they enjoy at their breakpoints:

Discontinuous (Heaviside, sign): a jump in the function value.
Continuous but not differentiable (|x|, ReLU): the pieces meet, but slopes differ.
Differentiable but not twice differentiable: slopes match, but curvatures differ. The piecewise $x^2$ / $2x - 1$ example from the Python section sits here — same value AND same slope at $x = 1$ (because $\frac{d}{dx}x^2|_{x=1} = 2$ ), but second derivative jumps from 2 to 0.
Smooth: every derivative exists everywhere. Single-formula functions like $e^x$ , $\sin x$ , polynomials.

ReLU: Piecewise in Modern Machine Learning

The Rectified Linear Unit,

\mathrm{ReLU}(x) = \max(0, x) = \begin{cases} 0 & x < 0 \\ x & x \geq 0 \end{cases},

is the most-used non-linear function in modern neural networks. Every transformer layer, every CNN feature map, every MLP hidden layer typically passes its values through a ReLU (or a close relative). Understanding piecewise functions is therefore not optional for ML — it is the central activation pattern.

Why ReLU works so well in practice

Cheap. One comparison, one selection. No exponentials, no divisions.
Gradient survives. On the active side, the derivative is exactly 1, so gradients flow unchanged through arbitrarily deep stacks. Sigmoid's derivative is at most 0.25, so a 10-layer sigmoid stack divides gradient by $4^{10}$ in the worst case — the "vanishing gradient" problem.
Sparse activations. About half the inputs are negative for a typical random initialization, so about half of every hidden layer is exactly zero. Sparsity is cheap to compute and often improves generalization.

The dark side: the dying-ReLU problem

The very piecewise structure that makes ReLU efficient also creates a failure mode. If a neuron's pre-activation is always negative (across the entire training set), then its output is always zero, its gradient is always zero, and the optimizer cannot move its weights. The neuron is dead, frozen at initialization, contributing nothing to the network forever.

We will demonstrate this with PyTorch in the code block below — initialize a neuron with a very negative bias, run SGD, and watch the weights refuse to move. Then in practice, three remedies break the symmetry:

Smart initialization (He, Kaiming) so pre-activations span both sides of zero from the start.
Leaky ReLU: $\max(\alpha x, x)$ for small $\alpha$ (e.g. 0.01). The flat side becomes a gentle slope, so gradients survive.
Skip connections (ResNets): provide an alternate pathway for gradient even when a unit is dead.

The takeaway

The most important non-linearity in modern AI is a piecewise function with two linear pieces. Its strengths and its failure modes are direct consequences of its piecewise structure. Every other thing we'll do with derivatives, gradients, and optimization is downstream of understanding this single shape.

Python Implementation

Two NumPy idioms cover almost every piecewise function you'll ever write: np.piecewise for the general n-piece case, and np.where for the common 2-piece case. The code below uses both side by side and plots a piecewise polynomial, a ReLU, and an absolute value on a single figure.

Plotting piecewise, ReLU, and |x| in plain Python

🐍piecewise_plot.py

Explanation(27)

Code(41)

1Import NumPy

NumPy gives us vectorized predicates (x < 1) and the two main tools we need: np.piecewise (general, n pieces) and np.where (fast, 2 pieces).

EXAMPLE

x = np.array([-1, 0, 1, 2]) -> x < 1 = array([ True, True, False, False])

2Import matplotlib

pyplot will draw the three subplots side by side so we can see piecewise behaviour, ReLU, and |x| in one figure.

4Comment: Strategy A is np.piecewise

np.piecewise is the general tool: pass an array x, a list of boolean conditions, and a list of functions (one per condition). Each x is routed to the matching function.

5Define f_piecewise(x)

Our showcase piecewise function: x^2 on the left, 2x - 1 on the right, glued at x = 1. We chose 2x - 1 because at x = 1 it equals 1 — same as 1^2. The two pieces meet, so f is continuous.

6Docstring states the rule

Spelling out the breakpoint and which rule wins on which side is the most error-prone part of piecewise code — putting it in the docstring catches mistakes when you re-read your code months later.

7Call np.piecewise

Three arguments: (1) the input array x, (2) a list of boolean masks, (3) a parallel list of functions. Each x in the input is routed to the first matching function.

EXAMPLE

np.piecewise(np.array([0, 1, 2]), [x < 1, x >= 1], [lambda t: t**2, lambda t: 2*t - 1])
  routes 0 -> 0**2 = 0
  routes 1 -> 2*1 - 1 = 1
  routes 2 -> 2*2 - 1 = 3

8Input array

x is the array of evaluation points. np.piecewise broadcasts cleanly across it, so the call works for one scalar or a million points without changes.

9Condition list [x < 1, x >= 1]

Each entry is a boolean array the same shape as x. Important: the conditions must cover every x exactly once. If they overlap, the FIRST matching condition wins. If they leave a gap, np.piecewise fills with 0.

EXAMPLE

For x = np.array([-1, 0, 1, 2]):
  x < 1   -> [ True,  True, False, False]
  x >= 1  -> [False, False,  True,  True]
  Coverage check: every column has exactly one True ✓

10Function list

Parallel to the condition list: lambda t: t**2 handles the True positions of x < 1, lambda t: 2*t - 1 handles the True positions of x >= 1. The dummy variable t is just a name for the elements routed to that lambda.

EXAMPLE

For x = np.array([-1, 0, 1, 2]):
  Pieces routed to lambda t: t**2     are [-1, 0]   -> [1, 0]
  Pieces routed to lambda t: 2*t - 1 are [ 1, 2]   -> [1, 3]
  Stitched result: array([1, 0, 1, 3])

13Comment: Strategy B is np.where

For 2-piece functions np.where is faster to type and faster to run. The pattern is: np.where(condition, value_if_true, value_if_false).

14Define relu(x)

ReLU is the cleanest possible np.where: pass x where x >= 0, otherwise 0. This single line is exactly the activation function used in essentially every modern neural network.

15np.where(x >= 0, x, 0.0)

For each element of x: if x >= 0 take x itself, else take 0.0. The 0.0 (not 0) keeps the output dtype as float — important when x is a float array.

EXAMPLE

relu(np.array([-2, -0.5, 0, 0.5, 2])) -> array([0., 0., 0., 0.5, 2.])

17Define abs_value(x)

|x| is the canonical two-piece function — x on one side, -x on the other. Notice we ARE NOT using np.abs because we want to expose the piecewise structure for teaching.

18np.where(x >= 0, x, -x)

Same pattern: x stays itself when non-negative, becomes -x when negative. The two halves meet exactly at x = 0 (since 0 and -0 are equal), so the function is continuous there.

EXAMPLE

abs_value(np.array([-3, -1, 0, 1, 3])) -> array([3, 1, 0, 1, 3])

20Build the x-axis grid

600 points on [-3, 3]. Density matters near corners — at x = 0 (for |x| and ReLU) and x = 1 (for f_piecewise) the slope changes abruptly and we want enough samples to see the corner crisply.

22Create 3 subplots side by side

subplots(1, 3) returns a Figure plus an array of three Axes. We will plot one function per Axes so the reader can compare shapes directly.

24Plot the quadratic-then-linear piecewise

axes[0] gets the most interesting curve: a parabola section glued to a ray. Visually it looks smooth at x = 1 because the values match — but the curvature still jumps, so the second derivative is discontinuous (we'll meet this distinction in Chapter 4).

25Mark the breakpoint

A red dot at (1, 1) makes the gluing visible. It is the single point where two different rules produce the SAME y-value — the definition of continuity at a breakpoint.

26Title each subplot

Each title states the rule. Compact titles are critical when comparing three small panels.

28Plot ReLU on axes[1]

The flat-then-rising shape. The corner at (0, 0) is where the slope jumps from 0 to 1 — the canonical 'kink' that breaks differentiability while preserving continuity.

31Plot |x| on axes[2]

A perfect V. The two rays have slopes -1 and +1; their slopes differ by 2 at the corner. ReLU and |x| are first cousins: |x| = 2·ReLU(x) − x.

EXAMPLE

Cross-check at x = -3: ReLU(-3) = 0, so 2*ReLU(-3) - (-3) = 0 + 3 = 3 = |-3| ✓
              at x =  3: ReLU( 3) = 3, so 2*ReLU( 3) - ( 3) = 6 - 3 = 3 = | 3| ✓

34Common axis styling

Loop over the three axes and apply the same y=0 and x=0 reference lines so the three panels share a visual baseline. Without this, each subplot draws its own axes and the comparison is harder.

35Horizontal y=0 line

Helps the eye locate where each curve crosses the x-axis. For ReLU and |x|, that crossing is x = 0 — the breakpoint.

36Vertical x=0 line

Marks the y-axis. The fact that ReLU and |x| both hit zero AT the y-axis but with different left-side behaviour is the visual signature of the piecewise rule.

37Grid + xlabel

Light grid + an x label on every subplot. The student should be able to read off f(2), f(-1), etc. by eye.

40Tight layout

plt.tight_layout() removes overlapping titles when subplots are packed together. A small thing, but it makes the figure publication-ready.

41Render

plt.show() pumps the figure to the screen. In a notebook this is implicit; in a script it's required.

14 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4# Strategy A: np.piecewise — list the conditions and the rules.
5def f_piecewise(x):
6    """f(x) = x^2 for x < 1, then 2x - 1 for x >= 1. Continuous at x = 1."""
7    return np.piecewise(
8        x,
9        [x < 1, x >= 1],
10        [lambda t: t ** 2, lambda t: 2 * t - 1],
11    )
12
13# Strategy B: np.where — short and idiomatic for 2-piece functions.
14def relu(x):
15    return np.where(x >= 0, x, 0.0)
16
17def abs_value(x):
18    return np.where(x >= 0, x, -x)
19
20x = np.linspace(-3, 3, 600)
21
22fig, axes = plt.subplots(1, 3, figsize=(13, 4))
23
24axes[0].plot(x, f_piecewise(x), color='#3b82f6', linewidth=2)
25axes[0].scatter([1], [1], color='red', zorder=5)
26axes[0].set_title('Quadratic + linear, continuous at x=1')
27
28axes[1].plot(x, relu(x), color='#22c55e', linewidth=2)
29axes[1].set_title('ReLU = max(0, x)')
30
31axes[2].plot(x, abs_value(x), color='#f59e0b', linewidth=2)
32axes[2].set_title('|x|')
33
34for ax in axes:
35    ax.axhline(0, color='gray', linestyle='--', alpha=0.4)
36    ax.axvline(0, color='gray', linestyle='--', alpha=0.4)
37    ax.grid(True, alpha=0.3)
38    ax.set_xlabel('x')
39
40plt.tight_layout()
41plt.show()

From hand-solved to computer-verified

Below we revisit the two worked examples — solving $|2x - 3| = 5$ and analyzing $g(x) = |x-1| + |x+2|$ — but now with NumPy doing the heavy lifting. Notice the flat-bottom plateau in the plot: it is the set of minimizers, not a single point.

Solving |2x − 3| = 5 and minimizing |x−1| + |x+2|

🐍absolute_value_examples.py

Explanation(35)

Code(46)

1Import NumPy

We use np.abs for the vectorized absolute value and np.linspace for the plotting grid. Built-in abs would also work on scalars, but np.abs broadcasts.

2Import matplotlib

pyplot draws the flat-bottom valley so we can SEE that g attains its minimum on a full interval, not at a single point.

4Comment header for Part 1

The whole solution turns on one identity: |u| = c (with c >= 0) is equivalent to u = c OR u = -c. We just plug u = 2x - 3 and c = 5 into both cases.

5State the rule

Writing the rule on its own line keeps the next two lines parsable as 'one application each' rather than abstract algebra.

6Plug in u = 2x - 3, c = 5

The two algebraic cases produce two linear equations. Solving each gives one candidate solution.

7Case 1 produces x = 4

From 2x - 3 = 5: add 3 to both sides -> 2x = 8 -> divide by 2 -> x = 4. Inside the original |·|, this gives 2(4) - 3 = 5 ✓.

8Case 2 produces x = -1

From 2x - 3 = -5: add 3 -> 2x = -2 -> x = -1. Inside the original |·|, this gives 2(-1) - 3 = -5, whose absolute value is 5 ✓.

10solutions = []

An accumulator list. We will append each numerical solution as we walk through the two cases.

11Loop over the two cases

The tuple (5, -5) holds the two possible values of the inner expression 2x - 3. One iteration per case.

EXAMPLE

Iteration 1: case_value =  5  ->  x = (5 + 3)/2  =  4
Iteration 2: case_value = -5  ->  x = (-5 + 3)/2 = -1

12Solve 2x - 3 = case_value

Algebraically: 2x = case_value + 3, then x = (case_value + 3) / 2. One arithmetic line stands in for the algebra step.

13Append the candidate to solutions

We keep both candidates in a list so we can sort, print, and assert in one place.

14Verify by substitution

Always check candidates: |2x - 3| should equal 5 at both solutions. abs(...) is Python's built-in absolute value — works on a plain float.

EXAMPLE

x = 4   ->  abs(2*4  - 3) = abs(5)  = 5 ✓
x = -1  ->  abs(2*-1 - 3) = abs(-5) = 5 ✓

15Print the verified row

Formatted output makes the verification step skim-readable: each row shows the candidate x and the recomputed |2x - 3|.

17Assert correctness

Asserting against the known answer is a guard against typos. If we someday change the constants, the assert fires immediately.

18Print the final answer

After verification we have the answer: x = 4 or x = -1. Two solutions because |·| is two-to-one almost everywhere.

20Comment header for Part 2

Now the more interesting object: a SUM of absolute values. Each |·| contributes one corner; their sum has two corners total.

21Find the corners

Corner locations come from the inside of each |·|: x - 1 = 0 gives x = 1, and x + 2 = 0 gives x = -2. These two x-values partition the real line into three regions.

22Region 1: x < -2

Both insides are negative: x - 1 < 0 and x + 2 < 0. Drop the bars by negating each: g(x) = -(x-1) + -(x+2) = -2x - 1. This is a line with slope -2.

EXAMPLE

At x = -3: g(-3) = |-3-1| + |-3+2| = 4 + 1 = 5
            and  -2(-3) - 1     = 6 - 1 = 5 ✓

23Region 2: -2 ≤ x ≤ 1 (the flat region)

Now x - 1 ≤ 0 (so |x-1| = 1 - x) but x + 2 ≥ 0 (so |x+2| = x + 2). Adding: (1 - x) + (x + 2) = 3. The x cancels — the function is CONSTANT on the whole interval [-2, 1]. This is the punchline of the example.

EXAMPLE

g(0) = |0-1| + |0+2| = 1 + 2 = 3
g(-1) = |-1-1| + |-1+2| = 2 + 1 = 3
g(0.7) = |0.7-1| + |0.7+2| = 0.3 + 2.7 = 3   (all the same!)

24Region 3: x > 1

Both insides positive: g(x) = (x-1) + (x+2) = 2x + 1. A line with slope +2, mirror image of Region 1.

26Conclusion: minimum 3 on [-2, 1]

The constant-region phenomenon is general: g(x) = |x - a| + |x - b| (with a < b) is minimized on the ENTIRE interval [a, b], and the minimum value is the distance |b - a|. Try a = -2, b = 1: minimum is |1 - (-2)| = 3 ✓.

28Define g(x)

Now we let NumPy handle all three regions automatically by writing g as a single expression. np.abs broadcasts, so g works on arrays.

29Return |x - 1| + |x + 2|

One line stands in for the case analysis above. The key insight is that we DON'T have to write three cases when np.abs already encodes the piecewise rule.

EXAMPLE

g(np.array([-3, -2, 0, 1, 3])) -> array([5., 3., 3., 3., 7.])
Notice the three 3s in the middle: confirms the flat region.

31Build the plotting grid

500 points on [-5, 5] is enough resolution to see both corners crisply without aliasing.

32Evaluate g on the grid

gs is now a 500-element array of g-values. The minimum should be exactly 3.0 and should occur on a flat plateau.

33Numerical check of the minimum

gs.min() should equal 3.0 (up to floating-point). If we had a typo in g, this print would expose it.

EXAMPLE

min g = 3.0000

34Sanity-check three interior points

g(-2), g(0), g(1) should all equal 3. The fact that three different x-values produce the same g is the flat-region phenomenon in action.

36Create the figure

figsize=(9, 5) is wide enough to display the entire flat region without squashing it.

37Plot g(x)

A purple curve. Visually it should look like an asymmetric trapezoid: two slopes outside and a flat bottom inside [-2, 1].

38Horizontal line at y = 3

Marks the minimum value. The curve should touch this line on the entire interval [-2, 1] and stay above it everywhere else.

39Shade the minimizing region

fill_between with a where-mask shades only the x-values that achieve the minimum. The shaded interval is the argmin SET, not a single point — a key feature of L1 / absolute-value objectives.

40Mark the corner points

Red dots at (-2, 3) and (1, 3). These are the two corners where the slope changes — one going from -2 to 0, the other from 0 to +2.

41Axis labels

x and g(x). Notice we label the y-axis g(x), not y — the function name matters in calculus.

42Title + legend

The title 'flat-bottom valley' names the shape so the reader has a concrete picture in mind. The legend distinguishes the curve from the minimum-marker line.

43Render

plt.show() displays the figure. The student should see the V-from-above + ramp shape with a clearly shaded plateau on [-2, 1].

11 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4# ---------- Part 1: solve |2x - 3| = 5 ----------------------------------------
5# Rule: |u| = c  (c > 0)  iff  u = c  OR  u = -c.
6# Here u = 2x - 3 and c = 5, so:
7#     Case 1: 2x - 3 =  5  ->  x =  4
8#     Case 2: 2x - 3 = -5  ->  x = -1
9solutions = []
10for case_value in (5, -5):
11    x = (case_value + 3) / 2          # solve 2x - 3 = case_value for x
12    solutions.append(x)
13    check = abs(2 * x - 3)            # verify by substitution
14    print(f"x = {x:>5.2f}  ->  |2x - 3| = {check}")
15
16assert sorted(solutions) == [-1.0, 4.0]
17print(f"Solutions: x = {solutions[1]} or x = {solutions[0]}")
18
19# ---------- Part 2: analyze g(x) = |x - 1| + |x + 2| --------------------------
20# Two corners: x = 1 (from |x-1|) and x = -2 (from |x+2|). They split the
21# real line into three regions. In each region we drop the bars by choosing
22# the correct sign.
23#
24#   x < -2:    g(x) = (1 - x) + (-x - 2) = -2x - 1
25#   -2 <= x <= 1:  g(x) = (1 - x) + (x + 2) =  3       <-- CONSTANT!
26#   x > 1:     g(x) = (x - 1) + (x + 2) =  2x + 1
27#
28# So g attains its minimum value 3 on the ENTIRE interval [-2, 1].
29
30def g(x):
31    return np.abs(x - 1) + np.abs(x + 2)
32
33xs = np.linspace(-5, 5, 500)
34gs = g(xs)
35print(f"min g = {gs.min():.4f} (expected 3.0)")
36print(f"g(-2) = {g(-2.0)}, g(0) = {g(0.0)}, g(1) = {g(1.0)}  (all should be 3)")
37
38plt.figure(figsize=(9, 5))
39plt.plot(xs, gs, color='#8b5cf6', linewidth=2.5, label='g(x) = |x-1| + |x+2|')
40plt.axhline(3, color='#22c55e', linestyle='--', alpha=0.7, label='min = 3 on [-2, 1]')
41plt.fill_between(xs, 0, 3, where=(xs >= -2) & (xs <= 1), color='#22c55e', alpha=0.15)
42plt.scatter([-2, 1], [3, 3], color='red', zorder=5)
43plt.xlabel('x'); plt.ylabel('g(x)')
44plt.title('Sum of two |·| functions: a flat-bottom valley')
45plt.legend(); plt.grid(True, alpha=0.3)
46plt.show()

PyTorch Implementation

Now we shift from descriptive plotting to autograd: ask PyTorch what the derivative of ReLU is at five sample points, watch it return the Heaviside step exactly. Then we reproduce the dying-ReLU pathology by initializing a neuron with a fatally negative bias and showing that SGD cannot rescue it.

ReLU's derivative, dying ReLU, and why piecewise matters in deep learning

🐍relu_autograd.py

Explanation(30)

Code(43)

1Import PyTorch

torch gives us tensors and autograd; we'll need both — autograd to verify the derivative of ReLU automatically, and tensors as the carriers of values + gradients.

2Import torch.nn.functional as F

F.relu is the functional form: a plain function call, not a layer. Internally it's the same kernel as nn.ReLU but cleaner for one-off demos.

4Comment: ReLU's derivative is the step

This is the headline. ReLU is piecewise linear, so its derivative is piecewise constant — exactly the Heaviside step from earlier in the section. The derivative jumps from 0 to 1 at x = 0.

5f'(x) = 0 for x < 0

On the flat region, the slope is zero. This means small input changes leave the output untouched — the source of the 'dead neuron' problem we'll see below.

6f'(x) = 1 for x > 0

On the identity region, the slope is one. Gradients pass through unchanged — this is why ReLU is so well-behaved for training compared to sigmoid (whose slope is at most 0.25).

7f'(0) is undefined (subgradient anywhere in [0, 1])

At the corner, the left-slope is 0 and the right-slope is 1, so no single number is the slope. Any value in [0, 1] is a valid 'subgradient.' This is the same kink that |x| has at zero.

8PyTorch chooses f'(0) = 0

An engineering choice: PyTorch picks 0. TensorFlow also picks 0. Picking 0 ties cleanly to the Heaviside convention H(0) = 0 used earlier in this section.

10Create x tensor with requires_grad=True

requires_grad=True tells autograd to track every operation on x. After backward(), x.grad will hold dL/dx for whatever scalar loss L we built.

EXAMPLE

x = tensor([-2.0, -0.5, 0.0, 0.5, 2.0], requires_grad=True)

12y = F.relu(x).sum()

F.relu applies the piecewise rule element-wise. We sum into a scalar because autograd's backward() requires a scalar root. Summing has the side-effect of making dy/dx_i = d(relu(x_i))/dx_i, which is exactly what we want to inspect.

13y.backward()

Propagates derivatives backward through the computation graph. After this line, x.grad is populated with one entry per element of x — each entry is the derivative of relu at that x.

15Print x

Five test points spanning both sides of the corner: two negative, the corner itself, two positive. .detach() removes the autograd tracking before printing.

EXAMPLE

x = [-2.0, -0.5, 0.0, 0.5, 2.0]

16Print relu(x)

Confirms the forward pass: negatives become 0, non-negatives pass through. The corner value relu(0) = 0 is correctly included on the right branch.

EXAMPLE

relu(x) = [0.0, 0.0, 0.0, 0.5, 2.0]

17Print dy/dx

This is the headline output. The gradient is 0 for every negative input, 1 for every positive input, and PyTorch reports 0 at the corner. The pattern [0, 0, 0, 1, 1] is the Heaviside step sampled at our five points.

EXAMPLE

dy/dx = [0.0, 0.0, 0.0, 1.0, 1.0]

19Comment header: dying ReLU

Now the practical consequence of the flat region. When a neuron's pre-activation is always negative, gradient flow through that neuron is exactly zero, and it never learns. We'll cause this on purpose with a bad initial bias.

20Why a negative-only pre-activation is fatal

If wx + b < 0 for every training input x, then F.relu always returns 0. The gradient with respect to w and b is also 0 (chain rule kills it). The optimizer has no signal — the neuron is dead, frozen at its initial values.

23Seed RNG for reproducibility

torch.manual_seed(0) makes this demo deterministic — every run produces the same gradient values. Without it, the conclusion is the same but the exact numbers wiggle.

24w = 1.0, leaf tensor

A single learnable weight. We start it at 1.0 — the value we WANT the optimizer to find — so we can prove that the optimizer fails even with the right starting weight.

25b = -10.0, very negative bias

The killer. b = -10 means wx + b ≤ 1 + (-10) = -9 for every x in [-1, 1]. All pre-activations are deeply in the flat region. The neuron is dead at initialization.

27Build the X training inputs

32 evenly spaced points in [-1, 1] reshaped to a column vector. .unsqueeze(1) adds the singleton last dim that downstream broadcasting expects.

EXAMPLE

X.shape = torch.Size([32, 1])
X[:5].squeeze() = tensor([-1.0000, -0.9355, -0.8710, -0.8065, -0.7419])

28Target = relu(X)

The supervised target is just relu applied to the inputs — so the optimal (w, b) is exactly (1, 0). If the optimizer could see a signal, it would head straight there.

30Plain SGD optimizer

Pure stochastic gradient descent on (w, b) with learning rate 0.05. No momentum, no adaptive tricks — we want to make the failure mode visible, not paper over it.

31Loop 20 SGD steps

20 steps is enough to settle into a stable state. With a healthy gradient signal, the loss would drop quickly; with a dead neuron, it goes nowhere.

32optimizer.zero_grad()

Reset accumulated gradients from the previous step. Forgetting this is the #1 bug in PyTorch training loops — gradients ADD up across .backward() calls.

33Forward: pred = relu(wX + b)

wX + b is in [-9, -11] for every X (because X is in [-1, 1] and b = -10). After relu, pred is identically 0. The neuron literally outputs zero for every training point.

EXAMPLE

wX + b at X = -1:  1.0 * -1.0 + -10.0 = -11.0
wX + b at X =  1:  1.0 *  1.0 + -10.0 =  -9.0
relu of every entry: 0.0

34MSE loss

(pred - target)^2 .mean(). Since pred = 0, the loss is exactly the mean of target^2. It is NOT zero — but the gradient signal through relu is zero because we are sitting on the flat side.

35Backward pass

Computes dL/dw and dL/db. Chain rule: dL/dw = dL/dpred * d(relu)/d(wX + b) * d(wX + b)/dw. The middle factor is zero because every pre-activation is negative. Result: dL/dw = 0, dL/db = 0.

36SGD step

optimizer.step() updates w and b by -lr * grad. Since both grads are 0, the update is the zero vector. w and b don't move. We will repeat this 20 times with no progress.

38Report w and b

After 20 steps, w stayed at 1.0 (the initial value), and b stayed at -10.0. The optimizer never moved them — proof that the gradient was zero throughout.

EXAMPLE

After 20 SGD steps with bias starting at -10:
  w = 1.0000   (target ~ 1)
  b = -10.0000 (target ~ 0 — but it didn't move!)
  w.grad = 0.000000e+00

39Report w.grad

The final w.grad is exactly 0 (or extremely close to it). This is the dying-ReLU signature. The cure in practice is one of: smarter initialization (so pre-activations span both sides of the corner), leaky ReLU (which has a small positive slope on the negative side), or skip connections.

40Connection back to the section

The dead-neuron pathology is a direct consequence of the piecewise structure of ReLU. The flat piece has slope 0, so any input fully inside it produces no gradient. Understanding piecewise functions is therefore not just calculus housekeeping — it's the difference between a model that trains and a model that is frozen at initialization.

13 lines without explanation

1import torch
2import torch.nn.functional as F
3
4# ReLU is piecewise linear: f(x) = max(0, x). Its DERIVATIVE is
5# the Heaviside step:
6#     f'(x) = 0  for x < 0
7#     f'(x) = 1  for x > 0
8#     f'(x) undefined at x = 0  (subgradient anywhere in [0, 1])
9#
10# PyTorch picks the convention f'(0) = 0.
11
12x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0], requires_grad=True)
13
14y = F.relu(x).sum()        # sum so we can call backward() on a scalar
15y.backward()
16
17print(f"x        = {x.detach().tolist()}")
18print(f"relu(x)  = {F.relu(x).detach().tolist()}")
19print(f"dy/dx    = {x.grad.tolist()}")
20
21# -------- The "dying ReLU" problem --------------------------------------------
22# If a neuron's input is always negative, its gradient is always 0, so the
23# neuron never updates. Demonstrate by training a tiny linear+ReLU on a
24# regression task with a bad initial bias.
25torch.manual_seed(0)
26w = torch.tensor([1.0], requires_grad=True)
27b = torch.tensor([-10.0], requires_grad=True)    # very negative bias
28
29X = torch.linspace(-1, 1, 32).unsqueeze(1)
30target = torch.relu(X)
31
32optimizer = torch.optim.SGD([w, b], lr=0.05)
33for step in range(20):
34    optimizer.zero_grad()
35    pred = F.relu(w * X + b)             # all pre-activations are way negative
36    loss = ((pred - target) ** 2).mean()
37    loss.backward()
38    optimizer.step()
39
40print(f"\nAfter 20 SGD steps with bias starting at -10:")
41print(f"  w = {w.item():.4f}   (target ~ 1)")
42print(f"  b = {b.item():.4f}   (target ~ 0 — but it didn't move!)")
43print(f"  w.grad = {w.grad.item():.6e}   <-- exactly zero or near-zero")

What autograd is really doing at the corner

PyTorch does not compute the derivative of ReLU symbolically — it dispatches to a hand-written backward kernel that returns 1 if the forward input was positive, 0 otherwise, and 0 at exact zero. The Heaviside step is therefore literally what autograd uses, not just a mathematical analogy.

Common Pitfalls

Pitfall	Why it bites	Fix
Overlapping piece domains	If two pieces both include a breakpoint, the function is ambiguous there.	Use strict and non-strict inequalities consistently. Convention: each piece owns its left endpoint with ≤ or its right endpoint with <.
Gap in the piece domains	Some inputs hit no piece at all — the function is undefined there.	Check that the union of all domains covers the intended input set.
Confusing continuity with differentiability	Drawing a continuous V and concluding it's differentiable at the vertex.	Check slopes from each side. If they disagree, no derivative exists at that point — even though the function is continuous.
\|x\|² = x (false)	Squaring strips the absolute value only because of x², not because \|·\| is the identity.	Memorize \|x\|² = x² for all real x. The bars vanish under squaring; the variable stays squared.
Solving \|u\| = c without considering c < 0	\|u\| can never be negative, so \|u\| = −1 has no solution.	Always check the sign of c first. If c < 0, the equation has no real solutions.
Forgetting one root of \|2x − 3\| = 5	Two cases (2x−3 = 5 and 2x−3 = −5) produce two roots. Students often write only one.	Whenever you 'drop' the absolute value bars, generate the two cases on a fresh line before any algebra.
Initializing a ReLU layer with a very negative bias	Pre-activations sit on the flat side forever; the gradient is zero.	Use He / Kaiming initialization, or switch to Leaky ReLU / GELU.

Summary

A piecewise function is a function described by different rules on different parts of its domain. The breakpoints between regions are where the action lives: they are the only places the function can misbehave — by jumping (discontinuity), by bending sharply (corner), or by changing curvature (non-smooth-but-differentiable). Reading a piecewise function is the same skill in every form it takes: tax brackets, thermostats, ReLU, Heaviside, sign, |x|.

Piecewise notation says "in region $D_i$ , use rule $f_i$ ." The domains must cover the input set without conflict.
Continuity at a breakpoint = left limit, right limit, and value all agree. Otherwise there is a jump.
|x| is two-piece linear: −x for negatives, x for non-negatives. The V-shape has slopes $\pm 1$ meeting in a corner at the origin.
|x − a| is the distance from x to a — the most useful reading of absolute value in calculus and statistics.
Equations $|u| = c$ split into two linear cases $u = c$ and $u = -c$ . Always check both, always verify by substitution.
Sums of absolute values can be minimized on an entire interval. This is the geometric root of L1-induced sparsity in machine learning.
Sign, Heaviside, |x|, ReLU are one family. Each row is the derivative of the row below it.
Corners ⇒ no derivative. Continuity does not imply differentiability. This distinction is the foundation of everything in Chapter 4.
ReLU is the most important piecewise function alive. Its strengths (cheap, non-vanishing gradient, sparse) and its weaknesses (dying neurons) are both direct consequences of its two-piece structure.
NumPy: np.piecewise and np.where. PyTorch: F.relu + autograd reproduces the Heaviside step exactly.

What's next. In Section 1.12 we will study transformations of arbitrary functions — shift, scale, reflect — using exactly the same shifting language $f(x) \mapsto a f(b(x - h)) + k$ that we previewed here for $|x|$ . In Chapter 2 we will give precise meaning to the left/right limits that defined continuity at a breakpoint.