Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section you will be able to:

Derive every inverse hyperbolic derivative from one tool — the inverse function theorem $(f^{-1})'(y) \;=\; 1 / f'(f^{-1}(y))$ — applied to $\sinh$ , $\cosh$ , and $\tanh$ .
Recognize why each formula has the shape it does: a square-root in the denominator for $\operatorname{arsinh}, \operatorname{arcosh}$ (they come from $\cosh^2 - \sinh^2 = 1$ ), a quadratic for $\operatorname{artanh}$ (it comes from $1 - \tanh^2$ ).
Cross-check the formulas by differentiating the logarithmic forms $\operatorname{arsinh}(x) = \ln(x + \sqrt{x^2+1})$ directly with the chain rule.
Apply the chain rule to compose these derivatives with other functions $\frac{d}{dx}\operatorname{artanh}(g(x)) = \frac{g'(x)}{1 - g(x)^2}$ .
Verify any answer numerically (Python central difference) and symbolically (PyTorch autograd).
Connect $\operatorname{artanh}$ to the logit transform used in classification heads, and $\operatorname{arsinh}$ to the asinh-scaled loss used in regression on heavy-tailed data.

The Question Behind the Formulas

In the last section you learned that $\sinh, \cosh, \tanh$ have unusually tidy derivatives. $\sinh' = \cosh$ . $\cosh' = \sinh$ . $\tanh' = 1 - \tanh^2$ . The pair $(\sinh, \cosh)$ behaves like a hyperbolic version of $(\sin, \cos)$ — same algebraic rhythm, no minus sign.

But just like the trig functions, hyperbolic functions are sometimes given to us backwards. A hanging cable has a shape $y = a \cosh(x/a)$ . If you measure the height $y$ and want to recover the horizontal position $x$ , you need $x = a\,\operatorname{arcosh}(y/a)$ . A neural network outputs a number $u \in (-1, 1)$ through a $\tanh$ head. If you want the pre-activation that produced it you need $\operatorname{artanh}(u)$ . Special relativity gives a rapidity $\phi = \operatorname{artanh}(v/c)$ . Statistics gives Fisher's $z = \operatorname{artanh}(r)$ transform on correlation coefficients.

The single question of this section: given that $(\sinh, \cosh, \tanh)$ have such clean derivatives, what are the derivatives of their inverses? And why do those derivatives have square-roots and quadratics glued to them?

The answer is one mechanical idea — the inverse function theorem — applied three times. Once you see the picture (curves reflecting across $y = x$ , slopes reciprocating), the formulas are inevitable.

Intuition: Inverses Mirror, Slopes Reciprocate

Inverting a function geometrically means reflecting its graph across the line $y = x$ . Every point $(a, b)$ on $f$ becomes the point $(b, a)$ on $f^{-1}$ . The two graphs are mirror images.

Now follow what happens to a tangent. Pick a point on $f$ where the tangent rises with slope $m = f'(a)$ — “up $m$ for every right 1”. Reflect that little arrow across $y = x$ . Up and right swap. The reflected arrow is “right $m$ for every up 1” — which is exactly the same as “up $1/m$ for every right 1”.

One sentence to keep: reflection across $y = x$ swaps rise and run, which is exactly the operation that inverts a slope. So the derivative of the inverse at a point is the reciprocal of the original derivative at the matching point.

Algebraically, if $y = f(x)$ and $x = f^{-1}(y)$ , then taking $\frac{d}{dy}$ of the second equation — viewing $x$ as a function of $y$ , then applying the chain rule from the first — gives

\frac{dx}{dy} \;=\; \frac{1}{\dfrac{dy}{dx}} \;=\; \frac{1}{f'(x)} \;=\; \frac{1}{f'(f^{-1}(y))}.

That is the entire toolkit. The rest of this section is just plugging in three different $f$ s.

Why a section just for this: the formula

(f^{-1})'(y) = 1/f'(f^{-1}(y))

looks simple, but every time you apply it you have to express

f'(f^{-1}(y))

back in terms of $y$ . For hyperbolic functions that re-expression always produces a square root or a quadratic — and that is where the famous shapes

1/\sqrt{1+x^2}

and

1/(1-x^2)

come from.

Mirror Across $y = x$ — Live Viewer

Drag the slider. Two things to watch:

The blue point on $f$ and the purple point on $f^{-1}$ are exact mirror images across the dashed line — they have swapped coordinates.
The green tangent (on $f$ ) and the orange tangent (on $f^{-1}$ ) are mirror images too. Their slopes multiply to $1$ : that is the inverse function theorem, drawn.

Inverse Function Reflection: Slopes Are Reciprocals

The two curves are mirror images across the dashed line y = x. The slope of the inverse at y equals 1 / f'(x) — that single fact gives every inverse-hyperbolic derivative formula.

x =1.000

f(x)

1.1752

f'(x)

1.5431

(f⁻¹)'(f(x))

0.6481

product

1.0000

Reciprocal check: f'(x) · (f⁻¹)'(f(x)) should always be 1. That is the inverse function theorem, and it is the single fact that turns every hyperbolic derivative into an inverse-hyperbolic derivative. arsinh undoes sinh on all of ℝ. The two curves are mirror images across y = x.

The Machine: Inverse Function Theorem

Concretely, here is the three-step recipe we will run three times.

Set $y = f(x)$ , so $x = f^{-1}(y)$ .
Differentiate $y = f(x)$ with respect to $y$ on both sides: $1 = f'(x)\,\frac{dx}{dy}$ , so $\frac{dx}{dy} = \frac{1}{f'(x)}$ .
Re-express $f'(x)$ in terms of $y$ using the relation $y = f(x)$ . This is the step that produces the square root.

Mnemonic: for inverses of hyperbolics, step 3 always uses one of two identities:

\cosh^2 - \sinh^2 = 1

(for

\operatorname{arsinh}, \operatorname{arcosh}

) or

1 - \tanh^2 = \operatorname{sech}^2

(for

\operatorname{artanh}

). Burn those two identities into memory and the three formulas fall out by themselves.

Derivation 1: $\dfrac{d}{dx}\operatorname{arsinh}(x)$

Let $y = \operatorname{arsinh}(x)$ , so $x = \sinh(y)$ . We want $dy/dx$ .

Step 1. Differentiate the relation $x = \sinh(y)$ with respect to $x$ , treating $y$ as a function of $x$ :

1 \;=\; \cosh(y)\cdot\frac{dy}{dx} \quad\Longrightarrow\quad \frac{dy}{dx} \;=\; \frac{1}{\cosh(y)}.

Step 2. Re-express $\cosh(y)$ in terms of $x$ . Use the fundamental identity $\cosh^2(y) - \sinh^2(y) = 1$ :

\cosh^2(y) \;=\; 1 + \sinh^2(y) \;=\; 1 + x^2 \quad\Longrightarrow\quad \cosh(y) \;=\; \sqrt{1+x^2}.

We take the positive square root because $\cosh \ge 1 > 0$ everywhere.

Step 3. Plug back in:

\boxed{\;\frac{d}{dx}\operatorname{arsinh}(x) \;=\; \frac{1}{\sqrt{1+x^2}}\;}

The square root in the denominator is the geometric signature of the identity $\cosh^2 - \sinh^2 = 1$ . Anytime you see $1/\sqrt{1+x^2}$ in a calculation, you are looking at an arsinh in disguise.

Sanity check at $x = 0$ : the formula gives

1/\sqrt{1+0} = 1

. And the original function: near zero

\sinh(y) \approx y

, so

\operatorname{arsinh}(x) \approx x

, which has slope 1. Two independent routes, same answer.

Derivation 2: $\dfrac{d}{dx}\operatorname{arcosh}(x)$

Let $y = \operatorname{arcosh}(x)$ , with $x \ge 1$ and $y \ge 0$ (the principal branch — see the pitfall section).

Step 1. From $x = \cosh(y)$ , differentiate:

1 \;=\; \sinh(y)\,\frac{dy}{dx} \quad\Longrightarrow\quad \frac{dy}{dx} \;=\; \frac{1}{\sinh(y)}.

Step 2. Re-express $\sinh(y)$ in terms of $x$ :

\sinh^2(y) \;=\; \cosh^2(y) - 1 \;=\; x^2 - 1 \quad\Longrightarrow\quad \sinh(y) \;=\; \sqrt{x^2-1}.

The positive sign comes from $y \ge 0 \Rightarrow \sinh(y) \ge 0$ .

Step 3. Plug back in:

\boxed{\;\frac{d}{dx}\operatorname{arcosh}(x) \;=\; \frac{1}{\sqrt{x^2-1}}, \quad x > 1.\;}

As $x \to 1^+$ , the slope blows up. Geometrically this is the moment $\cosh$ 's tangent at the bottom of its U-shape passes through horizontal (slope 0). Inverting it means the tangent of $\operatorname{arcosh}$ becomes vertical there. Reciprocal of zero is infinity.

Derivation 3: $\dfrac{d}{dx}\operatorname{artanh}(x)$

Let $y = \operatorname{artanh}(x)$ , with $-1 < x < 1$ .

Step 1. From $x = \tanh(y)$ , differentiate using $\tanh'(y) = 1 - \tanh^2(y)$ (the derivative you proved in the previous section):

1 \;=\; \bigl(1 - \tanh^2(y)\bigr)\,\frac{dy}{dx} \quad\Longrightarrow\quad \frac{dy}{dx} \;=\; \frac{1}{1 - \tanh^2(y)}.

Step 2. Substitute $\tanh(y) = x$ directly — no square roots this time, because the identity $1 - \tanh^2 = \operatorname{sech}^2$ is already a clean quadratic in $\tanh$ :

\boxed{\;\frac{d}{dx}\operatorname{artanh}(x) \;=\; \frac{1}{1-x^2}, \quad |x| < 1.\;}

At the endpoints $x = \pm 1$ the formula produces $1/0$ ; the slope is genuinely infinite there because $\operatorname{artanh}$ has vertical asymptotes at $\pm 1$ . The infinite slope is not a bug — it is the price of squeezing all of $\mathbb R$ into the open interval $(-1, 1)$ .

One quirk worth knowing. The formula for

\operatorname{arcoth}

— the inverse of

\coth

on the domain

|x| > 1

— is exactly the same algebraic expression

1/(1-x^2)

. The two functions

\operatorname{artanh}

and

\operatorname{arcoth}

simply live on different pieces of the real line, but they share a derivative formula. A surprising and useful symmetry.

The Remaining Three: $\operatorname{arcsch}, \operatorname{arsech}, \operatorname{arcoth}$

The same machine handles the reciprocal trio. The derivations are short — they follow $\frac{d}{dx}\operatorname{arcsch}(x) = -\frac{1}{|x|\sqrt{1+x^2}}$ from $\operatorname{csch} = 1/\sinh$ by applying the chain rule plus our arsinh result — and the answers are:

Function	Domain	Derivative
arcsch(x)	x ≠ 0	− 1 / ( \|x\| · √(1 + x²) )
arsech(x)	0 < x ≤ 1	− 1 / ( x · √(1 − x²) )
arcoth(x)	\|x\| > 1	1 / (1 − x²)

Where the minus signs come from:

\operatorname{csch}

and

\operatorname{sech}

are decreasing on their principal branches (each is the reciprocal of a positive growing function). Inverting a decreasing function gives a decreasing function — and an inverse function's slope inherits the sign of the original. So

\operatorname{arcsch}

and

\operatorname{arsech}

' both come out negative.

Master Reference Table

Print this. Tape it to your wall. Every inverse-hyperbolic problem reduces to one of these rows, perhaps composed with the chain rule.

f(x)	f'(x)	Domain of f'	Notes
arsinh(x)	1 / √(1 + x²)	all of ℝ	always positive; even function of x
arcosh(x)	1 / √(x² − 1)	x > 1	vertical tangent at x = 1
artanh(x)	1 / (1 − x²)	\|x\| < 1	blows up at x = ±1
arcsch(x)	− 1 / ( \|x\| · √(1 + x²) )	x ≠ 0	negative; the \|x\| keeps it real on both sides of zero
arsech(x)	− 1 / ( x · √(1 − x²) )	0 < x < 1	negative; matches arcosh's mirror
arcoth(x)	1 / (1 − x²)	\|x\| > 1	same expression as artanh' but different domain

Chain rule extension. For any of these

f

, the derivative of

f(g(x))

f'(g(x))\cdot g'(x)

. So

\frac{d}{dx}\operatorname{arsinh}(2x^2)

is just

\dfrac{4x}{\sqrt{1 + 4x^4}}

. No new rules, ever.

Sanity Check: The Logarithm Route

Each inverse hyperbolic function also has a closed-form logarithmic expression. Differentiating those expressions directly with the chain rule gives a second route to the same answers — a perfect cross-check for the inverse-function-theorem route.

Where does $\operatorname{arsinh}(x) = \ln(x + \sqrt{x^2+1})$ come from? Solve $x = \sinh(y) = (e^y - e^{-y})/2$ for $y$ . Multiply by $2e^y$ : $e^{2y} - 2xe^y - 1 = 0$ . This is a quadratic in $u = e^y$ with positive root $u = x + \sqrt{x^2+1}$ . Take $\ln$ of both sides and you get the formula.

Now differentiate directly. Let $u(x) = x + \sqrt{x^2+1}$ . Then:

u'(x) \;=\; 1 + \frac{x}{\sqrt{x^2+1}} \;=\; \frac{\sqrt{x^2+1} + x}{\sqrt{x^2+1}} \;=\; \frac{u(x)}{\sqrt{x^2+1}}.

By the chain rule,

\frac{d}{dx}\operatorname{arsinh}(x) \;=\; \frac{u'(x)}{u(x)} \;=\; \frac{u(x)/\sqrt{x^2+1}}{u(x)} \;=\; \frac{1}{\sqrt{x^2+1}}.

Same answer. The little cancellation $u(x)/u(x) = 1$ is where the algebra rewards us. The same trick works for $\operatorname{arcosh}$ and $\operatorname{artanh}$ ; try it as an exercise.

Why two routes? One route uses the structure of inverse functions abstractly (slopes reciprocate). The other uses the explicit closed form. Both must give the same answer, because there is only one derivative. When you build intuition along multiple paths, the formulas stop being memorized — they become deducible from first principles.

Worked Example (Collapsible): Mass on a Falling Cable

A heavy chain hangs in the shape $y(x) = a\cosh(x/a)$ (the catenary; see ch07/s01 if you want the physics derivation). We measure the height of a bead on the chain and want to know how its horizontal coordinate moves when the height changes. That is $dx/dy$ , and the answer is an application of the arcosh derivative.

Try the problem yourself first. Then expand the panel below to check.

Click to expand the worked solution

Solve for $x$ in terms of $y$ . From $y = a\cosh(x/a)$ we get $x = a\,\operatorname{arcosh}(y/a)$ , choosing the positive branch.

Differentiate with respect to $y$ . Using the master table row for $\operatorname{arcosh}$ and the chain rule with inner function $g(y) = y/a$ :

\frac{dx}{dy} \;=\; a \cdot \frac{1}{\sqrt{(y/a)^2 - 1}} \cdot \frac{1}{a} \;=\; \frac{1}{\sqrt{(y/a)^2 - 1}}.

Multiply numerator and denominator by $a$ to clean up:

\frac{dx}{dy} \;=\; \frac{a}{\sqrt{y^2 - a^2}}.

Numerical sanity. Let $a = 1\,\text{m}$ . At $y = 2\,\text{m}$ : $dx/dy = 1/\sqrt{4-1} = 1/\sqrt{3} \approx 0.5774$ . Meaning: when the bead rises 1 cm, it slides sideways $\approx 0.5774$ cm. Notice how the answer blows up as $y \to 1^+$ — that is the bottom of the chain (where the tangent is horizontal), so a tiny vertical move corresponds to a huge horizontal move. Exactly the geometric story behind the singular slope of $\operatorname{arcosh}$ at $x = 1$ .

Cross-check. Compute $dy/dx$ directly from $y = \cosh x$ : $dy/dx = \sinh x$ . At $y = 2$ we have $x = \operatorname{arcosh}(2) \approx 1.317$ and $\sinh(1.317) \approx 1.732 = \sqrt{3}$ . Reciprocate: $1/\sqrt{3} \approx 0.5774$ . Same answer. The inverse function theorem is doing exactly what the picture promised.

Interactive Derivative Explorer

Pick one of the six inverse hyperbolic functions, drag the slider, and watch:

the green tangent line rotate on the blue curve;
the orange dashed curve show $f'(x)$ ; the dashed curve's height at $x$ equals the slope of the blue curve at the same $x$ ;
the readouts — value, slope, and tangent angle in degrees — update live.

Inverse Hyperbolic Derivative Explorer

Pick a function, drag the slider, and watch the tangent line rotate. The dashed orange curve is the derivative — its height at x equals the slope of the blue curve at the same x.

x =1.000

f(x)

0.8814

f'(x) = slope

0.7071

tangent angle

35.3°

arsinh(x) = ln(x + √(x² + 1))

Derivative: f'(x) = 1 / √(1 + x²)

Slide

\operatorname{artanh}

close to

x = 0.95

and watch the slope explode past

10

. That is the geometric signature of the pole at

x = 1

, and it is also why neural networks trained with

\tanh

outputs sometimes have wild gradients near saturation — the inverse map is amplifying everything by

1/(1 - x^2)

Python: Numerical Verification by Central Difference

Three formulas, eight test points, one diagnostic table. We compare each hand-derived derivative against a central-difference estimate. If the algebra is right, the error column should be $\sim 10^{-10}$ on smooth regions.

inverse_hyperbolic.py — closed forms vs. central differences

🐍inverse_hyperbolic.py

Explanation(10)

Code(64)

1Imports — pure standard library

We use math.log and math.sqrt — both ordinary scalar functions. Callable and Tuple are just type hints. No NumPy, no SciPy. The goal of this file is to make every inverse-hyperbolic formula feel like a one-line arithmetic recipe, so we keep the toolbox small.

EXECUTION STATE

math.log = natural logarithm, ln

math.sqrt = square root

5arsinh(x) — defined everywhere on ℝ

The closed-form identity arsinh(x) = ln(x + √(x²+1)) comes from solving sinh(y) = x for y. We will derive it from scratch in the logarithm-route section. Notice there is no domain check — the argument x + √(x²+1) is always positive for any real x, so the log never complains.

EXAMPLE

arsinh(0)   = ln(0 + 1)   = 0
arsinh(1)   = ln(1 + √2)  ≈ 0.88137
arsinh(2.5) = ln(2.5 + √7.25) ≈ 1.6472

EXECUTION STATE

x + sqrt(x²+1) = always > 0 — safe for log

9arcosh(x) — domain x ≥ 1

cosh is only one-to-one for x ≥ 0 and its image starts at cosh(0) = 1. So arcosh(x) — the inverse on the principal branch — accepts only x ≥ 1. We raise an error instead of silently returning NaN, because a downstream loss curve that quietly emits NaN is one of the most expensive bugs in ML.

EXAMPLE

arcosh(1)   = ln(1 + 0) = 0
arcosh(1.5) = ln(1.5 + √1.25) ≈ 0.9624
arcosh(3)   = ln(3 + √8)  ≈ 1.7627

EXECUTION STATE

ValueError = raised when x < 1

15artanh(x) — domain (−1, 1), open interval

tanh squashes ℝ into (−1, 1), so artanh only accepts inputs strictly inside the open interval. At the endpoints ±1 the value blows up to ±∞, which matches the calculus: lim x→1⁻ artanh(x) = +∞. This function appears all over ML as the *logit* transform: artanh = ½ · logit(½(1+x)).

EXAMPLE

artanh(0)   = 0
artanh(0.5) = ½ ln(1.5/0.5) = ½ ln 3 ≈ 0.5493
artanh(0.9) = ½ ln(1.9/0.1) ≈ 1.4722

EXECUTION STATE

(1+x)/(1−x) = must be > 0; squeezing toward ±1 blows up

23d_arsinh(x) = 1 / √(1 + x²)

This is the entire payoff of the inverse function theorem applied to sinh. cosh(arsinh(x)) = √(1 + x²), and we will derive it explicitly. The derivative is positive everywhere and decays like 1/|x| for large |x| — arsinh grows like log(2x), so its slope falls off logarithmically.

EXAMPLE

d_arsinh(0)   = 1
d_arsinh(1)   = 1/√2 ≈ 0.7071
d_arsinh(2.5) ≈ 0.3714

26d_arcosh(x) = 1 / √(x² − 1)

Sister formula. Note the minus sign under the square root: this expression is real only for |x| > 1, matching the function's domain. As x → 1⁺ the slope explodes to +∞, which is what makes arcosh's tangent vertical at x = 1.

EXAMPLE

d_arcosh(1.5) = 1/√1.25 ≈ 0.8944
d_arcosh(3)   = 1/√8     ≈ 0.3536

29d_artanh(x) = 1 / (1 − x²)

Same expression as d/dx of ½ ln((1+x)/(1−x)) (you will check this in the logarithm-route section). The slope is positive on (−1, 1), explodes at the endpoints, and is exactly 1 at x = 0. Last fact matters: it tells you tanh is a *linear identity* in the neighborhood of zero, which is why deep nets initialized near zero look like linear nets for the first few iterations.

EXAMPLE

d_artanh(0)   = 1
d_artanh(0.5) = 1/0.75 ≈ 1.3333
d_artanh(0.9) = 1/0.19 ≈ 5.2632

34Central-difference helper

We will grade every analytic derivative by comparing it to a numerical estimate. (f(x+h) − f(x−h)) / (2h) is O(h²) accurate on a smooth function, so with h = 1e-5 the error should be ~1e-10 for these formulas — much smaller than the floating-point noise of the arithmetic itself.

EXECUTION STATE

h = default 1e-5; smaller h adds rounding noise

39check(...) — the diagnostic row

For each test point we print analytic value, numerical estimate, and absolute error. If our hand-derived formulas were wrong, the err column would jump to ~1e-1 or worse. Anything in the 1e-9 .. 1e-11 range is essentially zero in IEEE-754 land.

EXAMPLE

Sample output line:
arsinh  x=  1.0  analytic=  0.707107  numeric=  0.707107  err=1.27e-11

EXECUTION STATE

analytic = value of fprime(x) we coded

numeric = central-difference estimate

err = |analytic − numeric|

47Driver — eight test points across three domains

We pick points that exercise the boundaries of each function: x = 0 for arsinh (the symmetry centre), x = 1.5 for arcosh (close to the singular boundary x = 1), x = 0.9 for artanh (close to the +1 blow-up). If the formulas survive these tests they survive everywhere in their domain.

LOOP TRACE · 8 iterations

arsinh, x = 0

analytic = 1.000000

numeric = 1.000000

err = ~ 1e-11

arsinh, x = 1

analytic = 0.707107

numeric = 0.707107

err = ~ 1e-11

arsinh, x = 2.5

analytic = 0.371391

numeric = 0.371391

err = ~ 1e-11

arcosh, x = 1.5

analytic = 0.894427

numeric = 0.894427

err = ~ 1e-9 (steeper region)

arcosh, x = 3

analytic = 0.353553

numeric = 0.353553

err = ~ 1e-11

artanh, x = -0.5

analytic = 1.333333

numeric = 1.333333

err = ~ 1e-11

artanh, x = 0.5

analytic = 1.333333

numeric = 1.333333

err = ~ 1e-11

artanh, x = 0.9

analytic = 5.263158

numeric = 5.263158

err = ~ 1e-8 (near pole)

54 lines without explanation

1import math
2from typing import Callable, Tuple
3
4# ---- the six inverse hyperbolic functions, in closed form ----------------
5
6def arsinh(x: float) -> float:
7    # defined for all real x
8    return math.log(x + math.sqrt(x * x + 1.0))
9
10def arcosh(x: float) -> float:
11    # only defined for x >= 1
12    if x < 1.0:
13        raise ValueError("arcosh requires x >= 1")
14    return math.log(x + math.sqrt(x * x - 1.0))
15
16def artanh(x: float) -> float:
17    # only defined for -1 < x < 1
18    if not -1.0 < x < 1.0:
19        raise ValueError("artanh requires -1 < x < 1")
20    return 0.5 * math.log((1.0 + x) / (1.0 - x))
21
22
23# ---- their derivatives, in closed form (from the inverse function theorem) -
24
25def d_arsinh(x: float) -> float:
26    return 1.0 / math.sqrt(1.0 + x * x)
27
28def d_arcosh(x: float) -> float:
29    return 1.0 / math.sqrt(x * x - 1.0)
30
31def d_artanh(x: float) -> float:
32    return 1.0 / (1.0 - x * x)
33
34
35# ---- numerical derivative via symmetric difference -----------------------
36
37def numerical_derivative(f: Callable[[float], float], x: float,
38                          h: float = 1e-5) -> float:
39    """Central difference: O(h^2) accurate for smooth f."""
40    return (f(x + h) - f(x - h)) / (2.0 * h)
41
42
43def check(name: str, x: float, f: Callable[[float], float],
44          fprime: Callable[[float], float]) -> Tuple[float, float, float]:
45    analytic = fprime(x)
46    numeric = numerical_derivative(f, x)
47    err = abs(analytic - numeric)
48    print(f"{name}  x={x:>5}  analytic={analytic:10.6f}  "
49          f"numeric={numeric:10.6f}  err={err:.2e}")
50    return analytic, numeric, err
51
52
53if __name__ == "__main__":
54    # Pick points safely inside each function's domain.
55    check("arsinh", 0.0, arsinh, d_arsinh)
56    check("arsinh", 1.0, arsinh, d_arsinh)
57    check("arsinh", 2.5, arsinh, d_arsinh)
58
59    check("arcosh", 1.5, arcosh, d_arcosh)
60    check("arcosh", 3.0, arcosh, d_arcosh)
61
62    check("artanh", -0.5, artanh, d_artanh)
63    check("artanh",  0.5, artanh, d_artanh)
64    check("artanh",  0.9, artanh, d_artanh)

Run the file. Every err column entry is in the 1e-9 . 1e-12 range. That is the signature of correct algebra plus an O(h²) numerical scheme. Where the err climbs a little (arcosh near 1, artanh near 0.9), the cause is not bad algebra — it is the curve being so steep that central difference is no longer flat enough to be O(h²).

PyTorch: Autograd Recovers Every Formula

Now the third independent route. We compute the same derivatives with $\texttt{torch.autograd.grad}$ and compare against the analytic formulas. Internally, PyTorch's C++ backward kernels for $\texttt{asinh}, \texttt{acosh}, \texttt{atanh}$ implement exactly the closed forms we proved. If our derivation is correct the two columns must match to $\sim 10^{-15}$ on float64.

inverse_hyperbolic_torch.py — autograd cross-check

🐍inverse_hyperbolic_torch.py

Explanation(12)

Code(63)

1Imports — math + torch

math gives us the closed-form formulas; torch gives us tensors and autograd. We will compute every derivative two ways and check they agree to within floating-point tolerance.

5torch already has these as primitives

torch.asinh, torch.acosh, torch.atanh are built-ins. Each has a registered backward function — meaning PyTorch's autograd already knows the derivative formulas we are about to derive. The exercise here is *checking that what we proved by hand matches the framework's gradient table*, not teaching PyTorch new tricks.

EXAMPLE

torch.asinh(torch.tensor(1.0))   → tensor(0.8814)
torch.atanh(torch.tensor(0.5))   → tensor(0.5493)

16analytic_derivative: the formulas we will prove

These are exactly the three formulas the section will derive: 1/√(1+x²), 1/√(x²−1), 1/(1−x²). Putting them in one function lets us compare side-by-side against autograd later.

EXECUTION STATE

1/√(1+x²) = arsinh' — finite everywhere

1/√(x²-1) = arcosh' — singular at x = 1

1/(1-x²) = artanh' — singular at x = ±1

28autograd_derivative — single-point gradient

We wrap a single x in a scalar float64 tensor with requires_grad=True. float64 is intentional here: when we approach the singular boundaries (arcosh at 1+, artanh at ±1−) float32 noise becomes visible at the 6th decimal.

EXECUTION STATE

xt = tensor with requires_grad=True

dtype = float64 → 15 reliable decimals

30Forward pass: y = torch.asinh / acosh / atanh

This builds the (very short) computation graph: xt → y. There is exactly one op, so the backward pass will be a single application of the corresponding gradient formula stored inside PyTorch.

EXAMPLE

name='arsinh', x=1.0
  y = torch.asinh(xt)
  y.item() ≈ 0.8813735870195429

36torch.autograd.grad(outputs=y, inputs=xt)

We ask autograd to return dy/dxt as a tuple. This is the *clean* form (does not mutate xt.grad), preferred when you want to compute a gradient mid-program without touching state.

37outputs=y

We differentiate the inverse-hyperbolic value y, not anything downstream. y is a scalar tensor; outputs can be a scalar or an arbitrary tensor with an explicit grad_outputs.

38inputs=xt

The leaf tensor we want a gradient with respect to. xt must have requires_grad=True or autograd refuses to track it.

39create_graph=False

Second-order derivatives are not needed, so we skip building a graph-of-the-graph. If we wanted d²/dx² arsinh(x), we would flip this flag to True and call autograd.grad a second time on grad.

41return grad.item()

Pull the Python float out of the 0-dim tensor. .item() also detaches; the returned float carries no grad history.

44Driver: eight points across all three functions

Same eight points as the pure-Python file, with one negative input added to confirm symmetry. arsinh and artanh are odd functions (f(−x) = −f(x)), so their derivatives are even functions (f'(−x) = f'(x)). The negative test points let you see this directly in the output.

LOOP TRACE · 8 iterations

arsinh, x = 0

analytic = 1.000000

autograd = 1.000000

match = True

arsinh, x = 1

analytic = 0.707107

autograd = 0.707107

match = True

arsinh, x = -2.5

analytic = 0.371391 (same as +2.5: even slope)

autograd = 0.371391

match = True

arcosh, x = 1.5

analytic = 0.894427

autograd = 0.894427

match = True

arcosh, x = 3

analytic = 0.353553

autograd = 0.353553

match = True

artanh, x = -0.5

analytic = 1.333333

autograd = 1.333333

match = True

artanh, x = 0.5

analytic = 1.333333

autograd = 1.333333

match = True

artanh, x = 0.9

analytic = 5.263158

autograd = 5.263158

match = True

56math.isclose with rel_tol = 1e-9

We use isclose instead of `==` because float arithmetic is order-dependent: autograd composes ops in a slightly different order from our by-hand closed form. 1e-9 relative tolerance is generous on float64 and tight on float32. Every test passes.

51 lines without explanation

1import math
2import torch
3
4# ----- the six inverse hyperbolic functions via torch -----------------------
5# torch.asinh / acosh / atanh exist natively. We use them so autograd can do
6# its job, then cross-check against our hand-derived formulas.
7
8def f_arsinh(x: torch.Tensor) -> torch.Tensor:
9    return torch.asinh(x)
10
11def f_arcosh(x: torch.Tensor) -> torch.Tensor:
12    return torch.acosh(x)
13
14def f_artanh(x: torch.Tensor) -> torch.Tensor:
15    return torch.atanh(x)
16
17
18def analytic_derivative(name: str, x: float) -> float:
19    if name == "arsinh":
20        return 1.0 / math.sqrt(1.0 + x * x)
21    if name == "arcosh":
22        return 1.0 / math.sqrt(x * x - 1.0)
23    if name == "artanh":
24        return 1.0 / (1.0 - x * x)
25    raise ValueError(name)
26
27
28def autograd_derivative(name: str, x: float) -> float:
29    """Ask PyTorch for f'(x) at a single point using torch.autograd.grad."""
30    xt = torch.tensor(x, dtype=torch.float64, requires_grad=True)
31    if name == "arsinh":
32        y = f_arsinh(xt)
33    elif name == "arcosh":
34        y = f_arcosh(xt)
35    elif name == "artanh":
36        y = f_artanh(xt)
37    else:
38        raise ValueError(name)
39
40    (grad,) = torch.autograd.grad(
41        outputs=y,
42        inputs=xt,
43        create_graph=False,
44    )
45    return grad.item()
46
47
48if __name__ == "__main__":
49    tests = [
50        ("arsinh",  0.0),
51        ("arsinh",  1.0),
52        ("arsinh", -2.5),
53        ("arcosh",  1.5),
54        ("arcosh",  3.0),
55        ("artanh", -0.5),
56        ("artanh",  0.5),
57        ("artanh",  0.9),
58    ]
59    for name, x in tests:
60        ana = analytic_derivative(name, x)
61        ag  = autograd_derivative(name, x)
62        print(f"{name:>7}  x={x:>5}  analytic={ana:10.6f}  "
63              f"autograd={ag:10.6f}  match={math.isclose(ana, ag, rel_tol=1e-9)}")

What this script teaches you. When autograd computes the gradient of any expression that contains

\operatorname{arsinh}, \operatorname{arcosh}, \operatorname{artanh}

, it is mechanically substituting our three boxed formulas into the chain rule. Knowing the formula by heart is identical to knowing what autograd is going to do — and that is the only way to debug a NaN gradient in production.

Why ML Engineers Care: Logits, Tanh, and Stability

1. Logits are arctanhs in disguise. A binary classifier outputs a probability $p \in (0, 1)$ via $p = \sigma(z) = 1/(1 + e^{-z})$ . The inverse $\sigma^{-1}(p) = \ln(p/(1-p))$ is the logit. A short algebraic identity says $\operatorname{artanh}(2p - 1) = \tfrac{1}{2}\sigma^{-1}(p)$ . That is why the derivative $1/(1 - x^2)$ appears in backward passes through classification heads: it is the gradient of the logit at the rescaled probability.

2. The arsinh-scaled loss. When a regression target spans many orders of magnitude (think satellite radiance, stock returns, ad-revenue, sensor counts), training on raw values lets a single outlier dominate every gradient step. Replacing the loss $(y - \hat y)^2$ with $(\operatorname{arsinh}(y) - \operatorname{arsinh}(\hat y))^2$ compresses large values logarithmically while leaving small values almost linear. The gradient of the inner term picks up an $1/\sqrt{1+y^2}$ factor — the formula from this section — which is why the optimizer's effective step size on huge targets becomes inversely proportional to $|y|$ . Outliers can't scream anymore.

3. The artanh-saturation trap. A tanh activation outputs values in $(-1, 1)$ . The backward pass through the loss often inverts the activation conceptually (think contrastive losses operating on cosine similarities). The gradient of $\operatorname{artanh}$ is $1/(1 - x^2)$ , which is $100$ at $x = 0.995$ and $10000$ at $x = 0.99995$ . If a network saturates toward $\pm 1$ you get an exploding gradient through any artanh in the pipeline. Clamping inputs to $[-0.9999, 0.9999]$ before artanh is a standard production fix.

4. Rapidity in physics-aware models. Special relativity replaces velocities with rapidities $\phi = \operatorname{artanh}(v/c)$ because rapidity adds linearly under boosts while velocity does not. Recent ML papers on relativistic dynamics use this trick to keep loss landscapes flat across speeds — every gradient ends up with our $1/(1 - x^2)$ attached.

Practical takeaway: whenever an inverse hyperbolic appears inside a loss, look at the singular point of its derivative first. Catastrophic training runs usually trace back to inputs drifting toward those singularities.

Common Pitfalls

Forgetting the domain. $\operatorname{arcosh}(x)$ requires $x \ge 1$ ; $\operatorname{artanh}(x)$ requires $|x| < 1$ . The derivatives extend across the same domains, and outside them you get a NaN.
Sign of the square root. When you compute $\cosh(y)$ from $\cosh^2(y) = 1 + x^2$ you must pick the positive root, because $\cosh \ge 1$ . The same is true for the principal-branch choices in $\operatorname{arcosh}$ .
Confusing arctanh with arccoth. Their derivatives are identical expressions $1/(1-x^2)$ , but their domains are disjoint: $|x|<1$ for artanh, $|x|>1$ for arcoth. Pick the wrong one and your answer is in the wrong region of the real line.
Float32 near singularities. Both $\operatorname{artanh}'$ and $\operatorname{arcosh}'$ have gradient blow-ups at the edge of their domain. In float32 you will see noisy or infinite gradients well before you reach the mathematical boundary. Use float64 for diagnostic work and clamp inputs before the function call in production.
Notation drift. Some books write $\sinh^{-1}$ , $\cosh^{-1}$ ; we prefer the $\operatorname{arsinh}$ , $\operatorname{arcosh}$ form to avoid confusion with the reciprocal $1/\sinh$ . Both mean the same thing.

The single biggest mistake students make is memorizing the formulas and forgetting the derivation. When you see

1/\sqrt{1+x^2}

in a problem, you want your brain to flash “that is the slope of arsinh, because

\cosh = \sqrt{1+\sinh^2}

.” Two-hop memory beats one-hop memorization every time.

Summary

Every inverse hyperbolic derivative is one step away from the identity it lives on:

Identity	Inverse derivative it produces
cosh² − sinh² = 1	(arsinh)' = 1/√(1+x²), (arcosh)' = 1/√(x²−1)
1 − tanh² = sech²	(artanh)' = 1/(1−x²), (arcoth)' = 1/(1−x²)
csch = 1/sinh, sech = 1/cosh	(arcsch)' = −1/(\|x\|√(1+x²)), (arsech)' = −1/(x√(1−x²))

Three identities. Six formulas. One mechanical procedure (the inverse function theorem) that converts one to the other. Once the picture — graphs reflected across $y = x$ , slopes reciprocating — is in your head, the formulas stop being formulas and become consequences of geometry.

One sentence: the derivative of $f^{-1}$ at $y$ is the reciprocal of the derivative of $f$ at the matching $x$ — and for hyperbolic functions the re-expression in terms of $y$ always produces a square-root or a quadratic via $\cosh^2 - \sinh^2 = 1$ .

Learning Objectives

The Question Behind the Formulas

Intuition: Inverses Mirror, Slopes Reciprocate

Mirror Across y=xy = xy=x — Live Viewer

Inverse Function Reflection: Slopes Are Reciprocals

The Machine: Inverse Function Theorem

Derivation 1: ddxarsinh⁡(x)\dfrac{d}{dx}\operatorname{arsinh}(x)dxd​arsinh(x)

Derivation 2: ddxarcosh⁡(x)\dfrac{d}{dx}\operatorname{arcosh}(x)dxd​arcosh(x)

Derivation 3: ddxartanh⁡(x)\dfrac{d}{dx}\operatorname{artanh}(x)dxd​artanh(x)

The Remaining Three: arcsch⁡,arsech⁡,arcoth⁡\operatorname{arcsch}, \operatorname{arsech}, \operatorname{arcoth}arcsch,arsech,arcoth

Master Reference Table

Sanity Check: The Logarithm Route

Worked Example (Collapsible): Mass on a Falling Cable

Interactive Derivative Explorer

Inverse Hyperbolic Derivative Explorer

Python: Numerical Verification by Central Difference

PyTorch: Autograd Recovers Every Formula

Why ML Engineers Care: Logits, Tanh, and Stability

Common Pitfalls

Summary

Mirror Across $y = x$ — Live Viewer

Derivation 1: $\dfrac{d}{dx}\operatorname{arsinh}(x)$

Derivation 2: $\dfrac{d}{dx}\operatorname{arcosh}(x)$

Derivation 3: $\dfrac{d}{dx}\operatorname{artanh}(x)$

The Remaining Three: $\operatorname{arcsch}, \operatorname{arsech}, \operatorname{arcoth}$