Chapter 5
20 min read
Section 50 of 353

Derivatives of Inverse Hyperbolic Functions

Derivatives of Transcendental Functions

Learning Objectives

By the end of this section you will be able to:

  1. Derive every inverse hyperbolic derivative from one tool — the inverse function theorem (f1)(y)  =  1/f(f1(y))(f^{-1})'(y) \;=\; 1 / f'(f^{-1}(y)) — applied to sinh\sinh, cosh\cosh, and tanh\tanh.
  2. Recognize why each formula has the shape it does: a square-root in the denominator for arsinh,arcosh\operatorname{arsinh}, \operatorname{arcosh} (they come from cosh2sinh2=1\cosh^2 - \sinh^2 = 1), a quadratic for artanh\operatorname{artanh} (it comes from 1tanh21 - \tanh^2).
  3. Cross-check the formulas by differentiating the logarithmic forms arsinh(x)=ln(x+x2+1)\operatorname{arsinh}(x) = \ln(x + \sqrt{x^2+1}) directly with the chain rule.
  4. Apply the chain rule to compose these derivatives with other functions ddxartanh(g(x))=g(x)1g(x)2\frac{d}{dx}\operatorname{artanh}(g(x)) = \frac{g'(x)}{1 - g(x)^2}.
  5. Verify any answer numerically (Python central difference) and symbolically (PyTorch autograd).
  6. Connect artanh\operatorname{artanh} to the logit transform used in classification heads, and arsinh\operatorname{arsinh} to the asinh-scaled loss used in regression on heavy-tailed data.

The Question Behind the Formulas

In the last section you learned that sinh,cosh,tanh\sinh, \cosh, \tanh have unusually tidy derivatives. sinh=cosh\sinh' = \cosh. cosh=sinh\cosh' = \sinh. tanh=1tanh2\tanh' = 1 - \tanh^2. The pair (sinh,cosh)(\sinh, \cosh) behaves like a hyperbolic version of (sin,cos)(\sin, \cos) — same algebraic rhythm, no minus sign.

But just like the trig functions, hyperbolic functions are sometimes given to us backwards. A hanging cable has a shape y=acosh(x/a)y = a \cosh(x/a). If you measure the height yy and want to recover the horizontal position xx, you need x=aarcosh(y/a)x = a\,\operatorname{arcosh}(y/a). A neural network outputs a number u(1,1)u \in (-1, 1) through a tanh\tanh head. If you want the pre-activation that produced it you need artanh(u)\operatorname{artanh}(u). Special relativity gives a rapidity ϕ=artanh(v/c)\phi = \operatorname{artanh}(v/c). Statistics gives Fisher's z=artanh(r)z = \operatorname{artanh}(r) transform on correlation coefficients.

The single question of this section: given that (sinh,cosh,tanh)(\sinh, \cosh, \tanh) have such clean derivatives, what are the derivatives of their inverses? And why do those derivatives have square-roots and quadratics glued to them?

The answer is one mechanical idea — the inverse function theorem — applied three times. Once you see the picture (curves reflecting across y=xy = x, slopes reciprocating), the formulas are inevitable.


Intuition: Inverses Mirror, Slopes Reciprocate

Inverting a function geometrically means reflecting its graph across the line y=xy = x. Every point (a,b)(a, b) on ff becomes the point (b,a)(b, a) on f1f^{-1}. The two graphs are mirror images.

Now follow what happens to a tangent. Pick a point on ff where the tangent rises with slope m=f(a)m = f'(a) — “up mm for every right 1”. Reflect that little arrow across y=xy = x. Up and right swap. The reflected arrow is “right mm for every up 1” — which is exactly the same as “up 1/m1/m for every right 1”.

One sentence to keep: reflection across y=xy = x swaps rise and run, which is exactly the operation that inverts a slope. So the derivative of the inverse at a point is the reciprocal of the original derivative at the matching point.

Algebraically, if y=f(x)y = f(x) and x=f1(y)x = f^{-1}(y), then taking ddy\frac{d}{dy} of the second equation — viewing xx as a function of yy, then applying the chain rule from the first — gives

dxdy  =  1dydx  =  1f(x)  =  1f(f1(y)).\frac{dx}{dy} \;=\; \frac{1}{\dfrac{dy}{dx}} \;=\; \frac{1}{f'(x)} \;=\; \frac{1}{f'(f^{-1}(y))}.

That is the entire toolkit. The rest of this section is just plugging in three different ffs.

Why a section just for this: the formula (f1)(y)=1/f(f1(y))(f^{-1})'(y) = 1/f'(f^{-1}(y)) looks simple, but every time you apply it you have to express f(f1(y))f'(f^{-1}(y)) back in terms of yy. For hyperbolic functions that re-expression always produces a square root or a quadratic — and that is where the famous shapes 1/1+x21/\sqrt{1+x^2} and 1/(1x2)1/(1-x^2) come from.

Mirror Across y=xy = x — Live Viewer

Drag the slider. Two things to watch:

  • The blue point on ff and the purple point on f1f^{-1} are exact mirror images across the dashed line — they have swapped coordinates.
  • The green tangent (on ff) and the orange tangent (on f1f^{-1}) are mirror images too. Their slopes multiply to 11: that is the inverse function theorem, drawn.

Inverse Function Reflection: Slopes Are Reciprocals

The two curves are mirror images across the dashed line y = x. The slope of the inverse at y equals 1 / f'(x) — that single fact gives every inverse-hyperbolic derivative formula.

y = x(x, f(x))(f(x), x)
1.000
f(x)
1.1752
f'(x)
1.5431
(f⁻¹)'(f(x))
0.6481
product
1.0000
Reciprocal check: f'(x) · (f⁻¹)'(f(x)) should always be 1. That is the inverse function theorem, and it is the single fact that turns every hyperbolic derivative into an inverse-hyperbolic derivative. arsinh undoes sinh on all of ℝ. The two curves are mirror images across y = x.

The Machine: Inverse Function Theorem

Concretely, here is the three-step recipe we will run three times.

  1. Set y=f(x)y = f(x), so x=f1(y)x = f^{-1}(y).
  2. Differentiate y=f(x)y = f(x) with respect to yy on both sides: 1=f(x)dxdy1 = f'(x)\,\frac{dx}{dy}, so dxdy=1f(x)\frac{dx}{dy} = \frac{1}{f'(x)}.
  3. Re-express f(x)f'(x) in terms of yy using the relation y=f(x)y = f(x). This is the step that produces the square root.
Mnemonic: for inverses of hyperbolics, step 3 always uses one of two identities: cosh2sinh2=1\cosh^2 - \sinh^2 = 1 (for arsinh,arcosh\operatorname{arsinh}, \operatorname{arcosh}) or 1tanh2=sech21 - \tanh^2 = \operatorname{sech}^2 (for artanh\operatorname{artanh}). Burn those two identities into memory and the three formulas fall out by themselves.

Derivation 1: ddxarsinh(x)\dfrac{d}{dx}\operatorname{arsinh}(x)

Let y=arsinh(x)y = \operatorname{arsinh}(x), so x=sinh(y)x = \sinh(y). We want dy/dxdy/dx.

Step 1. Differentiate the relation x=sinh(y)x = \sinh(y) with respect to xx, treating yy as a function of xx:

1  =  cosh(y)dydxdydx  =  1cosh(y).1 \;=\; \cosh(y)\cdot\frac{dy}{dx} \quad\Longrightarrow\quad \frac{dy}{dx} \;=\; \frac{1}{\cosh(y)}.

Step 2. Re-express cosh(y)\cosh(y) in terms of xx. Use the fundamental identity cosh2(y)sinh2(y)=1\cosh^2(y) - \sinh^2(y) = 1:

cosh2(y)  =  1+sinh2(y)  =  1+x2cosh(y)  =  1+x2.\cosh^2(y) \;=\; 1 + \sinh^2(y) \;=\; 1 + x^2 \quad\Longrightarrow\quad \cosh(y) \;=\; \sqrt{1+x^2}.

We take the positive square root because cosh1>0\cosh \ge 1 > 0 everywhere.

Step 3. Plug back in:

  ddxarsinh(x)  =  11+x2  \boxed{\;\frac{d}{dx}\operatorname{arsinh}(x) \;=\; \frac{1}{\sqrt{1+x^2}}\;}

The square root in the denominator is the geometric signature of the identity cosh2sinh2=1\cosh^2 - \sinh^2 = 1. Anytime you see 1/1+x21/\sqrt{1+x^2} in a calculation, you are looking at an arsinh in disguise.

Sanity check at x=0x = 0: the formula gives 1/1+0=11/\sqrt{1+0} = 1. And the original function: near zero sinh(y)y\sinh(y) \approx y, so arsinh(x)x\operatorname{arsinh}(x) \approx x, which has slope 1. Two independent routes, same answer.

Derivation 2: ddxarcosh(x)\dfrac{d}{dx}\operatorname{arcosh}(x)

Let y=arcosh(x)y = \operatorname{arcosh}(x), with x1x \ge 1 and y0y \ge 0 (the principal branch — see the pitfall section).

Step 1. From x=cosh(y)x = \cosh(y), differentiate:

1  =  sinh(y)dydxdydx  =  1sinh(y).1 \;=\; \sinh(y)\,\frac{dy}{dx} \quad\Longrightarrow\quad \frac{dy}{dx} \;=\; \frac{1}{\sinh(y)}.

Step 2. Re-express sinh(y)\sinh(y) in terms of xx:

sinh2(y)  =  cosh2(y)1  =  x21sinh(y)  =  x21.\sinh^2(y) \;=\; \cosh^2(y) - 1 \;=\; x^2 - 1 \quad\Longrightarrow\quad \sinh(y) \;=\; \sqrt{x^2-1}.

The positive sign comes from y0sinh(y)0y \ge 0 \Rightarrow \sinh(y) \ge 0.

Step 3. Plug back in:

  ddxarcosh(x)  =  1x21,x>1.  \boxed{\;\frac{d}{dx}\operatorname{arcosh}(x) \;=\; \frac{1}{\sqrt{x^2-1}}, \quad x > 1.\;}

As x1+x \to 1^+, the slope blows up. Geometrically this is the moment cosh\cosh's tangent at the bottom of its U-shape passes through horizontal (slope 0). Inverting it means the tangent of arcosh\operatorname{arcosh} becomes vertical there. Reciprocal of zero is infinity.


Derivation 3: ddxartanh(x)\dfrac{d}{dx}\operatorname{artanh}(x)

Let y=artanh(x)y = \operatorname{artanh}(x), with 1<x<1-1 < x < 1.

Step 1. From x=tanh(y)x = \tanh(y), differentiate using tanh(y)=1tanh2(y)\tanh'(y) = 1 - \tanh^2(y) (the derivative you proved in the previous section):

1  =  (1tanh2(y))dydxdydx  =  11tanh2(y).1 \;=\; \bigl(1 - \tanh^2(y)\bigr)\,\frac{dy}{dx} \quad\Longrightarrow\quad \frac{dy}{dx} \;=\; \frac{1}{1 - \tanh^2(y)}.

Step 2. Substitute tanh(y)=x\tanh(y) = x directly — no square roots this time, because the identity 1tanh2=sech21 - \tanh^2 = \operatorname{sech}^2 is already a clean quadratic in tanh\tanh:

  ddxartanh(x)  =  11x2,x<1.  \boxed{\;\frac{d}{dx}\operatorname{artanh}(x) \;=\; \frac{1}{1-x^2}, \quad |x| < 1.\;}

At the endpoints x=±1x = \pm 1 the formula produces 1/01/0; the slope is genuinely infinite there because artanh\operatorname{artanh} has vertical asymptotes at ±1\pm 1. The infinite slope is not a bug — it is the price of squeezing all of R\mathbb R into the open interval (1,1)(-1, 1).

One quirk worth knowing. The formula for arcoth\operatorname{arcoth} — the inverse of coth\coth on the domain x>1|x| > 1 — is exactly the same algebraic expression 1/(1x2)1/(1-x^2). The two functions artanh\operatorname{artanh} and arcoth\operatorname{arcoth} simply live on different pieces of the real line, but they share a derivative formula. A surprising and useful symmetry.

The Remaining Three: arcsch,arsech,arcoth\operatorname{arcsch}, \operatorname{arsech}, \operatorname{arcoth}

The same machine handles the reciprocal trio. The derivations are short — they follow ddxarcsch(x)=1x1+x2\frac{d}{dx}\operatorname{arcsch}(x) = -\frac{1}{|x|\sqrt{1+x^2}} from csch=1/sinh\operatorname{csch} = 1/\sinh by applying the chain rule plus our arsinh result — and the answers are:

FunctionDomainDerivative
arcsch(x)x ≠ 0− 1 / ( |x| · √(1 + x²) )
arsech(x)0 < x ≤ 1− 1 / ( x · √(1 − x²) )
arcoth(x)|x| > 11 / (1 − x²)
Where the minus signs come from: csch\operatorname{csch} and sech\operatorname{sech} are decreasing on their principal branches (each is the reciprocal of a positive growing function). Inverting a decreasing function gives a decreasing function — and an inverse function's slope inherits the sign of the original. So arcsch\operatorname{arcsch} and arsech\operatorname{arsech}' both come out negative.

Master Reference Table

Print this. Tape it to your wall. Every inverse-hyperbolic problem reduces to one of these rows, perhaps composed with the chain rule.

f(x)f'(x)Domain of f'Notes
arsinh(x)1 / √(1 + x²)all of ℝalways positive; even function of x
arcosh(x)1 / √(x² − 1)x > 1vertical tangent at x = 1
artanh(x)1 / (1 − x²)|x| < 1blows up at x = ±1
arcsch(x)− 1 / ( |x| · √(1 + x²) )x ≠ 0negative; the |x| keeps it real on both sides of zero
arsech(x)− 1 / ( x · √(1 − x²) )0 < x < 1negative; matches arcosh's mirror
arcoth(x)1 / (1 − x²)|x| > 1same expression as artanh' but different domain
Chain rule extension. For any of these ff, the derivative of f(g(x))f(g(x)) is f(g(x))g(x)f'(g(x))\cdot g'(x). So ddxarsinh(2x2)\frac{d}{dx}\operatorname{arsinh}(2x^2) is just 4x1+4x4\dfrac{4x}{\sqrt{1 + 4x^4}}. No new rules, ever.

Sanity Check: The Logarithm Route

Each inverse hyperbolic function also has a closed-form logarithmic expression. Differentiating those expressions directly with the chain rule gives a second route to the same answers — a perfect cross-check for the inverse-function-theorem route.

Where does arsinh(x)=ln(x+x2+1)\operatorname{arsinh}(x) = \ln(x + \sqrt{x^2+1}) come from? Solve x=sinh(y)=(eyey)/2x = \sinh(y) = (e^y - e^{-y})/2 for yy. Multiply by 2ey2e^y: e2y2xey1=0e^{2y} - 2xe^y - 1 = 0. This is a quadratic in u=eyu = e^y with positive root u=x+x2+1u = x + \sqrt{x^2+1}. Take ln\ln of both sides and you get the formula.

Now differentiate directly. Let u(x)=x+x2+1u(x) = x + \sqrt{x^2+1}. Then:

u(x)  =  1+xx2+1  =  x2+1+xx2+1  =  u(x)x2+1.u'(x) \;=\; 1 + \frac{x}{\sqrt{x^2+1}} \;=\; \frac{\sqrt{x^2+1} + x}{\sqrt{x^2+1}} \;=\; \frac{u(x)}{\sqrt{x^2+1}}.

By the chain rule,

ddxarsinh(x)  =  u(x)u(x)  =  u(x)/x2+1u(x)  =  1x2+1.\frac{d}{dx}\operatorname{arsinh}(x) \;=\; \frac{u'(x)}{u(x)} \;=\; \frac{u(x)/\sqrt{x^2+1}}{u(x)} \;=\; \frac{1}{\sqrt{x^2+1}}.

Same answer. The little cancellation u(x)/u(x)=1u(x)/u(x) = 1 is where the algebra rewards us. The same trick works for arcosh\operatorname{arcosh} and artanh\operatorname{artanh}; try it as an exercise.

Why two routes? One route uses the structure of inverse functions abstractly (slopes reciprocate). The other uses the explicit closed form. Both must give the same answer, because there is only one derivative. When you build intuition along multiple paths, the formulas stop being memorized — they become deducible from first principles.

Worked Example (Collapsible): Mass on a Falling Cable

A heavy chain hangs in the shape y(x)=acosh(x/a)y(x) = a\cosh(x/a) (the catenary; see ch07/s01 if you want the physics derivation). We measure the height of a bead on the chain and want to know how its horizontal coordinate moves when the height changes. That is dx/dydx/dy, and the answer is an application of the arcosh derivative.

Try the problem yourself first. Then expand the panel below to check.

Click to expand the worked solution

Solve for xx in terms of yy. From y=acosh(x/a)y = a\cosh(x/a) we get x=aarcosh(y/a)x = a\,\operatorname{arcosh}(y/a), choosing the positive branch.

Differentiate with respect to yy. Using the master table row for arcosh\operatorname{arcosh} and the chain rule with inner function g(y)=y/ag(y) = y/a:

dxdy  =  a1(y/a)211a  =  1(y/a)21.\frac{dx}{dy} \;=\; a \cdot \frac{1}{\sqrt{(y/a)^2 - 1}} \cdot \frac{1}{a} \;=\; \frac{1}{\sqrt{(y/a)^2 - 1}}.

Multiply numerator and denominator by aa to clean up:

dxdy  =  ay2a2.\frac{dx}{dy} \;=\; \frac{a}{\sqrt{y^2 - a^2}}.

Numerical sanity. Let a=1ma = 1\,\text{m}. At y=2my = 2\,\text{m}: dx/dy=1/41=1/30.5774dx/dy = 1/\sqrt{4-1} = 1/\sqrt{3} \approx 0.5774. Meaning: when the bead rises 1 cm, it slides sideways 0.5774\approx 0.5774 cm. Notice how the answer blows up as y1+y \to 1^+ — that is the bottom of the chain (where the tangent is horizontal), so a tiny vertical move corresponds to a huge horizontal move. Exactly the geometric story behind the singular slope of arcosh\operatorname{arcosh} at x=1x = 1.

Cross-check. Compute dy/dxdy/dx directly from y=coshxy = \cosh x: dy/dx=sinhxdy/dx = \sinh x. At y=2y = 2 we have x=arcosh(2)1.317x = \operatorname{arcosh}(2) \approx 1.317 and sinh(1.317)1.732=3\sinh(1.317) \approx 1.732 = \sqrt{3}. Reciprocate: 1/30.57741/\sqrt{3} \approx 0.5774. Same answer. The inverse function theorem is doing exactly what the picture promised.


Interactive Derivative Explorer

Pick one of the six inverse hyperbolic functions, drag the slider, and watch:

  • the green tangent line rotate on the blue curve;
  • the orange dashed curve show f(x)f'(x); the dashed curve's height at xx equals the slope of the blue curve at the same xx;
  • the readouts — value, slope, and tangent angle in degrees — update live.

Inverse Hyperbolic Derivative Explorer

Pick a function, drag the slider, and watch the tangent line rotate. The dashed orange curve is the derivative — its height at x equals the slope of the blue curve at the same x.

-4-3-2-101234-2.5-2-1.5-1-0.500.511.522.5f(x) (blue)f'(x) (orange dashed)tangent at x (green)
1.000
f(x)
0.8814
f'(x) = slope
0.7071
tangent angle
35.3°
arsinh(x) = ln(x + √(x² + 1))
Derivative: f'(x) = 1 / √(1 + x²)
Slide artanh\operatorname{artanh} close to x=0.95x = 0.95 and watch the slope explode past 1010. That is the geometric signature of the pole at x=1x = 1, and it is also why neural networks trained with tanh\tanh outputs sometimes have wild gradients near saturation — the inverse map is amplifying everything by 1/(1x2)1/(1 - x^2).

Python: Numerical Verification by Central Difference

Three formulas, eight test points, one diagnostic table. We compare each hand-derived derivative against a central-difference estimate. If the algebra is right, the error column should be 1010\sim 10^{-10} on smooth regions.

inverse_hyperbolic.py — closed forms vs. central differences
🐍inverse_hyperbolic.py
1Imports — pure standard library

We use math.log and math.sqrt — both ordinary scalar functions. Callable and Tuple are just type hints. No NumPy, no SciPy. The goal of this file is to make every inverse-hyperbolic formula feel like a one-line arithmetic recipe, so we keep the toolbox small.

EXECUTION STATE
math.log = natural logarithm, ln
math.sqrt = square root
5arsinh(x) — defined everywhere on ℝ

The closed-form identity arsinh(x) = ln(x + √(x²+1)) comes from solving sinh(y) = x for y. We will derive it from scratch in the logarithm-route section. Notice there is no domain check — the argument x + √(x²+1) is always positive for any real x, so the log never complains.

EXAMPLE
arsinh(0)   = ln(0 + 1)   = 0
arsinh(1)   = ln(1 + √2)  ≈ 0.88137
arsinh(2.5) = ln(2.5 + √7.25) ≈ 1.6472
EXECUTION STATE
x + sqrt(x²+1) = always > 0 — safe for log
9arcosh(x) — domain x ≥ 1

cosh is only one-to-one for x ≥ 0 and its image starts at cosh(0) = 1. So arcosh(x) — the inverse on the principal branch — accepts only x ≥ 1. We raise an error instead of silently returning NaN, because a downstream loss curve that quietly emits NaN is one of the most expensive bugs in ML.

EXAMPLE
arcosh(1)   = ln(1 + 0) = 0
arcosh(1.5) = ln(1.5 + √1.25) ≈ 0.9624
arcosh(3)   = ln(3 + √8)  ≈ 1.7627
EXECUTION STATE
ValueError = raised when x < 1
15artanh(x) — domain (−1, 1), open interval

tanh squashes ℝ into (−1, 1), so artanh only accepts inputs strictly inside the open interval. At the endpoints ±1 the value blows up to ±∞, which matches the calculus: lim x→1⁻ artanh(x) = +∞. This function appears all over ML as the *logit* transform: artanh = ½ · logit(½(1+x)).

EXAMPLE
artanh(0)   = 0
artanh(0.5) = ½ ln(1.5/0.5) = ½ ln 3 ≈ 0.5493
artanh(0.9) = ½ ln(1.9/0.1) ≈ 1.4722
EXECUTION STATE
(1+x)/(1−x) = must be > 0; squeezing toward ±1 blows up
23d_arsinh(x) = 1 / √(1 + x²)

This is the entire payoff of the inverse function theorem applied to sinh. cosh(arsinh(x)) = √(1 + x²), and we will derive it explicitly. The derivative is positive everywhere and decays like 1/|x| for large |x| — arsinh grows like log(2x), so its slope falls off logarithmically.

EXAMPLE
d_arsinh(0)   = 1
d_arsinh(1)   = 1/√2 ≈ 0.7071
d_arsinh(2.5) ≈ 0.3714
26d_arcosh(x) = 1 / √(x² − 1)

Sister formula. Note the minus sign under the square root: this expression is real only for |x| > 1, matching the function's domain. As x → 1⁺ the slope explodes to +∞, which is what makes arcosh's tangent vertical at x = 1.

EXAMPLE
d_arcosh(1.5) = 1/√1.25 ≈ 0.8944
d_arcosh(3)   = 1/√8     ≈ 0.3536
29d_artanh(x) = 1 / (1 − x²)

Same expression as d/dx of ½ ln((1+x)/(1−x)) (you will check this in the logarithm-route section). The slope is positive on (−1, 1), explodes at the endpoints, and is exactly 1 at x = 0. Last fact matters: it tells you tanh is a *linear identity* in the neighborhood of zero, which is why deep nets initialized near zero look like linear nets for the first few iterations.

EXAMPLE
d_artanh(0)   = 1
d_artanh(0.5) = 1/0.75 ≈ 1.3333
d_artanh(0.9) = 1/0.19 ≈ 5.2632
34Central-difference helper

We will grade every analytic derivative by comparing it to a numerical estimate. (f(x+h) − f(x−h)) / (2h) is O(h²) accurate on a smooth function, so with h = 1e-5 the error should be ~1e-10 for these formulas — much smaller than the floating-point noise of the arithmetic itself.

EXECUTION STATE
h = default 1e-5; smaller h adds rounding noise
39check(...) — the diagnostic row

For each test point we print analytic value, numerical estimate, and absolute error. If our hand-derived formulas were wrong, the err column would jump to ~1e-1 or worse. Anything in the 1e-9 .. 1e-11 range is essentially zero in IEEE-754 land.

EXAMPLE
Sample output line:
arsinh  x=  1.0  analytic=  0.707107  numeric=  0.707107  err=1.27e-11
EXECUTION STATE
analytic = value of fprime(x) we coded
numeric = central-difference estimate
err = |analytic − numeric|
47Driver — eight test points across three domains

We pick points that exercise the boundaries of each function: x = 0 for arsinh (the symmetry centre), x = 1.5 for arcosh (close to the singular boundary x = 1), x = 0.9 for artanh (close to the +1 blow-up). If the formulas survive these tests they survive everywhere in their domain.

LOOP TRACE · 8 iterations
arsinh, x = 0
analytic = 1.000000
numeric = 1.000000
err = ~ 1e-11
arsinh, x = 1
analytic = 0.707107
numeric = 0.707107
err = ~ 1e-11
arsinh, x = 2.5
analytic = 0.371391
numeric = 0.371391
err = ~ 1e-11
arcosh, x = 1.5
analytic = 0.894427
numeric = 0.894427
err = ~ 1e-9 (steeper region)
arcosh, x = 3
analytic = 0.353553
numeric = 0.353553
err = ~ 1e-11
artanh, x = -0.5
analytic = 1.333333
numeric = 1.333333
err = ~ 1e-11
artanh, x = 0.5
analytic = 1.333333
numeric = 1.333333
err = ~ 1e-11
artanh, x = 0.9
analytic = 5.263158
numeric = 5.263158
err = ~ 1e-8 (near pole)
54 lines without explanation
1import math
2from typing import Callable, Tuple
3
4# ---- the six inverse hyperbolic functions, in closed form ----------------
5
6def arsinh(x: float) -> float:
7    # defined for all real x
8    return math.log(x + math.sqrt(x * x + 1.0))
9
10def arcosh(x: float) -> float:
11    # only defined for x >= 1
12    if x < 1.0:
13        raise ValueError("arcosh requires x >= 1")
14    return math.log(x + math.sqrt(x * x - 1.0))
15
16def artanh(x: float) -> float:
17    # only defined for -1 < x < 1
18    if not -1.0 < x < 1.0:
19        raise ValueError("artanh requires -1 < x < 1")
20    return 0.5 * math.log((1.0 + x) / (1.0 - x))
21
22
23# ---- their derivatives, in closed form (from the inverse function theorem) -
24
25def d_arsinh(x: float) -> float:
26    return 1.0 / math.sqrt(1.0 + x * x)
27
28def d_arcosh(x: float) -> float:
29    return 1.0 / math.sqrt(x * x - 1.0)
30
31def d_artanh(x: float) -> float:
32    return 1.0 / (1.0 - x * x)
33
34
35# ---- numerical derivative via symmetric difference -----------------------
36
37def numerical_derivative(f: Callable[[float], float], x: float,
38                          h: float = 1e-5) -> float:
39    """Central difference: O(h^2) accurate for smooth f."""
40    return (f(x + h) - f(x - h)) / (2.0 * h)
41
42
43def check(name: str, x: float, f: Callable[[float], float],
44          fprime: Callable[[float], float]) -> Tuple[float, float, float]:
45    analytic = fprime(x)
46    numeric = numerical_derivative(f, x)
47    err = abs(analytic - numeric)
48    print(f"{name}  x={x:>5}  analytic={analytic:10.6f}  "
49          f"numeric={numeric:10.6f}  err={err:.2e}")
50    return analytic, numeric, err
51
52
53if __name__ == "__main__":
54    # Pick points safely inside each function's domain.
55    check("arsinh", 0.0, arsinh, d_arsinh)
56    check("arsinh", 1.0, arsinh, d_arsinh)
57    check("arsinh", 2.5, arsinh, d_arsinh)
58
59    check("arcosh", 1.5, arcosh, d_arcosh)
60    check("arcosh", 3.0, arcosh, d_arcosh)
61
62    check("artanh", -0.5, artanh, d_artanh)
63    check("artanh",  0.5, artanh, d_artanh)
64    check("artanh",  0.9, artanh, d_artanh)
Run the file. Every err column entry is in the 1e-9 . 1e-12 range. That is the signature of correct algebra plus an O(h²) numerical scheme. Where the err climbs a little (arcosh near 1, artanh near 0.9), the cause is not bad algebra — it is the curve being so steep that central difference is no longer flat enough to be O(h²).

PyTorch: Autograd Recovers Every Formula

Now the third independent route. We compute the same derivatives with torch.autograd.grad\texttt{torch.autograd.grad} and compare against the analytic formulas. Internally, PyTorch's C++ backward kernels for asinh,acosh,atanh\texttt{asinh}, \texttt{acosh}, \texttt{atanh} implement exactly the closed forms we proved. If our derivation is correct the two columns must match to 1015\sim 10^{-15} on float64.

inverse_hyperbolic_torch.py — autograd cross-check
🐍inverse_hyperbolic_torch.py
1Imports — math + torch

math gives us the closed-form formulas; torch gives us tensors and autograd. We will compute every derivative two ways and check they agree to within floating-point tolerance.

5torch already has these as primitives

torch.asinh, torch.acosh, torch.atanh are built-ins. Each has a registered backward function — meaning PyTorch's autograd already knows the derivative formulas we are about to derive. The exercise here is *checking that what we proved by hand matches the framework's gradient table*, not teaching PyTorch new tricks.

EXAMPLE
torch.asinh(torch.tensor(1.0))   → tensor(0.8814)
torch.atanh(torch.tensor(0.5))   → tensor(0.5493)
16analytic_derivative: the formulas we will prove

These are exactly the three formulas the section will derive: 1/√(1+x²), 1/√(x²−1), 1/(1−x²). Putting them in one function lets us compare side-by-side against autograd later.

EXECUTION STATE
1/√(1+x²) = arsinh' — finite everywhere
1/√(x²-1) = arcosh' — singular at x = 1
1/(1-x²) = artanh' — singular at x = ±1
28autograd_derivative — single-point gradient

We wrap a single x in a scalar float64 tensor with requires_grad=True. float64 is intentional here: when we approach the singular boundaries (arcosh at 1+, artanh at ±1−) float32 noise becomes visible at the 6th decimal.

EXECUTION STATE
xt = tensor with requires_grad=True
dtype = float64 → 15 reliable decimals
30Forward pass: y = torch.asinh / acosh / atanh

This builds the (very short) computation graph: xt → y. There is exactly one op, so the backward pass will be a single application of the corresponding gradient formula stored inside PyTorch.

EXAMPLE
name='arsinh', x=1.0
  y = torch.asinh(xt)
  y.item() ≈ 0.8813735870195429
36torch.autograd.grad(outputs=y, inputs=xt)

We ask autograd to return dy/dxt as a tuple. This is the *clean* form (does not mutate xt.grad), preferred when you want to compute a gradient mid-program without touching state.

37outputs=y

We differentiate the inverse-hyperbolic value y, not anything downstream. y is a scalar tensor; outputs can be a scalar or an arbitrary tensor with an explicit grad_outputs.

38inputs=xt

The leaf tensor we want a gradient with respect to. xt must have requires_grad=True or autograd refuses to track it.

39create_graph=False

Second-order derivatives are not needed, so we skip building a graph-of-the-graph. If we wanted d²/dx² arsinh(x), we would flip this flag to True and call autograd.grad a second time on grad.

41return grad.item()

Pull the Python float out of the 0-dim tensor. .item() also detaches; the returned float carries no grad history.

44Driver: eight points across all three functions

Same eight points as the pure-Python file, with one negative input added to confirm symmetry. arsinh and artanh are odd functions (f(−x) = −f(x)), so their derivatives are even functions (f'(−x) = f'(x)). The negative test points let you see this directly in the output.

LOOP TRACE · 8 iterations
arsinh, x = 0
analytic = 1.000000
autograd = 1.000000
match = True
arsinh, x = 1
analytic = 0.707107
autograd = 0.707107
match = True
arsinh, x = -2.5
analytic = 0.371391 (same as +2.5: even slope)
autograd = 0.371391
match = True
arcosh, x = 1.5
analytic = 0.894427
autograd = 0.894427
match = True
arcosh, x = 3
analytic = 0.353553
autograd = 0.353553
match = True
artanh, x = -0.5
analytic = 1.333333
autograd = 1.333333
match = True
artanh, x = 0.5
analytic = 1.333333
autograd = 1.333333
match = True
artanh, x = 0.9
analytic = 5.263158
autograd = 5.263158
match = True
56math.isclose with rel_tol = 1e-9

We use isclose instead of `==` because float arithmetic is order-dependent: autograd composes ops in a slightly different order from our by-hand closed form. 1e-9 relative tolerance is generous on float64 and tight on float32. Every test passes.

51 lines without explanation
1import math
2import torch
3
4# ----- the six inverse hyperbolic functions via torch -----------------------
5# torch.asinh / acosh / atanh exist natively. We use them so autograd can do
6# its job, then cross-check against our hand-derived formulas.
7
8def f_arsinh(x: torch.Tensor) -> torch.Tensor:
9    return torch.asinh(x)
10
11def f_arcosh(x: torch.Tensor) -> torch.Tensor:
12    return torch.acosh(x)
13
14def f_artanh(x: torch.Tensor) -> torch.Tensor:
15    return torch.atanh(x)
16
17
18def analytic_derivative(name: str, x: float) -> float:
19    if name == "arsinh":
20        return 1.0 / math.sqrt(1.0 + x * x)
21    if name == "arcosh":
22        return 1.0 / math.sqrt(x * x - 1.0)
23    if name == "artanh":
24        return 1.0 / (1.0 - x * x)
25    raise ValueError(name)
26
27
28def autograd_derivative(name: str, x: float) -> float:
29    """Ask PyTorch for f'(x) at a single point using torch.autograd.grad."""
30    xt = torch.tensor(x, dtype=torch.float64, requires_grad=True)
31    if name == "arsinh":
32        y = f_arsinh(xt)
33    elif name == "arcosh":
34        y = f_arcosh(xt)
35    elif name == "artanh":
36        y = f_artanh(xt)
37    else:
38        raise ValueError(name)
39
40    (grad,) = torch.autograd.grad(
41        outputs=y,
42        inputs=xt,
43        create_graph=False,
44    )
45    return grad.item()
46
47
48if __name__ == "__main__":
49    tests = [
50        ("arsinh",  0.0),
51        ("arsinh",  1.0),
52        ("arsinh", -2.5),
53        ("arcosh",  1.5),
54        ("arcosh",  3.0),
55        ("artanh", -0.5),
56        ("artanh",  0.5),
57        ("artanh",  0.9),
58    ]
59    for name, x in tests:
60        ana = analytic_derivative(name, x)
61        ag  = autograd_derivative(name, x)
62        print(f"{name:>7}  x={x:>5}  analytic={ana:10.6f}  "
63              f"autograd={ag:10.6f}  match={math.isclose(ana, ag, rel_tol=1e-9)}")
What this script teaches you. When autograd computes the gradient of any expression that contains arsinh,arcosh,artanh\operatorname{arsinh}, \operatorname{arcosh}, \operatorname{artanh}, it is mechanically substituting our three boxed formulas into the chain rule. Knowing the formula by heart is identical to knowing what autograd is going to do — and that is the only way to debug a NaN gradient in production.

Why ML Engineers Care: Logits, Tanh, and Stability

1. Logits are arctanhs in disguise. A binary classifier outputs a probability p(0,1)p \in (0, 1) via p=σ(z)=1/(1+ez)p = \sigma(z) = 1/(1 + e^{-z}). The inverse σ1(p)=ln(p/(1p))\sigma^{-1}(p) = \ln(p/(1-p)) is the logit. A short algebraic identity says artanh(2p1)=12σ1(p)\operatorname{artanh}(2p - 1) = \tfrac{1}{2}\sigma^{-1}(p). That is why the derivative 1/(1x2)1/(1 - x^2) appears in backward passes through classification heads: it is the gradient of the logit at the rescaled probability.

2. The arsinh-scaled loss. When a regression target spans many orders of magnitude (think satellite radiance, stock returns, ad-revenue, sensor counts), training on raw values lets a single outlier dominate every gradient step. Replacing the loss (yy^)2(y - \hat y)^2 with (arsinh(y)arsinh(y^))2(\operatorname{arsinh}(y) - \operatorname{arsinh}(\hat y))^2 compresses large values logarithmically while leaving small values almost linear. The gradient of the inner term picks up an 1/1+y21/\sqrt{1+y^2} factor — the formula from this section — which is why the optimizer's effective step size on huge targets becomes inversely proportional to y|y|. Outliers can't scream anymore.

3. The artanh-saturation trap. A tanh activation outputs values in (1,1)(-1, 1). The backward pass through the loss often inverts the activation conceptually (think contrastive losses operating on cosine similarities). The gradient of artanh\operatorname{artanh} is 1/(1x2)1/(1 - x^2), which is 100100 at x=0.995x = 0.995 and 1000010000 at x=0.99995x = 0.99995. If a network saturates toward ±1\pm 1 you get an exploding gradient through any artanh in the pipeline. Clamping inputs to [0.9999,0.9999][-0.9999, 0.9999] before artanh is a standard production fix.

4. Rapidity in physics-aware models. Special relativity replaces velocities with rapidities ϕ=artanh(v/c)\phi = \operatorname{artanh}(v/c) because rapidity adds linearly under boosts while velocity does not. Recent ML papers on relativistic dynamics use this trick to keep loss landscapes flat across speeds — every gradient ends up with our 1/(1x2)1/(1 - x^2) attached.

Practical takeaway: whenever an inverse hyperbolic appears inside a loss, look at the singular point of its derivative first. Catastrophic training runs usually trace back to inputs drifting toward those singularities.

Common Pitfalls

  1. Forgetting the domain. arcosh(x)\operatorname{arcosh}(x) requires x1x \ge 1; artanh(x)\operatorname{artanh}(x) requires x<1|x| < 1. The derivatives extend across the same domains, and outside them you get a NaN.
  2. Sign of the square root. When you compute cosh(y)\cosh(y) from cosh2(y)=1+x2\cosh^2(y) = 1 + x^2 you must pick the positive root, because cosh1\cosh \ge 1. The same is true for the principal-branch choices in arcosh\operatorname{arcosh}.
  3. Confusing arctanh with arccoth. Their derivatives are identical expressions 1/(1x2)1/(1-x^2), but their domains are disjoint: x<1|x|<1 for artanh, x>1|x|>1 for arcoth. Pick the wrong one and your answer is in the wrong region of the real line.
  4. Float32 near singularities. Both artanh\operatorname{artanh}' and arcosh\operatorname{arcosh}' have gradient blow-ups at the edge of their domain. In float32 you will see noisy or infinite gradients well before you reach the mathematical boundary. Use float64 for diagnostic work and clamp inputs before the function call in production.
  5. Notation drift. Some books write sinh1\sinh^{-1}, cosh1\cosh^{-1}; we prefer the arsinh\operatorname{arsinh}, arcosh\operatorname{arcosh} form to avoid confusion with the reciprocal 1/sinh1/\sinh. Both mean the same thing.
The single biggest mistake students make is memorizing the formulas and forgetting the derivation. When you see 1/1+x21/\sqrt{1+x^2} in a problem, you want your brain to flash “that is the slope of arsinh, because cosh=1+sinh2\cosh = \sqrt{1+\sinh^2}.” Two-hop memory beats one-hop memorization every time.

Summary

Every inverse hyperbolic derivative is one step away from the identity it lives on:

IdentityInverse derivative it produces
cosh² − sinh² = 1(arsinh)' = 1/√(1+x²), (arcosh)' = 1/√(x²−1)
1 − tanh² = sech²(artanh)' = 1/(1−x²), (arcoth)' = 1/(1−x²)
csch = 1/sinh, sech = 1/cosh(arcsch)' = −1/(|x|√(1+x²)), (arsech)' = −1/(x√(1−x²))

Three identities. Six formulas. One mechanical procedure (the inverse function theorem) that converts one to the other. Once the picture — graphs reflected across y=xy = x, slopes reciprocating — is in your head, the formulas stop being formulas and become consequences of geometry.

One sentence: the derivative of f1f^{-1} at yy is the reciprocal of the derivative of ff at the matching xx — and for hyperbolic functions the re-expression in terms of yy always produces a square-root or a quadratic via cosh2sinh2=1\cosh^2 - \sinh^2 = 1.
Loading comments...