Chapter 1
14 min read
Section 1 of 121

What is Predictive Maintenance?

Predictive Maintenance & RUL

The Smoke Alarm, the Annual Checkup, the Fitness Watch

A smoke alarm screams when the kitchen is already on fire. An annual physical catches whatever happens to be wrong on the day you walk into the doctor's office. A modern cardiac-monitor patch streams ECG twenty-four hours a day, building a model of your heart and warning hours in advance that a rhythm event is brewing. Three different devices — three completely different ways of relating to failure. The smoke alarm is reactive: it tells you the disaster is already underway. The physical is preventive: it runs on a fixed schedule whether you need it or not. The patch is predictive: it learns from data and warns you while there is still time to act.

Predictive maintenance is exactly the third device, applied to engines, motors, bearings, transformers, batteries, wind turbines, and an ever-growing list of capital equipment that the world's economies run on. The job of this book is to show you, end to end, how to build that “cardiac patch” for a jet engine — including the part nobody tells you, which is how to balance accuracy against safety when the model is wrong.

Quick mental model. Reactive = wait and pay the worst case. Preventive = pay a small bill on a fixed schedule and accept some waste. Predictive = pay an even smaller bill on a smart schedule, but you have to trust your model.

Three Maintenance Strategies, One Cost Equation

Strip away the language and every maintenance program reduces to one cost equation per engine, summed over the fleet:

Ctotal=Cintervention+ClifeΔwasted+Cfailure1[intervention too late]C_{\text{total}} = C_{\text{intervention}} + C_{\text{life}} \cdot \Delta_{\text{wasted}} + C_{\text{failure}} \cdot \mathbf{1}[\text{intervention too late}]

The three knobs are the cost of acting (CinterventionC_{\text{intervention}}), the cost of throwing away remaining useful life (ClifeΔwastedC_{\text{life}} \cdot \Delta_{\text{wasted}}), and the cost of a catastrophic miss (CfailureC_{\text{failure}}). The three strategies trade them off differently:

StrategyWhen you actWhat you pay per engine
ReactiveAfter failureAlways CfailureC_{\text{failure}}
PreventiveFixed schedule (cycle = TT)CfailureC_{\text{failure}} if engine broke before TT, else Csched+Clife(tfailT)C_{\text{sched}} + C_{\text{life}} \cdot (t_{\text{fail}} - T)
PredictiveWhen ML model says RULtlead\text{RUL} \approx t_{\text{lead}}CfailureC_{\text{failure}} if late, else Csched+ClifeΔwastedC_{\text{sched}} + C_{\text{life}} \cdot \Delta_{\text{wasted}}

The interesting line is the third. Predictive can match preventive on the scheduled-cost term and beat it on the wasted-life term — but only if the model is accurate enough that “late predictions” are rare. The whole rest of this book is about making them rare.

Play With the Tradeoff

Below is a 100-engine fleet with normally-distributed failure times. The left chart picks one representative engine and shows where each strategy would intervene. The right chart shows the fleet-wide total dollar cost. Drag any slider and the simulation re-runs instantly.

Loading interactive simulator…

Two observations worth your time. First, push ML prediction error std from 5 cycles to 30 — the green bar climbs sharply because the “late prediction” tail of the Gaussian explodes. Second, push the preventive interval down to 100 — the orange bar collapses to almost zero failures, but the wasted-life penalty makes total cost rise. There is no silver bullet — the cost function will fight any choice you make.

Hands-On: Cost of 100 Engines in Pure Python

Before we get anywhere near a neural network, the entire problem fits in a forty-line Python script. The function below is exactly the math behind the simulator above, with deterministic seeding so the numbers are reproducible.

Three strategies, three lines, one cost equation
🐍maintenance_costs.py
1import numpy as np

NumPy is the numerical-computing library underneath every PyTorch tensor. We use it here for a normally-distributed random sample (np.random.normal) and a deterministic seed. Aliasing it as np is universal Python convention.

EXECUTION STATE
numpy = Provides ndarray, broadcasting, and vectorised math at C speed
as np = Standard alias used by every NumPy / SciPy / PyTorch tutorial
3Comment - A 100-engine fleet with normally-distributed failure times

Real fleets have a distribution of failure cycles: nominally identical engines fail at slightly different times because of manufacturing variation, duty cycle, and operating condition. We use a Normal(200, 30) distribution to capture that spread without cluttering the example with real fleet data.

4np.random.seed(42)

Sets the global NumPy random state. Any subsequent call to np.random.* is now deterministic - re-running this script will produce the exact same fleet. Reproducibility is non-negotiable in any production maintenance system.

EXECUTION STATE
np.random.seed() = Sets the seed of the global Mersenne-Twister state. After this call, every np.random.normal / .uniform / .randint draws from the same fixed pseudo-random stream.
arg: 42 = An arbitrary integer. Any int produces a deterministic stream; 42 is convention from Hitchhiker's Guide to the Galaxy.
5fleet_size = 100

How many engines we will simulate. With 100 engines we can read the cost-difference between strategies clearly without the simulation taking any noticeable time.

EXECUTION STATE
fleet_size = 100 - chosen so percentage-level effects (~5 unscheduled failures) are visible
6true_failure_cycles = np.random.normal(...).astype(int)

Draws 100 samples from a Normal distribution with mean 200 and standard deviation 30, then truncates each to an integer cycle. These are the 'ground-truth' failure times that our strategies will be evaluated against. In real life we never know these - that's exactly what the rest of the book is about predicting.

EXECUTION STATE
np.random.normal(loc, scale, size) = Gaussian sample. loc = mean, scale = std-dev, size = how many samples. Returns an ndarray.
arg: loc=200 = Mean failure cycle - the average engine in this fleet runs ~200 cycles before catastrophic failure.
arg: scale=30 = Standard deviation - fleet variability. About 68% of engines fail between cycle 170 and 230.
arg: size=fleet_size = 100 - produces a 1-D array of length 100.
.astype(int) = Truncates each float to an integer (towards zero). Cycles are inherently discrete.
result: true_failure_cycles[:8] = [214, 195, 219, 245, 192, 192, 247, 223] - the first 8 engines
fleet stats = min = 121, mean = 196.4, max = 255 cycles
9COST_FAILURE = 100_000

Dollar cost of an unscheduled failure. For a commercial turbofan this can easily reach mid-six figures once you include lost flight, AOG (aircraft on ground) penalties, and emergency airlift of replacement parts. We park it at a conservative $100k.

EXECUTION STATE
COST_FAILURE = $100,000 - UPPER_SNAKE_CASE marks it as a constant by convention
100_000 = Python lets you put underscores in numeric literals for readability - same as 100000
10COST_SCHEDULED = 5_000

Planned shop-visit cost. Roughly 5% of an unscheduled failure: parts are pre-stocked, the engine isn't airborne, and the maintenance crew is on a normal shift.

EXECUTION STATE
COST_SCHEDULED = $5,000 per scheduled repair
11COST_LIFE_PER_CYCLE = 100

The value we lose per cycle of remaining-useful-life that we throw away. If we replace an engine 20 cycles before it would have failed, we paid $5,000 + 20*$100 = $7,000 - the second term is the wasted-life penalty.

EXECUTION STATE
COST_LIFE_PER_CYCLE = $100 - converts wasted RUL cycles into dollars
why penalise wasted life? = If you ignore this term, the trivial optimum is to replace every engine at cycle 1 - zero failures, infinite cost. The wasted-life term is what keeps preventive from being trivially correct.
14def reactive_cost(failure_cycle: int) -> float:

Strategy 1: do nothing until the engine breaks. The cost is COST_FAILURE every single time, regardless of when it happens.

EXECUTION STATE
input: failure_cycle (int) = When the engine actually fails. The function ignores this argument - reactive doesn't care.
returns float = Always $100,000.
16return COST_FAILURE

Constant return - every engine in a reactive fleet costs the same.

EXECUTION STATE
return: 100_000 = Dollar cost of one unscheduled failure
19def preventive_cost(failure_cycle, schedule_interval=150) -> float:

Strategy 2: replace every engine on a fixed cycle schedule, regardless of its actual condition. The variable schedule_interval controls how aggressive we are - small = wasteful, large = misses failures.

EXECUTION STATE
input: failure_cycle = When the engine actually breaks (not visible to the strategy).
input: schedule_interval = 150 = Default fixed-schedule cycle. Aggressive choice: well below the 196.4-cycle mean.
returns float = Either $100k (if it broke first) or $5k + wasted_life * $100 (if scheduled first).
20Docstring - Strategy 2 - repair every schedule_interval cycles, regardless

Captures the defining feature of preventive: the schedule is fixed in advance, with no information about the actual engine.

21if failure_cycle < schedule_interval:

Did the engine break before our scheduled visit? With schedule_interval = 150 and Normal(200, 30), about 5% of engines do.

EXECUTION STATE
Probability = P(failure < 150) ~ Phi((150-200)/30) ~ Phi(-1.67) ~ 4.7% -> ~5 of 100 engines (we measured exactly 6)
22return COST_FAILURE (if it broke first)

When the engine fails before our scheduled visit, preventive degrades to reactive - full $100k.

EXECUTION STATE
return: 100_000 = Catastrophic-failure path
23wasted_cycles = failure_cycle - schedule_interval

If we got there in time, we pulled an engine that still had wasted_cycles of useful life left.

EXECUTION STATE
wasted_cycles = If failure_cycle = 245 and schedule_interval = 150: wasted_cycles = 95
24return COST_SCHEDULED + wasted_cycles * COST_LIFE_PER_CYCLE

Total bill for a scheduled repair: the visit itself plus wasted-life penalty.

EXECUTION STATE
Example return = $5,000 + 95 * $100 = $14,500 for engine #3 (true failure cycle 245)
27def predictive_cost(failure_cycle, lead_time=10, error_std=5.0) -> float:

Strategy 3: an ML model predicts each engine's failure cycle. We intervene lead_time cycles before the prediction. The prediction itself is noisy - error_std controls how good the model is.

EXECUTION STATE
input: failure_cycle = True failure cycle (only the simulator knows this - the strategy uses the noisy prediction).
input: lead_time = 10 = Safety margin. Bigger lead_time -> safer but wastes more life.
input: error_std = 5.0 = Std-dev of the ML model's RUL error. 5 cycles is realistic for a well-trained model.
returns float = Either $100k (late prediction) or $5k + wasted_life * $100.
29predicted = failure_cycle + np.random.normal(0, error_std)

Simulates the ML model's output as a noisy version of the truth. In a real system this is whatever your CNN-BiLSTM-Attention model spits out - for now we mock it as ground-truth + Gaussian noise.

EXECUTION STATE
np.random.normal(loc=0, scale=error_std) = Single Gaussian sample with mean 0 and std error_std (5.0). Returns a Python float.
Example = If failure_cycle = 245 and the noise sample is +1.7 -> predicted = 246.7
30intervention = predicted - lead_time

Intervene lead_time cycles before the predicted failure. If the prediction is exactly right and lead_time = 10, we replace the engine 10 cycles before it would have failed.

EXECUTION STATE
intervention = = 246.7 - 10 = 236.7 for engine #3
31if intervention >= failure_cycle:

Did the engine actually break before we got there? This happens when the noise pushed the prediction past the true failure by more than lead_time.

EXECUTION STATE
late-prediction probability = P(noise > lead_time) = P(N(0, 5) > 10) ~ 2.3% -> measured 5/100 in our run
33return COST_FAILURE (intervened too late)

Late predictions cost the full $100k. This is the failure mode that drove every safety regulation in commercial aviation.

EXECUTION STATE
return: 100_000 = Late-prediction path
34wasted_cycles = max(0, failure_cycle - intervention)

If we replaced early, we wasted (failure_cycle - intervention) cycles. The max(0, ...) guard is defensive - by this branch we already know intervention < failure_cycle, so the difference is positive, but it costs nothing to be careful.

EXECUTION STATE
Example = 245 - 236.7 = 8.3 cycles wasted for engine #3
35return COST_SCHEDULED + wasted_cycles * COST_LIFE_PER_CYCLE

Final bill - same formula as preventive, but with much smaller wasted_cycles.

EXECUTION STATE
Example return = $5,000 + 8.3 * $100 = $5,830 for engine #3 - vs $14,500 under preventive.
38reactive = [reactive_cost(c) for c in true_failure_cycles]

Python list comprehension - applies reactive_cost to every engine in the fleet, building a 100-element list of per-engine costs.

EXECUTION STATE
reactive[:5] = [100000, 100000, 100000, 100000, 100000]
len(reactive) = 100 (one entry per engine)
39preventive = [preventive_cost(c) for c in true_failure_cycles]

Same pattern; this time each entry depends on whether the engine survived to cycle 150.

EXECUTION STATE
preventive[:5] = [ 11400, 9500, 11900, 14500, 9200] (all survived past 150)
preventive[34] = 100000 - engine #34 broke at cycle 137, before the scheduled visit
40predictive = [predictive_cost(c) for c in true_failure_cycles]

Most expensive line in the script: each call draws a fresh Gaussian noise sample. The list-comprehension order is deterministic because of the seed, so re-running gives identical totals.

EXECUTION STATE
predictive[:5] = [ 6173, 6094, 5891, 5832, 6244]
late count = 5 of 100 engines were late predictions ($100k each); the other 95 averaged $5,857
42print(...) - Reactive total

Sum of all 100 entries, formatted with thousand-separators.

EXECUTION STATE
Output = Reactive total: $10,000,000
43print(...) - Preventive total

6 unscheduled failures * $100k = $600k, plus 94 scheduled repairs averaging ~$10k each.

EXECUTION STATE
Output = Preventive total: $ 1,539,700
44print(...) - Predictive total

Predictive saves 91% over reactive and 43% over preventive. The five late predictions cost $500k of the total - every percentage point of model accuracy you lose, this number grows.

EXECUTION STATE
Output = Predictive total: $ 884,572
vs reactive = -91.2% - almost order-of-magnitude reduction
vs preventive = -42.5%
19 lines without explanation
1import numpy as np
2
3# A 100-engine fleet with normally-distributed failure times.
4np.random.seed(42)
5fleet_size = 100
6true_failure_cycles = np.random.normal(loc=200, scale=30,
7                                       size=fleet_size).astype(int)
8
9# Cost parameters (US dollars, per engine).
10COST_FAILURE        = 100_000   # unscheduled, catastrophic
11COST_SCHEDULED      = 5_000     # planned shop visit
12COST_LIFE_PER_CYCLE = 100       # value of one wasted operating cycle
13
14
15def reactive_cost(failure_cycle: int) -> float:
16    """Strategy 1 - wait until it breaks, then pay full price."""
17    return COST_FAILURE
18
19
20def preventive_cost(failure_cycle: int, schedule_interval: int = 150) -> float:
21    """Strategy 2 - repair every schedule_interval cycles, regardless."""
22    if failure_cycle < schedule_interval:
23        return COST_FAILURE                 # broke before the visit
24    wasted_cycles = failure_cycle - schedule_interval
25    return COST_SCHEDULED + wasted_cycles * COST_LIFE_PER_CYCLE
26
27
28def predictive_cost(failure_cycle: int, lead_time: int = 10,
29                    error_std: float = 5.0) -> float:
30    """Strategy 3 - ML predicts RUL, we intervene lead_time cycles early."""
31    predicted    = failure_cycle + np.random.normal(0, error_std)
32    intervention = predicted - lead_time
33    if intervention >= failure_cycle:
34        return COST_FAILURE                 # we got there too late
35    wasted_cycles = max(0, failure_cycle - intervention)
36    return COST_SCHEDULED + wasted_cycles * COST_LIFE_PER_CYCLE
37
38
39reactive   = [reactive_cost(c)   for c in true_failure_cycles]
40preventive = [preventive_cost(c) for c in true_failure_cycles]
41predictive = [predictive_cost(c) for c in true_failure_cycles]
42
43print(f"Reactive   total: ${sum(reactive):>12,.0f}")
44print(f"Preventive total: ${sum(preventive):>12,.0f}")
45print(f"Predictive total: ${sum(predictive):>12,.0f}")
46# Reactive   total: $  10,000,000
47# Preventive total: $   1,539,700
48# Predictive total: $     884,572
You will recognise this pattern again. Wasted life appears as the regression error y^y|\hat{y} - y| later in the book; late prediction appears as the asymmetric NASA score in Chapter 13. The cost function only changes name, not shape.

Where Predictive Maintenance Pays Off Beyond Aerospace

It is easy to read a paper on turbofan RUL prediction and conclude it is an aerospace-only sport. It is not. The same cost equation governs decisions across nearly every capital-intensive industry — only the constants change.

IndustryWhat is degradingCost of unscheduled failureWhere the data comes from
Commercial aviationTurbofan engine, hydraulic actuator$50k-$1M+ per AOG eventFADEC sensor stream (this book)
HealthcareMRI scanner, ventilator, infusion pumpPatient harm + ~$250k device replacementSelf-test logs, vibration, current draw
Electric gridPower transformer, switchgear, cable$1M-$10M plus regional outageDissolved-gas-in-oil, partial-discharge
EV / grid storageLithium-ion cell health (state-of-health)Range loss, thermal runaway riskVoltage / current / temperature curves
ManufacturingCNC bearing, robot arm, hydraulic pressLine stoppage at $10k-$100k per hourVibration, acoustic emission, current
Autonomous vehiclesLiDAR, IMU, brake actuatorSafety-critical - disengagement or crashCross-sensor consistency, drift telemetry
Wind & solarGearbox, pitch bearing, inverterCrane + helicopter access fees alone $100k+SCADA + accelerometer

The mathematical core in this book — multi-task learning, gradient-aware balancing, the asymmetric safety score — transfers to every row of the table. Chapter 29 returns to this question explicitly when we discuss extending the AMNL/GABA/GRACE framework to bearings and batteries.

The Single Number Behind All of It: RUL

Every predictive-maintenance method ultimately boils down to estimating one scalar per machine, per moment in time: Remaining Useful Life, abbreviated RUL.

Given the multivariate sensor history X1:t={x1,x2,,xt}\mathbf{X}_{1:t} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_t\} — a window of past sensor readings up to the current cycle tt — RUL prediction is the conditional expectation

RUL^t=E[tfailtX1:t]\widehat{\text{RUL}}_t = \mathbb{E}\bigl[\,t_{\text{fail}} - t \mid \mathbf{X}_{1:t}\,\bigr]

That is it. The model, however large, only needs to learn one mapping: from a window of sensor data to a single cycle-count to failure. Everything else in this book — multi-task learning, attention, gradient balancing — is in service of making that single number more accurate and more safely conservative.

Why a single number works. RUL is a sufficient statistic for the maintenance decision: given an accurate RUL plus the costs we wrote above, the optimal intervention time is fully determined. You do not need to predict the entire future sensor stream, only the time until something goes wrong.

The Subtle Pitfall: Late Predictions Are Asymmetrically Expensive

Re-read the cost equation one more time. The first two terms grow linearly with error — they are continuous, polite, the kind of thing a regression loss like mean-squared-error rewards. The third term is discontinuous: a one-cycle-late prediction costs essentially the same as a hundred-cycles-late prediction. Both are CfailureC_{\text{failure}}.

The trap nearly every paper falls into. A model that achieves the lowest RMSE on a benchmark is not necessarily the model you want flying passengers. The direction of the error matters as much as its size. Predicting RUL = 50 when the truth is 40 (early, safe) is almost free. Predicting RUL = 50 when the truth is 60 (late, dangerous) is six figures of damage.

The community has had a way to formalise this asymmetry for over a decade — the NASA scoring function, which exponentially penalises late predictions far more than early ones. The whole story of this book is what happens when we take that asymmetry seriously: it is what motivates the failure-biased loss in Chapter 14, the gradient-aware balancing in Chapter 17, and the GRACE objective in Chapter 21.

The book in one sentence. Predictive maintenance is the art of being early enough, often enough, that the model's late mistakes are too rare to dominate the cost.

Takeaway

  • Three strategies, one cost equation. Reactive, preventive, and predictive maintenance differ only in when they intervene; the dollar cost has the same three-term structure for all of them.
  • Predictive wins when the model is accurate. In the toy simulation predictive saves 91% over reactive and 43% over preventive at a 5-cycle prediction error. Push that error to 30 cycles and the advantage vanishes.
  • The job of the book is one number. Estimate RUL accurately and conservatively, given a window of multivariate sensor data. Every chapter either improves the estimate or quantifies the cost of getting it wrong.
  • Late predictions are not just “a bit worse than” early ones. They are categorically worse. Section 1.3 will make that asymmetry quantitative.
Loading comments...