AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand AMNL's inference throughput of 31K samples/second
Analyze latency characteristics for real-time deployment
Identify computational bottlenecks in the inference pipeline
Learn optimization strategies for production deployment
Compare with industrial requirements for predictive maintenance

Core Insight: AMNL processes 31,625 samples per second on a single GPU—equivalent to monitoring over 31,000 engines simultaneously with 1-second update intervals. This throughput far exceeds typical industrial requirements, enabling real-time monitoring at scale.

Inference Throughput

Inference throughput measures how many predictions the model can make per unit time. This is critical for industrial deployment where thousands of assets may need simultaneous monitoring.

Benchmark Results

Metric	Value	Context
Throughput	31,625 samples/sec	Batch size 256
Per-sample latency	31.6 μs	Average
Batch latency	8.1 ms	256 samples
Hardware	NVIDIA RTX 5000	16GB VRAM

Throughput by Batch Size

Throughput varies significantly with batch size due to GPU utilization efficiency:

Batch Size	Throughput (samples/sec)	GPU Utilization
1	~2,500	~15%
16	~12,000	~45%
64	~22,000	~75%
128	~28,000	~88%
256	~31,625	~95%
512	~32,100	~97%

Optimal Batch Size

Batch size 256 provides the best balance between throughput and memory usage. Larger batches (512+) offer diminishing returns while consuming significantly more GPU memory.

What Does 31K Samples/Second Mean?

Latency Analysis

While throughput measures overall capacity, latency measures how quickly an individual prediction is returned—critical for real-time alerting systems.

Latency Breakdown

Stage	Time (μs)	Percentage
Data preprocessing	5.2	16.5%
CNN feature extraction	3.8	12.0%
BiLSTM encoding	12.4	39.2%
Multi-head attention	6.1	19.3%
Task heads	1.2	3.8%
Post-processing	2.9	9.2%
Total	31.6	100%

Latency Distribution

Latency varies across samples due to GPU scheduling and memory access patterns:

Percentile	Latency (μs)	Use Case
p50 (median)	28.4	Typical case
p90	35.2	Most samples
p99	48.7	Edge cases
p99.9	72.1	Rare outliers
Max observed	124.3	Cold start

Real-Time Guarantee

For real-time systems, the p99 latency (48.7 μs) is more relevant than the average. Even at p99, AMNL returns predictions in under 50 microseconds—well within the millisecond-scale requirements of most industrial control systems.

Computational Bottlenecks

Understanding where computation time is spent helps identify optimization opportunities.

BiLSTM Dominates

The BiLSTM encoder accounts for 39.2% of inference time, making it the primary bottleneck:

Sequential nature: LSTMs process timesteps sequentially, limiting parallelization
3-layer depth: Each layer adds latency
Bidirectional processing: Effectively 6 LSTM passes

Attention Overhead

Multi-head attention contributes 19.3% of latency with 12 attention heads:

\text{Attention FLOPs} = 4 \times n^2 \times d + 2 \times n^2 \times h

Where $n$ is sequence length, $d$ is embedding dimension, and $h$ is number of heads.

Memory Bandwidth

Operation	Memory Access Pattern	Bandwidth Bound?
CNN	Strided access	No (compute bound)
BiLSTM	Sequential access	Partially
Attention	Random access (softmax)	Yes
FC layers	Dense access	No (compute bound)

Optimization Strategies

Several strategies can further improve inference speed for production deployment.

1. Mixed Precision Inference

🐍python

1import torch
2
3# Enable automatic mixed precision for inference
4model = model.half()  # Convert to FP16
5
6with torch.cuda.amp.autocast():
7    predictions = model(input_batch)
8
9# Speedup: ~1.5-2x on modern GPUs

Precision	Throughput	Accuracy Impact
FP32 (baseline)	31,625 samples/sec	Reference
FP16 (mixed)	~48,000 samples/sec	Negligible (<0.1% RMSE)
INT8 (quantized)	~62,000 samples/sec	Small (~0.3% RMSE)

2. ONNX Runtime Optimization

🐍python

1import onnxruntime as ort
2
3# Export to ONNX
4torch.onnx.export(model, dummy_input, "amnl.onnx")
5
6# Load with optimizations
7sess_options = ort.SessionOptions()
8sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
9sess = ort.InferenceSession("amnl.onnx", sess_options)
10
11# Speedup: ~1.3x over PyTorch

3. TensorRT Compilation

🐍python

1import torch_tensorrt
2
3# Compile for TensorRT
4trt_model = torch_tensorrt.compile(
5    model,
6    inputs=[torch_tensorrt.Input((256, 50, 17))],
7    enabled_precisions={torch.float16}
8)
9
10# Speedup: ~2-3x over PyTorch FP32

Optimization Summary

Optimization	Throughput	Speedup	Complexity
Baseline PyTorch FP32	31,625	1.0x	None
PyTorch FP16	~48,000	1.5x	Low
ONNX Runtime	~41,000	1.3x	Low
TensorRT FP16	~78,000	2.5x	Medium
TensorRT INT8	~95,000	3.0x	High

Production Recommendation

For most industrial deployments, PyTorch FP16 provides the best balance of speedup (1.5×) and simplicity. TensorRT is recommended only when maximum throughput is required and deployment complexity is acceptable.

Summary

Inference Speed Analysis - Summary:

Base throughput: 31,625 samples/second on RTX 5000
Per-sample latency: 31.6 μs average, 48.7 μs at p99
Primary bottleneck: BiLSTM encoder (39.2% of time)
Easy optimization: FP16 provides 1.5× speedup with no code changes
Maximum throughput: ~95K samples/sec with TensorRT INT8

Industrial Scenario	Required Throughput	AMNL Headroom
Single factory (100 machines)	100/sec	316×
Regional fleet (1,000 engines)	1,000/sec	31.6×
Global fleet (10,000 engines)	10,000/sec	3.2×
High-frequency (10 Hz × 1,000)	10,000/sec	3.2×

Key Insight: AMNL's inference speed of 31K samples/second exceeds industrial requirements by an order of magnitude for typical deployments. This headroom enables real-time monitoring at scale, high-frequency update rates, and room for future growth—all on a single GPU without specialized optimization. For extreme-scale deployments, TensorRT optimization can push throughput to nearly 100K samples/second.