Chapter 19
15 min read
Section 94 of 104

Inference Speed: 31K samples/second

Computational Efficiency

Learning Objectives

By the end of this section, you will:

  1. Understand AMNL's inference throughput of 31K samples/second
  2. Analyze latency characteristics for real-time deployment
  3. Identify computational bottlenecks in the inference pipeline
  4. Learn optimization strategies for production deployment
  5. Compare with industrial requirements for predictive maintenance
Core Insight: AMNL processes 31,625 samples per second on a single GPU—equivalent to monitoring over 31,000 engines simultaneously with 1-second update intervals. This throughput far exceeds typical industrial requirements, enabling real-time monitoring at scale.

Inference Throughput

Inference throughput measures how many predictions the model can make per unit time. This is critical for industrial deployment where thousands of assets may need simultaneous monitoring.

Benchmark Results

MetricValueContext
Throughput31,625 samples/secBatch size 256
Per-sample latency31.6 μsAverage
Batch latency8.1 ms256 samples
HardwareNVIDIA RTX 500016GB VRAM

Throughput by Batch Size

Throughput varies significantly with batch size due to GPU utilization efficiency:

Batch SizeThroughput (samples/sec)GPU Utilization
1~2,500~15%
16~12,000~45%
64~22,000~75%
128~28,000~88%
256~31,625~95%
512~32,100~97%

Optimal Batch Size

Batch size 256 provides the best balance between throughput and memory usage. Larger batches (512+) offer diminishing returns while consuming significantly more GPU memory.

What Does 31K Samples/Second Mean?


Latency Analysis

While throughput measures overall capacity, latency measures how quickly an individual prediction is returned—critical for real-time alerting systems.

Latency Breakdown

StageTime (μs)Percentage
Data preprocessing5.216.5%
CNN feature extraction3.812.0%
BiLSTM encoding12.439.2%
Multi-head attention6.119.3%
Task heads1.23.8%
Post-processing2.99.2%
Total31.6100%

Latency Distribution

Latency varies across samples due to GPU scheduling and memory access patterns:

PercentileLatency (μs)Use Case
p50 (median)28.4Typical case
p9035.2Most samples
p9948.7Edge cases
p99.972.1Rare outliers
Max observed124.3Cold start

Real-Time Guarantee

For real-time systems, the p99 latency (48.7 μs) is more relevant than the average. Even at p99, AMNL returns predictions in under 50 microseconds—well within the millisecond-scale requirements of most industrial control systems.


Computational Bottlenecks

Understanding where computation time is spent helps identify optimization opportunities.

BiLSTM Dominates

The BiLSTM encoder accounts for 39.2% of inference time, making it the primary bottleneck:

  • Sequential nature: LSTMs process timesteps sequentially, limiting parallelization
  • 3-layer depth: Each layer adds latency
  • Bidirectional processing: Effectively 6 LSTM passes

Attention Overhead

Multi-head attention contributes 19.3% of latency with 12 attention heads:

Attention FLOPs=4×n2×d+2×n2×h\text{Attention FLOPs} = 4 \times n^2 \times d + 2 \times n^2 \times h

Where nn is sequence length, dd is embedding dimension, and hh is number of heads.

Memory Bandwidth

OperationMemory Access PatternBandwidth Bound?
CNNStrided accessNo (compute bound)
BiLSTMSequential accessPartially
AttentionRandom access (softmax)Yes
FC layersDense accessNo (compute bound)

Optimization Strategies

Several strategies can further improve inference speed for production deployment.

1. Mixed Precision Inference

🐍python
1import torch
2
3# Enable automatic mixed precision for inference
4model = model.half()  # Convert to FP16
5
6with torch.cuda.amp.autocast():
7    predictions = model(input_batch)
8
9# Speedup: ~1.5-2x on modern GPUs
PrecisionThroughputAccuracy Impact
FP32 (baseline)31,625 samples/secReference
FP16 (mixed)~48,000 samples/secNegligible (<0.1% RMSE)
INT8 (quantized)~62,000 samples/secSmall (~0.3% RMSE)

2. ONNX Runtime Optimization

🐍python
1import onnxruntime as ort
2
3# Export to ONNX
4torch.onnx.export(model, dummy_input, "amnl.onnx")
5
6# Load with optimizations
7sess_options = ort.SessionOptions()
8sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
9sess = ort.InferenceSession("amnl.onnx", sess_options)
10
11# Speedup: ~1.3x over PyTorch

3. TensorRT Compilation

🐍python
1import torch_tensorrt
2
3# Compile for TensorRT
4trt_model = torch_tensorrt.compile(
5    model,
6    inputs=[torch_tensorrt.Input((256, 50, 17))],
7    enabled_precisions={torch.float16}
8)
9
10# Speedup: ~2-3x over PyTorch FP32

Optimization Summary

OptimizationThroughputSpeedupComplexity
Baseline PyTorch FP3231,6251.0xNone
PyTorch FP16~48,0001.5xLow
ONNX Runtime~41,0001.3xLow
TensorRT FP16~78,0002.5xMedium
TensorRT INT8~95,0003.0xHigh

Production Recommendation

For most industrial deployments, PyTorch FP16 provides the best balance of speedup (1.5×) and simplicity. TensorRT is recommended only when maximum throughput is required and deployment complexity is acceptable.


Summary

Inference Speed Analysis - Summary:

  1. Base throughput: 31,625 samples/second on RTX 5000
  2. Per-sample latency: 31.6 μs average, 48.7 μs at p99
  3. Primary bottleneck: BiLSTM encoder (39.2% of time)
  4. Easy optimization: FP16 provides 1.5× speedup with no code changes
  5. Maximum throughput: ~95K samples/sec with TensorRT INT8
Industrial ScenarioRequired ThroughputAMNL Headroom
Single factory (100 machines)100/sec316×
Regional fleet (1,000 engines)1,000/sec31.6×
Global fleet (10,000 engines)10,000/sec3.2×
High-frequency (10 Hz × 1,000)10,000/sec3.2×
Key Insight: AMNL's inference speed of 31K samples/second exceeds industrial requirements by an order of magnitude for typical deployments. This headroom enables real-time monitoring at scale, high-frequency update rates, and room for future growth—all on a single GPU without specialized optimization. For extreme-scale deployments, TensorRT optimization can push throughput to nearly 100K samples/second.