Learning Objectives
By the end of this section, you will:
- Understand AMNL's inference throughput of 31K samples/second
- Analyze latency characteristics for real-time deployment
- Identify computational bottlenecks in the inference pipeline
- Learn optimization strategies for production deployment
- Compare with industrial requirements for predictive maintenance
Core Insight: AMNL processes 31,625 samples per second on a single GPU—equivalent to monitoring over 31,000 engines simultaneously with 1-second update intervals. This throughput far exceeds typical industrial requirements, enabling real-time monitoring at scale.
Inference Throughput
Inference throughput measures how many predictions the model can make per unit time. This is critical for industrial deployment where thousands of assets may need simultaneous monitoring.
Benchmark Results
| Metric | Value | Context |
|---|---|---|
| Throughput | 31,625 samples/sec | Batch size 256 |
| Per-sample latency | 31.6 μs | Average |
| Batch latency | 8.1 ms | 256 samples |
| Hardware | NVIDIA RTX 5000 | 16GB VRAM |
Throughput by Batch Size
Throughput varies significantly with batch size due to GPU utilization efficiency:
| Batch Size | Throughput (samples/sec) | GPU Utilization |
|---|---|---|
| 1 | ~2,500 | ~15% |
| 16 | ~12,000 | ~45% |
| 64 | ~22,000 | ~75% |
| 128 | ~28,000 | ~88% |
| 256 | ~31,625 | ~95% |
| 512 | ~32,100 | ~97% |
Optimal Batch Size
Batch size 256 provides the best balance between throughput and memory usage. Larger batches (512+) offer diminishing returns while consuming significantly more GPU memory.
What Does 31K Samples/Second Mean?
Latency Analysis
While throughput measures overall capacity, latency measures how quickly an individual prediction is returned—critical for real-time alerting systems.
Latency Breakdown
| Stage | Time (μs) | Percentage |
|---|---|---|
| Data preprocessing | 5.2 | 16.5% |
| CNN feature extraction | 3.8 | 12.0% |
| BiLSTM encoding | 12.4 | 39.2% |
| Multi-head attention | 6.1 | 19.3% |
| Task heads | 1.2 | 3.8% |
| Post-processing | 2.9 | 9.2% |
| Total | 31.6 | 100% |
Latency Distribution
Latency varies across samples due to GPU scheduling and memory access patterns:
| Percentile | Latency (μs) | Use Case |
|---|---|---|
| p50 (median) | 28.4 | Typical case |
| p90 | 35.2 | Most samples |
| p99 | 48.7 | Edge cases |
| p99.9 | 72.1 | Rare outliers |
| Max observed | 124.3 | Cold start |
Real-Time Guarantee
For real-time systems, the p99 latency (48.7 μs) is more relevant than the average. Even at p99, AMNL returns predictions in under 50 microseconds—well within the millisecond-scale requirements of most industrial control systems.
Computational Bottlenecks
Understanding where computation time is spent helps identify optimization opportunities.
BiLSTM Dominates
The BiLSTM encoder accounts for 39.2% of inference time, making it the primary bottleneck:
- Sequential nature: LSTMs process timesteps sequentially, limiting parallelization
- 3-layer depth: Each layer adds latency
- Bidirectional processing: Effectively 6 LSTM passes
Attention Overhead
Multi-head attention contributes 19.3% of latency with 12 attention heads:
Where is sequence length, is embedding dimension, and is number of heads.
Memory Bandwidth
| Operation | Memory Access Pattern | Bandwidth Bound? |
|---|---|---|
| CNN | Strided access | No (compute bound) |
| BiLSTM | Sequential access | Partially |
| Attention | Random access (softmax) | Yes |
| FC layers | Dense access | No (compute bound) |
Optimization Strategies
Several strategies can further improve inference speed for production deployment.
1. Mixed Precision Inference
1import torch
2
3# Enable automatic mixed precision for inference
4model = model.half() # Convert to FP16
5
6with torch.cuda.amp.autocast():
7 predictions = model(input_batch)
8
9# Speedup: ~1.5-2x on modern GPUs| Precision | Throughput | Accuracy Impact |
|---|---|---|
| FP32 (baseline) | 31,625 samples/sec | Reference |
| FP16 (mixed) | ~48,000 samples/sec | Negligible (<0.1% RMSE) |
| INT8 (quantized) | ~62,000 samples/sec | Small (~0.3% RMSE) |
2. ONNX Runtime Optimization
1import onnxruntime as ort
2
3# Export to ONNX
4torch.onnx.export(model, dummy_input, "amnl.onnx")
5
6# Load with optimizations
7sess_options = ort.SessionOptions()
8sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
9sess = ort.InferenceSession("amnl.onnx", sess_options)
10
11# Speedup: ~1.3x over PyTorch3. TensorRT Compilation
1import torch_tensorrt
2
3# Compile for TensorRT
4trt_model = torch_tensorrt.compile(
5 model,
6 inputs=[torch_tensorrt.Input((256, 50, 17))],
7 enabled_precisions={torch.float16}
8)
9
10# Speedup: ~2-3x over PyTorch FP32Optimization Summary
| Optimization | Throughput | Speedup | Complexity |
|---|---|---|---|
| Baseline PyTorch FP32 | 31,625 | 1.0x | None |
| PyTorch FP16 | ~48,000 | 1.5x | Low |
| ONNX Runtime | ~41,000 | 1.3x | Low |
| TensorRT FP16 | ~78,000 | 2.5x | Medium |
| TensorRT INT8 | ~95,000 | 3.0x | High |
Production Recommendation
For most industrial deployments, PyTorch FP16 provides the best balance of speedup (1.5×) and simplicity. TensorRT is recommended only when maximum throughput is required and deployment complexity is acceptable.
Summary
Inference Speed Analysis - Summary:
- Base throughput: 31,625 samples/second on RTX 5000
- Per-sample latency: 31.6 μs average, 48.7 μs at p99
- Primary bottleneck: BiLSTM encoder (39.2% of time)
- Easy optimization: FP16 provides 1.5× speedup with no code changes
- Maximum throughput: ~95K samples/sec with TensorRT INT8
| Industrial Scenario | Required Throughput | AMNL Headroom |
|---|---|---|
| Single factory (100 machines) | 100/sec | 316× |
| Regional fleet (1,000 engines) | 1,000/sec | 31.6× |
| Global fleet (10,000 engines) | 10,000/sec | 3.2× |
| High-frequency (10 Hz × 1,000) | 10,000/sec | 3.2× |
Key Insight: AMNL's inference speed of 31K samples/second exceeds industrial requirements by an order of magnitude for typical deployments. This headroom enables real-time monitoring at scale, high-frequency update rates, and room for future growth—all on a single GPU without specialized optimization. For extreme-scale deployments, TensorRT optimization can push throughput to nearly 100K samples/second.