Learning Objectives
By the end of this section, you will:
- Understand different model export formats and their use cases
- Convert AMNL to ONNX format for cross-platform deployment
- Optimize with TensorRT for maximum inference speed
- Validate exported models to ensure accuracy preservation
- Choose the right format for your deployment target
Core Insight: Exporting AMNL to optimized formats like ONNX and TensorRT enables deployment across diverse platforms—from cloud servers to edge devices—while achieving 2-3× inference speedup compared to native PyTorch execution.
Model Export Formats
Different deployment scenarios require different model formats. Here are the main options for AMNL:
| Format | Use Case | Speedup | Platform Support |
|---|---|---|---|
| PyTorch (.pt) | Development, flexible deployment | 1.0x (baseline) | Python environments |
| TorchScript (.ts) | Production Python, C++ integration | 1.1x | PyTorch ecosystem |
| ONNX (.onnx) | Cross-platform, multiple runtimes | 1.3x | Universal |
| TensorRT (.engine) | NVIDIA GPU optimization | 2.5x | NVIDIA GPUs only |
| OpenVINO (.xml) | Intel CPU/GPU optimization | 1.8x | Intel hardware |
Format Selection Guide
ONNX Conversion
ONNX (Open Neural Network Exchange) is the most versatile export format, supported by nearly all inference runtimes.
Basic ONNX Export
1import torch
2import torch.onnx
3
4def export_amnl_to_onnx(model, save_path, sequence_length=50, num_features=17):
5 """
6 Export AMNL model to ONNX format.
7
8 Args:
9 model: Trained AMNL PyTorch model
10 save_path: Output path for .onnx file
11 sequence_length: Input sequence length (default 50)
12 num_features: Number of input features (default 17)
13 """
14 model.eval()
15
16 # Create dummy input matching expected shape
17 batch_size = 1 # Will be dynamic
18 dummy_input = torch.randn(batch_size, sequence_length, num_features)
19
20 # Move to same device as model
21 device = next(model.parameters()).device
22 dummy_input = dummy_input.to(device)
23
24 # Export with dynamic batch size
25 torch.onnx.export(
26 model,
27 dummy_input,
28 save_path,
29 export_params=True,
30 opset_version=14, # Use recent opset for best compatibility
31 do_constant_folding=True, # Optimize constants
32 input_names=['sensor_data'],
33 output_names=['rul_prediction', 'health_logits'],
34 dynamic_axes={
35 'sensor_data': {0: 'batch_size'},
36 'rul_prediction': {0: 'batch_size'},
37 'health_logits': {0: 'batch_size'}
38 }
39 )
40
41 print(f"Model exported to {save_path}")
42 return save_path
43
44# Usage
45export_amnl_to_onnx(model, "amnl_model.onnx")Handling AMNL-Specific Layers
AMNL uses BiLSTM and Multi-Head Attention, which require special handling during export:
1# Some LSTM states may need explicit handling
2def prepare_model_for_export(model):
3 """Prepare AMNL model for clean ONNX export."""
4
5 # Ensure model is in eval mode (disables dropout)
6 model.eval()
7
8 # Detach any persistent LSTM hidden states
9 for name, module in model.named_modules():
10 if isinstance(module, torch.nn.LSTM):
11 # LSTMs in inference mode don't need state tracking
12 module.flatten_parameters()
13
14 # Freeze batch normalization statistics
15 for module in model.modules():
16 if isinstance(module, torch.nn.BatchNorm1d):
17 module.eval()
18 module.track_running_stats = False
19
20 return model
21
22# Prepare and export
23model = prepare_model_for_export(model)
24export_amnl_to_onnx(model, "amnl_model.onnx")ONNX Runtime Inference
1import onnxruntime as ort
2import numpy as np
3
4class ONNXAMNLPredictor:
5 """ONNX Runtime inference for AMNL model."""
6
7 def __init__(self, onnx_path, device='cuda'):
8 # Configure session options
9 sess_options = ort.SessionOptions()
10 sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
11
12 # Select execution provider
13 if device == 'cuda':
14 providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
15 else:
16 providers = ['CPUExecutionProvider']
17
18 self.session = ort.InferenceSession(
19 onnx_path,
20 sess_options,
21 providers=providers
22 )
23
24 # Get input/output names
25 self.input_name = self.session.get_inputs()[0].name
26 self.output_names = [o.name for o in self.session.get_outputs()]
27
28 def predict(self, sensor_data):
29 """
30 Run inference on sensor data.
31
32 Args:
33 sensor_data: numpy array of shape (batch, seq_len, features)
34
35 Returns:
36 Tuple of (rul_predictions, health_logits)
37 """
38 # Ensure float32 for ONNX
39 if sensor_data.dtype != np.float32:
40 sensor_data = sensor_data.astype(np.float32)
41
42 # Run inference
43 outputs = self.session.run(
44 self.output_names,
45 {self.input_name: sensor_data}
46 )
47
48 return outputs[0], outputs[1] # RUL, Health
49
50# Usage
51predictor = ONNXAMNLPredictor("amnl_model.onnx", device='cuda')
52rul, health = predictor.predict(sensor_batch)ONNX Runtime Performance
ONNX Runtime with CUDA provider typically achieves 1.3× speedup over native PyTorch. For CPU-only deployment, expect comparable or slightly better performance than PyTorch CPU.
TensorRT Optimization
For maximum performance on NVIDIA GPUs, TensorRT provides aggressive optimizations including layer fusion, kernel auto-tuning, and precision calibration.
ONNX to TensorRT Conversion
1import tensorrt as trt
2
3def convert_onnx_to_tensorrt(onnx_path, engine_path, fp16=True, max_batch=256):
4 """
5 Convert ONNX model to TensorRT engine.
6
7 Args:
8 onnx_path: Path to ONNX model
9 engine_path: Output path for TensorRT engine
10 fp16: Enable FP16 precision (recommended)
11 max_batch: Maximum batch size to support
12 """
13 logger = trt.Logger(trt.Logger.WARNING)
14 builder = trt.Builder(logger)
15 network = builder.create_network(
16 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
17 )
18 parser = trt.OnnxParser(network, logger)
19
20 # Parse ONNX model
21 with open(onnx_path, 'rb') as f:
22 if not parser.parse(f.read()):
23 for error in range(parser.num_errors):
24 print(parser.get_error(error))
25 raise RuntimeError("ONNX parsing failed")
26
27 # Configure builder
28 config = builder.create_builder_config()
29 config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
30
31 if fp16:
32 config.set_flag(trt.BuilderFlag.FP16)
33
34 # Set dynamic shapes
35 profile = builder.create_optimization_profile()
36 input_tensor = network.get_input(0)
37
38 # (batch, seq_len, features) = (?, 50, 17)
39 profile.set_shape(
40 input_tensor.name,
41 min=(1, 50, 17),
42 opt=(64, 50, 17),
43 max=(max_batch, 50, 17)
44 )
45 config.add_optimization_profile(profile)
46
47 # Build engine
48 engine = builder.build_serialized_network(network, config)
49
50 # Save engine
51 with open(engine_path, 'wb') as f:
52 f.write(engine)
53
54 print(f"TensorRT engine saved to {engine_path}")
55 return engine_path
56
57# Convert with FP16
58convert_onnx_to_tensorrt("amnl_model.onnx", "amnl_model.engine", fp16=True)TensorRT Inference
1import tensorrt as trt
2import pycuda.driver as cuda
3import pycuda.autoinit
4import numpy as np
5
6class TensorRTAMNLPredictor:
7 """TensorRT inference for AMNL model."""
8
9 def __init__(self, engine_path):
10 # Load engine
11 logger = trt.Logger(trt.Logger.WARNING)
12 with open(engine_path, 'rb') as f:
13 engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
14
15 self.context = engine.create_execution_context()
16 self.engine = engine
17
18 # Allocate buffers
19 self.inputs = []
20 self.outputs = []
21 self.bindings = []
22 self.stream = cuda.Stream()
23
24 for binding in engine:
25 size = trt.volume(engine.get_binding_shape(binding))
26 dtype = trt.nptype(engine.get_binding_dtype(binding))
27
28 # Allocate device memory
29 device_mem = cuda.mem_alloc(size * dtype.itemsize * 256) # Max batch
30 self.bindings.append(int(device_mem))
31
32 if engine.binding_is_input(binding):
33 self.inputs.append({'device': device_mem, 'dtype': dtype})
34 else:
35 self.outputs.append({'device': device_mem, 'dtype': dtype})
36
37 def predict(self, sensor_data):
38 """Run TensorRT inference."""
39 batch_size = sensor_data.shape[0]
40
41 # Set dynamic shape
42 self.context.set_binding_shape(0, sensor_data.shape)
43
44 # Transfer input to GPU
45 cuda.memcpy_htod_async(
46 self.inputs[0]['device'],
47 np.ascontiguousarray(sensor_data.astype(np.float32)),
48 self.stream
49 )
50
51 # Execute
52 self.context.execute_async_v2(
53 bindings=self.bindings,
54 stream_handle=self.stream.handle
55 )
56
57 # Transfer outputs to CPU
58 rul_output = np.empty((batch_size, 1), dtype=np.float32)
59 health_output = np.empty((batch_size, 3), dtype=np.float32)
60
61 cuda.memcpy_dtoh_async(rul_output, self.outputs[0]['device'], self.stream)
62 cuda.memcpy_dtoh_async(health_output, self.outputs[1]['device'], self.stream)
63
64 self.stream.synchronize()
65
66 return rul_output, health_output
67
68# Usage - 2.5x faster than PyTorch!
69predictor = TensorRTAMNLPredictor("amnl_model.engine")
70rul, health = predictor.predict(sensor_batch)TensorRT Benefits
TensorRT achieves 2.5× speedup through: (1) layer fusion combining Conv+BN+ReLU, (2) kernel auto-tuning for specific GPU, (3) FP16 tensor cores, and (4) memory optimization. The trade-off is NVIDIA-only compatibility.
Export Validation
After exporting, validate that the converted model produces identical (or near-identical) outputs to the original PyTorch model.
Validation Script
1import torch
2import numpy as np
3import onnxruntime as ort
4
5def validate_onnx_export(pytorch_model, onnx_path, num_samples=100, tolerance=1e-4):
6 """
7 Validate ONNX export against PyTorch model.
8
9 Args:
10 pytorch_model: Original PyTorch model
11 onnx_path: Path to exported ONNX model
12 num_samples: Number of random samples to test
13 tolerance: Maximum allowed difference
14
15 Returns:
16 True if validation passes, False otherwise
17 """
18 pytorch_model.eval()
19
20 # Load ONNX model
21 ort_session = ort.InferenceSession(onnx_path)
22 input_name = ort_session.get_inputs()[0].name
23
24 max_diff_rul = 0
25 max_diff_health = 0
26
27 for i in range(num_samples):
28 # Generate random input
29 test_input = torch.randn(1, 50, 17)
30
31 # PyTorch inference
32 with torch.no_grad():
33 pt_rul, pt_health = pytorch_model(test_input)
34 pt_rul = pt_rul.numpy()
35 pt_health = pt_health.numpy()
36
37 # ONNX inference
38 onnx_rul, onnx_health = ort_session.run(
39 None,
40 {input_name: test_input.numpy()}
41 )
42
43 # Compare outputs
44 diff_rul = np.abs(pt_rul - onnx_rul).max()
45 diff_health = np.abs(pt_health - onnx_health).max()
46
47 max_diff_rul = max(max_diff_rul, diff_rul)
48 max_diff_health = max(max_diff_health, diff_health)
49
50 print(f"Max RUL difference: {max_diff_rul:.6f}")
51 print(f"Max Health difference: {max_diff_health:.6f}")
52
53 passed = max_diff_rul < tolerance and max_diff_health < tolerance
54 print(f"Validation {'PASSED' if passed else 'FAILED'}")
55
56 return passed
57
58# Validate export
59validate_onnx_export(model, "amnl_model.onnx")Expected Validation Results
| Format | Max RUL Diff | Max Health Diff | Status |
|---|---|---|---|
| ONNX (FP32) | <1e-5 | <1e-5 | Exact match |
| ONNX (FP16) | <1e-3 | <1e-3 | Acceptable |
| TensorRT (FP32) | <1e-4 | <1e-4 | Near-exact |
| TensorRT (FP16) | <1e-2 | <1e-2 | Acceptable |
| TensorRT (INT8) | <0.1 | <0.05 | Validate carefully |
FP16 Precision
FP16 exports may show small numerical differences (up to 0.01 RMSE) compared to FP32. This is expected and acceptable for production use, as the performance improvement outweighs the minimal accuracy impact.
Summary
Model Export and ONNX Conversion - Summary:
- ONNX for compatibility: Best choice for cross-platform deployment with 1.3× speedup
- TensorRT for speed: 2.5× speedup on NVIDIA GPUs with FP16 optimization
- Dynamic batching: Export with dynamic batch size for flexible deployment
- Always validate: Confirm numerical accuracy after export
- FP16 is safe: Minimal accuracy impact with significant speed gains
| Deployment Target | Recommended Format | Expected Speedup |
|---|---|---|
| NVIDIA Cloud/Server | TensorRT FP16 | 2.5× |
| Mixed Cloud | ONNX Runtime | 1.3× |
| Edge (NVIDIA Jetson) | TensorRT FP16 | 2.0× |
| Edge (Generic) | ONNX Runtime | 1.2× |
| C++ Integration | TorchScript | 1.1× |
Key Insight: Model export is the bridge between research and production. ONNX provides the universal compatibility needed for diverse deployment scenarios, while TensorRT unlocks maximum performance on NVIDIA hardware. Always validate exported models before deployment to ensure the optimization process hasn't compromised prediction accuracy.