Chapter 20
15 min read
Section 97 of 104

Model Export and ONNX Conversion

Production Deployment

Learning Objectives

By the end of this section, you will:

  1. Understand different model export formats and their use cases
  2. Convert AMNL to ONNX format for cross-platform deployment
  3. Optimize with TensorRT for maximum inference speed
  4. Validate exported models to ensure accuracy preservation
  5. Choose the right format for your deployment target
Core Insight: Exporting AMNL to optimized formats like ONNX and TensorRT enables deployment across diverse platforms—from cloud servers to edge devices—while achieving 2-3× inference speedup compared to native PyTorch execution.

Model Export Formats

Different deployment scenarios require different model formats. Here are the main options for AMNL:

FormatUse CaseSpeedupPlatform Support
PyTorch (.pt)Development, flexible deployment1.0x (baseline)Python environments
TorchScript (.ts)Production Python, C++ integration1.1xPyTorch ecosystem
ONNX (.onnx)Cross-platform, multiple runtimes1.3xUniversal
TensorRT (.engine)NVIDIA GPU optimization2.5xNVIDIA GPUs only
OpenVINO (.xml)Intel CPU/GPU optimization1.8xIntel hardware

Format Selection Guide


ONNX Conversion

ONNX (Open Neural Network Exchange) is the most versatile export format, supported by nearly all inference runtimes.

Basic ONNX Export

🐍python
1import torch
2import torch.onnx
3
4def export_amnl_to_onnx(model, save_path, sequence_length=50, num_features=17):
5    """
6    Export AMNL model to ONNX format.
7
8    Args:
9        model: Trained AMNL PyTorch model
10        save_path: Output path for .onnx file
11        sequence_length: Input sequence length (default 50)
12        num_features: Number of input features (default 17)
13    """
14    model.eval()
15
16    # Create dummy input matching expected shape
17    batch_size = 1  # Will be dynamic
18    dummy_input = torch.randn(batch_size, sequence_length, num_features)
19
20    # Move to same device as model
21    device = next(model.parameters()).device
22    dummy_input = dummy_input.to(device)
23
24    # Export with dynamic batch size
25    torch.onnx.export(
26        model,
27        dummy_input,
28        save_path,
29        export_params=True,
30        opset_version=14,  # Use recent opset for best compatibility
31        do_constant_folding=True,  # Optimize constants
32        input_names=['sensor_data'],
33        output_names=['rul_prediction', 'health_logits'],
34        dynamic_axes={
35            'sensor_data': {0: 'batch_size'},
36            'rul_prediction': {0: 'batch_size'},
37            'health_logits': {0: 'batch_size'}
38        }
39    )
40
41    print(f"Model exported to {save_path}")
42    return save_path
43
44# Usage
45export_amnl_to_onnx(model, "amnl_model.onnx")

Handling AMNL-Specific Layers

AMNL uses BiLSTM and Multi-Head Attention, which require special handling during export:

🐍python
1# Some LSTM states may need explicit handling
2def prepare_model_for_export(model):
3    """Prepare AMNL model for clean ONNX export."""
4
5    # Ensure model is in eval mode (disables dropout)
6    model.eval()
7
8    # Detach any persistent LSTM hidden states
9    for name, module in model.named_modules():
10        if isinstance(module, torch.nn.LSTM):
11            # LSTMs in inference mode don't need state tracking
12            module.flatten_parameters()
13
14    # Freeze batch normalization statistics
15    for module in model.modules():
16        if isinstance(module, torch.nn.BatchNorm1d):
17            module.eval()
18            module.track_running_stats = False
19
20    return model
21
22# Prepare and export
23model = prepare_model_for_export(model)
24export_amnl_to_onnx(model, "amnl_model.onnx")

ONNX Runtime Inference

🐍python
1import onnxruntime as ort
2import numpy as np
3
4class ONNXAMNLPredictor:
5    """ONNX Runtime inference for AMNL model."""
6
7    def __init__(self, onnx_path, device='cuda'):
8        # Configure session options
9        sess_options = ort.SessionOptions()
10        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
11
12        # Select execution provider
13        if device == 'cuda':
14            providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
15        else:
16            providers = ['CPUExecutionProvider']
17
18        self.session = ort.InferenceSession(
19            onnx_path,
20            sess_options,
21            providers=providers
22        )
23
24        # Get input/output names
25        self.input_name = self.session.get_inputs()[0].name
26        self.output_names = [o.name for o in self.session.get_outputs()]
27
28    def predict(self, sensor_data):
29        """
30        Run inference on sensor data.
31
32        Args:
33            sensor_data: numpy array of shape (batch, seq_len, features)
34
35        Returns:
36            Tuple of (rul_predictions, health_logits)
37        """
38        # Ensure float32 for ONNX
39        if sensor_data.dtype != np.float32:
40            sensor_data = sensor_data.astype(np.float32)
41
42        # Run inference
43        outputs = self.session.run(
44            self.output_names,
45            {self.input_name: sensor_data}
46        )
47
48        return outputs[0], outputs[1]  # RUL, Health
49
50# Usage
51predictor = ONNXAMNLPredictor("amnl_model.onnx", device='cuda')
52rul, health = predictor.predict(sensor_batch)

ONNX Runtime Performance

ONNX Runtime with CUDA provider typically achieves 1.3× speedup over native PyTorch. For CPU-only deployment, expect comparable or slightly better performance than PyTorch CPU.


TensorRT Optimization

For maximum performance on NVIDIA GPUs, TensorRT provides aggressive optimizations including layer fusion, kernel auto-tuning, and precision calibration.

ONNX to TensorRT Conversion

🐍python
1import tensorrt as trt
2
3def convert_onnx_to_tensorrt(onnx_path, engine_path, fp16=True, max_batch=256):
4    """
5    Convert ONNX model to TensorRT engine.
6
7    Args:
8        onnx_path: Path to ONNX model
9        engine_path: Output path for TensorRT engine
10        fp16: Enable FP16 precision (recommended)
11        max_batch: Maximum batch size to support
12    """
13    logger = trt.Logger(trt.Logger.WARNING)
14    builder = trt.Builder(logger)
15    network = builder.create_network(
16        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
17    )
18    parser = trt.OnnxParser(network, logger)
19
20    # Parse ONNX model
21    with open(onnx_path, 'rb') as f:
22        if not parser.parse(f.read()):
23            for error in range(parser.num_errors):
24                print(parser.get_error(error))
25            raise RuntimeError("ONNX parsing failed")
26
27    # Configure builder
28    config = builder.create_builder_config()
29    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
30
31    if fp16:
32        config.set_flag(trt.BuilderFlag.FP16)
33
34    # Set dynamic shapes
35    profile = builder.create_optimization_profile()
36    input_tensor = network.get_input(0)
37
38    # (batch, seq_len, features) = (?, 50, 17)
39    profile.set_shape(
40        input_tensor.name,
41        min=(1, 50, 17),
42        opt=(64, 50, 17),
43        max=(max_batch, 50, 17)
44    )
45    config.add_optimization_profile(profile)
46
47    # Build engine
48    engine = builder.build_serialized_network(network, config)
49
50    # Save engine
51    with open(engine_path, 'wb') as f:
52        f.write(engine)
53
54    print(f"TensorRT engine saved to {engine_path}")
55    return engine_path
56
57# Convert with FP16
58convert_onnx_to_tensorrt("amnl_model.onnx", "amnl_model.engine", fp16=True)

TensorRT Inference

🐍python
1import tensorrt as trt
2import pycuda.driver as cuda
3import pycuda.autoinit
4import numpy as np
5
6class TensorRTAMNLPredictor:
7    """TensorRT inference for AMNL model."""
8
9    def __init__(self, engine_path):
10        # Load engine
11        logger = trt.Logger(trt.Logger.WARNING)
12        with open(engine_path, 'rb') as f:
13            engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
14
15        self.context = engine.create_execution_context()
16        self.engine = engine
17
18        # Allocate buffers
19        self.inputs = []
20        self.outputs = []
21        self.bindings = []
22        self.stream = cuda.Stream()
23
24        for binding in engine:
25            size = trt.volume(engine.get_binding_shape(binding))
26            dtype = trt.nptype(engine.get_binding_dtype(binding))
27
28            # Allocate device memory
29            device_mem = cuda.mem_alloc(size * dtype.itemsize * 256)  # Max batch
30            self.bindings.append(int(device_mem))
31
32            if engine.binding_is_input(binding):
33                self.inputs.append({'device': device_mem, 'dtype': dtype})
34            else:
35                self.outputs.append({'device': device_mem, 'dtype': dtype})
36
37    def predict(self, sensor_data):
38        """Run TensorRT inference."""
39        batch_size = sensor_data.shape[0]
40
41        # Set dynamic shape
42        self.context.set_binding_shape(0, sensor_data.shape)
43
44        # Transfer input to GPU
45        cuda.memcpy_htod_async(
46            self.inputs[0]['device'],
47            np.ascontiguousarray(sensor_data.astype(np.float32)),
48            self.stream
49        )
50
51        # Execute
52        self.context.execute_async_v2(
53            bindings=self.bindings,
54            stream_handle=self.stream.handle
55        )
56
57        # Transfer outputs to CPU
58        rul_output = np.empty((batch_size, 1), dtype=np.float32)
59        health_output = np.empty((batch_size, 3), dtype=np.float32)
60
61        cuda.memcpy_dtoh_async(rul_output, self.outputs[0]['device'], self.stream)
62        cuda.memcpy_dtoh_async(health_output, self.outputs[1]['device'], self.stream)
63
64        self.stream.synchronize()
65
66        return rul_output, health_output
67
68# Usage - 2.5x faster than PyTorch!
69predictor = TensorRTAMNLPredictor("amnl_model.engine")
70rul, health = predictor.predict(sensor_batch)

TensorRT Benefits

TensorRT achieves 2.5× speedup through: (1) layer fusion combining Conv+BN+ReLU, (2) kernel auto-tuning for specific GPU, (3) FP16 tensor cores, and (4) memory optimization. The trade-off is NVIDIA-only compatibility.


Export Validation

After exporting, validate that the converted model produces identical (or near-identical) outputs to the original PyTorch model.

Validation Script

🐍python
1import torch
2import numpy as np
3import onnxruntime as ort
4
5def validate_onnx_export(pytorch_model, onnx_path, num_samples=100, tolerance=1e-4):
6    """
7    Validate ONNX export against PyTorch model.
8
9    Args:
10        pytorch_model: Original PyTorch model
11        onnx_path: Path to exported ONNX model
12        num_samples: Number of random samples to test
13        tolerance: Maximum allowed difference
14
15    Returns:
16        True if validation passes, False otherwise
17    """
18    pytorch_model.eval()
19
20    # Load ONNX model
21    ort_session = ort.InferenceSession(onnx_path)
22    input_name = ort_session.get_inputs()[0].name
23
24    max_diff_rul = 0
25    max_diff_health = 0
26
27    for i in range(num_samples):
28        # Generate random input
29        test_input = torch.randn(1, 50, 17)
30
31        # PyTorch inference
32        with torch.no_grad():
33            pt_rul, pt_health = pytorch_model(test_input)
34            pt_rul = pt_rul.numpy()
35            pt_health = pt_health.numpy()
36
37        # ONNX inference
38        onnx_rul, onnx_health = ort_session.run(
39            None,
40            {input_name: test_input.numpy()}
41        )
42
43        # Compare outputs
44        diff_rul = np.abs(pt_rul - onnx_rul).max()
45        diff_health = np.abs(pt_health - onnx_health).max()
46
47        max_diff_rul = max(max_diff_rul, diff_rul)
48        max_diff_health = max(max_diff_health, diff_health)
49
50    print(f"Max RUL difference: {max_diff_rul:.6f}")
51    print(f"Max Health difference: {max_diff_health:.6f}")
52
53    passed = max_diff_rul < tolerance and max_diff_health < tolerance
54    print(f"Validation {'PASSED' if passed else 'FAILED'}")
55
56    return passed
57
58# Validate export
59validate_onnx_export(model, "amnl_model.onnx")

Expected Validation Results

FormatMax RUL DiffMax Health DiffStatus
ONNX (FP32)<1e-5<1e-5Exact match
ONNX (FP16)<1e-3<1e-3Acceptable
TensorRT (FP32)<1e-4<1e-4Near-exact
TensorRT (FP16)<1e-2<1e-2Acceptable
TensorRT (INT8)<0.1<0.05Validate carefully

FP16 Precision

FP16 exports may show small numerical differences (up to 0.01 RMSE) compared to FP32. This is expected and acceptable for production use, as the performance improvement outweighs the minimal accuracy impact.


Summary

Model Export and ONNX Conversion - Summary:

  1. ONNX for compatibility: Best choice for cross-platform deployment with 1.3× speedup
  2. TensorRT for speed: 2.5× speedup on NVIDIA GPUs with FP16 optimization
  3. Dynamic batching: Export with dynamic batch size for flexible deployment
  4. Always validate: Confirm numerical accuracy after export
  5. FP16 is safe: Minimal accuracy impact with significant speed gains
Deployment TargetRecommended FormatExpected Speedup
NVIDIA Cloud/ServerTensorRT FP162.5×
Mixed CloudONNX Runtime1.3×
Edge (NVIDIA Jetson)TensorRT FP162.0×
Edge (Generic)ONNX Runtime1.2×
C++ IntegrationTorchScript1.1×
Key Insight: Model export is the bridge between research and production. ONNX provides the universal compatibility needed for diverse deployment scenarios, while TensorRT unlocks maximum performance on NVIDIA hardware. Always validate exported models before deployment to ensure the optimization process hasn't compromised prediction accuracy.