Chapter 17
25 min read
Section 75 of 75

Model Export and Serving Infrastructure

Production Deployment

Introduction

This section covers exporting models to deployment formats (ONNX, TensorRT) and building serving infrastructure for production translation services.


ONNX Export

Portable Model Format

🐍python
1import torch
2import torch.nn as nn
3from typing import Dict, List, Optional, Tuple
4import os
5
6
7def onnx_overview():
8    """
9    Overview of ONNX export.
10    """
11    print("=" * 70)
12    print("ONNX: OPEN NEURAL NETWORK EXCHANGE")
13    print("=" * 70)
14
15    print("""
16    WHAT IS ONNX?
17    ─────────────
18
19    ONNX (Open Neural Network Exchange) is an open format for
20    representing machine learning models.
21
22    Benefits:
23    ┌─────────────────────────────────────────────────────────────────┐
24    │  • Interoperability: Works across frameworks                   │
25    │  • Optimization: Runtime-specific optimizations                │
26    │  • Deployment: Run on various hardware                         │
27    │  • Portability: Single format for all platforms                │
28    └─────────────────────────────────────────────────────────────────┘
29
30    Workflow:
31    ┌─────────┐     ┌─────────┐     ┌─────────────────────────┐
32    │ PyTorch │ ──► │  ONNX   │ ──► │  ONNX Runtime (CPU/GPU) │
33    │  Model  │     │  Model  │     │  TensorRT               │
34    └─────────┘     └─────────┘     │  OpenVINO               │
35                                    │  CoreML                 │
36                                    └─────────────────────────┘
37
38
39    SUPPORTED PLATFORMS:
40    ────────────────────
41
42    ONNX Runtime backends:
43    • CPU: Default, optimized for Intel/AMD
44    • CUDA: NVIDIA GPUs
45    • TensorRT: Maximum NVIDIA GPU performance
46    • OpenVINO: Intel hardware (CPU, iGPU, VPU)
47    • CoreML: Apple Silicon
48    • DirectML: Windows GPUs
49
50    Typical speedup: 1.2-2x over native PyTorch
51    """)
52
53
54class ONNXExporter:
55    """
56    Export PyTorch models to ONNX format.
57    """
58
59    def __init__(self, model: nn.Module, device: str = "cpu"):
60        """
61        Initialize exporter.
62
63        Args:
64            model: PyTorch model to export
65            device: Device for export
66        """
67        self.model = model
68        self.device = device
69        self.model.to(device)
70        self.model.eval()
71
72    def export_encoder_decoder(
73        self,
74        output_path: str,
75        max_length: int = 128,
76        vocab_size: int = 32000,
77        opset_version: int = 14
78    ) -> str:
79        """
80        Export encoder-decoder model to ONNX.
81
82        For translation models, we typically export:
83        1. Encoder (processes source)
84        2. Decoder (with KV cache for generation)
85
86        Args:
87            output_path: Path to save ONNX model
88            max_length: Maximum sequence length
89            vocab_size: Vocabulary size
90            opset_version: ONNX opset version
91
92        Returns:
93            Path to exported model
94        """
95        print("Exporting encoder-decoder model to ONNX...")
96
97        # Create dummy inputs
98        batch_size = 1
99        src_len = 32
100
101        dummy_input = {
102            'src_ids': torch.randint(0, vocab_size, (batch_size, src_len)),
103            'src_mask': torch.ones(batch_size, src_len),
104            'tgt_ids': torch.randint(0, vocab_size, (batch_size, 1)),
105        }
106
107        # Move to device
108        dummy_input = {k: v.to(self.device) for k, v in dummy_input.items()}
109
110        # Dynamic axes for variable sequence length
111        dynamic_axes = {
112            'src_ids': {0: 'batch_size', 1: 'src_len'},
113            'src_mask': {0: 'batch_size', 1: 'src_len'},
114            'tgt_ids': {0: 'batch_size', 1: 'tgt_len'},
115            'logits': {0: 'batch_size', 1: 'tgt_len'},
116        }
117
118        # Export
119        torch.onnx.export(
120            self.model,
121            (dummy_input['src_ids'], dummy_input['src_mask'], dummy_input['tgt_ids']),
122            output_path,
123            input_names=['src_ids', 'src_mask', 'tgt_ids'],
124            output_names=['logits'],
125            dynamic_axes=dynamic_axes,
126            opset_version=opset_version,
127            do_constant_folding=True,
128        )
129
130        print(f"Model exported to {output_path}")
131
132        # Verify export
133        self._verify_export(output_path, dummy_input)
134
135        return output_path
136
137    def _verify_export(
138        self,
139        onnx_path: str,
140        dummy_input: Dict[str, torch.Tensor]
141    ):
142        """Verify exported model matches PyTorch output."""
143        import onnx
144        import onnxruntime as ort
145
146        # Check ONNX model
147        onnx_model = onnx.load(onnx_path)
148        onnx.checker.check_model(onnx_model)
149        print("ONNX model validation passed!")
150
151        # Compare outputs
152        session = ort.InferenceSession(onnx_path)
153
154        # PyTorch output
155        with torch.no_grad():
156            pytorch_output = self.model(
157                dummy_input['src_ids'],
158                dummy_input['src_mask'],
159                dummy_input['tgt_ids']
160            )
161
162        # ONNX output
163        onnx_inputs = {
164            'src_ids': dummy_input['src_ids'].numpy(),
165            'src_mask': dummy_input['src_mask'].numpy(),
166            'tgt_ids': dummy_input['tgt_ids'].numpy(),
167        }
168        onnx_output = session.run(None, onnx_inputs)[0]
169
170        # Compare
171        pytorch_output = pytorch_output.numpy()
172        max_diff = abs(pytorch_output - onnx_output).max()
173        print(f"Max difference between PyTorch and ONNX: {max_diff:.6f}")
174
175        if max_diff < 1e-4:
176            print("✓ Export verification passed!")
177        else:
178            print("⚠ Warning: Outputs differ more than expected")
179
180
181def onnx_export_example():
182    """Show ONNX export code example."""
183    print("\nONNX Export Example")
184    print("=" * 60)
185
186    code = '''
187import torch
188import torch.onnx
189
190# Assume model is your trained TransformerModel
191model.eval()
192
193# Dummy inputs for tracing
194batch_size = 1
195src_len = 32
196vocab_size = 32000
197
198dummy_src = torch.randint(0, vocab_size, (batch_size, src_len))
199dummy_mask = torch.ones(batch_size, src_len)
200dummy_tgt = torch.randint(0, vocab_size, (batch_size, 1))
201
202# Export
203torch.onnx.export(
204    model,
205    (dummy_src, dummy_mask, dummy_tgt),
206    "translator.onnx",
207    input_names=['src_ids', 'src_mask', 'tgt_ids'],
208    output_names=['logits'],
209    dynamic_axes={
210        'src_ids': {0: 'batch', 1: 'src_len'},
211        'src_mask': {0: 'batch', 1: 'src_len'},
212        'tgt_ids': {0: 'batch', 1: 'tgt_len'},
213        'logits': {0: 'batch', 1: 'tgt_len'},
214    },
215    opset_version=14,
216    do_constant_folding=True,
217)
218
219print("Exported to translator.onnx")
220
221# Verify
222import onnx
223model_onnx = onnx.load("translator.onnx")
224onnx.checker.check_model(model_onnx)
225
226# Inference with ONNX Runtime
227import onnxruntime as ort
228
229session = ort.InferenceSession(
230    "translator.onnx",
231    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
232)
233
234# Run inference
235outputs = session.run(
236    None,
237    {
238        'src_ids': src_ids.numpy(),
239        'src_mask': src_mask.numpy(),
240        'tgt_ids': tgt_ids.numpy(),
241    }
242)
243'''
244    print(code)
245
246
247onnx_export_example()

TensorRT Optimization

Maximum GPU Performance

🐍python
1def tensorrt_overview():
2    """
3    Overview of TensorRT optimization.
4    """
5    print("=" * 70)
6    print("TENSORRT: NVIDIA GPU OPTIMIZATION")
7    print("=" * 70)
8
9    print("""
10    WHAT IS TENSORRT?
11    ─────────────────
12
13    TensorRT is NVIDIA's SDK for high-performance deep learning
14    inference. It applies several optimizations:
15
16    ┌─────────────────────────────────────────────────────────────────┐
17    │  1. LAYER FUSION                                               │
18    │     Combine multiple layers into single kernels                │
19    │     Conv + BatchNorm + ReLU → Single fused kernel             │
20    │                                                                │
21    │  2. PRECISION CALIBRATION                                      │
22    │     Automatic FP16/INT8 quantization                          │
23    │     Minimal accuracy loss, maximum speed                       │
24    │                                                                │
25    │  3. KERNEL AUTO-TUNING                                         │
26    │     Select best kernel for your specific GPU                   │
27    │     Optimized for your hardware                                │
28    │                                                                │
29    │  4. MEMORY OPTIMIZATION                                        │
30    │     Minimize memory transfers                                  │
31    │     Reuse memory across layers                                 │
32    └─────────────────────────────────────────────────────────────────┘
33
34
35    TYPICAL SPEEDUP:
36    ────────────────
37
38    ┌──────────────────────────────────────────────────────────────┐
39    │  Framework      │  Latency (ms)  │  vs TensorRT             │
40    ├─────────────────┼────────────────┼──────────────────────────┤
41    │  PyTorch FP32   │      45        │  4.5x slower             │
42    │  PyTorch FP16   │      25        │  2.5x slower             │
43    │  ONNX Runtime   │      20        │  2x slower               │
44    │  TensorRT FP16  │      10        │  Baseline                │
45    │  TensorRT INT8  │       6        │  1.7x faster             │
46    └──────────────────────────────────────────────────────────────┘
47
48
49    WORKFLOW:
50    ─────────
51
52    PyTorch → ONNX → TensorRT Engine → Deploy
53
54    Or use Torch-TensorRT for direct conversion
55    """)
56
57
58def tensorrt_conversion_example():
59    """Show TensorRT conversion code."""
60    print("\nTensorRT Conversion")
61    print("=" * 60)
62
63    code = '''
64# Method 1: ONNX to TensorRT
65# =========================
66
67import tensorrt as trt
68
69TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
70
71def build_engine(onnx_path, fp16=True):
72    """Build TensorRT engine from ONNX."""
73    builder = trt.Builder(TRT_LOGGER)
74    network = builder.create_network(
75        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
76    )
77    parser = trt.OnnxParser(network, TRT_LOGGER)
78
79    # Parse ONNX
80    with open(onnx_path, 'rb') as f:
81        if not parser.parse(f.read()):
82            for error in range(parser.num_errors):
83                print(parser.get_error(error))
84            return None
85
86    # Build config
87    config = builder.create_builder_config()
88    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
89
90    if fp16:
91        config.set_flag(trt.BuilderFlag.FP16)
92
93    # Build engine
94    engine = builder.build_serialized_network(network, config)
95
96    return engine
97
98# Build and save
99engine = build_engine("translator.onnx", fp16=True)
100with open("translator.trt", "wb") as f:
101    f.write(engine)
102
103
104# Method 2: Torch-TensorRT (simpler)
105# ==================================
106
107import torch_tensorrt
108
109model = model.cuda().eval()
110
111# Compile with Torch-TensorRT
112trt_model = torch_tensorrt.compile(
113    model,
114    inputs=[
115        torch_tensorrt.Input(
116            shape=[1, -1],  # Dynamic batch and sequence
117            dtype=torch.int64,
118        ),
119    ],
120    enabled_precisions={torch.float16},
121    workspace_size=1 << 30,
122)
123
124# Use like regular PyTorch model
125output = trt_model(input_tensor)
126
127# Save
128torch.jit.save(trt_model, "translator_trt.ts")
129
130
131# Inference with TensorRT
132# =======================
133
134import tensorrt as trt
135import pycuda.driver as cuda
136import pycuda.autoinit
137
138# Load engine
139with open("translator.trt", "rb") as f:
140    engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(f.read())
141
142context = engine.create_execution_context()
143
144# Allocate buffers
145inputs = []
146outputs = []
147bindings = []
148
149for binding in engine:
150    size = trt.volume(engine.get_binding_shape(binding))
151    dtype = trt.nptype(engine.get_binding_dtype(binding))
152    host_mem = cuda.pagelocked_empty(size, dtype)
153    device_mem = cuda.mem_alloc(host_mem.nbytes)
154    bindings.append(int(device_mem))
155
156    if engine.binding_is_input(binding):
157        inputs.append({'host': host_mem, 'device': device_mem})
158    else:
159        outputs.append({'host': host_mem, 'device': device_mem})
160
161# Run inference
162def infer(input_data):
163    # Copy input to device
164    np.copyto(inputs[0]['host'], input_data.ravel())
165    cuda.memcpy_htod(inputs[0]['device'], inputs[0]['host'])
166
167    # Execute
168    context.execute_v2(bindings)
169
170    # Copy output to host
171    cuda.memcpy_dtoh(outputs[0]['host'], outputs[0]['device'])
172
173    return outputs[0]['host']
174'''
175    print(code)
176
177
178tensorrt_conversion_example()

Model Serving with FastAPI

REST API for Translation

🐍python
1def fastapi_serving_example():
2    """Show FastAPI serving example."""
3    print("=" * 70)
4    print("MODEL SERVING WITH FASTAPI")
5    print("=" * 70)
6
7    code = '''
8# translation_server.py
9
10from fastapi import FastAPI, HTTPException
11from pydantic import BaseModel
12from typing import List, Optional
13import torch
14import asyncio
15from concurrent.futures import ThreadPoolExecutor
16
17app = FastAPI(title="Translation API", version="1.0.0")
18
19# Request/Response models
20class TranslationRequest(BaseModel):
21    text: str
22    source_lang: str = "de"
23    target_lang: str = "en"
24    beam_size: int = 5
25    max_length: int = 128
26
27class TranslationResponse(BaseModel):
28    translation: str
29    confidence: float
30    tokens: int
31    latency_ms: float
32
33class BatchTranslationRequest(BaseModel):
34    texts: List[str]
35    source_lang: str = "de"
36    target_lang: str = "en"
37    beam_size: int = 5
38
39class BatchTranslationResponse(BaseModel):
40    translations: List[str]
41    total_latency_ms: float
42
43
44# Model loading
45class TranslationService:
46    def __init__(self, model_path: str, device: str = "cuda"):
47        self.device = torch.device(device if torch.cuda.is_available() else "cpu")
48        self.model = self._load_model(model_path)
49        self.tokenizer = self._load_tokenizer()
50        self.executor = ThreadPoolExecutor(max_workers=4)
51
52    def _load_model(self, path: str):
53        # Load your model here
54        model = torch.load(path, map_location=self.device)
55        model.eval()
56        return model
57
58    def _load_tokenizer(self):
59        # Load tokenizer
60        pass
61
62    def translate(
63        self,
64        text: str,
65        beam_size: int = 5,
66        max_length: int = 128
67    ) -> dict:
68        import time
69        start = time.time()
70
71        # Tokenize
72        input_ids = self.tokenizer.encode(text)
73        input_ids = torch.tensor([input_ids]).to(self.device)
74
75        # Generate
76        with torch.no_grad():
77            output_ids = self.model.generate(
78                input_ids,
79                max_length=max_length,
80                num_beams=beam_size
81            )
82
83        # Decode
84        translation = self.tokenizer.decode(output_ids[0])
85
86        latency = (time.time() - start) * 1000
87
88        return {
89            "translation": translation,
90            "confidence": 0.95,  # Would compute from model
91            "tokens": len(output_ids[0]),
92            "latency_ms": latency
93        }
94
95    async def translate_async(self, *args, **kwargs):
96        loop = asyncio.get_event_loop()
97        return await loop.run_in_executor(
98            self.executor, lambda: self.translate(*args, **kwargs)
99        )
100
101
102# Initialize service
103service = TranslationService("model.pt")
104
105
106@app.post("/translate", response_model=TranslationResponse)
107async def translate(request: TranslationRequest):
108    """Translate a single text."""
109    try:
110        result = await service.translate_async(
111            request.text,
112            beam_size=request.beam_size,
113            max_length=request.max_length
114        )
115        return TranslationResponse(**result)
116    except Exception as e:
117        raise HTTPException(status_code=500, detail=str(e))
118
119
120@app.post("/translate/batch", response_model=BatchTranslationResponse)
121async def translate_batch(request: BatchTranslationRequest):
122    """Translate multiple texts."""
123    import time
124    start = time.time()
125
126    translations = []
127    for text in request.texts:
128        result = await service.translate_async(
129            text, beam_size=request.beam_size
130        )
131        translations.append(result["translation"])
132
133    return BatchTranslationResponse(
134        translations=translations,
135        total_latency_ms=(time.time() - start) * 1000
136    )
137
138
139@app.get("/health")
140async def health_check():
141    """Health check endpoint."""
142    return {"status": "healthy", "model_loaded": service.model is not None}
143
144
145# Run with: uvicorn translation_server:app --host 0.0.0.0 --port 8000
146'''
147    print(code)
148
149    print("""
150
151    USAGE:
152    ──────
153
154    # Start server
155    uvicorn translation_server:app --host 0.0.0.0 --port 8000
156
157    # Single translation
158    curl -X POST "http://localhost:8000/translate" \\
159         -H "Content-Type: application/json" \\
160         -d '{"text": "Der Hund läuft im Park.", "source_lang": "de"}'
161
162    # Batch translation
163    curl -X POST "http://localhost:8000/translate/batch" \\
164         -H "Content-Type: application/json" \\
165         -d '{"texts": ["Hallo", "Wie geht es dir?"]}'
166
167    # Health check
168    curl "http://localhost:8000/health"
169
170
171    PRODUCTION CONSIDERATIONS:
172    ──────────────────────────
173
174    1. Use Gunicorn with multiple workers:
175       gunicorn -w 4 -k uvicorn.workers.UvicornWorker translation_server:app
176
177    2. Add rate limiting:
178       from slowapi import Limiter
179       limiter = Limiter(key_func=get_remote_address)
180
181    3. Add request logging
182
183    4. Add metrics (Prometheus)
184
185    5. Add authentication if needed
186
187    6. Use async properly for I/O
188
189    7. Consider batching requests together
190    """)
191
192
193fastapi_serving_example()

Efficient Batching and Queuing

High-Throughput Serving

🐍python
1def batching_strategies():
2    """Explain efficient batching strategies."""
3    print("=" * 70)
4    print("EFFICIENT BATCHING FOR HIGH THROUGHPUT")
5    print("=" * 70)
6
7    print("""
8    WHY BATCHING MATTERS:
9    ─────────────────────
10
11    Single request:
12    - GPU utilization: ~20%
13    - Latency: 50ms
14    - Throughput: 20 req/s
15
16    Batched (8 requests):
17    - GPU utilization: ~80%
18    - Latency: 80ms (per request)
19    - Throughput: 100 req/s
20
21    5x throughput improvement!
22
23
24    DYNAMIC BATCHING:
25    ─────────────────
26
27    Queue incoming requests and batch them together:
28
29    ┌─────────────────────────────────────────────────────────────────┐
30    │  Request Queue                                                  │
31    │  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐            │
32    │  │ R1  │ R2  │ R3  │ R4  │ R5  │ R6  │ R7  │ R8  │            │
33    │  └──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┘            │
34    │     │     │     │     │     │     │     │     │                │
35    │     └─────┴─────┴─────┴─────┴─────┴─────┴─────┘                │
36    │                        │                                        │
37    │                        ▼                                        │
38    │  ┌───────────────────────────────────────────┐                 │
39    │  │            Batch (8 requests)             │                 │
40    │  │                    │                      │                 │
41    │  │                    ▼                      │                 │
42    │  │               [MODEL]                     │                 │
43    │  │                    │                      │                 │
44    │  │                    ▼                      │                 │
45    │  │         [8 translations]                  │                 │
46    │  └───────────────────────────────────────────┘                 │
47    │                        │                                        │
48    │     ┌─────┬─────┬─────┴─────┬─────┬─────┬─────┬─────┐         │
49    │     ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼               │
50    │    T1    T2    T3    T4    T5    T6    T7    T8                │
51    └─────────────────────────────────────────────────────────────────┘
52
53
54    BATCHING PARAMETERS:
55    ────────────────────
56
57    max_batch_size: Maximum requests per batch (e.g., 32)
58    max_wait_time: Maximum time to wait for batch (e.g., 50ms)
59
60    Trade-off:
61    - Larger batch → Higher throughput, higher latency
62    - Smaller batch → Lower latency, lower throughput
63    """)
64
65
66def dynamic_batcher_implementation():
67    """Show dynamic batcher implementation."""
68    print("\nDynamic Batcher Implementation")
69    print("=" * 60)
70
71    code = '''
72import asyncio
73from typing import List, Dict, Any
74from dataclasses import dataclass
75import time
76
77
78@dataclass
79class BatchRequest:
80    """Single request in batch."""
81    text: str
82    future: asyncio.Future
83
84
85class DynamicBatcher:
86    """
87    Dynamic batching for high-throughput inference.
88    """
89
90    def __init__(
91        self,
92        model,
93        max_batch_size: int = 32,
94        max_wait_time: float = 0.05  # 50ms
95    ):
96        self.model = model
97        self.max_batch_size = max_batch_size
98        self.max_wait_time = max_wait_time
99
100        self.queue: List[BatchRequest] = []
101        self.lock = asyncio.Lock()
102        self._batch_task = None
103
104    async def start(self):
105        """Start the batching loop."""
106        self._batch_task = asyncio.create_task(self._batch_loop())
107
108    async def stop(self):
109        """Stop the batching loop."""
110        if self._batch_task:
111            self._batch_task.cancel()
112
113    async def submit(self, text: str) -> str:
114        """
115        Submit a request and wait for result.
116
117        Args:
118            text: Text to translate
119
120        Returns:
121            Translation result
122        """
123        future = asyncio.Future()
124        request = BatchRequest(text=text, future=future)
125
126        async with self.lock:
127            self.queue.append(request)
128
129        # Wait for result
130        return await future
131
132    async def _batch_loop(self):
133        """Main batching loop."""
134        while True:
135            await asyncio.sleep(self.max_wait_time)
136
137            async with self.lock:
138                if not self.queue:
139                    continue
140
141                # Get batch
142                batch = self.queue[:self.max_batch_size]
143                self.queue = self.queue[self.max_batch_size:]
144
145            if batch:
146                # Process batch
147                await self._process_batch(batch)
148
149    async def _process_batch(self, batch: List[BatchRequest]):
150        """Process a batch of requests."""
151        texts = [req.text for req in batch]
152
153        # Run inference
154        loop = asyncio.get_event_loop()
155        translations = await loop.run_in_executor(
156            None,
157            self._translate_batch,
158            texts
159        )
160
161        # Return results
162        for req, translation in zip(batch, translations):
163            req.future.set_result(translation)
164
165    def _translate_batch(self, texts: List[str]) -> List[str]:
166        """Translate batch (synchronous)."""
167        import torch
168
169        # Tokenize all texts
170        inputs = self.model.tokenizer(
171            texts,
172            padding=True,
173            return_tensors="pt"
174        )
175
176        # Generate
177        with torch.no_grad():
178            outputs = self.model.generate(**inputs)
179
180        # Decode
181        translations = self.model.tokenizer.batch_decode(
182            outputs,
183            skip_special_tokens=True
184        )
185
186        return translations
187
188
189# Usage in FastAPI
190batcher = DynamicBatcher(model, max_batch_size=32)
191
192@app.on_event("startup")
193async def startup():
194    await batcher.start()
195
196@app.on_event("shutdown")
197async def shutdown():
198    await batcher.stop()
199
200@app.post("/translate")
201async def translate(request: TranslationRequest):
202    translation = await batcher.submit(request.text)
203    return {"translation": translation}
204'''
205    print(code)
206
207
208dynamic_batcher_implementation()

Docker Deployment

Containerization

🐳dockerfile
1# Dockerfile for Translation Service
2
3# Base image with CUDA support
4FROM nvidia/cuda:12.1-runtime-ubuntu22.04
5
6# Set working directory
7WORKDIR /app
8
9# Install Python
10RUN apt-get update && apt-get install -y \
11    python3.10 \
12    python3-pip \
13    && rm -rf /var/lib/apt/lists/*
14
15# Install dependencies
16COPY requirements.txt .
17RUN pip3 install --no-cache-dir -r requirements.txt
18
19# Copy application code
20COPY src/ ./src/
21COPY models/ ./models/
22COPY translation_server.py .
23
24# Expose port
25EXPOSE 8000
26
27# Health check
28HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
29    CMD curl -f http://localhost:8000/health || exit 1
30
31# Run server
32CMD ["uvicorn", "translation_server:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt:

📝text
1torch>=2.0.0
2transformers>=4.30.0
3fastapi>=0.100.0
4uvicorn>=0.23.0
5pydantic>=2.0.0
6onnxruntime-gpu>=1.15.0

docker-compose.yml:

📄yaml
1version: '3.8'
2
3services:
4  translation:
5    build: .
6    ports:
7      - "8000:8000"
8    deploy:
9      resources:
10        reservations:
11          devices:
12            - driver: nvidia
13              count: 1
14              capabilities: [gpu]
15    volumes:
16      - ./models:/app/models
17    environment:
18      - CUDA_VISIBLE_DEVICES=0
19      - MODEL_PATH=/app/models/translator.pt
20    restart: unless-stopped
21
22  nginx:
23    image: nginx:alpine
24    ports:
25      - "80:80"
26    volumes:
27      - ./nginx.conf:/etc/nginx/nginx.conf
28    depends_on:
29      - translation
30    restart: unless-stopped
31
32  prometheus:
33    image: prom/prometheus
34    ports:
35      - "9090:9090"
36    volumes:
37      - ./prometheus.yml:/etc/prometheus/prometheus.yml
38    restart: unless-stopped

Deployment commands:

bash
1# Build and run
2docker-compose up --build -d
3
4# View logs
5docker-compose logs -f translation
6
7# Scale (multiple instances)
8docker-compose up --scale translation=4
9
10# Stop
11docker-compose down

Cloud Deployment

AWS/GCP/Azure Options

AWS Options:

  • EC2 with GPU: p3.2xlarge (1× V100 ~$3/hour), p4d.24xlarge (8× A100 ~$32/hour) - Good for development, small scale
  • ECS/EKS with GPU: Container orchestration, auto-scaling - Good for production workloads
  • SageMaker: Managed ML platform, built-in inference endpoints - Good for MLOps integration
  • Lambda (CPU only): Serverless, pay-per-use - Good for low traffic, cost optimization

GCP Options:

  • Compute Engine with GPU: n1-standard + T4/V100/A100 - Similar to EC2
  • GKE with GPU: Kubernetes management, auto-scaling
  • Vertex AI: Managed ML platform, model serving endpoints
  • Cloud Run (CPU): Serverless containers - Good for CPU inference

Azure Options:

  • Azure VMs with GPU: NCv3/NVv4 series
  • AKS with GPU: Kubernetes on Azure
  • Azure ML: Managed ML platform

Cost Comparison (approximate)

Option$/hour$/million requests
EC2 p3.2xlarge (V100)~$3~$15 (200 req/s)
EC2 g4dn.xlarge (T4)~$0.5~$5 (100 req/s)
SageMaker ServerlessVaries~$4 (pay per inference)
Lambda + CPU~$0.2~$50 (slow)
Self-hosted (owned)~$0.5~$3 (efficient)

Recommendation:

  • Low traffic (<1000 req/day): Lambda/Cloud Run + CPU - Cheapest, no GPU costs
  • Medium traffic (1k-100k req/day): Single GPU instance (T4 or A10) - ~$500-1500/month
  • High traffic (>100k req/day): Multiple GPU instances, Kubernetes for orchestration, consider SageMaker/Vertex for management

Summary

Deployment Checklist

StepStatusNotes
Export to ONNXVerify outputs match
Optimize with TensorRTIf NVIDIA GPU
Build FastAPI serverWith batching
Containerize with DockerInclude CUDA
Deploy to cloudChoose based on traffic
Set up monitoringPrometheus + Grafana

Performance Targets

Latency Targets:

  • P50: < 50ms
  • P95: < 100ms
  • P99: < 200ms

Throughput Targets:

  • Single GPU: 100-500 req/s (depending on length)
  • Cluster: Scale linearly with GPUs

Availability:

  • Uptime: 99.9%
  • Health checks every 30s
  • Auto-restart on failure

Course Conclusion

Congratulations on completing this comprehensive course on Transformer implementation!

You've learned:

  • How attention mechanisms work from scratch
  • Complete transformer encoder-decoder architecture
  • Training and evaluation pipelines
  • Pre-trained model fine-tuning
  • Advanced architectures (Flash Attention, MoE, RoPE)
  • Production deployment techniques

Next steps:

  • Read the original "Attention Is All You Need" paper
  • Explore Hugging Face Transformers library
  • Try building a decoder-only model (GPT-style)
  • Experiment with different tasks (summarization, QA)

Happy building!

Loading comments...