Introduction
This section covers exporting models to deployment formats (ONNX, TensorRT) and building serving infrastructure for production translation services.
ONNX Export
Portable Model Format
🐍python
1import torch
2import torch.nn as nn
3from typing import Dict, List, Optional, Tuple
4import os
5
6
7def onnx_overview():
8 """
9 Overview of ONNX export.
10 """
11 print("=" * 70)
12 print("ONNX: OPEN NEURAL NETWORK EXCHANGE")
13 print("=" * 70)
14
15 print("""
16 WHAT IS ONNX?
17 ─────────────
18
19 ONNX (Open Neural Network Exchange) is an open format for
20 representing machine learning models.
21
22 Benefits:
23 ┌─────────────────────────────────────────────────────────────────┐
24 │ • Interoperability: Works across frameworks │
25 │ • Optimization: Runtime-specific optimizations │
26 │ • Deployment: Run on various hardware │
27 │ • Portability: Single format for all platforms │
28 └─────────────────────────────────────────────────────────────────┘
29
30 Workflow:
31 ┌─────────┐ ┌─────────┐ ┌─────────────────────────┐
32 │ PyTorch │ ──► │ ONNX │ ──► │ ONNX Runtime (CPU/GPU) │
33 │ Model │ │ Model │ │ TensorRT │
34 └─────────┘ └─────────┘ │ OpenVINO │
35 │ CoreML │
36 └─────────────────────────┘
37
38
39 SUPPORTED PLATFORMS:
40 ────────────────────
41
42 ONNX Runtime backends:
43 • CPU: Default, optimized for Intel/AMD
44 • CUDA: NVIDIA GPUs
45 • TensorRT: Maximum NVIDIA GPU performance
46 • OpenVINO: Intel hardware (CPU, iGPU, VPU)
47 • CoreML: Apple Silicon
48 • DirectML: Windows GPUs
49
50 Typical speedup: 1.2-2x over native PyTorch
51 """)
52
53
54class ONNXExporter:
55 """
56 Export PyTorch models to ONNX format.
57 """
58
59 def __init__(self, model: nn.Module, device: str = "cpu"):
60 """
61 Initialize exporter.
62
63 Args:
64 model: PyTorch model to export
65 device: Device for export
66 """
67 self.model = model
68 self.device = device
69 self.model.to(device)
70 self.model.eval()
71
72 def export_encoder_decoder(
73 self,
74 output_path: str,
75 max_length: int = 128,
76 vocab_size: int = 32000,
77 opset_version: int = 14
78 ) -> str:
79 """
80 Export encoder-decoder model to ONNX.
81
82 For translation models, we typically export:
83 1. Encoder (processes source)
84 2. Decoder (with KV cache for generation)
85
86 Args:
87 output_path: Path to save ONNX model
88 max_length: Maximum sequence length
89 vocab_size: Vocabulary size
90 opset_version: ONNX opset version
91
92 Returns:
93 Path to exported model
94 """
95 print("Exporting encoder-decoder model to ONNX...")
96
97 # Create dummy inputs
98 batch_size = 1
99 src_len = 32
100
101 dummy_input = {
102 'src_ids': torch.randint(0, vocab_size, (batch_size, src_len)),
103 'src_mask': torch.ones(batch_size, src_len),
104 'tgt_ids': torch.randint(0, vocab_size, (batch_size, 1)),
105 }
106
107 # Move to device
108 dummy_input = {k: v.to(self.device) for k, v in dummy_input.items()}
109
110 # Dynamic axes for variable sequence length
111 dynamic_axes = {
112 'src_ids': {0: 'batch_size', 1: 'src_len'},
113 'src_mask': {0: 'batch_size', 1: 'src_len'},
114 'tgt_ids': {0: 'batch_size', 1: 'tgt_len'},
115 'logits': {0: 'batch_size', 1: 'tgt_len'},
116 }
117
118 # Export
119 torch.onnx.export(
120 self.model,
121 (dummy_input['src_ids'], dummy_input['src_mask'], dummy_input['tgt_ids']),
122 output_path,
123 input_names=['src_ids', 'src_mask', 'tgt_ids'],
124 output_names=['logits'],
125 dynamic_axes=dynamic_axes,
126 opset_version=opset_version,
127 do_constant_folding=True,
128 )
129
130 print(f"Model exported to {output_path}")
131
132 # Verify export
133 self._verify_export(output_path, dummy_input)
134
135 return output_path
136
137 def _verify_export(
138 self,
139 onnx_path: str,
140 dummy_input: Dict[str, torch.Tensor]
141 ):
142 """Verify exported model matches PyTorch output."""
143 import onnx
144 import onnxruntime as ort
145
146 # Check ONNX model
147 onnx_model = onnx.load(onnx_path)
148 onnx.checker.check_model(onnx_model)
149 print("ONNX model validation passed!")
150
151 # Compare outputs
152 session = ort.InferenceSession(onnx_path)
153
154 # PyTorch output
155 with torch.no_grad():
156 pytorch_output = self.model(
157 dummy_input['src_ids'],
158 dummy_input['src_mask'],
159 dummy_input['tgt_ids']
160 )
161
162 # ONNX output
163 onnx_inputs = {
164 'src_ids': dummy_input['src_ids'].numpy(),
165 'src_mask': dummy_input['src_mask'].numpy(),
166 'tgt_ids': dummy_input['tgt_ids'].numpy(),
167 }
168 onnx_output = session.run(None, onnx_inputs)[0]
169
170 # Compare
171 pytorch_output = pytorch_output.numpy()
172 max_diff = abs(pytorch_output - onnx_output).max()
173 print(f"Max difference between PyTorch and ONNX: {max_diff:.6f}")
174
175 if max_diff < 1e-4:
176 print("✓ Export verification passed!")
177 else:
178 print("⚠ Warning: Outputs differ more than expected")
179
180
181def onnx_export_example():
182 """Show ONNX export code example."""
183 print("\nONNX Export Example")
184 print("=" * 60)
185
186 code = '''
187import torch
188import torch.onnx
189
190# Assume model is your trained TransformerModel
191model.eval()
192
193# Dummy inputs for tracing
194batch_size = 1
195src_len = 32
196vocab_size = 32000
197
198dummy_src = torch.randint(0, vocab_size, (batch_size, src_len))
199dummy_mask = torch.ones(batch_size, src_len)
200dummy_tgt = torch.randint(0, vocab_size, (batch_size, 1))
201
202# Export
203torch.onnx.export(
204 model,
205 (dummy_src, dummy_mask, dummy_tgt),
206 "translator.onnx",
207 input_names=['src_ids', 'src_mask', 'tgt_ids'],
208 output_names=['logits'],
209 dynamic_axes={
210 'src_ids': {0: 'batch', 1: 'src_len'},
211 'src_mask': {0: 'batch', 1: 'src_len'},
212 'tgt_ids': {0: 'batch', 1: 'tgt_len'},
213 'logits': {0: 'batch', 1: 'tgt_len'},
214 },
215 opset_version=14,
216 do_constant_folding=True,
217)
218
219print("Exported to translator.onnx")
220
221# Verify
222import onnx
223model_onnx = onnx.load("translator.onnx")
224onnx.checker.check_model(model_onnx)
225
226# Inference with ONNX Runtime
227import onnxruntime as ort
228
229session = ort.InferenceSession(
230 "translator.onnx",
231 providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
232)
233
234# Run inference
235outputs = session.run(
236 None,
237 {
238 'src_ids': src_ids.numpy(),
239 'src_mask': src_mask.numpy(),
240 'tgt_ids': tgt_ids.numpy(),
241 }
242)
243'''
244 print(code)
245
246
247onnx_export_example()TensorRT Optimization
Maximum GPU Performance
🐍python
1def tensorrt_overview():
2 """
3 Overview of TensorRT optimization.
4 """
5 print("=" * 70)
6 print("TENSORRT: NVIDIA GPU OPTIMIZATION")
7 print("=" * 70)
8
9 print("""
10 WHAT IS TENSORRT?
11 ─────────────────
12
13 TensorRT is NVIDIA's SDK for high-performance deep learning
14 inference. It applies several optimizations:
15
16 ┌─────────────────────────────────────────────────────────────────┐
17 │ 1. LAYER FUSION │
18 │ Combine multiple layers into single kernels │
19 │ Conv + BatchNorm + ReLU → Single fused kernel │
20 │ │
21 │ 2. PRECISION CALIBRATION │
22 │ Automatic FP16/INT8 quantization │
23 │ Minimal accuracy loss, maximum speed │
24 │ │
25 │ 3. KERNEL AUTO-TUNING │
26 │ Select best kernel for your specific GPU │
27 │ Optimized for your hardware │
28 │ │
29 │ 4. MEMORY OPTIMIZATION │
30 │ Minimize memory transfers │
31 │ Reuse memory across layers │
32 └─────────────────────────────────────────────────────────────────┘
33
34
35 TYPICAL SPEEDUP:
36 ────────────────
37
38 ┌──────────────────────────────────────────────────────────────┐
39 │ Framework │ Latency (ms) │ vs TensorRT │
40 ├─────────────────┼────────────────┼──────────────────────────┤
41 │ PyTorch FP32 │ 45 │ 4.5x slower │
42 │ PyTorch FP16 │ 25 │ 2.5x slower │
43 │ ONNX Runtime │ 20 │ 2x slower │
44 │ TensorRT FP16 │ 10 │ Baseline │
45 │ TensorRT INT8 │ 6 │ 1.7x faster │
46 └──────────────────────────────────────────────────────────────┘
47
48
49 WORKFLOW:
50 ─────────
51
52 PyTorch → ONNX → TensorRT Engine → Deploy
53
54 Or use Torch-TensorRT for direct conversion
55 """)
56
57
58def tensorrt_conversion_example():
59 """Show TensorRT conversion code."""
60 print("\nTensorRT Conversion")
61 print("=" * 60)
62
63 code = '''
64# Method 1: ONNX to TensorRT
65# =========================
66
67import tensorrt as trt
68
69TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
70
71def build_engine(onnx_path, fp16=True):
72 """Build TensorRT engine from ONNX."""
73 builder = trt.Builder(TRT_LOGGER)
74 network = builder.create_network(
75 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
76 )
77 parser = trt.OnnxParser(network, TRT_LOGGER)
78
79 # Parse ONNX
80 with open(onnx_path, 'rb') as f:
81 if not parser.parse(f.read()):
82 for error in range(parser.num_errors):
83 print(parser.get_error(error))
84 return None
85
86 # Build config
87 config = builder.create_builder_config()
88 config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
89
90 if fp16:
91 config.set_flag(trt.BuilderFlag.FP16)
92
93 # Build engine
94 engine = builder.build_serialized_network(network, config)
95
96 return engine
97
98# Build and save
99engine = build_engine("translator.onnx", fp16=True)
100with open("translator.trt", "wb") as f:
101 f.write(engine)
102
103
104# Method 2: Torch-TensorRT (simpler)
105# ==================================
106
107import torch_tensorrt
108
109model = model.cuda().eval()
110
111# Compile with Torch-TensorRT
112trt_model = torch_tensorrt.compile(
113 model,
114 inputs=[
115 torch_tensorrt.Input(
116 shape=[1, -1], # Dynamic batch and sequence
117 dtype=torch.int64,
118 ),
119 ],
120 enabled_precisions={torch.float16},
121 workspace_size=1 << 30,
122)
123
124# Use like regular PyTorch model
125output = trt_model(input_tensor)
126
127# Save
128torch.jit.save(trt_model, "translator_trt.ts")
129
130
131# Inference with TensorRT
132# =======================
133
134import tensorrt as trt
135import pycuda.driver as cuda
136import pycuda.autoinit
137
138# Load engine
139with open("translator.trt", "rb") as f:
140 engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(f.read())
141
142context = engine.create_execution_context()
143
144# Allocate buffers
145inputs = []
146outputs = []
147bindings = []
148
149for binding in engine:
150 size = trt.volume(engine.get_binding_shape(binding))
151 dtype = trt.nptype(engine.get_binding_dtype(binding))
152 host_mem = cuda.pagelocked_empty(size, dtype)
153 device_mem = cuda.mem_alloc(host_mem.nbytes)
154 bindings.append(int(device_mem))
155
156 if engine.binding_is_input(binding):
157 inputs.append({'host': host_mem, 'device': device_mem})
158 else:
159 outputs.append({'host': host_mem, 'device': device_mem})
160
161# Run inference
162def infer(input_data):
163 # Copy input to device
164 np.copyto(inputs[0]['host'], input_data.ravel())
165 cuda.memcpy_htod(inputs[0]['device'], inputs[0]['host'])
166
167 # Execute
168 context.execute_v2(bindings)
169
170 # Copy output to host
171 cuda.memcpy_dtoh(outputs[0]['host'], outputs[0]['device'])
172
173 return outputs[0]['host']
174'''
175 print(code)
176
177
178tensorrt_conversion_example()Model Serving with FastAPI
REST API for Translation
🐍python
1def fastapi_serving_example():
2 """Show FastAPI serving example."""
3 print("=" * 70)
4 print("MODEL SERVING WITH FASTAPI")
5 print("=" * 70)
6
7 code = '''
8# translation_server.py
9
10from fastapi import FastAPI, HTTPException
11from pydantic import BaseModel
12from typing import List, Optional
13import torch
14import asyncio
15from concurrent.futures import ThreadPoolExecutor
16
17app = FastAPI(title="Translation API", version="1.0.0")
18
19# Request/Response models
20class TranslationRequest(BaseModel):
21 text: str
22 source_lang: str = "de"
23 target_lang: str = "en"
24 beam_size: int = 5
25 max_length: int = 128
26
27class TranslationResponse(BaseModel):
28 translation: str
29 confidence: float
30 tokens: int
31 latency_ms: float
32
33class BatchTranslationRequest(BaseModel):
34 texts: List[str]
35 source_lang: str = "de"
36 target_lang: str = "en"
37 beam_size: int = 5
38
39class BatchTranslationResponse(BaseModel):
40 translations: List[str]
41 total_latency_ms: float
42
43
44# Model loading
45class TranslationService:
46 def __init__(self, model_path: str, device: str = "cuda"):
47 self.device = torch.device(device if torch.cuda.is_available() else "cpu")
48 self.model = self._load_model(model_path)
49 self.tokenizer = self._load_tokenizer()
50 self.executor = ThreadPoolExecutor(max_workers=4)
51
52 def _load_model(self, path: str):
53 # Load your model here
54 model = torch.load(path, map_location=self.device)
55 model.eval()
56 return model
57
58 def _load_tokenizer(self):
59 # Load tokenizer
60 pass
61
62 def translate(
63 self,
64 text: str,
65 beam_size: int = 5,
66 max_length: int = 128
67 ) -> dict:
68 import time
69 start = time.time()
70
71 # Tokenize
72 input_ids = self.tokenizer.encode(text)
73 input_ids = torch.tensor([input_ids]).to(self.device)
74
75 # Generate
76 with torch.no_grad():
77 output_ids = self.model.generate(
78 input_ids,
79 max_length=max_length,
80 num_beams=beam_size
81 )
82
83 # Decode
84 translation = self.tokenizer.decode(output_ids[0])
85
86 latency = (time.time() - start) * 1000
87
88 return {
89 "translation": translation,
90 "confidence": 0.95, # Would compute from model
91 "tokens": len(output_ids[0]),
92 "latency_ms": latency
93 }
94
95 async def translate_async(self, *args, **kwargs):
96 loop = asyncio.get_event_loop()
97 return await loop.run_in_executor(
98 self.executor, lambda: self.translate(*args, **kwargs)
99 )
100
101
102# Initialize service
103service = TranslationService("model.pt")
104
105
106@app.post("/translate", response_model=TranslationResponse)
107async def translate(request: TranslationRequest):
108 """Translate a single text."""
109 try:
110 result = await service.translate_async(
111 request.text,
112 beam_size=request.beam_size,
113 max_length=request.max_length
114 )
115 return TranslationResponse(**result)
116 except Exception as e:
117 raise HTTPException(status_code=500, detail=str(e))
118
119
120@app.post("/translate/batch", response_model=BatchTranslationResponse)
121async def translate_batch(request: BatchTranslationRequest):
122 """Translate multiple texts."""
123 import time
124 start = time.time()
125
126 translations = []
127 for text in request.texts:
128 result = await service.translate_async(
129 text, beam_size=request.beam_size
130 )
131 translations.append(result["translation"])
132
133 return BatchTranslationResponse(
134 translations=translations,
135 total_latency_ms=(time.time() - start) * 1000
136 )
137
138
139@app.get("/health")
140async def health_check():
141 """Health check endpoint."""
142 return {"status": "healthy", "model_loaded": service.model is not None}
143
144
145# Run with: uvicorn translation_server:app --host 0.0.0.0 --port 8000
146'''
147 print(code)
148
149 print("""
150
151 USAGE:
152 ──────
153
154 # Start server
155 uvicorn translation_server:app --host 0.0.0.0 --port 8000
156
157 # Single translation
158 curl -X POST "http://localhost:8000/translate" \\
159 -H "Content-Type: application/json" \\
160 -d '{"text": "Der Hund läuft im Park.", "source_lang": "de"}'
161
162 # Batch translation
163 curl -X POST "http://localhost:8000/translate/batch" \\
164 -H "Content-Type: application/json" \\
165 -d '{"texts": ["Hallo", "Wie geht es dir?"]}'
166
167 # Health check
168 curl "http://localhost:8000/health"
169
170
171 PRODUCTION CONSIDERATIONS:
172 ──────────────────────────
173
174 1. Use Gunicorn with multiple workers:
175 gunicorn -w 4 -k uvicorn.workers.UvicornWorker translation_server:app
176
177 2. Add rate limiting:
178 from slowapi import Limiter
179 limiter = Limiter(key_func=get_remote_address)
180
181 3. Add request logging
182
183 4. Add metrics (Prometheus)
184
185 5. Add authentication if needed
186
187 6. Use async properly for I/O
188
189 7. Consider batching requests together
190 """)
191
192
193fastapi_serving_example()Efficient Batching and Queuing
High-Throughput Serving
🐍python
1def batching_strategies():
2 """Explain efficient batching strategies."""
3 print("=" * 70)
4 print("EFFICIENT BATCHING FOR HIGH THROUGHPUT")
5 print("=" * 70)
6
7 print("""
8 WHY BATCHING MATTERS:
9 ─────────────────────
10
11 Single request:
12 - GPU utilization: ~20%
13 - Latency: 50ms
14 - Throughput: 20 req/s
15
16 Batched (8 requests):
17 - GPU utilization: ~80%
18 - Latency: 80ms (per request)
19 - Throughput: 100 req/s
20
21 5x throughput improvement!
22
23
24 DYNAMIC BATCHING:
25 ─────────────────
26
27 Queue incoming requests and batch them together:
28
29 ┌─────────────────────────────────────────────────────────────────┐
30 │ Request Queue │
31 │ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │
32 │ │ R1 │ R2 │ R3 │ R4 │ R5 │ R6 │ R7 │ R8 │ │
33 │ └──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┘ │
34 │ │ │ │ │ │ │ │ │ │
35 │ └─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │
36 │ │ │
37 │ ▼ │
38 │ ┌───────────────────────────────────────────┐ │
39 │ │ Batch (8 requests) │ │
40 │ │ │ │ │
41 │ │ ▼ │ │
42 │ │ [MODEL] │ │
43 │ │ │ │ │
44 │ │ ▼ │ │
45 │ │ [8 translations] │ │
46 │ └───────────────────────────────────────────┘ │
47 │ │ │
48 │ ┌─────┬─────┬─────┴─────┬─────┬─────┬─────┬─────┐ │
49 │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ │
50 │ T1 T2 T3 T4 T5 T6 T7 T8 │
51 └─────────────────────────────────────────────────────────────────┘
52
53
54 BATCHING PARAMETERS:
55 ────────────────────
56
57 max_batch_size: Maximum requests per batch (e.g., 32)
58 max_wait_time: Maximum time to wait for batch (e.g., 50ms)
59
60 Trade-off:
61 - Larger batch → Higher throughput, higher latency
62 - Smaller batch → Lower latency, lower throughput
63 """)
64
65
66def dynamic_batcher_implementation():
67 """Show dynamic batcher implementation."""
68 print("\nDynamic Batcher Implementation")
69 print("=" * 60)
70
71 code = '''
72import asyncio
73from typing import List, Dict, Any
74from dataclasses import dataclass
75import time
76
77
78@dataclass
79class BatchRequest:
80 """Single request in batch."""
81 text: str
82 future: asyncio.Future
83
84
85class DynamicBatcher:
86 """
87 Dynamic batching for high-throughput inference.
88 """
89
90 def __init__(
91 self,
92 model,
93 max_batch_size: int = 32,
94 max_wait_time: float = 0.05 # 50ms
95 ):
96 self.model = model
97 self.max_batch_size = max_batch_size
98 self.max_wait_time = max_wait_time
99
100 self.queue: List[BatchRequest] = []
101 self.lock = asyncio.Lock()
102 self._batch_task = None
103
104 async def start(self):
105 """Start the batching loop."""
106 self._batch_task = asyncio.create_task(self._batch_loop())
107
108 async def stop(self):
109 """Stop the batching loop."""
110 if self._batch_task:
111 self._batch_task.cancel()
112
113 async def submit(self, text: str) -> str:
114 """
115 Submit a request and wait for result.
116
117 Args:
118 text: Text to translate
119
120 Returns:
121 Translation result
122 """
123 future = asyncio.Future()
124 request = BatchRequest(text=text, future=future)
125
126 async with self.lock:
127 self.queue.append(request)
128
129 # Wait for result
130 return await future
131
132 async def _batch_loop(self):
133 """Main batching loop."""
134 while True:
135 await asyncio.sleep(self.max_wait_time)
136
137 async with self.lock:
138 if not self.queue:
139 continue
140
141 # Get batch
142 batch = self.queue[:self.max_batch_size]
143 self.queue = self.queue[self.max_batch_size:]
144
145 if batch:
146 # Process batch
147 await self._process_batch(batch)
148
149 async def _process_batch(self, batch: List[BatchRequest]):
150 """Process a batch of requests."""
151 texts = [req.text for req in batch]
152
153 # Run inference
154 loop = asyncio.get_event_loop()
155 translations = await loop.run_in_executor(
156 None,
157 self._translate_batch,
158 texts
159 )
160
161 # Return results
162 for req, translation in zip(batch, translations):
163 req.future.set_result(translation)
164
165 def _translate_batch(self, texts: List[str]) -> List[str]:
166 """Translate batch (synchronous)."""
167 import torch
168
169 # Tokenize all texts
170 inputs = self.model.tokenizer(
171 texts,
172 padding=True,
173 return_tensors="pt"
174 )
175
176 # Generate
177 with torch.no_grad():
178 outputs = self.model.generate(**inputs)
179
180 # Decode
181 translations = self.model.tokenizer.batch_decode(
182 outputs,
183 skip_special_tokens=True
184 )
185
186 return translations
187
188
189# Usage in FastAPI
190batcher = DynamicBatcher(model, max_batch_size=32)
191
192@app.on_event("startup")
193async def startup():
194 await batcher.start()
195
196@app.on_event("shutdown")
197async def shutdown():
198 await batcher.stop()
199
200@app.post("/translate")
201async def translate(request: TranslationRequest):
202 translation = await batcher.submit(request.text)
203 return {"translation": translation}
204'''
205 print(code)
206
207
208dynamic_batcher_implementation()Docker Deployment
Containerization
🐳dockerfile
1# Dockerfile for Translation Service
2
3# Base image with CUDA support
4FROM nvidia/cuda:12.1-runtime-ubuntu22.04
5
6# Set working directory
7WORKDIR /app
8
9# Install Python
10RUN apt-get update && apt-get install -y \
11 python3.10 \
12 python3-pip \
13 && rm -rf /var/lib/apt/lists/*
14
15# Install dependencies
16COPY requirements.txt .
17RUN pip3 install --no-cache-dir -r requirements.txt
18
19# Copy application code
20COPY src/ ./src/
21COPY models/ ./models/
22COPY translation_server.py .
23
24# Expose port
25EXPOSE 8000
26
27# Health check
28HEALTHCHECK \
29 CMD curl -f http://localhost:8000/health || exit 1
30
31# Run server
32CMD ["uvicorn", "translation_server:app", "--host", "0.0.0.0", "--port", "8000"]requirements.txt:
📝text
1torch>=2.0.0
2transformers>=4.30.0
3fastapi>=0.100.0
4uvicorn>=0.23.0
5pydantic>=2.0.0
6onnxruntime-gpu>=1.15.0docker-compose.yml:
📄yaml
1version: '3.8'
2
3services:
4 translation:
5 build: .
6 ports:
7 - "8000:8000"
8 deploy:
9 resources:
10 reservations:
11 devices:
12 - driver: nvidia
13 count: 1
14 capabilities: [gpu]
15 volumes:
16 - ./models:/app/models
17 environment:
18 - CUDA_VISIBLE_DEVICES=0
19 - MODEL_PATH=/app/models/translator.pt
20 restart: unless-stopped
21
22 nginx:
23 image: nginx:alpine
24 ports:
25 - "80:80"
26 volumes:
27 - ./nginx.conf:/etc/nginx/nginx.conf
28 depends_on:
29 - translation
30 restart: unless-stopped
31
32 prometheus:
33 image: prom/prometheus
34 ports:
35 - "9090:9090"
36 volumes:
37 - ./prometheus.yml:/etc/prometheus/prometheus.yml
38 restart: unless-stoppedDeployment commands:
⚡bash
1# Build and run
2docker-compose up --build -d
3
4# View logs
5docker-compose logs -f translation
6
7# Scale (multiple instances)
8docker-compose up --scale translation=4
9
10# Stop
11docker-compose downCloud Deployment
AWS/GCP/Azure Options
AWS Options:
- EC2 with GPU: p3.2xlarge (1× V100 ~$3/hour), p4d.24xlarge (8× A100 ~$32/hour) - Good for development, small scale
- ECS/EKS with GPU: Container orchestration, auto-scaling - Good for production workloads
- SageMaker: Managed ML platform, built-in inference endpoints - Good for MLOps integration
- Lambda (CPU only): Serverless, pay-per-use - Good for low traffic, cost optimization
GCP Options:
- Compute Engine with GPU: n1-standard + T4/V100/A100 - Similar to EC2
- GKE with GPU: Kubernetes management, auto-scaling
- Vertex AI: Managed ML platform, model serving endpoints
- Cloud Run (CPU): Serverless containers - Good for CPU inference
Azure Options:
- Azure VMs with GPU: NCv3/NVv4 series
- AKS with GPU: Kubernetes on Azure
- Azure ML: Managed ML platform
Cost Comparison (approximate)
| Option | $/hour | $/million requests |
|---|---|---|
| EC2 p3.2xlarge (V100) | ~$3 | ~$15 (200 req/s) |
| EC2 g4dn.xlarge (T4) | ~$0.5 | ~$5 (100 req/s) |
| SageMaker Serverless | Varies | ~$4 (pay per inference) |
| Lambda + CPU | ~$0.2 | ~$50 (slow) |
| Self-hosted (owned) | ~$0.5 | ~$3 (efficient) |
Recommendation:
- Low traffic (<1000 req/day): Lambda/Cloud Run + CPU - Cheapest, no GPU costs
- Medium traffic (1k-100k req/day): Single GPU instance (T4 or A10) - ~$500-1500/month
- High traffic (>100k req/day): Multiple GPU instances, Kubernetes for orchestration, consider SageMaker/Vertex for management
Summary
Deployment Checklist
| Step | Status | Notes |
|---|---|---|
| Export to ONNX | □ | Verify outputs match |
| Optimize with TensorRT | □ | If NVIDIA GPU |
| Build FastAPI server | □ | With batching |
| Containerize with Docker | □ | Include CUDA |
| Deploy to cloud | □ | Choose based on traffic |
| Set up monitoring | □ | Prometheus + Grafana |
Performance Targets
Latency Targets:
- P50: < 50ms
- P95: < 100ms
- P99: < 200ms
Throughput Targets:
- Single GPU: 100-500 req/s (depending on length)
- Cluster: Scale linearly with GPUs
Availability:
- Uptime: 99.9%
- Health checks every 30s
- Auto-restart on failure
Course Conclusion
Congratulations on completing this comprehensive course on Transformer implementation!
You've learned:
- How attention mechanisms work from scratch
- Complete transformer encoder-decoder architecture
- Training and evaluation pipelines
- Pre-trained model fine-tuning
- Advanced architectures (Flash Attention, MoE, RoPE)
- Production deployment techniques
Next steps:
- Read the original "Attention Is All You Need" paper
- Explore Hugging Face Transformers library
- Try building a decoder-only model (GPT-style)
- Experiment with different tasks (summarization, QA)
Happy building!