AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand AMNL's memory footprint of under 500 MB
Analyze memory allocation across model components
Calculate memory scaling with batch size
Learn memory optimization techniques for constrained environments
Evaluate deployment options based on available memory

Core Insight: AMNL requires less than 500 MB of GPU memory for inference, enabling deployment on entry-level GPUs, edge devices, and even CPU-only systems. This low memory footprint makes predictive maintenance accessible to facilities without expensive hardware infrastructure.

Memory Footprint Analysis

GPU memory usage is a critical constraint for industrial deployment, where dedicated AI hardware may not be available.

Inference Memory Requirements

Configuration	Memory Usage	Notes
Model weights only	~14 MB	FP32 parameters
Batch size 1	~180 MB	Minimal activations
Batch size 32	~285 MB	Typical deployment
Batch size 128	~390 MB	High throughput
Batch size 256	~480 MB	Maximum recommended

Training vs Inference

Training requires significantly more memory (2-4 GB) due to gradient storage, optimizer states, and activation checkpoints. The 500 MB figure applies to inference only.

Memory Composition

\text{Total Memory} = \text{Weights} + \text{Activations} + \text{Workspace}

Component	Size (batch=256)	Percentage
Model weights	14 MB	2.9%
Activations	420 MB	87.5%
CUDA workspace	46 MB	9.6%
Total	480 MB	100%

Activations Dominate

Unlike many deep learning models where weights dominate memory usage, AMNL's memory is primarily consumed by activations. This is due to the sequence length (50 timesteps) and multiple intermediate representations in the BiLSTM and attention layers.

Memory Breakdown by Component

Understanding where memory is allocated helps identify optimization opportunities.

Memory by Layer Type

Layer Type	Weights	Activations (B=256)	Total
CNN layers	1.6 MB	39.3 MB	40.9 MB
BiLSTM layers	9.4 MB	118 MB	127.4 MB
Attention	2.4 MB	188 MB	190.4 MB
FC layers	0.6 MB	74 MB	74.6 MB
Total	14 MB	420 MB	434 MB

Memory Scaling with Batch Size

Memory usage scales approximately linearly with batch size due to activation storage.

Scaling Formula

\text{Memory}(B) \approx 14 + 1.8 \times B \text{ (MB)}

Where $B$ is the batch size. The 14 MB base is model weights, and 1.8 MB per sample is activation memory.

Memory vs Batch Size

Batch Size	Memory (MB)	Samples/MB	Efficiency
1	180	0.006	Low
8	200	0.040	Low
32	285	0.112	Medium
64	330	0.194	Good
128	390	0.328	Good
256	480	0.533	Best
512	650	0.788	Diminishing

Memory-Throughput Tradeoff

Larger batches improve throughput but consume more memory. For memory-constrained deployments, batch size 64-128 provides a good balance, using ~350 MB while achieving 80% of maximum throughput.

Memory Optimization Techniques

Several techniques can reduce memory usage for constrained environments.

1. FP16 Inference

Half-precision inference halves both weight and activation memory:

🐍python

1# Convert model to FP16
2model = model.half()
3
4# Memory reduction
5# FP32: 480 MB → FP16: 240 MB (batch=256)
6# Speedup bonus: ~1.5x faster

Precision	Weights	Activations	Total
FP32	14 MB	420 MB	480 MB
FP16	7 MB	210 MB	240 MB
INT8	3.5 MB	105 MB	~130 MB

2. Gradient Checkpointing (Training)

For training, gradient checkpointing trades compute for memory:

🐍python

1from torch.utils.checkpoint import checkpoint
2
3class MemoryEfficientAMNL(nn.Module):
4    def forward(self, x):
5        # Checkpoint BiLSTM to save memory
6        x = checkpoint(self.bilstm, x)
7        x = checkpoint(self.attention, x)
8        return self.heads(x)
9
10# Memory reduction: ~40% during training

3. Dynamic Batching

Adjust batch size based on available memory at runtime:

🐍python

1def get_optimal_batch_size(available_memory_mb):
2    """Calculate optimal batch size given available GPU memory."""
3    base_memory = 14  # Model weights
4    per_sample = 1.8  # Activation memory per sample
5
6    max_batch = int((available_memory_mb - base_memory) / per_sample)
7
8    # Round down to power of 2 for efficiency
9    optimal = 2 ** int(np.log2(max_batch))
10
11    return min(optimal, 256)  # Cap at 256
12
13# Example: 2GB GPU
14# get_optimal_batch_size(2000) → 128

4. CPU Fallback

For edge devices without GPU, AMNL can run entirely on CPU with acceptable latency:

Device	Memory	Latency (batch=1)	Throughput
GPU (RTX 5000)	480 MB VRAM	0.03 ms	31K/sec
CPU (8-core)	~500 MB RAM	15 ms	~65/sec
Edge CPU (4-core)	~500 MB RAM	45 ms	~22/sec
Raspberry Pi 4	~500 MB RAM	200 ms	~5/sec

Edge Deployment

AMNL can run on edge devices like Raspberry Pi 4, enabling on-premise predictive maintenance without cloud connectivity. While throughput is limited, 5 predictions/second is sufficient for single-asset monitoring with update intervals of 1 second or more.

Summary

GPU Memory Usage - Summary:

Base memory: 480 MB at batch size 256 (FP32)
Activations dominate: 87.5% of memory is activations, not weights
Linear scaling: ~1.8 MB per sample in batch
FP16 halves memory: 240 MB at batch size 256
Edge deployment: Possible on devices with 500+ MB RAM

Deployment Target	Available Memory	Max Batch Size	Throughput
High-end GPU (16GB)	16,000 MB	256+	31K+/sec
Mid-range GPU (4GB)	4,000 MB	256	31K/sec
Entry GPU (2GB)	2,000 MB	128	28K/sec
Integrated GPU	500 MB	32	12K/sec
Edge CPU	500 MB	1-8	22-65/sec

Key Insight: AMNL's memory footprint of under 500 MB enables deployment across a wide range of hardware—from high-end data center GPUs to edge devices. This democratizes predictive maintenance, allowing facilities of all sizes to implement AI-powered monitoring without significant infrastructure investment. The key to this efficiency is the compact 3.5M parameter design and the option for FP16 inference.