Chapter 5
15 min read
Section 23 of 261

Memory Layout and Contiguous Storage

Arrays

Arrays are not a clever data structure invented by computer scientists. They are the data structure that matches the shape of physical memory. To understand why arrays are fast, why they are everywhere, and why every other data structure is, in some sense, defined relative to them, we have to descend one level below the abstraction and look at how a CPU actually sees memory. Once you internalize that view, the rest of this chapter — static vs dynamic, multidimensional layouts, two pointers, sliding windows — reduces to a small set of consequences.

Memory Is a 1D Array of Bytes

From the CPU's point of view, main memory (RAM) is a single, gigantic, one-dimensional array of bytes. Each byte has an integer address, and each byte stores an unsigned 8-bit value. Mathematically, memory is a function

M:{0,1,2,,2641}{0,1,,255}.M : \{0, 1, 2, \dots, 2^{64}-1\} \to \{0, 1, \dots, 255\}.

Every variable, every object, every Python list, every PyTorch tensor — in the end — lives somewhere inside this enormous byte-addressed array. Higher-level structures are interpretations imposed on contiguous (or scattered) ranges of bytes.

Virtual vs physical addresses

Modern operating systems give each process its own virtual address space, which the MMU maps to physical RAM in 4 KB pages. For data-structure reasoning we treat virtual addresses as if they were the real thing — the mapping is consistent within a process and transparent to your program.

Element Size, Word Size, and Alignment

The CPU can address individual bytes, but it does not like to. It prefers to read and write words — 4 bytes on a 32-bit machine, 8 bytes on a 64-bit machine — in a single cycle. A read of a 4-byte integer that begins at an address divisible by 4 (and a 8-byte float at an address divisible by 8) is called aligned; the CPU pulls the value out in one shot. A misaligned read straddles two words and either costs an extra cycle or, on stricter architectures, raises a hardware exception.

TypeSize (typical 64-bit)Required alignmentWhy
bool / char / int81 B1 Bany byte address works
int16 / short2 B2 Btwo-byte halfword load
int32 / float4 B4 Bsingle 32-bit load
int64 / double / pointer8 B8 Bsingle 64-bit load
struct {int32 x, y; char name[16]}24 B4 Bpadded to alignment of widest field

Languages and compilers respect this. numpy.dtype('int32').itemsize is 4; sizeof(double) in C is 8; torch.float32 is 4 bytes per element. Whenever this section uses the symbol ss, it means “size in bytes of one element”.

Why structs grow

struct {char a; int32 b; char c;} is not 6 bytes — the compiler pads it to 12 to keep b on a 4-byte boundary. Reordering fields by decreasing size eliminates the padding. This matters when arrays of structs become arrays of millions of structs.

The Address Formula and Why Indexing Is O(1)

An array is a contiguous block of nn elements, each of size ss bytes, starting at some base\text{base} address. The ii-th element lives at

addr(A[i])  =  base+is.\text{addr}(A[i]) \;=\; \text{base} + i \cdot s.

That is one multiply and one add — constant work, independent of nn and independent of ii. This is the entire reason array indexing is O(1)O(1). The CPU is not searching; it is computing the address and issuing a single load.

Why O(1) requires contiguity

If elements were scattered around memory (a linked list), there would be no formula to apply. To reach the ii-th element you would have to start at the head and follow ii pointers, paying O(i)O(i) in the process. Contiguity is what turns “walk” into “arithmetic”.

A worked example

Suppose AA is an array of int32 starting at 0x1000. Then

  • A[0] is at 0x1000 + 0×4 = 0x1000
  • A[1] is at 0x1000 + 1×4 = 0x1004
  • A[7] is at 0x1000 + 7×4 = 0x101C
  • A[1000] is at 0x1000 + 1000×4 = 0x1FA0 — same one-multiply-one-add as A[1]

The Memory Hierarchy and Cache Lines

Constant-time addressing only tells half the story. The other half — the half that actually decides whether your program runs in 0.4 s or 4 s — is the cache hierarchy. RAM is roughly 100× slower than the CPU. To hide that gap, modern processors keep small, very fast SRAM caches near the cores.

LevelTypical sizeLatencyRelative to L1
L1 cache32–64 KB / core~1 ns (4 cycles)1×
L2 cache256 KB–1 MB / core~3 ns (12 cycles)~3×
L3 cache4–32 MB shared~10 ns (40 cycles)~10×
DRAM (RAM)16–128 GB~80–100 ns~100×
NVMe SSD1 TB~100 µs~100,000×
Spinning disk10 TB~10 ms~10,000,000×
Network round trip—~10 ms (LAN: ~0.5 ms)~10,000,000×

When the CPU loads a single byte that is not in cache, it does not fetch one byte. It fetches an entire cache line — almost universally 64 bytes on x86 and ARM — from RAM into L1. Subsequent accesses to addresses inside that 64-byte window cost roughly nothing.

For an int32 array, one cache line holds 64/4=1664 / 4 = 16 consecutive elements. A sequential walk of one million int32 values triggers about 106/1662,50010^6 / 16 \approx 62{,}500 cache misses — one per line, then 15 free hits. The same million values stored as a randomly-laid-out linked list trigger up to 10610^6 misses. That is the difference between a few milliseconds and a few seconds on the same hardware.

Locality is the real reason arrays dominate. The O(1)O(1) address formula makes them fast in theory; cache lines plus hardware prefetching make them fast in practice.

Spatial vs temporal locality

  • Spatial locality: if you just touched address aa, you will probably touch a+4a + 4 next. Arrays maximize this.
  • Temporal locality: if you just touched aa, you will probably touch aa again soon. Loops over arrays also exploit this.

The hardware prefetcher watches your access pattern. As soon as it detects a stride-1 sweep, it starts pulling future cache lines into L1 before you ask for them. On a contiguous array, the CPU is essentially running ahead of you.

Interactive: Memory Inspector

The widget below shows a 4×5 array as both its logical 2D grid and its actual 1D memory layout. Drag the i and j sliders to pick a cell; switch the layout to see how row-major and column-major place the same logical element at completely different physical addresses; switch the dtype to see how element size scales the strip.

Logical 2D view (4 × 5)
j →i[0,0][0,1][0,2][0,3][0,4][1,0][1,1][1,2][1,3][1,4][2,0][2,1][2,2][2,3][2,4][3,0][3,1][3,2][3,3][3,4]
Physical 1D memory (row-major traversal order, each cell = 4 bytes)
addr →[0,0]0x1000[0,1][0,2]0x1008[0,3][0,4]0x1010[1,0][1,1]0x1018[1,2][1,3]0x1020[1,4][2,0]0x1028[2,1][2,2]0x1030[2,3][2,4]0x1038[3,0][3,1]0x1040[3,2][3,3]0x1048[3,4]
Address calculation
addr=base+(incols+j)s\text{addr} = \text{base} + (i \cdot n_{\text{cols}} + j) \cdot s
base = 0x1000    s = 4 B    n_rows = 4    n_cols = 5
linear index = 2 * 5 + 3 = 13
addr = 0x1000 + 13 × 4 = 0x1034

Two observations worth pausing on. First: in row-major mode, walking i = 0, 1, 2, 3 with j fixed jumps the address by 5s5s each step — you skip across cache lines. Walking j = 0, 1, 2, 3, 4 with i fixed jumps by exactly ss — perfect stride-1 traversal. Column-major reverses this. Second: the element does not move; only the order in which elements are laid out changes. The 2D index [2, 3] still refers to the same value; it just lives at a different byte offset.

Quick Check

An array A of float64 (s = 8 bytes) starts at base address 0x1000 with 5 columns, stored row-major. What is the byte address of A[3][4]?

Multi-Dimensional Arrays in Linear Memory

Memory is one-dimensional, but our problems often are not. To store a 2D array AA with mm rows and nn columns, the language must pick a flattening. Two conventions dominate.

Row-major (C, Python, NumPy default, PyTorch default)

Rows are stored end-to-end. A[0][0], A[0][1], …, A[0][n-1], A[1][0], …. The address formula is

addr(A[i][j])  =  base+(in+j)s.\text{addr}(A[i][j]) \;=\; \text{base} + (i \cdot n + j) \cdot s.

Column-major (Fortran, MATLAB, R, Julia, BLAS “F-order”)

Columns are stored end-to-end. A[0][0], A[1][0], …, A[m-1][0], A[0][1], ….

addr(A[i][j])  =  base+(jm+i)s.\text{addr}(A[i][j]) \;=\; \text{base} + (j \cdot m + i) \cdot s.

The choice has real performance consequences because of cache lines. A nested loop should iterate the innermost dimension along memory:

🐍python
1# Row-major (NumPy default): inner loop varies the LAST index.
2for i in range(m):
3    for j in range(n):
4        process(A[i, j])     # stride-1, one cache miss per 16 elements
5
6# Cache-hostile: inner loop varies the FIRST index on row-major data.
7for j in range(n):
8    for i in range(m):
9        process(A[i, j])     # stride-n, one cache miss per element

On a 10000×10000 float64 matrix, the second loop is commonly 5–15× slower than the first — same algorithm, same FLOPs, same arithmetic complexity. The only difference is access order versus memory layout.

Strides: The Universal Language of Tensors

NumPy, PyTorch, JAX, TensorFlow, and BLAS all generalize the row-major / column-major dichotomy with a single concept: strides. A stride sds_d for axis dd is “the number of bytes you must add to the address to advance the index along axis dd by 1”. The general formula is

addr(A[i0,i1,,ik1])  =  base+d=0k1idsd.\text{addr}(A[i_0, i_1, \dots, i_{k-1}]) \;=\; \text{base} + \sum_{d=0}^{k-1} i_d \cdot s_d.

For a row-major (m, n) array of float64 elements, the strides are (n*8, 8): advancing one row jumps 8n8n bytes; advancing one column jumps 88 bytes. For column-major they are (8, m*8). Strides unify both layouts inside one formula.

The killer feature: strides let many operations skip copying. Transposing a matrix swaps the two strides. Slicing every other row (A[::2]) doubles the row stride. Reversing an axis negates a stride. NumPy and PyTorch routinely return views — new metadata over the same buffer — instead of copying gigabytes.

The stride trick in one line

x.T in PyTorch is O(1)O(1). It does not move a single byte; it returns a tensor whose strides are (stride[1], stride[0]). The data buffer is shared. The resulting tensor is no longer contiguous, which is why you sometimes see .contiguous() calls before reshape.

Code: Computing Addresses by Hand (Python / NumPy)

NumPy exposes everything we have discussed: .itemsize (the ss), .strides (the sds_d per axis), and .ctypes.data (the base address). We can reproduce the address formula and verify it against the buffer NumPy actually allocated.

Reproducing NumPy's address arithmetic
🐍numpy_addresses.py
3Allocate a contiguous int32 buffer

np.arange(20) produces [0, 1, 2, ..., 19] as int32. .reshape(4, 5) does NOT copy; it just attaches new shape (4, 5) and strides (20, 4) to the same 80-byte buffer.

EXAMPLE
shape = (4, 5), strides = (20, 4), buffer length = 80 bytes
4Read the base address

A.ctypes.data is the integer address of the first byte of the buffer in this Python process. On my machine it printed 0x7f9a3c043010 — the exact value differs every run.

5Element size in bytes (s)

For dtype int32 this is 4. This is the s in our address formula, the size of one element.

EXAMPLE
s = 4
6Per-axis strides in bytes

NumPy stores strides in bytes, not elements. For our (4, 5) int32 array: sr = 5 cols × 4 bytes = 20, sc = 4. Walking down a row jumps 20 bytes; walking across a column jumps 4 bytes. That is exactly row-major.

EXAMPLE
sr = 20, sc = 4
8Pick an arbitrary element

We will compute the address of A[2, 3] two ways and check they agree.

9Apply the address formula

addr = base + i*sr + j*sc = base + 2*20 + 3*4 = base + 52. The 53rd byte of the buffer (index 52) is where A[2, 3] starts.

EXAMPLE
base + 2*20 + 3*4 = base + 52
10Ask NumPy for the actual address

A[i, j] is a 0-d array view; its __array_interface__['data'] is (address, read_only_flag). We pull out the address NumPy itself thinks the element lives at.

17Verify

match prints True. Our hand-computed offset of 52 bytes lands on exactly the same byte NumPy lands on. The stride formula is not an analogy — it is what NumPy literally computes on every indexing operation.

EXAMPLE
match = True
9 lines without explanation
1import numpy as np
2
3A = np.arange(20, dtype=np.int32).reshape(4, 5)   # 4 rows, 5 cols, int32
4base = A.ctypes.data
5s    = A.itemsize
6sr, sc = A.strides
7
8i, j = 2, 3
9addr_formula = base + i * sr + j * sc
10addr_actual  = A[i, j].__array_interface__["data"][0]
11
12print(f"shape   = {A.shape}")
13print(f"strides = {A.strides}  (bytes per axis step)")
14print(f"itemsize= {s}")
15print(f"addr formula = {hex(addr_formula)}")
16print(f"addr actual  = {hex(addr_actual)}")
17print(f"match        = {addr_formula == addr_actual}")

Code: Verifying Contiguity in C

In C, an int array is guaranteed to be contiguous. We can ask the compiler for the address of each element and inspect the differences directly.

Watching consecutive int addresses on real hardware
📄contiguous.c
5Stack-allocate a contiguous array of 5 ints

On a 64-bit machine sizeof(int) is 4, so A occupies exactly 20 contiguous bytes on the stack. The compiler picks a base address, e.g. 0x7ffee2c4a8b0.

EXAMPLE
base = 0x7ffee2c4a8b0 (varies per run)
7Loop over every element and print its address

The %p format prints the address of A[i]; the offset is computed by casting both pointers to char* (so subtraction is in bytes, not ints).

8Address arithmetic in action

Output on a typical x86-64 run: &A[0] = 0x...8b0 (offset 0), &A[1] = 0x...8b4 (offset 4), &A[2] = 0x...8b8 (offset 8), &A[3] = 0x...8bc (offset 12), &A[4] = 0x...8c0 (offset 16). Every address is exactly 4 bytes after the previous one — this IS the address formula, observed empirically.

EXAMPLE
offsets: 0, 4, 8, 12, 16
13Total size of the array

sizeof(A) is 20 — five int elements at 4 bytes each. Notice this is bytes, not element count. (For element count: sizeof(A) / sizeof(A[0]) = 5.)

EXAMPLE
sizeof(int) = 4, sizeof(A) = 20
11 lines without explanation
1#include <stdio.h>
2#include <stddef.h>
3
4int main(void) {
5    int A[5] = {10, 20, 30, 40, 50};
6
7    for (int i = 0; i < 5; i++) {
8        printf("&A[%d] = %p   value = %d   offset = %ld bytes\n",
9               i, (void*)&A[i], A[i], (char*)&A[i] - (char*)&A[0]);
10    }
11
12    printf("sizeof(int) = %zu\n", sizeof(int));
13    printf("sizeof(A)   = %zu\n", sizeof(A));
14    return 0;
15}

Pointer-to-pointer is NOT a 2D array in C

int **A is an array of pointers to (separately allocated) int arrays. The rows are not contiguous with each other. A[i][j] requires two memory loads (one for the row pointer, one for the element) and destroys spatial locality across rows. A real contiguous 2D array in C is int A[m][n] or a single flat int *A = malloc(m*n*sizeof(int)) with manual indexing A[i*n + j].

Quick Check

You have a 10000×10000 float64 matrix stored row-major. Which loop order has the best cache behavior?

Where This Shows Up in the Real World

DomainManifestation
Deep learningPyTorch / NumPy strides power transpose, slicing, broadcasting without copies. .contiguous() forces a copy when an op needs row-major.
Image processingAn RGB image is a (H, W, 3) row-major uint8 buffer. GPUs read pixels in stride-1 sweeps; a 90° rotation that does not respect layout can be 10× slower than one that does.
Databases (OLTP)PostgreSQL, MySQL store data row-major (one row = one tuple, fields contiguous). Optimal for SELECT * WHERE id = 7 — fetch one row, all fields ride along.
Databases (OLAP)Parquet, ClickHouse, DuckDB, Snowflake store column-major. Optimal for SELECT AVG(price) over a billion rows — read only the price column; the rest never enters cache.
Game engines / ECSStructure-of-arrays (positions[], velocities[]) beats array-of-structures (entities[i].pos, .vel) because update systems touch one component at a time, and SoA gives perfect locality.
Numerical libraries (BLAS, LAPACK)Take a 'leading dimension' parameter precisely because they need the stride between rows. Whole ecosystem is built on the address formula.
File formatsCSV is row-major (one record per line); Parquet/ORC are column-major. Compression ratios on Parquet are 5–10× better because adjacent values in a column are similar.
NetworkingMulti-byte integers travel in 'network byte order' (big-endian). htonl / ntohl exist to bridge endianness when reading raw buffers off the wire.
Operating systemsPage tables, file descriptor tables, the process scheduler's runqueue — all are arrays. O(1) indexed lookup is the hot path.
BioinformaticsDNA sequence alignment fills an (m+1)×(n+1) DP matrix. Choosing row-major vs column-major changes which dependency is stride-1 in the inner loop.

Pitfalls and Gotchas

  1. Wrong loop order on multidimensional data. The single most common cache-induced slowdown. On row-major data, the rightmost index varies in the innermost loop. On column-major data, the leftmost index does.
  2. “2D arrays” that are arrays of pointers. Java's int[][], C's int **A, Python's list-of-lists — these are not contiguous. Each row lives wherever the allocator placed it. NumPy's ndarray and C's int A[m][n] are the contiguous versions.
  3. Object arrays in managed languages. In Java, int[] stores 4-byte ints contiguously. Integer[] stores 8-byte pointers to boxed integers scattered across the heap — a different beast. Same in Python: a list of int is a list of pointers to PyObjects; numpy.int32 arrays are dense.
  4. Misalignment. Reading a 4-byte int from an odd address is undefined behavior in C, and several cycles slower on x86 even when it works. SIMD instructions (SSE / AVX / NEON) often require 16- or 32-byte alignment.
  5. Endianness when reading raw bytes. A uint32 value 0x01020304 is stored as bytes 04 03 02 01 on x86 (little-endian) and 01 02 03 04 on most network protocols (big-endian). Reading binary file formats and network packets without honoring this produces silently corrupt numbers.
  6. Strided views that you forgot were strided. A NumPy slice x[::2] shares storage with x. Writing to it mutates the original. .copy() when you need independence.
  7. Padding inside structs. An array of one million struct {char a; double b;} uses 16 MB, not 9 MB — the compiler pads each struct to 16 bytes for alignment. Reorder fields by descending size to recover the wasted space.

Everything in this chapter follows from this section. Static vs dynamic arrays is a question about who owns the contiguous block and how it grows. Multi-dimensional arrays are a flattening choice on top of the same address formula. Two pointers and sliding windows are algorithms that exploit stride-1 access for cache-friendly linear sweeps. Implementing dynamic arrays is, in the end, a story about resizing the contiguous block and amortizing the cost of moving bytes. Hold on to one image: a flat strip of bytes, an integer base, a fixed element size, and an index that turns into an address by a single multiply-and-add. That is what an array is.

Loading comments...