Arrays are not a clever data structure invented by computer scientists. They are the data structure that matches the shape of physical memory. To understand why arrays are fast, why they are everywhere, and why every other data structure is, in some sense, defined relative to them, we have to descend one level below the abstraction and look at how a CPU actually sees memory. Once you internalize that view, the rest of this chapter — static vs dynamic, multidimensional layouts, two pointers, sliding windows — reduces to a small set of consequences.
Memory Is a 1D Array of Bytes
From the CPU's point of view, main memory (RAM) is a single, gigantic, one-dimensional array of bytes. Each byte has an integer address, and each byte stores an unsigned 8-bit value. Mathematically, memory is a function
Every variable, every object, every Python list, every PyTorch tensor — in the end — lives somewhere inside this enormous byte-addressed array. Higher-level structures are interpretations imposed on contiguous (or scattered) ranges of bytes.
Virtual vs physical addresses
Element Size, Word Size, and Alignment
The CPU can address individual bytes, but it does not like to. It prefers to read and write words — 4 bytes on a 32-bit machine, 8 bytes on a 64-bit machine — in a single cycle. A read of a 4-byte integer that begins at an address divisible by 4 (and a 8-byte float at an address divisible by 8) is called aligned; the CPU pulls the value out in one shot. A misaligned read straddles two words and either costs an extra cycle or, on stricter architectures, raises a hardware exception.
| Type | Size (typical 64-bit) | Required alignment | Why |
|---|---|---|---|
| bool / char / int8 | 1 B | 1 B | any byte address works |
| int16 / short | 2 B | 2 B | two-byte halfword load |
| int32 / float | 4 B | 4 B | single 32-bit load |
| int64 / double / pointer | 8 B | 8 B | single 64-bit load |
| struct {int32 x, y; char name[16]} | 24 B | 4 B | padded to alignment of widest field |
Languages and compilers respect this. numpy.dtype('int32').itemsize is 4; sizeof(double) in C is 8; torch.float32 is 4 bytes per element. Whenever this section uses the symbol , it means “size in bytes of one element”.
Why structs grow
struct {char a; int32 b; char c;} is not 6 bytes — the compiler pads it to 12 to keep b on a 4-byte boundary. Reordering fields by decreasing size eliminates the padding. This matters when arrays of structs become arrays of millions of structs.The Address Formula and Why Indexing Is O(1)
An array is a contiguous block of elements, each of size bytes, starting at some address. The -th element lives at
That is one multiply and one add — constant work, independent of and independent of . This is the entire reason array indexing is . The CPU is not searching; it is computing the address and issuing a single load.
Why O(1) requires contiguity
A worked example
Suppose is an array of int32 starting at 0x1000. Then
A[0]is at0x1000 + 0×4 = 0x1000A[1]is at0x1000 + 1×4 = 0x1004A[7]is at0x1000 + 7×4 = 0x101CA[1000]is at0x1000 + 1000×4 = 0x1FA0— same one-multiply-one-add asA[1]
The Memory Hierarchy and Cache Lines
Constant-time addressing only tells half the story. The other half — the half that actually decides whether your program runs in 0.4 s or 4 s — is the cache hierarchy. RAM is roughly 100× slower than the CPU. To hide that gap, modern processors keep small, very fast SRAM caches near the cores.
| Level | Typical size | Latency | Relative to L1 |
|---|---|---|---|
| L1 cache | 32–64 KB / core | ~1 ns (4 cycles) | 1× |
| L2 cache | 256 KB–1 MB / core | ~3 ns (12 cycles) | ~3× |
| L3 cache | 4–32 MB shared | ~10 ns (40 cycles) | ~10× |
| DRAM (RAM) | 16–128 GB | ~80–100 ns | ~100× |
| NVMe SSD | 1 TB | ~100 µs | ~100,000× |
| Spinning disk | 10 TB | ~10 ms | ~10,000,000× |
| Network round trip | — | ~10 ms (LAN: ~0.5 ms) | ~10,000,000× |
When the CPU loads a single byte that is not in cache, it does not fetch one byte. It fetches an entire cache line — almost universally 64 bytes on x86 and ARM — from RAM into L1. Subsequent accesses to addresses inside that 64-byte window cost roughly nothing.
For an int32 array, one cache line holds consecutive elements. A sequential walk of one million int32 values triggers about cache misses — one per line, then 15 free hits. The same million values stored as a randomly-laid-out linked list trigger up to misses. That is the difference between a few milliseconds and a few seconds on the same hardware.
Locality is the real reason arrays dominate. The address formula makes them fast in theory; cache lines plus hardware prefetching make them fast in practice.
Spatial vs temporal locality
- Spatial locality: if you just touched address , you will probably touch next. Arrays maximize this.
- Temporal locality: if you just touched , you will probably touch again soon. Loops over arrays also exploit this.
The hardware prefetcher watches your access pattern. As soon as it detects a stride-1 sweep, it starts pulling future cache lines into L1 before you ask for them. On a contiguous array, the CPU is essentially running ahead of you.
Interactive: Memory Inspector
The widget below shows a 4×5 array as both its logical 2D grid and its actual 1D memory layout. Drag the i and j sliders to pick a cell; switch the layout to see how row-major and column-major place the same logical element at completely different physical addresses; switch the dtype to see how element size scales the strip.
linear index = 2 * 5 + 3 = 13
addr = 0x1000 + 13 × 4 = 0x1034
Two observations worth pausing on. First: in row-major mode, walking i = 0, 1, 2, 3 with j fixed jumps the address by each step — you skip across cache lines. Walking j = 0, 1, 2, 3, 4 with i fixed jumps by exactly — perfect stride-1 traversal. Column-major reverses this. Second: the element does not move; only the order in which elements are laid out changes. The 2D index [2, 3] still refers to the same value; it just lives at a different byte offset.
Quick Check
An array A of float64 (s = 8 bytes) starts at base address 0x1000 with 5 columns, stored row-major. What is the byte address of A[3][4]?
Multi-Dimensional Arrays in Linear Memory
Memory is one-dimensional, but our problems often are not. To store a 2D array with rows and columns, the language must pick a flattening. Two conventions dominate.
Row-major (C, Python, NumPy default, PyTorch default)
Rows are stored end-to-end. A[0][0], A[0][1], …, A[0][n-1], A[1][0], …. The address formula is
Column-major (Fortran, MATLAB, R, Julia, BLAS “F-order”)
Columns are stored end-to-end. A[0][0], A[1][0], …, A[m-1][0], A[0][1], ….
The choice has real performance consequences because of cache lines. A nested loop should iterate the innermost dimension along memory:
1# Row-major (NumPy default): inner loop varies the LAST index.
2for i in range(m):
3 for j in range(n):
4 process(A[i, j]) # stride-1, one cache miss per 16 elements
5
6# Cache-hostile: inner loop varies the FIRST index on row-major data.
7for j in range(n):
8 for i in range(m):
9 process(A[i, j]) # stride-n, one cache miss per elementOn a 10000×10000 float64 matrix, the second loop is commonly 5–15× slower than the first — same algorithm, same FLOPs, same arithmetic complexity. The only difference is access order versus memory layout.
Strides: The Universal Language of Tensors
NumPy, PyTorch, JAX, TensorFlow, and BLAS all generalize the row-major / column-major dichotomy with a single concept: strides. A stride for axis is “the number of bytes you must add to the address to advance the index along axis by 1”. The general formula is
For a row-major (m, n) array of float64 elements, the strides are (n*8, 8): advancing one row jumps bytes; advancing one column jumps bytes. For column-major they are (8, m*8). Strides unify both layouts inside one formula.
The killer feature: strides let many operations skip copying. Transposing a matrix swaps the two strides. Slicing every other row (A[::2]) doubles the row stride. Reversing an axis negates a stride. NumPy and PyTorch routinely return views — new metadata over the same buffer — instead of copying gigabytes.
The stride trick in one line
x.T in PyTorch is . It does not move a single byte; it returns a tensor whose strides are (stride[1], stride[0]). The data buffer is shared. The resulting tensor is no longer contiguous, which is why you sometimes see .contiguous() calls before reshape.Code: Computing Addresses by Hand (Python / NumPy)
NumPy exposes everything we have discussed: .itemsize (the ), .strides (the per axis), and .ctypes.data (the base address). We can reproduce the address formula and verify it against the buffer NumPy actually allocated.
Code: Verifying Contiguity in C
In C, an int array is guaranteed to be contiguous. We can ask the compiler for the address of each element and inspect the differences directly.
Pointer-to-pointer is NOT a 2D array in C
int **A is an array of pointers to (separately allocated) int arrays. The rows are not contiguous with each other. A[i][j] requires two memory loads (one for the row pointer, one for the element) and destroys spatial locality across rows. A real contiguous 2D array in C is int A[m][n] or a single flat int *A = malloc(m*n*sizeof(int)) with manual indexing A[i*n + j].Quick Check
You have a 10000×10000 float64 matrix stored row-major. Which loop order has the best cache behavior?
Where This Shows Up in the Real World
| Domain | Manifestation |
|---|---|
| Deep learning | PyTorch / NumPy strides power transpose, slicing, broadcasting without copies. .contiguous() forces a copy when an op needs row-major. |
| Image processing | An RGB image is a (H, W, 3) row-major uint8 buffer. GPUs read pixels in stride-1 sweeps; a 90° rotation that does not respect layout can be 10× slower than one that does. |
| Databases (OLTP) | PostgreSQL, MySQL store data row-major (one row = one tuple, fields contiguous). Optimal for SELECT * WHERE id = 7 — fetch one row, all fields ride along. |
| Databases (OLAP) | Parquet, ClickHouse, DuckDB, Snowflake store column-major. Optimal for SELECT AVG(price) over a billion rows — read only the price column; the rest never enters cache. |
| Game engines / ECS | Structure-of-arrays (positions[], velocities[]) beats array-of-structures (entities[i].pos, .vel) because update systems touch one component at a time, and SoA gives perfect locality. |
| Numerical libraries (BLAS, LAPACK) | Take a 'leading dimension' parameter precisely because they need the stride between rows. Whole ecosystem is built on the address formula. |
| File formats | CSV is row-major (one record per line); Parquet/ORC are column-major. Compression ratios on Parquet are 5–10× better because adjacent values in a column are similar. |
| Networking | Multi-byte integers travel in 'network byte order' (big-endian). htonl / ntohl exist to bridge endianness when reading raw buffers off the wire. |
| Operating systems | Page tables, file descriptor tables, the process scheduler's runqueue — all are arrays. O(1) indexed lookup is the hot path. |
| Bioinformatics | DNA sequence alignment fills an (m+1)×(n+1) DP matrix. Choosing row-major vs column-major changes which dependency is stride-1 in the inner loop. |
Pitfalls and Gotchas
- Wrong loop order on multidimensional data. The single most common cache-induced slowdown. On row-major data, the rightmost index varies in the innermost loop. On column-major data, the leftmost index does.
- “2D arrays” that are arrays of pointers. Java's
int[][], C'sint **A, Python's list-of-lists — these are not contiguous. Each row lives wherever the allocator placed it. NumPy'sndarrayand C'sint A[m][n]are the contiguous versions. - Object arrays in managed languages. In Java,
int[]stores 4-byte ints contiguously.Integer[]stores 8-byte pointers to boxed integers scattered across the heap — a different beast. Same in Python: a list ofintis a list of pointers toPyObjects;numpy.int32arrays are dense. - Misalignment. Reading a 4-byte int from an odd address is undefined behavior in C, and several cycles slower on x86 even when it works. SIMD instructions (SSE / AVX / NEON) often require 16- or 32-byte alignment.
- Endianness when reading raw bytes. A
uint32value0x01020304is stored as bytes04 03 02 01on x86 (little-endian) and01 02 03 04on most network protocols (big-endian). Reading binary file formats and network packets without honoring this produces silently corrupt numbers. - Strided views that you forgot were strided. A NumPy slice
x[::2]shares storage withx. Writing to it mutates the original..copy()when you need independence. - Padding inside structs. An array of one million
struct {char a; double b;}uses 16 MB, not 9 MB — the compiler pads each struct to 16 bytes for alignment. Reorder fields by descending size to recover the wasted space.
Everything in this chapter follows from this section. Static vs dynamic arrays is a question about who owns the contiguous block and how it grows. Multi-dimensional arrays are a flattening choice on top of the same address formula. Two pointers and sliding windows are algorithms that exploit stride-1 access for cache-friendly linear sweeps. Implementing dynamic arrays is, in the end, a story about resizing the contiguous block and amortizing the cost of moving bytes. Hold on to one image: a flat strip of bytes, an integer base, a fixed element size, and an index that turns into an address by a single multiply-and-add. That is what an array is.