Chapter 5
15 min read
Section 25 of 261

Multi-dimensional Arrays

Arrays

Memory is one-dimensional. Problems are not. The whole story of multi-dimensional arrays — the entire reason NumPy, PyTorch, BLAS, OpenCV, every game engine, and every relational database had to make hard, opinionated decisions about layout — is the friction between those two facts. Section 1 showed that an array is a contiguous run of bytes plus an address formula. This section generalizes that formula to any number of dimensions, exposes the single concept (strides) that unifies all layouts, and shows why the same algorithm can be 10× faster or slower depending on which axis you put on the inside of the loop.

What a Multi-Dimensional Array Really Is

Mathematically, an NN-dimensional array (a tensor) of shape (n1,n2,,nN)(n_1, n_2, \dots, n_N) is a function from a Cartesian product of index ranges into an element type:

A:d=1N{0,1,,nd1}T.A : \prod_{d=1}^{N} \{0, 1, \dots, n_d - 1\} \to T.

For N=1N=1 this is a vector; for N=2N=2 a matrix; for N=3N=3 a cube of values; for N=4N=4 a stack of cubes (mini-batches in deep learning); and so on. The total element count is d=1Nnd\prod_{d=1}^{N} n_d.

That is the logical picture. Physically, the elements have to be ordered into a 1D byte strip, because that is the only thing memory understands. A multi-dimensional array is therefore two things glued together:

  1. A contiguous (or near-contiguous) buffer of dnd\prod_d n_d elements in memory.
  2. A layout function that maps an index tuple (i1,,iN)(i_1, \dots, i_N) to a flat offset into that buffer.

Different libraries pick different layout functions. Picking one is the entire content of the next two sections.

Two Ways to Lay Out 2D in 1D Memory

For a 2D array AA of shape (nrows,ncols)(n_{\text{rows}}, n_{\text{cols}}), there are two canonical orderings.

Row-major (C, Python lists, NumPy default, PyTorch default, OpenCV, Pillow)

Lay out row 0 entirely, then row 1, then row 2, and so on. Within a row, columns are contiguous. The flat offset (in elements) is

off(i,j)=incols+j.\text{off}(i, j) = i \cdot n_{\text{cols}} + j.

Column-major (Fortran, MATLAB, R, Julia, BLAS, LAPACK, Eigen)

Lay out column 0 entirely, then column 1, and so on. Within a column, rows are contiguous. The flat offset is

off(i,j)=jnrows+i.\text{off}(i, j) = j \cdot n_{\text{rows}} + i.

Same logical matrix, same element values, completely different physical addresses. The choice is not arbitrary: it leaks into the API of every numerical library you will ever use, and it determines whether a given loop is fast or slow.

Why does Fortran prefer column-major?

Fortran was designed for dense linear algebra, and the canonical operation in linear algebra — matrix-vector multiply y=Axy = Ax — takes inner products of columns of AA with xx. Column-major makes those inner products stride-1. Row-major C, by contrast, descends from systems programming, where 2D arrays were rare and the natural reading order “left-to-right, top-to-bottom” (rows first) won.

The General N-D Formula and Strides

Both layouts are special cases of a single formula. Define the stride sds_d of axis dd as “how many elements to step in the buffer when index idi_d increases by 1, with all other indices held fixed”. Then

off(i1,i2,,iN)=d=1Nidsd.\text{off}(i_1, i_2, \dots, i_N) = \sum_{d=1}^{N} i_d \cdot s_d.

For row-major (C-order) on a tensor of shape (n1,,nN)(n_1, \dots, n_N), the strides are determined by the recursion: the last axis varies fastest, so its stride is 1; each earlier axis strides over a full slab of all later axes. Concretely,

sN=1,sd=sd+1nd+1=k=d+1Nnk.s_N = 1, \quad s_d = s_{d+1} \cdot n_{d+1} = \prod_{k=d+1}^{N} n_k.

For a (n0,n1,n2)(n_0, n_1, n_2) 3D tensor in row-major order, the strides are (n1n2,  n2,  1)(n_1 n_2, \; n_2, \; 1). Walking i2i_2 by 1 advances by 1 element. Walking i1i_1 by 1 advances by n2n_2 elements (a full row). Walking i0i_0 by 1 advances by n1n2n_1 n_2 elements (a full slab). For column-major (F-order), the formula mirrors with the first axis varying fastest:

s1=1,sd=k=1d1nk.s_1 = 1, \quad s_d = \prod_{k=1}^{d-1} n_k.

Strides are the universal language

Once you accept the stride formulation, row-major and column-major stop being two separate worlds. They are just two specific stride vectors. Permuting axes, slicing every other element, reversing an axis, transposing a matrix — all of these are just recomputing strides. The data buffer never moves. NumPy, PyTorch, JAX, and TensorFlow all expose this directly via the .strides attribute.

True 2D Arrays vs Jagged Arrays of Pointers

Not everything advertised as a “2D array” is laid out as one contiguous block. There is a critical distinction between a true 2D array — where all mnm \cdot n elements live in one contiguous run — and a jagged array (or array of arrays), which is a 1D array of pointers, each pointing to an independently allocated row.

ConstructLayoutLocalityIndexing cost
C: int A[M][N]Single contiguous M×N×4-byte blockExcellent1 multiply + 1 add + 1 load
C: int *A = malloc(M*N*sizeof(int))Single contiguous block, manual A[i*N + j]Excellent1 multiply + 1 add + 1 load
C: int **A (M mallocs)1 array of M pointers + M separate row buffersPoor across rows2 dependent loads
Java: int[][] A1 array of M references to int[N] arraysPoor across rows2 dependent loads + bounds check
Python: list of lists1 list of M PyObject* pointers to row lists of PyObject*CatastrophicMultiple pointer chases per element
NumPy: np.zeros((m, n))Single contiguous m×n×dtype-byte blockExcellent1 multiply + 1 add + 1 load
PyTorch: torch.zeros(m, n)Single contiguous block (unless explicitly strided)Excellent1 multiply + 1 add + 1 load

The two-load form is the killer. When A[i][j] requires loading A[i] first to discover the row pointer and then dereferencing it, the second load cannot start until the first finishes (a data-dependent load), the prefetcher has nothing to predict (the row pointers can point anywhere), and each row may live in a different page of the heap. For a 1024×1024 matrix, the difference between a contiguous float A[1024][1024] and a float **A with 1024 separate mallocs is routinely 3–10× on a tight numerical loop.

Java's int[][] is jagged by design

In Java, int[][] is genuinely an array of array references. Each row may have a different length (jagged arrays are first-class), and rows live wherever the GC placed them. For numeric work, packing into a flat int[] with manual i*n + j indexing — or using a library like ND4J or JBLAS — is dramatically faster.

The Loop-Order Rule and Cache Reality

The address formula tells you indexing is O(1)O(1). The cache hierarchy tells you that the order in which you ask for those O(1)O(1) elements determines whether you run at L1 speed or DRAM speed — a roughly 100× gap. The rule is simple and exception-free:

Iterate the dimension whose stride is 1 in the innermost loop. On row-major data, that is the last index. On column-major data, the first.

Why? Because a stride-1 access pattern reads consecutive bytes. A 64-byte cache line then carries 16 float32s or 8 float64s of useful work for the price of one DRAM round trip, and the hardware prefetcher streams the next line into L1 before you ask for it. A stride-ncolsn_{\text{cols}} access pattern, by contrast, jumps a full row each iteration, touches a fresh cache line every time, and gives the prefetcher nothing to grab.

🐍python
1# Row-major data (NumPy default). Inner loop varies LAST index → stride-1 → fast.
2for i in range(m):
3    for j in range(n):
4        out += A[i, j]                # consecutive 8-byte reads
5
6# Same logical work, but inner loop varies FIRST index → stride-n → cache miss every step.
7for j in range(n):
8    for i in range(m):
9        out += A[i, j]                # 8*n-byte jumps; prefetcher gives up

On a 4096×4096 float64 matrix (128 MB — well outside any L2 or L3) the second loop is typically 5–15× slower than the first on the same CPU, with the same FLOP count, the same arithmetic operations, and the same correctness. Nothing changed except the order in which we asked memory for our own data.

Interactive: ND-Tensor Stride Inspector

The widget below makes the abstract formulas concrete. Pick a 2D or 3D shape, drag the index sliders, and watch three things at once: (1) the logical grid view, (2) the physical 1D strip in memory order, and (3) the offset formula expanded with your specific numbers. In 2D mode, toggle row-major / column-major — the data does not move, but the strip's ordering does. In 3D mode, flip the transpose switch to swap the strides of axis 0 and axis 2; the same buffer now appears in a different visual order, which is exactly how a zero-copy transpose works in PyTorch.

Logical 4×5 view
j →i[0,0][0,1][0,2][0,3][0,4][1,0][1,1][1,2][1,3][1,4][2,0][2,1][2,2][2,3][2,4][3,0][3,1][3,2][3,3][3,4]
Physical 1D memory (row-major traversal)
offset →[0,0]0[0,1][0,2]2[0,3][0,4]4[1,0][1,1]6[1,2][1,3]8[1,4][2,0]10[2,1][2,2]12[2,3][2,4]14[3,0][3,1]16[3,2][3,3]18[3,4]
Offset calculation
off(i,j)=incols+j\text{off}(i, j) = i \cdot n_{\text{cols}} + j
shape = (4, 5)    strides = (5, 1)    index = (2, 3)
offset = 2·5 + 3·1 = 13

Two patterns to look for. First, when you change the layout in 2D mode, observe that the same logical cell [i, j] lands at completely different positions in the strip. Second, the transpose toggle in 3D mode flips strides without ever rewriting the buffer; that is why x.transpose(0, 2), x.permute(...), and x[::-1] all run in O(1)O(1) regardless of how big the tensor is.

Quick Check

A row-major 3D tensor has shape (5, 4, 6) and dtype int32 (4 bytes). What is the byte offset of A[1, 2, 3] from the base address?

Case Study: Matrix Multiplication and Loop Order

Matrix multiplication C=ABC = AB with ARm×kA \in \mathbb{R}^{m \times k}, BRk×nB \in \mathbb{R}^{k \times n}, CRm×nC \in \mathbb{R}^{m \times n} is the textbook playground for loop-order effects. The naive triple loop has six possible orderings of the three indices (i, j, k); they all produce identical answers but exhibit dramatically different cache behavior on row-major storage.

OrderC[i,j] accessA[i,k] accessB[k,j] accessVerdict
i j k (naive)stride 1 in inner-but-onestride 1 (good)stride n (bad)Mediocre — B kills it
i k jstride 1 (write-streaming)loop-invariant in innerstride 1 (good)Best naive order
j i kstride 1 across rows of C (bad on row-major)stride 1 in middlestride n in inner (bad)Cache-hostile
k i jstride 1 (good)stride 1 in middlestride 1 (good)Excellent — but more memory traffic

On a 1024×1024 float32 multiply, the i k j order routinely beats i j k by 3–5× in plain C with optimization on, simply because the innermost loop now streams contiguously through both BB and CC. This is also why production BLAS (OpenBLAS, MKL, BLIS) does not use any of these orderings directly. They use tiled or blocked matrix multiply: chop AA, BB, and CC into small panels that fit in L1 / L2, multiply panel-by-panel, and reuse each panel many times before evicting it. The result is the difference between 2 GFLOPS (naive) and 50–200 GFLOPS (BLAS) on the same CPU.

Higher Dimensions: Images, Video, Mini-Batches

Three- and four-dimensional arrays are the workhorses of perception, graphics, and deep learning. The shape conventions are not historical accidents — they are deliberate choices about which axis is stride-1, because that determines which kernels run fast.

DomainShapeConventionStride-1 axisWhy
Grayscale image(H, W)Row-majorW (columns)Image filters scan left-to-right; a kernel reads consecutive pixels in a row.
RGB image (NumPy / OpenCV / Pillow)(H, W, 3)HWC, row-majorChannelCo-locating R/G/B for one pixel speeds up per-pixel ops (color conversion, blending).
Mini-batch image (PyTorch default, cuDNN)(B, C, H, W)NCHW, row-majorW (last)Convolutions slide a kernel across H, W; stride-1 along W matches GPU coalesced loads.
Mini-batch image (TensorFlow CPU default)(B, H, W, C)NHWC, row-majorC (channel)Pointwise ops over channels become stride-1; on CPUs that vectorize over channel.
Video clip(F, H, W, C)Frame-majorC or W depending on backendF is the slowest-varying axis; one frame at a time fits in L2.
Volumetric scan (CT / MRI)(D, H, W)Slice-majorWDICOM stores depth slices contiguously; per-slice 2D ops are cache-friendly.
Attention activations (Transformer)(B, H, S, D_h)Batch-Head-Seq-DimD_hPer-head attention does (S×D_h) · (D_h×S); stride-1 D_h gives stride-1 inner products.

NCHW vs NHWC is a real engineering tradeoff

cuDNN's convolution implementations are tuned for both layouts but historically prefer NCHW because the spatial axes (H, W) are contiguous in memory, which suits the im2col + GEMM strategy. On modern GPUs with tensor cores, NHWC often wins for FP16/INT8 because the channel axis carries the SIMD lane. Mixing the two without converting will silently transpose every tensor that crosses the boundary.

Code: A Hand-Built Array2D in Python

To make the formulas concrete, here is a 2D array implemented from scratch on top of a flat Python list. Notice how every read and write reduces to the row-major offset formula.

A 2D array as a 1D list plus an offset formula
🐍array2d.py
1The class wraps two pieces of state

Logical shape (n_rows, n_cols) and the flat data buffer. The buffer is a single Python list of length n_rows * n_cols — exactly the contiguous block a real 2D array would occupy.

2Constructor signature

We accept the two dimensions and an optional fill value (default 0). There is no nested list anywhere in this class.

5Allocate the flat buffer

[fill] * (n_rows * n_cols) creates one Python list of size n_rows·n_cols. For Array2D(4, 5), this is a list of 20 zeros. Memory-wise this is a 1D structure; we will impose 2D-ness in software.

EXAMPLE
Array2D(4, 5) → self.data is a list of 20 zeros
7Helper that turns (i, j) into a flat index

This is the address formula in three lines: bounds-check first, then return i * n_cols + j. Every other method routes through this — there is exactly ONE place where 2D becomes 1D, which is the whole point of separating layout from logical access.

8Bounds checking is mandatory in Python

Unlike C, where overshooting an array silently reads adjacent memory (and is undefined behavior), we explicitly raise IndexError. NumPy does the same — it bounds-checks in Python and skips the check inside compiled C kernels.

10Apply the row-major offset formula

i * n_cols + j is precisely the expression from the address formula. For our (4, 5) example, A[2, 3] becomes 2 * 5 + 3 = 13. Element [2, 3] therefore lives at self.data[13]. This is a single multiply and a single add — O(1), independent of n_rows and n_cols.

EXAMPLE
A[2, 3] → 2 * 5 + 3 = 13
12__getitem__ enables A[i, j] syntax

Python passes the tuple (i, j) as key. We unpack, compute the flat offset, and read from data. The user sees a 2D access; the hardware sees a 1D access.

13Tuple unpacking

key is the tuple (i, j) that Python builds when you write A[2, 3]. We unpack it into the two integer indices.

14One indirection: flat index, then list lookup

self.data[self._flat(i, j)] computes i*n_cols+j and indexes the underlying list. For A[2, 3] on a (4, 5) array: self.data[13].

16__setitem__ for assignment

Same pattern, just storing instead of loading. A[i, j] = v becomes self.data[i * n_cols + j] = v. One offset computation, one write.

17Unpack the index tuple again

Identical to the getter — we extract (i, j) from the key.

18Write through the flat index

Compute the flat offset, then assign. Setting A[2, 3] = 7 mutates self.data[13] from 0 to 7. Nothing else in the buffer changes.

21Construct a 4x5 array of zeros

20 zeros laid out flat. Logically a 4-row by 5-column grid; physically a list of 20 elements.

EXAMPLE
self.data = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
22Write A[2, 3] = 7

Routes through __setitem__ → _flat(2, 3) → returns 13 → self.data[13] = 7. The 14th element of the buffer (zero-indexed position 13) becomes 7. All others remain 0.

EXAMPLE
After this line, position 13 of the flat buffer holds 7.
23Inspect the flat buffer

Printing data shows the 7 sitting at position 13 — exactly where the row-major formula said it would. The 2D structure is entirely in our software (the _flat method); the physical buffer knows only positions 0..19.

EXAMPLE
[0,0,0,0,0, 0,0,0,0,0, 0,0,0,7,0, 0,0,0,0,0]
24Confirm the offset value

_flat(2, 3) returns 13. Same formula NumPy applies internally on every indexing operation, just written explicitly in Python instead of compiled C.

EXAMPLE
13
8 lines without explanation
1class Array2D:
2    def __init__(self, n_rows, n_cols, fill=0):
3        self.n_rows = n_rows
4        self.n_cols = n_cols
5        self.data = [fill] * (n_rows * n_cols)   # one flat 1-D buffer
6
7    def _flat(self, i, j):
8        if not (0 <= i < self.n_rows and 0 <= j < self.n_cols):
9            raise IndexError(f"({i}, {j}) out of bounds for ({self.n_rows}, {self.n_cols})")
10        return i * self.n_cols + j               # row-major offset
11
12    def __getitem__(self, key):
13        i, j = key
14        return self.data[self._flat(i, j)]
15
16    def __setitem__(self, key, value):
17        i, j = key
18        self.data[self._flat(i, j)] = value
19
20
21A = Array2D(4, 5)
22A[2, 3] = 7
23print(A.data)            # [0,0,0,0,0, 0,0,0,0,0, 0,0,0,7,0, 0,0,0,0,0]
24print(A._flat(2, 3))     # 13

Swapping the offset formula to j * self.n_rows + i would convert this class to column-major. The data buffer would not change shape; only the mapping function would. Layout is software, not data.

Code: Strides, Views, and Transpose-Without-Copy (NumPy / PyTorch)

Real libraries do not write a fresh class for every shape. They store one buffer plus a shape tuple and a strides tuple, and let the strides tuple do all the work. Here is how to read it out and watch a transpose happen with zero copy.

Watching a zero-copy transpose
🐍strides_demo.py
1Import NumPy

We need .arange, .reshape, .transpose, .strides, .ctypes.data, and np.shares_memory. All standard NumPy.

3Allocate a 2x3x4 int32 tensor

np.arange(24) produces 0..23. Reshape attaches shape (2, 3, 4) to the same 96-byte buffer (24 elements × 4 bytes). Default order is row-major (C-order).

EXAMPLE
shape = (2, 3, 4), 96 bytes total, values 0..23
4Read the shape

Just confirms (2, 3, 4). This is the logical view.

EXAMPLE
A.shape = (2, 3, 4)
5Read the strides — in BYTES, not elements

For row-major (2, 3, 4) int32: stride along axis 2 (last) is 4 bytes, axis 1 is 4·4 = 16 bytes, axis 0 is 3·4·4 = 48 bytes. So A.strides = (48, 16, 4). Walking i_0 by 1 jumps 48 bytes (a 3×4 slab); walking i_1 by 1 jumps 16 bytes (a row of 4 ints); walking i_2 by 1 jumps 4 bytes (one int).

EXAMPLE
A.strides = (48, 16, 4)
6Read the base address

A.ctypes.data is the integer address of the first byte of the buffer. We will use it to verify that A and B point to the same memory and to hand-compute element addresses.

8Transpose all three axes — A.transpose(2, 1, 0)

This is the headline trick. transpose does NOT copy any data. It returns a new ndarray object whose .data points to the same buffer, but with shape and strides permuted: shape (4, 3, 2) and strides (4, 16, 48). The 96-byte buffer is untouched.

9Confirm the new shape

B.shape is (4, 3, 2) — the original (2, 3, 4) with axes in reversed order, exactly as requested.

EXAMPLE
B.shape = (4, 3, 2)
10And the corresponding new strides

B.strides = (4, 16, 48). The first axis of B (originally A's last axis) now has stride 4 bytes — meaning consecutive B[k, ...] entries are adjacent in memory. The last axis of B (originally A's first axis) has stride 48 bytes. Walking B's last axis is now expensive; walking B's first axis is cheap. Same data, reinterpreted.

EXAMPLE
B.strides = (4, 16, 48)
11Verify they share the buffer

np.shares_memory(A, B) returns True. There is one 96-byte buffer in RAM; A and B are two metadata wrappers around it. This is the entire reason PyTorch operations like .transpose() and .permute() are advertised as O(1) — they manipulate strides, not bytes.

EXAMPLE
share memory? True
14Pick one logical element to verify

We will read A[1, 2, 3] (the last element of A, value 23) and B[3, 2, 1] (which should be the same element, by definition of transpose) and confirm both compute the same byte address.

EXAMPLE
i=1, j=2, k=3
15Compute A[1, 2, 3]'s address by hand

addr = base + 1·48 + 2·16 + 3·4 = base + 48 + 32 + 12 = base + 92. The 93rd byte of the buffer (offset 92) holds the int32 value 23.

EXAMPLE
addr_A = base + 92
16Compute B[3, 2, 1]'s address by hand

B has strides (4, 16, 48). addr = base + 3·4 + 2·16 + 1·48 = base + 12 + 32 + 48 = base + 92. Same address. The transposed access path lands on the exact same byte — that is what 'same data, different stride' means in practice.

EXAMPLE
addr_B = base + 92  (matches addr_A)
17Print A's address

Prints something like 0x7f9a3c043010 + 92 — the exact base differs every run, but the +92 offset is deterministic.

18Print B's address — identical

Same hex address as A's. Two different index tuples, same byte in memory.

19Read both values and compare

Prints '23 == 23'. The transpose did its job: B[3, 2, 1] has the same numeric value as A[1, 2, 3] because they are literally the same int32 in memory.

EXAMPLE
values: 23 == 23
4 lines without explanation
1import numpy as np
2
3A = np.arange(24, dtype=np.int32).reshape(2, 3, 4)
4print("A.shape  =", A.shape)
5print("A.strides=", A.strides)        # bytes per axis step
6print("A.data   =", hex(A.ctypes.data))
7
8B = A.transpose(2, 1, 0)              # zero-copy: just permutes shape and strides
9print("B.shape  =", B.shape)
10print("B.strides=", B.strides)
11print("share memory?", np.shares_memory(A, B))
12
13# Same element, two index tuples, same byte address
14i, j, k = 1, 2, 3
15addr_A = A.ctypes.data + i*A.strides[0] + j*A.strides[1] + k*A.strides[2]
16addr_B = B.ctypes.data + k*B.strides[0] + j*B.strides[1] + i*B.strides[2]
17print("A[1,2,3] addr:", hex(addr_A))
18print("B[3,2,1] addr:", hex(addr_B))
19print("values:", A[i, j, k], "==", B[k, j, i])

Identical idea in PyTorch: x.permute(2, 1, 0), x.transpose(0, 2), and x.t() all return non-contiguous views over the same storage. When a downstream op (like view or reshape) needs a contiguous buffer, you must call .contiguous() — which is the moment the bytes actually get copied. Until then, all the rearranging happens in stride metadata.

The same idea in C++ / Eigen

📄cpp
1#include <Eigen/Dense>
2using Eigen::MatrixXd;
3
4MatrixXd A(1000, 1000);
5A.setRandom();
6
7// Eigen is column-major by default. A.transpose() returns an expression
8// template that swaps strides — no allocation, no memcpy.
9auto At = A.transpose();              // O(1)
10
11double s = (A * At).trace();          // multiply uses BOTH layouts efficiently

Eigen, BLAS, and LAPACK all expose a “leading dimension” (lda) parameter on their matrix kernels. That parameter is the stride between consecutive columns (or rows). It exists precisely so that submatrix views — which have a different leading dimension than the parent matrix — can be passed in without copying.

Where This Shows Up in the Real World

DomainManifestation
Deep learning frameworksTensors are (shape, strides, dtype, storage_offset, device) plus a buffer. permute / transpose / unsqueeze / slice all return strided views. .contiguous() forces a copy.
GPU programming (CUDA, ROCm)Coalesced memory access requires consecutive threads to read consecutive addresses. The fast axis must be the threadIdx.x axis. NCHW vs NHWC determines whether your kernel hits 90% or 9% of memory bandwidth.
BLAS / LAPACK / cuBLASColumn-major across the entire ecosystem (a Fortran inheritance). Calling BLAS from C/NumPy means either passing row-major data with a transposed flag, or paying for a copy.
Image processing (OpenCV, Pillow, ImageMagick)Row-major (H, W, C). A 90° rotation that respects the layout reorders rows; one that reads down a column on rotated data is one cache miss per pixel.
Medical imaging (DICOM, NIfTI)(D, H, W) volumetric scans. Per-slice ops iterate (H, W) with stride-1 inner; whole-volume ops must pick an axis order that matches the slowest (D) axis.
OLAP / columnar databases (Parquet, ClickHouse, DuckDB)Column-major on disk. SELECT AVG(price) reads only the price column; analytics over a billion rows finish in seconds because they never touch the other 50 fields.
Game enginesTilemaps are 2D arrays of tile IDs (row-major). Voxel worlds (Minecraft chunks) are 3D arrays of block IDs — typically 16×256×16 or 32×32×32, packed contiguously. Navmeshes index 2D grid cells via the same (i, j) → flat formula.
Climate / CFD simulations(time, level, lat, lon) 4D grids. Stencil kernels iterate space with time on the outside; the inner loop is stride-1 in lon. NetCDF and HDF5 store the data in chunked, often compressed multi-D layouts.
Finite element analysisStiffness matrices are sparse, but local element matrices are dense small 2D arrays (often 24×24 or 60×60). Cache layout decides the FLOP rate of the inner assembly loop.
Ray-tracing acceleration gridsUniform grids subdivide space into a 3D array of voxel cells, each holding a list of triangles. Stride-1 traversal of the grid maps directly to coherent ray packets in modern path tracers.
Computer vision (im2col, convolution)Convolution is implemented as a reshape + matrix multiply (im2col). The reshape exists precisely to put the channel/spatial axes in the layout BLAS wants.
Spreadsheet engines (Excel, Google Sheets)Sparse storage of dense regions is multi-dimensional under the hood. Range operations like SUM(A1:A1000) exploit stride-1 column traversal of a column-major store.

Quick Check

You have a row-major 2D matrix A of shape (10000, 10000). Which loop computes column sums fastest on this layout?

Pitfalls and Gotchas

  1. Mixing row-major and column-major libraries. Passing a NumPy row-major matrix into a Fortran-style BLAS routine without setting the “transposed” flag silently solves the transposed problem. Symptoms: numerically plausible but wrong results, or shape mismatches that only appear for non-square matrices.
  2. “2D arrays” that are arrays of pointers. Java's int[][], C's int **A, Python's list-of-lists. The rows are not contiguous with each other and may not even be the same length. For numerical work, flatten to a single buffer.
  3. Confusing logical shape with memory layout. A tensor of shape (2,3,4)(2, 3, 4) can be stored in 6 different stride orderings (3! axis permutations), all containing the same elements. Two tensors with the same shape can have wildly different cache behavior if their strides differ.
  4. Wrong loop order on multidimensional data. The single most common cache-induced slowdown. On row-major data, the rightmost index varies in the innermost loop. On column-major data, the leftmost index does. Get it backwards and you pay 5–15× for the same computation.
  5. Forgetting .contiguous() before view / reshape. A non-contiguous tensor (e.g., the output of transpose) cannot be reshaped without copying. PyTorch will throw “view size is not compatible with input tensor's size and stride”. Calling .contiguous() forces the copy explicitly so you know it is happening.
  6. Strided views aliasing the original. x[::2] shares storage with x. Writing to it mutates x. If you need independence, call .copy() in NumPy or .clone() in PyTorch.
  7. NCHW vs NHWC mismatches in deep learning pipelines. A model trained in PyTorch (NCHW) will silently transpose every input if the data loader hands it NHWC. The correctness is preserved by the transpose, but throughput collapses because every batch now allocates a new contiguous buffer.
  8. Padding / row stride differences in image libraries. OpenCV often pads each row to a multiple of 4 or 16 bytes for SIMD alignment. The image's step (row stride) is therefore not always width * channels. Iterating (width * channels) instead of step reads garbage from the padding.

Hold on to one image. A multi-dimensional array is a single contiguous buffer plus a stride vector. Indexing computes didsd\sum_d i_d \cdot s_d, one multiply-add per axis, and that is it. Row-major and column-major are just two specific stride choices. Transpose, slice, permute, broadcast, and reshape are stride manipulations — the data does not move. The cache hierarchy turns this picture into a performance contract: iterate the stride-1 axis on the inside, or pay 10×. Every fast numerical library — NumPy, PyTorch, BLAS, cuDNN, OpenCV, Parquet readers, game engines — is, at its core, a careful negotiation between this contract and the shape of the problem.

Loading comments...