Memory is one-dimensional. Problems are not. The whole story of multi-dimensional arrays — the entire reason NumPy, PyTorch, BLAS, OpenCV, every game engine, and every relational database had to make hard, opinionated decisions about layout — is the friction between those two facts. Section 1 showed that an array is a contiguous run of bytes plus an address formula. This section generalizes that formula to any number of dimensions, exposes the single concept (strides) that unifies all layouts, and shows why the same algorithm can be 10× faster or slower depending on which axis you put on the inside of the loop.
What a Multi-Dimensional Array Really Is
Mathematically, an -dimensional array (a tensor) of shape is a function from a Cartesian product of index ranges into an element type:
For this is a vector; for a matrix; for a cube of values; for a stack of cubes (mini-batches in deep learning); and so on. The total element count is .
That is the logical picture. Physically, the elements have to be ordered into a 1D byte strip, because that is the only thing memory understands. A multi-dimensional array is therefore two things glued together:
- A contiguous (or near-contiguous) buffer of elements in memory.
- A layout function that maps an index tuple to a flat offset into that buffer.
Different libraries pick different layout functions. Picking one is the entire content of the next two sections.
Two Ways to Lay Out 2D in 1D Memory
For a 2D array of shape , there are two canonical orderings.
Row-major (C, Python lists, NumPy default, PyTorch default, OpenCV, Pillow)
Lay out row 0 entirely, then row 1, then row 2, and so on. Within a row, columns are contiguous. The flat offset (in elements) is
Column-major (Fortran, MATLAB, R, Julia, BLAS, LAPACK, Eigen)
Lay out column 0 entirely, then column 1, and so on. Within a column, rows are contiguous. The flat offset is
Same logical matrix, same element values, completely different physical addresses. The choice is not arbitrary: it leaks into the API of every numerical library you will ever use, and it determines whether a given loop is fast or slow.
Why does Fortran prefer column-major?
The General N-D Formula and Strides
Both layouts are special cases of a single formula. Define the stride of axis as “how many elements to step in the buffer when index increases by 1, with all other indices held fixed”. Then
For row-major (C-order) on a tensor of shape , the strides are determined by the recursion: the last axis varies fastest, so its stride is 1; each earlier axis strides over a full slab of all later axes. Concretely,
For a 3D tensor in row-major order, the strides are . Walking by 1 advances by 1 element. Walking by 1 advances by elements (a full row). Walking by 1 advances by elements (a full slab). For column-major (F-order), the formula mirrors with the first axis varying fastest:
Strides are the universal language
.strides attribute.True 2D Arrays vs Jagged Arrays of Pointers
Not everything advertised as a “2D array” is laid out as one contiguous block. There is a critical distinction between a true 2D array — where all elements live in one contiguous run — and a jagged array (or array of arrays), which is a 1D array of pointers, each pointing to an independently allocated row.
| Construct | Layout | Locality | Indexing cost |
|---|---|---|---|
| C: int A[M][N] | Single contiguous M×N×4-byte block | Excellent | 1 multiply + 1 add + 1 load |
| C: int *A = malloc(M*N*sizeof(int)) | Single contiguous block, manual A[i*N + j] | Excellent | 1 multiply + 1 add + 1 load |
| C: int **A (M mallocs) | 1 array of M pointers + M separate row buffers | Poor across rows | 2 dependent loads |
| Java: int[][] A | 1 array of M references to int[N] arrays | Poor across rows | 2 dependent loads + bounds check |
| Python: list of lists | 1 list of M PyObject* pointers to row lists of PyObject* | Catastrophic | Multiple pointer chases per element |
| NumPy: np.zeros((m, n)) | Single contiguous m×n×dtype-byte block | Excellent | 1 multiply + 1 add + 1 load |
| PyTorch: torch.zeros(m, n) | Single contiguous block (unless explicitly strided) | Excellent | 1 multiply + 1 add + 1 load |
The two-load form is the killer. When A[i][j] requires loading A[i] first to discover the row pointer and then dereferencing it, the second load cannot start until the first finishes (a data-dependent load), the prefetcher has nothing to predict (the row pointers can point anywhere), and each row may live in a different page of the heap. For a 1024×1024 matrix, the difference between a contiguous float A[1024][1024] and a float **A with 1024 separate mallocs is routinely 3–10× on a tight numerical loop.
Java's int[][] is jagged by design
int[][] is genuinely an array of array references. Each row may have a different length (jagged arrays are first-class), and rows live wherever the GC placed them. For numeric work, packing into a flat int[] with manual i*n + j indexing — or using a library like ND4J or JBLAS — is dramatically faster.The Loop-Order Rule and Cache Reality
The address formula tells you indexing is . The cache hierarchy tells you that the order in which you ask for those elements determines whether you run at L1 speed or DRAM speed — a roughly 100× gap. The rule is simple and exception-free:
Iterate the dimension whose stride is 1 in the innermost loop. On row-major data, that is the last index. On column-major data, the first.
Why? Because a stride-1 access pattern reads consecutive bytes. A 64-byte cache line then carries 16 float32s or 8 float64s of useful work for the price of one DRAM round trip, and the hardware prefetcher streams the next line into L1 before you ask for it. A stride- access pattern, by contrast, jumps a full row each iteration, touches a fresh cache line every time, and gives the prefetcher nothing to grab.
1# Row-major data (NumPy default). Inner loop varies LAST index → stride-1 → fast.
2for i in range(m):
3 for j in range(n):
4 out += A[i, j] # consecutive 8-byte reads
5
6# Same logical work, but inner loop varies FIRST index → stride-n → cache miss every step.
7for j in range(n):
8 for i in range(m):
9 out += A[i, j] # 8*n-byte jumps; prefetcher gives upOn a 4096×4096 float64 matrix (128 MB — well outside any L2 or L3) the second loop is typically 5–15× slower than the first on the same CPU, with the same FLOP count, the same arithmetic operations, and the same correctness. Nothing changed except the order in which we asked memory for our own data.
Interactive: ND-Tensor Stride Inspector
The widget below makes the abstract formulas concrete. Pick a 2D or 3D shape, drag the index sliders, and watch three things at once: (1) the logical grid view, (2) the physical 1D strip in memory order, and (3) the offset formula expanded with your specific numbers. In 2D mode, toggle row-major / column-major — the data does not move, but the strip's ordering does. In 3D mode, flip the transpose switch to swap the strides of axis 0 and axis 2; the same buffer now appears in a different visual order, which is exactly how a zero-copy transpose works in PyTorch.
offset = 2·5 + 3·1 = 13
Two patterns to look for. First, when you change the layout in 2D mode, observe that the same logical cell [i, j] lands at completely different positions in the strip. Second, the transpose toggle in 3D mode flips strides without ever rewriting the buffer; that is why x.transpose(0, 2), x.permute(...), and x[::-1] all run in regardless of how big the tensor is.
Quick Check
A row-major 3D tensor has shape (5, 4, 6) and dtype int32 (4 bytes). What is the byte offset of A[1, 2, 3] from the base address?
Case Study: Matrix Multiplication and Loop Order
Matrix multiplication with , , is the textbook playground for loop-order effects. The naive triple loop has six possible orderings of the three indices (i, j, k); they all produce identical answers but exhibit dramatically different cache behavior on row-major storage.
| Order | C[i,j] access | A[i,k] access | B[k,j] access | Verdict |
|---|---|---|---|---|
| i j k (naive) | stride 1 in inner-but-one | stride 1 (good) | stride n (bad) | Mediocre — B kills it |
| i k j | stride 1 (write-streaming) | loop-invariant in inner | stride 1 (good) | Best naive order |
| j i k | stride 1 across rows of C (bad on row-major) | stride 1 in middle | stride n in inner (bad) | Cache-hostile |
| k i j | stride 1 (good) | stride 1 in middle | stride 1 (good) | Excellent — but more memory traffic |
On a 1024×1024 float32 multiply, the i k j order routinely beats i j k by 3–5× in plain C with optimization on, simply because the innermost loop now streams contiguously through both and . This is also why production BLAS (OpenBLAS, MKL, BLIS) does not use any of these orderings directly. They use tiled or blocked matrix multiply: chop , , and into small panels that fit in L1 / L2, multiply panel-by-panel, and reuse each panel many times before evicting it. The result is the difference between 2 GFLOPS (naive) and 50–200 GFLOPS (BLAS) on the same CPU.
Higher Dimensions: Images, Video, Mini-Batches
Three- and four-dimensional arrays are the workhorses of perception, graphics, and deep learning. The shape conventions are not historical accidents — they are deliberate choices about which axis is stride-1, because that determines which kernels run fast.
| Domain | Shape | Convention | Stride-1 axis | Why |
|---|---|---|---|---|
| Grayscale image | (H, W) | Row-major | W (columns) | Image filters scan left-to-right; a kernel reads consecutive pixels in a row. |
| RGB image (NumPy / OpenCV / Pillow) | (H, W, 3) | HWC, row-major | Channel | Co-locating R/G/B for one pixel speeds up per-pixel ops (color conversion, blending). |
| Mini-batch image (PyTorch default, cuDNN) | (B, C, H, W) | NCHW, row-major | W (last) | Convolutions slide a kernel across H, W; stride-1 along W matches GPU coalesced loads. |
| Mini-batch image (TensorFlow CPU default) | (B, H, W, C) | NHWC, row-major | C (channel) | Pointwise ops over channels become stride-1; on CPUs that vectorize over channel. |
| Video clip | (F, H, W, C) | Frame-major | C or W depending on backend | F is the slowest-varying axis; one frame at a time fits in L2. |
| Volumetric scan (CT / MRI) | (D, H, W) | Slice-major | W | DICOM stores depth slices contiguously; per-slice 2D ops are cache-friendly. |
| Attention activations (Transformer) | (B, H, S, D_h) | Batch-Head-Seq-Dim | D_h | Per-head attention does (S×D_h) · (D_h×S); stride-1 D_h gives stride-1 inner products. |
NCHW vs NHWC is a real engineering tradeoff
Code: A Hand-Built Array2D in Python
To make the formulas concrete, here is a 2D array implemented from scratch on top of a flat Python list. Notice how every read and write reduces to the row-major offset formula.
Swapping the offset formula to j * self.n_rows + i would convert this class to column-major. The data buffer would not change shape; only the mapping function would. Layout is software, not data.
Code: Strides, Views, and Transpose-Without-Copy (NumPy / PyTorch)
Real libraries do not write a fresh class for every shape. They store one buffer plus a shape tuple and a strides tuple, and let the strides tuple do all the work. Here is how to read it out and watch a transpose happen with zero copy.
Identical idea in PyTorch: x.permute(2, 1, 0), x.transpose(0, 2), and x.t() all return non-contiguous views over the same storage. When a downstream op (like view or reshape) needs a contiguous buffer, you must call .contiguous() — which is the moment the bytes actually get copied. Until then, all the rearranging happens in stride metadata.
The same idea in C++ / Eigen
1#include <Eigen/Dense>
2using Eigen::MatrixXd;
3
4MatrixXd A(1000, 1000);
5A.setRandom();
6
7// Eigen is column-major by default. A.transpose() returns an expression
8// template that swaps strides — no allocation, no memcpy.
9auto At = A.transpose(); // O(1)
10
11double s = (A * At).trace(); // multiply uses BOTH layouts efficientlyEigen, BLAS, and LAPACK all expose a “leading dimension” (lda) parameter on their matrix kernels. That parameter is the stride between consecutive columns (or rows). It exists precisely so that submatrix views — which have a different leading dimension than the parent matrix — can be passed in without copying.
Where This Shows Up in the Real World
| Domain | Manifestation |
|---|---|
| Deep learning frameworks | Tensors are (shape, strides, dtype, storage_offset, device) plus a buffer. permute / transpose / unsqueeze / slice all return strided views. .contiguous() forces a copy. |
| GPU programming (CUDA, ROCm) | Coalesced memory access requires consecutive threads to read consecutive addresses. The fast axis must be the threadIdx.x axis. NCHW vs NHWC determines whether your kernel hits 90% or 9% of memory bandwidth. |
| BLAS / LAPACK / cuBLAS | Column-major across the entire ecosystem (a Fortran inheritance). Calling BLAS from C/NumPy means either passing row-major data with a transposed flag, or paying for a copy. |
| Image processing (OpenCV, Pillow, ImageMagick) | Row-major (H, W, C). A 90° rotation that respects the layout reorders rows; one that reads down a column on rotated data is one cache miss per pixel. |
| Medical imaging (DICOM, NIfTI) | (D, H, W) volumetric scans. Per-slice ops iterate (H, W) with stride-1 inner; whole-volume ops must pick an axis order that matches the slowest (D) axis. |
| OLAP / columnar databases (Parquet, ClickHouse, DuckDB) | Column-major on disk. SELECT AVG(price) reads only the price column; analytics over a billion rows finish in seconds because they never touch the other 50 fields. |
| Game engines | Tilemaps are 2D arrays of tile IDs (row-major). Voxel worlds (Minecraft chunks) are 3D arrays of block IDs — typically 16×256×16 or 32×32×32, packed contiguously. Navmeshes index 2D grid cells via the same (i, j) → flat formula. |
| Climate / CFD simulations | (time, level, lat, lon) 4D grids. Stencil kernels iterate space with time on the outside; the inner loop is stride-1 in lon. NetCDF and HDF5 store the data in chunked, often compressed multi-D layouts. |
| Finite element analysis | Stiffness matrices are sparse, but local element matrices are dense small 2D arrays (often 24×24 or 60×60). Cache layout decides the FLOP rate of the inner assembly loop. |
| Ray-tracing acceleration grids | Uniform grids subdivide space into a 3D array of voxel cells, each holding a list of triangles. Stride-1 traversal of the grid maps directly to coherent ray packets in modern path tracers. |
| Computer vision (im2col, convolution) | Convolution is implemented as a reshape + matrix multiply (im2col). The reshape exists precisely to put the channel/spatial axes in the layout BLAS wants. |
| Spreadsheet engines (Excel, Google Sheets) | Sparse storage of dense regions is multi-dimensional under the hood. Range operations like SUM(A1:A1000) exploit stride-1 column traversal of a column-major store. |
Quick Check
You have a row-major 2D matrix A of shape (10000, 10000). Which loop computes column sums fastest on this layout?
Pitfalls and Gotchas
- Mixing row-major and column-major libraries. Passing a NumPy row-major matrix into a Fortran-style BLAS routine without setting the “transposed” flag silently solves the transposed problem. Symptoms: numerically plausible but wrong results, or shape mismatches that only appear for non-square matrices.
- “2D arrays” that are arrays of pointers. Java's
int[][], C'sint **A, Python's list-of-lists. The rows are not contiguous with each other and may not even be the same length. For numerical work, flatten to a single buffer. - Confusing logical shape with memory layout. A tensor of shape can be stored in 6 different stride orderings (3! axis permutations), all containing the same elements. Two tensors with the same
shapecan have wildly different cache behavior if theirstridesdiffer. - Wrong loop order on multidimensional data. The single most common cache-induced slowdown. On row-major data, the rightmost index varies in the innermost loop. On column-major data, the leftmost index does. Get it backwards and you pay 5–15× for the same computation.
- Forgetting
.contiguous()beforeview/reshape. A non-contiguous tensor (e.g., the output oftranspose) cannot be reshaped without copying. PyTorch will throw “view size is not compatible with input tensor's size and stride”. Calling.contiguous()forces the copy explicitly so you know it is happening. - Strided views aliasing the original.
x[::2]shares storage withx. Writing to it mutatesx. If you need independence, call.copy()in NumPy or.clone()in PyTorch. - NCHW vs NHWC mismatches in deep learning pipelines. A model trained in PyTorch (NCHW) will silently transpose every input if the data loader hands it NHWC. The correctness is preserved by the transpose, but throughput collapses because every batch now allocates a new contiguous buffer.
- Padding / row stride differences in image libraries. OpenCV often pads each row to a multiple of 4 or 16 bytes for SIMD alignment. The image's
step(row stride) is therefore not alwayswidth * channels. Iterating(width * channels)instead ofstepreads garbage from the padding.
Hold on to one image. A multi-dimensional array is a single contiguous buffer plus a stride vector. Indexing computes , one multiply-add per axis, and that is it. Row-major and column-major are just two specific stride choices. Transpose, slice, permute, broadcast, and reshape are stride manipulations — the data does not move. The cache hierarchy turns this picture into a performance contract: iterate the stride-1 axis on the inside, or pay 10×. Every fast numerical library — NumPy, PyTorch, BLAS, cuDNN, OpenCV, Parquet readers, game engines — is, at its core, a careful negotiation between this contract and the shape of the problem.