The previous section described the Stack ADT as a black box: push, pop, peek, size, empty, all running in . That contract says nothing about how memory is laid out. Two implementations dominate every stack you will ever touch: a dynamic array and a singly linked list. Both deliver per operation. They are not interchangeable.
This section dissects both, traces the cost of every line of code, and shows when each one is the right call. By the end you should be able to read , , the JVM operand stack, and the lock-free Treiber stack in — and explain why each one chose the implementation it did.
One ADT, Two Skeletons
The Stack ADT has only one privileged end: the top. Everything we need from an implementation is the ability to add and remove at that one end in constant time. Two memory layouts give us that:
| Implementation | Where is the top? | push | pop |
|---|---|---|---|
| Dynamic array | Last element of the array | append (write to slot size) | pop_back (clear slot size−1) |
| Singly linked list | The head node | prepend (allocate, point new node at head) | advance head (free old head) |
Both choices avoid the costly end of their underlying structure: the array never shifts elements, the linked list never walks to the tail. That is the whole trick — everything else is engineering.
Why not push at index 0 of an array?
The Array-Backed Stack
A dynamic array stack stores elements in a contiguous buffer of capacity , with a counter tracking the current size. The top of the stack lives at index .
Memory Layout
1capacity = 8
2 ┌───┬───┬───┬───┬───┬───┬───┬───┐
3data: │ 3 │ 7 │ 1 │ 4 │ 9 │ · │ · │ · │
4 └───┴───┴───┴───┴───┴───┴───┴───┘
5 0 1 2 3 4 5 6 7
6 ↑
7 top = 4 (size = 5)Push writes to slot and increments . Pop decrements and returns the slot (optionally clearing it for the GC). When we cannot push without first resizing the buffer.
Capacity Doubling and Amortized
The standard growth policy doubles capacity on overflow. A single resize is expensive — it allocates a new buffer of size and copies all existing elements — but it is rare. Over a sequence of pushes starting from capacity 1, the total copy cost is
,
so the average cost per push is bounded by a constant. This is the classic amortized argument we will formalize in Chapter 5. Worst-case push, however, is — the one push that triggers a resize copies the whole buffer. That distinction matters in real-time systems, where you cannot afford an unbounded latency spike.
Amortized vs worst-case
Python Implementation
Python's built-in list is itself a dynamic array, so a Pythonic stack is one line: stack = []; push is stack.append(x); pop is stack.pop(). That is exactly the right thing in production. To understand what append does, here is the same machinery written explicitly:
Pointer invalidation in C/C++
std::vector resize invalidates pointers, iterators, and references into the buffer. If you held &stack.top() across a push, your pointer now dangles. This is the single most common bug when migrating from a linked-list stack to an array stack — we will see how std::stack sidesteps it below.The Linked-List Stack
A linked-list stack stores each element in its own heap-allocated node. Each node holds a value and a pointer to the next node. The stack itself is a single pointer, head, pointing at the top node.
Memory Layout
1head ──► [9 | ●] ──► [4 | ●] ──► [1 | ●] ──► [7 | ●] ──► [3 | ∅]
2 top bottom
3
4 push(v): new node N with N.next = head; head = N
5 pop(): h = head; head = head.next; free h; return h.valueBoth push and pop touch only the head pointer and at most one node. There is no “buffer full” case — growth is one allocation per push. There is also no resize spike: every push pays exactly the same constant cost, so worst-case push is , not just amortized.
Python Implementation
Why a dataclass is fine here
next pointer suffices. If you find yourself needing a doubly linked list inside a stack implementation, stop — you almost certainly want a deque (Chapter 8) instead.Interactive: Two-Stacks Race
The two implementations look almost identical from the outside. The difference is what they cost per push. Drive both at once below: every click pushes (or pops) the same value on both stacks. The chart records the per-operation cost in arbitrary work units — one unit per slot write or pointer assignment, plus a copy-cost spike whenever the array stack resizes.
- Push 4 elements. The array fills its initial buffer; cost is flat (1 unit each). The linked list pays 3 units per push (allocate + 2 writes).
- Push the 5th element. The array stack resizes: its bar shoots up to units (copy 4 + write 1). Linked list bar stays flat at 3.
- Push the 9th, 17th, 33rd elements: each triggers another resize, with bars . The spikes get bigger but rarer — this is amortized in pictures.
- Watch the average cost. Even with the spikes, the array stack's average creeps toward a small constant (around 2 units). Linked stack average is locked at 3.
What the chart is and is not
Side-by-Side Comparison
| Property | Array stack | Linked-list stack |
|---|---|---|
| push (average) | O(1) amortized | O(1) worst-case |
| push (worst-case) | O(n) on resize | O(1) |
| pop | O(1) | O(1) |
| peek | O(1) | O(1) |
| Memory per element | 1 word (the value itself) | ≥ 2 words (value + next), plus per-allocation overhead |
| Allocator pressure | Rare, large allocations on resize | One allocation per push |
| Cache locality | Excellent (contiguous) | Poor (pointer-chased nodes) |
| Pointer to top stable across push? | No (resize invalidates) | Yes |
| Trivial to bound size | Yes (cap capacity) | Same, but less natural |
| Lock-free implementation | Hard (must coordinate index + buffer) | Standard pattern (Treiber stack) |
| Works with arena allocator | Awkward (resize fights the arena) | Natural |
Why Cache Locality Dominates in Practice
Modern CPUs run roughly two orders of magnitude faster than main memory. An L1 hit takes ~1 ns; a main-memory miss takes ~100 ns. The array stack's top is almost always in L1 because we just touched the adjacent slots. The linked-list stack's top is a freshly allocated node, often on a cache line that has never been touched — every push misses.
For typical short-lived stacks of small values, an array-backed stack is between 3× and 10× faster than a linked-list stack on commodity hardware, despite both being “”. This is why , Java's , Python's list, and the JVM operand stack are all array-backed.
Quick Check
You push 9 elements onto an array stack with initial capacity 4 and growth factor 2 (so capacities go 4, 8, 16, …). Counting 1 unit per slot write and 1 per element copied during resize, the total work is closest to:
The Bounded Stack Variant
Sometimes you want the array stack without the resize. A bounded stack has a fixed capacity decided at construction; pushing past capacity is an error rather than a trigger to grow.
1class BoundedStack:
2 def __init__(self, capacity: int) -> None:
3 if capacity <= 0:
4 raise ValueError("capacity must be positive")
5 self._data: list[object] = [None] * capacity
6 self._size = 0
7 self._capacity = capacity
8
9 def push(self, value: object) -> None:
10 if self._size == self._capacity:
11 raise OverflowError("stack overflow")
12 self._data[self._size] = value
13 self._size += 1
14
15 def pop(self) -> object:
16 if self._size == 0:
17 raise IndexError("pop from empty stack")
18 self._size -= 1
19 value, self._data[self._size] = self._data[self._size], None
20 return valueBounded stacks dominate in:
- Embedded firmware — no malloc, every byte budgeted at compile time.
- Hardware call stacks — an x86 thread stack is a fixed 1–8 MB; overflow is a hard fault.
- Fixed-depth recursion guards — explicit stacks for DFS where a depth limit is part of the spec.
- Interpreters with predictable working set — the JVM operand stack has a per-method
max_stackattribute computed at compile time so the interpreter pre-allocates exactly the right amount.
Always-bounded contract
Concurrency: The Treiber Stack
Both implementations above are single-threaded. If two threads call push simultaneously on the same stack, you can lose updates, corrupt the head pointer, or skip a slot. The standard fix is a mutex around every operation. The elegant fix — available only on the linked-list version — is the Treiber stack (Treiber 1986), a lock-free stack that uses a single atomic compare-and-swap (CAS) on the head pointer.
1# Pseudocode — real Treiber stack needs language-level atomics.
2# In Python you would use threading.Lock; in Java, AtomicReference;
3# in C++, std::atomic<Node*> with memory ordering.
4
5def push(stack, value):
6 new_node = Node(value=value)
7 while True:
8 old_head = stack.head # atomic load
9 new_node.next = old_head
10 if cas(stack.head, old_head, new_node): # atomic compare-and-swap
11 return # success; otherwise retry
12
13def pop(stack):
14 while True:
15 old_head = stack.head # atomic load
16 if old_head is None:
17 return None # or raise
18 new_head = old_head.next
19 if cas(stack.head, old_head, new_head):
20 return old_head.valueThe CAS atomically does this: if stack.head still equals old_head, replace it with the new value and return success; otherwise return failure and leave it alone. The loop retries on contention. This is lock-free — some thread always makes progress — but not wait-free: a single thread could in principle spin forever if it is unlucky.
The ABA problem
head = A and is about to CAS. Meanwhile thread T2 pops A, pops B, and pushes A again (perhaps reusing the same memory). When T1 wakes up, the CAS sees head == A and succeeds — but A's next pointer now points at deallocated memory. The fix is a tagged pointer (head + 64-bit version counter, atomically incremented on every change) or hazard pointers. In Java this is hidden by the GC: A is not reclaimed while T1 still holds a reference. In C++ you must implement it yourself.An array-backed lock-free stack is much harder because you must atomically coordinate two things: the size index and the buffer pointer (which can move on resize). This is why every lock-free stack you will encounter — internals, the GC's mark stack, kernel work-stealing queues — uses a linked-list layout.
Quick Check
Why is the Treiber stack called 'lock-free' but not 'wait-free'?
C++ Implementations
C++ exposes the array-vs-linked tension in the standard library itself. std::stack is a container adapter: it wraps an underlying container and forwards push/pop to it. The default underlying container is std::deque, not std::vector, for one specific reason — we will see why.
Array-Backed Stack in C++
The pointer-stability footgun
top() returns T&. After a push that triggers a vector reallocation, that reference is dangling. Code like1auto& x = stack.top();
2stack.push(other); // may reallocate
3x = something; // UNDEFINED BEHAVIOR — x danglesstd::stack defaults to std::deque: deque uses a chunked layout so element addresses are stable across push (though not insert/erase). For most users the deque default is fine; for the lowest possible overhead specify std::stack<T, std::vector<T>> and never hold a reference across a push.Linked-List Stack in C++
1#include <memory>
2#include <stdexcept>
3#include <utility>
4
5template <typename T>
6class LinkedStack {
7 struct Node {
8 T value;
9 std::unique_ptr<Node> next;
10 Node(T v, std::unique_ptr<Node> n)
11 : value(std::move(v)), next(std::move(n)) {}
12 };
13
14 std::unique_ptr<Node> head_;
15 std::size_t size_ = 0;
16
17public:
18 void push(T value) {
19 head_ = std::make_unique<Node>(std::move(value), std::move(head_));
20 ++size_;
21 }
22
23 T pop() {
24 if (!head_) throw std::out_of_range("pop from empty stack");
25 T value = std::move(head_->value);
26 head_ = std::move(head_->next); // old head freed here
27 --size_;
28 return value;
29 }
30
31 T& top() {
32 if (!head_) throw std::out_of_range("top of empty stack");
33 return head_->value;
34 }
35
36 bool empty() const noexcept { return !head_; }
37 std::size_t size() const noexcept { return size_; }
38};The unique_ptr chain handles all the memory management: when head_ is reassigned in pop, the old head is destroyed, its next moves out first so the destruction does not recursively destroy the entire list (which would blow the call stack on a deep stack). top() is now reference-stable across push, the trade-off being two heap allocations per push and a near-guaranteed cache miss.
std::stack's actual default
std::stack<T> is std::stack<T, std::deque<T>>. std::deque is neither pure array nor pure list — it is an array of fixed-size chunks. Push amortized , references stable, no whole-buffer copy. It is the “safe default” that gives up some cache locality for pointer stability.Real-World Stacks
| Stack | Implementation | Why this choice |
|---|---|---|
| x86 hardware call stack | Array (memory grows downward, %rsp register) | Single allocation at thread creation; no malloc per call frame; cache-hot. |
| JVM operand stack (per method) | Array, max depth precomputed by javac | Bytecode declares max_stack at compile time → static allocation, no resize ever. |
| Python list as a stack | Array (CPython listobject.c) | Idiomatic stack idiom in Python; same dynamic-array machinery. |
| Java java.util.Stack (legacy) | Array (extends Vector) | Predates ArrayDeque; synchronized; effectively deprecated. |
| Java ArrayDeque (preferred) | Circular array | Faster than Stack and LinkedList for stack workloads; cache-friendly. |
| Treiber stack in JUC | Linked list with AtomicReference head | Lock-free push/pop via CAS; basis of ConcurrentLinkedQueue and many GC mark stacks. |
| Compiler activation records | Hardware stack (array) | Function call/return is the hot path; cache locality dominates everything. |
| PostScript interpreter operand stack | Array, fixed default depth | Predictable footprint for printer firmware; overflow is a defined error. |
| RPN calculator (HP-style) | Array, fixed (often depth 4) | Hardware-defined fixed depth; older models actually used 4 named registers. |
| Browser back-button history | Array per tab | Bounded by user navigation length; trivial to bound and prune. |
| DFS on huge graph | Explicit linked-list or array stack | Replaces system call stack to avoid stack-overflow on deep graphs. |
| GC mark stack (JVM, V8, .NET) | Linked array of fixed-size chunks (a chunked stack) | Hybrid: cache locality of array + worst-case O(1) push of linked list. |
Notice the chunked GC mark stack: it is essentially a linked list of arrays. Each chunk is contiguous (cache-friendly); chunks are linked together (no resize spike, no whole-buffer copy). When you find yourself wanting both properties, this hybrid is the canonical answer — it is what std::deque is, what the V8 GC's mark stack is, and what every production JVM uses internally for marking.
Pitfalls
1. Conflating amortized with worst-case in a real-time loop
A push that usually takes 5 ns and occasionally takes 5 ms because of a resize will be invisible in average-case microbenchmarks and catastrophic in an audio callback. If you are inside a ms latency budget, prefer a linked-list stack or pre-reserve the array's capacity (vec.reserve(N)).
2. Holding a reference into the array stack across a push
Already shown in C++ above. Same trap exists in Rust's Vec (the borrow checker catches it at compile time, mercifully) and in Go slices that share a backing array (the runtime does not catch it). Whenever the underlying buffer can move, internal pointers can dangle.
3. Per-allocation overhead in tight loops
A linked-list stack with pushes calls malloc ten million times. On glibc that is ~30 ns per call — ns = 0.3 s of pure allocator overhead. The same workload on an array stack pays resize allocations total. Use a node pool / arena allocator if you genuinely need a linked stack in a tight loop.
4. ABA in lock-free linked stacks
Already shown above. The fix is tagged pointers (std::atomic<TaggedPointer> with a 64-bit version counter) or hazard pointers, or a managed runtime with GC, or epoch-based reclamation. Never roll your own production lock-free stack without addressing this.
5. Forgetting to clear popped slots in array stack
If T owns resources (file handles, GPU buffers, big strings), leaving the popped slot pointing at the old object means the resource is held until the slot is overwritten or the stack itself is destroyed. The self._data[i] = None line in our Python implementation and the implicit destruction in C++'s vector::pop_back exist precisely to avoid this leak.
Choosing an Implementation
A short decision tree:
- Default: array stack. If you do not have a specific reason to choose otherwise, an array-backed stack (
listin Python,std::stackoverstd::vectorin C++,ArrayDequein Java) wins on cache locality, allocator overhead, and code simplicity. - Real-time / bounded latency. Use a bounded array stack with capacity reserved up front, or a linked-list stack with a node pool. Never let the resize spike happen inside the latency budget.
- Concurrent push/pop from many threads. Linked list with CAS on the head (Treiber stack). Wrap with hazard pointers or rely on a GC.
- Need stable references to elements (you keep a pointer to across mutations). Linked list, or
std::dequeas a chunked compromise. - Strict memory budget (embedded, kernel). Bounded array stack. Allocation is statically determined.
- Both cache locality and worst-case push. Chunked stack (linked list of fixed-size arrays). This is what GC mark stacks and
std::dequeare.
Engineering takeaway: “ per operation” is necessary but rarely sufficient. The constants — cache lines, allocator calls, pointer stability, lock-freedom — are what determine which implementation actually ships. Knowing both implementations lets you pick the right one and explain the choice.
The next section pulls back from data-structure mechanics to look at the most consequential stack in computing: the system call stack, the array stack that every running program uses for every function call.