Why classification CNNs can't segment
Every CNN we built in §§11.1–11.7 was designed to throw spatial information away. Pooling layers halve resolution; the final global-average-pool flattens whatever survives into a single channel-wise vector; the classifier head produces one number per class. That is the right design when the answer is one label for the whole image — "cat", "1000-class ImageNet index 281", or "pneumonia: yes/no".
Segmentation needs the opposite. The output is the same size as the input, with one label per pixel. A 572×572 cell-microscopy image must come back as a 572×572 mask of cell or not-cell. We need a network whose bottom layers throw away resolution to gain receptive field, then whose top layers recover that resolution while preserving the high-level reasoning.
The resolution-recovery problem. Classification CNNs are downsamplers. Segmentation networks must be down-then-up samplers, with a way for the upsampling path to see the high-resolution detail that the downsampling path threw away. That last clause — how the up-path sees high-res detail — is what U-Net solves.
Three failure modes a pure encoder-decoder (no skip connections) hits in practice, all of which U-Net's skip pattern repairs:
- Boundary blur. By the time information reaches the bottleneck (28×28 in the original U-Net), every cell membrane has been smeared across multiple feature-map pixels. The decoder cannot recover edges it does not have.
- Small-object loss. A neuron at the bottleneck has receptive field of ~140 pixels; objects smaller than that are aliased into the closest large-scale feature.
- Localization vs context tradeoff. The decoder needs both: what is this(semantic, comes from the bottleneck) and exactly where (spatial, comes from the encoder's shallow layers). Without skips, only the first survives.
Architecture: symmetric encoder-decoder
U-Net is built from three pieces that reappear unchanged across the entire segmentation lineage we cover in §§09–11:
- Contracting path (encoder). Four stages of (3×3 conv → ReLU) ×2 → 2×2 max-pool. Channels double at each stage (64 → 128 → 256 → 512); spatial dimensions roughly halve at each pooling step.
- Bottleneck. Two more 3×3 convs at channels. This is the lowest spatial resolution but the deepest semantic representation.
- Expansive path (decoder). Four mirrored stages of up-conv (2×2 transposed conv) → concatenate with the matching encoder feature map → (3×3 conv → ReLU) ×2. Channels halve, spatial dimensions roughly double, at each stage.
The decoder is symmetric to the encoder, and at every level the encoder's output is copied across to the decoder via a skip connection. We unpack what "copied across" means in skip-connections below.
Shapes the original U-Net produces, level by level:
| Stage | Channels | H × W | Notes |
|---|---|---|---|
| Input | 1 | 572 × 572 | Single-channel cell-microscopy image |
| Encoder block 1 | 64 | 568 × 568 | (3×3 conv unpadded) × 2 |
| After max-pool 1 | 64 | 284 × 284 | Halve spatial |
| Encoder block 2 | 128 | 280 × 280 | (3×3 conv unpadded) × 2 |
| After max-pool 2 | 128 | 140 × 140 | |
| Encoder block 3 | 256 | 136 × 136 | |
| After max-pool 3 | 256 | 68 × 68 | |
| Encoder block 4 | 512 | 64 × 64 | |
| After max-pool 4 | 512 | 32 × 32 | |
| Bottleneck | 1024 | 28 × 28 | Lowest resolution, deepest semantics |
| Up-conv → concat → conv | 512 | 52 × 52 | Skip from encoder block 4 (cropped 64→52) |
| Up-conv → concat → conv | 256 | 100 × 100 | Skip from encoder block 3 (cropped 136→100) |
| Up-conv → concat → conv | 128 | 196 × 196 | Skip from encoder block 2 (cropped 280→196) |
| Up-conv → concat → conv | 64 | 388 × 388 | Skip from encoder block 1 (cropped 568→388) |
| 1×1 conv → softmax | 2 | 388 × 388 | 2 classes: cell / background |
Total parameters: ~31 million. The decoder is roughly half the cost of the encoder because, although the channel counts mirror, the up-conv replaces a max-pool, which has no parameters.
The diagram below renders that same shape table as a 3D scene. Encoder blocks descend on the left with channels growing (blue), the bottleneck sits at the bottom (purple), decoder blocks ascend on the right with channels shrinking (amber). The dashed amber arcs arching over the top are the four skip connections — each encoder level feeds its mirror decoder level. Drag to rotate, scroll to zoom, click a layer to inspect.
Shape flow (interactive)
The diagram below is the same U-Net you saw in the table, drawn as a graph. Hover any block to see exact channels, spatial size, and parameters. Click a dashed skip line to see the cropping arithmetic that feeds the corresponding decoder level.
Three-letter mnemonic for what each color means: encoder is blue (down), bottleneck is purple (deep), decoder is brown (up). The dashed arrows are the skip connections; their existence is what separates U-Net from a generic encoder-decoder.
Skip connections: concat vs add
ResNet (§11.5) uses skip connections that add the encoder feature into the decoder feature: . U-Net does something different — it concatenates along the channel axis:
Why concat instead of add? The two paths carry different kinds of information. The encoder skip carries high-resolution spatial detail (where is the boundary?). The decoder up-conv carries semantic context (is this region cell or background?). Adding them forces a single channel slot to hold both, weighted equally, with no way for the network to decide how to combine them. Concatenating preserves both as separate channels, and the next 3×3 conv learns the optimal fusion as a regular weight matrix.
The cropping step. Because U-Net's convolutions are unpadded, the encoder feature map at level is slightly larger than the decoder feature map at the same level. We center-crop the encoder side by pixels per side before concatenating. The center here matters: the network only ever sees a fully valid receptive field for every output pixel.
For a quick mental contrast with ResNet: ResNet skips solve a training problem (gradient flow) by adding identity. U-Net skips solve a representation problem (high-res detail in the decoder) by concatenating learned features. The mechanisms look similar but address different bottlenecks. Cross-reference: §11.5 (ResNet add-skip derivation) vs §10.6 (transposed convolution arithmetic).
Skip-connection ablation playground
Words about "each skip carries different information" do not land until you see the network's output break in different ways. The widget below trains a small U-Net on a microscopy sample, then disables one skip at a time and re-renders the predicted mask. Toggle any single skip OFF.
What you should see. Disabling skip 1 (the highest-resolution, 64-channel skip) makes cell boundaries blurrier; the network knows where cells roughly are but smears the membrane. Disabling skip 4 (the deep, 512-channel skip) leaves boundaries sharp where cells are detected but causes the network to miss whole cells — the deep skip carried the "is there a cell here at all" signal.
blobs.gif from the imagej.net image library; substituted for the originally-targeted Spindle.tif which 404'd on all known mirrors). The shown mask predictions are produced by a tiny U-Net trained on that single image for ~200 epochs — intentionally illustrative, not production quality. Regeneration script: scripts/generate_segmentation_assets.py.Python from scratch: one decoder up-block
Before reaching for PyTorch, let's build the single mechanism that distinguishes U-Net from a generic encoder-decoder — one decoder up-block — using only NumPy. We use tiny tensors so every value can be printed and verified. The block does three things in order: center-crop the skip, concatenate along the channel axis, and convolve.
Run it and you should see Cropped skip shape: (1, 4, 4), Fused shape: (3, 4, 4), and a (1, 2, 2) output. We did the crop, the concat, and the 3×3 conv ourselves — that's the entire mechanism. PyTorch in pytorch-implementation stacks this same building block four times to get the full decoder.
PyTorch: full UNet module
The PyTorch implementation builds the same crop-concat-conv mechanism we just did in NumPy, repeated 4 times in the decoder, with two small modernizations: padded 3×3 convolutions (so input and output spatial sizes match) and bilinear upsample as the default decoder up-step (cheaper than ConvTranspose, immune to the checkerboard artifact derived in §10.6).
We split the network into four files-worth of code: DoubleConv (the atom), Down (encoder stage), Up (decoder stage), and UNet (full module). Each is self-contained and composes into the next.
1. DoubleConv — the atom
2. Down — encoder stage
3. Up — decoder stage
4. UNet — full module
Loss functions for segmentation
Segmentation has a class-imbalance problem out of the gate: the foreground (cells, roads, organs) usually occupies less than 10% of the pixels. Plain pixel-wise BCE / cross-entropy is heavily biased toward predicting all-background. Three losses solve this; pick by problem.
| Loss | Definition (informal) | When to use | Cross-ref |
|---|---|---|---|
| Dice | 1 − 2|p ∩ g| / (|p| + |g|) (overlap-based) | Heavy class imbalance, e.g. tumor < 1% of pixels | §5.4 (loss-functions) |
| BCE + Dice | Pixel-wise BCE plus the Dice term | Most binary segmentation tasks; pairs pixel-level signal with shape-level signal | §5.4 |
| Focal Tversky | (1 − T)^γ where Tversky T trades FP and FN | Very small structures (vessels, ducts) where missing them is much worse than over-segmenting | Salehi 2017 (arXiv:1706.05721) |
For binary cell segmentation we'll use BCE + Dice with equal weight. The Dice term:
where is the predicted probability for pixel , is the ground-truth label, and is a small constant (e.g. 1) that makes the loss well-defined when both p and g are zero everywhere.
Training recipe
The original U-Net's training recipe — still a good default for biomedical sets:
- Patch sampling. Train on overlapping patches (e.g. 388×388 outputs from 572×572 inputs) instead of full images. Smaller memory, more diversity per epoch.
- Heavy augmentation. Random flips, 90° rotations, and elastic deformation — small smooth random vector fields applied to both the input and the mask. Critical for biomedical sets where labeled data is scarce.
- Boundary weight maps. The original paper precomputes a per-pixel weight that assigns higher loss to pixels near cell boundaries (separating touching cells is the hardest sub-problem). Modern setups often drop this in favor of a Dice + boundary-loss combination, but it remains a clean inductive prior.
- Optimizer. Adam with , batch size 4–16 (memory-bound). Learning-rate cosine decay or step decay both work; warmup is rarely worth it at this scale.
- Metrics. Intersection-over-Union () and Dice coefficient on a held-out validation set. For multi-class semantic segmentation: mean IoU (mIoU) averaged over classes, with explicit per-class breakdowns to catch class-imbalance failures.
Real-world applications
U-Net's reach is the broadest of any architecture in this chapter. The same encoder- decoder + skip pattern appears, often unchanged, across domains where the input and the desired output are both image-shaped:
| Domain | Task | Why U-Net wins here |
|---|---|---|
| Biomedical (microscopy) | Cell, nucleus, organelle segmentation in EM and fluorescence images | Original domain. Scarce labels + heavy augmentation + per-boundary weights. |
| Biomedical (radiology) | Tumor / organ delineation in MRI, CT, ultrasound; 3D U-Net for volumetric scans | 3D U-Net (Çiçek 2016) extends every 2D op to 3D. Used in nnU-Net's clinical pipelines. |
| Biomedical (ophthalmology) | Retinal-vessel segmentation in OCT and fundus images | Tiny vessels need high-res skip 1 to resolve; deep skip provides shape prior. |
| Satellite (mapping) | Building footprint, road extraction (Inria Aerial, SpaceNet) | Wide receptive field via deep skip; high-res first skip preserves building corners. |
| Agriculture (remote sensing) | Crop-field delineation in Sentinel-2 multi-spectral imagery | n_channels = 13 (one per spectral band) is just a config knob in our UNet class. |
| Autonomous driving (semantic) | Drivable-surface and lane masking when SOTA precision is not required | DeepLab (§10) typically wins here, but U-Net remains a fast baseline. |
| AR / mobile | Real-time portrait segmentation (Pixel phones, Zoom backgrounds) | Trimmed U-Net runs at 30+ FPS on phones; bilinear upsample + lightweight backbone. |
| Industrial QA | Surface-defect detection on steel, glass, fabric production lines | Pre-trained on ImageNet then fine-tuned on a few hundred labeled defects. |
| Diffusion (generative) | U-Net is the denoising backbone for Stable Diffusion and DDPM | Same encoder-decoder + skip pattern, augmented with self-attention; see Ch 23. |
3D U-Net: from pixels to voxels
Every CT or MRI scan is a volumetric image — a stack of 2D slices that, taken together, form a 3D voxel grid. A liver tumour does not live on a single slice; it spans, e.g., 12 contiguous slices in an abdominal CT, with shape, edges, and local invasion only fully visible when those slices are reasoned about together. The 2D U-Net we just built operates on each slice independently. It throws away the third dimension before it ever convolves.
The cross-slice context problem. A 2D U-Net seeing a single 512×512 axial slice of a CT cannot tell whether a bright spot at is a tumour (which would persist across the next 5 slices) or a vessel cross-section (which would shift on the next slice). The fix is structural, not training-time: replace every 2D op with its 3D counterpart so a single forward pass sees a 3D voxel neighbourhood at every position.
Three failure modes a slice-by-slice 2D U-Net hits in practice, all of which 3D U-Net fixes:
- Through-plane discontinuity. Predicted masks are sharp inside a slice but flicker between adjacent slices because the network never had a chance to enforce cross-slice consistency.
- Anisotropic context loss. A 3×3 in-plane receptive field sees ~3 mm of tissue (depending on resolution) but the kernel sees 0 mm in the through-plane direction. A small, elongated structure visible across 4 slices can be entirely missed.
- 3D shape priors discarded. Anatomical structures have characteristic 3D shapes (a kidney is a bean; a vertebra has a recognisable arch). 2D processing cannot use those priors.
The 3D U-Net was introduced by Çiçek et al. 2016 (MICCAI) for exactly this reason. It is structurally identical to the 2D U-Net you just built — same encoder-decoder, same concat skips — with every 2D primitive replaced by its 3D analogue. Two contemporary papers explored adjacent points in the design space: Milletari et al. 2016 (V-Net) added residual blocks and introduced the now-ubiquitous soft Dice loss; the fully-automated nnU-Net pipeline (Isensee et al. 2021, Nature Methods) made 3D U-Net the default winner on ~23 public benchmarks.
3D U-Net architecture (Çiçek 2016)
The recipe, taken verbatim from §2 of the paper. The encoder ("analysis path") and decoder ("synthesis path") each have 4 resolution steps; every 2D op is swapped one-for-one with its 3D analogue.
| 2D op (earlier in this section) | 3D op (Çiçek 2016, §2) | What changes |
|---|---|---|
| 3×3 conv (padded) + ReLU | 3×3×3 conv (padded) + ReLU | Kernel goes from 9 to 27 weights → 3× params per filter |
| 3×3 conv + BN + ReLU (twice per stage) | 3×3×3 conv + BN + ReLU (twice per stage) | BatchNorm3d normalises over (B,D,H,W); same idea |
| 2×2 max-pool (stride 2) | 2×2×2 max-pool (stride 2) | Halves D, H, W → 8× voxel reduction per stage |
| 2×2 transposed conv (stride 2) | 2×2×2 transposed conv (stride 2) | Doubles D, H, W exactly |
| Skip: encoder → decoder (concat axis 1) | Same — concat along channel axis | Identical mechanism, one more spatial axis |
| Final 1×1 conv → n_classes | Final 1×1×1 conv → n_classes | Same per-voxel linear projection |
One subtle but important departure from the 2D U-Net: channels double before the max-pool, not after. The paper credits this to Szegedy et al. 2015 (Rethinking Inception) to avoid representational bottlenecks. A pre-pool doubling means the conv at the higher resolution gets to use the larger channel count where it sees the most context.
Exact shapes reported in the paper for the Xenopus-kidney experiments:
| Stage | Channels | Spatial (voxels) | Notes |
|---|---|---|---|
| Input | 3 | 132 × 132 × 116 | 3-channel confocal microscopy (Tomato-Lectin / DAPI / Beta-Catenin) |
| Encoder L1 (after 2× conv) | 32 → 64 | 132³ → 124³ etc. | Doubling-before-pool: ch goes 32→64 here, BEFORE max-pool |
| After max-pool 1 | 64 | halved each axis | 2×2×2 stride 2 → 8× voxel reduction |
| Encoder L2 | 64 → 128 | shrinks by 4 voxels per axis from unpadded 3³ convs | two convs |
| Encoder L3 | 128 → 256 | shrinks | two convs |
| Encoder L4 (deepest analysis) | 256 → 512 | shrinks | two convs |
| Bottleneck | 512 | smallest | deepest semantic representation |
| Decoder L4 → L1 (mirror) | 512 → 256 → 128 → 64 | doubles each axis per stage | 2×2×2 up-conv → concat skip → 2× 3³ conv |
| Output | 3 | 44 × 44 × 28 | 3 labels: inside-tubule / tubule / background. Receptive field 155×155×180 µm³ |
Total parameters: (exact figure from Çiçek 2016 §2). Batch size in the paper: 1 — volumetric activations are too large for larger batches on 2016 hardware, and this remains the norm today. BatchNorm placement: before each ReLU. With batch size 1 the running statistics are computed per-sample (effectively InstanceNorm); modern reimplementations (e.g. nnU-Net) often prefer for this reason.
Volumetric flow (interactive 3D)
Every box below is a feature-map volume. The encoder column on the left shrinks the spatial cube while the channel count grows; the decoder column on the right does the reverse; dashed lines are the skip connections that copy across each level. Drag to orbit the camera; hover the level buttons to highlight a matching pair on both sides.
The shown spatial sizes (132 → 66 → 33 → 16 → 8) are didactic clean halves. The Çiçek 2016 paper uses unpadded 3³ convolutions, so the actual encoder outputs shrink by 4 voxels per axis at each stage (not exact halves) and the final output is 44×44×28 rather than 132×132×116. Modern padded reimplementations recover the clean halving you see here. The shape table in the previous section gives the paper's exact numbers.
Python from scratch: one 3D up-block
Just like the 2D version, the cleanest way to internalise a 3D U-Net is to build the single new operation — the volumetric up-block — from scratch in NumPy on tiny printable tensors. The block does three things in order: concatenate the skip along the channel axis, apply a 3×3×3 convolution, and (in a real network) the upsample-by-2 that produced the decoder feature map. Below we focus on the concat + 3³ conv, which is the part that actually changes vs 2D.
The pay-off is the very last printed slice. Depth slice 0 is uniform 87.75; depth slice 1 is uniform 114.75. The +27.0 jump comes from the 3³ kernel summing three different input depth values along the depth axis (0+1+2 vs 1+2+3). A 2D conv applied slice-by-slice could not produce that variation — it would output the same value at every position on every slice. That Δ = 27 is the literal numerical signature of cross-slice context.
PyTorch: full UNet3D module
Now in PyTorch, with the same four-file decomposition we used for the 2D version — DoubleConv3D, Down3D, Up3D, UNet3D. Every line is the 3D counterpart of a line you have already read in the 2D PyTorch implementation. Read them side-by-side: the diff is the smallest possible while still capturing volumetric context.
1. DoubleConv3D — the volumetric atom
2. Down3D — encoder stage
3. Up3D — decoder stage
4. UNet3D — full module
Sparse-annotation loss & 3D Dice
The Çiçek 2016 paper's key training contribution is not the 3D extension — that is mechanical — but the loss it uses to train from sparse 2D annotations on 3D volumes. Annotating every voxel of a 132×132×116 microscopy volume is impractical; annotating a few orthogonal xy/xz/yz slices is tractable. The trick: a per-voxel weighted softmax with the weight set to zero on unlabelled voxels:
where is the voxel grid, is the softmax probability for class at voxel , is the ground-truth label, and is the per-voxel weight. Setting on unlabelled voxels makes them invisible to the gradient, so the network learns only from the annotated slices but predicts on the whole volume at inference. Setting on rare classes (small tumours) up-weights them — the same idea as class re-balancing in 2D, lifted to 3D.
For fully-annotated volumetric tasks (BraTS, KiTS) the dominant loss is the 3D soft Dice loss from Milletari et al. 2016 (V-Net):
identical to the 2D Dice from loss-functions above except the sums run over voxels (3D) rather than pixels (2D). Production pipelines almost always use Dice + cross-entropy as a sum: cross-entropy gives a clean per-voxel signal early in training, Dice gives the shape-level signal that handles the severe foreground/background imbalance typical of 3D tumour volumes (a glioma is often < 1% of brain voxels).
3D U-Net in production (BraTS, KiTS, nnU-Net)
3D U-Net is the workhorse of clinical-grade volumetric segmentation. The two patterns below reappear across essentially every public 3D segmentation benchmark since 2018: (1) a U-Net-shaped 3D backbone with concat skips, (2) Dice + CE loss with patch-based training. Variants of nnU-Net (which auto-configures both) currently lead most public leaderboards.
| Benchmark / dataset | Task | Why 3D U-Net (and what wins) |
|---|---|---|
| BraTS (Brain Tumour Segmentation, 2012–present) | Multi-modal MRI segmentation of glioma into 3 sub-regions: whole tumour, tumour core, enhancing tumour | 4-channel input (T1, T1ce, T2, FLAIR). 3D context is critical because lesions span tens of slices. Most recent winners are nnU-Net variants. (Menze et al. 2015, IEEE TMI; Bakas et al. 2018) |
| KiTS19 / KiTS21 (Kidney Tumor) | Kidney + tumour + cyst segmentation in contrast-enhanced abdominal CT | 3D shape prior of the kidney is strong; 2D slice-by-slice loses both ends of the organ. Top KiTS19 entry was an ensembled 3D U-Net (Heller et al. 2021, Med. Image Anal.) |
| LiTS (Liver Tumor Segmentation) | Liver + lesion segmentation in abdominal CT | Tumours often span 5–20 slices and have weak 2D contrast; 3D conv recovers them. (Bilic et al. 2023, Med. Image Anal.) |
| Medical Segmentation Decathlon (MSD) | 10 different 3D tasks: brain, liver, hippocampus, lung, prostate, pancreas, hepatic vessels, spleen, colon, cardiac | The MSD was won by nnU-Net out-of-the-box on 7/10 tasks, demonstrating that a single self-configuring 3D U-Net pipeline beats task-specific custom networks. (Antonelli et al. 2022, Nat. Comm.) |
| Industrial CT (defect inspection) | Cracks, voids, inclusions in cast metal parts scanned with industrial CT | Same algorithm; medical pre-trained checkpoints transfer well. 3D context disambiguates spheres-of-revolution (real defects) from ring artefacts (acquisition flaws). |
| Cryo-electron tomography | Particle picking / membrane segmentation in cellular tomograms | Resolution is ~10× lower than confocal; 3D U-Net + sparse-annotation loss (Çiçek-style) is well-suited because dense annotation is infeasible. |
| Seismic interpretation | Salt-body / fault segmentation in 3D seismic volumes (energy industry) | The 3D structure of geological faults is fundamentally volumetric; same 3D U-Net architecture is used unchanged on seismic amplitudes. |
Two practical pointers for anyone shipping a 3D U-Net:
- Patch sampling and overlap. Train on 96–128³ random patches; at inference use a sliding window with ~50% overlap and Gaussian-weighted averaging at patch borders. This is what nnU-Net does by default.
- Resampling to a canonical voxel size. Volumetric medical data has wildly varying voxel spacing across scanners. Resample every volume to a fixed mm-per-voxel (median of the training set) before training; reverse at inference. This single step often outweighs architecture tweaks.
Other variants (Attention U-Net, U-Net++, …)
Beyond the 3D extension above, a decade of 2D U-Net variants has sharpened the network for specific failure modes — cluttered scenes, label scarcity, transformer backbones, and full clinical pipelines. None of them change the "encoder-decoder with concat skips" idea; each tweaks one piece:
| Variant | Year | What it changes | When to reach for it |
|---|---|---|---|
| Attention U-Net | 2018 | Adds attention gates on each skip connection to suppress irrelevant skip features | Cluttered backgrounds where the encoder skip carries noise the decoder must ignore |
| U-Net++ | 2018 | Replaces single skips with a nested grid of intermediate convs (the ++ in the name) | Deep supervision wanted; better gradient flow at intermediate decoder depths |
| TransUNet | 2021 | Replaces the bottleneck with a Vision Transformer (ViT) | Long-range context is critical (large organs, long roads). See Ch 18 for ViT. |
| nnU-Net | 2021 | Not a network change — a fully automated pipeline that picks U-Net hyperparams from data | Real clinical work. State-of-the-art on most medical-segmentation benchmarks out of the box. |
References
Primary paper. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI 2015. arXiv:1505.04597.
Variants.
- Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016). 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. MICCAI 2016. arXiv:1606.06650. — architecture used in the 3D-U-Net deep dive above (4 resolution steps, channel doubling before max-pool, 19,069,955 params).
- Milletari, F., Navab, N., & Ahmadi, S.-A. (2016). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV 2016. arXiv:1606.04797. — soft Dice loss for 3D segmentation; close design contemporary to 3D U-Net.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567. — source of the "avoid representational bottlenecks" rule cited in Çiçek 2016 §2 (channel doubling before pooling).
- Oktay, O. et al. (2018). Attention U-Net: Learning Where to Look for the Pancreas. arXiv:1804.03999.
- Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2018). UNet++: A Nested U-Net Architecture for Medical Image Segmentation. DLMIA. arXiv:1807.10165.
- Chen, J. et al. (2021). TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv:2102.04306.
- Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18, 203–211. DOI: 10.1038/s41592-020-01008-z.
Loss functions cited above.
- Salehi, S. S. M., Erdogmus, D., & Gholipour, A. (2017). Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI. arXiv:1706.05721.
3D segmentation benchmarks cited above.
- Menze, B. H. et al. (2015). The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Transactions on Medical Imaging 34(10), 1993–2024. DOI: 10.1109/TMI.2014.2377694.
- Bakas, S. et al. (2018). Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge. arXiv:1811.02629.
- Heller, N. et al. (2021). The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge. Medical Image Analysis 67, 101821. DOI: 10.1016/j.media.2020.101821.
- Bilic, P. et al. (2023). The Liver Tumor Segmentation Benchmark (LiTS). Medical Image Analysis 84, 102680. DOI: 10.1016/j.media.2022.102680.
- Antonelli, M. et al. (2022). The Medical Segmentation Decathlon. Nature Communications 13, 4128. DOI: 10.1038/s41467-022-30695-9.
Cross-references inside this book. Chapter 5 §4 (loss functions including Dice). Chapter 5 §5 (BatchNorm and GroupNorm). Chapter 10 §6 (transposed convolutions and the checkerboard artifact). Chapter 11 §5 (ResNet and add-style skip connections). Chapter 23 (Diffusion models — uses U-Net as the denoising backbone).