Blog
Part 0 — Foundations
- 0.0 · Compute: the Crown Jewel of AI InfrastructureAI infrastructure is enormous — data, storage, networking, serving. This series zooms into the one slice where the money and the scarcity live: compute. The whole series answers one question — are you getting useful work out of the silicon you pay for?
- 0.1 · The 4-Layer Mental Model for AI ComputeA simple model that splits a compute cluster into four layers — reservation, provisioning, scheduling, workload — so 'my pod is Pending' becomes a one-minute diagnosis instead of a three-team argument.
- 0.2 · Why Is My Pod Stuck Pending? Looking into the failure pathThe most common — and most expensive — GPU cluster question. The same 'Pending' can come from any of the four layers; here's how to find which one, before you waste money on the wrong fix.
Part 1 — Anatomy of a Unit of Compute
- 1.1 · Compute: FLOPs, Precision, and the 10× Hidden in a Spec SheetThis post is only about one thing inside the accelerator: its ability to calculate. The same chip has a ~10× range of 'speed' depending on precision — but that number is a ceiling, and most of the time you never reach it.
- 1.2 · Memory: HBM Capacity and BandwidthThe second thing inside the accelerator: its memory. Two numbers — capacity (does it fit) and bandwidth (can you feed the compute) — and which one limits you depends on the workload, not the chip. This is why the FLOPs ceiling from 1.1 is rarely your speed.
- 1.3 · Intra-Node Fabric: NVLink and NVSwitchWhy 8 GPUs in a box aren't just 8 GPUs — the high-bandwidth links inside a node and what they mean for tightly-coupled work.
- 1.4 · Inter-Node Fabric and Topology: The Network, the Rack, the PodHow boxes talk — InfiniBand/RoCE/EFA and the rack/pod layout; where a job lands across the network changes its speed as much as whether it lands.
- 1.5 · The Host Around the Accelerator: CPU, RAM, and Local NVMeThe last piece of a unit of compute isn't on the accelerator at all — it's the CPU, system RAM, and local disk feeding it across a thin PCIe straw. When the host starves the GPU, no faster GPU helps — and its fix isn't in this Part. This post closes the anatomy and bridges into the rest of the series.
Part 2 — GPU Sharing
- 2.1 · One GPU, Many Jobs: The Case for SharingA whole accelerator handed to a job that uses a sliver of it is the most common waste; the fork between splitting in space and sharing in time.
- 2.2 · Hard Partitions: MIGSplitting one GPU in space into isolated instances — strong isolation, coarse granularity, and when it fits.
- 2.3 · Soft Sharing: MPS and Time-SlicingSharing one GPU concurrently (MPS) or by turns (time-slicing) — higher utilization, weaker isolation.
- 2.4 · When Sharing Backfires: Isolation and InterferenceSharing only pays off if one job can't steal another's memory or tail latency; where each sharing mode's isolation breaks.
Part 3 — Job Scheduling
- 3.1 · How a Scheduler Decides: Watch → Filter → Score → BindThe loop every placement decision runs through — framed as where each later problem actually occurs, not a docs tour.
- 3.2 · All-or-Nothing: Gang Scheduling for Distributed JobsA multi-GPU job needs all its pieces at once; partial placement deadlocks and wastes what it grabbed.
- 3.3 · Pack vs Spread: Fragmentation and the Bin-Packing ProblemFree GPUs you can't use because they're scattered; packing tight vs spreading out, and what each costs.
- 3.4 · Placement That Respects TopologyWhere a job's pieces land changes its speed as much as whether they land; keeping tightly-coupled work close on the fabric.
Part 4 — Cluster Sharing
- 4.1 · Fair Shares: Quotas, Borrowing, and ReclaimGuaranteeing each team a floor without letting the cluster sit idle behind that guarantee — lend the spare, take it back on demand.
- 4.2 · Priority and Preemption: Who Yields When the Cluster Is FullNot enough to go around — who gets evicted, how, and why evicting too eagerly costs more than it saves.
- 4.3 · Co-locating Online and Offline WorkServing is provisioned for peak and idle most of the day; backfill it with batch without hurting latency.
- 4.4 · Riding the Daily Tide: Lend Off-Peak, Reclaim on DemandDemand has a daily rhythm; ride it instead of provisioning for peak and paying for the trough.
- 4.5 · Elastic Jobs: Grow and Shrink With Available CapacityJobs that expand into idle GPUs and contract when demand returns — absorbing the spare instead of leaving it idle.
Part 5 — Failure Recovery
- 5.1 · Checkpointing: How Often to SaveSaving progress cheaply so an interruption costs minutes, not days — and the tradeoff between checkpoint overhead and lost work.
- 5.2 · Surviving Interruption: Reclaim, Preemption, and Graceful DrainGiving capacity back without losing work — spot reclaim, preemption, and the orderly cordon → drain → evict dance.
- 5.3 · Catching Bad Hardware Fast: Stragglers and Fail-SlowAt scale, hardware fails and slows constantly; detecting the fail-slow node before it burns a whole run's cycles.
Part 6 — Measurement & Economics
- 6.1 · Utilization Isn't One Number: Allocation vs Occupancy vs MFUThree numbers people all call 'utilization' that disagree — and which one tells you you're wasting money.
- 6.2 · The Cost of IdleA dollar figure on a single idle percentage point at fleet scale — the number that justifies the whole series.
- 6.3 · The Right Buying Mix: Reserved, On-Demand, and SpotWhy a paid commitment must run hot to beat on-demand, and when spot wins.
- 6.4 · Forecasting Demand to Size ReservationsHow much to reserve is a prediction problem; sizing committed capacity against forecast demand instead of guessing.
Part 7 — Synthesis