Blog | Mincan Yang

Blog

Part 0 — Foundations

0.0 · Compute: the Crown Jewel of AI Infrastructure

AI infrastructure is enormous — data, storage, networking, serving. This series zooms into the one slice where the money and the scarcity live: compute. The whole series answers one question — are you getting useful work out of the silicon you pay for?
0.1 · The 4-Layer Mental Model for AI Compute

A simple model that splits a compute cluster into four layers — reservation, provisioning, scheduling, workload — so 'my pod is Pending' becomes a one-minute diagnosis instead of a three-team argument.
0.2 · Why Is My Pod Stuck Pending? Looking into the failure path

The most common — and most expensive — GPU cluster question. The same 'Pending' can come from any of the four layers; here's how to find which one, before you waste money on the wrong fix.

Part 1 — Anatomy of a Unit of Compute

Part 2 — GPU Sharing

Part 3 — Job Scheduling

Part 4 — Cluster Sharing

Part 5 — Failure Recovery

Part 6 — Measurement & Economics

Part 7 — Synthesis

7.1 · The Thesis: Capacity and Scheduling Are One Problem

The whole series in one argument — supply and scheduling aren't two problems, and the seam between them is the leverage.