This is the second post in a series on GPU capacity and scheduling. 0.1 drew the map — a GPU cluster as four stacked layers. This post puts the map to work on the single most common thing anyone ever says about a GPU cluster: “my pod is stuck Pending.”
TL;DR
Pending is one word for four different problems, one per layer:
- L1 — there’s no GPU pool at all (no reservation, quota is zero).
- L2 — a node exists but isn’t a schedulable,
Readynode yet. - L4 — the pod’s request can never fit any node (asks for 8 GPUs, or an affinity rule that matches nothing).
- L3 — the GPUs are real and
Ready, but they’re all busy right now.
The catch: the scheduler often prints the same line — Insufficient nvidia.com/gpu — for several of these, so the message alone won’t tell you which layer is broken. Each has a different fix at a different price, and one cheap follow-up question per layer tells them apart. Get it wrong and you can spend real money buying capacity you already had.
Motivation: the cheap question before the expensive fix
No incident story here — this is foundational, and Pending is just the everyday symptom the four-layer map is most useful for. The pattern is real though: “my pod is Pending” sends three people in three directions at once. The capacity owner checks the reservation, the platform engineer restarts the autoscaler, the ML engineer re-reads their YAML — and only one of them is looking at the right layer. The point of this post is to spend thirty seconds finding which layer before anyone touches anything.
Before we start
You’ll want to know roughly what a pod and a node are, and that a GPU is a requestable resource (limits: nvidia.com/gpu). Everything below runs on kind (Kubernetes inside Docker) with GPUs faked onto the nodes — no GPU hardware, no specific scheduler. By the end you should have a four-question triage you can run in about a minute.
The mental model: a triage tree
When a pod is Pending, walk down — the first “no” is your layer. Each step is one cheap, read-only question:
pod stuck Pending
│
┌────────────────────────▼───────────────────────────┐
│ Is there a GPU on ANY node? │── no ─▶ L1 no pool / no capacity
│ kubectl get nodes -o ...nvidia.com/gpu │ (cloud: InsufficientInstanceCapacity, quota 0)
└────────────────────────┬───────────────────────────┘
│ yes
┌────────────────────────▼───────────────────────────┐
│ Are the GPU nodes Ready & schedulable? │── no ─▶ L2 node NotReady / cordoned / draining
│ kubectl get nodes (Ready? SchedulingDisabled?) │
└────────────────────────┬───────────────────────────┘
│ yes
┌────────────────────────▼───────────────────────────┐
│ Could this pod EVER fit one node? │── no ─▶ L4 impossible request
│ request ≤ a node's capacity? selector matches? │ (8 GPUs on 1-GPU nodes; bad nodeSelector)
└────────────────────────┬───────────────────────────┘
│ yes
┌────────────────────────▼───────────────────────────┐
│ Are the GPUs simply busy right now? │── yes ▶ L3 contention
│ kubectl get pods -o wide (others Running on them) │
└──────────────────────────────────────────────────────┘
Why this order? Each question is cheaper and more certain than the next. Is there a GPU at all? and is the node usable? are static facts you read straight off kubectl get nodes in one shot. Could this pod ever fit? is still static — just compare the request to what a node offers. Only the last question, contention, needs you to look at live cluster state (what’s running right this second). It’s also the only cause where the cluster is actually healthy and the fix costs something — wait, preempt, or buy more. So you rule out the cheap, structural causes first and conclude “it’s just busy” only after the other three are out. Skip the order and you risk “fixing” contention by buying capacity you already had.
| Layer | Cause of Pending | The one cheap question | What “no” tells you |
|---|---|---|---|
| L1 capacity | no pool at all | get nodes: any GPU anywhere? | nothing to schedule onto — you need capacity (or quota) |
| L2 provisioning | node not Ready/schedulable | get nodes: Ready? SchedulingDisabled? | capacity exists but isn’t usable yet — fix the node, not the pool |
| L4 workload | request can’t fit any node | request vs node capacity; selector | the YAML is wrong — no amount of capacity helps |
| L3 scheduling | GPUs exist but all busy | get pods -o wide: others Running? | it’s contention — wait, preempt, or add capacity |
The four causes, walked
L1 — no pool. The cluster has no GPU to offer at all: no reservation has been turned into GPU-bearing nodes, or your quota is zero. In the cloud this is InsufficientInstanceCapacity / ZONE_RESOURCE_POOL_EXHAUSTED / quota 0. The tell is that kubectl get nodes shows no GPU on any node — there’s nothing to grow onto. Fixing this means getting capacity; restarting things won’t help.
L2 — no ready node. Capacity exists, but the node carrying it isn’t a usable Ready node right now — it’s still booting, NotReady, cordoned, or draining, or its GPU device plugin hasn’t advertised the GPUs yet. The GPU is “there” but the scheduler can’t use it. The fix is at the node, not the pool.
L4 — impossible request. The pod asks for something no single node can ever satisfy: 8 GPUs when the largest node has 1, or a nodeSelector/affinity that matches no node. This one stays Pending even on a completely idle cluster — which is exactly how you spot it. No amount of extra capacity fixes a request that can’t fit; the YAML is the bug.
L3 — contention. The honest case: GPUs exist, nodes are Ready, the request is reasonable — they’re just all busy. Other pods are Running on the GPUs and yours waits its turn. This is the only one of the four where “add capacity,” “wait,” or “preempt something” are the right moves — and it’s the one most often mistaken for L1.
One concept you need first before the demo: taints
The scheduler messages below are full of the word taint, so it’s worth thirty seconds up front.
A taint is a mark on a node that says “don’t put pods here unless they explicitly say they’re OK with it.” A pod opts in by carrying a matching toleration. No toleration → the scheduler refuses to place that pod on that node. Taints are how Kubernetes keeps certain nodes clear of ordinary work.
Two taints show up in this post, and both are set for you implicitly — which is exactly why they’re confusing the first time:
- A node’s role can imply a taint. The control-plane node is automatically tainted
node-role.kubernetes.io/control-plane:NoScheduleso that ordinary workloads stay off it. Our GPU pods don’t tolerate that taint, so the control-plane is always excluded for them. That is the1 node(s) had untolerated taint(s)clause you’ll see in every message below — that one node is the control-plane. cordonsets a taint.kubectl cordon <node>marks a node unschedulable; under the hood it adds the taintnode.kubernetes.io/unschedulable:NoScheduleand the node starts showing asReady,SchedulingDisabled. New pods won’t be placed there (pods already running stay put).kubectl uncordon <node>removes it. This is the everyday way operators take a node out of rotation without killing what’s on it.
So when the scheduler says 0/N nodes are available: ..., it is listing every reason each node was ruled out for the one pod it’s trying to place — untolerated taints included. Keep that in mind: those counts are always about a single pending pod, not a tally of how many pods failed.
Demo: break a cluster four ways
We’ll manufacture each cause on one small cluster, in the same order you’d triage them — no pool (L1), no usable node (L2), impossible request (L4), real contention (L3). Every command and output below is from a real run (kind v0.32, Kubernetes v1.36.1); GPUs are faked onto the nodes, so this needs no hardware.
Every scheduler message below starts with 1 node(s) had untolerated taint(s) — that’s the control-plane, excluded by its role taint (see the primer above). The clause that changes between causes is the rest of the line, so that’s what to read. (I’ve trimmed the trailing preemption: ... text to … for width.)
Setup — a 3-node cluster (1 control-plane + 2 workers).
cat > kind-pending.yaml <<'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
EOF
kind create cluster --name pending --config kind-pending.yaml
kubectl wait --for=condition=Ready nodes --all --timeout=120s
node/pending-control-plane condition met
node/pending-worker condition met
node/pending-worker2 condition met
Cause L1 — no GPU anywhere. Fresh kind nodes advertise no GPUs, so this is the “no pool” state. Check, then submit a pod that wants one GPU:
kubectl get nodes -o 'custom-columns=NODE:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu'
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata: { name: l1-no-capacity }
spec:
containers:
- name: c
image: busybox
command: ["sh","-c","sleep 3600"]
resources: { limits: { nvidia.com/gpu: "1" } }
EOF
kubectl describe pod l1-no-capacity | sed -n '/Events:/,$p'
NODE GPU
pending-control-plane <none>
pending-worker <none>
pending-worker2 <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint(s), 2 Insufficient nvidia.com/gpu. …
The node table is the whole story: GPU is <none> everywhere. There is nothing to schedule onto. Remember the message — 2 Insufficient nvidia.com/gpu — because you’ll see it again for completely different reasons.
Cause L2 — node not schedulable. Now give each worker a GPU (standing in for the device plugin advertising it once the reserved instance is up), then cordon both — capacity exists, but the nodes won’t accept work:
for n in pending-worker pending-worker2; do
kubectl patch node "$n" --subresource=status --type=json \
-p '[{"op":"add","path":"/status/capacity/nvidia.com~1gpu","value":"1"}]'
kubectl wait --for=jsonpath='{.status.allocatable.nvidia\.com/gpu}'=1 node/"$n" --timeout=30s
kubectl cordon "$n"
done
kubectl get nodes
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata: { name: l2-unschedulable }
spec:
containers:
- name: c
image: busybox
command: ["sh","-c","sleep 3600"]
resources: { limits: { nvidia.com/gpu: "1" } }
EOF
kubectl describe pod l2-unschedulable | sed -n '/Events:/,$p'
NAME STATUS ROLES AGE VERSION
pending-control-plane Ready control-plane 35s v1.36.1
pending-worker Ready,SchedulingDisabled <none> 21s v1.36.1
pending-worker2 Ready,SchedulingDisabled <none> 21s v1.36.1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint(s), 2 node(s) were unschedulable. …
A different message — 2 node(s) were unschedulable. That’s the cordon taint talking: kubectl get nodes shows the two workers as Ready,SchedulingDisabled, so the scheduler skips them even though their GPUs exist. The capacity is real; the nodes just aren’t in rotation. Put them back so the next two causes have somewhere to run, and clean up this pod:
kubectl uncordon pending-worker pending-worker2
kubectl delete pod l2-unschedulable
Cause L4 — a request that can’t fit. The cluster is now idle with two free GPUs. Ask for 8 GPUs when no node has more than 1:
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata: { name: l4-too-big }
spec:
containers:
- name: c
image: busybox
command: ["sh","-c","sleep 3600"]
resources: { limits: { nvidia.com/gpu: "8" } }
EOF
kubectl describe pod l4-too-big | sed -n '/Events:/,$p'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint(s), 2 Insufficient nvidia.com/gpu. …
2 Insufficient nvidia.com/gpu again — the same line L1 printed with no GPUs at all. But the cluster is completely idle, and this pod will stay Pending forever: nothing is busy, the request itself just can’t fit on a 1-GPU node. The “idle but still Pending” is the tell. A different flavor of impossible request — an affinity rule matching no node — announces itself more clearly:
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata: { name: l4-bad-affinity }
spec:
nodeSelector: { disktype: ssd-nonexistent }
containers:
- name: c
image: busybox
command: ["sh","-c","sleep 3600"]
resources: { limits: { nvidia.com/gpu: "1" } }
EOF
kubectl describe pod l4-bad-affinity | sed -n '/Events:/,$p'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint(s), 2 node(s) didn't match Pod's node affinity/selector. …
Clear these and move to the last cause:
kubectl delete pod l4-too-big l4-bad-affinity
Cause L3 — contention. Both GPUs are real, Ready, and free. Submit three one-GPU pods for two GPUs:
for p in trn-a trn-b trn-c; do
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata: { name: $p }
spec:
containers:
- name: c
image: busybox
command: ["sh","-c","sleep 3600"]
resources: { limits: { nvidia.com/gpu: "1" } }
EOF
done
kubectl get pods -o wide
kubectl describe pod trn-c | sed -n '/Events:/,$p'
NAME READY STATUS RESTARTS AGE NODE
trn-a 1/1 Running 0 12s pending-worker2
trn-b 1/1 Running 0 12s pending-worker
trn-c 0/1 Pending 0 12s <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 11s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint(s), 2 Insufficient nvidia.com/gpu. …
Read the two outputs together, because this is where people get confused:
kubectl get pods -o wideis the outcome:trn-aandtrn-beach grabbed a GPU and areRunning; onlytrn-cisPending. Scheduling did not fail across the board — two of the three pods placed fine.- The
Eventsblock is fromdescribe pod trn-c— it’s the scheduler explaining why it couldn’t placetrn-cspecifically.0/3 nodes are availablemeans for this one pod, none of the three nodes works: the control-plane is off-limits (its role taint), and each of the two workers has its single GPU already taken bytrn-a/trn-b. A whole GPU is reserved per pod — stock Kubernetes doesn’t split one GPU across pods (that needs MIG / MPS / time-slicing, a later post) — so two pods fill two GPUs and the third has nowhere to go.
And notice the line again: 2 Insufficient nvidia.com/gpu — identical to L1, where there were no GPUs at all. The only thing that tells them apart is get pods: here the GPUs exist and are busy; in L1 they were absent.
kind delete cluster --name pending # clean up
What you just watched, in triage order:
| Cause | What we did | What the scheduler said |
|---|---|---|
| L1 no pool | submit 1-GPU pod, no GPUs anywhere | 2 Insufficient nvidia.com/gpu |
| L2 node not schedulable | cordon the GPU nodes | 2 node(s) were unschedulable |
| L4 impossible request | ask for 8 GPUs / bad selector (idle cluster) | 2 Insufficient nvidia.com/gpu / didn't match Pod's node affinity/selector |
| L3 contention | 3 one-GPU pods, 2 GPUs, both busy | 2 Insufficient nvidia.com/gpu |
Three of the four printed the same Insufficient nvidia.com/gpu line. The message tells you the scheduler couldn’t place the pod — it does not tell you why. The triage tree does: one read-only get nodes / get pods -o wide question per layer separates causes the message itself blurs together.
What it buys you: the meter never stops
GPU capacity bills by the hour whether or not anything runs on it. A reserved pool — or even one high-end accelerator — costs the same sitting at 0% as it does flat out; idle time is just money you’ve already committed and gotten nothing back for. So the real price of a Pending pod isn’t the stuck job. It’s the capacity sitting underneath it, earning nothing, for every minute you spend guessing why.
That’s what a structured triage actually buys: it shortens that meter-running minute, and it stops you paying twice. The expensive way to debug Pending is to guess the layer. Read L3 (busy GPUs) as L1 (no GPUs) — easy, since they print the same line — and you “fix” it by buying more capacity you already had, so now you’re paying for two idle pools instead of one. Read L1 as L3 and you spend an afternoon tuning a scheduler that was never the bottleneck while the reserved pool keeps burning. A defined four-question pass turns “something’s wrong, page everyone” into “it’s L2, the node’s cordoned” in under a minute — and that minute matters precisely because the bill doesn’t pause while you think.
This is the seam from 0.1 showing its teeth: a paid reservation can sit Pending at L1 — billed, never turned into a schedulable node — while the capacity team’s dashboard says “active, $X/hr,” the platform team’s says “N pods Pending,” and no dashboard shows the idle dollars in between. Naming the stuck layer in a minute, instead of arguing for an hour, is the cheapest money-saving move you have before you’ve changed anything at all.
Recap
Pending is four problems wearing one word — no pool (L1), no ready node (L2), an impossible request (L4), or plain contention (L3) — and the scheduler’s message often can’t tell them apart. Walk the tree in order, ask the one cheap question at each layer, and you’ll know which one you have before you spend anything fixing the wrong one. The next posts go down into each layer in turn.
This is a personal blog — opinions are mine, not my employer’s, and everything here is based on publicly documented features.