July 3, 2026 · 8 min read · kubernetes
ImagePullBackOff: The Five Real Causes and How to Fix Each Fast
Your pod won't start. kubectl get pods shows ImagePullBackOff, or its angrier sibling ErrImagePull. The container hasn't crashed. It never ran. Kubernetes can't get the image, so it's backing off and retrying with an exponential delay. The good news: unlike a crash loop, this failure tells you almost exactly what's wrong if you know where to look.
What ImagePullBackOff actually means
ErrImagePull is the first failure: the kubelet asked the container runtime to pull the image and the pull failed. ImagePullBackOff is what you see next: Kubernetes is now waiting before it tries again, doubling the delay each time up to a cap of five minutes. The "BackOff" is not the error. It's the symptom of a pull that keeps failing.
There are only a handful of root causes, and kubectl describe will hand you the real one in its Events section. Always start there:
kubectl describe pod <pod-name>
Scroll to Events at the bottom. The Failed to pull image line contains the actual reason. Everything below is how to read it.
The five real causes
1. Wrong image name or tag
The most common cause by far. A typo in the repository, or a tag that doesn't exist. The classic: deploying :latest when the image was pushed as :v1.2.3, or a CI pipeline that referenced a tag it never built.
The Event reads:
Failed to pull image "myapp:v2": manifest for myapp:v2 not found
Fix: confirm the tag exists in the registry. docker manifest inspect <image>:<tag> will tell you in one command whether the tag is real before you blame the cluster.
2. Private registry, missing or wrong credentials
The image exists, but the cluster isn't allowed to pull it. You'll see:
Failed to pull image "...": pull access denied, repository does not exist or may require authorization
Note the lie in that message. "Repository does not exist" usually means "you're not authenticated." Kubernetes pulls with an imagePullSecret. If it's missing, wrong, or not attached to the pod's service account, every pull fails.
Fix:
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<token>
Then reference it under spec.imagePullSecrets in the pod, or attach it to the service account so every pod inherits it.
3. Expired registry token
This one is nasty because it works until it doesn't. ECR, GCR, and ACR tokens are short-lived, often 12 hours. A deployment that pulled fine yesterday starts throwing ImagePullBackOff overnight with no config change. Nothing in your YAML is wrong; the token baked into the pull secret simply expired.
Fix: rotate the token, or better, use the cloud provider's credential helper so the kubelet fetches a fresh token on every pull instead of relying on a static secret.
4. Rate limiting
If you pull public images from Docker Hub on an anonymous account, you hit the pull-rate limit. Under load (a node scaling event, a rollout across many pods) the cluster pulls the same image dozens of times and Docker Hub starts refusing:
toomanyrequests: You have reached your pull rate limit
Fix: authenticate to Docker Hub even for public images (the authenticated limit is far higher), or mirror critical images into your own registry.
5. Network or DNS: the node can't reach the registry
The image is fine, the credentials are fine, but the node can't resolve or reach the registry endpoint. Common on private clusters behind a proxy, or when a registry's domain fails DNS resolution from inside the node. The Event will mention dial tcp, i/o timeout, or no such host.
Fix: from a debug pod or the node, curl -v https://<registry> and nslookup <registry>. If those fail, it's networking, not Kubernetes.
A fast triage order
kubectl describe pod <pod>→ read the Event verbatim.- Tag exists?
docker manifest inspect <image>:<tag>. - Private? Check the
imagePullSecretis present and attached to the service account. - Worked yesterday, fails today? Suspect an expired token.
- Event mentions
dial tcp/no such host? Networking or DNS, not auth.
Ninety percent of ImagePullBackOff incidents are cause #1 or #2. The describe Event almost always names the right one. The slow part is being awake enough at 3am to read it carefully.
Why this still takes an hour
The information is all there. The problem is that nobody describes the pod first. The instinct is to re-deploy, restart the node, or re-run CI. Anything but read the one line that already contains the answer. By the time you've ruled out the things it isn't, an hour is gone and the deploy is still red.
This is exactly the gap Tracegrid closes. When a pod enters ImagePullBackOff, Tracegrid reads the Event for you, identifies which of the five causes it is (wrong tag, missing secret, expired token, rate limit, or network) and posts the specific fix to Slack, with the command. No describe, no guessing, no hour. You get the answer before you've finished opening your laptop.
It installs in 60 seconds, watches Kubernetes, Linux, Docker, ECS, and Azure, and there's a free tier with AI explanations included. If you've lost a night to a pull that turned out to be a typo, that's the night it's built to give back.
curl -sSL https://tracegrid.app/install.sh | bash, or tracegrid.app.
Written by Pradip, founder of Tracegrid, building AI infrastructure intelligence so small teams get senior-SRE answers at 3am.