Kubernetes monitoring
Kubernetes breaks.
Tracegrid explains.
Install the agent once via Helm. Never wonder why a pod is crashing again — every incident arrives with a plain-English root cause and the exact kubectl command to fix it.
The 10 most common K8s failures — explained
These cover the large majority of real production incidents. Tracegrid catches all of them, plus 400+ more.
CrashLoopBackOff
A pod starts, crashes, and Kubernetes restarts it on an exponential backoff — over and over.
Tracegrid reads the previous container logs and the pod events, then tells you which of the six real causes it is: missing env var, OOMKilled, bad liveness probe, wrong entrypoint, missing ConfigMap/Secret, or a failing init container.
OOMKilled
The container exceeds its memory limit and the Linux kernel kills it (exit code 137).
Detected even when the pod has no logs — Tracegrid reads the K8s event and the kernel signal, and recommends a memory limit based on observed usage plus headroom.
ImagePullBackOff
Kubernetes cannot pull the container image — wrong name/tag, expired registry token, or missing pull secret.
Tracegrid surfaces the exact pull error string and points at the specific cause (e.g. expired ECR auth token vs. typo in the tag).
Pod stuck in Pending
The scheduler cannot place the pod — insufficient CPU/memory, taint mismatch, or no matching node.
Tracegrid identifies the precise scheduling constraint blocking placement and shows the node capacity vs. the pod request.
PVC Unbound
A PersistentVolumeClaim never binds — storage class missing or no available PVs.
Tracegrid reports the PVC state and the reason the binding failed, so you fix the storage class instead of guessing.
Liveness probe killing healthy pods
An over-aggressive liveness probe restarts a pod that was actually fine, often during slow startup.
Tracegrid distinguishes a probe-induced restart from a real crash and flags probe timing as the cause.
Readiness probe failures
The pod runs but never becomes Ready — the app or a dependency is not up.
Tracegrid shows which endpoint is failing the readiness check and whether a dependency is the blocker.
Deployment rollout stuck
A new rollout hangs because new pods never reach Ready.
Tracegrid shows why the rollout is blocked and which new replica is failing, so you can roll back or fix forward with confidence.
TLS certificate expiry
A cert-manager renewal failed or a manual cert is about to lapse.
Tracegrid predicts expiry up to 30 days ahead and alerts before traffic breaks — not after.
Resource quota exceeded
A namespace hits its ResourceQuota and new pods cannot schedule.
Tracegrid shows quota usage vs. limits and which workload pushed you over.
How Tracegrid monitors Kubernetes
- Deployed as a DaemonSet via Helm — one command, no operator.
- Watches the Kubernetes events API directly, not just metrics.
- Reads pod logs for application-level errors.
- Maps owner references (Pod → ReplicaSet → Deployment) to name the real culprit.
- Calculates blast radius when a service fails, so you fix the cause not the symptom.
- Namespace filtering keeps it scoped to the workloads you care about.
Install on Kubernetes
helm repo add tracegrid https://charts.tracegrid.app
helm repo update
helm install tracegrid tracegrid/agent \
--namespace tracegrid --create-namespace \
--set apiKey=YOUR_API_KEYThat is the whole setup. Your first explained incident usually arrives within the hour.
Start monitoring your cluster free
Start free monitoring1 host free forever. 15-day full trial on paid plans. No credit card.