Quick start
Get Tracegrid running in under 5 minutes.
Prerequisites
- A Slack workspace where you are an admin
- A Linux server (Ubuntu 20.04+) or Kubernetes cluster
- Your Tracegrid API key (from your account page)
Step 1 — Get your API key
After signing up, you receive an email with your API key. It looks like: gw_a1b2c3d4e5f6...
Your API key is shown once. Store it securely.
Step 2 — Connect Slack
Create a Slack Incoming Webhook:
- 1. Go to api.slack.com/apps
- 2. Create New App → From scratch → name it "Tracegrid"
- 3. Incoming Webhooks → Activate → Add New Webhook
- 4. Select your #incidents channel
- 5. Copy the webhook URL
Step 3 — Install the agent
Follow the Linux or Kubernetes installation guides.
How Tracegrid works
Tracegrid provides deep visibility into your infrastructure incidents using AI.
|
(Anomaly Data)
v
[Tracegrid Backend AI]
|
(Intelligence)
v
[Slack / Postmortems]
- 1. Agent installs: Deploys on your infrastructure as a systemd service or Kubernetes DaemonSet.
- 2. Agent collects: Metrics, process states, and container events every 15 seconds. Ships anomalies immediately when detected.
- 3. Backend AI analysis: AI correlates signals, identifies root cause, and generates plain-English explanation and fix steps.
- 4. Knowledge Delivery: Slack receives the incident card. When resolved, postmortem appears automatically in thread.
Triggering your first incident
Trigger a test incident
After installing the agent, verify everything works by triggering a demo incident:
curl -X GET https://api.tracegrid.app/internal/demo-incident \
-H "X-Internal-Key: your_internal_key"
Check your Slack channel within 10 seconds.
Trigger a real incident (Linux)
Simulate high CPU:
# Install stress (Ubuntu)
sudo apt-get install stress -y
# Spike CPU for 60 seconds
stress --cpu 4 --timeout 60
Tracegrid will alert when CPU exceeds 90% for 30 seconds.
Trigger a real incident (Kubernetes)
Deploy a pod that crashes:
kubectl run crash-test --image=busybox --restart=Always \
-n default -- sh -c "exit 1"
Tracegrid detects CrashLoopBackOff after 3 restarts.
Clean up: kubectl delete pod crash-test -n default
Linux / VM / EC2 Installation
curl -sSL https://install.tracegrid.app | bash
The installer will ask for your API key, your backend URL, and a name for this host. It sets up a systemd service that restarts automatically on failure.
Kubernetes Installation
Deploy as a DaemonSet to monitor every node in your cluster.
curl -sSL https://raw.githubusercontent.com/yourrepo/main/agent/kubernetes/daemonset.yaml -o tracegrid.yaml
# Edit tracegrid.yaml: set your API key and cluster name
kubectl apply -f tracegrid.yaml
Use kubectl logs -n tracegrid -l app=tracegrid-agent to check status.
Environment variables
| Variable | Description |
|---|---|
TRACEGRID_API_KEY | Your tenant API key (required) |
TRACEGRID_BACKEND_URL | Backend URL — https://api.tracegrid.app |
TRACEGRID_MODE | vm or kubernetes (default: vm) |
TRACEGRID_HOSTNAME | Override hostname shown in alerts |
TRACEGRID_LOG_LEVEL | debug, info, warn, error (default: info) |
TRACEGRID_CLUSTER_NAME | Cluster name shown in K8s alerts |
TRACEGRID_NAMESPACE | K8s namespace to watch (default: all) |
HOST_PROC | Host /proc path in K8s (default: /host/proc) |
Agent config file
The agent reads from /etc/tracegrid/agent.yaml (Linux) or ./agent.yaml (local development). Environment variables override config file values.
Example config:
api_key: gw_your_key_here
backend_url: https://api.tracegrid.app
hostname: prod-web-01
collection_interval_seconds: 15
log_level: info
For Kubernetes, use the ConfigMap and Secret in daemonset.yaml instead of a config file.
Alert thresholds
Default thresholds (Growth plan allows customization):
| Metric | Warning | Critical |
|---|---|---|
| CPU usage | > 90% | > 95% |
| Memory usage | > 85% | > 95% |
| Disk usage | > 85% | > 95% |
| K8s restarts | >= 3 | >= 5 |
| K8s pending | > 5 min | > 10 min |
Custom thresholds (Growth plan)
Coming in the next release. Configure via the Tracegrid dashboard or API.
Slack integration
Option 1 — Incoming Webhook (recommended)
- 1. Go to api.slack.com/apps
- 2. Create New App → From scratch
- 3. Name: Tracegrid, select your workspace
- 4. Incoming Webhooks → Activate → Add New Webhook to Workspace
- 5. Select your #incidents channel → Allow
- 6. Copy the webhook URL (starts with
https://hooks.slack.com) - 7. Provide this URL via API:
curl -X PATCH https://api.tracegrid.app/v1/tenants/YOUR_ID \
-H "X-Internal-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"slack_webhook_url": "https://hooks.slack.com/..."}'
Option 2 — Slack App (interactive buttons)
See the full Slack App setup guide in docs/slack-app-setup.md for Acknowledge, Escalate, and Dismiss button support.
AI provider setup
Tracegrid supports three AI providers. Set AI_PROVIDER in your backend environment to switch between them.
Groq (recommended — fastest, generous free tier)
- 1. Sign up at console.groq.com (free, no credit card)
- 2. Create API key → copy it
- 3. Set:
AI_PROVIDER=groq,GROQ_API_KEY=your_key
Model used: llama-3.3-70b-versatile
Google Gemini
- 1. Go to aistudio.google.com → Get API key
- 2. Set:
AI_PROVIDER=gemini,GEMINI_API_KEY=your_key
Model used: gemini-2.0-flash
Anthropic Claude
- 1. Sign up at console.anthropic.com → add credits
- 2. Set:
AI_PROVIDER=anthropic,ANTHROPIC_API_KEY=your_key
Model used: claude-sonnet-4-20250514
Note: requires paid credits — use Groq for free testing