Quick start

Get Tracegrid running in under 5 minutes.

Prerequisites

A Slack workspace where you are an admin
A Linux server (Ubuntu 20.04+) or Kubernetes cluster
Your Tracegrid API key (from your account page)

Step 1 — Get your API key

After signing up, you receive an email with your API key. It looks like: gw_a1b2c3d4e5f6...

Your API key is shown once. Store it securely.

Step 2 — Connect Slack

Create a Slack Incoming Webhook:

1. Go to api.slack.com/apps
2. Create New App → From scratch → name it "Tracegrid"
3. Incoming Webhooks → Activate → Add New Webhook
4. Select your #incidents channel
5. Copy the webhook URL

Step 3 — Install the agent

Follow the Linux or Kubernetes installation guides.

How Tracegrid works

Tracegrid provides deep visibility into your infrastructure incidents using AI.

[Infrastructure] -- (Metrics/Logs) --> [Tracegrid Agent]
                                   |
                            (Anomaly Data)
                                   v
                        [Tracegrid Backend AI]
                                   |
                            (Intelligence)
                                   v
                       [Slack / Postmortems]

1. Agent installs: Deploys on your infrastructure as a systemd service or Kubernetes DaemonSet.
2. Agent collects: Metrics, process states, and container events every 15 seconds. Ships anomalies immediately when detected.
3. Backend AI analysis: AI correlates signals, identifies root cause, and generates plain-English explanation and fix steps.
4. Knowledge Delivery: Slack receives the incident card. When resolved, postmortem appears automatically in thread.

Triggering your first incident

Trigger a test incident

After installing the agent, verify everything works by triggering a demo incident:

bashCopy
curl -X GET https://api.tracegrid.app/internal/demo-incident \
  -H "X-Internal-Key: your_internal_key"

Check your Slack channel within 10 seconds.

Trigger a real incident (Linux)

Simulate high CPU:

bashCopy
# Install stress (Ubuntu)
sudo apt-get install stress -y
# Spike CPU for 60 seconds
stress --cpu 4 --timeout 60

Tracegrid will alert when CPU exceeds 90% for 30 seconds.

Trigger a real incident (Kubernetes)

Deploy a pod that crashes:

bashCopy
kubectl run crash-test --image=busybox --restart=Always \
  -n default -- sh -c "exit 1"

Tracegrid detects CrashLoopBackOff after 3 restarts.

Clean up: kubectl delete pod crash-test -n default

Linux / VM / EC2 Installation

bashCopy
curl -sSL https://install.tracegrid.app | bash

The installer will ask for your API key, your backend URL, and a name for this host. It sets up a systemd service that restarts automatically on failure.

Kubernetes Installation

Deploy as a DaemonSet to monitor every node in your cluster.

bashCopy
curl -sSL https://raw.githubusercontent.com/yourrepo/main/agent/kubernetes/daemonset.yaml -o tracegrid.yaml
# Edit tracegrid.yaml: set your API key and cluster name
kubectl apply -f tracegrid.yaml

Use kubectl logs -n tracegrid -l app=tracegrid-agent to check status.

Environment variables

Variable	Description
`TRACEGRID_API_KEY`	Your tenant API key (required)
`TRACEGRID_BACKEND_URL`	Backend URL — https://api.tracegrid.app
`TRACEGRID_MODE`	vm or kubernetes (default: vm)
`TRACEGRID_HOSTNAME`	Override hostname shown in alerts
`TRACEGRID_LOG_LEVEL`	debug, info, warn, error (default: info)
`TRACEGRID_CLUSTER_NAME`	Cluster name shown in K8s alerts
`TRACEGRID_NAMESPACE`	K8s namespace to watch (default: all)
`HOST_PROC`	Host /proc path in K8s (default: /host/proc)

Agent config file

The agent reads from /etc/tracegrid/agent.yaml (Linux) or ./agent.yaml (local development). Environment variables override config file values.

Example config:

yamlCopy
api_key: gw_your_key_here
backend_url: https://api.tracegrid.app
hostname: prod-web-01
collection_interval_seconds: 15
log_level: info

For Kubernetes, use the ConfigMap and Secret in daemonset.yaml instead of a config file.

Alert thresholds

Default thresholds (Growth plan allows customization):

Metric	Warning	Critical
CPU usage	> 90%	> 95%
Memory usage	> 85%	> 95%
Disk usage	> 85%	> 95%
K8s restarts	>= 3	>= 5
K8s pending	> 5 min	> 10 min

Custom thresholds (Growth plan)

Coming in the next release. Configure via the Tracegrid dashboard or API.

Slack integration

Option 1 — Incoming Webhook (recommended)

1. Go to api.slack.com/apps
2. Create New App → From scratch
3. Name: Tracegrid, select your workspace
4. Incoming Webhooks → Activate → Add New Webhook to Workspace
5. Select your #incidents channel → Allow
6. Copy the webhook URL (starts with https://hooks.slack.com)
7. Provide this URL via API:

bashCopy
curl -X PATCH https://api.tracegrid.app/v1/tenants/YOUR_ID \
  -H "X-Internal-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"slack_webhook_url": "https://hooks.slack.com/..."}'

Option 2 — Slack App (interactive buttons)

See the full Slack App setup guide in docs/slack-app-setup.md for Acknowledge, Escalate, and Dismiss button support.

AI provider setup

Tracegrid supports three AI providers. Set AI_PROVIDER in your backend environment to switch between them.

Groq (recommended — fastest, generous free tier)

1. Sign up at console.groq.com (free, no credit card)
2. Create API key → copy it
3. Set: AI_PROVIDER=groq, GROQ_API_KEY=your_key

Model used: llama-3.3-70b-versatile

Google Gemini

1. Go to aistudio.google.com → Get API key
2. Set: AI_PROVIDER=gemini, GEMINI_API_KEY=your_key

Model used: gemini-2.0-flash

Anthropic Claude

1. Sign up at console.anthropic.com → add credits
2. Set: AI_PROVIDER=anthropic, ANTHROPIC_API_KEY=your_key

Model used: claude-sonnet-4-20250514

Note: requires paid credits — use Groq for free testing