OpenClaw Observability with the Grafana Stack
I’ve been running OpenClaw as my AI assistant for a couple months now. It handles Discord, Telegram, WhatsApp, cron jobs, sub-agents — a lot of moving parts. When something is slow or a message gets dropped, I want to know why without digging through log files.
OpenClaw has a built-in diagnostics-otel plugin that exports traces, metrics, and logs over OTLP. If you already have a Grafana stack running (like I do from my Claude Code observability setup), adding OpenClaw takes about five minutes.
What you get
The plugin emits 26 metrics and trace spans covering the full message lifecycle:
Model usage: token counts (input, output, cache read/write), cost in USD, run duration, context window usage — all labeled by channel, provider, and model.
Message flow: webhook received/processed/error counts, message queued/processed with duration histograms, outcome tracking.
Queue and session health: lane enqueue/dequeue with depth and wait time, session state transitions, stuck session detection with age histograms.
Every model call generates a trace span (openclaw.model.usage) with token breakdowns, session IDs, and timing. Webhook processing and message handling get their own spans too. Enough to trace a Discord message from ingestion to reply.
Prerequisites
You need an OTLP-compatible collector. I use Grafana Alloy, but any OpenTelemetry Collector works. My setup fans out to:
- Mimir — metrics (via Prometheus remote write)
- Tempo — traces
- Loki — logs
If you just want to see traces quickly, Jaeger works too:
docker run -d --name jaeger \
-p 16686:16686 -p 4318:4318 \
jaegertracing/jaeger:2 \
--set receivers.otlp.protocols.http.endpoint=0.0.0.0:4318
Step 1: Enable the plugin
Add this to your ~/.openclaw/openclaw.json:
{
"plugins": {
"allow": ["diagnostics-otel"],
"entries": {
"diagnostics-otel": {
"enabled": true
}
}
},
"diagnostics": {
"enabled": true,
"otel": {
"enabled": true,
"endpoint": "http://your-alloy-host:4318",
"protocol": "http/protobuf",
"serviceName": "openclaw-gateway",
"traces": true,
"metrics": true,
"logs": true,
"sampleRate": 1.0,
"flushIntervalMs": 10000
}
}
}
Or use the CLI:
openclaw plugins enable diagnostics-otel
Then restart:
openclaw gateway restart
Step 2: Check for environment variable conflicts
This one bit me. If you have OTEL_EXPORTER_OTLP_ENDPOINT or OTEL_SERVICE_NAME set in your shell environment or systemd service file, the OpenTelemetry SDK will use those instead of the config file values.
I had these set for Claude Code telemetry in my ~/.bashrc:
export OTEL_EXPORTER_OTLP_ENDPOINT="http://alloy:4318"
export OTEL_SERVICE_NAME="pi-coding-agent"
The systemd service inherited them. OpenClaw’s config said serviceName: "openclaw-gateway" but traces showed up as pi-coding-agent — or didn’t show up at all because the endpoint was wrong.
Check what your gateway process actually sees:
cat /proc/$(pgrep -f "openclaw.*gateway")/environ | tr '\0' '\n' | grep OTEL
If you’re running OpenClaw via systemd, set the correct values explicitly in the service file:
Environment=OTEL_EXPORTER_OTLP_ENDPOINT=http://your-alloy-host:4318
Environment=OTEL_SERVICE_NAME=openclaw-gateway
Then systemctl --user daemon-reload && systemctl --user restart openclaw-gateway.
Step 3: Verify data is flowing
Check traces in Tempo:
curl -s "http://your-tempo:3200/api/search/tag/service.name/values" \
| jq '.tagValues'
You should see openclaw-gateway in the list.
Check metrics in Mimir:
curl -s "http://your-mimir:9009/prometheus/api/v1/label/__name__/values" \
| jq '[.data[] | select(startswith("openclaw"))]'
You should get ~26 metric names back.
Step 4: Build a dashboard (or use mine)
I built a Grafana dashboard that covers the key panels:
Top row stats: total tokens, cost (USD), messages processed, average run duration.
Time series: token usage over time by type, cost by model, messages by channel, message processing latency (p50/p95/p99).
Performance: run duration percentiles, context window usage.
Infrastructure: queue depth, queue wait time, session state changes.
Traces: recent traces table from Tempo with links to full trace views.
The dashboard includes channel and model dropdown filters so you can zoom into Discord-only or Opus-only traffic.
Grab the JSON and import it: openclaw-dashboard.json (Gist)
In Grafana: Dashboards → Import → paste the JSON. Map the Prometheus datasource to your Mimir and the Tempo datasource to your Tempo instance.
Alloy config (for reference)
If you’re using Grafana Alloy as your OTLP collector, here’s the relevant config that receives from OpenClaw and fans out to the Grafana stack:
otelcol.receiver.otlp "default" {
http { endpoint = "0.0.0.0:4318" }
grpc { endpoint = "0.0.0.0:4317" }
output {
metrics = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
traces = [otelcol.processor.batch.default.input]
}
}
otelcol.processor.batch "default" {
timeout = "5s"
send_batch_size = 1024
output {
metrics = [otelcol.exporter.prometheus.mimir.input]
logs = [otelcol.exporter.loki.default.input]
traces = [otelcol.exporter.otlphttp.tempo.input]
}
}
otelcol.exporter.prometheus "mimir" {
forward_to = [prometheus.remote_write.mimir.receiver]
}
prometheus.remote_write "mimir" {
endpoint { url = "http://mimir:9009/api/v1/push" }
}
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint { url = "http://loki:3100/loki/api/v1/push" }
}
otelcol.exporter.otlphttp "tempo" {
client {
endpoint = "http://tempo:4318"
tls { insecure = true }
}
}
The full picture
With this in place, I have one Grafana stack covering all my AI agent activity:
| Service | What it is |
|---|---|
claude-code |
Claude Code sessions (coding) |
pi-coding-agent |
Pi agent sessions (coding) |
openclaw-gateway |
OpenClaw itself (orchestration, messaging, cron) |
Claude Code and Pi report what happens inside coding sessions. OpenClaw reports the orchestration layer — which messages came in, how they were routed, how long the model took, what it cost.
Combined, that’s full visibility from “user sends Discord message” through “OpenClaw spawns a coding agent” to “agent completes and replies.” All in the same Tempo traces, same Mimir metrics, same Grafana dashboards.
Metrics reference
For dashboard builders, here’s the full list:
| Metric | Type | Labels |
|---|---|---|
openclaw_tokens_total |
counter | openclaw_token, openclaw_channel, openclaw_provider, openclaw_model |
openclaw_cost_usd_total |
counter | openclaw_channel, openclaw_provider, openclaw_model |
openclaw_run_duration_ms_milliseconds |
histogram | openclaw_channel, openclaw_provider, openclaw_model |
openclaw_context_tokens |
histogram | openclaw_context, openclaw_channel, openclaw_provider, openclaw_model |
openclaw_message_processed_total |
counter | openclaw_channel, openclaw_outcome |
openclaw_message_queued_total |
counter | openclaw_channel, openclaw_source |
openclaw_message_duration_ms_milliseconds |
histogram | openclaw_channel, openclaw_outcome |
openclaw_webhook_received_total |
counter | openclaw_channel, openclaw_webhook |
openclaw_webhook_duration_ms_milliseconds |
histogram | openclaw_channel, openclaw_webhook |
openclaw_queue_depth |
histogram | openclaw_lane |
openclaw_queue_wait_ms_milliseconds |
histogram | openclaw_lane |
openclaw_session_state_total |
counter | openclaw_state, openclaw_reason |
openclaw_session_stuck_total |
counter | openclaw_state |
The _milliseconds suffix is an artifact of Alloy’s Prometheus conversion — the OTEL metric names use _ms but the Prometheus exporter appends the unit.
What’s next
I want to add alerting — notify me when session stuck count spikes or when daily cost exceeds a threshold. Grafana Alerting can do this directly from the Mimir metrics. That’s a post for another day.
The dashboard JSON is open and editable. If you build something better, I’d love to see it.