Personal blog powered by a passion for technology.

OpenClaw Observability with the Grafana Stack

I’ve been running OpenClaw as my AI assistant for a couple months now. It handles Discord, Telegram, WhatsApp, cron jobs, sub-agents — a lot of moving parts. When something is slow or a message gets dropped, I want to know why without digging through log files.

OpenClaw has a built-in diagnostics-otel plugin that exports traces, metrics, and logs over OTLP. If you already have a Grafana stack running (like I do from my Claude Code observability setup), adding OpenClaw takes about five minutes.

What you get

The plugin emits 26 metrics and trace spans covering the full message lifecycle:

Model usage: token counts (input, output, cache read/write), cost in USD, run duration, context window usage — all labeled by channel, provider, and model.

Message flow: webhook received/processed/error counts, message queued/processed with duration histograms, outcome tracking.

Queue and session health: lane enqueue/dequeue with depth and wait time, session state transitions, stuck session detection with age histograms.

Every model call generates a trace span (openclaw.model.usage) with token breakdowns, session IDs, and timing. Webhook processing and message handling get their own spans too. Enough to trace a Discord message from ingestion to reply.

Prerequisites

You need an OTLP-compatible collector. I use Grafana Alloy, but any OpenTelemetry Collector works. My setup fans out to:

  • Mimir — metrics (via Prometheus remote write)
  • Tempo — traces
  • Loki — logs

If you just want to see traces quickly, Jaeger works too:

docker run -d --name jaeger \
  -p 16686:16686 -p 4318:4318 \
  jaegertracing/jaeger:2 \
  --set receivers.otlp.protocols.http.endpoint=0.0.0.0:4318

Step 1: Enable the plugin

Add this to your ~/.openclaw/openclaw.json:

{
  "plugins": {
    "allow": ["diagnostics-otel"],
    "entries": {
      "diagnostics-otel": {
        "enabled": true
      }
    }
  },
  "diagnostics": {
    "enabled": true,
    "otel": {
      "enabled": true,
      "endpoint": "http://your-alloy-host:4318",
      "protocol": "http/protobuf",
      "serviceName": "openclaw-gateway",
      "traces": true,
      "metrics": true,
      "logs": true,
      "sampleRate": 1.0,
      "flushIntervalMs": 10000
    }
  }
}

Or use the CLI:

openclaw plugins enable diagnostics-otel

Then restart:

openclaw gateway restart

Step 2: Check for environment variable conflicts

This one bit me. If you have OTEL_EXPORTER_OTLP_ENDPOINT or OTEL_SERVICE_NAME set in your shell environment or systemd service file, the OpenTelemetry SDK will use those instead of the config file values.

I had these set for Claude Code telemetry in my ~/.bashrc:

export OTEL_EXPORTER_OTLP_ENDPOINT="http://alloy:4318"
export OTEL_SERVICE_NAME="pi-coding-agent"

The systemd service inherited them. OpenClaw’s config said serviceName: "openclaw-gateway" but traces showed up as pi-coding-agent — or didn’t show up at all because the endpoint was wrong.

Check what your gateway process actually sees:

cat /proc/$(pgrep -f "openclaw.*gateway")/environ | tr '\0' '\n' | grep OTEL

If you’re running OpenClaw via systemd, set the correct values explicitly in the service file:

Environment=OTEL_EXPORTER_OTLP_ENDPOINT=http://your-alloy-host:4318
Environment=OTEL_SERVICE_NAME=openclaw-gateway

Then systemctl --user daemon-reload && systemctl --user restart openclaw-gateway.

Step 3: Verify data is flowing

Check traces in Tempo:

curl -s "http://your-tempo:3200/api/search/tag/service.name/values" \
  | jq '.tagValues'

You should see openclaw-gateway in the list.

Check metrics in Mimir:

curl -s "http://your-mimir:9009/prometheus/api/v1/label/__name__/values" \
  | jq '[.data[] | select(startswith("openclaw"))]'

You should get ~26 metric names back.

Step 4: Build a dashboard (or use mine)

I built a Grafana dashboard that covers the key panels:

Top row stats: total tokens, cost (USD), messages processed, average run duration.

Time series: token usage over time by type, cost by model, messages by channel, message processing latency (p50/p95/p99).

Performance: run duration percentiles, context window usage.

Infrastructure: queue depth, queue wait time, session state changes.

Traces: recent traces table from Tempo with links to full trace views.

The dashboard includes channel and model dropdown filters so you can zoom into Discord-only or Opus-only traffic.

Grab the JSON and import it: openclaw-dashboard.json (Gist)

In Grafana: Dashboards → Import → paste the JSON. Map the Prometheus datasource to your Mimir and the Tempo datasource to your Tempo instance.

Alloy config (for reference)

If you’re using Grafana Alloy as your OTLP collector, here’s the relevant config that receives from OpenClaw and fans out to the Grafana stack:

otelcol.receiver.otlp "default" {
  http { endpoint = "0.0.0.0:4318" }
  grpc { endpoint = "0.0.0.0:4317" }

  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

otelcol.processor.batch "default" {
  timeout         = "5s"
  send_batch_size = 1024

  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlphttp.tempo.input]
  }
}

otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint { url = "http://mimir:9009/api/v1/push" }
}

otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint { url = "http://loki:3100/loki/api/v1/push" }
}

otelcol.exporter.otlphttp "tempo" {
  client {
    endpoint = "http://tempo:4318"
    tls { insecure = true }
  }
}

The full picture

With this in place, I have one Grafana stack covering all my AI agent activity:

Service What it is
claude-code Claude Code sessions (coding)
pi-coding-agent Pi agent sessions (coding)
openclaw-gateway OpenClaw itself (orchestration, messaging, cron)

Claude Code and Pi report what happens inside coding sessions. OpenClaw reports the orchestration layer — which messages came in, how they were routed, how long the model took, what it cost.

Combined, that’s full visibility from “user sends Discord message” through “OpenClaw spawns a coding agent” to “agent completes and replies.” All in the same Tempo traces, same Mimir metrics, same Grafana dashboards.

Metrics reference

For dashboard builders, here’s the full list:

Metric Type Labels
openclaw_tokens_total counter openclaw_token, openclaw_channel, openclaw_provider, openclaw_model
openclaw_cost_usd_total counter openclaw_channel, openclaw_provider, openclaw_model
openclaw_run_duration_ms_milliseconds histogram openclaw_channel, openclaw_provider, openclaw_model
openclaw_context_tokens histogram openclaw_context, openclaw_channel, openclaw_provider, openclaw_model
openclaw_message_processed_total counter openclaw_channel, openclaw_outcome
openclaw_message_queued_total counter openclaw_channel, openclaw_source
openclaw_message_duration_ms_milliseconds histogram openclaw_channel, openclaw_outcome
openclaw_webhook_received_total counter openclaw_channel, openclaw_webhook
openclaw_webhook_duration_ms_milliseconds histogram openclaw_channel, openclaw_webhook
openclaw_queue_depth histogram openclaw_lane
openclaw_queue_wait_ms_milliseconds histogram openclaw_lane
openclaw_session_state_total counter openclaw_state, openclaw_reason
openclaw_session_stuck_total counter openclaw_state

The _milliseconds suffix is an artifact of Alloy’s Prometheus conversion — the OTEL metric names use _ms but the Prometheus exporter appends the unit.

What’s next

I want to add alerting — notify me when session stuck count spikes or when daily cost exceeds a threshold. Grafana Alerting can do this directly from the Mimir metrics. That’s a post for another day.

The dashboard JSON is open and editable. If you build something better, I’d love to see it.