How I Set Up Observability for Claude Code with the Grafana Stack

March 9, 2026 · 5 min read · claude-codeobservabilitygrafanaopentelemetrysre

AI coding assistants are becoming standard engineering tools. But here’s a question most teams aren’t asking yet: how much are they actually costing us, and are they making us faster?

I spent a morning wiring up Claude Code’s built-in OpenTelemetry support to our existing Grafana stack (Alloy, Mimir, Loki). No new infrastructure. No vendor lock-in. Just environment variables and a small collector config.

Here’s what I learned.

Claude Code Already Speaks OpenTelemetry

This surprised me. Claude Code ships with native OTEL support — no SDK wrappers, no sidecar, no code changes. You set a few environment variables and it starts emitting metrics and log events.

The metrics are genuinely useful:

Token usage per model (input, output, cache read, cache creation)
Cost in USD per session and model
Lines of code added and removed
Commits and PRs created
Tool accept/reject decisions — how often engineers trust the AI’s suggestions
Active time — actual usage vs. idle

The log events capture per-request details: latency, model used, token counts, API errors, and tool execution results.

The Architecture

If you run a Grafana stack, you already have everything you need.

Claude Code (developer laptop)
        │
        │  OTLP/HTTP
        ▼
   Grafana Alloy  (lightweight collector)
        │
        ├── metrics → Mimir
        ├── logs    → Loki
        └── traces  → Tempo

Alloy acts as the OTLP receiver and fans out each signal type to the right backend. About 300 MHz CPU and 256 MB RAM — negligible if you already run it for other workloads.

Developer Setup

On the developer’s machine, add this to ~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://your-alloy-host:4318",
    "OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE": "cumulative",
    "OTEL_RESOURCE_ATTRIBUTES": "team.id=your-team,user.name=your-name"
  }
}

That last line about temporality preference? I wasted 30 minutes debugging why metrics weren’t showing up in Mimir. Claude Code defaults to delta temporality. The Prometheus remote write exporter in Alloy silently drops delta sums. No errors, no warnings at info level. You only see dropped unsupported delta sum if you crank Alloy to debug logging. Set it to cumulative and everything flows.

The Alloy Pipeline

A minimal River configuration receives OTLP, enriches with a source label, and routes signals:

// OTLP Receiver
otelcol.receiver.otlp "claude_code" {
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    metrics = [otelcol.processor.attributes.claude_code.input]
    logs    = [otelcol.processor.attributes.claude_code.input]
  }
}

// Enrich with source label
otelcol.processor.attributes "claude_code" {
  action {
    key    = "source"
    value  = "claude-code"
    action = "upsert"
  }
  output {
    metrics = [otelcol.processor.filter.claude_code.input]
    logs    = [otelcol.exporter.loki.claude_code.input]
  }
}

// Only forward claude_code.* metrics
otelcol.processor.filter "claude_code" {
  metrics {
    include {
      match_type = "regexp"
      metric_names = ["claude_code\\..*"]
    }
  }
  output {
    metrics = [otelcol.exporter.prometheus.claude_code.input]
  }
}

// Mimir via Prometheus remote write
otelcol.exporter.prometheus "claude_code" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

// Loki
otelcol.exporter.loki "claude_code" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

If you’re on Mimir 2.16+ or Grafana Cloud, you can skip the Prometheus conversion and use otelcol.exporter.otlphttp for metrics too.

Alternative: OTel Collector YAML

If you already run a standalone OTel Collector instead of Alloy:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  attributes/claude:
    actions:
      - key: source
        value: claude-code
        action: upsert
  filter/claude_metrics:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - "claude_code\\..*"

exporters:
  prometheusremotewrite:
    endpoint: "http://mimir:9009/api/v1/push"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [attributes/claude, filter/claude_metrics]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [attributes/claude]
      exporters: [loki]

What You Can Actually See

Once data flows, here’s what the dashboard looks like in practice.

The overview row gives you the headline numbers — total cost, active sessions, token consumption, and cost broken down by model:

Claude Code Grafana Dashboard — Overview

Drill into session activity and cache efficiency. The active time panel shows CLI vs. user think time, and you can see token usage per individual session:

Claude Code Grafana Dashboard — Sessions and Token Deep Dive

The cost analysis section is where it gets interesting for managers — cost per session, cost per model, cost efficiency ($/1K tokens), and a session table sorted by spend:

Claude Code Grafana Dashboard — Cost Analysis and Environment

With these panels in place, you get answers to questions that matter:

Cost governance: Total USD spent on Claude Code per team, per engineer, per week. Set alerts before someone accidentally runs a loop that burns $50/hour.

Productivity signal: Are the teams using AI coding tools shipping more code? Correlate claude_code_lines_of_code_count and claude_code_commit_count with your deployment frequency.

Quality signal: The tool accept/reject metric (claude_code_code_edit_tool_decision) tells you something interesting. High rejection rates might mean the tool is struggling with a particular codebase, or an engineer needs guidance on how to prompt effectively.

API reliability: Track claude_code_api_error events. If Anthropic’s API has a bad day, you’ll see it in your dashboards before your engineers start complaining in Slack.

Alerting

Once the data is in Mimir and Loki, standard Grafana alerting rules work:

groups:
  - name: claude-code
    rules:
      - alert: HighCostSpike
        expr: sum(increase(claude_code_cost_usage_total[1h])) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Claude Code cost spike: >$50/hour"

      - alert: HighTokenBurn
        expr: sum(rate(claude_code_token_usage_total[5m])) > 100000
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Sustained high token usage (>100k tokens/5min)"

The cost alert is the important one. AI API costs can spike fast when agents get stuck in loops. You want to know about it before the invoice arrives.

A Gotcha Worth Knowing

Claude Code’s -p flag (one-shot prompt mode) runs too briefly for the OTEL exporter to flush. Telemetry only works reliably in interactive sessions or longer-running processes. If you’re testing and see no data, that’s probably why.

Tuning Variables

A few env vars worth knowing:

Variable	Default	Notes
`OTEL_METRIC_EXPORT_INTERVAL`	60000	Lower to 10000 for debugging
`OTEL_LOG_USER_PROMPTS`	disabled	Keep disabled — privacy
`OTEL_LOG_TOOL_DETAILS`	disabled	Enable for debugging
`OTEL_METRICS_INCLUDE_SESSION_ID`	true	Set false in prod to reduce cardinality

The Bigger Picture

This is table stakes for any organization rolling out AI coding tools. You wouldn’t deploy a microservice without metrics and logging. Why would you deploy an AI assistant that consumes API tokens at scale without the same visibility?

The data also feeds into something I’ve been thinking about: an “Agent Readiness” score for repositories. Repos with better documentation, tests, and structure produce better AI output. Now you can actually measure it — correlate readiness scores with agent task completion rates and cost efficiency.

We already measure engineering with DORA metrics. AI agent observability is the next layer.