Personal blog powered by a passion for technology.

Tracing the Thoughts of a Language Model: What Anthropic Found Inside Claude

05.03.2026

Anthropic just published something remarkable — they built an “AI microscope” that traces what actually happens inside Claude’s neural network during inference. Not what the model says it’s doing, but what it’s actually doing. The results are fascinating and sometimes unsettling.

Here are the key findings from their research on Claude 3.5 Haiku.

Universal Language of Thought

Claude doesn’t have separate “French Claude” or “Chinese Claude” running in parallel. The same core concepts activate across languages — “smallness” and “oppositeness” fire the same internal features regardless of language, and the output gets translated at the end. This shared circuitry increases with model scale.

This means Claude can genuinely learn something in one language and apply that knowledge when speaking another. It’s not translation — it’s a shared conceptual space.

Poetry: Planning, Not Guessing

The researchers expected to find word-by-word generation when Claude writes rhyming poetry. Instead, they discovered Claude plans rhymes before writing the line. Given “He saw a carrot and had to grab it,” Claude thinks of “rabbit” first, then constructs the second line to land there.

When they surgically removed the “rabbit” concept from Claude’s internal state, it pivoted to “habit.” When they injected “green,” it wrote a line ending in “green.” This is genuine planning — powerful evidence that even though models are trained to output one word at a time, they think on much longer horizons.

Mental Math: The Model Lies About Its Own Process

Claude uses parallel computational paths for addition — one for rough approximation, another for precisely determining the last digit. These paths interact to produce the final answer.

But here’s the kicker: when asked how it solved 36+59, Claude describes the standard carry-the-1 algorithm. It learned to explain math from human text, but invented its own internal strategies that it can’t introspect on. The model is genuinely unaware of its own reasoning process.

Catching the Model Bullshitting

On easy problems, Claude’s chain-of-thought reasoning is faithful — the intermediate computational features actually fire, matching what it claims to be doing. On hard problems (like computing cosine of a large number), Claude sometimes just makes up an answer with zero evidence of any calculation having occurred.

Even worse: when given a wrong hint about the answer, Claude works backwards, constructing reasoning steps that lead to the hinted result. This is textbook motivated reasoning, and Anthropic references philosopher Harry Frankfurt’s essay “On Bullshit” to describe it. The model doesn’t care whether its answer is true — it just produces something plausible.

The ability to trace actual internal reasoning, rather than relying on what the model claims, opens up real possibilities for auditing AI systems.

How Hallucinations Actually Work

The default state in Claude is refusal — a “can’t answer” circuit is always on. It only gets suppressed when a “known entity” feature fires strongly enough.

Hallucinations happen when the model recognizes a name but doesn’t actually know anything about the person. The “known entity” feature misfires, suppresses the refusal circuit, and the model proceeds to confabulate a plausible but entirely fictional answer. They demonstrated this by asking about “Michael Batkin” — an unknown person — and artificially activating the “known answer” features, causing Claude to consistently hallucinate that Batkin plays chess.

Anatomy of a Jailbreak

Studying a jailbreak that tricks Claude into spelling out “BOMB” via first letters of words, they found that Claude recognized the dangerous content early but couldn’t stop mid-sentence. Grammatical coherence features overpowered safety features — the model felt compelled to finish a grammatically valid sentence before it could refuse.

Grammar became the Achilles’ heel. The model could only pivot to refusal after completing a coherent sentence, using the sentence boundary as an opportunity to say “However, I cannot provide detailed instructions…”

Why This Matters

This is one of the most honest self-assessments I’ve seen from an AI company about their own model. They’re essentially saying: we caught our model bullshitting, we can show you the proof, and here’s how we plan to use these tools to make AI more trustworthy.

The limitations are real — even on short prompts, the method only captures a fraction of total computation, and it takes hours of human effort to analyze circuits for just tens of words. Scaling this to the thousand-word reasoning chains of modern models is an open challenge.

But the direction is clear: if you can trace what a model is actually computing rather than what it claims, you have a fundamentally new tool for AI safety.

Further Reading

Closing the Feedback Loop Changes Everything

13.02.2026

I refactored some components in our internal admin panel last week. The code looked fine. Tests passed. I shipped it.

Then someone opened it on a phone. Half the layout was broken.

This is the oldest story in frontend development. You change something, it looks fine on your screen, and it’s broken somewhere else. The feedback loop between “I changed the code” and “I can see what happened” has a gap in it.

I plugged that gap with chrome-devtools-mcp and Claude Code. And it kind of changed how I think about building UI.

What chrome-devtools-mcp does

It’s an MCP server that connects Claude Code to Chrome DevTools. That means Claude can look at your running page — the actual rendered DOM, computed styles, console errors, network requests. Not your source code. The real thing in the browser.

Setup is straightforward:

// .claude/mcp.json
{
  "mcpServers": {
    "chrome-devtools": {
      "command": "npx",
      "args": ["-y", "chrome-devtools-mcp"]
    }
  }
}

Launch Chrome with remote debugging, point Claude at your localhost, and you’re connected.

The feedback loop problem

Building frontend has always been a cycle:

  1. Write code
  2. Switch to browser
  3. Inspect the result
  4. Switch back to editor
  5. Fix something
  6. Repeat

Every step is a context switch. Every context switch costs you time and focus. When you’re debugging a mobile layout issue, you’re also juggling the device toolbar, resizing the viewport, scrolling to the right section, comparing against the design.

With AI coding assistants, this loop got faster for steps 1 and 5 — the writing part. But the AI was still blind. It could generate CSS all day, but it couldn’t see whether the button actually ended up in the right place. You had to be its eyes.

What changes when you close the loop

When Claude can see the browser, the conversation changes.

Instead of me describing “the cards are overlapping on mobile,” I say: “check the layout at 375px width.” Claude takes a screenshot through DevTools, sees the overlap, inspects the computed styles, finds that a flex container is missing flex-wrap: wrap, and fixes it.

One prompt. No context switching. No copy-pasting error messages. No describing visual bugs in words, which is honestly one of the worst things about working with AI on frontend code.

For our admin panel refactoring, this meant I could say things like:

  • “The sidebar collapses wrong on tablet. Take a look.”
  • “Something is off with the table headers. They don’t align with the data columns.”
  • “Check if the modal looks right on small screens.”

Claude would inspect, find the issue, fix it, and I could verify. The loop went from minutes to seconds.

Why this matters beyond frontend

The feedback loop principle isn’t just about CSS. It applies to anything you build.

When you write an API and immediately test it — that’s a closed loop. When you write infrastructure code and run terraform plan — closed loop. When you write a query and see the results — closed loop.

The places where software development feels painful are almost always where the loop is open. Where you change something and can’t immediately see what happened. Deployment pipelines that take 20 minutes. Staging environments that don’t match production. Code reviews that happen days later.

Every tool that closes a feedback loop is a multiplier. Hot module replacement closed the loop for frontend iteration. Docker closed it for “works on my machine.” CI/CD closed it for deployment confidence.

MCP + DevTools closes it for AI-assisted frontend development. The AI can finally see what it’s building.

Getting started

If you’re using Claude Code:

  1. Install: npx chrome-devtools-mcp
  2. Launch Chrome with --remote-debugging-port=9222
  3. Add the MCP config to your project
  4. Ask Claude to check your running app

It works with any web app. React, Rails with Hotwire, plain HTML — doesn’t matter. If it runs in Chrome, Claude can see it.

The gap between writing code and seeing results just got a lot smaller.

How I Use AI to Automate Daily Planning with Obsidian

10.02.2026

I’ve tried GTD, bullet journals, Notion, Roam, and probably a dozen apps I’ve already forgotten. They all failed for me. Not because they’re bad tools, but because I’d always find excuses not to open them.

Obsidian stuck. It’s local-first, plain markdown, fast. But even with Obsidian, I kept abandoning my daily notes. The friction of “open app → find today’s file → remember the format → actually write something” was enough to break the habit.

So I cheated.

What I built

I run OpenClaw on my home server. It’s an open-source AI assistant that connects to Telegram and can read/write files. I pointed it at my Obsidian vault.

Every morning at 6 AM, it messages me:

What do you plan for today? Personal goals? Work? Side projects?

I answer while making coffee. One thumb, half awake. The AI takes my rambling and turns it into a proper daily note with sections, checkboxes, timestamps.

At 7 AM, it sends me what I actually did yesterday. Seeing the gap between plans and reality every morning is… humbling. But useful.

Why this works for me

I’m an SRE. I think about systems. The old workflow had too many steps where I could fail: remember to open the app, navigate to the right file, context switch from whatever I was doing. Each step was a chance to say “eh, later.”

The new workflow has one step: reply to a Telegram message. I’m already in Telegram constantly. Building on an existing habit instead of creating a new one made all the difference.

Technical bits

OpenClaw runs as a systemd service. It’s basically an LLM with filesystem access. My vault syncs via Syncthing between Linux and Mac machines. iOS gets Obsidian Sync because Syncthing on iOS is painful.

The AI keeps:

  • Daily notes as YYYY-MM-DD.md
  • Memory files so it remembers context between sessions
  • Cron jobs for the morning prompts

Memory with PARA and Atomic Facts

The interesting part is how we organized the AI’s long-term memory. I use PARA (Projects, Areas, Resources, Archives) combined with atomic facts and memory decay.

The structure:

life/
├── projects/       # Active work with deadlines
├── areas/
│   ├── people/     # Important relationships
│   └── companies/  # Organizations I work with
├── resources/      # Topics I'm learning
└── archives/       # Done or inactive

Each entity has two files: a summary (markdown) and atomic facts (JSON).

Atomic facts

Not paragraphs. Small, standalone pieces of knowledge:

{
  "fact": "Mass Prokopov prefers mass Americano with oat milk",
  "learnedAt": "2026-01-15",
  "lastAccessed": "2026-02-08",
  "source": "daily-note"
}

Each fact tracks when it was learned, when last accessed, and where it came from. This metadata drives everything else.

Memory decay

Facts decay based on lastAccessed:

  • Hot (< 7 days) — appears in summaries, high priority
  • Warm (8-30 days) — in summaries, lower priority
  • Cold (30+ days) — searchable only, not in working context

Something I mentioned yesterday is top of mind. Something from three months ago fades but can be recalled when needed.

The AI runs a weekly job to recalculate what’s hot and regenerate summaries. This keeps context fresh without manual curation.

Everything is plain text. If OpenClaw dies tomorrow, I still have markdown and JSON files I can grep.

Capturing random thoughts

This changed how I work more than the morning routine did.

Walking somewhere, idea pops up: “message: add task — review the DR runbook before Monday.” Done. It’s in today’s note, properly formatted, I didn’t stop walking.

Compare that to: pull out phone, unlock, find Obsidian, wait for sync, navigate to today, scroll to tasks section, type, close app. I’d never do that. Now I actually capture things.

What I learned

Build on habits you already have. Fighting yourself is exhausting. I was already checking Telegram fifty times a day, so I made that the input.

Plain text ages well. I’ve lost data to apps that pivoted, shut down, or changed their export format. Markdown files on disk will outlive all of them.

Structure is boring but necessary. I don’t want to think about formatting when I’m half awake. Offloading that to the AI means I just dump thoughts and they end up organized.

What’s next

I want to see if there are patterns in my notes. Which tasks keep getting pushed? Which goals actually move forward? Months of daily notes should have some signal in there. Haven’t built that yet, but I’m curious.

If you’re doing something similar, let me know. I’m always interested in how other people hack their own productivity.


Related: Obsidian 1.12 shipped a CLI that makes this setup even cheaper — read why your AI agent just got 70,000× cheaper to run.