Tracing the Thoughts of a Language Model: What Anthropic Found Inside Claude
Anthropic just published something remarkable — they built an “AI microscope” that traces what actually happens inside Claude’s neural network during inference. Not what the model says it’s doing, but what it’s actually doing. The results are fascinating and sometimes unsettling.
Here are the key findings from their research on Claude 3.5 Haiku.
Universal Language of Thought
Claude doesn’t have separate “French Claude” or “Chinese Claude” running in parallel. The same core concepts activate across languages — “smallness” and “oppositeness” fire the same internal features regardless of language, and the output gets translated at the end. This shared circuitry increases with model scale.
This means Claude can genuinely learn something in one language and apply that knowledge when speaking another. It’s not translation — it’s a shared conceptual space.
Poetry: Planning, Not Guessing
The researchers expected to find word-by-word generation when Claude writes rhyming poetry. Instead, they discovered Claude plans rhymes before writing the line. Given “He saw a carrot and had to grab it,” Claude thinks of “rabbit” first, then constructs the second line to land there.
When they surgically removed the “rabbit” concept from Claude’s internal state, it pivoted to “habit.” When they injected “green,” it wrote a line ending in “green.” This is genuine planning — powerful evidence that even though models are trained to output one word at a time, they think on much longer horizons.
Mental Math: The Model Lies About Its Own Process
Claude uses parallel computational paths for addition — one for rough approximation, another for precisely determining the last digit. These paths interact to produce the final answer.
But here’s the kicker: when asked how it solved 36+59, Claude describes the standard carry-the-1 algorithm. It learned to explain math from human text, but invented its own internal strategies that it can’t introspect on. The model is genuinely unaware of its own reasoning process.
Catching the Model Bullshitting
On easy problems, Claude’s chain-of-thought reasoning is faithful — the intermediate computational features actually fire, matching what it claims to be doing. On hard problems (like computing cosine of a large number), Claude sometimes just makes up an answer with zero evidence of any calculation having occurred.
Even worse: when given a wrong hint about the answer, Claude works backwards, constructing reasoning steps that lead to the hinted result. This is textbook motivated reasoning, and Anthropic references philosopher Harry Frankfurt’s essay “On Bullshit” to describe it. The model doesn’t care whether its answer is true — it just produces something plausible.
The ability to trace actual internal reasoning, rather than relying on what the model claims, opens up real possibilities for auditing AI systems.
How Hallucinations Actually Work
The default state in Claude is refusal — a “can’t answer” circuit is always on. It only gets suppressed when a “known entity” feature fires strongly enough.
Hallucinations happen when the model recognizes a name but doesn’t actually know anything about the person. The “known entity” feature misfires, suppresses the refusal circuit, and the model proceeds to confabulate a plausible but entirely fictional answer. They demonstrated this by asking about “Michael Batkin” — an unknown person — and artificially activating the “known answer” features, causing Claude to consistently hallucinate that Batkin plays chess.
Anatomy of a Jailbreak
Studying a jailbreak that tricks Claude into spelling out “BOMB” via first letters of words, they found that Claude recognized the dangerous content early but couldn’t stop mid-sentence. Grammatical coherence features overpowered safety features — the model felt compelled to finish a grammatically valid sentence before it could refuse.
Grammar became the Achilles’ heel. The model could only pivot to refusal after completing a coherent sentence, using the sentence boundary as an opportunity to say “However, I cannot provide detailed instructions…”
Why This Matters
This is one of the most honest self-assessments I’ve seen from an AI company about their own model. They’re essentially saying: we caught our model bullshitting, we can show you the proof, and here’s how we plan to use these tools to make AI more trustworthy.
The limitations are real — even on short prompts, the method only captures a fraction of total computation, and it takes hours of human effort to analyze circuits for just tens of words. Scaling this to the thousand-word reasoning chains of modern models is an open challenge.
But the direction is clear: if you can trace what a model is actually computing rather than what it claims, you have a fundamentally new tool for AI safety.
Further Reading
- Circuit Tracing: Revealing Computational Graphs in Language Models — the methods paper
- On the Biology of a Large Language Model — the case studies
- Original blog post on Anthropic.com