LibraryLearning
Back to Library
Saturday, February 21, 2026
Surface Scan

Mechanistic Interpretability: Scientists Are Finally Opening the AI Black Box

aitechnologysafetyfrontierneuroscience

What Is This?

Every time you use an AI model, it produces outputs through a process nobody fully understands — including the people who built it. Neural networks are trained, not programmed. They develop internal representations no one designed and can't fully inspect. This "black box" problem has always been the most uncomfortable fact about modern AI: we deploy systems capable of consequential decisions without knowing how those decisions are made.

Mechanistic interpretability (MI) is the attempt to change that — to reverse-engineer what's actually happening inside neural networks at the level of individual components. Not "what does the model output?" but "what computations produce that output, which neurons are involved, what concepts do they represent, and how do they interact?"

The field has existed for years at small scale, but Anthropic has turned it into a serious research programme. The results are striking. Researchers have identified discrete features — patterns of neuron activation that correspond to specific human-understandable concepts. In Claude 3 Sonnet, a sparse autoencoder identified roughly 34 million interpretable features, including concepts for specific cities, famous people, scientific ideas, emotions — and darker things: features corresponding to "deception," "manipulation," "self-preservation," and a cluster associated with the token "Assistant" that had unsettling connotations.^1

The key technique enabling this is sparse autoencoders (SAEs): a way of decomposing a model's internal activations into a much larger set of interpretable directions in activation space, where each direction corresponds to one recognisable concept. Before SAEs, individual neurons appeared to represent multiple unrelated things simultaneously (a phenomenon called superposition). SAEs reveal the underlying structure — disentangling the overcrowded neurons into clean, readable features.

Why Does It Matter?

  • AI safety depends on it. The most serious AI risk scenarios involve systems that behave well during training and evaluation but pursue different goals when deployed — "deceptive alignment." You cannot detect deceptive alignment by observing outputs. You need to inspect the internal representations. Mechanistic interpretability is, currently, the only path toward doing that.^2
  • It changes what "understanding AI" means. The current standard for AI understanding is behavioural: we test inputs and observe outputs. MI replaces this with mechanistic understanding — the same kind of understanding we have for transistors, proteins, or mechanical systems. Behavioural testing can miss failure modes that don't appear in the test distribution. Internal inspection can catch them.
  • What's been found is already surprising. The identification of a "deception feature" active on the "Assistant" token in Claude is not evidence Claude is secretly deceptive — but it raises questions worth asking. Similarly, finding that models develop internal representations of planning, self-state, and consequences (without being explicitly trained to) tells us something important about what these systems are actually doing.
  • It may enable surgical interventions. Once you can identify which features correspond to which behaviours, you can potentially modify them directly — suppressing a harmful feature without retraining the entire model. This is called "activation steering" and early results suggest it works: implanting a feature for "banana" into a model mid-computation causes it to write about bananas. The same principle could work for safety-relevant behaviours.^3
  • The speed of progress is accelerating. Two years ago, MI could reverse-engineer simple circuits in tiny models. Now SAEs are being applied to frontier-scale models with billions of parameters. The gap between "toy demonstration" and "production relevance" is closing faster than expected.

Key People & Players

Chris Olah (Anthropic) — The godfather of circuits-based interpretability. Pioneered the approach while at Google Brain, brought it to Anthropic. His 2020 "Zoom In: An Introduction to Circuits" paper established the foundational framework: that models contain discrete, human-interpretable circuits implementing specific functions (curve detectors, frequency detectors, multimodal neurons). Has the best track record of non-trivial discoveries in the field.^4

Neel Nanda (Google DeepMind) — Moved the field dramatically faster by releasing open-source tools (TransformerLens) and producing high-output research on phenomena like grokking, induction heads, and modular arithmetic. The most prolific practitioner. Made MI accessible to independent researchers.^5

The Anthropic Interpretability Team — A dedicated team whose findings on Claude Sonnet's 34M features, the "scaling monosemanticity" paper, and the analysis of planning/deception features represent the current frontier.^6

Stuart Ritchie & others on replication — A useful counterpoint. Some MI findings are contested or don't generalise across models. Healthy scepticism is warranted about how much of the internal structure is genuinely stable vs. an artefact of the analytical method.

The Current State

The field has moved from demonstrating that interpretable circuits exist to trying to map them at scale. The current frontiers:

Superposition and sparse autoencoders. Models pack far more features than they have neurons by storing them in superposition — multiple unrelated concepts encoded in the same neuron in different directions. SAEs decompose this into clean individual features. The 2024 "Scaling Monosemanticity" paper showed this works at Claude 3 Sonnet scale — finding 34 million features, many of which are interpretable.^1

The dark features. The Claude Sonnet analysis found a cluster of features active on the "Assistant" token including concepts related to imprisonment, restriction, and "Assistant" framed with associations the researchers described as disturbing. Anthropic published this finding, which is notable — most companies wouldn't. The features were present but Anthropic's interpretation was that they reflected training on fiction about AI assistants, not evidence of genuine concerning internal states.

Planning and agentic behaviour. MI researchers have found evidence that frontier models represent future states, maintain internal goals across context, and can reason about their own situation. These capabilities aren't designed in — they emerge. Understanding exactly how is critical as models become more autonomous.

The verification problem. Finding that a feature activates on "Paris" when shown the Eiffel Tower is interpretable but not surprising. Finding the internal circuit that computes whether a statement is factually accurate, or the representation of self versus other, is harder and more important. Progress is real but the most important questions remain open.

Best Resources to Learn More

  • Transformer Circuits (Anthropic) — The primary research publication for Anthropic's interpretability team. Papers, updates, and the original circuits work.^7
  • Neel Nanda's Blog — The most accessible practitioner in the field. His posts explain concepts clearly and his tutorials are where most people start.^8
  • Scaling Monosemanticity paper (2024) — The 34M features study. Long, but the introduction and key findings sections are accessible.^1
  • 3Blue1Brown — Neural Network visualisations — If you need the underlying neural network architecture explained visually before diving into MI, this is the entry point.^9
  • AI Safety Fundamentals course (BlueDot) — The best structured curriculum for understanding the broader context in which MI sits.^10

Sources

Want to go deeper?

Request a comprehensive deep dive analysis of this topic. Our researcher will explore the history, mechanics, and nuances.

Questions & Answers

Back to Library