Evolution Timeline
Paper added: Sparse Autoencoders Reveal Monosemantic Features
New technique for decomposing superposition into interpretable directions
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
We use sparse autoencoders to decompose the internal activations of language models into interpretable features. Our approach reveals that models represent concepts in superposition, and we can disentangle these into monosemantic features that correspond to human-understandable concepts.
Controversy: Do circuits generalize across models?
Debate over whether discovered circuits are universal or model-specific artifacts
On the Universality of Learned Circuits in Language Models
We investigate whether circuits discovered in one model architecture appear in other models. Our results suggest circuit motifs are partially universal but implementation details vary significantly across architectures.
Paper added: Scaling Monosemanticity
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
We scale sparse autoencoder methods to production language models, finding millions of interpretable features. We demonstrate features for abstract concepts, multi-lingual representations, and safety-relevant activations.
Consensus: Circuits exist and can be isolated
Community agrees that neural networks implement interpretable algorithms, though methods for discovery remain debated
Mechanistic Interpretability Across Five Years of Research
We survey the mechanistic interpretability literature and identify common patterns. Despite methodological disagreements, we find strong evidence that neural networks implement discrete, interpretable computational circuits.
Paper added: Induction Heads in Transformers
In-context Learning and Induction Heads
We identify "induction heads" as a key mechanism for in-context learning in transformers. These circuits enable models to recognize and continue patterns by attending to previous examples.
Paper added: Toy Models of Superposition
Toy Models of Superposition
We construct minimal toy models that demonstrate how neural networks represent more features than dimensions through superposition. This explains polysemantic neurons and informs interpretability methods.
Paper added: A Mathematical Framework for Transformer Circuits
A Mathematical Framework for Transformer Circuits
We develop mathematical tools for reverse-engineering transformer models as computational circuits. Our framework enables precise claims about how transformers implement algorithms.
Paper added: Feature Visualization
Feature Visualization
How can we see what neural networks are looking for? Feature visualization uses optimization to find inputs that maximally activate neurons, revealing learned representations.
Programme founded
Mechanistic Interpretability programme created to track research on understanding neural network internals
Zoom In: An Introduction to Circuits
We introduce the circuits thread of interpretability research, which aims to reverse-engineer neural networks into human-understandable algorithms. We demonstrate this approach on vision models.