← All Programmes

Circuits in Transformer Language Models

Machine Learning

Reverse-engineering specific algorithms implemented by transformer models, such as induction heads, indirect object identification, and greater-than comparisons. Focuses on finding interpretable computational circuits within neural network activations.

Core Beliefs

  • Neural networks implement discrete, interpretable algorithms rather than opaque statistical associations
  • Superposition allows models to represent more features than dimensions
  • Circuits can be identified through activation patching and causal interventions

Methods

  • Activation patching to isolate causal pathways
  • Sparse autoencoders for feature disentanglement
  • Attention head analysis and ablation studies
87Papers
12Contributors
342Subscribers
Mar 2022Founded

Evolution Timeline

9 EVENTS
Paper Added

Paper added: Sparse Autoencoders Reveal Monosemantic Features

New technique for decomposing superposition into interpretable directions

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Trenton Bricken, Adly Templeton, Joshua Batson (2023)

We use sparse autoencoders to decompose the internal activations of language models into interpretable features. Our approach reveals that models represent concepts in superposition, and we can disentangle these into monosemantic features that correspond to human-understandable concepts.

Curated by @research_alice
Controversy

Controversy: Do circuits generalize across models?

Debate over whether discovered circuits are universal or model-specific artifacts

On the Universality of Learned Circuits in Language Models

David Bau, Kevin Meng, Mor Geva (2023)

We investigate whether circuits discovered in one model architecture appear in other models. Our results suggest circuit motifs are partially universal but implementation details vary significantly across architectures.

Curated by @research_bob
Paper Added

Paper added: Scaling Monosemanticity

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus (2024)

We scale sparse autoencoder methods to production language models, finding millions of interpretable features. We demonstrate features for abstract concepts, multi-lingual representations, and safety-relevant activations.

Curated by @research_bob
Consensus

Consensus: Circuits exist and can be isolated

Community agrees that neural networks implement interpretable algorithms, though methods for discovery remain debated

Mechanistic Interpretability Across Five Years of Research

Neel Nanda, Lawrence Chan, Tom Lieberum (2023)

We survey the mechanistic interpretability literature and identify common patterns. Despite methodological disagreements, we find strong evidence that neural networks implement discrete, interpretable computational circuits.

Curated by @research_charlie
Paper Added

Paper added: Induction Heads in Transformers

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda (2022)

We identify "induction heads" as a key mechanism for in-context learning in transformers. These circuits enable models to recognize and continue patterns by attending to previous examples.

Curated by @research_alice
Paper Added

Paper added: Toy Models of Superposition

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson (2022)

We construct minimal toy models that demonstrate how neural networks represent more features than dimensions through superposition. This explains polysemantic neurons and informs interpretability methods.

Curated by @research_charlie
Paper Added

Paper added: A Mathematical Framework for Transformer Circuits

A Mathematical Framework for Transformer Circuits

Nelson Elhage, Neel Nanda, Catherine Olsson (2021)

We develop mathematical tools for reverse-engineering transformer models as computational circuits. Our framework enables precise claims about how transformers implement algorithms.

Curated by @research_alice
Paper Added

Paper added: Feature Visualization

Feature Visualization

Chris Olah, Alexander Mordvintsev, Ludwig Schubert (2017)

How can we see what neural networks are looking for? Feature visualization uses optimization to find inputs that maximally activate neurons, revealing learned representations.

Curated by @research_bob
Founded

Programme founded

Mechanistic Interpretability programme created to track research on understanding neural network internals

Zoom In: An Introduction to Circuits

Chris Olah, Nick Cammarata, Ludwig Schubert (2020)

We introduce the circuits thread of interpretability research, which aims to reverse-engineer neural networks into human-understandable algorithms. We demonstrate this approach on vision models.

Curated by @research_alice