Canopy - Curated Research Programmes

Evolution Timeline

9 EVENTS

Nov 15, 2025Paper Added

Paper added: Sparse Autoencoders Reveal Monosemantic Features

New technique for decomposing superposition into interpretable directions

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Trenton Bricken, Adly Templeton, Joshua Batson (2023)

We use sparse autoencoders to decompose the internal activations of language models into interpretable features. Our approach reveals that models represent concepts in superposition, and we can disentangle these into monosemantic features that correspond to human-understandable concepts.

Curated by @research_alice

Nov 10, 2025Controversy

Controversy: Do circuits generalize across models?

Debate over whether discovered circuits are universal or model-specific artifacts

On the Universality of Learned Circuits in Language Models

David Bau, Kevin Meng, Mor Geva (2023)

We investigate whether circuits discovered in one model architecture appear in other models. Our results suggest circuit motifs are partially universal but implementation details vary significantly across architectures.

Curated by @research_bob

Nov 5, 2025Paper Added

Paper added: Scaling Monosemanticity

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus (2024)

We scale sparse autoencoder methods to production language models, finding millions of interpretable features. We demonstrate features for abstract concepts, multi-lingual representations, and safety-relevant activations.

Curated by @research_bob

Oct 28, 2025Consensus

Consensus: Circuits exist and can be isolated

Community agrees that neural networks implement interpretable algorithms, though methods for discovery remain debated

Mechanistic Interpretability Across Five Years of Research

Neel Nanda, Lawrence Chan, Tom Lieberum (2023)

We survey the mechanistic interpretability literature and identify common patterns. Despite methodological disagreements, we find strong evidence that neural networks implement discrete, interpretable computational circuits.

Curated by @research_charlie

Oct 15, 2025Paper Added

Paper added: Induction Heads in Transformers

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda (2022)

We identify "induction heads" as a key mechanism for in-context learning in transformers. These circuits enable models to recognize and continue patterns by attending to previous examples.

Curated by @research_alice

Sep 22, 2025Paper Added

Paper added: Toy Models of Superposition

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson (2022)

We construct minimal toy models that demonstrate how neural networks represent more features than dimensions through superposition. This explains polysemantic neurons and informs interpretability methods.

Curated by @research_charlie

Aug 30, 2025Paper Added

Paper added: A Mathematical Framework for Transformer Circuits

A Mathematical Framework for Transformer Circuits

Nelson Elhage, Neel Nanda, Catherine Olsson (2021)

We develop mathematical tools for reverse-engineering transformer models as computational circuits. Our framework enables precise claims about how transformers implement algorithms.

Curated by @research_alice

Jul 12, 2025Paper Added

Paper added: Feature Visualization

Feature Visualization

Chris Olah, Alexander Mordvintsev, Ludwig Schubert (2017)

How can we see what neural networks are looking for? Feature visualization uses optimization to find inputs that maximally activate neurons, revealing learned representations.

Curated by @research_bob

Mar 15, 2022Founded

Programme founded

Mechanistic Interpretability programme created to track research on understanding neural network internals

Zoom In: An Introduction to Circuits

Chris Olah, Nick Cammarata, Ludwig Schubert (2020)

We introduce the circuits thread of interpretability research, which aims to reverse-engineer neural networks into human-understandable algorithms. We demonstrate this approach on vision models.

Curated by @research_alice

Circuits in Transformer Language Models

Core Beliefs

Methods

Evolution Timeline

Paper added: Sparse Autoencoders Reveal Monosemantic Features

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Controversy: Do circuits generalize across models?

On the Universality of Learned Circuits in Language Models

Paper added: Scaling Monosemanticity

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Consensus: Circuits exist and can be isolated

Mechanistic Interpretability Across Five Years of Research

Paper added: Induction Heads in Transformers

In-context Learning and Induction Heads

Paper added: Toy Models of Superposition

Toy Models of Superposition

Paper added: A Mathematical Framework for Transformer Circuits

A Mathematical Framework for Transformer Circuits

Paper added: Feature Visualization

Feature Visualization

Programme founded

Zoom In: An Introduction to Circuits