Mech-Interpre 101
(2024.June)
What’s Mechanistic Interpretability (MI)
To reverse-engineer neural networks into understandable algorithms. This includes examining the internal mechanisms and representations learned by these networks, which is crucial for ensuring safety and alignment in AI systems.
Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.
Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.
A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.
Recommended mindset
-
Map neural network computations
-
Understand how individual neurons and layers contribute to decision-making
-
Treat neural networks like complex computational or biological mechanisms
-
Create methods to verify and validate AI decision processes
Why MI is not just a theory
On GPT-2, Wang et al., 2022 used MI techniques to derive an interpretable algorithm that the network is using to solve an NLP task. They argued that the algorithm is faulty, and shows that running adversarial samples does cause the network to produce the expected wrong results.(Summary).
Open Problems
- 200 Concrete Open Problems in Mechanistic Interpretability: Introduction (2022.Dec)
- Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems (2024.May)
Tutorials
-
Concrete Steps to Get Started in Transformer Mechanistic Interpretability (Full)
-
Transformers & Mechanistic Interpretability by Callum McDougall
Playground: neuronpedia.org by Neal
Reads
-
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
-
Bridging the VLM and mech interp communities for multimodal interpretability
-
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
-
A Comprehensive Mechanistic Interpretability Explainer & Glossary