Mech-Interpre 101

(2024.June)

What’s Mechanistic Interpretability (MI)

To reverse-engineer neural networks into understandable algorithms. This includes examining the internal mechanisms and representations learned by these networks, which is crucial for ensuring safety and alignment in AI systems.

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

Recommended mindset

Why MI is not just a theory

On GPT-2, Wang et al., 2022 used MI techniques to derive an interpretable algorithm that the network is using to solve an NLP task. They argued that the algorithm is faulty, and shows that running adversarial samples does cause the network to produce the expected wrong results.(Summary).

Open Problems

Tutorials

Playground: neuronpedia.org by Neal

Reads