What is mechanistic interpretability?

2 min read

Mechanistic interpretability (MI) is a subfield of the broader field of interpretability . The goal of MI is to “reverse engineer neural networks into human understandable algorithms”. The central tenet is that it is possible.

The field can be understood as being based on three core hypotheses that is analogous to Theodor Schwann’s Cell theory^[1]:

Features: The building blocks of neural networks are features, which can be studied.
Circuits: Features are connected by weights, forming circuits. These are subgraphs of the neural network.
Universality: Similar features and circuits appear in different models and tasks.

Current MI research focuses on identifying circuits within a model that produces certain behavior, such as grokking, superposition, and phase changes.

What is mechanistic interpretability?

Further reading: