What is mechanistic interpretability?

Mechanistic interpretability (MI) is a subfield of the broader field of interpretability . The goal of MI is to “reverse engineer neural networks into human understandable algorithms”. The central tenet is that it is possible.

The field can be understood as being based on three core hypotheses that is analogous to Theodor Schwann’s Cell theory[1]:

  1. Features: The building blocks of neural networks are features, which can be studied.

  2. Circuits: Features are connected by weights, forming circuits. These are subgraphs of the neural network.

  3. Universality: Similar features and circuits appear in different models and tasks.

Current MI research focuses on identifying circuits within a model that produces certain behavior, such as grokking, superposition, and phase changes.

Further reading:


  1. The first two parts ofTheodor Schwann’s theory proposes that cells forms the basic units of life, and make up all organisms. This is analgous to the features and circuits. The last part proposes cells arise from pre-existing cells, suggesting in “universality” that there are certain functional motifs that are preserbed across species. ↩︎