Safety & Alignment

Mechanistic Interpretability

Reverse-engineering the internal circuits and features that produce model behavior.

Definition

Mechanistic interpretability studies a model's internals at the level of circuits and features, treating the trained network as a program to be reverse-engineered. Researchers identify circuits (groups of neurons performing a computation) and features (directions in activation space that map to human concepts), then test and manipulate them. The goal is a concrete scientific understanding of how models compute, which supports safety and alignment.

Mechanistic Interpretability

Definition

Related terms