All terms
Safety & Alignment
Mechanistic Interpretability
Reverse-engineering the internal circuits and features that produce model behavior.
Definition
Mechanistic interpretability studies a model's internals at the level of circuits and features, treating the trained network as a program to be reverse-engineered. Researchers identify circuits (groups of neurons performing a computation) and features (directions in activation space that map to human concepts), then test and manipulate them. The goal is a concrete scientific understanding of how models compute, which supports safety and alignment.