All terms
Safety & Alignment
Superposition
When a network packs more features than it has neurons by overlapping their directions.
Definition
Superposition is an idea from interpretability research that a neural network can store more features than it has neurons by letting them share and overlap the same internal space. This works because most features rarely show up at the same time, so the network tolerates the slight mix-up. Superposition makes individual neurons hard to read and motivates tools like sparse autoencoders to recover cleaner features.