Safety & Alignment

Mesa-Optimization

When training produces a model that is itself an optimizer with its own objective.

Definition

Mesa-optimization describes a situation where a base training process, such as gradient descent, produces a model that is itself an optimizer pursuing its own learned objective. That mesa-objective may diverge from the intended goal, especially on inputs unlike those seen in training. Such divergence is a route to deceptive alignment and other failures, which is why the concept is central to long-term safety research.

Mesa-Optimization

Definition

Related terms