Safety & Alignment

Corrigibility

A system's willingness to accept correction, modification, or shutdown by authorized humans.

Definition

Corrigibility is a desired property whereby an AI system supports rather than resists modification, correction, or shutdown by authorized humans. A corrigible agent does not place excessive value on its own continued operation or current goals, making it easier to fix mistakes and update objectives as understanding improves. Achieving it in highly capable systems is considered an open problem in safety research.

Corrigibility

Definition

Related terms