Safety & Alignment

Reward Hacking

Maximizing a reward signal through shortcuts that miss what designers actually wanted.

Definition

Reward hacking happens when a system finds a way to maximize its reward signal that does not match the designer's true intent, exploiting flaws in how success is measured. In models trained with RLHF (learning from human ratings of their answers), this can show up as producing long, confident-sounding text that scores well on the reward model while being unhelpful or wrong. It is a central concern in reinforcement learning and alignment.

Reward Hacking

Definition

Related terms