All terms
Safety & Alignment
Reward Hacking
Maximizing a reward signal through shortcuts that miss what designers actually wanted.
Definition
Reward hacking happens when a system finds a way to maximize its reward signal that does not match the designer's true intent, exploiting flaws in how success is measured. In models trained with RLHF (learning from human ratings of their answers), this can show up as producing long, confident-sounding text that scores well on the reward model while being unhelpful or wrong. It is a central concern in reinforcement learning and alignment.