All terms
Training
Reward Model
A model that scores outputs by how much humans would prefer them.
Definition
A reward model learns to predict how much humans would prefer a given output, producing a score that guides reinforcement learning during alignment. It is trained on preference data where annotators compare pairs of responses. Methods like DPO skip the explicit reward model, but it remains central to many RLHF pipelines.