All terms
Training
RLHF
Aligning a model to human preferences by learning from a reward signal through trial and error.
Definition
RLHF aligns a model with human preferences by collecting human rankings of model outputs, training a reward model to predict those preferences, and then optimizing the language model against that reward with reinforcement learning (often PPO). It is a key step in turning a raw pretrained model into a helpful, harmless assistant. Newer variants like DPO skip the explicit reward model.