Training

RLHF

Aligning a model to human preferences by learning from a reward signal through trial and error.

Definition

RLHF aligns a model with human preferences by collecting human rankings of model outputs, training a reward model to predict those preferences, and then optimizing the language model against that reward with reinforcement learning (often PPO). It is a key step in turning a raw pretrained model into a helpful, harmless assistant. Newer variants like DPO skip the explicit reward model.

Related terms

Fine-Tuning LLM