All terms
Training
Direct Preference Optimization
Aligning a model directly from preferred-versus-rejected response pairs, without a reward model.
Definition
Direct Preference Optimization improves a model using pairs of responses where one is preferred over the other, adjusting the model directly toward the preferred answers. It is set up so the model itself implicitly judges quality, giving a simple training target. This avoids the separate scoring model and the trial-and-error reinforcement learning of classic RLHF (training a model from human ratings of its answers), making the process simpler and more stable.