Data

Human Feedback

Human judgments of model outputs used as a signal to align model behavior.

Definition

Human feedback is the set of signals — preferences, rankings, corrections, or ratings — that people provide about model outputs to steer a model toward helpful and harmless behavior. It is the raw material for alignment methods such as RLHF and DPO (two techniques for tuning a model to people's preferences), where a reward model (a scoring model trained to predict which answers people prefer) or direct preference signal is derived from these judgments. The quality and consistency of human feedback shape how well a model follows intentions.

Human Feedback

Definition

Related terms