All terms
Data
Human Feedback
Human judgments of model outputs used as a signal to align model behavior.
Definition
Human feedback is the set of signals — preferences, rankings, corrections, or ratings — that people provide about model outputs to steer a model toward helpful and harmless behavior. It is the raw material for alignment methods such as RLHF and DPO (two techniques for tuning a model to people's preferences), where a reward model (a scoring model trained to predict which answers people prefer) or direct preference signal is derived from these judgments. The quality and consistency of human feedback shape how well a model follows intentions.