Training

Reinforcement Learning from AI Feedback

Aligning a model using preference judgments from another model instead of humans.

Definition

Reinforcement learning from AI feedback replaces or supplements human preference labels with judgments from another model, making alignment cheaper and easier to scale. A model evaluates candidate outputs, and those judgments train the reward signal used to optimize behavior. Anthropic's Constitutional AI is a well-known approach in this family.

Reinforcement Learning from AI Feedback

Definition

Related terms