Training

Proximal Policy Optimization

A reinforcement learning method that updates a policy in small, stable steps.

Definition

Proximal Policy Optimization is a reinforcement learning algorithm (it learns from rewards by trial and error) that improves stability by limiting how much the model's behavior can change in a single update. In RLHF training, it uses a reward model to score the model's responses and nudges the model toward higher-scoring answers, while a penalty keeps it from drifting too far from its starting point. It was the standard method for RLHF before simpler approaches like DPO and GRPO gained ground.

Proximal Policy Optimization

Definition

Related terms