Skip to main content
All terms
Training

Proximal Policy Optimization

A reinforcement learning method that updates a policy in small, stable steps.

Definition

Proximal Policy Optimization is a reinforcement learning algorithm (it learns from rewards by trial and error) that improves stability by limiting how much the model's behavior can change in a single update. In RLHF training, it uses a reward model to score the model's responses and nudges the model toward higher-scoring answers, while a penalty keeps it from drifting too far from its starting point. It was the standard method for RLHF before simpler approaches like DPO and GRPO gained ground.