All terms
Training
Proximal Policy Optimization
A reinforcement learning method that updates a policy in small, stable steps.
Definition
Proximal Policy Optimization is a reinforcement learning algorithm (it learns from rewards by trial and error) that improves stability by limiting how much the model's behavior can change in a single update. In RLHF training, it uses a reward model to score the model's responses and nudges the model toward higher-scoring answers, while a penalty keeps it from drifting too far from its starting point. It was the standard method for RLHF before simpler approaches like DPO and GRPO gained ground.