PS-PPO : Prefix-Sampling PPO for Critic-Free RLHF
Abstract
Reinforcement Learning from Human Feedback (RLHF) for Large Language Models increasingly relies on critic-free methods as a practical alternative to actor–critic training. Despite their simplicity, existing critic-free approaches propagate a trajectory-level learning signal uniformly across all tokens appearing in the trajectory. This requires full-trajectory update for every rollouts, leading to substantial optimization cost for long reasoning traces although the feedback signal is effectively determined early in the trajectory. We propose Prefix-Sampling Proximal Policy Optimization (PS-PPO), a compute-efficient critic-free method for RLHF that exploits this temporal redundancy. PS-PPO introduces a prompt-conditioned cutoff distribution and samples a cutoff timestep for each trajectory. Policy gradient updates are then applied only up to the sampled cutoff timestep, while a correction mechanism ensures that the resulting truncated gradient estimator remains unbiased with respect to the full-trajectory objective. This procedure bypasses later tokens whose contribution to the feedback signal is negligible, without distorting the underlying learning signal. Experiments on mathematical reasoning and RLHF benchmarks show that PS-PPO achieves large reductions in training compute and peak GPU memory, while maintaining accuracy comparable to strong critic-free baselines.