Stabilizing PPO via Latent-Space Regularization and KDE-Driven Exploration
Abstract
Proximal Policy Optimization (PPO) is widely used in continuous-control tasks, yet its performance is often highly sensitive to training dynamics when neural networks approximate the policy and value functions. This paper introduces SPPO, a drop-in augmentation that preserves PPO’s clipped objective and network architecture while stabilizing actor-critic geometry via three mechanisms: (i) a CKA-based constraint on critic representations, (ii) a no-flip regularizer on actor updates, and (iii) KDE-driven advantage shaping. Theoretical analysis shows that these mechanisms tighten bounds on one-step bootstrapping error, improve expected directional alignment of action updates, and ensure non-decreasing occupancy mass over high-novelty regions. Experiments on standard continuous-control benchmarks demonstrate consistent gains over PPO and recent PPO stabilization methods. Ablation studies further quantify the contribution and complementary effects of each component. Additional training-dynamics analyses indicate that SPPO reduces instability and oscillations in both actor and critic updates, improving training stability and final performance.