Learning from Comparison: Constrained Projection Policy Optimization for Pareto-Front Improvement
Abstract
Constrained multi-objective reinforcement learning aims to discover a diverse set of feasible trade-offs, yet scalarization and signed, normalized group-relative advantages can be brittle under objective-scale drift, near-ties, and feasibility scarcity. We propose constrained projection policy optimization (CoPro), which alternates between an E-step moment projection and an M-step policy projection. In the E-step, we solve a Kullback-Leibler (KL)-regularized, moment-constrained projection over each sampled group to compute a nonnegative reweighting distribution (q*) that promotes feasible Pareto-front (PF) progress, preserves feasibility anchors, and suppresses ambiguous near-ties. This E-step admits a closed-form exponential-family solution and guarantees strictly positive probability mass on feasible anchors whenever feasible candidates appear in the group. In the M-step, we project the policy toward q* via weighted maximum likelihood with a trust-region regularizer, yielding a PF-aligned update direction from comparisons without hand-crafted reward shaping. Empirically, CoPro improves feasible PF quality and robustness on constrained multi-objective benchmarks for large language model tool use and analog circuit design tasks. Code is available at https://anonymous.4open.science/r/CoPro-8A95/README.md.