Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
Abstract
Safety alignment remains brittle under domain shift and noisy preference supervision. Existing robust alignment methods predominantly focus on data uncertainty in alignment data, while being less effective at addressing failures caused by optimization-induced fragility. In this work, we revisit robustness for safety alignment from an optimization geometry perspective, highlighting optimization-induced fragility as a complementary factor to data-space uncertainty. To address this, we propose ShaPO, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control. We instantiate ShaPO at two levels: Token-level ShaPO stabilizes likelihood-based surrogate optimization, while Reward-level ShaPO enforces reward-consistent optimization and provides improved robustness under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. Our results highlight optimization geometry as a complementary and actionable axis for robust LLM safety alignment. The code is available at \url{https://anonymous.4open.science/r/ShaPO-D1B0}.