SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models
Abstract
Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation tasks. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy model's distribution. Our first contribution is a systematical analysis of the diffusion trajectories across different timesteps and identify that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose SIPO, a Stabilized and Improved preference Optimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C&M is introduced to facilitate stabilize training by clipping and masking uninformative timesteps. Followed by a timestep aware importance re-weighting paradigm to fully correct off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models, including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods, with meticulous adjustments on parameters. Overall, these results highlight the importance of timestep-aware alignment and and provide valuable guidelines for improved preference optimization in diffusion models.