Preference-Calibrated Optimization with Score-Level Distribution Alignment for Text-to-Image Diffusion Model Unlearning
Abstract
While text-to-image diffusion models achieve remarkable generation quality, they inadvertently memorize sensitive content, necessitating machine unlearning to prevent undesired outputs. However, existing unlearning methods rely on suboptimal surrogate objectives rather than directly optimizing the unlearning goal, leading to fundamental objective mismatch. Moreover, these methods preserve model utility via surface-level constraints on model parameters or outputs, yet fail to capture the intrinsic generative dynamics of diffusion models, consequently triggering catastrophic forgetting. To address these challenges, we propose Preference-calibrated Optimization with Score-level Distribution Alignment (POSDA), a unified unlearning framework that harmonizes effective erasure with fine-grained structural preservation. Specifically, we reframe unlearning as a preference optimization problem by constructing a reward that explicitly quantifies the unlearning objective. Additionally, we introduce score-level distribution alignment to ensure the invariance of the underlying manifold topology of the unlearned model, thereby preventing distributional drift. Extensive experiments across object, style, and NSFW unlearning tasks demonstrate that POSDA achieves state-of-the-art erasure efficacy while maintaining superior model utility compared to existing methods.