Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
Abstract
Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present an RL fine-tuning recipe for diffusion unlearning that treats denoising as a sequential decision process and uses a timestep-aware critic that predicts expected terminal reward from noisy intermediate states. Concretely, we train a CLIP-based predictor on noisy intermediate states and use it to estimate per-timestep values to compute advantage estimates for PPO-style policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves strong forgetting relative to existing baselines while maintaining image quality and benign prompt fidelity. Ablations show that (i) per-step critics and (ii) noisy-conditioned value estimates are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.