Inference Time Concept Removal Guidance for Text-to-Image Diffusion Models
Abstract
Text-to-image diffusion models remain vulnerable to adversarial prompts that elicit disallowed content, motivating reliable inference-time controls. A popular approach is negative guidance, which subtracts a negative-prompt direction with a fixed weight. However, it often forces a safety–fidelity trade-off, causing artifacts or prompt drift when over-applied and failing under attacks when under-applied. Recent dynamic variants reweight guidance using posterior-odds signals, which can be brittle for open-vocabulary compositional prompts, while lightweight similarity-based methods do not leverage the evolving image evidence along the denoising trajectory. We introduce Concept Removal Guidance (CRG), a training-free, plug-and-play method that estimates unwanted-concept presence at each diffusion step using only the noise predictions from the model, and then adaptively gates and calibrates negative guidance via a closed-form constrained update that enforces a target presence threshold while minimally perturbing the conditional trajectory. Across multiple red-teaming benchmarks, CRG significantly reduces attack success rates while improving benign fidelity, and additional suppression targets such as artist style and violence without fine-tuning or external classifiers.