Responsible Text-to-Image Diffusion: Interpretable and Linearly Controllable Semantics for Fair and Safe Generation
Abstract
Text-to-image (T2I) diffusion models (DMs) have achieved remarkable generative quality but still exhibit the risk to produce biased and inappropriate images. A promising line of prior work aims to mitigate this issue by learning interpretable and linearly controllable concepts from semantic spaces, such as the U-Net bottleneck. However, these methods rely entirely on the bottleneck layer in U-Net and therefore cannot be directly applied to modern ViT-based DMs, including FLUX and PixArt. In this work, we present a model agnostic framework for discovering interpretable and linearly controllable semantic attributes across any T2I DMs backbone. We first show that multi-modal attention heads in ViT-based DMs encode interpretable and (near) linear semantic structures similar to those in the U-Net bottleneck. Based on this insight, we introduce a method that learns external concept vectors, which are added to the multi-modal attention heads for ViT-based DMs or to the bottleneck layer for U-Net-based DMs, while keeping the pretrained model frozen. Experiments across SDXL, SD3.5, PixArt, and FLUX demonstrate that these concept vectors provide interpretability, linearity, and highly improved fairness while preserving visual fidelity. The code is included in the supplementary material.