Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
Abstract
Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text–image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods can increase sample-to-sample variation, but typically rely on extra sampling, auxiliary optimization, or careful tuning—incurring non-trivial runtime and memory overhead. We examine intermediate Transformer features and observe that the lowest-frequency (DC) component rapidly homogenizes across seeds early in generation, infivsyinh an early trajectory lock-in that limits downstream variation. Building on this, we propose DC Attenuation for diVersity Enhancement \textbf{(DAVE)}, a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline and incurs negligible overhead, improving prompt-consistent diversity without sacrificing image quality.