Mitigating Noise-Induced Layout Priors for Object Counting in Diffusion Models
Abstract
Despite remarkable progress in text-to-image diffusion models, accurately generating the specified number of objects remains a persistent challenge. We identify the initial noise as a primary determinant of spatial layout formation, with early-stage cross-attention serving as the key mechanism that mediates the propagation of noise-induced structures throughout the denoising process. We formalize this phenomenon as the \textbf{\textit{Noise-Induced Layout Prior}}. Leveraging this insight, we propose a novel training-free framework for object counting in diffusion models. Our approach consists of two key components: (1) a \emph{Count-Aware Noise Adjustment Strategy}, which explicitly manipulates the initial latent noise to align layout formation with the target object count, and (2) an \emph{Attention-Guided Layout Consistency Strategy}, which performs test-time optimization on early-stage cross-attention to further stabilize layout formation during denoising. Extensive experiments on both single-category and multi-category benchmarks demonstrate that our method consistently outperforms strong diffusion baselines and state-of-the-art object count control methods in terms of counting accuracy and image quality.