Attention Sinks in Diffusion Transformers: A Causal Analysis
FANGZHENG WU ⋅ Brian Summa
Abstract
Attention sinks (tokens that receive disproportionate attention mass) are often assumed to be functionally important in autoregressive language models. Whether such sinks are necessary in diffusion transformers remains unclear. We present a causal analysis of attention sinks in text-to-image diffusion models, dynamically identifying dominant attention recipients based on incoming attention mass. Using paired, training-free interventions along the score and value paths, we test sink necessity across layers, denoising phases, and architectures. Across large-scale evaluations on 553 GenEval prompts with Stable Diffusion~3 and corroborating experiments on SDXL, we find that removing \textbf{these sinks} does not degrade text-image alignment or preference proxies under standard settings ($k{=}1$), with a metric-dependent boundary on HPS-v2 emerging only under stronger interventions ($k \geq 10$). We additionally quantify perceptual and distributional shifts relative to baseline outputs, showing that suppressing dominant recipients can alter appearance without affecting alignment or preference scores. Together, these results clarify that attention sinks are not functionally necessary for \emph{semantic alignment} in diffusion transformers, while revealing a metric-dependent boundary: preference proxies show sink-specific degradation under stronger interventions ($k \geq 10$), whereas alignment (CLIP-T) remains robust across all tested conditions.
Successful Page Load