FreeText: Training-Free Text Rendering via Attention Localization and Spectral Glyph Injection
Abstract
Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically necessitate costly retraining or impose rigid external layout constraints, often compromising aesthetic quality and flexibility. We propose FreeText, a training-free, plug-and-play framework that improves text rendering by leveraging intrinsic mechanisms of Diffusion Transformer (DiT) models. FreeText decomposes the problem into where to write and what to write. For the former, we localize writing regions by extracting token-wise spatial attribution from image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For the latter, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and mitigate semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while maintaining semantic alignment and aesthetic quality, with modest inference overhead.