TextAtlas5M: A Large-Scale Dataset for Long Text Image Generation
Abstract
Text-conditioned image generation has made rapid progress, yet rendering images with long-form text remains challenging due to the limitations of existing datasets, which predominantly focus on short and simple text. We introduce TextAtlas5M, a large-scale dataset designed to evaluate long-text rendering, where “long text” encompasses not only textual length but also layout complexity and semantic richness. TextAtlas5M contains 5 million generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models. We further curate 4,000 human-improved test cases (TextAtlasEval) spanning four domains, forming one of the most extensive benchmarks for text rendering. Evaluations show that TextAtlas5M poses substantial challenges even for state-of-the-art proprietary models (e.g., GPT-4o), with significantly larger gaps observed for open-source models. Training on TextAtlas5M consistently improves text rendering for both diffusion-based and autoregressive models, demonstrating its effectiveness for advancing text-rich image generation.