Poster Tue, Jul 7, 2026 • 10:30 PM – 12:15 AM PDT HALL A #1104

TextAtlas5M: A Large-Scale Dataset for Long Text Image Generation

Dongxing Mao ⋅ Alex Jinpeng Wang ⋅ weiming Han ⋅ Jiawei Zhang ⋅ Zhuobai Dong ⋅ Linjie Li ⋅ Lin Yiqi ⋅ Zhengyuan Yang ⋅ Libo Qin ⋅ Fuwei Zhang ⋅ Lijuan Wang ⋅ Min Li

Abstract

Text-conditioned image generation has made rapid progress, yet rendering images with long-form text remains challenging due to the limitations of existing datasets, which predominantly focus on short and simple text. We introduce TextAtlas5M, a large-scale dataset designed to evaluate long-text rendering, where “long text” encompasses not only textual length but also layout complexity and semantic richness. TextAtlas5M contains 5 million generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models. We further curate 4,000 human-improved test cases (TextAtlasEval) spanning four domains, forming one of the most extensive benchmarks for text rendering. Evaluations show that TextAtlas5M poses substantial challenges even for state-of-the-art proprietary models (e.g., GPT-4o), with significantly larger gaps observed for open-source models. Training on TextAtlas5M consistently improves text rendering for both diffusion-based and autoregressive models, demonstrating its effectiveness for advancing text-rich image generation.

Lay Summary

Many image generation tools can now create realistic pictures from written instructions, but they still struggle when an image needs to contain a lot of readable text. This matters because many real-world images, such as posters, slides, documents, webpages, charts, and advertisements, often include long text, complex layouts, and detailed meanings. In this work, we introduce TextAtlas5M, a large collection of 5 million images designed to study and improve long-text image generation. In our setting, “long text” does not only mean more words. It also includes more complex page layouts and richer visual meaning. We also build TextAtlasEval, a carefully improved test set of 4,000 examples across four real-world domains, to better measure how well image generation systems can handle text-rich images. Our experiments show that long-text image generation remains difficult, even for advanced commercial systems such as GPT-4o, and the challenge is even larger for open-source models. We further show that training models with TextAtlas5M helps them generate clearer and more accurate text in images. Overall, TextAtlas5M provides both a challenging testbed and a useful training resource for improving image generation in text-heavy visual scenarios.