FormalImG: Evaluating Structural Compositional Generalization for T2I Models
Hong-Jie You ⋅ Jie-Jing Shao ⋅ Xiao-Wen Yang ⋅ Zhi-Fan Wu ⋅ Lin-Han Jia ⋅ Lan-Zhe Guo ⋅ Yu-Feng Li
Abstract
As natural language becomes the primary interface for image generation, evaluating semantic generalization under language instructions is increasingly important. Existing benchmarks emphasize combinations of concepts but rarely examine the internal semantic structure of language. We introduce FormalImG, a first-order-logic-based benchmark for structural compositional generalization. Natural language instructions are formalized as logical formulas and we define structural compositional complexity and $\varepsilon$-structural compositional generalizability to measure how model performance changes with increasing semantic dependency. The benchmark includes two evaluation scenarios and 4,000 instructions across multiple complexity levels, assessed through symbolic verification and model-as-judge. Experiments show that mainstream text-to-image models experience clear performance decline as structural complexity grows, with stable performance mainly at low complexity levels. Further analysis indicates that large language models already handle textual structural reasoning well, while the language-to-vision transformation stage forms the significant bottleneck.
Successful Page Load