StructMAR: Structure-Aware Masked Autoregression for Explicit Layout Alignment in Text-to-Image Generation
Gang Cao ⋅ Junying Zhang
Abstract
Text-to-image generation is widely used, but many applications require strict instance-level layout alignment. Masked Autoregressive (MAR) models on continuous latents are efficient and high-fidelity, yet flattening 2D latents into 1D sequences weakens spatial topology and hinders precise control. We propose Structure-Aware RoPE-MAR (StructMAR) to turn layout alignment from soft correlation into explicit structural alignment. StructMAR integrates 2D Rotary Positional Embeddings with a Layout-Guided Attention Bias to mechanistically enforce token-to-instance correspondence. We further apply Group Relative Policy Optimization (GRPO) to better align training objectives with layout-centric evaluation. On COCO-Position, StructMAR achieves state-of-the-art alignment (57.2 AP, 79.4 mIoU) while maintaining image quality comparable to strong diffusion baselines. On COCO-MIG, it improves robustness in dense settings (ISR 61.9, mIoU 57.4) and achieves a 4.05$\times$ inference speedup. These results highlight the importance of explicit structural inductive biases for robust, efficient controllable autoregressive generation; code is available at https://anonymous.4open.science/r/StructMAR-FE92/.
Successful Page Load