XYZFlow: Scaling Multidimensional Shortcut Flows for Efficient Generative Modeling
Jinxiu Liu ⋅ Xuanming Liu ⋅ Kangfu Mei ⋅ Yandong Wen ⋅ Weiyang Liu
Abstract
The pursuit of high-fidelity image generation faces a fundamental trade-off between sampling speed and output quality. While diffusion models excel in quality, their iterative nature incurs high computational costs. Current efficient methods primarily focus on distilling pre-trained models into few-step samplers; however, this distillation process is challenging and heavily reliant on teacher model quality. In this paper, we introduce \textbf{\XYZFlow}, a novel framework that rethinks this paradigm through multidimensional scaling of flow matching. Unlike MeanFlow's single-step deterministic mapping, our approach intensively scales the expressive power of generative models by enhancing the uniqueness and learnability of probability paths through structured, multidimensional conditioning. Theoretically, we frame autoregressive modeling as an implicit flow straightening mechanism, where expanding contextual constraints reduce trajectory ambiguity. XYZFlow implements this via two orthogonal scaling dimensions: (1)Temporal scaling through non-Markovian conditioning on the full denoising history, and (2) Spatial scaling through our proposed Next Shortcut Prediction, where patches are generated sequentially using the complete denoising trajectories of preceding patches as priors. This multidimensional conditioning constructs a high-dimensional coordinate system for probability flows, enforcing mapping uniqueness. Our Next Shortcut Prediction mechanism specifically enables efficient generation by leveraging rich contextual information from previously generated patches' full denoising processes. Extensive evaluations demonstrate XYZFlow achieves state-of-the-art performance, with 7.2--8.5$\times$ speedup over teachers while maintaining competitive FID. Notably, our structured Next Shortcut Prediction design establishes a more parameter-efficient scaling dimension and achieves superior quality-latency trade-offs compared to simply enlarging models or compressing sampling steps.
Successful Page Load