MIST: Moment-Aligned Invariant Stability Transform for Robust Flow Matching
Abstract
Classifier-Free Guidance (CFG) is a cornerstone of flow-matching models, significantly enhancing visual quality and prompt adherence. However, high guidance scales inherently violate the optimal transport dynamics, leading to visual artifacts and mode collapse. In this paper, we investigate the mechanisms of this failure through the lens of velocity moment decomposition. Our analysis reveals that the distributional shift induced by CFG decouples into two geometric components: a Linear Barycentric Drift that shifts the global distribution center, and a Quadratic Energetic Instability that injects surplus kinetic energy, disrupting the transport cost and triggering variance explosion. To mitigate these issues, we introduce MIST (Moment-aligned Invariant Stability Transform), a training-free method designed to confine the sampling trajectory to the learned data manifold. MIST comprises two hierarchical stages: (1) Invariant Alignment (IA), a global statistical rectifier that restores structural integrity by removing the linear drift and realigning the energy profile; and (2) Stability Thresholding (ST), a local dynamical regulator that enforces Lipschitz-like smoothness via temporal decay and spatial suppression. MIST enables robust, high-fidelity generation across a wide range of guidance scales while consistently improving performance at moderate scales. Extensive experiments on diverse text-to-image and text-to-video benchmarks demonstrate that MIST outperforms standard CFG and state-of-the-art corrections, establishing a new benchmark for robust guidance in flow-based generative models.