LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment
Abstract
We formulate the learning of generalist Vision-Language-Action (VLA) models as a Gromov-Wasserstein alignment problem, aiming to map semantically similar VL embeddings to physically similar motion primitives. However, solving this is challenging due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, while the physical manifold of robotic action is non-Euclidean and anisotropic. As a result, direct regression approaches fail due to the disjoint metric structures of these domains, making standard distance minimization ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer). LAST reconstructs the action space to establish a more consistent metric alignment between the VL and Action modalities. Specifically, LAST bridges the heterogeneity via two stages: (1) Global Topological Linearization, which linearizes the action manifold through Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation; and (2) Local Metric Discretization, where the representation is discretized hierarchically into schemas and whitened residuals, establishing a mathematical isomorphism with the isotropic Euclidean metric. By addressing the structural mismatch globally and locally, LAST enables VLA models with enhanced convergence and generalizability.