General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling
Abstract
Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Theoretically, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM constructs the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical “world lines” in the Lie-algebraic tangent space. This distinguishes invariant topological schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we expand sparse training data into a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, significantly outperforming geometry-agnostic baselines.