The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space
Bing-Cheng Chuang ⋅ I-Hsuan Chu ⋅ Bor Jiun Lin ⋅ Yang YuanFu ⋅ Min Sun ⋅ Chun-Yi Lee
Abstract
Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the \textbf{Euclidean Fallacy}: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce \textbf{Lie Diffuser Actor (LDA)}, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.06$ to $3.30$ ($+7.8\%$). We further validate our method on real robot and the results show that our method outperforms the baseline on majority tasks.
Successful Page Load