Poster Mon, Jul 6, 2026 • 10:00 PM – 11:45 PM PDT HALL A #506

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

Bing-Cheng Chuang ⋅ I-Hsuan Chu ⋅ Bor Jiun Lin ⋅ Yang YuanFu ⋅ Min Sun ⋅ Chun-Yi Lee

Abstract

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the **Euclidean Fallacy**: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce **Lie Diffuser Actor (LDA)**, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

Lay Summary

Modern robots learn manipulation skills by watching human demonstrations through a process called diffusion: noise is gradually added to the recorded motions until they become random, and a neural network learns to reverse the process. But a robot's hand pose, meaning its position and orientation in 3D space, does not live in a flat space. Valid rotations sit on a curved surface, like points on a globe. Adding ordinary noise to a rotation pushes it off this surface, producing intermediate poses that no real robot could ever take. We propose Lie Diffuser Actor, a method that respects this curved structure. Instead of adding noise directly to rotations, we add it in the flat tangent plane that touches the surface at each pose, then project back through an operation that keeps every intermediate result valid. The network never sees impossible poses during either training or execution. On a simulation benchmark and on a real robot arm, Lie Diffuser Actor produces smoother and more reliable manipulation than methods that ignore the curved geometry. This brings robot learning one step closer to dependable real-world deployment.