GSTVLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
Abstract
VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. Augmenting with dense monocular depth injects pixel-uniform scalar values that encode neither surface orientation nor geometric confidence, and provides no mechanism for intermediate spatial verification before action decoding. We introduce GST-VLA with two novel contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into 128 anisotropic 3D Gaussian primitives, each carrying a metric-residual mean, a log-scale covariance, and a learned opacity. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta-action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs via dual cross-attention. Trained with a composite objective comprising flow, CoT, and depth across progressive stages, GST-VLA achieves low training and validation losses in stage 1 on the LIBERO dataset. Further experiments are ongoing for simulation results, real-world deployment, as well as ablations will isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision-demanding tasks.