GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert
Abstract
Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization and costly adaptation. We propose to learn a \textbf{G}eneralizable \textbf{A}ction \textbf{E}xpert (\textbf{GAE}), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud–trajectory dataset comprising \textbf{150k} trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an \textbf{Action Pre-training, Pointcloud Fine-tuning (APPF)} scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Extensive experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.