Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
Abstract
Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic framework for low-cost VLA deployment via model--hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model--hardware pairs under \textbf{CET} (Cost, Energy, Time), showing that ``right-sized'' edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using fine-grained SM tracing and Roofline analysis, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose \textbf{DP-Cache} and \textbf{V-AEFusion} to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to (2.9\times) speedup on GPUs and (3.3\times) on edge NPUs with only marginal success degradation. The code will be publicly available once the acceptance of the paper. The example leaderboard website is: \url{https://vla-leaderboard-01.vercel.app/}.