EnsembleVLA: Ensemble Learning for Vision-Language Action Models
Abstract
Diverse Vision-language-action (VLA) models have been proposed and demonstrated remarkable capabilities in robotic manipulation. However, how to effectively ensemble VLAs to further enhance performance remains largely unexplored, as conventional ensemble techniques designed for discriminative tasks cannot be directly applied to generative action policies with high-dimensional, multimodal distributions. To address this challenge, we propose EnsembleVLA, an energy-based framework that enables principled ensemble of diverse VLA models. We establish a unified theoretical framework showing that both diffusion-based and flow-based VLA models can be formulated as energy-based models, where additive energy combination naturally induces policy composition at the distribution level. This theoretical foundation enables multiple pre-trained policies to be seamlessly aggregated into a stronger ensemble policy. Building upon this compositional framework, EnsembleVLA further incorporates learnable composition weights for dynamic policy balancing, coupled with a confidence-aware gating mechanism that adaptively modulates bounded residual corrections, collectively ensuring stable and robust task execution. Extensive experiments demonstrate that EnsembleVLA achieves competitive performance across various tasks in both simulated and real-world environments.