SkillNet: Hierarchical Skill Modeling for Compositional Generalization in Vision-Language Action Models
Abstract
Transfer across diverse task compositions and unseen behaviors remains a significant challenge for vision-language action (VLA) models. Skills are repeatable and atomic components for various tasks, and similarities shared with different skills provide evidence for transferability across behaviors. However, existing skill-centric methods have two problems. First, skills are often loosely organized, lacking a hierarchy that can capture similarities and differences across skills. Second, they lack a mechanism which has the capacity to express transferable skill attributes in a structured parametric space. To this end, we propose SkillNet, which models skill attributes in a hierarchical way and regulates compositional model structure with transferable skill attributes. SkillNet exploits motion code and VerbNet Framework to explicitly model similarities of skills on mechanical properties and semantic roles, and organizes skills in a hierarchical way. Based on this hierarchy, SkillNet leverages the scalability of the mixture-of-experts (MoE) mechanism and develops skill embeddings as soft constraints to enable compositional generalization via similar expert activations on similar skills. On zero-shot and few-shot transfer experiments in simulators and real-world environments, SkillNet achieves an improvement of performance by 16.0% and 23.9%. Meanwhile, SkillNet achieves state-of-the-art performance on in-domain settings.