Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis
Abstract
Understanding the capability bottlenecks of embodied multimodal large language models (MLLMs) is crucial for improvement. However, existing embodied benchmarks fail to provide actionable insights because they focus on task-level evaluation rather than discovering capability bottlenecks. To address this, we introduce BEAR, where we divide embodied tasks into 14 atomic skills for skill-level evaluation. BEAR comprises 4,469 interleaved image–video–text entries across 14 skills in 6 categories, ranging from low-level perception to high-level planning. We evaluate 20 MLLMs on BEAR under a hierarchical skill-level diagnosis framework and discover that (1) perceptual capabilities are major bottlenecks behind reasoning failures, and (2) models fail due to unstable spatiotemporal modeling which remain unexposed in previous benchmarks. Furthermore, building on these insights, we propose BEAR-Agent, a multimodal conversable agent that augments MLLMs with visual and spatial tools. It substantially enhances MLLMs’ performance across skills, yielding a relative improvement of 17.5% on GPT-5 on BEAR and outperform baselines by a large margin in both simulation and real-robot experiments across models.