Hierarchical Procedural Meta-Reasoning for Generalizable Multimodal Agents
Abstract
While multimodal agents can achieve strong performance through fine-tuning, their ability to generalize remains limited in complex real-world tasks such as mobile navigation, where diverse applications, frequent system changes, and customized workflows are common in practice. We argue that a fundamental bottleneck lies in whether an agent possesses sufficient task-specific procedural knowledge to accomplish a given goal. Such procedural knowledge may be provided by the general capabilities of large language models, or obtained from additional external resources such as web search when necessary. Based on this view, we propose Procedure-Aware Multimodal Agent with Meta Reasoning, a framework that explicitly represents task knowledge as natural-language procedures and trains a procedure-aware grounded agent to condition its actions on this knowledge. By learning to leverage procedural knowledge from different sources, our approach enables robust generalization across tasks, applications, interface versions, and multi-app workflows, achieving substantial improvements on challenging Android benchmarks.