Towards AI Agents In the Real World
Abstract
Recent advances in AI agents have been driven by imitation learning with reinforcement learning in the digital world, based on large scale generative models, yielding strong performance in many online tasks but limited capability in physical world settings. I argue for a shift toward AI agents grounded in world modeling, allowing them to understand the physical environment, to understand user intentions and social contexts, thereby enhancing their ability to perform complex tasks autonomously in the real world. World modeling encompasses the integration of multimodal perception, planning through reasoning for action and control, and memory to create a comprehensive understanding of the physical world. I argue that achieving advanced machine intelligence requires modeling both the physical world and the mental world, including latent variables such as intent, attention, and context. I outline key challenges toward building context-aware, interactive agents in the real world. This essential trajectory demands continued efforts to develop robust world models and embodied agents that can truly assist humans with real tasks in the real world.