Code2World: A GUI World Model via Renderable Code Generation
Abstract
GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text/pixel-based approaches struggle to achieve both high visual fidelity and fine-grained structural controllability. To this end, we propose \textbf{Code2World}, a vision-language coder that simulates the next visual state via \textbf{renderable code generation}. Specifically, to address the data scarcity problem, we construct \textbf{AndroidCode} by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of \textbf{over 80K} high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply \textbf{Render-Aware Reinforcement Learning} which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, \textit{Code2World significantly enhances downstream navigation success rates in a flexible manner}, boosting Gemini-2.5-Flash by {+9.5\%} on AndroidWorld navigation.