Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng ⋅ Li'an Zhong ⋅ Yi Wang ⋅ Rui Dai ⋅ Kaikui Liu ⋅ Xiangxiang Chu ⋅ Linyuan Lü ⋅ Phil Torr ⋅ Kevin Qinghong Lin

Project Page

Abstract

GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text/pixel-based approaches struggle to achieve both high visual fidelity and fine-grained structural controllability. To this end, we propose \textbf{Code2World}, a vision-language coder that simulates the next visual state via \textbf{renderable code generation}. Specifically, to address the data scarcity problem, we construct \textbf{AndroidCode} by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of \textbf{over 80K} high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply \textbf{Render-Aware Reinforcement Learning} which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, \textit{Code2World significantly enhances downstream navigation success rates in a flexible manner}, boosting Gemini-2.5-Flash by {+9.5\%} on AndroidWorld navigation.