ECHO: Terminal Agents Learn World Models for Free
Abstract
CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream---stdout, errors, file contents---records the consequences of its actions. We argue that this stream is a supervision signal, but standard agent RL largely discards it: GRPO-style training updates action tokens while ignoring environment responses, even though rollouts already contain them. We introduce ECHO (Environment Cross-entropy Hybrid Objective), an auxiliary loss that trains the policy to predict observation tokens alongside the policy gradient. ECHO reuses the same forward pass, requires no additional rollouts, and turns terminal feedback into a dense signal that remains informative even on failed trajectories. This objective shapes the policy toward a world model of terminal dynamics, plausibly improving its priors over which actions are likely to succeed. On TerminalBench 2, ECHO nearly doubles pass@1: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. From base Qwen3-8B, ECHO+GRPO matches expert-SFT-then-GRPO on held-out terminal tasks without the expert demonstrations required for SFT, and closes roughly half this gap on TerminalBench 2. Without verifier rewards, ECHO can lead to self-improvement through exploration alone, suggesting environment observations are not merely context for actions but supervision for the policy itself.