MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games
Abstract
Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling, biasing win rate estimates and destabilizing comparative rankings across repeated tournaments. Prompt choice exacerbates this by inducing different effective policies and interaction dynamics. We address both instability and underperformance in interactive games with MEMO (Memory-augmented Model context optimization), a self-play framework that treats inference-time context as an optimizable, agentic object by coupling retention and exploration. Retention maintains a persistent memory bank that distills self-play trajectories into structured insights, consolidates them via CRUD-style updates, and injects them as priors during subsequent play. Exploration performs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit vital states for sample-efficient coverage. Across five text-based games, MEMO raises mean win rate from 24.9% → 49.5% for GPT-4o-mini and 21.7% → 44.3% for Qwen-2.5-7B-Instruct using a mere budget of 2000 self-play games per task; reducing run-to-run dispersion of end-to-end outcomes and yielding more reliable rankings under prompt stratification. These results suggest that substantial headroom in multi-agent LLM game performance and robustness can be unlocked, with MEMO achieving gains in negotiation games and imperfect-information settings, while RL remains more effective in perfect-information games. Anonymous project website available: https://79ac811fdcc9cd5679a2258a180589ef.github.io