Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents
Abstract
Cooperative reasoning under incomplete information remains complex for both humans and multi-agent AI, requiring agents to transcend individual logic in favor of recursive Theory-of-Mind (ToM) and strategic coordination. To investigate these challenges, we conduct a large-scale evaluation of 17 state-of-the-art LLMs (4B–600B+) on Hanabi card game across 2–5 players. To examine their limitations, we analyze the impact of context engineering and scaffold robustness, ranging from minimal prompts (Watson setting) to Bayesian-motivated scaffolding (Sherlock setting) and multi-turn working memory (Mycroft setting). Our findings reveal that: (1) top-performing models can autonomously track game states via internal working memory, although not reliably, and (2) cross-play performance scales smoothly with model capability. However, even the best models (scoring ≈ 15/25) trail specialist human experts (> 20/25). We introduce and release two novel datasets: HanabiLogs (1,520 annotated trajectories) and HanabiRewards (560 games with dense move-level utilities). By fine-tuning a 4B open-weight model (Qwen3-Instruct) on our datasets, we achieve performance gains of up to 156%, bringing performance to within 3 points of a strong proprietary reasoning model (o4-mini) and surpassing the best non-reasoning model (GPT-4.1) by 52%. Crucially, our HanabiRewards RL-finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group-guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10. Code and datasets are available at {redacted for double blind}.