Playing with Fire : What Transfers When RL Trains a Language Agent?
Abstract
What does reinforcement learning (RL) teach an agentic language model —task-transferable competence or scaffold-specific tricks? We investigate this question in Hanabi, a cooperative card game that serves as a benchmark for hidden-state reasoning, by training and evaluating models across three scaffolds that vary in state exposure and reasoning burden. Our results reveal that a smaller 4B model trained with RL on information-rich scaffolds can achieve large gains on the trained scaffolds, but fails to transfer to a held-out scaffold that requires basic game knowledge; continued RL on a single scaffold may even degrade transfer to another trained scaffold. In contrast, a 120B model with stronger prior Hanabi competence transfers consistently across all held-out scaffolds under a comparable RL setup, suggesting that cross-scaffold transfer is gated by pre-RL task competence. Yet this failure of direct transfer does not render RL unhelpful for the weaker model. We construct FIREWORKS, a 10K-example benchmark of strict hidden-belief reconstruction across 2--5 player Hanabi games, on which the base model has near-zero Pass@1. Training on FIREWORKS produces longer rollouts and lifts held-out non-Hanabi hard-logic accuracy from Pass@8 = 0.15 to 0.37, with continued Hanabi RL raising it to 0.61—even though the state-tracking skill itself does not transfer zero-shot across scaffolds. When state-tracking and move-rating rewards are optimized jointly, cold-start training is slowed by interference, while initialization from FIREWORKS reaches the base model's reward level in roughly half the steps. Together, these results decompose RL post-training into three outcomes—scaffold-local gains, competence-gated transfer, and trainability transfer—and show that staged hard-task training improves downstream trainability even when the trained skill does not survive a scaffold change.