Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork
Abstract
In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)—where coordination with unknown partners is required—remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides an end-to-end pipeline to generate learning histories, serialize them into reproducible datasets, and perform online multi-episode evaluation. We evaluate state-of-the-art ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal startling limitations: contrary to their success in single-agent domains, current ICRL architectures fail to exhibit test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no observable in-context improvement over long horizons. These findings highlight the fundamental challenges of strategic inference under partial observability, establishing our benchmark as a critical testbed for next-generation coordination algorithms. Our repository is available at https://anonymous.4open.science/r/ICRL4AHT.