From Winning to Understanding: A Diagnostic Long-Horizon RTS Benchmark for LLMs
Abstract
Large language models (LLMs) are increasingly used as decision modules, yet existing benchmarks provide limited coverage of long-horizon, adversarial interaction while faithfully acting on human instructions. We introduce a long-horizon Red Alert RTS benchmark with a hierarchical interface in which LLMs output budgeted, low-frequency macro/tactical intents that are executed deterministically for standardized comparison. The benchmark evaluates (i) robustness to ``rules-as-variable'' perturbations via rule-style shifts , (ii) competitive strength via Elo-style ratings from head-to-head matches, and (iii) human steerability via standardized language interventions. Beyond win/loss, we log economy growth/spending, combat loss ratio, and visibility coverage to diagnose long-horizon failure modes. Overall, the benchmark provides a reproducible and diagnostic testbed for robustness and controllability in long-horizon adversarial decision making.