EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
Abstract
The rapid evolution of Large Language Models (LLMs) has shifted their role from passive information providers to active agents capable of executing complex workflows. However, the realization of a true "AI worker" is currently hindered by benchmarks that fail to capture the intricacy of professional environments, which demand long-horizon planning, complex tool usage, and adherence to strict access protocols. To bridge this gap, we introduce EnterpriseOps-Gym, a benchmark environment designed to evaluate agentic planning in realistic enterprise settings. EnterpriseOps-Gym provides: (i) 1,150 expert-curated tasks across eight interconnected domains (including HR, IT, Customer Service and productivity tools) that require managing persistent state and adhering to strict outcome-based verification logic; and (ii) a high-fidelity, containerized sandbox environment hosting 164 database tables and 512 functional tools. Our evaluation reveals critical limitations in state-of-the-art models: even the top-performing Claude Sonnet~4.5 achieves only 34.1\% success, struggling significantly with planning consistency, error recovery, and policy constraints. Furthermore, we observe that agents frequently fail to refuse infeasible tasks, leading to unintended and potentially harmful side effects on the system. These findings indicate that current agents are not yet ready for enterprise deployment. By releasing EnterpriseOps-Gym, we provide a concrete testbed to advance the reliability of autonomous agents in professional workflows.