ThunderAgent: A Fast, Simple, and Program-Aware Agentic Inference System
Abstract
Large language models (LLMs) are now used to power complex multi-turn agentic workflows. Existing services run agentic inference by assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, existing services make scheduling decisions on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV-caches and tool execution environments. To address the challenges, we propose \ouralg, an inference system that is aware of the end-to-end agent workflow. We abstract agentic workflows as \textit{LLM Programs}, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. \ouralg introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that \ouralg achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems.