How good is your harness?
Abstract
AI agent benchmarks conflate the underlying language model with the harness that wraps the model in tools, prompts, and control flow. We develop a simple statistical method to disentangle the effects of the LLM and its harness on the agent's score; i.e., attribute the variation in agents' scores to their LLMs and harnesses. We use the method to evaluate harnesses and LLMs on Terminal-Bench 2.0, and our results show that 1. the harness matters as much as the LLM: the gains from simply picking the best harness are comparable to those from picking the best LLM. 2. the harness effects can be heterogeneous; i.e., some harnesses work better with some LLMs than with others. Our results confirm and, more importantly, quantify practitioners' intuition on the importance of the harness and validate efforts to tailor harnesses to specific applications.