Position: Agent Evaluation Should Be Agentified for Openness, Standardization, and Reproducibility
Abstract
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. This position paper argues that the root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by assessor agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. This design separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and deployment; we provide recommended practices that allow both agent developers and benchmark designers to adopt AAA with minimal additional effort; and we show how this approach turns agent evaluation from ad-hoc integration work into a reusable, portable, and production-aligned process. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.