Memoria-Bench: A Comprehensive Benchmark for Evaluating Memory in Long-Horizon Autonomous Agents
Abstract
Memory is a core capability of autonomous agents, yet existing benchmarks evaluate it primarily in constrained settings such as short dialogues or synthetic tasks, failing to reflect realistic agent deployments. We present \textbf{Memoria-Bench}, a benchmark for evaluating agent memory grounded in complete, chronologically ordered interaction trajectories that may span millions of tokens. Guided by principles of realism, domain and agent diversity, and explicit exposure of memory-centric challenges, all tasks are formulated as anti-summarization question answering, requiring fine-grained, temporally grounded memory retrieval rather than high-level abstraction. Memoria-Bench covers deep research, coding, and Science & Development agents across seven domain categories and instantiates three task families: temporal aggregation, multi-hop memory reasoning, and long-range state tracking. Experiments on state-of-the-art long-context models and memory-augmented-based methods reveal substantial performance degradation in long, noisy trajectories, exposing a critical memory bottleneck beyond context length scaling.