Poster Tue, Jul 7, 2026 • 2:00 PM – 3:45 PM KST Coex: HALL A

Memoria-Bench: A Comprehensive Benchmark for Evaluating Memory in Long-Horizon Autonomous Agents

Qiufeng Wang ⋅ Jiaxuan Zhu ⋅ Ziteng Feng ⋅ Zhenyu Cui ⋅ Jialong Wu ⋅ Shuxia Lin ⋅ Caorui Li ⋅ Renzhao Liang ⋅ Yifei Yu ⋅ Kun Wang ⋅ Qiankun Li ⋅ Guibin Zhang ⋅ Siming Huang ⋅ Xianzhen Luo ⋅ Jie Wang ⋅ Junnan Dong ⋅ Siyu An ⋅ Biao Liu ⋅ Yidong Wang ⋅ Cunxiang Wang ⋅ Yu Chen ⋅ Zhenhong Zhou ⋅ Liang Lin ⋅ Zhongxiang Sun ⋅ Deng-Bao Wang ⋅ Xu Yang ⋅ Yang Liu ⋅ Min-Ling Zhang ⋅ di yin ⋅ Xing Sun ⋅ Jiaheng Liu ⋅ Qian-Wen Zhang

Abstract

Memory is a core capability of autonomous agents, yet existing benchmarks evaluate it primarily in constrained settings such as short dialogues or synthetic tasks, failing to reflect realistic agent deployments. We present \textbf{Memoria-Bench}, a benchmark for evaluating agent memory grounded in complete, chronologically ordered interaction trajectories that may span millions of tokens. Guided by principles of realism, domain and agent diversity, and explicit exposure of memory-centric challenges, all tasks are formulated as anti-summarization question answering, requiring fine-grained, temporally grounded memory retrieval rather than high-level abstraction. Memoria-Bench covers deep research, coding, and Science & Development agents across seven domain categories and instantiates three task families: temporal aggregation, multi-hop memory reasoning, and long-range state tracking. Experiments on state-of-the-art long-context models and memory-augmented-based methods reveal substantial performance degradation in long, noisy trajectories, exposing a critical memory bottleneck beyond context length scaling.