Attention with Routed-Memory for Learnable Sparse Control
Abstract
Despite advances in long-context inference, large language models (LLMs) remain fundamentally limited by the key-value (KV) caching mechanisms that are necessary for stable computation. Management techniques, such as selective token eviction and pruning, have vastly mitigated the issues that have arisen, but often discard potentially useful information to manage the growing memory requirements of the cache. In this paper, we build upon these approaches to propose Attention with Routed Memory ARM, a novel KV caching structure that introduces a fully differentiable, fixed-size memory system organized as a hierarchical routing structure that learns to select memory slots via Gumbel-Softmax and performs sigmoid-gated updates that softly combine new and stored information, avoiding hard eviction and thereby reducing information loss. By combining this with a policy to dynamically select varying amounts of memory at inference, ARM adapts its accesses for simple contexts and expanding retrieval for inputs that require deeper reasoning, enabling more scalable and effective retrieval on both short and long contexts. Experimental results on standard commonsense and long-context reasoning benchmarks demonstrate that ARM achieves superior performance and efficiency compared to fixed KV-caching approaches, while remaining efficient and scalable in terms of both memory and generation latency.