Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

Measuring the Limits of Continual Learning for LLMs

Nimit Kalra ⋅ Narutatsu Ri ⋅ Zerzar Bukhari ⋅ Ang Li ⋅ Sanae Lotfi ⋅ Liam Fowl ⋅ Micah Goldblum

Project Page

Abstract

Language models are trained in stages but deployed as mostly static artifacts, leaving them poorly matched to a world that continually produces novel information. This trait has motivated a broad class of continual learning systems that adapt models to new information through weight updates, retrieval, memory, long-context inference, or hybrid mechanisms. Yet, existing evaluations do not tell us whether such systems have truly internalized new information: whether they can go beyond memorizing new information and update stale beliefs, resolve indirect references, compose new facts with prior knowledge, surface facts even when only implicitly relevant, and confidently recognize gaps in its own knowledge. We construct ImprintBench, a benchmark of realistic settings that expose these systematic shortcomings. ImprintBench consists of a refreshable pipeline that automatically constructs evaluations across three domains: news events, open-source API changes, and evolving personalization histories, with queries spanning six capability families: acquisition, temporal update, referential resolution, composition, implicit relevance, and boundary awareness. Across in-the-wild update scenarios, we find common systematic failures in both retrieval-based and training-based methods, showing that current systems still fall short of robustly learning from new experience.