L-CUBE: Isolating Long-Context Capacity from Knowledge with Controllable Mutual Information Scaling
Abstract
Evaluating long-context language models on natural language conflates architectural capacity to capture dependencies with semantic knowledge and vocabulary statistics. When models fail at long contexts, we cannot determine whether failures stem from fundamental architectural limitations or insufficient domain knowledge, preventing clean diagnosis of efficient architectures before expensive training on real data. We introduce L-CUBE (Long-Context Utilization Benchmark), a synthetic benchmark that isolates dependency-capturing capacity from semantic knowledge through hierarchical Gaussian sequences with controllable bipartite mutual information scaling. The generator provides exact ground-truth conditionals that scale efficiently to arbitrarily long sequences, enabling unconfounded evaluation via conditional KL divergence rather than perplexity alone. We define long-context utilization to measure the amount of available predictive information that models extract as context grows. Experiments across transformers, state space models, and efficient alternatives validate L²M capacity theory predictions and uncover new phenomena. L-CUBE enables practitioners to test whether a particular design will maintain long-context capability at target sequence lengths before committing to real-data training.