Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

A Controlled Benchmark for Lag-Structured Dependency Motifs

Bowen Qi

Project Page

Abstract

Long-context benchmarks often report pooled scores over heterogeneous tasks, making it difficult to identify which dependency structures a model actually recovers. We propose a controlled benchmark chart for lag-structured dependencies. Each task is specified by a normalized causal kernel and represented by a lossy but interpretable descriptor $\Phi(w)=(s,P,T_\eta,D)$, measuring support density, peakiness, tail mass, and dispersion. We instantiate 1021 tasks across eight anchor, bridge, and stress families, and compare same-order lightweight full attention, sliding-window attention, diagonal SSM, and Mamba-like selective SSM heads. The resulting chart reveals architecture-task structure hidden by pooled reporting: a pooled diagnostic summary nearly ties the two best models (0.659 vs. 0.657), while distinct families have sharply different winners. Local neighborhoods in $\Phi$ predict held-out winners with 66.7% accuracy, outperforming family-, region-, and single-model baselines; a targeted three-seed rerun preserves winners on 97.5% of mid/high-gap tasks. Finally, two query-dependent bridge probes, QueriedDecay and AddressedDecay, suggest interpretable preference migration beyond the fixed-kernel face rather than immediate collapse. These results argue for benchmark designs that report structured task neighborhoods rather than only aggregate scores.