Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation
Abstract
We present a case study in tight theory-benchmark coupling for length generalisation in RNNs. Starting from switched dynamical systems analysis, we derive a gap suppression mechanism and a margin condition that together certify generalisation from weights alone---zero false positives at H >= 4k (100/100; preconditions verified empirically on all two-phase models); certified models achieve 100% accuracy at L=1,000,000 (66,667x training length, 10/10 seeds). Inverting the certificate into a training recipe yields >=99/100 across k=3-10, 12, including a blind prediction at k=12 (20/20 vs 0/20 baseline), with extension to 42 absorbing-state regex patterns (349/350 certified, 0 FP). Equally informative are the failures: non-absorbing DFAs defeat the recipe (0% vs 80-100% baseline), gated architectures resist certification (1/20 GRU, 0/20 LSTM) even when the recipe partially works, and the certificate produces false positives when capacity is insufficient. These breakdowns delineate exactly where the theory-benchmark loop needs new theoretical input.