Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

Shubh Chapra ⋅ Dhruv Kumar ⋅ Murari Mandal ⋅ Yash Sinha

Project Page

Abstract

The emergence of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of Large Language Models (LLMs), yet moving from empirical scores to quantitative performance approximations across reasoning depths remains an open challenge. We introduce the {Complexity Ceiling Benchmark (CCB)}: a depth-parameterised evaluation framework that isolates computational depth by varying required reasoning steps $N$ from 5 to 50 ($n{=}40$ independent trials per depth cell), holding all semantic parameters fixed. CCB spans three structurally distinct domains: {D1 Alien Grid} (grounded spatial state-tracking), {D2 Symbolic Pointer Tracking} (abstract alias-chain resolution), and {D3 Social Logic} (nested transitive relational inference). We fit a geometric accuracy decay model $P(\text{correct}|N){=}p_d^N$ with parametric bootstrap 95\% CIs; under the assumption of independent per-step failures, this provides a \emph{useful empirical approximation} of long-horizon reasoning capability. We introduce the {Trace First Branch Correct (TFBC)} metric, which identifies the first step $k^*$ at which a reasoning trace diverges from ground truth while the final answer remains correct. Our pipeline is validated by human inter-annotator agreement ($\kappa \geq 0.938$) with explicit parser robustness analysis. Frontier models achieve substantially higher step-retention across D1 and D2 ($p_d > 0.92$), with Claude maintaining $p_d > 0.86$ on the hardest domain D3. Verbosity ablations show that forced state-tracking offers {zero statistical benefit} on structurally complex instances (McNemar $p{=}1.000$, $n{=}20$), providing evidence that the observed D3 difficulty is not reducible by prompt engineering for the evaluated models under vanilla autoregressive inference. These findings motivate new theoretical frameworks for sequential reasoning and the evaluation of process-supervised architectures.