Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction

Miroslav Lžičař

Project Page

Abstract

We introduce CellARC, a controlled benchmark for few-shot local rule induction built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in $\leq256$ tokens, exposing explicit knobs for alphabet size $k$, radius $r$, rule family, Langton's $\lambda$, query coverage $cov$, and cell entropy $H$. The central goal is not to claim a broad measure of reasoning, but to create a small experimental system where evidence, ambiguity, and model failure can be separated. We release 95k training episodes plus two 1k test splits, evaluate symbolic, neural, recursive, and closed-model baselines, and add oracle-calibrated identifiability diagnostics. Under the unrestricted induced local-map class, query cells whose support windows are unseen are information-theoretically ambiguous, while seen windows are lookup-identifiable. We also implement structured posteriors for affine linear maps mod-$k$ and one-step native totalistic/threshold families, exposing cells that are rule-identifiable even when lookup fails. This view explains why token accuracy alone can overstate progress and motivates exact-query, balanced, non-quiescent, and identifiability-conditioned metrics. Current results show that exact support-only structured solvers close the affine and native certifiable non-lookup strata, while seed-swept compact ICL baselines remain far below those certifying oracles. The released extrapolation split is difficulty-skewed rather than family-balanced, so we also provide generated family-balanced diagnostics; a lightweight learned family proposer plus exact verifier closes the generated affine diagnostic and partially closes native/family-balanced diagnostics.