PLSemanticsBench: A Formal Semantics Reasoning Benchmark for Code
Abstract
Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics define behavior through sym- bolic transition rules that can be systematically altered under distribution shift. We investigate whether LLMs can condition their reasoning on formal semantics through program execution and introduce PLSEMANTICSBENCH, pairing featherweight C programs with two se- mantic systems—small-step operational seman- tics and K semantics—and probing four capabil- ities: composing rules for final states, selecting rules when state is unmutated, sustaining such conditioning over long traces, and following sup- plied rules under novel semantics. To decou- ple semantic reasoning from syntactic familiarity, we redefine familiar operators to induce symbol- meaning conflict and introduce novel symbols de- fined only through the supplied rules, and stress- test models on Human-Written, LLM-Translated, and Fuzzer-Generated splits with increasing struc- tural complexity. Across 11 frontier LLMs, strong final-state accu- racy under standard semantics (up to 90%) drops sharply—by as much as 40–60% points—under semantic mutations and increasing structural com- plexity. Only a handful of models achieve non- zero long-horizon conditioning accuracy, and even the best systems reach just 35%. Together, these results suggest that contemporary LLMs of- ten rely on pretrained lexical associations rather than systematically conditioning on supplied for- mal rules. Anonymized PLSEMANTICSBENCH is available at https://huggingface.co/dat asets/LambdaadbmaL/PLSemanticsBench.