Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

GraphStateEval: A Step-by-Step Evaluation Framework for Graph Algorithm Execution in Large Language Models via Intermediate State Tracing

Kanav Kapoor ⋅ Dhruv Kumar ⋅ Jagat S Challa ⋅ Murari Mandal ⋅ Yash Sinha

Project Page

Abstract

We introduce GraphStateEval, a step-by-step evaluation framework and benchmark for measuring how accurately large language models (LLMs) execute graph algorithms, requiring models to reproduce complete internal memory states---queue, stack, priority queue, visited set, and neighbour lists---at every loop iteration under zero-shot execution with no teacher forcing or intermediate state corrections. Ground-truth traces are generated deterministically by a programmatic engine with canonical tie-breaking rules, enabling arbitrarily complex, automatically verified benchmarks that resist memorisation and saturation. We evaluate Gemma 3 12B on BFS, DFS, and Dijkstra across five graph topologies scaled from $N{=}3$ to $N{=}12$ nodes using two zero-shot prompting strategies (natural language and source code). Our Step-Exact-Match (SEM) metric reveals a pronounced topology hierarchy: path graphs sustain near-perfect tracing throughout, while sparse-random graphs degrade substantially beyond $N{\approx}6$; conversely, dense-random and complete graphs often maintain surprising resilience up to $N{\approx}10$ due to shallower traversals. All results reflect single-seed evaluation and should be interpreted as exploratory. A persistent gap between SEM and Final-Answer-Correctness (FAC) across BFS, DFS, and Dijkstra illustrates algorithmic reward hacking---where correct final answers are produced despite internally incorrect intermediate state traces---showing that answer-level evaluation can drastically overestimate true step-level algorithmic execution fidelity.