Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification
Abstract
Large language models (LLMs) are increasingly applied to technical domains requiring structured, multi-step algorithmic reasoning. Yet systematic benchmarks that isolate intermediate reasoning fidelity — rather than merely checking final answers — remain scarce. We introduce SVAC (Step-Level Verification of Algorithm Correctness), a benchmark that evaluates LLMs on their ability to manually trace three canonical Operating System resource-management algorithms: page replacement (FIFO, LRU, OPT, Clock), process scheduling (SJF, Round Robin, MLFQ), and disk scheduling (FCFS, SSTF, SCAN, C-SCAN, LOOK). For each problem instance, models must produce a full execution trace under strict constraints: no code, no external tools, and no step abbreviation. Ground-truth traces generated by deterministic solvers enable fine-grained, step-level scoring across five complexity levels and three prompting strategies (zero-shot, few-shot, chain-of-thought). Our evaluation reveals characteristic failure modes including early error cascades, hit/miss misclassification, and algorithm-specific state corruption that vary systematically across model families, algorithm types, and input complexity. SVAC provides the first unified framework for assessing OS-algorithm reasoning in LLMs, with implications for understanding the depth and limits of procedural reasoning in current foundation models.