Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

Joint Evaluation of Compliance, Planning, and Consistency under Paraphrase: A Relational-Complexity View of Frontier LLMs

Shivansh Bibra ⋅ Dhruv Kumar ⋅ Murari Mandal ⋅ Yash Sinha

Project Page

Abstract

We introduce a 971-prompt benchmark that jointly evaluates three properties of frontier large language models (LLMs) under controlled prompt variation: (i) instruction compliance as the number of formatting constraints scales, (ii) multi-step planning as reasoning depth grows, and (iii) within-model lexical consistency across paraphrased restatements of the same underlying question. We evaluate three frontier systems---GPT-4o, Gemini, and DeepSeek---over 2,913 real API responses on a closed factual domain. Compliance is low and appears complexity-sensitive (means 0.46--0.56; exploratory Wilcoxon $p{<}0.05$ in all three models); planning accuracy is non-monotonic in step count, with DeepSeek leading at every depth (0.76--0.89); and within-model paraphrase Jaccard averages only ${\sim}0.27$, indicating substantial surface-form variability across paraphrases. Cross-model Jaccard suggests that GPT-4o and Gemini produce more lexically similar outputs to each other (0.512) than either does to DeepSeek, which may be consistent with convergent alignment behaviour among proprietary models. We interpret these findings through the lens of Relational Complexity (RC) as a conceptual framing, drawing on the REL benchmark of Fesser et al.\ (2025), and report an exploratory task-vs-form pattern in which the strongest performer on task accuracy is the least lexically stable under paraphrase. The benchmark, prompts, responses, and analysis pipeline are released for reproducibility.