Exploring Relational Reasoning Capabilities in LLMs with REL
Abstract
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. While this capability is essential for scientific reasoning, most existing evaluations of relational reasoning in large language models focus on structured inputs such as tables, graphs, or synthetic relational tasks, and do not isolate the sources of difficulty that arise from higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), defined as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty independently of confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Evaluating frontier LLMs, we observe a consistent and monotonic degradation in performance as RC increases, even when the total number of entities is held fixed. This failure mode persists under increased test-time compute and with in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than insufficient inference steps or exposure to examples. Our results identify a well-defined regime of higher-arity reasoning in which current models struggle and motivate revisiting reasoning benchmarks through the lens of relational complexity.