SEDRAS: Symbolically Evaluated Deep Research And Science
Abstract
As the reasoning capabilities of Large Language Models (LLMs) expand, evaluating true inductive generalization on entirely unseen data becomes increasingly challenging. To this end, we introduce a modular in-context learning evaluation framework, that is scalable and extendable across its separate modules. This is based upon the notion of synthetic scenarios with controllable complexity across three independent axes: \ \textbf{1)} the logic of the underlying data distribution (UDD) \textbf{2)} their projection into diverse representations, and \textbf{3)} the interaction dynamic determining how the model accesses and explores the data. For these scenarios, the model is tasked to perform in-context scientific discovery and produce an interpretable theory in natural language that explains the observations. In a separate conversation, the model is then tasked to convert this generated theory into executable code, which can be programmatically compared against the underlying data distribution. Using this modular framework we produce an initial suite of 600 diverse scenarios that we use to evaluate and analyze various state-of-the-art LLMs. Although these experiments show that Gemini 3.0 Pro achieves the best overall score, each model performs the best at different tasks. For example: GPT 5.2 is the clear winner on pure symbolic data, Claude Opus 4.5 is the best at working with files, Gemini is the strongest model for the non-dynamic scenarios, and Grok 4.1 is the strongest model when UDD complexity scales. Furthermore, all models struggle with active exploration and are seemingly incapable of identifying informative data points, resulting in less efficient exploration than a random baseline. This highlights the room for improvement state-of-the-art LLMs have, even without further scaling of the complexity of the benchmark.