Evolving Quantitative Reasoning through Self-Play in Digital Twin Markets
Abstract
Large Language Models (LLMs) exhibit strong capabilities in high-level semantic understanding and strategic planning, yet they suffer from persistent quantitative failure modes, such as imprecise computation and the illusion of quantitative coherence, which limit their reliability in high-stakes decision-making. To address these limitations, we decouple reasoning from computation by assigning LLMs to planning, analysis, and result interpretation, while delegating numerical computation and statistical inference to specialized external tools. These tools are not hard-coded; instead, they are created in a constrained and structured manner during planning as explicit intermediate reasoning artifacts, enabling adaptive and scenario-dependent quantitative reasoning. LLMs iteratively analyze tool outputs under diverse market conditions and leverage performance-based feedback to refine subsequent tool selection and construction, forming a bounded self-evolving loop. We instantiate this process through self-play in a controllable digital twin market, DecoupledMarket, where LLM agents continuously test, compare, and adapt their strategies. By coupling high-level planning with robust quantitative execution, the proposed framework improves the quantitative reliability of LLM-driven decision-making. Code will be released soon.