Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #702

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

Zhiren Gong ⋅ Tiantong Wu ⋅ Jiaming Zhang ⋅ Fuyao Zhang ⋅ CHE WANG ⋅ Yurong Hao ⋅ Yikun Hou ⋅ Foo Ping ⋅ Yilei Zhao ⋅ Fei Huang ⋅ Chau Yuen ⋅ Wei Yang Bryan Lim

Project Page

Abstract

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

Lay Summary

Large language models are increasingly expected to help with scientific work, such as connecting ideas from chemistry, physics, medicine, engineering, finance, and other fields. However, real scientific problem-solving rarely happens as a single isolated question. It often unfolds over several steps, where earlier answers influence later questions and where knowledge from multiple areas must be combined. This paper introduces XDomainBench, a benchmark designed to test how well language models handle such interactive, cross-disciplinary reasoning. It contains 8,598 multi-turn sessions across 20 domains and 4 task types, covering scenarios that range from single-field questions to problems requiring knowledge from several fields at once. Unlike standard benchmarks that mainly report whether a model gets a final answer right or wrong, XDomainBench also tracks how the difficulty and domain mixture change across a conversation. Our experiments show that current language models often struggle as more domains are combined. Their performance drops especially when the conversation becomes more complex, shifts between domains, or contains sudden increases in difficulty. We find that these failures come from both the direct challenge of combining knowledge from multiple fields and the indirect accumulation of mistakes across turns. XDomainBench therefore provides a practical testbed for understanding when and why language models fail in realistic scientific workflows, and for developing future AI assistants that can reason more reliably across disciplines.