Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs
Abstract
Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM that bridges these paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought (CoT). Our three-stage curriculum progressively builds reasoning capabilities: (1) 3D perception alignment that grounds object visual-geometric features to the LLM's textual embedding space, (2) CoT-SFT that teaches systematic query decomposition and stepwise spatial verification from symbolic program traces, and (3) CoT-RL that extends learned reasoning patterns to open-set concepts and deeply nested instructions. This transfers reasoning patterns rather than concept-specific knowledge, preserving key NS3D virtues: transparent reasoning traces and modular interchangeability of planning and perception components. Extensive evaluations demonstrate that APEIRIA surpasses previous NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning benchmarks, presenting a unified framework combining the systematic reasoning of symbolic methods with the flexibility of modern LLMs.