Toward Trustworthy LLM–GNN Fusion: A Fusion-Aware Evaluation and Reporting Framework
Abstract
Hybrid LLM-GNN systems are typically evaluated using task accuracy, which can result in misleading conclusions about the effectiveness of fusion. In this work, we show that apparent improvements are often dominated by a single modality rather than reflecting genuine cross-modal interaction. We propose a fusion-aware evaluation framework that attributes gains relative to strong unimodal baselines and organises evaluation into three dimensions: effectiveness, stability, and responsibility. Through case studies on four benchmarks (Cora, Citeseer, WikiCS, PubMed), spanning variations in scale, domain, graph construction, and feature characteristics, we find that fusion does not consistently improve in-domain accuracy, while its benefits mainly emerge in cross-dataset transferability and robustness under perturbations. Furthermore, responsibility metrics reveal hidden trade-offs, including fairness degradation and negligible explainability gains, despite minimal computational overhead. These results demonstrate that accuracy-centric evaluation is insufficient and may obscure critical system behaviours. Our framework provides a structured and transparent way to analyse when fusion is beneficial and when it introduces trade-offs, offering practical guidance for evaluating hybrid graph-language systems.