Synergistic Space-Vision Processing for Predicate Inference
Zhenhua Lei ⋅ Zefang Han ⋅ yu qiu
Abstract
Scene graph generation (SGG) aims to parse an image into a structured graph of objects and their predicates, enabling explicit relational reasoning for visual understanding. However, prevailing methods often over-predict geometric predicates, resulting in scene graphs that are factually correct yet semantically shallow. While recent works effectively attribute this phenomenon to the long-tailed data distribution, we identify another critical factor driving such biased prediction: co-occurrence-induced representation entanglement, where geometric and non-geometric predicates that frequently co-occur are encoded into overly similar representations. To this end, we introduce Dual-stream Synergistic Network (DS-Net) that models geometric and non-geometric predicates with two specialized streams, coupled with a bidirectional cross-stream fusion mechanism. The space stream focuses on spatial and structural cues, while the vision stream captures fine-grained visual evidence and semantic priors. Extensive experiments show that DS-Net consistently improves predicate inference, achieving 1.3\% $\sim$ 6.1\% absolute gains in mR@100 on the SGGen task when integrated into existing SGG methods. These results highlight the importance of synergistic modeling of geometric and non-geometric predicates for generating semantically richer scene graphs.
Successful Page Load