HSGG: Training-Free Hierarchical Scene Graph Generation with Geometry-Guided Relation Reasoning
Abstract
Scene Graph Generation (SGG) connects visual perception with structured reasoning, but is limited by scarce annotations and the long-tailed distribution of relational predicates. Training-free methods based on vision-language models (VLMs) reduce supervision requirements, yet often rely on flat graph structures and produce hallucinated relations. We present HSGG, a training-free framework for open-world Hierarchical Scene Graph Generation, whose inference proceeds in two steps. First, bidirectional hierarchical entity perception combines top-down object expansion with bottom-up attribute reasoning to construct multi-level scene hierarchies capturing part–whole semantics. Second, geometry-guided relation reasoning infers valid relations from these structured entities: geometry-aware relation filtering first prunes spatially implausible object pairs using 2D proximity, depth cues, and object scale, and geometry-grounded contrastive relation decoding then suppresses hallucinated predicates by contrasting predictions from a visually grounded expert against a hallucination-prone geometric prior, ensuring relations are both geometrically consistent and semantically coherent. Experiments show that HSGG generalizes effectively to unseen objects and predicates without training, substantially reduces relational hallucinations, and consistently improves downstream reasoning performance.