Learning Context-Conditioned Predicate Semantics via Prototype Feedback
Abstract
In scene graph generation (SGG), a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect the evidence of a given image, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers image-conditioned predicate semantics from the set of relations within each image and feeds the adapted semantics back to recalibrate relation representations while preserving dataset-level semantic coherence. The learning objective anchors context adaptation to global semantic centers, preventing semantic drift while still allowing selective semantic reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over strong baselines, with F@100 improvements of +1.4 and +2.7 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence.