SegPVSG: Panoptic Video Scene Graph Generation via Temporal Focusing and Generative Augmentation
Abstract
Panoptic Video Scene Graph Generation (PVSG) aims to identify relations between pixel-level entities in a video, serving as a novel paradigm for structured video parsing. However, this task faces two key challenges. First, the interactions between entities are temporally fragmented and sparse, meaning videos are dominated by irrelevant content with limited salient information. Second, the distribution of relations exhibits a significant long-tailed pattern, making models struggle to perform well on tail categories with insufficient data. To address these issues, we propose SegPVSG, an innovative, temporal-segment-aware PVSG framework consisting of two key components: TempFocusNet (TFN) and Relation-centric Generative Video Augmentation (RGVA) module. TFN is a localization-then-recognition network that improves PVSG performance by explicitly localizing and focusing on salient segments before relation recognition. Meanwhile, RGVA is a novel augmentation module that generates realistic, context-consistent video segments for rare relations and coherently inserts them into original videos. Our method outperforms prior methods by +3.53 mR@20 and +5.9 mR@50, demonstrating its effectiveness. Code will be released.