Modeling Long-Tail Relations in the Operating Room via In-Context Multimodal Learning
Abstract
Operating room (OR) scene graph generation (SGG) enables holistic modeling of OR domains by encoding interactions among medical staff, tools, and equipment as triplet-based structured scene graphs. Although existing OR SGG methods demonstrate satisfactory overall performance, they exhibit substantially lower accuracy on long-tail categories compared to head categories in OR data. We introduce SGG-ICL, a novel framework that represents the first attempt to address the long-tail problem in OR SGG by leveraging in-context learning (ICL). SGG-ICL first identifies long-tail samples via an Adaptive Router module and selectively applies ICL only to these samples. This selective routing strategy enhances performance on long-tail categories without degrading head-category accuracy. Subsequently, SGG-ICL constructs a candidate pool through multimodal retrieval and then employs a trained MLLM Reranker to re-rank the candidates, selecting the most similar examples to the test sample for ICL. The reranker is supervised by IoU scores derived from annotated SGG triplets and exploits rich multimodal information to estimate pairwise sample similarity. Experimental results show that SGG-ICL improves accuracy on long-tail categories by 6.9%, while also achieving a 2.6% improvement in overall accuracy.