EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations
Anuraag Gadehothur Karnam ⋅ Tarunesh Sathish
Abstract
Object-centric learning makes a concrete structural prediction: when one object changes one attribute, the corresponding object code should move, the other object codes should remain stable, and the decoded scene graph should update only at that site. Existing evaluations usually report segmentation, single-image factor prediction, or downstream accuracy, so this prediction is rarely tested as an intervention claim. We introduce EditCLEVR, a paired-scene benchmark in which each example contains a before/after CLEVR-style scene pair with the same layout and exactly one known attribute change on one known object, or a no-edit re-render for drift measurement. The protocol separates representation-level diagnostics for localization and stability from semantic faithfulness metrics that check whether decoded scene changes match the intended intervention across in-distribution and compositional out-of-distribution (OOD) suites. Scene-Graph Intervention Accuracy (SGIA) is the headline semantic metric: it requires the after-scene prediction to be correct and the only predicted before-to-after semantic change to be the intended object-factor edit; $\Delta$SGIA relaxes this by checking the single-site change pattern without requiring the full after-scene graph to be correct. Across ground-truth-mask backbones, learned-slot models, SAM 2 + frozen-ViT models, and one mask-feature hybrid, EditCLEVR shows that OOD degradation persists with perfect masks, mask source explains part but not all of native performance, and locality or stability alone can overstate semantic faithfulness.
Successful Page Load