Embodied-DETR: End-to-End Temporal 3D Object Detection in Egocentric Views
Abstract
Embodied 3D object detection is a fundamental perception capability for embodied agents, where observations are partial, heavily occluded, and sequential, requiring modeling of temporal continuity. However, existing benchmarks and methods are primarily designed for fully reconstructed global scenes and fail to capture temporal scene context and instance evolution in first-person perception. We introduce Embodied-Det, a new benchmark for egocentric 3D object detection that evaluates detection accuracy, temporal stability, and consistency under embodied settings. Building on this benchmark, we propose Embodied-DETR, an end-to-end temporal detection framework that models scene-level context and instance-level continuity through two complementary temporal modules, Scene-aware Feature Aggregation and Instance-aware Query Embedding. Experiments on Embodied-Det show that existing methods suffer substantial performance degradation in egocentric temporal settings, while Embodied-DETR achieves superior accuracy and temporal consistency, demonstrating the effectiveness of temporal modeling for embodied 3D perception.