Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams
Abstract
We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 7K+ timestamped questions for diagnosing User-centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adaptation to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatial-aware and long streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code will be released.