Foreground-Aware Token Routing Vision Transformer for Real-Time Satellite Video Tracking
Abstract
Real-time satellite video tracking poses distinct challenges, including accommodating high spatial-temporal resolution, dynamic backgrounds, and constrained onboard computational resources. While Discriminative Correlation Filter (DCF)-based methods offer high-speed inference, they suffer from limited accuracy. In contrast, Vision Transformer (ViT)-based trackers achieve strong performance by unifying representation and aggregation in a single-stream design, yet their heavy computational footprint limits practical deployment in real-time satellite scenarios. In this work, we present FATrack, a novel tracking framework that effectively balances tracking accuracy and computational efficiency. At its core is FA-ViT, a lightweight Vision Transformer backbone that introduces foreground-aware token routing, enabling the model to concentrate computation on target-relevant regions while suppressing redundancy. To mitigate semantic degradation caused by token sparsification, we propose the Adaptive Scatter Module (ASM), which selectively reinforces informative tokens via joint spatial-channel attention and sparse structural propagation, thereby enhancing both semantic fidelity and spatial coherence. By synergistically integrating FA-ViT and ASM, FATrack forms a unified architecture that delivers real-time performance with significantly improved tracking precision. Extensive evaluations on multiple satellite video benchmarks demonstrate that FATrack surpasses existing real-time trackers in accuracy and achieves inference efficiency comparable to DCF-based methods, highlighting its potential for practical deployment in large-scale aerial video tracking systems.