Learning Generalized Trackers with Elastic Token Budgets
Abstract
Visual tracking aims to estimate target states in video sequences, with applications spanning diverse computational requirements. Recent methods optimize trackers using manually pruned image tokens with a fixed budget to reduce computational costs. However, these trackers, once trained, are constrained to perform tracking under a fixed computational budget, limiting their adaptability to real-world computational diversity. To address the above limitation, we provide the first exploration of the elastic token budget training framework (ETBTrack), enabling trackers to perform robust tracking under varying computational budgets. It enjoys several merits. First, we present a novel result-driven importance criteria, in which we optimize a policy network guided by the localization precision of the tracker to estimate token importance, thereby aligning the objectives of importance estimation and tracking precision. Second, we develop a new budget-collaborative optimization strategy, in which we collaboratively optimize the tracker across varying budgets, thereby enabling the tracker to be compatible with diverse budgets. Two optimization processes are performed alternately to enhance the capability of elastic inference. Extensive experiments on large-scale benchmarks demonstrate the effectiveness of our method. Codes will be released.