Return of Frustratingly Easy Unsupervised Video Domain Adaptation
Abstract
Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called \emph{MetaTrans}. Specifically, \emph{MetaTrans} adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, \emph{MetaTrans} embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, \emph{MetaTrans} effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.