Task-Aware Mechanism: Hybrid MoE Vision Tower Towards Holistic Video Understanding
Abstract
Does \emph{Comprehending the main idea of a 2-hour movie} and \emph{Counting the birds appearing in a 15-second clip} really warrant the same video processing pipeline? We present Task-Aware Mechanism (TAM), a hybrid-gated Mixture-of-Experts (MoE) vision tower that adapts frame count and resolution to the user query and video length. TAM introduces a compact 0.1B text-only \emph{Inductor} trained on our TA-116K dataset to infer task types, enabling task-aware visual budgeting and routing: a soft-gated MoE vision encoder for stability, and hard-gated resolution-specific projectors/pipelines for efficient specialization. Built on Qwen2-7B, TallVA-8B-A7B achieves state-of-the-art performance among models with comparable LLMs across diverse video benchmarks and remains competitive against stronger-LLM baselines, showing that task-aware visual budgeting makes video understanding more holistic. The code is included in the supplementary material.