Zero-Shot Text-to-Motion Evaluation using Video Language Models
Abstract
Text-to-motion (T2M) generation has emerged as a fundamental task. However, existing evaluation metrics often fail to accurately capture the semantic alignment between textual descriptions and generated 3D motions. In this work, we propose VeMo, a novel evaluation framework that leverages the zero-shot reasoning capabilities of Video-Language Models (VLMs) for T2M assessment. Our basic idea is: render the generated human motion into a skinned video, and then use a VLM for evaluation. To mitigate the information loss inherent in 3D-to-2D projections, we introduce an entropy-based uncertainty analysis that ensures the reliability of the evaluation scores. To address the lack of rigorous standards in the field, we contribute a meta-evaluation benchmark featuring manual annotations of coarse-grained alignment and fine-grained rationales. Extensive experiments demonstrate that VeMo significantly outperforms traditional metrics in human-alignment, offering a scalable and data-independent solution for the reliable assessment of T2M models.