CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Abstract
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind, remaining fragmented and narrowly focused. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for Compositional Music Instruction (CMI) reward modeling, where the generated music may be conditioned on text descriptions, lyrics, and/or audio prompts. We first introduce CMIPref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMIRewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text–music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on Music Arena and CMI-Pref test set, as well as preference agreement on Music Arena and CMI-Pref. Additional analyses examine performance variation across factors such as annotators, annotation timing and confidence, music generation models, and audio length. Experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via topk filtering. Our work provides the necessary data, benchmarks, and models to advance aligned music generation.