MICE-Bench: A Challenging and Comprehensive Benchmark for Multi-Reference Image Creation and Editing
Abstract
The paradigm of visual generation is rapidly shifting from single-image conditioning toward multi-image conditioning, making the ability to synthesize and edit images based on multiple visual references a critical capability. Despite this trend, existing benchmarks remain largely limited to single-reference scenarios or narrowly defined tasks, leaving model behavior under complex multi-concept composition insufficiently explored. To bridge this gap, we introduce MICE-Bench, a comprehensive benchmark for Multi-reference Image Creation and Editing. The benchmark is designed around three core principles: 1) heterogeneous concept composition across seven visual dimensions; 2) varying levels of constraint density, ranging from dual-concept to seven-concept configurations; 3) concept-centric data construction and benchmark evaluation, enabling fine-grained analysis of interactions among multiple concepts. MICE-Bench consists of 3,119 high-quality test cases within a unified concept space. Using an 8-dimensional evaluation metric, we systematically evaluate 13 state-of-the-art models. Our results show that although closed-source models maintain a clear performance advantage, all models experience notable degradation in concept consistency and physical realism as concept complexity increases.This indicates that current models rely on superficial composition rather than genuine multi-concept synthesis, highlighting substantial room for future improvement.