Strategic Candidacy in Generative AI Arenas
Abstract
AI arenas, which rank models from pairwise preferences of users, are an industry-standard evaluation mechanism for generative models. In a recent paper, Singh et al. (2025) demonstrate that widely-used mechanisms are not clone-robust: In particular, they submitted multiple copies of the same model and showed that the higher-ranked copy was several positions above the lower-ranked one. In this paper, we begin by showing, both theoretically and in simulations calibrated to data from LMArena, that producers can benefit substantially from submitting clones. We then propose a new mechanism for ranking models based on pairwise comparisons, called You-Rank-We-Rank (YRWR). It uses producers’ rankings over their own models to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot gain much utility by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. We validate our theory with further semisynthetic experiments.