Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation
Abstract
High-quality labels are essential for reliable evaluation of modern machine learning (ML) and artificial intelligence (AI) systems. Increasingly, model evaluation pipelines in practice involve collaborative "gold-silver" supervision, where all instances may receive multiple inexpensive, imperfect silver labels (e.g., from crowdsourcing platforms or automated AI judges), while a limited number of costly gold labels provided by experts can be selectively acquired for difficult cases, such as those with substantial disagreement among silver labels. This setting differs from classical labeling formulations in that an instance can receive multiple silver labels, while expert labeling is applied selectively, and has become increasingly common in the evaluation of modern ML/AI systems. Yet, a key challenge in this setup is determining when and how to allocate labeling effort across silver and gold labels under a fixed budget, while simultaneously ensuring that the collected labels support reliable model evaluation. Escalating from silver to expert labeling too late may propagate incorrect labels from imperfect annotators, whereas escalating too early wastes scarce expert resources. Moreover, because labeling decisions depend on previously observed labels, the resulting data are adaptively sampled, inducing dependencies between labels and the sampling process. This adaptivity complicates both the design of systematic labeling algorithms and the validity of downstream statistical inference used for ML/AI system evaluation. To address these challenges, we propose a cost-efficient collaborative adaptive labeling framework in which each instance may receive multiple imperfect silver labels and, when warranted, an expert-provided gold label. To support valid model evaluation from adaptively collected labels, we propose an estimator that systematically combines expert-provided gold labels and imperfect silver labels, and establish its consistency under mild conditions. Across multiple evaluation datasets, our method substantially improves labeling quality and the reliability of downstream statistical evaluation compared to existing baselines.