Trust Functions: Near Lossless Weak-to-Strong Generalization by Learning to Trust the Weak Teacher
Abstract
Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, decision making, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, producing the strongest final model. Our analyses suggest that neural trust functions improve learning through more than label error reduction. They induce an implicit easy-first curriculum, recover near-optimal alternatives where ground truth labels are incomplete, and produce more coherent gradient updates, offering a mechanistic account of the stability and efficiency of trust-filtered weak-to-strong generalization.