Random Selection Reveals Implicit Knowledge Consensus in Code Generation
Abstract
Training large language models for code generation requires selecting high-quality data from solution pools where each problem admits multiple correct implementations. Conventional studies on data selection hold that sophisticated strategies that employ various optimization objectives, such as diversity maximization or difficulty ranking, should outperform naive random sampling. However, research on whether sophisticated selection methods truly benefit code generation remains limited. In this work, we systematically evaluate multiple selection strategies across different representation spaces, including continuous embeddings, discrete tokens, and syntactic structures, using various base language models. We observe counterintuitive phenomena, suggesting that sophisticated methods may not yield robust benefits. Instead, simple random sampling achieves consistently competitive performance across all models, exhibiting greater stability and transferability than sophisticated methods. We attribute that an implicit knowledge consensus exists among diverse correct solutions, such that random selection already covers the common algorithmic knowledge required for training. Our findings provide practical insights into data selection for code generation, suggesting practitioners can adopt simple random sampling without sacrificing performance.