Cure-SFT: Diagnostic-Guided Data Curation for Instruction Tuning
Abstract
Instruction data curation is central to improving the instruction-following ability of large language models. However, existing approaches often struggle to simultaneously maintain data quality, diversity, and distributional consistency, largely because they do not explicitly distinguish semantic redundancy from quality defects and rely on coarse-grained modeling of instruction data quality. To address this issue, we propose Cure-SFT, a coarse-to-fine, diagnostic-guided method for instruction data curation that explicitly disentangles semantic redundancy from quality defects. Specifically, Cure-SFT removes redundant samples via stratified semantic-geometric sampling, applies teacher models for diagnostic triage, and performs targeted defect remediation on fixable samples. Our experiments show that Cure-SFT can surpass full-data instruction tuning using only 10% of the data budget. Moreover, Cure-SFT consistently outperforms strong selection-based and rewriting-based baselines across data budgets, supporting the effectiveness of diagnostic-guided data curation.