TarGATE: Target-Aware Data Selection via Token-Attenuation Gates
Abstract
Targeted instruction tuning requires selecting pertinent samples from massive mixed candidate datasets guided by a small reference dataset reflecting the desired capability, yet efficiently identifying high-quality data amidst noise remains challenging. To address this, we propose TarGATE (Target-aware GATEs, a simple yet effective data selection framework that leverages the model's inherent data understanding. TarGATE computes a token-level Information Retention Ratio (IRR) to scale the output of the feed-forward network, where the instance-level average IRR serves as a quantitative metric for data quality. To align gates' preferences with the target task, we employ a joint optimization strategy utilizing the reference set and a subset of candidate data, which encourages the gates to assign higher IRRs to reference-aligned data while suppressing low-quality samples. Extensive experiments across noisy and real-world scenarios demonstrate that TarGATE outperforms related baselines. Furthermore, TarGATE exhibits superior computational efficiency and strong cross-model transferability, enabling smaller selector to effectively curate high-quality fine-tuning data for larger foundation models. The code is available at here.