Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration
Abstract
The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild and interpretable conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments demonstrate that the proposed method outperforms existing pruning methods while substantially reducing training cost. On ImageNet-1k with ResNet-50, our method reduces training time by over 40\% without sacrificing accuracy.