Unsupervised Process-Aware Coreset Selection for In-Context Learning
Abstract
We address the challenge of unsupervised coreset selection for few-shot in-context learning (ICL). The goal is to select a small subset of examples under a fixed annotation budget to yield effective prompts for large language models. Existing geometry-based methods often yield coresets that suffer from a skewed distribution, due to the oversampling of peripheral examples and high local redundancy. To address these issues, we propose a process-aware framework for coreset selection. It jointly optimizes the diversity and representativeness of selected samples via a submodular objective. It ensures representativeness by selecting samples based on local density awareness, while promoting diversity by imposing a redundancy penalty relative to the evolving selected set. Thus, it performs progress-aware balancing of representativeness and diversity based on the selection context. Extensive experiments on 7 NLP datasets demonstrate that our method consistently outperforms state-of-the-art coreset selection methods in downstream ICL performance. Further analysis validates that our approach better balances diversity and representativeness throughout the selection process, while retaining the theoretical guarantees of submodular optimization.