ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment for Code
Abstract
Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. We introduce \texttt{ZIP-FIT}, a data selection framework that uses compression to directly measure the alignment between potential training data and the target task distribution. Our key insight is that compression-based similarity captures both syntactic and structural patterns relevant to the target task (like code), enabling more precise selection of task-relevant data for code. In extensive evaluations on Autoformalization and Python code generation, \texttt{ZIP-FIT} significantly outperforms leading baselines like DSIR and D4. Models trained on \texttt{ZIP-FIT}-selected data achieve their lowest cross-entropy loss up to 85.1\% faster than these baselines, demonstrating that better task alignment leads to more efficient learning. In addition, \texttt{ZIP-FIT} performs selection up to 65.8\% faster than DSIR and two orders of magnitude faster than D4. In addition, we achieve 18.86\% Pass@1 on HumanEval compared to LESS's 18.06\% while being approximately 2000 times faster. Notably, \texttt{ZIP-FIT} shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data.