TuneAhead: Predicting Fine-tuning Performance Before Training Begins
Abstract
Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a fundamental question: Can we predict fine-tuning performance before training begins? We present TuneAhead, a lightweight framework for pre-hoc prediction of fine-tuning performance. TuneAhead encodes each fine-tuning run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short simulated run. A gradient-boosting predictor maps these features to performance predictions, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features are driving performance. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TuneAhead consistently outperforms strong baselines such as ProxyLM and Early-Stop Extrapolation. On a held-out test set of 370 runs, by defining ‘success’ as exceeding a performance threshold, it accurately predicted 89.4% of successful runs (110/123) and 91.0% of failure runs (225/247), enabling practitioners to proactively avoid costly unsuccessful runs before training begins. This leads to computational savings of 58.4% in total.