ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Yizheng Huang ⋅ Wenjun Zeng ⋅ Aditi Kumaresan ⋅ Zi Wang
Abstract
Evaluating generative AI models is increasingly resource-intensive due to slow generation, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as error likelihood or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop strategies that proactively select or synthesize the most informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 10–100x fewer samples to achieve estimates within $\pm1\%$ of ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
Successful Page Load