Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
HyperINF: Scaling-up Accurate Approximation of Influence Functions by the Hyperpower Method
xinyu Zhou · Simin Fan · Martin Jaggi
Influence functions give a principled method to assess the contribution of individual training samples, though the high computational costs prohibit applications on large-scale foundation models or datasets. According to our empirical study, current approximation methods cannot give an accurate estimation of influence function for scaled-up models or datasets, which leas to sub-optimal results on large foundation models.To address the issue, we propose HyperINF as an accurate approximation method based on a hyperpower method, i.e. Schulz's iterative algorithm, which enjoys rigorous convergence guarantee. From a synthetic convergence simulation, HyperINF showcases superior accuracy and stability in the Hessian inverse matrix estimation compared to existing baselines, especially on high-dimensional matrices and sample-sizes. We further validate the efficacy of HyperINF through extensive real-world data attribution problems, including mislabeled data detection, data selection for LLM finetuning, and multimodal instruct-tuning data selection for VLM pretraining before and after cross-modal alignment. When select a small portion of dataset, HyperINF's advanced approximation accuracy significantly improves the performance while other baselines could lead to a large degradation. We provide the codebase at \url{https://anonymous.4open.science/r/HyperINF-7FA2}.