Clustered Influence Functions
Miklós Máté Badó ⋅ Kristian Fenech
Abstract
Influence functions are a standard tool for data debugging and unlearning, but they become impractical for **high-query** subset workloads such as large-$K$ cross-validation, repeated resampling, or interactive what-if analysis as each subset query typically requires an expensive inverse-curvature solve. We introduce **Clustered Influence Functions (CiF)**, which turns subset influence into an **amortized subset oracle**. We build a compact cache once by clustering training gradients, solve a damped Generalised Gauss-Newton system only for cluster means, and answer new subset queries by a linear recombination using cluster membership counts. This yields per-query cost of $O(Cp)$ linear in the cache size $C$, and the number of model parameters $p$. We further provide a diagnostic error bound that decomposes approximation error into a **clustering scatter** term and a **solver residual** term, making the accuracy-compute tradeoff explicit through the cache budget and solver tolerance. Evaluations across MNIST, CIFAR-10, show that CiF matches per-query influence rankings while significantly reducing the total runtime in high-$Q$ regimes, enabling influence-based workflows that are otherwise computationally prohibitive.
Successful Page Load