Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution
Abstract
Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose Manifold-Aligned Guided Integrated Gradients (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By ensuring that decoded intermediate images remain aligned with the data manifold, MA-GIG constrains gradient evaluation to statistically valid regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method suppresses off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers.