Optimizing Machine Learning Explanations for Properties
Abstract
There are explanation methods, as well as works that quantify the extent to which these explanations satisfy properties, like faithfulness or robustness. For instance, SmoothGrad \cite{smilkovsmoothgrad2017}encourages robustness by averaging gradients around an input, whereas LIME \cite{ribeirowhy2016}encourages fidelity by fitting a linear approximation of a function. However, we demonstrate that these forms of encouragement do not consistently target their desired properties. In this paper, we \emph{directly optimize} explanations for desired properties. We show that, compared to SmoothGrad and LIME, we are able to: (1) produce explanations that are more optimal with respect to chosen properties (2) manage trade-offs between properties more explicitly and intuitively.