Integrated Gradients

# Integrated Gradients [Integrated gradients](https://arxiv.org/abs/1802.03788) is an attribution method which works by approximating the following integral, where $F : \mathbb{R}^n \rightarrow \mathbb{R}$ is the network, $x$ is the input, $x'$ is the baseline, and $i$ is the feature. $$ \mathrm{IntegratedGradients}_i(x) = (x_i - x_i') \times \int_{\alpha=0}^1 \frac{\partial F(x' + \alpha (x - x'))}{\partial x_i} d\alpha $$ [Intuitively](https://www.tensorflow.org/tutorials/interpretability/integrated_gradients) we're accumulating all the change in output from changing $x'$ to $x$ along the linear path $\alpha x' + (1 - \alpha) x$. For a concrete example in words, let $x_i$ be the weight of a heavy exotic frog, $x_i'$ be the average weight over all frogs, $F(x)$ be the estimated jump height of a frog with all the features $x$ (one of which is weight). Given this, the right hand side of the equation above can be interpreted as $$ \begin{aligned} \underbrace{\text{Weight gain of exotic frog}}_{(x_i-x_i')} \times \underbrace{\text{Average rate jump height changes per unit weight gain}}_{\int_{\alpha=0}^1 \frac{\partial F(x' + \alpha (x - x'))}{\partial x_i} d\alpha} \\ = \text{Jump height change of exotic frog}\;\textbf{due to weight gain} \end{aligned} $$ In other words, the integrated gradients vector is the vector of the change in output **due** to the change in each input feature.