Attention as Deconvolution

###### tags: `one-offs` `deep models` `em` `generative models` `particles` # Attention as Deconvolution **Overview**: In this note, I make some observations about the 'Attention' mechanism in the Transformer Neural Network, and its connections to the classical signal processing task of deconvolution (in a simplified setting) ## Setup Let $\sigma^{2}>0$ be known, and let $\nu$ be an unknown probability measure. Suppose that $\left\{ x_{i} : i\in [N] \right\}$ are generated by \begin{align} z_{i}&\overset{\mathrm{iid}}{\sim}\nu\\x_{i}\mid z_{i}&\sim\mathcal{N}\left(z_{i},\sigma^{2}\cdot I_{p}\right). \end{align} The statistical task is then to recover $\left\{ z_{i}:i\in \left[N\right] \right\}$, given $\left\{ x_{i}:i\in \left[N\right] \right\}$. ## Solution by Expectation-Maximisation Here is one approach, more-or-less related to the classical approach of nonparametric maximum likelihood estimation. Firstly, express the putative mixing measure $\nu$ in the form \begin{align} \nu\approx\frac{1}{K}\sum_{k\in\left[K\right]}\delta\left(\mu_{k},\mathrm{d}z\right) \end{align} for some collection of centroids $\left\{ \mu_{k}:k\in\left[ K\right ] \right\} \subseteq\mathbf{R}^{p}$. Next, take $K=N$, i.e. parameterise the mixing measure as richly as the data, and initialise the estimates on the data itself, i.e set $\mu_{k}^{0}=x_{k}$ for $k\in [N]$ . Note that since $\sigma^{2}>0$, this is not necessarily the maximum likelihood estimator. Subsequently, refine the estimates $\left\{ \widehat{\mu_{k}}:k\in [N] \right\}$ using the EM algorithm, a standard tool for fitting the parameters of latent variable models. The updates then take the form \begin{align} \text{for }t\geq1,\qquad A_{i,k}^{\left(t\right)} &=\frac{\mathcal{N}\left(x_{i};\mu_{k}^{\left(t-1\right)},\sigma^{2}\cdot I_{p}\right)}{\sum_{\ell\in[N]}\mathcal{N}\left(x_{i};\mu_{\ell}^{\left(t-1\right)},\sigma^{2}\cdot I_{p}\right)} \\ \mu_{k}^{\left(t\right)} &=\frac{\sum_{i\in[N] }A_{i,k}^{\left(t\right)}\cdot x_{i}}{\sum_{i\in[N] }A_{i,k}^{\left(t\right)}}, \end{align} and can be iterated until a local optimum is found. ## Connection to Attention Observe now that if all $x_{i}$ and $\mu_{k}^{\left(t\right)}$ are constrained to live on a sphere of a given radius $R$, then the 'responsibility' factors simplify to \begin{align} \mathcal{N}\left(x_{i};\mu_{k}^{\left(t-1\right)},\sigma^{2}\cdot I_{p}\right)\propto\exp\left(\frac{\left\langle x_{i},\mu_{k}^{\left(t-1\right)}\right\rangle }{\sigma^{2}}\right), \end{align} where the implied prefactor depends only on $\left(\sigma^{2},R\right)$. That is, the 'soft assignment' matrix $A$ takes the form \begin{align} A_{i,k}^{\left(t\right)}&=\frac{\exp\left(\frac{\left\langle x_{i},\mu_{k}^{\left(t-1\right)}\right\rangle }{\sigma^{2}}\right)}{\sum_{\ell\in[N] }\exp\left(\frac{\left\langle x_{i},\mu_{\ell}^{\left(t-1\right)}\right\rangle }{\sigma^{2}}\right)}, \end{align} or rather, \begin{align} \text{for }i\in[N] ,\qquad A_{i,\cdot}^{\left(t\right)} = \mathsf{SoftArgMax}\left(\left\{ \left\langle x_{i},\mu_{k}^{\left(t-1\right)}\right\rangle :k\in[N] \right\} ;\beta=\frac{1}{\sigma^{2}}\right). \end{align} From this point of view, we can interpret the updates as reproducing a form of Attention mechanism, as popularised in the Transformer architecture of Deep Learning. Note that this comes with some simplifications which are not made in general Attention layers; the query and key matrices are related through $QK^{\top}=\frac{1}{\sigma^{2}} \cdot I_{p}$, and the value matrix is taken as $V=I_{p}$. Additionally, typical Attention-based architectures use different parameters at each layer, rather than iterating the same sequence of updates (as one would have in the EM algorithm). This is coherent with the view of Attention layers as a sort of flexible 'unrolling' of EM-type algorithms.