Higher Order Attention Mechanisms

# Higher Order Attention Mechanisms ## TL;DR We can think of the Multihead Dot Product Attention (MHDPA), as modelling the relations between two "entities". But while we're at it: (1) why stick to two and (2) why not nest MHDPAs to model relations between relations (between relations between ...)? Has it been done? ## Notation and Recap: MHDPA MHDPA is a mechanism that enables the encapsulation of the relation between two "generic objects" or "entities". Let's say we have $M$ objects indexed by $\mu$, each of which we wish to relate with $N$ objects indexed by $\nu$. The first step is to require $M$ key vectors, $\mathbf{k}^{\mu}$, $N$ query vectors $\mathbf{q}^{\nu}$ and $N$ corresponding value vectors $\mathbf{v}^{\nu}$. We will denote the $i$-th component of $\mathbf{k}^{\mu}$, $j$-th component of $\mathbf{q}^{\nu}$ and $n$-th component of $\mathbf{v}^{\nu}$ as $k^{\mu}_{i}$, $q^{\nu}_{j}$ and $v^{\nu}_{n}$ respectively. The second step is to compute the _attention weights_, which in Einsum convention is given by: $$ \bar{W}^{\mu\nu} = k^{\mu}_{i} q^{\nu}_{j} M_{ij} $$ where $M_{ij}$ is a linear bilinear operator (usually the Kronecker delta $\delta_{ij}$). Note that $\tilde{W}$ is a $M \times N$ matrix. Subsequently, the one may softmax normalize $\bar{W}$ along $\nu$ to obtain what we shall call: $$ W^{\mu\bar{\nu}} $$ -- the dash on an index means that the corresponding reduction along that index is softmax-normalized. The third and final step is to compute the attention output as: $$ y^{\mu}_{n} = W^{\mu\bar{\nu}} v^{\bar\nu}_n $$ Intuitively, we can think of $W^{\mu\nu}$ as the similarity between the _object representations_ of objects indexed by $\mu$ and $\nu$ (where $\mu$ is the index of the _reference object_, i.e. it's special). As such, we only model the coupling between two objects. ## Extensions Now, one can imagine the following two extentions, which can be used in conjunction with the vanilla MHDPA. ### Multiple Objects We can have the same key target two or more queries. Say we have three objects indexed by $\mu$, $\nu$ and $\rho$, associated with embedding vectors $\mathbf{k}^{\mu}$, $\mathbf{q}^{\nu}$ and $\mathbf{r}^{\rho}$, where the latter two are queries (we may set $\mathbf{q}^{\nu} = \mathbf{r}^{\nu}$ if we please) and we denote the components of $\mathbf{r}^{\rho}$ with $r^{\rho}_{\kappa}$ (that's a kappa to not collide with the $k$ for key, sorry). The _attention tensor_ is given by: $$ \bar W^{\mu\nu\rho} = k^{\mu}_{i} q^{\nu}_{j} r^{\rho}_{\kappa} M_{ij\kappa} $$ where $M$ is a trilinear operator (which can be the generalized Kronecker delta, but in principle also learned as a parameter). Now for the values, we can have a tensor $v^{\nu\rho}_{n}$, such that the attention output is given by: $$ y^{\mu}_{n} = W^{\mu\bar{\nu\rho}} v^{\bar{\nu\rho}}_{n} $$ where the softmax normalization is jointly over $\nu$ and $\rho$. Intuitively, this models how the reference object indexed by $\mu$ **jointly** interacts with objects $\nu$ and $\rho$. ### Modelling Meta-Relations (Relations between Relations) This is a bit like nesting multiple MHDPAs: we want to model the relations between relations. But for that, we must first embed the relation itself. The procedure is the following (but I think it can be simplified). Let's require _value tensors_ $t^{\nu}_{m}$ and $u^{\rho}_{n}$. Now consider: $$ a^{\mu\rho}_{m} = W^{\mu\bar\nu\rho} t^{\bar\nu}_{m} $$ and $$ b^{\mu\nu}_{n} = W^{\mu\nu\bar\rho} u^{\bar\rho}_{n} $$ Given this, the vector $\mathbf{a}^{\mu\rho}$ can be thought of as an embedding of the relation between objects $\mu$ and $\rho$, and the vector $\mathbf{b}^{\mu\nu}$ as that between objects $\mu$ and $\nu$. Observe that these embedding vectors again look like keys and queries in the vanilla MHDPA (however, the keys and queries are now tensor valued in the greek indices; in vanilla MHDPA, they're vectors in greek indices, e.g. in $\mathbf{k}^{\mu}$, $\mu$ is the vector greek index). So we strive the obvious and put another layer of MHDPA on top. But there's a problem: which of $\mathbf{a}^{\mu\rho}$ and $\mathbf{b}^{\mu\nu}$ should be the key and which the query for this nested attention mechanism? I don't have a straight answer. But here's what one could do: Run two attention mechanisms, once with $\mathbf{a}^{\mu\rho}$ as the key and $\mathbf{b}^{\mu\nu}$ as the query, and once with the vice versa. So we apply the two attention mechanisms to end up with the respective attended values $c^{\mu\rho}_{g}$ and $d^{\mu\nu}_{h}$. We may now combine them with linear operators $G$ and $\tilde M$: $$ y^{\mu}_{l} = \tilde M_{lgh} G^{\rho\sigma} c^{\mu\rho}_{g} d^{\mu\nu}_{h} $$ That said, I think the last part can probably be simplified.