Michael Pearce

@fcY63KdMTLqIQhxOgSpIwQ

Joined on Jul 27, 2017

  • "Toward a Mathematical Framework for Computation in Superposition" (TMF) [link] provides a proof of concept for how superposition can allow parallel on many features in parallel. As a mathematical description, it serves as a starting place for understanding more complex types of superposition computation. Here we give a condensed version of the points made in the post. Computing ANDs in superposition enter image description here Features: Say we have a set of $m$ sparse, boolean features (eg, a set of topics that are either present or not in a text).Denote features by $f_{\alpha}$. Let $\ell \ll m$ be the number of features "on" at the same time. Neurons: In the MLP layer, $d$ neurons each take a random subset of the features as inputs. Neurons are only nonzero if at least two features are "on".
     Like  Bookmark
  • Bilinear layers Bilinear layers are an alternative to the standard MLP layer with some nonlinear activation function. For an input $x$, a bilinear MLP would be $$\begin{aligned}MLP_{bilinear}(x) &= (W x) \odot (W' x)\ &= \sum_{jk} W_{ij} W'{ik} x{j} x_{k}\end{aligned}$$ where $\odot$ represents element-wise multiplication and $W, W'$ are weight matrices. Unlike other nonlinear layers, the bilinear layer can be written in terms of only linear operations and a third order tensor, $W_{ij} W'_{ik}$. Pairwise Feature Interactions The advantage of the bilinear layer is that it's the simplest type of nonlinearity, only having pairwise interactions. This can make it simpler to understand how input features might interact to create output features. For example, if the input is composed of feature vectors, $f$, with (sparse) activations, $a$, such that $x_i = \sum_{\alpha} a_{\alpha}f_{\alpha i}$, then the output can be written as $$\begin{aligned}MLP_{bilinear}(x) &= \sum_{\alpha \beta} a_{\alpha}a_{\beta} \left(\sum_{jk}W_{ij}W'{ik}f{\alpha j} f_{\beta k} \right)\end{aligned}$$ after grouping activations and features together. If we expect output features to represent pairwise operations of input features (eg ANDs of input features), then it's natural to define the output activation and features as: $$\begin{aligned}a^O_{(\alpha \beta)} &\equiv a_{\alpha}a_{\beta} \ f^O_{(\alpha \beta) i} &\equiv \sum_{jk}W_{ij}W'{ik}\left(f{\alpha j} f_{\beta k} + f_{\beta j} f_{\alpha k}\right) \ MLP_{bilinear}(x) &= \sum_{(\alpha \beta)} a^O_{(\alpha\beta)} f^O_{(\alpha\beta)i}\end{aligned}$$ where $(\alpha \beta)$ indexes over pairs of input features, including twin pairs like (00). The features are written in a form that is symmetric over $\alpha, \beta$.
     Like  Bookmark