Computation in Superposition: Math Overview

"Toward a Mathematical Framework for Computation in Superposition" (TMF) [link] provides a proof of concept for how superposition can allow parallel on many features in parallel. As a mathematical description, it serves as a starting place for understanding more complex types of superposition computation.

Here we give a condensed version of the points made in the post.

Computing ANDs in superposition

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Features: Say we have a set of
$m$ sparse, boolean features (eg, a set of topics that are either present or not in a text).
- Denote features by
  $f_{α}$ . Let
  $ℓ ≪ m$ be the number of features "on" at the same time.
Neurons: In the MLP layer,
$d$ neurons each take a random subset of the features as inputs. Neurons are only nonzero if at least two features are "on".
- $z_{i} = R e L U (\sum_{α} W_{i α} f_{α} - 1)$ where
  $W_{i α}$ is 1 with probability
  $p$ and 0 otherwise.
- Each neuron's inputs define a set
  $S_{i}$ of features.
- $z_{i} = 1$ if only two features in
  $S_{i}$ are "on". More generally,
  $z_{i} = k - 1$ for
  $k \geq 2$ features "on" in
  $S_{i}$ and 0 otherwise.
Read off ANDs: To linearly read off the AND operation for two features, we take the average over neurons that have both features as inputs.
- AND estimate
  $\hat{f_{α} \land f_{β}} = \underset{f_{α}, f_{β} \in S_{i}}{E} [z_{i}] = \sum_{i} z_{i} \frac{W_{i α} W_{i β}}{\sum_{j} W_{j α} W_{j β}}$
- We expect the
  $z_{i}$ included in the sum to typically be
  $1$ when
  $f_{α}$ and
  $f_{β}$ are "on" and
  $0$ otherwise, giving the correct AND behavior. Rare events when other "on" features are included in the sums add noise to these values.
Interference: Noisy contributions to the read-off from other neurons being "on" are negligible if the features are sufficiently sparse and the random subsets of feature inputs are small enough.
- Overall, the noise contributions are small if the expected number of nonzero terms in a neuron's sum over features is small. Since
  $ℓ$ neurons are "on" and contribute to the sum with probability
  $p$ , then this requires
  $p ℓ ≪ 1$ .
- Consider a neuron
  $z_{i}$ with both
  $f_{α}$ and
  $f_{β}$ as inputs.
  - If both features are "on", then the neuron is in the linear piece of the ReLU, so
    $z_{i} = 1 + \sum_{γ \neq α, β} W_{i γ} f_{γ} = 1 + Δ$ . There are
    $ℓ - 2$ other features that are "on" and a probability of
    $p$ of each feature contributing to the sum. So
    $E [Δ]$ contributes
    $p (ℓ - 2)$ to
    $\hat{f_{α} \land f_{β}}$ . This is only small if
    $p ℓ ≪ 1$ , ie the probability of extra features being "on" and connected to the neuron is small.
  - If only one feature is "on", then a similar calculation shows the noise contribution is also
    $\sim p ℓ$ .
  - If both features are "off", then there is only noise when two other features in the sum are "on". This is more rare, so the noise contribution is smaller,
    $\sim (p ℓ)^{2}$ if
    $p ℓ ≪ 1$ .
Universal AND: It's possible to read-off all pairwise ANDs with a layer size that is a power of
$\log (m)$ .
- To ensure a high probability that any pair of features can be read-off we need
  $p > \sqrt{\frac{2 \log m}{d}}$ .
  - A given pair of features will be inputs into
    $d p^{2}$ neurons on average. The probability of not appearing in any neuron is
    $\exp (- d p^{2})$ , from the poisson distribution.
  - Across
    $(\binom{m}{2})$ pairs, the probability of all pairs being in at least one neuron's inputs is
    $(1 - \exp (- d p^{2}))^{(\binom{m}{2})} \approx 1 - (\binom{m}{2}) \exp (- d p^{2})$ . For the second term to be small, we need
    $p > \sqrt{\frac{2 \log m}{d}}$ .
- The post makes the particular choice
  $p = \log^{2} m / \sqrt{d}$ . Then, each estimate
  $\hat{f_{α} \land f_{β}}$ is an average over
  $\log^{4} m$ terms.
- If
  $ℓ$ is also a power of
  $\log m$ , then
  $d$ can be a power of
  $\log m$ and be large enough for small interference terms.
Realistic regimes:
- Based on values from Anthropic's "Towards Monosemanticity", it wouldn't be possible to use the U-AND construction to compute ANDs for all features.
  - $m \sim 4, 000$ (range 500-130K), technically these are features found in the neuron activations.
  - $ℓ \sim 10 - 20$ .
  - $d \sim 500$
  - Constraints on
    $p$ :
    - $p ≪ 1 / ℓ = 0.05 - 0.1$
    - $p > \sqrt{\frac{2 \log m}{d}} = 0.18$
  - If we take
    $p = 0.03$ , then we would only be able to extract
    $1 - \exp (- d p^{2}) = 36 %$ of all ANDs.

Extensions

When inputs are in superposition: The construction also works if the inputs are already in superposition. However, in this case the number of ANDs that can be computed is of order the number of parameters in the layer
$O (d_{i n} d)$ .
- Let the
  $m$ sparse, boolean features be an "overbasis" for a
  $d_{i n}$ -dim space. So
  $\vec{f_{1}}, \dots, \vec{f_{m}} \in R^{d_{i n}}$ . The input is a sum of
  $ℓ$ feature vectors.
- Decompose the neuron's weight matrix into
  $W = \hat{W} F$ where
  $F_{α i} = (\vec{f_{α}})_{i}$ (
  $m \times d_{i n}$ dims) puts the input back into the feature basis via a dot product.
  $\hat{W}$ is same as the U-AND construction above.
- After applying
  $F$ , the features are either "on" with value 1 or "off" with a small random value of standard deviation
  $1 / \sqrt{d_{i n}}$ .
- Unlike before, the interference terms now add up for "off" features. If neurons connect to
  $\sim d_{i n}$ features via
  $\hat{W}$ then the interference terms are order 1 and cause "misfires". This requires
  $p < \frac{d_{i n}}{m}$ .
- For large
  $m$ , we cannot compute all pairwise ANDs since
  $p$ would be too small for neurons to contain all pairs of features.
- Instead, we have to choose
  $d_{i n} d$ ANDs to compute. It's basically the same construction but subsetting to features involved in the chosen ANDs.
Random Gaussian weights for inputs not in superposition:
- A layer with random Gaussian weights would have neuron activations
  $z_{i} = R e L U (\sum_{α} W_{i α} f_{α} - b_{i})$ for boolean feature values
  $f_{α}$ .
- We need readoff weights
  $r_{i}$ that are correlated with
  $z_{i}$ when two features,
  $f_{α}$ and
  $f_{β}$ are active, but uncorrelated otherwise.
- To readoff
  $f_{α} \land f_{β}$ , we can use weights
  $r_{i} = \frac{W_{i α} W_{i β}}{E [\sum_{i} W_{i α} W_{i β} z_{i} | α, β active]},$ where the expectation is over inputs. Because of the expectation, the readoff weights would need to be learned.
  - If both features are active, then by design we'd expect
    $E [\sum_{i} r_{i} z_{i}] = 1$ .
  - If one or zero features are active, then
    $E [\sum_{i} r_{i} z_{i}] = 0$ since either
    $W_{i α}$ or
    $W_{i β}$ is uncorrelated with
    $z_{i}$ .
  - In general, the noise in the numerator sum (
    $\sum_{i} z_{i} W_{i α} W_{i β}$ ) should scale as
    $\sqrt{ℓ k} σ_{W}^{3}$ for
    $ℓ$ active features and
    $k$ nonzero neuron activations,
    $z_{i}$ . The denominator goes as
    $k σ_{W}^{3}$ , so the readoff noise is
    $\sim \sqrt{ℓ / k}$ .
  - If the bias is learned, then it's value is likely determined by minimizing the noise term
    $E [- b_{i} + y | z_{i} = x - b_{i} + y > 0]$ where
    $x, y$ are random gaussian variables with mean zero and variances
    $σ_{W} \sqrt{2}$ and
    $σ_{W} (ℓ - 2)$ that represent the sums over the two AND features and the other active features, respectively.
- It's possible to also readoff
  $f_{α} \land f_{β}$ using the weights
  $r_{i} = \frac{W_{i α} + W_{i β}}{E [(W_{i α} + W_{i β}) z_{i} | α, β active]} .$
  - This approach has the advantage of a higher correlation between
    $r_{i}$ and
    $z_{i}$ but the disadvantage that the readoff is nonzero when only one feature is active, going something like
    $E [\sum_{i} r_{i} z_{i}] \approx \frac{E [\sum_{i} (W_{i α})^{2} z_{i}]}{E [(W_{i α} + W_{i β})^{2} z_{i}]} \approx \frac{k σ^{2}}{k (2 σ)^{2}} \approx \frac{1}{4}$ .
  - This approach is perhaps favorable when
    $d$ is not large, and the noise for the first approach is large.
Stacking U-AND layers: It might be possible to stack multple AND layers in a row to start computing more complex boolean functions. If interference is not carefully managed then errors are likely to propogate.
ANDs of three or more features: The same type of construction can be used to compute ANDs up to order
$k$ (eg
$f_{1} \land f_{2} \land f_{3} \dots \land f_{k}$ ) simultaneously.
- If a neuron's inputs contain three features, then its value is 2 if all three are "on" and 1 if only two are "on".
- An AND of three features (say
  $f_{1}$ ,
  $f_{2}$ ,
  $f_{3}$ ) can be computed by adding or subtracting averages over neurons with two or three of the features:
  - $A N D_{1, 2, 3} = E_{1, 2 \in S_{i}} [z_{i}] + E_{1, 3 \in S_{i}} [z_{i}] + E_{2, 3 \in S_{i}} [z_{i}] - E_{1, 2, 3 \in S_{i}} [z_{i}]$ .
- Can make similar constructions for higher order ANDs.

Computation in Superposition: Math Overview

Computing ANDs in superposition

Extensions

Read more

Computing ANDs in Bilinear layers