Try   HackMD

Computation in Superposition: Math Overview

"Toward a Mathematical Framework for Computation in Superposition" (TMF) [link] provides a proof of concept for how superposition can allow parallel on many features in parallel. As a mathematical description, it serves as a starting place for understanding more complex types of superposition computation.

Here we give a condensed version of the points made in the post.

Computing ANDs in superposition

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Features: Say we have a set of
    m
    sparse, boolean features (eg, a set of topics that are either present or not in a text).
    • Denote features by
      fα
      . Let
      m
      be the number of features "on" at the same time.
  • Neurons: In the MLP layer,
    d
    neurons each take a random subset of the features as inputs. Neurons are only nonzero if at least two features are "on".
    • zi=ReLU(αWiαfα1)
      where
      Wiα
      is 1 with probability
      p
      and 0 otherwise.
    • Each neuron's inputs define a set
      Si
      of features.
    • zi=1
      if only two features in
      Si
      are "on". More generally,
      zi=k1
      for
      k2
      features "on" in
      Si
      and 0 otherwise.
  • Read off ANDs: To linearly read off the AND operation for two features, we take the average over neurons that have both features as inputs.
    • AND estimate
      fαfβ^=Efα,fβSi[zi]=iziWiαWiβjWjαWjβ
    • We expect the
      zi
      included in the sum to typically be
      1
      when
      fα
      and
      fβ
      are "on" and
      0
      otherwise, giving the correct AND behavior. Rare events when other "on" features are included in the sums add noise to these values.
  • Interference: Noisy contributions to the read-off from other neurons being "on" are negligible if the features are sufficiently sparse and the random subsets of feature inputs are small enough.
    • Overall, the noise contributions are small if the expected number of nonzero terms in a neuron's sum over features is small. Since
      neurons are "on" and contribute to the sum with probability
      p
      , then this requires
      p1
      .
    • Consider a neuron
      zi
      with both
      fα
      and
      fβ
      as inputs.
      • If both features are "on", then the neuron is in the linear piece of the ReLU, so
        zi=1+γα,βWiγfγ=1+Δ
        . There are
        2
        other features that are "on" and a probability of
        p
        of each feature contributing to the sum. So
        E[Δ]
        contributes
        p(2)
        to
        fαfβ^
        . This is only small if
        p1
        , ie the probability of extra features being "on" and connected to the neuron is small.
      • If only one feature is "on", then a similar calculation shows the noise contribution is also
        p
        .
      • If both features are "off", then there is only noise when two other features in the sum are "on". This is more rare, so the noise contribution is smaller,
        (p)2
        if
        p1
        .
  • Universal AND: It's possible to read-off all pairwise ANDs with a layer size that is a power of
    log(m)
    .
    • To ensure a high probability that any pair of features can be read-off we need
      p>2logmd
      .
      • A given pair of features will be inputs into
        dp2
        neurons on average. The probability of not appearing in any neuron is
        exp(dp2)
        , from the poisson distribution.
      • Across
        (m2)
        pairs, the probability of all pairs being in at least one neuron's inputs is
        (1exp(dp2))(m2)1(m2)exp(dp2)
        . For the second term to be small, we need
        p>2logmd
        .
    • The post makes the particular choice
      p=log2m/d
      . Then, each estimate
      fαfβ^
      is an average over
      log4m
      terms.
    • If
      is also a power of
      logm
      , then
      d
      can be a power of
      logm
      and be large enough for small interference terms.
  • Realistic regimes:
    • Based on values from Anthropic's "Towards Monosemanticity", it wouldn't be possible to use the U-AND construction to compute ANDs for all features.
      • m4,000
        (range 500-130K), technically these are features found in the neuron activations.
      • 1020
        .
      • d500
      • Constraints on
        p
        :
        • p1/=0.050.1
        • p>2logmd=0.18
      • If we take
        p=0.03
        , then we would only be able to extract
        1exp(dp2)=36%
        of all ANDs.

Extensions

  • When inputs are in superposition: The construction also works if the inputs are already in superposition. However, in this case the number of ANDs that can be computed is of order the number of parameters in the layer
    O(dind)
    .
    • Let the
      m
      sparse, boolean features be an "overbasis" for a
      din
      -dim space. So
      f1,,fmRdin
      . The input is a sum of
      feature vectors.
    • Decompose the neuron's weight matrix into
      W=W^F
      where
      Fαi=(fα)i
      (
      m×din
      dims) puts the input back into the feature basis via a dot product.
      W^
      is same as the U-AND construction above.
    • After applying
      F
      , the features are either "on" with value 1 or "off" with a small random value of standard deviation
      1/din
      .
    • Unlike before, the interference terms now add up for "off" features. If neurons connect to
      din
      features via
      W^
      then the interference terms are order 1 and cause "misfires". This requires
      p<dinm
      .
    • For large
      m
      , we cannot compute all pairwise ANDs since
      p
      would be too small for neurons to contain all pairs of features.
    • Instead, we have to choose
      dind
      ANDs to compute. It's basically the same construction but subsetting to features involved in the chosen ANDs.
  • Random Gaussian weights for inputs not in superposition:
    • A layer with random Gaussian weights would have neuron activations
      zi=ReLU(αWiαfαbi)
      for boolean feature values
      fα
      .
    • We need readoff weights
      ri
      that are correlated with
      zi
      when two features,
      fα
      and
      fβ
      are active, but uncorrelated otherwise.
    • To readoff
      fαfβ
      , we can use weights
      ri=WiαWiβE[iWiαWiβzi| α,β active],
      where the expectation is over inputs. Because of the expectation, the readoff weights would need to be learned.
      • If both features are active, then by design we'd expect
        E[irizi]=1
        .
      • If one or zero features are active, then
        E[irizi]=0
        since either
        Wiα
        or
        Wiβ
        is uncorrelated with
        zi
        .
      • In general, the noise in the numerator sum (
        iziWiαWiβ
        ) should scale as
         kσW3
        for
        active features and
        k
        nonzero neuron activations,
        zi
        . The denominator goes as
        kσW3
        , so the readoff noise is
        /k
        .
      • If the bias is learned, then it's value is likely determined by minimizing the noise term
        E[bi+y | zi=xbi+y>0]
        where
        x,y
        are random gaussian variables with mean zero and variances
        σW2
        and
        σW(2)
        that represent the sums over the two AND features and the other active features, respectively.
    • It's possible to also readoff
      fαfβ
      using the weights
      ri=Wiα+WiβE[(Wiα+Wiβ)zi| α,β active].
      • This approach has the advantage of a higher correlation between
        ri
        and
        zi
        but the disadvantage that the readoff is nonzero when only one feature is active, going something like
        E[irizi]E[i(Wiα)2zi]E[(Wiα+Wiβ)2zi]kσ2k(2σ)214
        .
      • This approach is perhaps favorable when
        d
        is not large, and the noise for the first approach is large.
  • Stacking U-AND layers: It might be possible to stack multple AND layers in a row to start computing more complex boolean functions. If interference is not carefully managed then errors are likely to propogate.
  • ANDs of three or more features: The same type of construction can be used to compute ANDs up to order
    k
    (eg
    f1f2f3fk
    ) simultaneously.
    • If a neuron's inputs contain three features, then its value is 2 if all three are "on" and 1 if only two are "on".
    • An AND of three features (say
      f1
      ,
      f2
      ,
      f3
      ) can be computed by adding or subtracting averages over neurons with two or three of the features:
      • AND1,2,3=E1,2Si[zi]+E1,3Si[zi]+E2,3Si[zi]E1,2,3Si[zi]
        .
    • Can make similar constructions for higher order ANDs.