Try   HackMD

Computing ANDs in bilinear layers

Bilinear layers

Bilinear layers are an alternative to the standard MLP layer with some nonlinear activation function. For an input

x, a bilinear MLP would be
MLPbilinear(x)=(Wx)(Wx)=jkWijWikxjxk
where
represents element-wise multiplication and
W,W
are weight matrices. Unlike other nonlinear layers, the bilinear layer can be written in terms of only linear operations and a third order tensor,
WijWik
.

Pairwise Feature Interactions

The advantage of the bilinear layer is that it's the simplest type of nonlinearity, only having pairwise interactions. This can make it simpler to understand how input features might interact to create output features.

For example, if the input is composed of feature vectors,

f, with (sparse) activations,
a
, such that
xi=αaαfαi
, then the output can be written as
MLPbilinear(x)=αβaαaβ(jkWijWikfαjfβk)
after grouping activations and features together.

If we expect output features to represent pairwise operations of input features (eg ANDs of input features), then it's natural to define the output activation and features as:

a(αβ)Oaαaβf(αβ)iOjkWijWik(fαjfβk+fβjfαk)MLPbilinear(x)=(αβ)a(αβ)Of(αβ)iO where
(αβ)
indexes over pairs of input features, including twin pairs like (00). The features are written in a form that is symmetric over
α,β
.

Computing ANDs

If the input activations

aα are sparse and boolean, then
a(αβ)Oaαaβ
already computes the AND for the pair of input features. The feature vector
f(αβ)O
then represents the AND in the output space. The original input features are also represented by
f(αα)O
.

The main constraint is that the output features form an over-basis, meaning that the dot product of any two features is sufficiently small.

It's possible that random feature vectors (coming from random

W and
W
) are enough for an overbasis. For a
d
-dim space, the dot product of two (normalized) gaussian random vectors is a random value with mean zero and standard deviation
1d
.

To read-off the value of the AND of two features, we could simply dot product the MLP output with the feature vector for the AND we want to extract and normalize. This means dotting with

f(αβ)O/||f(αβ)O||2. For random gaussian feature vectors, this would give
a(αβ)O
(equal to
aαaβ
) plus inteference terms from the other active feature vectors. For
input features active, the interference terms will be of order
d
after adding up
±1d
over
2
active feature pears. For
=5
and
d=1000
, the interference terms are on the order of
0.16
. Reading off XORs can be done similarly, using the fact that
aα XOR aβ=aα+aβ2 aα aβ
.

Random

W,
W
might be enough to represent all pairwise feature ANDs. The limitation is that no two random vectors should have high interference. The appendix shows that the probability of two random vectors having a cosine similarity of more than
δ
is
P[|xTy|||x|| ||y||δ]2dδ2πexp(dδ22)
when
δ1/d
. For example,
d=1000
and
δ=0.3
gives
P1021
.

Overall the number of ANDs that could be stored before two feature vectors become similar is exponential in

d. Inteference during read-offs is small as long as the number of active features,
, is small compared to
d
.

In practice, features (and ANDs of features) will have correlations, so a model could learn to minimize the interference between features that are often active together. If feature activity is power law distributed, there is potential to greatly reduce inteference by carefully choosing the feature vectors of the most active features.

Non-boolean features

Even if the activities

aα are sparse but not boolean, our construction of output features still appears useful. The output activations would be
a(αβ)O=aαaβ
which can serve as a basis for building up some nonlinear function of the initial features.

The feature vectors that the model learns may also play a role. For example if three pairs of features all have similar output feature vectors, say

fO, then the model may be effectively computing
a1a2+a3a4+a5a6
and representing it with
fO
.

Appendix

Probability of high interference: According to these lecture notes on high-dim data [link] (chapter 4), the dot product of a gaussian random vector

x (entries iid with
σ2=1
) with a normalized random vector
y||y||
is also gaussian distributed with variance 1. Therefore
P[|xTy|||y||ε]=22πεet22dterfc(ε2).

The norm of

x concentrates around
d
with small fractional deviations of order
1/d
. Neglecting these deviations, the dot product of two normalized gaussian vectors is then
P[|xTy|||x|| ||y||δ]erfc(δd2)
using the equation above with
δεd
. For
δ1/d
, this gives
P2dδ2πexp(dδ22).

For

d=1000, a high interference of
δ=0.3
has probability
1021
and for
δ=0.5
the probability is
1056
.