Computing ANDs in bilinear layers

Bilinear layers

Bilinear layers are an alternative to the standard MLP layer with some nonlinear activation function. For an input

x

, a bilinear MLP would be

\begin{aligned} M L P_{b i l i n e a r} (x) & = (W x) ⊙ (W^{'} x) \\ = \sum_{j k} W_{i j} W_{i k}^{'} x_{j} x_{k} \end{aligned}

where

⊙

represents element-wise multiplication and

W, W^{'}

are weight matrices. Unlike other nonlinear layers, the bilinear layer can be written in terms of only linear operations and a third order tensor,

W_{i j} W_{i k}^{'}

Pairwise Feature Interactions

The advantage of the bilinear layer is that it's the simplest type of nonlinearity, only having pairwise interactions. This can make it simpler to understand how input features might interact to create output features.

For example, if the input is composed of feature vectors,

f

, with (sparse) activations,

a

, such that

x_{i} = \sum_{α} a_{α} f_{α i}

, then the output can be written as

\begin{aligned} M L P_{b i l i n e a r} (x) & = \sum_{α β} a_{α} a_{β} (\sum_{j k} W_{i j} W_{i k}^{'} f_{α j} f_{β k}) \end{aligned}

after grouping activations and features together.

If we expect output features to represent pairwise operations of input features (eg ANDs of input features), then it's natural to define the output activation and features as:

\begin{aligned} a_{(α β)}^{O} & \equiv a_{α} a_{β} \\ f_{(α β) i}^{O} & \equiv \sum_{j k} W_{i j} W_{i k}^{'} (f_{α j} f_{β k} + f_{β j} f_{α k}) \\ M L P_{b i l i n e a r} (x) & = \sum_{(α β)} a_{(α β)}^{O} f_{(α β) i}^{O} \end{aligned}

where

(α β)

indexes over pairs of input features, including twin pairs like (00). The features are written in a form that is symmetric over

α, β

Computing ANDs

If the input activations

a_{α}

are sparse and boolean, then

a_{(α β)}^{O} \equiv a_{α} a_{β}

already computes the AND for the pair of input features. The feature vector

f_{(α β)}^{O}

then represents the AND in the output space. The original input features are also represented by

f_{(α α)}^{O}

The main constraint is that the output features form an over-basis, meaning that the dot product of any two features is sufficiently small.

It's possible that random feature vectors (coming from random

W

and

W^{'}

) are enough for an overbasis. For a

d

-dim space, the dot product of two (normalized) gaussian random vectors is a random value with mean zero and standard deviation

\approx \frac{1}{\sqrt{d}}

To read-off the value of the AND of two features, we could simply dot product the MLP output with the feature vector for the AND we want to extract and normalize. This means dotting with

f_{(α β)}^{O} / | | f_{(α β)}^{O} | |^{2}

. For random gaussian feature vectors, this would give

a_{(α β)}^{O}

(equal to

a_{α} \land a_{β}

) plus inteference terms from the other active feature vectors. For

ℓ

input features active, the interference terms will be of order

\frac{ℓ}{\sqrt{d}}

after adding up

\pm \frac{1}{\sqrt{d}}

over

ℓ^{2}

active feature pears. For

ℓ = 5

and

d = 1000

, the interference terms are on the order of

0.16

. Reading off XORs can be done similarly, using the fact that

a_{α} XOR a_{β} = a_{α} + a_{β} - 2 a_{α} \land a_{β}

Random

W

W^{'}

might be enough to represent all pairwise feature ANDs. The limitation is that no two random vectors should have high interference. The appendix shows that the probability of two random vectors having a cosine similarity of more than

δ

P [\frac{| x^{T} y |}{| | x | | | | y | |} \geq δ] \approx \sqrt{\frac{2}{d δ^{2} π}} \exp (- d \frac{δ^{2}}{2})

when

δ ≫ 1 / \sqrt{d}

. For example,

d = 1000

and

δ = 0.3

gives

P \sim 10^{- 21}

Overall the number of ANDs that could be stored before two feature vectors become similar is exponential in

d

. Inteference during read-offs is small as long as the number of active features,

ℓ

, is small compared to

d

In practice, features (and ANDs of features) will have correlations, so a model could learn to minimize the interference between features that are often active together. If feature activity is power law distributed, there is potential to greatly reduce inteference by carefully choosing the feature vectors of the most active features.

Non-boolean features

Even if the activities

a_{α}

are sparse but not boolean, our construction of output features still appears useful. The output activations would be

a_{(α β)}^{O} = a_{α} a_{β}

which can serve as a basis for building up some nonlinear function of the initial features.

The feature vectors that the model learns may also play a role. For example if three pairs of features all have similar output feature vectors, say

f^{O}

, then the model may be effectively computing

a_{1} a_{2} + a_{3} a_{4} + a_{5} a_{6}

and representing it with

f^{O}

Appendix

Probability of high interference: According to these lecture notes on high-dim data [link] (chapter 4), the dot product of a gaussian random vector

x

(entries iid with

σ^{2} = 1

) with a normalized random vector

\frac{y}{| | y | |}

is also gaussian distributed with variance 1. Therefore

P [\frac{| x^{T} y |}{| | y | |} \geq ε] = \frac{2}{\sqrt{2 π}} \int_{ε}^{\infty} e^{- \frac{t^{2}}{2}} d t \equiv erfc (\frac{ε}{\sqrt{2}}) .

The norm of

x

concentrates around

\sqrt{d}

with small fractional deviations of order

1 / \sqrt{d}

. Neglecting these deviations, the dot product of two normalized gaussian vectors is then

P [\frac{| x^{T} y |}{| | x | | | | y | |} \geq δ] \approx erfc (δ \sqrt{\frac{d}{2}})

using the equation above with

δ \equiv \frac{ε}{\sqrt{d}}

. For

δ ≫ 1 / \sqrt{d}

, this gives

P \approx \sqrt{\frac{2}{d δ^{2} π}} \exp (- d \frac{δ^{2}}{2}) .

For

d = 1000

, a high interference of

δ = 0.3

has probability

10^{- 21}

and for

δ = 0.5

the probability is

10^{- 56}