There is no commentSelect some text and then click Comment, or simply add a comment to this page from below to start a discussion.
Computing ANDs in bilinear layers
Bilinear layers
Bilinear layers are an alternative to the standard MLP layer with some nonlinear activation function. For an input , a bilinear MLP would be where represents element-wise multiplication and are weight matrices. Unlike other nonlinear layers, the bilinear layer can be written in terms of only linear operations and a third order tensor, .
Pairwise Feature Interactions
The advantage of the bilinear layer is that it's the simplest type of nonlinearity, only having pairwise interactions. This can make it simpler to understand how input features might interact to create output features.
For example, if the input is composed of feature vectors, , with (sparse) activations, , such that , then the output can be written as after grouping activations and features together.
If we expect output features to represent pairwise operations of input features (eg ANDs of input features), then it's natural to define the output activation and features as: where indexes over pairs of input features, including twin pairs like (00). The features are written in a form that is symmetric over .
Computing ANDs
If the input activations are sparse and boolean, then already computes the AND for the pair of input features. The feature vector then represents the AND in the output space. The original input features are also represented by .
The main constraint is that the output features form an over-basis, meaning that the dot product of any two features is sufficiently small.
It's possible that random feature vectors (coming from random and ) are enough for an overbasis. For a -dim space, the dot product of two (normalized) gaussian random vectors is a random value with mean zero and standard deviation .
To read-off the value of the AND of two features, we could simply dot product the MLP output with the feature vector for the AND we want to extract and normalize. This means dotting with . For random gaussian feature vectors, this would give (equal to ) plus inteference terms from the other active feature vectors. For input features active, the interference terms will be of order after adding up over active feature pears. For and , the interference terms are on the order of . Reading off XORs can be done similarly, using the fact that .
Random , might be enough to represent all pairwise feature ANDs. The limitation is that no two random vectors should have high interference. The appendix shows that the probability of two random vectors having a cosine similarity of more than is when . For example, and gives .
Overall the number of ANDs that could be stored before two feature vectors become similar is exponential in . Inteference during read-offs is small as long as the number of active features, , is small compared to .
In practice, features (and ANDs of features) will have correlations, so a model could learn to minimize the interference between features that are often active together. If feature activity is power law distributed, there is potential to greatly reduce inteference by carefully choosing the feature vectors of the most active features.
Non-boolean features
Even if the activities are sparse but not boolean, our construction of output features still appears useful. The output activations would be which can serve as a basis for building up some nonlinear function of the initial features.
The feature vectors that the model learns may also play a role. For example if three pairs of features all have similar output feature vectors, say , then the model may be effectively computing and representing it with .
Appendix
Probability of high interference: According to these lecture notes on high-dim data [link] (chapter 4), the dot product of a gaussian random vector (entries iid with ) with a normalized random vector is also gaussian distributed with variance 1. Therefore
The norm of concentrates around with small fractional deviations of order . Neglecting these deviations, the dot product of two normalized gaussian vectors is then using the equation above with . For , this gives
For , a high interference of has probability and for the probability is .