In previous notes I wrote about the approximation power of simple neural networks, such as multilayer perceptrons with ReLU activation function. These networks are able to represent piecewise linear functions with large number of linear pieces, which in turn can approximate a broad class of continuous functions provided that the number of pieces is large enough.
But piecewise linear functions feel like very clunky function approximators. Although they can approximate a smooth functon, they can also in theory zig-zag between datapoints in a wild fashion, presumably leading to really poor generalisation performance.
As neural networks are overparametrised, meaning they have more parameters than would be strictly needed to fit the training data, minimising the training loss is an underdetermined problem with multiple solutions. The different possible solutions may have wildly different properties. There is no guarantee that a minimum of the training loss will be a 'nice' solution that generalises to unseen test data.
Thankfully, even simple neural networks like ReLU MLPs have built-in inductive biases which make it more likely that "nicer" kind of solutions to learning problems are more often found. These biases or tendencies towards nice solutions are not fully understood, uncovering and characterising them is an active area of research. There are different notions of 'nice' that have been put forward and researchers were able to produce more or less strong evidence of the presence of these kind of inductive biases.
In this note I mention three ways to describe what "nice" solution means.
As described, ReLU neural networks define piecewise linear functions, which are also called linear splines. In 1D learning problems, we can describe such a function in terms of the breakpoints or boundaries between linear segments. We can then study what happens to these breakpoints as the networks are confronted with training data, i.e. are trained using gradient descent.
Several analyses have found that the behaviour of shallow ReLU networks depends critically on how their parameters are initialised. Williams et al, 2010 distinguish between the kernel and active regimes based on the scale of initialization at different layers. They show that while in the kernel regime, wide ReLU networks tend to approximate cubic spline interpolants, in the active regime, a linear spline interpolant solution is found.
The key difference is in the training dynamics: the magnitude of weights at the two layers of a shallow network determines how much the location of the breakpoints move during training. Here is a nice video showing how the breakpoints change as training progresses.
Smoothness is a very meaningful and visually obvious form of 'niceness' that a function can exhibit. When dealing with functions in higher dimensional spaces, or functions that map between discrete objects (such as those in language modeling where inputs and outputs might be character or token sequences) smoothness is not neccessarily what we are intersted in.
Consider a Boolean function with multiple binary inputs and a binary output. Can we say something about how nice such a function is? Consider four Boolean functions. True
irrespective of its inputs. True
if its first argument is True
and False
otherwise.
One way to capture this feeling that some functions are simple and others are complex is the Kolmogorov complexity. Informally, the Kolmogorov complexity is the length of the shortest program in a Turing-complete programming language (or universal Turing machine) that implements the function in question. In most languages,
In a seminal paper Louis et al, (2018) showed that randomly initialized neural networks tend to implement algorithmically simpler Boolean functions. Furthermore, neural networks have an easier time fitting algorithmically simpler datasets, and will have a bias towards favouring simpler explanations of data when the learning problem is underdetermined.
Later Goldblum et al (2023) discovered a similar bias in Transformers used in langauge modeling towards favouring lower complexity strings that are easier to compress.
Another interesting bias is towards low rank representations, as illustrated in (Huh et al, 2023). A low-rank representation is one where input data is transformed by the neural network to an intermediate representation (the activations of a middle layer) in which the representations span a lower dimensional subspace than the original dimensionality of the data. As this work empirically shows, this effect is stronger in deeper networks, and this may explain why deeper networks struggle to fit even linear functions when the rank of the linear function is sufficiently high. A similar bias towards low rank is observed in the context of matrix factorization (Arora et al, 2019).
A low-rank representation can often be useful. When dealing with very high dimensional data, it used to be common practice in machine learning to map data to a smaller dimensional representation using various dimensionality reduction techniques before feeding it into a learning algorithm. It may be that deep neural networks have a built in dimensionality reduction tendency.