In previous notes I wrote about the approximation power of simple neural networks, such as multilayer perceptrons with ReLU activation function. These networks are able to represent piecewise linear functions with large number of linear pieces, which in turn can approximate a broad class of continuous functions provided that the number of pieces is large enough.
But piecewise linear functions feel like very clunky function approximators. Although they can approximate a smooth functon, they can also in theory zig-zag between datapoints in a wild fashion, presumably leading to really poor generalisation performance.
As neural networks are overparametrised, meaning they have more parameters than would be strictly needed to fit the training data, minimising the training loss is an underdetermined problem with multiple solutions. The different possible solutions may have wildly different properties. There is no guarantee that a minimum of the training loss will be a 'nice' solution that generalises to unseen test data.
Thankfully, even simple neural networks like ReLU MLPs have built-in inductive biases which make it more likely that "nicer" kind of solutions to learning problems are more often found. These biases or tendencies towards nice solutions are not fully understood, uncovering and characterising them is an active area of research. There are different notions of 'nice' that have been put forward and researchers were able to produce more or less strong evidence of the presence of these kind of inductive biases.
In this note I mention three ways to describe what "nice" solution means.