A general formula for a multilayer neural network is given by the recursive formula:
where
We can also expand the above recursive definition and write the deep neural network in the following expanded form:
The nonlinearity
The most commonly used activation function is the rectified linear unit (ReLU). It is given by the following formula:
And it has the following basic shape:
Other choices of activation functions exist, deep learning frameworks have a whole host of them: tanh, sigmoid, exponential linear unit (ELU) as well as variants of ReLU such as leaky ReLU.
ReLUs have many advantages, such as very fast gradient computations. The reason why I focus on the ReLU in this note is because its simplicity allows us to visualise the kind of functions neural networks of different depth and width can represent.
Approximation theory deals with characterising sets of functions that can be well approximated by functions from simpler function classes. When applied to neural network models, approximation theory is concerned with the questions: What kinds of functions can we represent with a particular neural network architecture? How rich/dense is that class of functions, i.e. can they be used to approximate arbirtary functions well?
What follows here is the illustration of thinking about the first question in the context of ReLU networks. By no means is what follows a rigorous theoretical analysis, approximation theory of course goes a lot further than what I will talk about in this note.
First, let's look at what functions we can represent with a single hidden layer, i.e.
where
Below, I illustrate each component of this vector, as a function of the input
Each unit, shown with different colours, will define a 'basis function' of the same essential shape. The difference between the units are: the location of the kink where the unit turns on, the slope of the linear part, and finally whether the unit activates for positive or negative values (direction of the slope).
The final network output is a linear combination of these basis functions, and thus looks like this:
We can see that when we combine ReLU basis functions linearly, we end up with a piecewise linear function. The boundaries between the linear segments in the output are dictated by the location of the kink in each of the ReLU units. I colour-coded the kinks in this function the same way as the different ReLU activations were plotted in the previous figure. For example, the brown unit turns on for values above approximately
Importantly, each ReLU unit we add one kink to the function. If we measure the complexity of the function we implement in terms of the number of segments we can say that for shallopw networks:
number of kinks
width of network
So adding width is not the most efficient way in making our function class richer. Can stacking multiple layers be bette? It turns out the answer is yes, and I will illsutrate this with an artificially constructed network, called the 'sawtooth' network (Telgarsky, 2015)
Let's consider the following recursively defined function:
Although this function is expressed in terms of the absolute value, clearly, this can be implemented with using just two ReLU activation functions:
So this is a function that is realisable with a deep ReLU network with just 2 units in each layer,
Let's look at what functions this deep but narrow netwok implements as we increase depth:
This is just the identity, without any neural network this is what we are starting from, showing here for completeness:
You can check visually, that the number of linea segments doubles each time a new layer is added. This is because each linear segment crosses
Therefore, for deep ReLU networks one can say that
number of kinks
In summary, increasing depth is a more efficient way for increasing the complexity of the functions the network can theoretically represent than adding width.
In higher dimensions, things are a little bit more complicated, as each ReLU unit defines a hyperplane, rather than a kink.
Hanin and Rolnick (2019) made beautiful illustrations of the piecewise linear functions over 2-dimensional inputs, which I used in my slides:
In this illustration, each colouured region corresponds a linear segment. Within each of these regions the function is linear, but the slope (gradient) is different in each. Although this colouring scheme might suggest otherwise, these linear segments seamlessly fit together so that the network defines a continuous function.
The conclusion from this is that deep networks, especially those that have a large number of units in each layer, can represent very complex functions, and they define a very large model class.
This has lead many people to be very skeptical about using them, because classical learning theory and theory of generalisation suggested that the more complex your model class, the more your method will overfit. Based on these arguments, the complexity of large deep networks is way too large, so they shouldn't work. But they do work in practice, and this apparent contradiction has lead people to rethink how we appoach the theory of generalization.
Hanin and Rolnick (2019) argue that although the 'worst-case' complexity of neural networks is exponential, as illustrated by constructing examples like the sawtooth network, in practice, the number of segments actually stays linear in most randomly initialized networks. So while these deep nets have the capacity to model very complex piecewise linear functions, they tend to not use all this complexity in practice.
A related observation was made by Pérez et al, (2019), who argue that deep learning generalizes well despite the theoretically very complex model-class, because the parameter-function mapping is such that it favours simpler functions. What this means is that, although deep networks can in theory implement very complex functions, for most values of the pararmeters, the function they do implement tends to be rather simple. The arguments are nicely summarized in this conversation-style poster presentation.
Finally, I encourage interested students to read up on deep linear models. For example, Arora et al, (2019) consider the problem of matrix completion, with deep linear models. As discussed above, deep linear models (deep nets without the nonlinearity