--- title: DeepNN Notes on Approximation with ReLU Networks tags: DeepNN, Teaching, Lecture Notes description: Lecture notes on representing functions with deep feed-forwad networks of rectified linear units --- # Approximation with ReLU networks ## Background: basic multilayer perceptron A general formula for a multilayer neural network is given by the recursive formula: \begin{align} f_l(x) &= \phi(W_l f_{l-1}(x) + b_l)\\ f_0(x) &= x, \end{align} where $l$ runs from $1$ to $L$ where $L$ is the number of layers. Here, I assume $x$ is a vector, the weights $W_l$ are matrices and the biases $b_l$ are vectors. In general, $x$ can be arranged as tensors, in which case the weight matrices and biases become tensors, too. $\phi: \mathbb{R}\rightarrow\mathbb{R}$ is a non-linear function which is assumed to act on each component of a vector (or tensor) independently. We can also expand the above recursive definition and write the deep neural network in the following expanded form: $$ \small f_L(x) = \phi\left(b_L + W_L \phi\left(b_{L-1} + W_{L-1} \phi\left( \cdots \phi\left(b_1 + W_1 x\right) \cdots \right)\right)\right) $$ ### The Rectified Linear Unit The nonlinearity $\phi$, also called the activation function, is an important component of the neural network. Without $\phi$, or with $\phi(x) = x$, the neural network would implement just a linear function of it's input. The choice of nonlinearity $\phi$, togethe with the number and width of each layer will influence the set of functions that we can implement with a neural network. The most commonly used activation function is the *rectified linear unit* (ReLU). It is given by the following formula: $$ \phi(x) = \left\{\matrix{0&\text{when }x\leq 0\\x&\text{when }x>0}\right. $$ And it has the following basic shape: ![](https://i.imgur.com/SxKdrzb.png) Other choices of activation functions exist, deep learning frameworks have a whole host of them: tanh, sigmoid, exponential linear unit (ELU) as well as variants of ReLU such as leaky ReLU. ReLUs have many advantages, such as very fast gradient computations. The reason why I focus on the ReLU in this note is because its simplicity allows us to visualise the kind of functions neural networks of different depth and width can represent. ## What can these networks represent? Approximation theory deals with characterising sets of functions that can be well approximated by functions from simpler function classes. When applied to neural network models, approximation theory is concerned with the questions: What kinds of functions can we represent with a particular neural network architecture? How rich/dense is that class of functions, i.e. can they be used to approximate arbirtary functions well? What follows here is the illustration of thinking about the first question in the context of ReLU networks. By no means is what follows a rigorous theoretical analysis, approximation theory of course goes a lot further than what I will talk about in this note. ### Shallow ReLU networks First, let's look at what functions we can represent with a single hidden layer, i.e. $L=1$. We furthermore will only consider 1-dimensional inputs and 1D outputs, for now, so it's easier to draw and illustrate things. The function class we would like to understand looks like this: $$ f(x) = \mathbf{w}^T_2 \operatorname{ReLU}(\mathbf{w}_1x - \mathbf{b}_1), $$ where $\mathbf{w}_1$ is a weight vector of size $d_1 \times 1$, where $d_1$ is the number of units in the hidden layer. Similarly, each hidden unit has its bias term which is described by the corresponding component of the bias vector $\mathbf{b}_1$. The output of the hidden units, called the activation at the first layer, is given by the vector: $$ \operatorname{ReLU}(\mathbf{w}_1x - \mathbf{b}_1) $$ Below, I illustrate each component of this vector, as a function of the input $x$ for a randomly chosen $\mathbf{b}_1$ and $\mathbf{w}_1$: ![](https://i.imgur.com/rN5wRVJ.png) Each unit, shown with different colours, will define a 'basis function' of the same essential shape. The difference between the units are: the location of the kink where the unit turns on, the slope of the linear part, and finally whether the unit activates for positive or negative values (direction of the slope). The final network output is a linear combination of these basis functions, and thus looks like this: ![](https://i.imgur.com/kX3nuYg.png) We can see that when we combine ReLU basis functions linearly, we end up with a piecewise linear function. The boundaries between the linear segments in the output are dictated by the location of the kink in each of the ReLU units. I colour-coded the *kinks* in this function the same way as the different ReLU activations were plotted in the previous figure. For example, the brown unit turns on for values above approximately $1.5$, resulting in a kink in the output at $x\approx1.5$. Importantly, each ReLU unit we add one kink to the function. If we measure the complexity of the function we implement in terms of the number of segments we can say that for shallopw networks: > number of kinks $\approx O($ width of network $)$ ### Deep ReLU networks So adding width is not the most efficient way in making our function class richer. Can stacking multiple layers be bette? It turns out the answer is yes, and I will illsutrate this with an artificially constructed network, called the *'sawtooth'* network ([Telgarsky, 2015](https://arxiv.org/abs/1509.08101)) #### Example: "sawtooth" network Let's consider the following recursively defined function: \begin{align} f_l(x) &= 2\vert f_{l-1}(x)\vert - 2 \\ f_0(x) &= x \end{align} Although this function is expressed in terms of the absolute value, clearly, this can be implemented with using just two ReLU activation functions: \begin{align} f_l(x) &= 2 \operatorname{ReLU}(f_{l-1}(x)) + 2 \operatorname{ReLU}(-f_{l-1}(x)) - 2\\ f_0(x) &= x \end{align} So this is a function that is realisable with a deep ReLU network with just 2 units in each layer, $2L$ units in total. Let's look at what functions this deep but narrow netwok implements as we increase depth: #### $0$-layer network This is just the identity, without any neural network this is what we are starting from, showing here for completeness: ![](https://i.imgur.com/bRUYJD7.png) #### $1$-layer network ![](https://i.imgur.com/bQpM1dg.png) #### $2$-layer network ![](https://i.imgur.com/NxNj7VG.png) #### $3$-layer network ![](https://i.imgur.com/xlcpyP7.png) #### $4$-layer network ![](https://i.imgur.com/9GF52j7.png) --- #### $5$-layer network ![](https://i.imgur.com/Ku8aamk.png) You can check visually, that the number of linea segments doubles each time a new layer is added. This is because each linear segment crosses $0$, so when we take the absolute value, each segment will be split in half. Then the function is shifted downwards again so all of the newly created linear segments cross $0$ once again. Therefore, for deep ReLU networks one can say that > number of kinks $\approx O(2^\text{depth})$ In summary, increasing depth is a more efficient way for increasing the complexity of the functions the network can theoretically represent than adding width. ### In higher dimensions: linear regions In higher dimensions, things are a little bit more complicated, as each ReLU unit defines a hyperplane, rather than a kink. [Hanin and Rolnick (2019)](http://proceedings.mlr.press/v97/hanin19a/hanin19a.pdf) made beautiful illustrations of the piecewise linear functions over 2-dimensional inputs, which I used in my slides: ![](https://i.imgur.com/0NVHFEN.png) In this illustration, each colouured region corresponds a linear segment. Within each of these regions the function is linear, but the slope (gradient) is different in each. Although this colouring scheme might suggest otherwise, these linear segments seamlessly fit together so that the network defines a continuous function. ## Conclusions The conclusion from this is that deep networks, especially those that have a large number of units in each layer, can represent very complex functions, and they define a very large model class. This has lead many people to be very skeptical about using them, because classical learning theory and theory of generalisation suggested that the more complex your model class, the more your method will overfit. Based on these arguments, the complexity of large deep networks is way too large, so they *shouldn't work*. But they do work in practice, and this apparent contradiction has lead people to rethink how we appoach the theory of generalization. ## Additional reading (not mentioned in lecture) [Hanin and Rolnick (2019)](http://proceedings.mlr.press/v97/hanin19a/hanin19a.pdf) argue that although the 'worst-case' complexity of neural networks is exponential, as illustrated by constructing examples like the sawtooth network, in practice, the number of segments actually stays linear in most randomly initialized networks. So while these deep nets have the capacity to model very complex piecewise linear functions, they tend to not use all this complexity in practice. A related observation was made by [PĂ©rez et al, (2019)](https://openreview.net/pdf?id=rye4g3AqFm), who argue that deep learning generalizes well despite the theoretically very complex model-class, because the parameter-function mapping is such that it favours simpler functions. What this means is that, although deep networks *can* in theory implement very complex functions, for most values of the pararmeters, the function they *do* implement tends to be rather simple. The arguments are nicely summarized in this conversation-style [poster presentation](https://postersession.ai/poster/deep-learning-generalizes-because-the-pa/). Finally, I encourage interested students to read up on deep linear models. For example, [Arora et al, (2019)](https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf) consider the problem of matrix completion, with deep linear models. As discussed above, deep linear models (deep nets without the nonlinearity $\phi$) are the same as ordinary linear models, because the composition of linear functions is still a linear function. Yet, it turns out that if you redundantly parametrise linear functions in this way, and train them using gradient descent, you get different solutions depending on the parametrization and initialization. The reason why I mention this here is to highlight that reasoning about the model class neural networks can represent only tells you a small part of the story as to why they are useful.