Learning and Neural Networks

# Learning and Neural Networks Agenda: 1. (Math) Conceptual foundations 2. (Demo) Solving linear regression 3. (Code) Further quetions about the library 4. (Q&A) Questions, possibly answers, pointers to literature <div style="height: 500px"></div> ## 1.1 Functions A neural network is a function $f$ that maps training examples, represented by a matrix $X$ of **inputs**, to **predictions** $\hat{Y}$. As a diagram, such a function looks like this: <img src="https://i.imgur.com/QJHvmZZ.png"> Example functions are $\mathsf{id}$ and $\mathsf{const}$, defined as follows: $$ \begin{align} \mathsf{id}(x) &= x\\ \mathsf{const}(c) &= [\lambda x. c] \end{align} $$ These are perfectly good functions, but not great as neural networks because they have no (interesting) parameters that can be learned from data. <div style="height: 200px"></div> ## 1.2 Parametric functions A useful neural network is a function $f_\theta$ that still maps input $X$ to output $\hat{Y}$, but is parametrized by a vector $\theta$ of **weights**. These parameters are part of the internal state of the network and aren't supplied by the user of the network along with the input and output data. We'll visualize such parametrized functions as follows: ![](https://i.imgur.com/P811LU0.jpg) The simplest such neural network that comes to mind maps its input to points on a line that goes through the origin of Euclidean two-space and is defined as follows: $$ f_\theta(X) = \theta X $$ <div style="height: 200px"></div> ## 1.3 Learning The learning problem for a neural network is the following: - We have an *unknown* probability distribution $f_{u}$ from which we've sampled some training data $(X, Y)$. - We have a neural network $f_\theta$ that maps training input $X$ to predictions $\hat{Y}$. - To **learn** means to find a value of $\theta$ such that $f_\theta$ produces predictions $\hat{Y}$ that are in some sense "close to" the **targets** $Y$. This notion of closeness is captured by a **loss** function $\mathcal{L}$, which takes predictions $\hat{Y}$ and targets $Y$ and produces a number that's smaller if the two are "close to" each other. We'll represent the loss function with the following diagram: ![](https://i.imgur.com/6psXbue.png) We usually deal with **vectorized** data, so $\hat{Y}$ will be a vector of predictions and $\mathcal{L}$ will yield a vector $L$ of numbers. We want a single number to represent the overall loss and we do this with a **cost** function $\mathcal{J}$ which computes the expected value of $L$ either on the entire training data (this is the approach used in **batch** training) or a random subset of it (this one's used in **mini-batch** training). We again visualize: ![](https://i.imgur.com/3v9nqnp.png) <div style="height: 200px"></div> ## 1.4 Optimization The learning problem, stated in the language of mathematical optimization, boils down to the following. We have an **objective** function defined as a composition of three functions: the neural network itself ($f_\theta$), followed by the loss function ($\mathcal{L}$), and lastly by the cost function ($\mathbb{E}$): $$ \begin{align} J &= f_\theta \circ \mathcal{L}~\circ \mathbb{E}\\ J(\theta; X, Y) &= \mathbb{E}(f_\theta(X), Y) \end{align} $$ Visually, this function is a little nicer to look at: ![](https://i.imgur.com/b3ql6KW.png) The **goal** of optimization is to find $\theta$ that makes $J$ as small as possible. <div style="height: 200px"></div> ## 1.5 Gradient descent The most common solution to neural network optimization problems is **gradient descent** and involves the differentiation of the cost $\mathcal{J}$ with respect to parameters $\theta$. This differentiation gives us what's called the **gradient** on $\theta$, denoted by $\nabla\theta$ and it's a vector of the same shape as $\theta$, selected from the space of all possible vectors, as the one that if we add it to $\theta$ will *increase* $J$ the most. But since we want to *minimize* $J$, we subtract the gradient from $\theta$ (hence the term gradient *descent*, not ascent). Algorithmically, gradient descent can be described as follows: ```python while not small_enough(J): ∇θ = gradient(J, θ)(X,Y) θ = θ - ∇θ ``` <div style="height: 200px"></div> ## 1.5 Differentiation Mathematically, differentiation is straightforward and is done following the rules of **calculus** (single-variable or vector, depending on the shapes of things). Computationally, however, there are at least three different ways of *implementing* calculus on a computer. We start with an overview of mathematical differentiation and then proceed to the implementation approaches. <div style="height: 100px"></div> ### 1.5.1 Analytic differentiation <div style="height: 20px"></div> #### 1.5.1.1 Derivative Mathematical or **analytic** differentiation involves paper and pencil and manual, human labor. To differentiate a function is to find its **derivative**, which is a concept that's most commonly defined in terms of the **limit** operator (there's a way of doing calculus without limits called non-standard analysis, but we won't discuss it). Derivative is a **higher-order** binary operator described as follows: $$ \frac{df}{dx} = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h} $$ It takes a function $f$ and its (single) formal parameter $x$, and returns a function that tells us the rate of change of $f$. This definition itself is not fundamental and is usually (especially in **real analysis**) reduced to a formula in first-order logic (this is the beloved $\epsilon-\delta$ definition that partitions students into those who stuck around and those who left math to study something less annoying). Given this (or the more boring $\epsilon-\delta$) definition, mathematicians have since the times of the giants (Fermat, Descrates, Leibniz, Newton, Bernoullis, among others) looked at all of the **elementary** functions (e.g. $\sin, \cos, e^x$, and so on) and figured out their derivatives. <div style="height: 100px"></div> #### 1.5.1.2 Differentiation (single-variable) Assuming this process of differentiation for all elementary functions has been completed, differentiation $\frac{df}{dx}$ is defined by **structural induction** (or as computer scientists call it, mutual recursion) as follows: - (<u>base case</u>) if $f$ is one of the elementary functions, its derivative is simply looked up in a table of derivatives (e.g. the derivative of $\sin$ is $\cos$, of $e^x$ is itself). - (<u>inductive step</u>) - (**addition**) if $f$ is the addition of functions $g$ and $h$ ($f = g + h$), then its derivative is: $$ \frac{df}{dx} = \frac{dg}{dx} + \frac{dh}{dx} $$ - (**product**) if $f$ is the product of functions $g$ and $h$ ($f = g * h$), then its derivative is: $$ \frac{df}{dx} = \frac{dg}{dx}h + \frac{dh}{dx}g $$ - (**power**) if $f$ raises its argument to an integral power $n$ ($f(x) = x^n$), then its derivative is: $$ \frac{df}{dx}(x) = n \cdot x^{n - 1} $$ - (**composition**) if $f$ is the composition of functions $g$ and $h$ ($f = g \circ h$), then its derivative is given by: $$ \frac{df}{dx} = \frac{dh}{dx} \cdot \frac{dg}{dx} $$ This last inductive case you'll recognize as the **chain** rule. In what follows we'll be differentiating a function that's a composition of more than two functions, so let's quickly look at how the chain rule works for functions of, say three, compositions: $$ \begin{align} f &= g \circ h \circ k\\ u &= g(x)\\ v & = h(u)\\ f &= k(v)\\ \frac{df}{dx} &= \frac{du}{dx} * \frac{dv}{du} * \frac{df}{dv} \end{align} $$ The chain rule applies by a similiar procedure to functions composed of arbitrary, but finite, number of other functions. <div style="height: 100px"></div> #### 1.5.1.3 Differentiation (multivariable) So far we've be talking about single-variable functions, but neural networks are taking matrices as inputs and producing matrices as output, so we need a way of differentiating in higher-dimensional spaces. There are two ways of extending differentiation: - (**vector calculus**) The first approach, which is the standard, is to treat a function (such as $f(x_1, x_n)$) as a vector-valued function and put the arguments into a vector $\bar{x} = [x_1, x_2]$ before passing them to the function: $$ f(x_1, x_2) \equiv f([x_1, x_2]) $$ Once we do this, we get into the field of vector calculus, which uses techniques from **linear algebra** and **multivariable calculus** to extend the notion of derivative to vector spaces. - (**functional analysis**) The second approach is to **curry** the multivariable function (viz. $f(x_1, x_n)$) to get a single variable function $f_1$ takes $x_1$ and returns a function $f_2$ that then takes $x_2$ and performs the computation that $f$ was performing: $$ f(x_1, x_2) \equiv (f_c = f_1 \circ f_2) $$ For example, let's suppose $f$ computes the sum of its two arguments: $f(x_1, x_2) = x_1 + x_2$. When we curry $f$ we get a higher-order function (also called a **functional**) $f_c$ defined as follows: $$ \begin{align} f_c(x_1) &= [\lambda x. x_1 + x] \end{align} $$ We then use $f_c$ to define the original binary function: $$ \begin{align} f(x_1, x_2) &\equiv f_c(x_1)(x_2)\\ &\equiv [\lambda x. x_1 + x](x_2)\\ &\equiv x_1 + x_2\\ \end{align} $$ If you find higher-order functions interesting, I'll be more than happy to impose some recommended reading in **lambda calculus**, **combinatory logic**, and **functional programming** that deal with these things. <div style="height: 100px"></div> #### 1.5.1.4 Limits of analytic differentiation In theory, it's possible to apply the recursive mathematical procedure to any neural network optimization problem using paper, pencil, and lots of patience. In practice, this is only feasible for simple neural networks, like linear models. In Appendix 1 I've shown how to do this for an unregularized linear regression model with squared loss function, so we won't waste time going through the manual labor here. For neural networks, with vector-valued data and hundreds (often several orders of magnitude more) of nodes, analytic differentiation is not just error-prone, but infeasible, so it's necessary that we find a way of doing it better. There are three main ways of doing it better, the last of which is the state of the art: - symbolically - numerically - automatically Our main topic is automatic differentiation, so I'll briefly mention the first two and then spend most of the time describing how automatic differentiation works. <div style="height: 100px"></div> ### 1.5.2 Symbolic differentiation This approach requires that we describe the optimization objective function in a particular *domain-specific language* (e.g. Mathematica, S-Expressions) and then use a symbolic differentiation engine to get an expression for the derivative of the function. The resultant expression can then be evaluated to get numerical results. In Appendix 2 I've shown how to implement a basic symbolic differentiation engine. There are two main problems with this approach. First, the language has to be learned, even if it's not Mathematica but is a subset of LISP. Second, the derivative expression can be a huge formula and involve expressions that *look* different syntactically but evaluate to the same function or value. For example, the expressions "$e^{ix}$" and "$\cos(x) + i\sin(x)$" look different as sequences of symbols, but are actually computing the same function. <div style="height: 100px"></div> ### 1.5.3 Numerical differentiation The insight that leads to this approach is inspired by the definition of derivative, which we defined above as follows: $$ \frac{df}{dx} = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h} $$ The insight is that we can *approximate* the derivative, for any point $a$ in the domain of $f$ by getting rid of the limit and treating $h$ as a very very small number $\epsilon$: $$ \frac{df}{dx} \approx \frac{f(x + \epsilon) - f(x)}{\epsilon} $$ For example, the function $f(x) = x^2$, whose derivative is $2x$, can be approximated at $x = 4$ by taking $\epsilon = 0.00001$: $$ \begin{align} \frac{d x^2}{dx}(4) &\approx \frac{(4 + \epsilon)^2 - 4^2}{\epsilon}\\ &\approx \frac{(4.00001)^2 - 16}{0.00001}\\ &\approx \frac{16.0000800001 - 16}{0.00001}\\ &\approx \frac{0.0000800001}{0.00001}\\ &\approx 8.00001\\ \end{align} $$ The exact value of the derivative at $4$ is $2*4 = 8$. This same basic idea can be used to approximate derivatives of more complicated functions, but the problem is that the approximation errors add up and for huge neural networks become useless as a way of computing the derivative of the optimization objective. Nevertheless, the method finds use under the hood of neural network libraries as a way of doing **gradient checking**, to make sure that derivatives obtained via one of the othre methods is close enough to the approximate derivatives. <div style="height: 100px"></div> ### 1.5.4 Automatic differentiation We come at last to the workhorse of deep learning frameworks like PyTorch and TensorFlow, namely to automatic or algorithmic differentiation (aka "autodiff"), also known under the name **backpropagation** (aka "backprop"). It avoids all of the problems of analytic, symbolic, and numerical differentiation and is responsible for the success of deep learning and for the fact that the most recent Turing Awards were given to Hinton, LeCun, and Bengio. <div style="height: 100px"></div> #### 1.5.4.1 Optimization objective as a graph Everything starts with the realization that a neural network optimization objective ($J$) is a **graph**, more specifically a *weighted, directed, acyclic* graph. Recall the diagram from an earlier section: ![](https://i.imgur.com/b3ql6KW.png) - the functions and their sub-functions ($f_\theta, \mathcal{L}, \mathbb{E}$), the data $X, Y$, and the model parameters $\theta$ are the nodes **nodes**. - the *values* given or computed by functions are the **weights** on the edges between the nodes. - the **edges** link function input(s) to outputs, children to parent nodes Once we realize this, we can start programming a neural network library (a simplified version of PyTorch or TensorFlow). To make sure I understand how the pieces fit together, I've written such a library and as I describe how such things work, I'll point to pieces of code from it as a concrete illustration. <div style="height: 100px"></div> #### 1.5.4.2 Graph representation The library keeps a graph behind the scenes. As we declare variables (like $\theta$) and constants (like $X, Y, 2$), it creates nodes and adds them to the underlying graph (invisible to the user). As we compose functions (e.g. taking the expected value of the loss), the library automatically creates new nodes and links them to the existing ones. [See code for `Graph`, `Node`, `Variable`, and `Constant`]. <div style="height: 100px"></div> #### 1.5.4.3 Built-in derivatives So far, the library hasn't given us anything useful. As we go about our daily modeling business, it's busy constructing and and augmenting an underlying graph that represents the function we're trying to optimize. Here's where things get exciting: the nodes that represent operations come with their own derivatives! [See code for some of the `ops`]. The library exposes these derivatives to us as properties on `Node` objects, so if we have a node representing a function $f(x, y)$, then we can get its derivative with respect to each argument by calling `f.grad`. These derivatives are all that we need to solve the optimization problem with gradient descent. We ask the library to give us three derivatives: $$ \frac{d\hat{Y}}{d\theta}, \frac{dL}{d\hat{Y}}, \frac{dJ}{dL} $$ and then we use the chain rule to find the derivative of $J$ with respect to the neural network parameter vector $\theta$: $$ \frac{dJ}{d\theta} = \frac{d\hat{Y}}{d\theta} \cdot \frac{dL}{d\hat{Y}} \cdot \frac{dJ}{dL} $$ <div style="height: 100px"></div> #### 1.5.4.4 Implementing multiplication Mathematically, our work is done, because we can just multiply these gradients and plug the result into gradient descent to solve the optimization problem. But multiplication (including matrix multiplication) is an **associative** operator, so we can actually perform the computation in one of two ways: - (**forward**-mode) $\frac{dJ}{d\theta} = \Big(\frac{d\hat{Y}}{d\theta} \cdot \frac{dL}{d\hat{Y}}\Big) \cdot \frac{dJ}{dL}$ - (**reverse**-mode) $\frac{dJ}{d\theta} = \frac{d\hat{Y}}{d\theta} \cdot \Big(\frac{dL}{d\hat{Y}} \cdot \frac{dJ}{dL}\Big)$ The forward mode is so-called because the derivatives are multiplied in the same order the functions are computed (during a so-called "forward pass"); the reverse mode goes in the opposite direction. Now, it turns out that in terms of **computational efficiency**, both in terms of FLOPS and memory requirements, the reverse-mode automatic differentation is much more efficient than the forward-mode one for neural network optimization problems. The reason is this: $J$ is a scalar, while the other nodes (e.g. $\mathcal{L}, \mathbb{E}$) are matrices, so when we multiply matrices with vectors we get vectors and when we multiply matrices with matrices we get matrices. The forward mode multiplies matrices all the way to the end, when it multiplies the final matrix $L$ with $\mathbb{E}$ to get the scalar $J$. The reverse mode starts withe that final multiplication to get a vector and keeps multiplying matrices by vectors until it reaches the beginning. <style> img { width: 50%; margin-left: 150px; } </style>