## 3.6 Chain Rule Recall the following chain rule from single variable calculus. :::success Let $f$ and $g$ functions such that $g \circ f$ be well-defined. Suppose that $f$ is differentiable at $a$, and that $g$ is differentiable at $f(a)$. Then $g \circ f$ is differentiable at $a$ and $$(g \circ f)'(a)=g'(f(a))f'(a).$$ ::: This generalises to higher dimensions. Before stating the most general form of the chain rule, we consider a special cases in two dimensions. :::success **Theorem** Let $f: \mathbb R^2\to \mathbb R$, $x: (a,b) \to \mathbb R$, and $y: (a,b) \to \mathbb R$ be differentiable functions such that $z=f(x(t),y(t)),\, t\in(a,b)$ is well-defined. Then $z:(a,b) \to \mathbb R$ is differentiable on $(a,b)$ and $$ \frac{dz}{dt}=\frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt}\,.$$ ::: <br> **Example 1** Starting from $t=0$, a right circular cone grows in height at $1\, cm/s$ and in radius by $0.5\, cm/s$. How fast does its volume grow at $t=3\, s\, ?$ :::spoiler Answer Note that the volume is given by $V=\frac{\pi}{3}hr^2$ where $h$ and $r$ are functions of $t$. Applying the chain rule, \begin{align} \frac{dV}{dt}&=\frac{\partial V}{\partial h}\frac{dh}{dt}+\frac{\partial V}{\partial r}\frac{dr}{dt}\,\\ &= \frac{\pi}{3} r^2 \cdot 1 + \frac{2\pi}{3}hr \cdot 0.5 \\ &= \frac{\pi}{3}r(r+h) \end{align} Since at $t=3$, $r=1.5\, cm$ and $h=3\, cm$. So, $$ \frac{dV}{dt} = \frac{\pi}{3}\times \frac{3}{2}\left(\frac{3}{2}+3\right) = \frac{9\pi}{4} \,cm^3/s\,.$$ ::: <br> In general, we have the following. :::success **Theorem** Suppose $f$ is a differentiable of $n$ variables $x_1,\dots, x_n$ and each $x_j$ is a differentiable function of $m$ variables $t_1,\dots,t_m$. Then for all $j=1,\dots, m$ we have the following. $$\frac{\partial f}{\partial t_j}=\sum_{k=1}^n \frac{\partial f}{\partial x_{k}} \frac{\partial x_{k}}{\partial t_{j}}.$$ ::: <br> **Example 2** Write down the chain rule when $f=f(x,y,z)$, $x=x(u,v)$, $y=y(u,v,w)$ and $z=z(v)$. :::spoiler Answer ```graphviz digraph hierarchy { nodesep=0.5 // increases the separation between nodes node [color=Red,fontname=Courier, shape=circle] //All nodes will this shape and colour edge [color=Blue, style=none] //All the lines look like this f->{x y z} x->{u v} y->{u v w} z->v // {rank=same;ITManager Teacher1 Teacher2} Put them on the same level } ``` \begin{align} \frac{\partial f}{\partial u}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial u}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial u}\\ \frac{\partial f}{\partial v}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial v}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial v}+\frac{\partial f}{\partial z} \frac{d z}{d v}\\ \frac{\partial f}{\partial w}&=\frac{\partial f}{\partial y} \frac{\partial y}{\partial w}. \end{align} ::: <br> **Example 3** Suppose $f(x,y)=3x^2−2xy+y^2$ where $x(u,v)=3u+2v$ and $y(u,v)=4u−v$. Find the following: $\frac{\partial f}{\partial u}, \frac{\partial f}{\partial v}, \frac{\partial^2 f}{\partial u^2}, \frac{\partial^2 f}{\partial u \partial u}, \frac{\partial^2 f}{\partial v^2}$. :::spoiler Answer ```graphviz digraph hierarchy { nodesep=0.75 // increases the separation between nodes node [color=Red,fontname=Courier, shape=circle] //All nodes will this shape and colour edge [color=Blue, style=none] //All the lines look like this f->{x y} x->{u v} y->{u v} // {rank=same;ITManager Teacher1 Teacher2} Put them on the same level } ``` So, \begin{align} \frac{\partial f}{\partial u}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial u}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial u} \\&= (6x-2y)3+(-2x+2y)4=10x+2y=38u+18v.\\ \frac{\partial f}{\partial v}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial v}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial v}\\&=(6x-2y)2+(-2x+2y)(-1)=14x-6y=18u+34v.\\ \frac{\partial^2 f}{\partial u^2}&=38.\\ \frac{\partial^2 f}{\partial u \partial v}&=18=\frac{\partial^2 f}{\partial v \partial u}.\\ \frac{\partial^2 f}{\partial v^2}&=34. \end{align} ::: <br> **Example 4** Suppose $g(s,t)=f(s^2-t^2,t^2-s^2)$ where $f$ is differentiable. Show that $g$ satisfies the following partial differential equation (PDE): $$t \frac{\partial g}{\partial s}+s\frac{\partial g}{\partial t}=0\,.$$ :::spoiler Answer Take $x=s^2-t^2$ and $y=t^2-s^2$. Then, $g(s,t)=f(x,y)$, $x_s=2s=-y_s$ and $x_t=-2t=-y_t$. Applying the chain rule, $g_s = f_x x_s + f_y y_s$ and $g_t = f_x x_t + f_y y_t$. Finally, $$t\frac{\partial g}{\partial s}+s\frac{\partial g}{\partial t}=t(2sf_x-2sf_y)+s(-2tf_x+2sf_y)=0\,.$$ ::: <br> ### 3.6.1 Implicit differentiation Recall the example from last week in which we found the tangent to the ellipsoid $$2x^2+y^2+z^2=4$$ at the point $(1,1,1)$. In this case, we wrote $z = \sqrt{4-2x^2-y^2}$ (because the point was in the upper half space) and computed the partial derivatives of $z$. However, we could have also used **implicit differentiation**: \begin{align} 4x+0+2z \frac{\partial z}{\partial x} &=0 \implies \frac{\partial z}{\partial x} = -\frac{2x}{z} \implies \frac{\partial z}{\partial x}\Big|_{(1,1,1)}=-2. \\ 0+2y+2z \frac{\partial z}{\partial y}&=0 \implies \frac{\partial z}{\partial y} = -\frac{y}{z} \implies \frac{\partial z}{\partial y}\Big|_{(1,1,1)}=-1. \end{align} In general, we have the following theorem. :::success **Theorem** (Implicit differentiation) Suppose $F(y,x_1,\dots,x_n)=0$ defines $y$ implicitly as a function of $x_1,\dots,x_n$ and $F$ has continuous partial derivatives with $F_y \neq 0$. Then for all $j=1,\dots,n$, $$\frac{\partial y}{\partial x_j} = - \frac{F_{x_j}}{F_{y}} $$ ::: **Example 5** Find $\frac{dy}{dx}$ if $y$ is given implicitly as a function of $x$ by the equation of the ellipse $$3x^2−2xy+y^2+4x−6y−11=0.$$ Find the equation of the tangent line to this ellipse at $(2,1)$. :::spoiler Answer By the implicit differentiation, $$\frac{dy}{dx}=−\frac{∂f/∂x}{∂f/∂y}=\dfrac{6x−2y+4}{−2x+2y−6}=\dfrac{3x−y+2}{x−y+3}.$$ Note that the tangent line passes through $(2,1)$ and has slope $\frac{dy}{dx}|_{(2,1)}=\frac{7}{4}$. So the equation of the line is $$y-1=\frac{7}{4}(x-2) \implies y =\frac{7}{4}x - \frac{5}{2}.$$ ![tangent](https://hackmd.io/_uploads/S1P5yrQfyl.gif) ::: <br> --- **References** 1. *Chapter 14.5* : Stewart, J. (2012). Calculus (8th ed.). Boston: Cengage Learning. 2. *Chapter 13.5* : Strang G., Calculus (3rd ed.). [https://ocw.mit.edu/](https://ocw.mit.edu/courses/res-18-001-calculus-fall-2023/pages/textbook/) --- <br> ## 3.7 Directional Derivatives and Gradient We saw that partial derivatives tell us the rate of change of functions in the direction of the $x,y,z, \dots$ axes in $\mathbb R^n$. What about the rate of change in other directions? *Directional derivatives* tell us exactly that. ![image](https://hackmd.io/_uploads/SkDxR-CZ1x.png) <sub><sup>(taken from [1])</sub></sup> <br> :::info **Definition** The **directional derivative of $f$ at $(a_1,\dots,a_n)$ in the direction of a unit vector $\vec{u}= (u_1,\dots,u_n)$** is $$ D_{\vec u} f(a_1,\dots, a_n) = \lim_{h\to 0}\frac{f(a_1+hu_1, \dots, a_n+hu_n) - f(a_1,\dots, a_n)}{h}$$ if the limit exists. ::: **Example 1** Find the directional derivative of $f(x,y)=x^2+y^2$ at $(1,0)$ in the direction given by the angle $\pi/4$ (measured anticlockwise from the positive $x-$axis). ::::spoiler Answer We recall that :::success If the unit vector $\vec u$ makes an anticlockwise angle $\theta$ with the positive $x-$axis, then $$\vec u = (\cos \theta, \sin \theta)= \cos (\theta)\, i + \sin (\theta)\, j.$$ ::: So, the unit vector in the given direction is $\vec{u}=\big( \frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}} \big)$. So, $$D_{\vec u} f(1,0) = \lim_{h \to 0} \frac{(1+h/\sqrt{2})^2+(0+h/\sqrt{2})^2 - (1^2+0^2)}{h} = \lim_{h \to 0} \frac{h^2+\sqrt{2}h}{h}= \sqrt{2}.$$ :::: <br> This is a very cumbersome way to compute the derivatives. The following theorem makes things easier. :::success **Theorem** Suppose $f$ is a differentiable function. Then $$D_{\vec u}f = \nabla f \cdot \vec u$$ where $\nabla f := (f_{x_1},\dots,f_{x_n})$ which is the **gradient of $f$**. ::: :::spoiler Proof Since $f$ is differentiable, $\nabla f := (f_{x_1},\dots,f_{x_n})$ exists and $g(h)=f((a_1,\dots,a_n)+h\vec u)$ is differentiable as a function of $h$. Note that, on the one hand, $$g'(0)= \lim_{h \to 0} \frac{g(h)-g(0)}{h} = D_{\vec u}f(a_1,\dots,a_n).\hspace{40pt} (1)$$ On the other hand, because $g(h) = f(a_1+hu_1, \dots, a_n+hu_n)$, we can apply chain rule to write, $$g'(h) = \sum_{j=1}^n f_{x_j}(a_j+hu_j) \frac{d}{dh}(a_j+hu_j) =\sum_{j=1}^n f_{x_j}(a_j+hu_j)u_j.$$ So, plugging in $h=0$, we obtain $$g'(0) = \sum_{j=1}^n f_{x_j}(a_j)u_j = \nabla f (a_1,\dots, a_n) \cdot \vec u .\hspace{48pt} (2)$$ Combining $(1)$ and $(2)$, we have the required result. ::: <br> **Example 2** Let $f(x,y)=x^2+y^2$. Write down the gradient of $f$ at $(1,0)$. Find the directional derivative of $f$ at $(1,0)$ in the direction given by the angle $\theta=\pi/4$. :::spoiler Answer The gradient of $f$ at $(1,0)$ is $$\nabla f (1,0) = (f_x,f_y)|_{(1,0}= (2x,2y)|_{(1,0)}=(2,0).$$ The directional derivative is $$D_{\vec u} f(1,0)= \nabla f (1,0) \cdot \vec{u} = (2,0)\cdot \Big(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}\Big) = \sqrt{2}\,.$$ ::: <br> **Example 3** Let $g(x,y,z)= x \sin (yz)$. Write down the gradient of $g$. Find the directional derivative of $g$ at $(1,3,0)$ in the direction given by $\vec{v}=i+2j-k$. ::::spoiler Answer We recall that :::success The unit vector in the direction of $\vec v$ is $\frac{\vec{v}}{\|\vec v\|}\,.$ ::: The gradient of $g$ is $$\nabla g = (g_x,g_y, g_z)= (\sin(yz), xz\cos(yz), xy\cos(yz)).$$ The directional derivative is $$D_{\vec v} g(1,3,0)= \nabla g (1,3,0) \cdot \frac{\vec{v}}{\|v\|} = (0,0,3)\cdot \Big(\frac{1}{\sqrt{6}},\frac{2}{\sqrt{6}},\frac{-1}{\sqrt{6}} \Big) = -\sqrt{3/2}\,.$$ :::: <br> We can compute the directional derivative by expanding the dot product as follows. $$D_{\vec v} g = \nabla g \cdot \frac{\vec{v}}{\|\vec v\|} = \|\nabla g\| \Big\|\frac{\vec{v}}{\|\vec v\|}\Big\|\cos \theta =\|\nabla g\|\cos \theta $$ where $\theta$ is the angle between $\nabla g$ and $\vec v$. From this we can conclude that the rate of change (increase or decrease) of the function is maximal when $\theta = 0$ or $\pi$. When $\theta =0$, i.e., in the direction of the gradient, the function increases at the highest possible rate and when $\theta =\pi$, i.e., in the direction opposite to the gradient the function decreases at the highest possible rate. :::success **Theorem** (Steepest ascent/descent) Let $f$ be a differentiable function. Then \begin{align} \max_{\vec u} D_{\vec u} f &= D_{\vec{\nabla f}} f =\|\nabla f\|\,,\\ \min_{\vec u} D_{\vec u} f &= D_{\vec{-\nabla f}} f =-\|\nabla f\|\,. \end{align} ::: **Example 4** Let $h(x,y)=x e^y$. Find the maximal rate of increase of the funtion at the point $(2,0)$. :::spoiler Answer The maximal rate of increase is $\|\nabla h (2,0)\|=\|(e^y,xe^y)|_{2,0}\|=\|(1,2)\|= \sqrt{5}$. <br> ![image](https://hackmd.io/_uploads/BkpVbEAWJe.png) ::: <br> --- **References** 1. *Chapter 14.6* : Stewart, J. (2012). Calculus (8th ed.). Boston: Cengage Learning. 2. *Chapter 13.4* : Strang G., Calculus (3rd ed.). [https://ocw.mit.edu/](https://ocw.mit.edu/courses/res-18-001-calculus-fall-2023/pages/textbook/) 3. *Chapter 2.4* : Corral, M. (2021). Vector Calculus. [https://www.mecmath.net/](https://www.mecmath.net/) --- <br> ## 3.8 Application: Gradient Descent A practical application of this theorem is the **Gradient Descent** (GD) an optimisation algorithm attributed to Cauchy (1798-1857). Modified versions of GD are used in Machine Learning, e.g., RMSprop, ADAM. :::success **Optimization Problem** Let $D \subseteq \mathbb R^n$. Given a function $f :D \to \mathbb R$, the general problem of finding the value that minimises $f$ over $D$ is formulated as $$\min_{x \in D} f(x).$$ In this context, $f$ is the **objective function** and $D$ is the **constraint set**. The input values at which the function output value is minimised, are denoted by $${\arg \min}_{x \in D} f(x)\,.$$ ::: <br> **Example 1** Find $\min_{(x,y,z) \in \mathbb R^3} \,(x^2+y^2+z^2+1)$. What is ${\arg \min}_{(x,y,z) \in \mathbb R^3}\, (x^2+y^2+z^2+1)\,?$ :::spoiler Answer We note that $x^2, y^2, z^2\geq 0$, and hence, $x^2+y^2+z^2 \geq 0$ and the equality occurs if and only if $x=y=z=0$. Therefore, $$\min_{(x,y,z) \in \mathbb R^3} \,(x^2+y^2+z^2+1)=1\,,\,\,\,\text{&}\,\,\,\,\,\,{\arg \min}_{(x,y,z) \in \mathbb R^3}\, (x^2+y^2+z^2+1) = \{(0,0,0)\}\,.$$ ::: <br> :::danger **Gradient Descent** Given an initial point $x_0$, find the iterate $x_{n}$ recursively using $$x_n = x_{n-1} - \gamma\,\nabla f (x_{n-1}), \,\,\,n\geq 1$$ for some $\gamma >0$, called the **learning rate**. Then $f(x_n)<f(x_{n-1})$. Repeat until maximum number of iterations reached or does not decrease the function value more than the *tolerance* $\varepsilon <<1$. ::: ![image](https://hackmd.io/_uploads/HkJ-N8CZJg.png) <sub><sup>(taken from [1])</sub></sup> <br> **Example 2** Find $\arg\min_{(x,y,z) \in \mathbb R^3} \,(x^2+y^2+z^2+1)$ using GD with learning rate $0.1$. :::spoiler Answer Since $\nabla (x^2+y^2+z^2+1)=(2x,2y,2z)$. We obtian \begin{align} (x_n,y_n,z_n) &= (x_{n-1},y_{n-1},z_{n-1}) - 0.1(2x_{n-1},2y_{n-1},2z_{n-1})\\ &=(0.8x_{n-1},0.8y_{n-1},0.8z_{n-1}) =\cdots=(0.8)^n(x_0,y_0,z_0)\,, \,\,\,n\geq 1. \end{align} Therefore, $\lim_{n\to \infty} (x_n,y_n,z_n) = (0,0,0)\,$ as expected. ::: <br> Finally, let's look at the ``python`` implementation of the algorithm found at [this link](https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21) and the visualisation in [1] to see how the learning rate and the nature of the function affects the effectiveness of the algorithm. --- **References** 1. Pedregosa, F., "Gradient Descent", *Keep the gradient flowing*, 10/11/2024, https://fa.bianp.net/teaching/2018/eecs227at/gradient_descent.html 2. *Chapter 7.1* : Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press. --- <br> <!--- ## 3.8 Application 2: Backpropagation In order to minimise the loss functions when training ML algorithms, one has to compute the partial derivatives of loss functions with respect to the parameters of a neural network. **Backpropagation**, introduced by Kelley (1960) and Linnainmaa (1970), later popularised by Hinton (2024 Nobel Physics Prize) and collaborators, is a method to do this. Backpropagation is simply the Chain rule but done in a smarter way to reduce repititions of computations, and hence, save computational time (and in fact, it saves an enormous amount of time.) **Example 1: Linear Classifiers.** Typically, there are many training data points $(x_j,t_j), j=1,\dots,N.$ For the calculations to be simple, we assume $N=1$. The model is given by the line $z=wx+b$ and activation function (to keep values between 0 and 1) $y=\sigma(z)$ where $\sigma$ is the sigmoid function. ![image](https://hackmd.io/_uploads/S11Ugg-Gye.png =400x300) We would like to minimize squared error $\mathcal L=0.5(y-t)^2$. We will choose appropriate weights $w$ and $b$, to minimise it using GD. To this end, we have to compute $(\mathcal L)_w =\bar w$ and $(\mathcal L)_b = \bar b$ (where we have adapted the notation $(\mathcal L)_p =: \bar p$). ![image](https://hackmd.io/_uploads/ByxLz--fyx.png =450x160) <sub><sup>(taken from [1])</sub></sup> In fact, we can add the regulariser $\mathcal R=0.5 w^2$ to obtain the loss function $\mathcal L_{reg}= \mathcal L+ \lambda \mathcal R$ (which is better conditioned for GD). We will choose apprpriate weights $w$ and $b$, and hyperparameter $\lambda$ to minimise it using GD. Given below is the **computation graph**. ![image](https://hackmd.io/_uploads/rJbNnlbGkg.png =250x100) <sub><sup>(taken from [1])</sub></sup> The nodes in the graph correspond to all the values that are computed, with edges to indicate which values are computed from which other values. ![image](https://hackmd.io/_uploads/Syo4Lb-f1g.png =500x290) <sub><sup>(taken from [1])</sub></sup> For the univariate logistic classifier example: ![image](https://hackmd.io/_uploads/SJ8n_--Gyx.png =550x270) <sub><sup>(taken from [1])</sub></sup> This example was adapted from [1]. If you are interested in seeing a less mathematical and more intuitive description visit [this link](https://www.ibm.com/think/topics/backpropagation). --- **References** 1. Grosse, R., Backpropagation, *Intro to Neural Networks and Machine Learning*, 10/11/2024, https://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L06%20Backpropagation.pdf 2. *Chapter 5.6* : Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press. --- <br> ---> :::danger **Summary**: Now, we can - Compute derivatives of composite functions of several variables using chain rule. - Calculate the gradient of a function. - Use the directional derivative to determine the rate of change of a function. - Determine the max/min rate of change of a function of several variables. - Describe the gradient descent algorithm. :::