## 3.6 Chain Rule
Recall the following chain rule from single variable calculus.
:::success
Let $f$ and $g$ functions such that $g \circ f$ be well-defined. Suppose that $f$ is differentiable at $a$, and that $g$ is differentiable at $f(a)$. Then $g \circ f$ is differentiable at $a$ and $$(g \circ f)'(a)=g'(f(a))f'(a).$$
:::
This generalises to higher dimensions. Before stating the most general form of the chain rule, we consider a special cases in two dimensions.
:::success
**Theorem**
Let $f: \mathbb R^2\to \mathbb R$, $x: (a,b) \to \mathbb R$, and $y: (a,b) \to \mathbb R$ be differentiable functions such that $z=f(x(t),y(t)),\, t\in(a,b)$ is well-defined. Then $z:(a,b) \to \mathbb R$ is differentiable on $(a,b)$ and $$ \frac{dz}{dt}=\frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt}\,.$$
:::
<br>
**Example 1**
Starting from $t=0$, a right circular cone grows in height at $1\, cm/s$ and in radius by $0.5\, cm/s$. How fast does its volume grow at $t=3\, s\, ?$
:::spoiler Answer
Note that the volume is given by $V=\frac{\pi}{3}hr^2$ where $h$ and $r$ are functions of $t$. Applying the chain rule,
\begin{align}
\frac{dV}{dt}&=\frac{\partial V}{\partial h}\frac{dh}{dt}+\frac{\partial V}{\partial r}\frac{dr}{dt}\,\\
&= \frac{\pi}{3} r^2 \cdot 1 + \frac{2\pi}{3}hr \cdot 0.5 \\
&= \frac{\pi}{3}r(r+h)
\end{align}
Since at $t=3$, $r=1.5\, cm$ and $h=3\, cm$. So, $$ \frac{dV}{dt} = \frac{\pi}{3}\times \frac{3}{2}\left(\frac{3}{2}+3\right) = \frac{9\pi}{4} \,cm^3/s\,.$$
:::
<br>
In general, we have the following.
:::success
**Theorem**
Suppose $f$ is a differentiable of $n$ variables $x_1,\dots, x_n$ and each $x_j$ is a differentiable function of $m$ variables $t_1,\dots,t_m$. Then for all $j=1,\dots, m$ we have the following. $$\frac{\partial f}{\partial t_j}=\sum_{k=1}^n \frac{\partial f}{\partial x_{k}} \frac{\partial x_{k}}{\partial t_{j}}.$$
:::
<br>
**Example 2**
Write down the chain rule when $f=f(x,y,z)$, $x=x(u,v)$, $y=y(u,v,w)$ and $z=z(v)$.
:::spoiler Answer
```graphviz
digraph hierarchy {
nodesep=0.5 // increases the separation between nodes
node [color=Red,fontname=Courier, shape=circle] //All nodes will this shape and colour
edge [color=Blue, style=none] //All the lines look like this
f->{x y z}
x->{u v}
y->{u v w}
z->v
// {rank=same;ITManager Teacher1 Teacher2} Put them on the same level
}
```
\begin{align}
\frac{\partial f}{\partial u}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial u}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial u}\\
\frac{\partial f}{\partial v}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial v}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial v}+\frac{\partial f}{\partial z} \frac{d z}{d v}\\
\frac{\partial f}{\partial w}&=\frac{\partial f}{\partial y} \frac{\partial y}{\partial w}.
\end{align}
:::
<br>
**Example 3**
Suppose $f(x,y)=3x^2−2xy+y^2$ where $x(u,v)=3u+2v$ and $y(u,v)=4u−v$. Find the following: $\frac{\partial f}{\partial u}, \frac{\partial f}{\partial v}, \frac{\partial^2 f}{\partial u^2}, \frac{\partial^2 f}{\partial u \partial u}, \frac{\partial^2 f}{\partial v^2}$.
:::spoiler Answer
```graphviz
digraph hierarchy {
nodesep=0.75 // increases the separation between nodes
node [color=Red,fontname=Courier, shape=circle] //All nodes will this shape and colour
edge [color=Blue, style=none] //All the lines look like this
f->{x y}
x->{u v}
y->{u v}
// {rank=same;ITManager Teacher1 Teacher2} Put them on the same level
}
```
So,
\begin{align}
\frac{\partial f}{\partial u}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial u}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial u} \\&= (6x-2y)3+(-2x+2y)4=10x+2y=38u+18v.\\
\frac{\partial f}{\partial v}&=\frac{\partial f}{\partial x} \frac{\partial x}{\partial v}+\frac{\partial f}{\partial y} \frac{\partial y}{\partial v}\\&=(6x-2y)2+(-2x+2y)(-1)=14x-6y=18u+34v.\\
\frac{\partial^2 f}{\partial u^2}&=38.\\
\frac{\partial^2 f}{\partial u \partial v}&=18=\frac{\partial^2 f}{\partial v \partial u}.\\
\frac{\partial^2 f}{\partial v^2}&=34.
\end{align}
:::
<br>
**Example 4**
Suppose $g(s,t)=f(s^2-t^2,t^2-s^2)$ where $f$ is differentiable. Show that $g$ satisfies the following partial differential equation (PDE):
$$t \frac{\partial g}{\partial s}+s\frac{\partial g}{\partial t}=0\,.$$
:::spoiler Answer
Take $x=s^2-t^2$ and $y=t^2-s^2$. Then, $g(s,t)=f(x,y)$, $x_s=2s=-y_s$ and $x_t=-2t=-y_t$. Applying the chain rule, $g_s = f_x x_s + f_y y_s$ and $g_t = f_x x_t + f_y y_t$. Finally,
$$t\frac{\partial g}{\partial s}+s\frac{\partial g}{\partial t}=t(2sf_x-2sf_y)+s(-2tf_x+2sf_y)=0\,.$$
:::
<br>
### 3.6.1 Implicit differentiation
Recall the example from last week in which we found the tangent to the ellipsoid $$2x^2+y^2+z^2=4$$ at the point $(1,1,1)$. In this case, we wrote $z = \sqrt{4-2x^2-y^2}$ (because the point was in the upper half space) and computed the partial derivatives of $z$.
However, we could have also used **implicit differentiation**:
\begin{align}
4x+0+2z \frac{\partial z}{\partial x} &=0 \implies \frac{\partial z}{\partial x} = -\frac{2x}{z} \implies
\frac{\partial z}{\partial x}\Big|_{(1,1,1)}=-2. \\
0+2y+2z \frac{\partial z}{\partial y}&=0 \implies \frac{\partial z}{\partial y} = -\frac{y}{z} \implies
\frac{\partial z}{\partial y}\Big|_{(1,1,1)}=-1.
\end{align}
In general, we have the following theorem.
:::success
**Theorem** (Implicit differentiation)
Suppose $F(y,x_1,\dots,x_n)=0$ defines $y$ implicitly as a function of $x_1,\dots,x_n$ and $F$ has continuous partial derivatives with $F_y \neq 0$. Then for all $j=1,\dots,n$, $$\frac{\partial y}{\partial x_j} = - \frac{F_{x_j}}{F_{y}} $$
:::
**Example 5**
Find $\frac{dy}{dx}$ if $y$ is given implicitly as a function of $x$ by the equation of the ellipse $$3x^2−2xy+y^2+4x−6y−11=0.$$ Find the equation of the tangent line to this ellipse at $(2,1)$.
:::spoiler Answer
By the implicit differentiation, $$\frac{dy}{dx}=−\frac{∂f/∂x}{∂f/∂y}=\dfrac{6x−2y+4}{−2x+2y−6}=\dfrac{3x−y+2}{x−y+3}.$$
Note that the tangent line passes through $(2,1)$ and has slope $\frac{dy}{dx}|_{(2,1)}=\frac{7}{4}$. So the equation of the line is $$y-1=\frac{7}{4}(x-2) \implies y =\frac{7}{4}x - \frac{5}{2}.$$

:::
<br>
---
**References**
1. *Chapter 14.5* : Stewart, J. (2012). Calculus (8th ed.). Boston: Cengage Learning.
2. *Chapter 13.5* : Strang G., Calculus (3rd ed.). [https://ocw.mit.edu/](https://ocw.mit.edu/courses/res-18-001-calculus-fall-2023/pages/textbook/)
---
<br>
## 3.7 Directional Derivatives and Gradient
We saw that partial derivatives tell us the rate of change of functions in the direction of the $x,y,z, \dots$ axes in $\mathbb R^n$. What about the rate of change in other directions? *Directional derivatives* tell us exactly that.

<sub><sup>(taken from [1])</sub></sup>
<br>
:::info
**Definition**
The **directional derivative of $f$ at $(a_1,\dots,a_n)$ in the direction of a unit vector $\vec{u}= (u_1,\dots,u_n)$** is $$ D_{\vec u} f(a_1,\dots, a_n) = \lim_{h\to 0}\frac{f(a_1+hu_1, \dots, a_n+hu_n) - f(a_1,\dots, a_n)}{h}$$ if the limit exists.
:::
**Example 1**
Find the directional derivative of $f(x,y)=x^2+y^2$ at $(1,0)$ in the direction given by the angle $\pi/4$ (measured anticlockwise from the positive $x-$axis).
::::spoiler Answer
We recall that
:::success
If the unit vector $\vec u$ makes an anticlockwise angle $\theta$ with the positive $x-$axis, then $$\vec u = (\cos \theta, \sin \theta)= \cos (\theta)\, i + \sin (\theta)\, j.$$
:::
So, the unit vector in the given direction is $\vec{u}=\big( \frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}} \big)$. So, $$D_{\vec u} f(1,0) = \lim_{h \to 0} \frac{(1+h/\sqrt{2})^2+(0+h/\sqrt{2})^2 - (1^2+0^2)}{h} = \lim_{h \to 0} \frac{h^2+\sqrt{2}h}{h}= \sqrt{2}.$$
::::
<br>
This is a very cumbersome way to compute the derivatives. The following theorem makes things easier.
:::success
**Theorem**
Suppose $f$ is a differentiable function. Then $$D_{\vec u}f = \nabla f \cdot \vec u$$ where $\nabla f := (f_{x_1},\dots,f_{x_n})$ which is the **gradient of $f$**.
:::
:::spoiler Proof
Since $f$ is differentiable, $\nabla f := (f_{x_1},\dots,f_{x_n})$ exists and $g(h)=f((a_1,\dots,a_n)+h\vec u)$ is differentiable as a function of $h$. Note that, on the one hand, $$g'(0)= \lim_{h \to 0} \frac{g(h)-g(0)}{h} = D_{\vec u}f(a_1,\dots,a_n).\hspace{40pt} (1)$$ On the other hand, because $g(h) = f(a_1+hu_1, \dots, a_n+hu_n)$, we can apply chain rule to write,
$$g'(h) = \sum_{j=1}^n f_{x_j}(a_j+hu_j) \frac{d}{dh}(a_j+hu_j) =\sum_{j=1}^n f_{x_j}(a_j+hu_j)u_j.$$ So, plugging in $h=0$, we obtain
$$g'(0) = \sum_{j=1}^n f_{x_j}(a_j)u_j = \nabla f (a_1,\dots, a_n) \cdot \vec u .\hspace{48pt} (2)$$ Combining $(1)$ and $(2)$, we have the required result.
:::
<br>
**Example 2**
Let $f(x,y)=x^2+y^2$. Write down the gradient of $f$ at $(1,0)$. Find the directional derivative of $f$ at $(1,0)$ in the direction given by the angle $\theta=\pi/4$.
:::spoiler Answer
The gradient of $f$ at $(1,0)$ is
$$\nabla f (1,0) = (f_x,f_y)|_{(1,0}= (2x,2y)|_{(1,0)}=(2,0).$$ The directional derivative is
$$D_{\vec u} f(1,0)= \nabla f (1,0) \cdot \vec{u} = (2,0)\cdot \Big(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}\Big) = \sqrt{2}\,.$$
:::
<br>
**Example 3**
Let $g(x,y,z)= x \sin (yz)$. Write down the gradient of $g$. Find the directional derivative of $g$ at $(1,3,0)$ in the direction given by $\vec{v}=i+2j-k$.
::::spoiler Answer
We recall that
:::success
The unit vector in the direction of $\vec v$ is $\frac{\vec{v}}{\|\vec v\|}\,.$
:::
The gradient of $g$ is
$$\nabla g = (g_x,g_y, g_z)= (\sin(yz), xz\cos(yz), xy\cos(yz)).$$ The directional derivative is
$$D_{\vec v} g(1,3,0)= \nabla g (1,3,0) \cdot \frac{\vec{v}}{\|v\|} = (0,0,3)\cdot \Big(\frac{1}{\sqrt{6}},\frac{2}{\sqrt{6}},\frac{-1}{\sqrt{6}} \Big) = -\sqrt{3/2}\,.$$
::::
<br>
We can compute the directional derivative by expanding the dot product as follows.
$$D_{\vec v} g = \nabla g \cdot \frac{\vec{v}}{\|\vec v\|} = \|\nabla g\| \Big\|\frac{\vec{v}}{\|\vec v\|}\Big\|\cos \theta =\|\nabla g\|\cos \theta $$ where $\theta$ is the angle between $\nabla g$ and $\vec v$. From this we can conclude that the rate of change (increase or decrease) of the function is maximal when $\theta = 0$ or $\pi$.
When $\theta =0$, i.e., in the direction of the gradient, the function increases at the highest possible rate and when $\theta =\pi$, i.e., in the direction opposite to the gradient the function decreases at the highest possible rate.
:::success
**Theorem** (Steepest ascent/descent)
Let $f$ be a differentiable function. Then \begin{align}
\max_{\vec u} D_{\vec u} f &= D_{\vec{\nabla f}} f =\|\nabla f\|\,,\\
\min_{\vec u} D_{\vec u} f &= D_{\vec{-\nabla f}} f =-\|\nabla f\|\,.
\end{align}
:::
**Example 4**
Let $h(x,y)=x e^y$. Find the maximal rate of increase of the funtion at the point $(2,0)$.
:::spoiler Answer
The maximal rate of increase is $\|\nabla h (2,0)\|=\|(e^y,xe^y)|_{2,0}\|=\|(1,2)\|= \sqrt{5}$.
<br>

:::
<br>
---
**References**
1. *Chapter 14.6* : Stewart, J. (2012). Calculus (8th ed.). Boston: Cengage Learning.
2. *Chapter 13.4* : Strang G., Calculus (3rd ed.). [https://ocw.mit.edu/](https://ocw.mit.edu/courses/res-18-001-calculus-fall-2023/pages/textbook/)
3. *Chapter 2.4* : Corral, M. (2021). Vector Calculus. [https://www.mecmath.net/](https://www.mecmath.net/)
---
<br>
## 3.8 Application: Gradient Descent
A practical application of this theorem is the **Gradient Descent** (GD) an optimisation algorithm attributed to Cauchy (1798-1857). Modified versions of GD are used in Machine Learning, e.g., RMSprop, ADAM.
:::success
**Optimization Problem**
Let $D \subseteq \mathbb R^n$. Given a function $f :D \to \mathbb R$, the general problem of finding the value that minimises $f$ over $D$ is formulated as
$$\min_{x \in D} f(x).$$ In this context, $f$ is the **objective function** and $D$ is the **constraint set**. The input values at which the function output value is minimised, are denoted by $${\arg \min}_{x \in D} f(x)\,.$$
:::
<br>
**Example 1**
Find $\min_{(x,y,z) \in \mathbb R^3} \,(x^2+y^2+z^2+1)$. What is ${\arg \min}_{(x,y,z) \in \mathbb R^3}\, (x^2+y^2+z^2+1)\,?$
:::spoiler Answer
We note that $x^2, y^2, z^2\geq 0$, and hence, $x^2+y^2+z^2 \geq 0$ and the equality occurs if and only if $x=y=z=0$. Therefore, $$\min_{(x,y,z) \in \mathbb R^3} \,(x^2+y^2+z^2+1)=1\,,\,\,\,\text{&}\,\,\,\,\,\,{\arg \min}_{(x,y,z) \in \mathbb R^3}\, (x^2+y^2+z^2+1) = \{(0,0,0)\}\,.$$
:::
<br>
:::danger
**Gradient Descent**
Given an initial point $x_0$, find the iterate $x_{n}$ recursively using
$$x_n = x_{n-1} - \gamma\,\nabla f (x_{n-1}), \,\,\,n\geq 1$$ for some $\gamma >0$, called the **learning rate**. Then $f(x_n)<f(x_{n-1})$.
Repeat until maximum number of iterations reached or does not decrease the function value more than the *tolerance* $\varepsilon <<1$.
:::
 <sub><sup>(taken from [1])</sub></sup>
<br>
**Example 2**
Find $\arg\min_{(x,y,z) \in \mathbb R^3} \,(x^2+y^2+z^2+1)$ using GD with learning rate $0.1$.
:::spoiler Answer
Since $\nabla (x^2+y^2+z^2+1)=(2x,2y,2z)$. We obtian
\begin{align}
(x_n,y_n,z_n) &= (x_{n-1},y_{n-1},z_{n-1}) - 0.1(2x_{n-1},2y_{n-1},2z_{n-1})\\ &=(0.8x_{n-1},0.8y_{n-1},0.8z_{n-1}) =\cdots=(0.8)^n(x_0,y_0,z_0)\,, \,\,\,n\geq 1.
\end{align} Therefore, $\lim_{n\to \infty} (x_n,y_n,z_n) = (0,0,0)\,$ as expected.
:::
<br>
Finally, let's look at the ``python`` implementation of the algorithm found at [this link](https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21) and the visualisation in [1] to see how the learning rate and the nature of the function affects the effectiveness of the algorithm.
---
**References**
1. Pedregosa, F., "Gradient Descent", *Keep the gradient flowing*, 10/11/2024, https://fa.bianp.net/teaching/2018/eecs227at/gradient_descent.html
2. *Chapter 7.1* : Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press.
---
<br>
<!---
## 3.8 Application 2: Backpropagation
In order to minimise the loss functions when training ML algorithms, one has to compute the partial derivatives of loss functions with respect to the parameters of a neural network. **Backpropagation**, introduced by Kelley (1960) and Linnainmaa (1970), later popularised by Hinton (2024 Nobel Physics Prize) and collaborators, is a method to do this. Backpropagation is simply the Chain rule but done in a smarter way to reduce repititions of computations, and hence, save computational time (and in fact, it saves an enormous amount of time.)
**Example 1: Linear Classifiers.**
Typically, there are many training data points $(x_j,t_j), j=1,\dots,N.$ For the calculations to be simple, we assume $N=1$. The model is given by the line $z=wx+b$ and activation function (to keep values between 0 and 1) $y=\sigma(z)$ where $\sigma$ is the sigmoid function.

We would like to minimize squared error $\mathcal L=0.5(y-t)^2$. We will choose appropriate weights $w$ and $b$, to minimise it using GD. To this end, we have to compute $(\mathcal L)_w =\bar w$ and $(\mathcal L)_b = \bar b$ (where we have adapted the notation $(\mathcal L)_p =: \bar p$).

<sub><sup>(taken from [1])</sub></sup>
In fact, we can add the regulariser $\mathcal R=0.5 w^2$ to obtain the loss function $\mathcal L_{reg}= \mathcal L+ \lambda \mathcal R$ (which is better conditioned for GD). We will choose apprpriate weights $w$ and $b$, and hyperparameter $\lambda$ to minimise it using GD.
Given below is the **computation graph**.

<sub><sup>(taken from [1])</sub></sup>
The nodes in the graph correspond to all the values that are computed, with edges to indicate which values are computed from which other values.

<sub><sup>(taken from [1])</sub></sup>
For the univariate logistic classifier example:

<sub><sup>(taken from [1])</sub></sup>
This example was adapted from [1]. If you are interested in seeing a less mathematical and more intuitive description visit [this link](https://www.ibm.com/think/topics/backpropagation).
---
**References**
1. Grosse, R., Backpropagation, *Intro to Neural Networks and Machine Learning*, 10/11/2024, https://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L06%20Backpropagation.pdf
2. *Chapter 5.6* : Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press.
---
<br>
--->
:::danger
**Summary**: Now, we can
- Compute derivatives of composite functions of several variables using chain rule.
- Calculate the gradient of a function.
- Use the directional derivative to determine the rate of change of a function.
- Determine the max/min rate of change of a function of several variables.
- Describe the gradient descent algorithm.
:::