## Understanding Why Chain Rule Works
> _"Mathematical intuition is the ability to see the truth without first having gone through a formal process of reasoning."_
> — Henri Poincaré
The goal of this post is to force you to think about the chain rule a bit deeply.
The only prerequisite for reading this is understanding the power rule(i.e $\dfrac{d}{dx}[x^n] = nx^{n-1}$). Understanding [differentiation intuitively](https://hackmd.io/VpahY4TPR8WwgHOL9fxXhA) is recommended.
The chain rule is a differentiation method for composite functions. It is defined as follows for a function $f(u)$ where $u = g(x)$ $$\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)*\dfrac{d}{dx}(u)$$
which implies differentiate $f(u)$ with respect to $u$ treating $u$ as an independent variable, and differentiate $g(x)$ with respect to $x$ and then multiply both.
For example, given the function $f(x) = (2x^2 + 10)^2$, we can differentiate it as follows: we would set $g(x) = 2x^2 + 10$ and $u = g(x)$ so we can express $f(g(x))$ as $f(u) = u^2$.
Now, let's use the chain rule: $\dfrac{d}{du}f(u) = 2u$, $\dfrac{d}{dx}(u) = 4x$ and $\dfrac{d}{dx}[f(u)] = 2u*4x$. Recall that $u = g(x) = 2x^2 + 10$, so we have $\dfrac{d}{dx}[f(u)] = 2(2x^2 + 10)*4x = 16x^3 + 80x$.
---
To understand the chain rule, you must first understand how function compositions work.
If we have a functions $f(x)$ and $g(x)$, what does $f(g(x))$ mean?
It means that for the function $f(x)$, evaluate it at $g(x)$ instead of $x$. We want to study the effect of this change.
How does $f(g(x))$ change with respect to a change $x$? Another way to put it is this: how does $f(g(x))$ change with respect to $g(x)$ and $g(x)$ with respect to $x$?
Let's explore the behaviour of change in a function composition.
**Example 1**: For $g(x) = 2x$ and $f(x) = x^2$ for an interval $x \in [-4, 4]$.
Let's see the graph:
<iframe src="https://www.desmos.com/calculator/yimwy1fmsz?embed" width="700" height="700" style="border: 1px solid #ccc" frameborder=0></iframe>
Let's build the intuition for this by looking at the table of values of $g(x)$, $f(x)$ $f(g(x))$ and $\dfrac{d}{dx}[f(g(x))]$ on the interval.
| $x$ | $g(x) = 2x$ | $\dfrac{d}{dx}[g(x)] = 2$ | $f(x) = x^2$ | $\dfrac{d}{dx}[f(x)] = 2x$ | $f(g(x)) = 4x^2$ | $\dfrac{d}{dx}[f(g(x))] = 8x$ |
| ---- | ----------- | ------------------------- | ------------ | -------------------------- | ---------------- | ----------------------------- |
| $-4$ | $-8$ | $2$ | $16$ | $-8$ | $64$ | $-32$ |
| $-3$ | $-6$ | $2$ | $9$ | $-6$ | $36$ | $-24$ |
| $-2$ | $-4$ | $2$ | $4$ | $-4$ | $16$ | $-16$ |
| $-1$ | $-2$ | $2$ | $1$ | $-2$ | $4$ | $-8$ |
| $0$ | $0$ | $2$ | $0$ | $0$ | $0$ | $0$ |
| $1$ | $2$ | $2$ | $1$ | $2$ | $4$ | $8$ |
| $2$ | $4$ | $2$ | $4$ | $4$ | $16$ | $16$ |
| $3$ | $6$ | $2$ | $9$ | $6$ | $36$ | $24$ |
| $4$ | $8$ | $2$ | $16$ | $8$ | $64$ | $32$ |
The graph shows that the maximum value for $f(x)$ is $16$ while that of $f(g(x))$ is $64$. That's a scale-up.
Can you see a pattern? Every value of $f(g(x))$ is $4$ times the value of $f(x)$.
We can show this formally by picking any two unique pairs of $f(x)$ and $f(g(x))$ (i.e at the same $x$) and finding the slope(i.e $\dfrac{y_2 - y_1}{x_2 - x_1}$). For example, using the pairs ($16$, $64$) and ($4$, $16$), we have $\dfrac{64 - 16}{16 - 4} = \dfrac{48}{12} = 4$. Let's use another two pairs ($9$, $36$) and ($1$, $4$), we have $\dfrac{4 - 36}{1 - 9} = \dfrac{-32}{-8} = 4$
This turns out to be easy to see because $f(g(x))$ is simply $4$ times $f(x)$ so it makes sense that this is the case. Other places you can notice is as follows:
- The values of $\dfrac{d}{dx}[f(g(x))]$ is 4 times the values of $\dfrac{d}{dx}[f(x)]$
- Every value of $f(g(x))$ and $\dfrac{d}{dx}[f(g(x))]$ is divisible by $4$
- The difference between any two values of $f(g(x))$ and $\dfrac{d}{dx}[f(g(x))]$ is divisible by $4$
The means that for every change in the value of $f(x)$ there's a 4x change in the value of $f(g(x))$. This sounds weird, right? we just represented the rate of change between two somewhat independent functions as if they were dependent on each other. If this relationship was a function it would be $h(x) = 4x$ so that $\dfrac{d}{dx}[h(x))] = 4$. This works because $f(g(x))$ is a composition of $f(x)$ and $g(x)$ and $f(x) = x^2$.
Let's try this for $x$ and $g(x)$. But then, you might ask: Is $g(x)$ a composition of $x$? Let's see.
If we created a function $v(x) = x$, then $g(v(x))$ equals $g(x)$. This is true for every function of $x$. Having set this foundation, we can see that for every change in $x$, $g(x)$ by a factor of $2$.
**Note**: You won't always have a constant factor between $f(x)$ and $f(g(x))$ or between their derivatives. But most times $g(x)$ would affect the behaviour of change in the $f(g(x))$ compared to $f(x)$. It either makes it **faster**, **slower**, **scale-up**, **scale-down** and even combination of these along certain intervals.
The examples below show some of these other behaviours of change.
**Example 2**: For $g(x) = -x$ and $f(x) = x^3$ for an interval $x \in [-3, 3]$.
<iframe src="https://www.desmos.com/calculator/ozvt0afldc?embed" width="700" height="700" style="border: 1px solid #ccc" frameborder=0></iframe>
| $x$ | $g(x) = -x$ | $\dfrac{d}{dx}[g(x)] = -1$ | $f(x) = x^3$ | $\dfrac{d}{dx}[f(x)] = 3x^2$ | $f(g(x)) = -x^3$ | $\dfrac{d}{dx}[f(g(x))] = -3x^2$ |
| ---- | ----------- | -------------------------- | ------------ | ---------------------------- | ---------------- | -------------------------------- |
| $-3$ | $3$ | $-1$ | $-27$ | $27$ | $27$ | $-27$ |
| $-2$ | $2$ | $-1$ | $-8$ | $12$ | $8$ | $-12$ |
| $-1$ | $1$ | $-1$ | $-1$ | $3$ | $1$ | $-3$ |
| $0$ | $0$ | $-1$ | $0$ | $0$ | $0$ | $0$ |
| $1$ | $-1$ | $-1$ | $1$ | $3$ | $-1$ | $-3$ |
| $2$ | $-2$ | $-1$ | $8$ | $12$ | $-8$ | $-12$ |
| $3$ | $-3$ | $-1$ | $27$ | $27$ | $-27$ | $-27$ |
**Example 3**: For $g(x) = x + 1$ and $f(x) = x^2+x$ for an interval $x \in [-2, 2]$.
<iframe src="https://www.desmos.com/calculator/1snsjfddwm?embed" width="700" height="700" style="border: 1px solid #ccc" frameborder=0></iframe>
| $x$ | $g(x) = x + 1$ | $\dfrac{d}{dx}[g(x)] = 1$ | $f(x) = x^2 + x$ | $\dfrac{d}{dx}[f(x)] = 2x + 1$ | $f(g(x)) = x^2 + 3x + 2$ | $\dfrac{d}{dx}[f(g(x))] = 2x + 3$ |
| ---- | -------------- | ------------------------- | ---------------- | ------------------------------ | ------------------------ | --------------------------------- |
| $-2$ | $-1$ | $1$ | $2$ | $-3$ | $0$ | $-1$ |
| $-1$ | $0$ | $1$ | $0$ | $-1$ | $0$ | $1$ |
| $0$ | $1$ | $1$ | $0$ | $1$ | $2$ | $3$ |
| $1$ | $2$ | $1$ | $2$ | $3$ | $6$ | $5$ |
| $2$ | $3$ | $1$ | $3$ | $5$ | $12$ | $7$ |
---
Chain rule answers the question: *how do you express a function dependent on a variable as the rate of change in respect to that variable when the variable is a function?*
### Why is $g(x)$ treated like a variable?
Why is the chain rule $\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)*\dfrac{d}{dx}(u)$ and not $\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)+\dfrac{d}{dx}(u)$?
We would look two ways to look at this: the first is a bit hand-wavy and the other is more intuitive.
For the first one, the intuition lies is in the definition: as $g(x)$ changes $f(g(x))$ *and* as $x$ changes $g(x)$. For clarity: Let's break this statement into three part:
- as $g(x)$ changes $f(x)$ - $g(x)$ is treated as a variable
- $x$ changes $g(x)$
- *and* - from binary operations, we know that *and* implies multiplication. just like *or* implies addition.
Let's look the more intuitive way version: currency conversions!
I have **Nigerian Naira (NGN)** and I want to convert it to **Pounds sterling(GBP)**. But, there's a little challenge: there are only two exchanges available; **NGN**/**USD** and **USD**/**GBP**. We would have to convert from **NGN** to **USD** and then, from **USD** to **GBP**. The rates are **1 USD = NGN1,500** and **1 GBP = 1.25 USD** respectively. How do we convert **NGN20,000** to **GBP**?
First, we would convert to **USD**. **NGN20,000** to **USD** is **20,000/1500 = 13.33 USD**. Then, **USD** to **GBP** is **13.33 / 1.25 = 10.67 GBP**. That is, **NGN20,000** equals **10.67 GBP**. To get the rate of **NGN** to **GBP**, we multiply the both rates(.i.e **1/1500 x 1/1.25 = 1/1875**).
The interesting thing is we can convert these individual conversions to functions. The function for converting **NGN** to **USD** would be $g(x) = \dfrac{x}{1500}$ and the one for converting from **USD** to **GBP** would be $f(x) = \dfrac{x}{1.25}$. The most interesting part of this is that the function for converting from **NGN** to **GBP** is function composition of $f(x)$ and $g(x)$! That is, $f(g(x)) = \dfrac{\dfrac{x}{1500}}{1.25} = \dfrac{x}{1875}$.
If we apply the chain rule to the composite function $f(u)$ where $u = g(x)$, we have $\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)*\dfrac{d}{dx}(u) = \dfrac{1}{1.25}*\dfrac{1}{1500}=\dfrac{1}{1875}$. That is our expected rate!
But, if we change multiplication to addition to we have $\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)+\dfrac{d}{dx}(u) = \dfrac{1}{1.25}+\dfrac{1}{1500}=\dfrac{1201}{1500}$. This is very wrong!
I employ you to think of others ways asides currency conversions whereby this can be shown!
---
I hope you were able to think wide and far about function compositions and the chain rule!