571: Ch5 Differentiation
=====
# The derivative of a function f:ℝ→ℝ
Let's define and prove propeties of derivatives so that the definitions and theorems immediately lift from $f:\mathbb{R}\to\mathbb{R}$ to $f:X \to Y$ whenever $X, Y$ are a space like $\mathbb{R}^n$ or $\mathbb{C}^n$ and even more general settings.
Given a function $f:E\subseteq \mathbb{R} \to \mathbb{R}$ we say $f$ is *differentiable* at $a\in E$ if there is a linear map $D_f(a):\mathbb{R}\to\mathbb{R}$ so that
$$
\lim_{h \to 0}\frac{(f(a+h) - f(a))-D_f(a)(h)}{h} = 0
$$
This is sometimes written
$$
f(a+h) - f(a) - D_f(a)(h) = o(h)\text{ and }\lim_{h\to 0}\frac{o_f(h)}{h}=0.
$$
So $D_f(a)$ is a **first order** linear approximation to $f(a+h)-f(a)$, in the variable $h$. Since $D_f(a)$ is a linear map so we know $D_f(a)(h) = mh$ for some scalar $m$. (In higher dimensions, this would be $D_f(a)(h)=Ah$ for some matrix $A$.) We can use the above to find $m$
$$
\lim_{h \to 0}\frac{(f(a+h) - f(a))-mh}{h} = 0\overset{\text{iff}}{\Longleftrightarrow} \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}=m
$$
Often this $m$ is denoted
$$
m = \left.\frac{d}{dx} f\,\right|_{x=a}=f'(a) = \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}
$$
If $f$ is differentiable at $a$, then
$$
|f(a+h)-f(a)| \le |o(h)|+|D_f(a)(h)|= |o(h)|+|mh|\to 0\text{ as }h\to 0
$$
and hence $f$ is continuous at $a$.
From this definition various theorems follow easily:
\begin{align}
(f+g)(a+h)&-(f+g)(a) - (D_f(a)+D_g(a))(h) \\&=
(f(a+h)-f(a)-D_f(a)(h))+(g(a+h)-g(a)-D_g(a)(h))\\
&=o_f(h)+o_g(h)\\&=o(h)
\end{align}
Verify that $\lim_{h\to 0}\frac{o(h)}{h}=0$ and we have shown $D(f+g)(a)=D_f(a)+D_g(a)$. (**Sum Rule**)
Similarly,
\begin{align}
(f\cdot g)(a+h)&-(f\cdot g)(a)-[f(a)\cdot D_g(a)(h)+D_f(a)(h)\cdot g(a)]\\
&= f(a+h)\cdot g(a+h)-f(a)\cdot g(a+h) + f(a)\cdot g(a+h) \\
&\qquad\qquad + f(a)\cdot g(a) - f(a)\cdot D_g(a)(h) -D_f(a)(h)\cdot g(a)\\
&\qquad\qquad + D_f(a)(h)\cdot g(a+h)-D_f(a)(h)\cdot g(a+h)\\
&= [f(a+h)-f(a)-D_f(a)(h)]\cdot g(a+h) \\
&\qquad\qquad + [g(a+h)-g(a)-D_g(a)(h)]\cdot f(a)\\
&\qquad\qquad +(g(a+h)-g(a))\cdot D_f(a)(h)\\
&=o_f(h)\cdot g(a+h) + o_g(h)\cdot f(a) + (g(a+h)-g(a))D_f(a)(h)\\
&=o(h)
\end{align}
Verify that $\lim_{h\to 0}\frac{o(h)}{h}=0$ and we have $D(f\cdot g)(a)(h)=f(a)\cdot D_g(a)(h)+ D_f(a)(h)\cdot g(a)$. (**Product Rule**)
**Note:** The above works for *any* product-like operation $\star$ that satisfies certain distributivity laws:
\begin{align}
(f+g)\star h&=(f\star h)+(g\star h)\\
h\star(f+g)&=(h\star f)+(h\star g)
\end{align}
This shows that the *product* rule holds for the inner product $(\langle f,g\rangle)$, the cross product $(f\times g)$, the commutator $([f,g]=f\cdot g-g\cdot f)$, etc.
An even more telling example is:
\begin{align}
(f\circ g)(a+h)&-(f\circ g)(a) =
f(g(a)+\boxed{D_g(a)(h)+o_g(h)}) - f(g(a)) \\
&=f(g(a)) + D_f(g(a))(D_g(a)(h)+o_g(h))-f(g(a))\\
&=D_f(g(a))(D_g(a)(h)+o_g(h))\\
&=(D_f(g(a))\circ D_g(a))(h) +D_f(g(a))(o_g(h))\\
&=(D_f(g(a))\circ D_g(a))(h) + o(h)
\end{align}
where $o(h) = D_f(g(a))(o_g(h))$. What we need to see is that
$$
\lim_{h\to 0}\frac{o(h)}{h}=\lim_{h\to 0}\frac{D_f(g(a))(o_g(h))}{h}=
D_f(g(a))\Bigl(\lim_{h\to 0}\frac{o_g(h)}{h}\Bigr)=D_f(g(a))(0)=0
$$
**Note:** We used that $D_f(g(a))$ is linear to get
$$
D_f(g(a))(D_g(a)(h)+o_g(h)) = D_f(g(a))(D_g(a)(h))+D_f(g(a))(o_g(h))
$$
## Local Extrema
Let $a$ be a local extrema of $f$, then $D_f(a)$ is the (constant) $0$ map. WLOG assume $a$ is a local maximum. If $D_f(a)$ is not the $0$ map, there is $h$ so that $D_f(a)(h)\neq 0$. If $D_f(a)(h)<0$, then $D_f(a)(-h)>0$ so again WLOG assume $D_f(a)(h)>0$. By linearity, $D_f(\gamma h)=\gamma D_f(a)(h)$ and
$$
f(a+\gamma h)-f(a)=\gamma D_f(a)(h)+o_f(\gamma h)
$$
and $\lim_{\gamma\to 0}\frac{o_f(\gamma h)}{\gamma |h|}=0$ so there is $\delta>0$ so that for $0<\gamma<\delta$, $\frac{o_f(\gamma h)}{\gamma|h|}<\frac{1}{|h|}\frac{D_f(a)(h)}{2}$. So for such $\gamma$, $o_f(\gamma h)<\gamma \frac{D_f(a)(h)}{2}$ and thus $f(a+\gamma h)-f(a)>\gamma \frac{D_f(a)(h)}{2}>0$. So $f(a+\gamma h)>f(a)$ for all $\gamma\in(0,\delta)$. This contradicts $a$ being a local maximum.
**Remark** This argument is a bit more convoluted than necessary as for $f:\mathbb R\to\mathbb R$ we have $D_f(a)(h)=mh$ ad we can just argue from $m$. But more generally, $f:\mathbb R^m\to \mathbb R$ and $D_f(a)h=Ah$ where $A$ is a $(1\times m)$-matrix. The argument provided works in this more general setting basically getting that if $a$ is a local extreme point, then $A=O_{1\times m}$. Later we will see that this corresponds to all partial-derivatives being $0$.
A point where $D_f(a)\equiv 0$ (is the 0 linear map) is called a ***stationary point*** or ***critical point***. The above shows that all local extrema that occur at a point where $f$ is differentiable are stationary points.
## Mean Value Theorem
**Rolle's Theorem** Suppose $f(a)=f(b)$, $h$ is continuous on $[a,b]$ and differentiable on $(a,b)$. Then there is $c\in(a,b)$ so that $f'(c)=0$.
**Proof.** Since $f$ is continuous on $[a,b]$ it obtains its max/min. If these both occur at endpoints, then the max and mi are the same and $f$ is constant hence $f'(c)=0$ for all $c\in(a,b)$. Otherwise, there is a local extrema in the interval $(a,b)$ and hence $f'(c)=0$ there.
❏
**Mean Value Theorem** Asume $f$ is continuous on $[a,b]$ and differentiable on $(a,b)$, then there is $c\in (a,b)$ so that $f'(c)=\frac{f(b)-f(a)}{b-a}$.
**Proof.** Apply Rolle's to $h(x) = f(x) - rx$ where $r = \frac{f(b)-f(a)}{b-a}$.
❏
**Corollary** If $f'(x)\ge 0$ on an interval $(a,b)$ and continuous on $[a,b]$, then $f$ is increasing on $[a,b]$. Similarly, if $f'(x)\le 0$ on $(a,b)$, then $f$ is decreasing, and if $f'(x)=0$ on $(a,b)$, then $f$ is constant.
❏
There are many variants of this with $\ge$ replaced by $>$ and "increasing" replaced by "strictly-increasing," etc.
### Cauchy Mean Value Theorem
**Theorem** Suppose $f$ and $g$ are differentiable on $(a,b)$ and continuous on $[a,b]$, then there is $c\in(a,b)$ such that $f'(c)(g(b)-g(a)) = g'(c)(f(b)-f(a))$.
Clearly this is a strengthening of the MVT taking g(x)=x. The proof is essentially the same.
**Proof.** Let $h(x)=f(x)(g(b)-g(a)) - g(x)(f(b)-f(a))$, then $h(a)=h(b)$, $h$ is diffble in $(a,b)$, and continuous on $[a,b]$. By Rolle's Theorem there is $c\in(a,b)$ with $h(c)=0$, clearly, this $c$ is what we were looking for.
❏
This version of the MVT immediately gives L'Hospital's rule.
### L'Hospital
Suppose $f(a-)=\lim_{x\to a^-}f(x)=0=\lim_{x\to a^-}g(x)=g(a-)$. If $\lim_{x\to a^-}\frac{f'(x)}{g'(x)}=L$ and $g(x)\neq 0$ for $x\in (h,a)$ for some $h<a$, then $\lim_{x\to a^-}\frac{f(x)}{g(x)}=L$.
**Proof.** We may assume $g(a)=f(a)=0$ and $f$ and $g$ are differentiable on $(h,a)$ and continuous on $[h,a]$ and $g(x)\neq 0$ on $(h,a)$ for some $h<a$. Then by Cauchy Mean Value Theorem there is $c$ so that $h<c<a$ and
$$
\frac{f'(c)}{g'(c)}=\frac{f(h)}{g(h)}
$$
Taking the limit as $h\to a^-$ we see also that $c\to a^-$ as well and so
$$
\lim_{h\to a^-}\frac{f(h)}{g(h)}=\lim_{c\to a^-}\frac{f'(c)}{g'(c)}
$$
if the latter exists.
❏
**Note.** There are many variants of L'Hospital's rule. The limits can be one sided, two sided, $a$ can be $\pm\infty$, or the limit itself might be infinite. It is better to understand the proof than to remember all the cases.
### IVT for derivatives
**Theorem** If $f$ is differentiable on $[a,b]$, and $f'(a)<\lambda < f'(b)$, then for some $c\in(a,b)$, $f'(c)=\lambda$.
**Proof** Consider $g(x)=f(x)-\lambda x$, so $g$ is differentiable on $[a,b]$ and hence continuous on $[a,b]$. Note that $g'(a)<0<g'(b)$. This means that for some $\delta>0$ and all $h\in(0,\delta)$, $g(a+h)-g(a)<0<g(b)-g(b-h)$ so $g(x)<g(a)$ on $(a,\delta)$ and $g(x)<g(b)$ for on $(b-h,b)$. We know $g$ has a min on $[a,b]$ at some $c$ and thus $c\in(a,b)$. But then $g'(c)=f'(c)-\lambda=0$, so $f'(c)=\lambda$.
**Corollary** $f'$ has no "jump" discontinuities.
## Inversion
Let $f:E\subset\mathbb R\to \mathbb R$ be continuously differentiable in a neighborhood of $a\in E$. Then if $D_f(a)\neq 0$, i.e., $f'(a)\neq 0$, then there is a neighborhood of $a$ on which $f$ is 1-1 and if $g$ is the "local inverse," then there is a neighborhood of $f(a)=b$ so that $g$ is continuously differentiable and $D_g(b) = D_f(a)^{-1}$, i.e., $g'(b)=\frac{1}{f'(a)}$, which can be written:
$$
D_{f^{-1}}(f(a))=D_f(a)^{-1}\text{ or } (f^{-1})'(f(a)) = \frac{1}{f'(a)}
$$
## Taylor's Theorem
Given a function represented by a power series on $(c-R, c+R)$ we as $f(x)=\sum a_i(x-c)^i$ we have by a simple calculation that $a_i = \frac{f^{(i)}(c)}{i!}$. Now given a function $f(x)$ with derivatives of all orders defined on $(c-R,c+R)$, it makes sense to ask if $f(x) = \sum \frac{f^{(i)}(c)}{i!}(x - c)^i$.
**Notation:** $C^\omega(D)$ is the collection of functions analytic on $D$. So here we are asking about the relationship between $C^\infty(D)$ and $C^\omega(D)$.
In general if $I$ is an interval and $f$ has $n$ many derivatives in $I and $c\in I$ define the $n^{\text{th}}$ degree Taylor polynomial centered at $c$ to be
$$
P_n(c,x)=\sum_{i=0}^n\frac{f^{(i)}(c)}{i!}(x-c)^i
$$
and set
$$
R_n(c,x)=f(x)-P_n(c,x).
$$
**Notation** $c=0$ write just $P_n(x)$ and $R_n(x)$. when we need to make $f$ clear, $P^f_n(c,x)$ and $R_n^f(c,x)$ can be used.
Notice
$$
f(x)=\sum_{i=0}^\infty \frac{f^{(i)}(c)}{i!}(x-c)^i \overset{\text{iff}}{\Longleftrightarrow}
\lim_{n\to \infty}R_n(c,x)=0
$$
**Taylor's Remainder Theorem** If $x,c\in I$ for $I$ an interval and $f\in C^{n+1}(I)$, then there is $y$ between $x$ and $c$ so that
$$
R_n(c,x)=\frac{f^{(n+1)}(y)}{(n+1)!}(x-y)^{n+1} \tag{Lagrange form}
$$
$$
R_n(c,x)= \frac{f^{(n+1)}(y)}{n!}(x-c)(x-y)^n\tag{Cauchy Form}
$$
$$
R_n(c,x)=\int_c^x \frac{f^{(n+1)}(t)}{n!}(x-t)^n \,dt \tag{Integral form}
$$
**Proof:** (Lagrange Form). Fix $x\neq c$ and chose $M$ so that
$$
f(x) = P_n(c,x)+\frac{M(x-c)^{n+1}}{(n+1)!}
$$
define
$$
g(t) = P_n(c,t)+\frac{M(t-c)^{n+1}}{(n+1)!}-f(t)
$$
It is clear that $g^{(k)}(c)=0$ for $k\leq n$. Apply Rolle's theorem repeatedly. Since $g(x)=g(c)=0$ get $x_1$ between $x$ and $c$ so that $g'(x_1)=0$. Get $x_2$ between $x_1$ and $c$ so that $g^{(2)}(x_2)=0$, etc. In this way get $x_1$, $x_2$, ..., $x_{n+1}$ so that $g^{k}(x_k)=0$ for $k\leq n+1$.
Now $g^{(n+1)}(x_{n+1})=M-f^{(n+1)}(x_{n+1})=0$ so $M=f^{(n+1)}(x_{n+1})$ and we have
$$
f(x)=P_n(f,c)(x)+\frac{f^{(n+1)}(x_{n+1})}{(n+1)!}(x_{n+1}-c)^{n+1}
$$
❏ (Lagrange Form)
**Proof:** (Integral Form). For $n=0$ we have $P_0(f,c)(x)=f(c)$ and
$$
R_0(c,x)=f(x)-f(c)=\int_c^x f'(t)\,dt
$$
Assume we have the result for $n$, then let $u(t)=f^{(n+1)}(t)$ and $v'(t)=\frac{(x-t)^n}{n!}$ so that $u'(t)=f^{(n+2)}(t)$ and $v(t)=-\frac{(x-t)^{n+1}}{(n+1)!}$ and
\begin{align}
R_n(c,x)&=\int_c^x \frac{(x-t)^n}{n!}f^{(n+1)}(t)\,dt=\int_c^x v'(t) u(t)\,dt\\
&=\bigl(u(x)v(x)-u(c)v(c)\bigr)-\int_c^x v(t)u'(t)\,dt\\
&=f^{(n+1)}(x)\cdot 0 + f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!}+\int_c^x\frac{(x-t)^{n+1}}{(n+1)!}f^{(n+2)}(t)\,dt\\
&= f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!} + \int_c^x\frac{f^{(n+2)}(t)}{(n+1)!}(x-t)^{n+1}\,dt\\
\end{align}
and
$$
\begin{split}
R_{n+1}(c,x)=f(x)&-P_{n+1}(c,x)=
(f(x)-P_n(c,x))-f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!}\\
&=R_n(c,x)-f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!}.
\end{split}
$$
so we have
$$
R_{n+1}(c,x)=\int_c^x\frac{f^{(n+2)}(t)}{(n+1)!}(x-t)^{n+1}\,dt
$$
❏ (Integral Form)
**Proof:** (Cauchy Form.) The intermediate value theorem for integrals provides a $y$ between $x$ and $c$ so that
$$
\frac{1}{x-c}\int_c^x \frac{f^{(n+1)}(t)}{n!}(x-t)^n\,dt= \frac{f^{(n+1)}(y)}{n!}(x-y)^n
$$
So
$$
R_n(c,x)=\frac{f^{(n+1)}(y)}{n!}(x-y)^n(x-c)
$$
❏ (Cauchy Form)
**Example** $f(x)=e^x$ we have $f^{(n)}(0)=e^0=1$ for all $n$ and thus
$$
\left|e^x-\sum_{i=0}^n\frac{x^i}{i!}\right|=\left|\frac{e^cx^{n+1}}{(n+1)!}\right|
$$
So it is clear that $|P_n(x)-e^x|\to 0$ as $n\to\infty$ for all $x$. So pointwise $e^x=\sum_{i=0}^\infty \frac{x^i}{i!}$.
[Visualize this :eyeglasses:](https://www.desmos.com/calculator/mdmfjxwknx)
**Example** Consider $f(x)=\begin{cases}e^{-1/x}&x>0\\0&x\le 0\end{cases}$
It is possible to show that $f^{(n)}(0)=0$ for all $n$, so the Taylor series for $P_\infty(x)=\sum_{i=0}^\infty \frac{0}{i!}x^i=0$ and it is clear that $f(x)\neq P_\infty(x)$. This means that the $R_n(1)=\frac{f^{(n)}(c)}{n!}(1-c)^n\not\to 0$.
:::spoiler **The Weeds**
Substitute $u=-\frac{1}{x}$ and let $u\to-\infty$. Then it is clear that $g(u)=e^u$ satisfies
\begin{align}
g'(u)&=e^u\frac{du}{dx}=(-u^2)e^u\\
g''(u)&=(-2u-u^2)e^u\frac{du}{dx}=(-u^2)(-2u-u^2)e^u\\
g^{(n)}(u)&=p_n(u)e^u\\
g^{(n+1)}(u)&=(-u^2)(p'(u)+p(u))e^u=p_{n+1}(u)e^u
\end{align}
So $p_n(u)$ is a polynomial of degree $2n$ and $\lim_{u\to-\infty}p_n(u)e^u=0$ by repeated L'Hospital. This shows that $f^{(n)}(0)=\lim_{x\to 0^+}\frac{f^{(n-1)}(x)}{x}=\lim_{u\to-\infty}(-u)p_{n-1}(u)e^u=0$. Thus $P_n(0,x)=0$, yet clearly, $f(x)\neq 0$.
You can see that these blow up for $\frac{f^{(n)}(c)}{n!}(1-c)^n$, for $n=1,2,3,4,5$, and $c\in(0,1)$:
<iframe src="https://www.desmos.com/calculator/pdk0fg2cdt?embed" width="500" height="500" style="border: 1px solid #ccc" frameborder=0></iframe>
:::
# The General Definition of the Derivative
Let $(X,\|\cdot\|_X)$ and $(Y,\|\cdot\|_Y)$ be normed topological vector spaces (TVS) over either $\mathbb{R}$ or $\mathbb{C}$. We do not need to recall what a vector space over $\mathbb{R}$ or $\mathbb{C}$ is, but do recall that $\|\cdot\|_X:X\to\mathbb{R}$ is a norm iff
- $\|x+y\|_X\leq \|x\|_X+\|y\|_X$
- $\|x\|\ge 0$ and $\|x\|_X=0\iff x=0$
- $\|\alpha x\|_X=|\alpha\||x\|_X$
Recall also that a norm $\|\cdot\|_X$ gives rise to a metric by $d_X(x,x')=\|x-x\|_X$.
Typical examples are the Euclidean spaces, $\mathbb{R}^n$ and $\mathbb{C}^n$ with the usual $2$-norm, $\|x\|_2^2=\sum_{i=1}^nx_i^2$. Slightly more exotic examples would be, $\ell^2(\mathbb{R})$ or $\ell^2(\mathbb{C})$, the set of **square summable sequences**, with norm $\|x\|_2^2=\sum_{i=1}^\infty x_i^2$ or $\ell^1(\mathbb{C})$, the **absolutely summable sequences**, where the norm is given by $\|x\|_1=\sum_{i=0}^\infty|x_i|$. There are many more such examples, but for this discussion the Euclidean spaces suffice for examples.
For these notes I will only consider real Euclidean spaces $\mathbb{R}^n$, but I will use notation that generalizes and hence generalizes readily to the more general setting described above.
Given $f:X\to Y$ and $x\in X$ an interior point in the domain of $f$ we can define the *best linear approximation to $f$ at $x$* to be the unique linear function $D_f(x):X\to Y$ so that
$$
f(x+h)-f(x)=D_f(x)(h)+o_f(x)(h)
$$
where
$$
\lim_{h\to 0}\frac{\|o_f(h)\|_Y}{\|h\|_X}=0
$$
**Note**: The remainder function $o_f(x)(h)$ is completely determined by
$$
o_f(x)(h)=(f(x+h)-f(x))-D_f(x)(h)
$$
For reasons we will not get into here, we also need to insist that $D_f(x)$ is continuous. So long as $X$ is finite dimensional, this will always be the case.^[For normed TVS, a linear $L:X\to Y$ is continuous iff $\|L\|_{\text{op}}=\sup\{\|L(x)\|_Y\mid \|x\|_X=1\}<\infty$. Such linear maps are also termed *bounded linear maps*. If $X$ is finite dimensional, then all linear maps on $X$ are bounded and hence continuous.]
**Example 1:** Consider $x\mapsto x^2$ for $x\in\mathbb{R}$.
$$
(x+h)(x+h) = x^2+ 2xh + h^2
$$
So
$$
f(x+h)-f(x) = 2xh +h^2 = D_f(x)(h) + o_f(x)(h)
$$
Clearly, $D_f(x)(h) = 2xh$ is linear (in $h$) and $\|o(h)\|/\|h\| = |h|^2/|h| = |h| \to 0$ as $h\to 0$.
**Example 2:** This is essentially the same example. Now $x\in \mathbb{R}^n$ and $f(x)=x^Tx=\langle x,x\rangle = \|x\|^2$.
$$
\begin{align}
f(x+h)&=(x+h)^T(x+h)=(x^T+h^T)(x+h)\\
&=x^Tx + x^Th+h^Tx + h^th \\
&= \|x\|^2 + (2x^T)(h) + \|h\|^2\\
&=f(x)+(2x^T)(h)+\|h\|^2\\
&=f(x)+D_f(x)(h)+o_f(x)(h)
\end{align}
$$
Again $D_f(x)(h) = (2x^T)(h)$ is linear (in $h$) and $o_f(x)(h)/\|h\|=\|h\|^2/\|h\| \to 0$ as $h\to 0$. So $D_f(x)(h)=(2x^T)(h)$ and if you like the "usual way" of writing this, $2x^T=[f_1(x),\ldots,f_n(x)]$ where $f_i$ is the partial derivative with respect to the $i$^th^ variable, and so $2x^T$ is the Jacobian of $f$ at $x$.
It would not suffice to simply have the error $o_f(h)$ satisfy, $\lim_{h\to 0}o_f(h)=0$. For we could take any $\alpha\in \mathbb{R}$ and set $D_f(h)=\alpha h$. Simple continuity of $f$ at $x$ would imply that $o_f(h)\to 0$ as $h\to 0$, so this clearly is not enough to gaurantee the uniqueness of $D_f(x)$.
**Claim:** $D_f(x)$ is unique when it exists.
**Proof:** Suppose you had two contestants for linear maps $D_f(x)$ and $D'_f(x)$, note that the error term is determined as $o_f(x)(h)=f(h)-D_f(x)(h)$ and similarly for $o'_f(x)$. We have
$$
D_f(x)(h)-D'_f(x)(h)=o_f(x)(h)-o'_f(x)(h)
$$
So $L:X\to Y$ defined by $L(h)=D_f(x)(h)-D'_f(x)(h)$ is linear and has the property that $\frac{\|L(h)\|_Y}{\|h\|_X}\to 0$ as $h\to 0$ since $\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}\to 0$ and $\frac{\|o_f'(x)(h)\|_Y}{\|h\|_X}\to 0$.
Since $rh\to0$ as $r\to 0$ for any $h\in Y$ we have that
\begin{align}
0&=\lim_{r\to 0}\frac{\|L(rh)\|_Y}{\|rh\|_X}
=\lim_{r\to 0}\frac{\|L(rh)\|_Y}{\|rh\|_X}
=\lim_{r\to 0}\frac{\|rL(h)\|_Y}{r\|h\|_X}\\
&=\lim_{r\to 0}\frac{r\|L(h)\|_Y}{r\|h\|_X}
=\lim_{r\to 0}\frac{\|L(h)\|_Y}{\|h\|_X}=\frac{\|L(h)\|_Y}{\|h\|_X}
\end{align}
But then $\|L(h)\|_Y=0\|h\|_X=0$ and so $L(h)=0$. This was for an arbitrary $h\in X$, so $L=0$, the constantly $0$ map. This means that $L(h)=D_f(x)(h)-D'_f(x)(h)=0$ for all $h$ and hence $D_f(x)=D'_f(x)$ are the same linear map.
❏
**Note:** This proof really used the full condition on the error term: $\lim_{h\to 0}\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}=0$.
## Computing D~f~(x) for f:ℝ^n^→ℝ
Suppose $D_f(x)$ exists, then since $D_f(x)$ is linear, there is a $1\times n$ matrix $A$ so that $D_f(x)(h)=Ah$, that is
$$
D_f(x)(h)=\bigl[\, a_1\,a_2\,\cdots\,a_n\bigr]
\begin{bmatrix}h_1\\h_2\\\vdots\\h_n\end{bmatrix}
$$
Given an $m\times n$ matrix $B$ we can find the $j$^th^ column of $A$ as $Ae^n_j$ where $e^n_j$ is the $j$^th^ standard basis element:
$$
e^n_j=\begin{bmatrix}
0\\\vdots\\0\\1\\0\\\vdots\\0
\end{bmatrix}
\begin{matrix}
\phantom{0}\\\phantom{\vdots}\\\phantom{0}\\\leftarrow j\\\phantom{1}\\\phantom{\vdots}\\\phantom{0}
\end{matrix}
$$
That is $e^n_j(i)=\begin{cases}0&\text{if }i\neq j\\1&\text{if }i=j\end{cases}$.
So $B_{i,j}=Be^n_j(i)$ so in our current setting, that is where $A$ is $1\times n$, $a_i=D_f(x)(e_i^n)(i)$. Now consider
\begin{align*}
\lim_{h\to 0}\frac{|f(x+h)-f(x)-D_f(x)(h)|}{\|h\|}=0
&\implies
\lim_{t\to 0}\frac{|f(x+te_i^n)-f(x)-D_f(x)(te_i^n)|}{|t|}=0\\
&\implies \lim_{t\to 0}\left|\frac{f(x+te_i^n)-f(x)}{t}-a_i\right|=0
\end{align*}
since $\frac{D_f(x)(te_i^n)}{t}=\frac{tD_f(x)(e_i^n)}{t}=D_f(x)(e_i^n)=a_i$. But let
$$
\hat f_x(t)=f(x+te_i^n)=f(x_1,x_2,\ldots,x_{i-1},x_i+t,x_{i+1},\ldots,x_n),
$$
then $a_i=\frac{d}{dt}\hat f\big|_{t=0}\stackrel{\tiny{\text{def}}}{=}\frac{\partial}{\partial x_i}f(x)$. We also denote this $\partial_i f(x)$, $\frac{\partial f}{\partial x_i}(x)$, and sometimes $f_i(x)$. So we have shown that if $f$ is differentiable at $x$, then
$$
D_f(x)=\bigl[\,\partial_1 f\,\partial_2 f\,\ldots\,\partial_n f\,\bigr]
$$
**Example** Consider $f(x)=x^Tx=\sum_{i}x_i^2$ again, so $f:\mathbb R^n\to \mathbb R$. We have $\partial_i f(x)=2x_i$ and so $D_f(x)=\bigl[\,2x_1\,2x_2\,\ldots\,2x_n\,\bigr]=2x^T$ which agrees what we got above!
**Example** Consider $f(x,y)=x\cdot y$. Then $[D_f(x,y)]=\bigl[y, x\bigr]$. So at a point $(a,b)$, the best linear approximation to $f((a,b)+(h,k))=(a+h)\cdot(b+k) = \bigl[b\ a\bigr]\Bigl[\begin{smallmatrix}h\\k\end{smallmatrix} \Bigr]=bh+ak$. This makes sense as
$$
f((a,b)+(h,k))-f((a,b))=(a+h)(b+k)-ab=(ab+ak+bh+hk)-ab=(ak+bh)+(h,k)
$$
So $o_f((h,k))=hk$ and $\lim_{(h,k)\to (0,0)}\frac{hk}{\sqrt{h^2+k^2}}=0$.
Note, when we get tot the chain rule for multivariable functions we would have
$$
\frac{d}{dt}(f(t)\cdot g(t))=D_F((f(t),g(t))(f'(t),g'(t))=
\bigl[g(t)\ f(t)\bigr]\Bigl[\begin{smallmatrix}f'(t)\\g'(t)\end{smallmatrix} \Bigr]=g(t)f'(t)+f(t)g'(t)
$$
where $F(x,y)=x\cdot y$.
## Computing D~f~(x) for f : ℝ^n^ → ℝ^m^
Here we just note that $f:\mathbb R^n\to\mathbb R^m$ is really just given by $f(x)=(f_1(x),\ldots,f_m(x))$ where $f_i(x):\mathbb R^n\to\mathbb R$. Actually, to use the correct vector/matrix notation
$$
f(x)=\begin{bmatrix}f_1(x)\\f_2(x)\\\vdots\\f_n(x)\end{bmatrix}
$$
As above $D_f(x)(i,j)=D_f(x)(e_j^n)(i)=D_{f_i}(x)(j)$ and thus $D_f(x)(i,j)=\partial_j f_i(x)$, that is,
$$
D_f(x)=\begin{bmatrix}
\partial_1 f_1&\partial_2 f_1&\cdots&\partial_n f_1\\
\partial_1 f_2&\partial_2 f_2&\cdots&\partial_nf_2\\
\vdots&\vdots&\ddots&\vdots\\
\partial_1 f_m&\partial_2 f_m&\cdots&\partial_n f_m
\end{bmatrix}
$$
**Warning** Clearly if $D_f(x)$ exists, then all of the partial derivatives exist, but the partial derivatives existing is not enough to guarantee that $f$ is differentiable.
**Theorem** If $\partial_i f_j$ exists and are continuous in an open nbhd of $x$ for all $i,j$, then $D_f(x)$ exists and is continuous in an open nbhd of $x$.
**Proof** Let $U$ be an open nbhd of $x$ so that $\partial_i f$ is continuous in $U$ for all $i$. Let $\epsilon>0$ and choose $\delta >0$ so that $y\in N_\delta(x)\implies \partial_i f(y)\in N_{\epsilon/n}(\partial_i f(x))$. Let $x+h\in N_\delta(x)$. Call $x=y_0$ and $x+h=y_n$ and choose $y_i\in U$ so that $y_i-y_{i-1}=h_ie_i$. By MVT there is $c_i\in (0,h_i)$ so that $f(y_i)-f(y_{i-1})=\partial_i f(y'_i)h_i$ where $y'_i=y_{i-1}+c_ie_i$. We have then $f(x+h)-f(x)=f(y_n)-f(y_0)=\sum_{i=1}^nf(y_i)-f(y_{i-1})=\sum_{i=1}^n\partial_i f(y'_i)h_i$. So $f(x+h)-f(x)-D_f(x)=\sum_{i=1}^n\partial_i f(y'_i)h_i-\sum_{i=1}^n\partial_i f(x)h_i=\sum_{i=1}^n(\partial_i f(y_i')-\partial_i f(x))h_i$. Thus
$$
\|f(x+h)-f(x)-D_f(x)\| = \left\|\sum_{i=1}^n(\partial_i f(y_i')-\partial_i f(x))h_i\right\|
\le \sum_{i=1}^n(\epsilon/n) |h_i|
$$
so
$$
\frac{f(x+h)-f(x)-D_f(x)}{\|h\|}<\epsilon.
$$
❏
**Example**: Let $f(x,y)=(x^2-y^2,2xy^2)$, then $\partial_x f_1=2x$, $\partial_y f_1=-2y$, $\partial_x f_2=2y^2$, and $\partial_y f_2=4xy$. Clearly all of these partials are continuous on all of $\mathbb R^2$, so $f$ is differentiable on all of $\mathbb R^2$ and
$$
D_f(a,b)(h,k)=\begin{bmatrix}
2a&-2b\\
2b^2&4ab
\end{bmatrix}\begin{bmatrix}h\\k\end{bmatrix}
$$
so given that $f(3,4)=(-7, 96)$ we can approximate $f(3.12,3.84)=f(3+0.12,4-0.16)\approx f(3,4)+D_f(3,4)(0.12,-0.16)$ so
$$
f(3.12,3.84) \approx
\begin{bmatrix}-7\\96\end{bmatrix}+
\begin{bmatrix}6&-8\\32&48\end{bmatrix}
\begin{bmatrix}0.12\\-0.16\end{bmatrix}
=\begin{bmatrix}-7\\96\end{bmatrix}+
\begin{bmatrix}2\\-3.84\end{bmatrix}
=\begin{bmatrix}-5\\92.16\end{bmatrix}
$$
You can check that $f(3.12,-3.84)=(-4.9487, 92.01254)$.
Just to reiterate something done above. For the case that $f:\mathbb{R}\to\mathbb{R}$ it is clear that $D_f(x)(h) = ah$ for some $a\in\mathbb{R}$ and hence just the definition provides:
$$
\frac{f(x+h)-f(x)-ah}{h}=\frac{o_f(x)(h)}{h}\to 0\text{ as }h\to 0
$$
and hence
$$
\lim_{h\to 0}\left(\frac{f(x+h)-f(x)}{h}-a\right)=0
$$
which becomes the familiar
$$
\lim_{h\to 0}\frac{f(x+h)-f(x)}{h}=a=f'(x)
$$
and hence $D_f(x)(h)=f'(x)h$, so $D_f(x):\mathbb{R}\to\mathbb{R}$ is the linear function $h\mapsto f'(x)h$.
# Properties of the derivative
## Continuity of *f* at differentiable points
Let $f:X\to Y$ be differentiable at $x\in X$, then $f$ is continuous at $x$. To see this notice
$$
\|f(x+h)-f(x)\|_Y\leq\|D_f(x)(h)\|_Y+\|o_f(x)(h)\|_Y
$$
The continuity of $D_f(x)$ implies that $D_f(x)(h)\to D_f(x)(0)=0$ as $h\to 0$ and by assumption $\|o_f(x)(h)\|_Y\to 0$, so we see $\|f(x+h)-f(x)\|_Y\to 0$ as $h\to 0$ and hence $f$ is continuous at $0$.
## Critical points
**Theorem** (Fermat's little theorem.) If $f:X\to\mathbb{R}$ and a local maxima or minima occurs at $c$ and $f$ is differentiable at $c$, then $D_f(c)=0$, i.e., is the $0$ linear map.
**Proof** Suppose a local maximum occurs at $c$ and that $D_f(c)(x) = \gamma > 0$ for some $x$, that is $D_f(c)$ is not the $0$-map. Then for $t>0$
$$
\frac{\delta}{\|x\|}=\frac{D_f(c)(tx)}{\|tx\|}=\frac{f(c+tx)-f(c)+o_f(c)(tx)}{\|tx\|}
$$
Choose $\delta'$ so that $0<t<\delta'$, $\frac{o_f(c)(tx)}{\|tx\|}<\frac{\delta}{2\|x\|}$, then $f(c+tx)-f(c)>0$. So $f(c)$ is not a local max. If $D_f(c)(x)=\gamma<0$, we do the same, but with $t<0$. If $c$ is a local minimum, then we use the same argument, but invert the rolls of $t>0$ and $t<0$.
❏
**Definition** $c\in X$ is a *critical point* or *stationary point* iff $D_f(x)$ is the $0$ map.
So for $f:X\to \mathbb{R}$, if $f$ has a local extrema at $c$ and $f$ is differentiable at $c$, then $c$ is a critical point, but the converse can fail.
## Algebraic properties of derivatives
### Product Rule
Suppose $f:X\to Y$ and $g:X\to Y$ are dfbl at $x\in X$ and $\star:Y\times Y\to Y$ is some sort of product satisfying^[This is a very general result it takes care of inner products, cross-products, the usual product, etc.]:
- $(y_0\star y_1)\star y_2 = y_0\star(y_1\star y_2)$
- $\alpha(y_0\star y_1) = (\alpha y_0)\star y_1 = y_0\star(\alpha y_0)$
- $y_0\star(y_1 + y_2)=y_0\star y_1 + y_0\star y_2$
- $(y_0 + y_1)\star y_2 = (y_0 \star y_2) + (y_1 \star y_2)$
The first property is associativity and the last three are called *bi-linearity*
Then
$$
D_{f\star g}(x)(h) = D_f(x)(h) \star g(x) + f(x) \star D_g(x)(h)
$$
Be mindful of the types involved here: First the definition of $f\star g:X\to Y$ is given by $(f\star g)(x)=f(x)\star g(x)$. Second, $D_f(x), D_g(x)\in \mathcal{L}(X,Y)$, the space of linear functions from $X$ to $Y$, while $f(x), g(x)\in Y$ and $h\in X$.
**Proof** We must show that
$$
(f\star g)(x + h) - (f\star g)(x) = (D_f(x)(h) \star g(x) + f(x) \star D_g(x)(h)) + o(h)
$$
where $\|o(h)\|_Y/\|h\|_X\to 0$ as $h\to 0$. For this we simply compute using the fact that $f$ and $g$ are dfbl at $x$.
$$
\begin{align*}
f(x+h)\star& g(x+h)-f(x)\star g(x)\\
&=f(x+h)\star g(x+h) - f(x+h)\star g(x) + f(x+h)\star g(x) - f(x)\star g(x)\\
&=f(x+h)\star(g(x+h)-g(x))+(f(x+h)-f(x))\star g(x)\\
&=(f(x)+D_f(x)(h)+o_f(x)(h))\star (D_g(x)(h)+o_g(x)(h))\\
&\qquad+(D_f(x)(h)+o_f(x)(h))\star g(x)\\
&=f(x)\star D_g(x)(h)+D_f(x)\star g(x) + o(x)(h)
\end{align*}
$$
where
$$
\begin{align*}
o(x)(h)&=(D_f(x)(h)+o_f(x)(h))\star (D_g(x)(h)+o_g(x)(h))\\
&\qquad+f(x)\star o_g(x)(h)+o_f(x)(h)\star g(x)\\
\end{align*}
$$
It is clear that $\lim_{h\to 0}\frac{\|o(x)(h)\|_Y}{\|h\|_X}=0$, since $\lim_{h\to 0}\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}=0$, $\lim_{h\to 0}\frac{\|o_g(x)(h)\|_Y}{\|h\|_X}=0$, and $\lim_{h\to 0}\frac{\|D_f(x)(h)\|_Y}{\|h\|_X}=0$.
### Chain Rule
**Theorem:** Suppose $f:X\to Y$ and $g:Y\to Z$ with $f$ dfbl at $x$ and $g$ dfbl at $g(x)$, then $g\circ f:X\to Z$ is dfbl at $x$ and and $D_{g\circ f}(x)=D_g(f(x))\circ D_f(x)$.
**Proof:** The composition of linear functions is linear so $D_g(f(x))\circ D_f(x):X\to Z$ is linear. We must show that for
$o:X\to Y$ defined by
$$
o(h) = (g\circ f)(x+h)-(g\circ f)(x)-(D_g(f(x))\circ D_f(x))(h)
$$
, that $\lim_{h\to 0}\frac{\|o(h)\|_Z}{\|h\|_X}=0$.
$$
\begin{align*}
g(f(x+h))-g(f(x)) &= D_g(f(x))(f(x+h-f(x))) + o_g(f(x+h)-f(x))\\
&= D_g(f(x))(D_f(x)(h)+o_f(x)(h)) + o_g(D_f(x)(h)+o_f(x)(h))\\
&= D_g(f(x))(D_f(x)(h)) + D_g(o_f(x)(h)) + o_g(D_f(x)(h)+o_f(x)(h))
\end{align*}
$$
so define
$$
o(h) = D_g(o_f(x)(h)) + o_g(f(x))(D_f(x)(h)+o_f(x)(h))
$$
We must show $\frac{\|o(h)\|_Z}{\|h\|_X}\to 0$ as $h\to 0$. It would suffice to show this true for each "part" of $o(h)$, namely show
$$
\frac{\|D_g(f(x))(o_f(x)(h))\|_Z}{\|h\|_X}\to 0\quad\text{and}\quad
\frac{\|o_g(f(x))(D_f(x)(h)+o_f(x)(h))\|_Z}{\|h\|_X}\to 0
$$
as $h\to 0$.
$$
\lim_{h\to 0}\frac{\|D_g(f(x))(o_f(x)(h))\|_Z}{\|h\|_X}=\lim_{h\to 0}\left\Vert D_g(f(x))\left(\frac{\|o_f(x)(h))\|_Y}{\|h\|_X}\right)\right\Vert_Z
$$
But as $y\mapsto \|D_g(f(x))(y)\|_Z$ is continuous we have
$$
\begin{align*}
\lim_{h\to 0}\left\Vert D_g(f(x))\left(\frac{\|o_f(x)(h))\|_Y}{\|h\|_X}\right)\right\Vert_Z&=
\left\Vert D_g(f(x))\left(\lim_{h\to 0}\frac{\|o_f(x)(h))\|_Y}{\|h\|_X}\right)\right\Vert_Z\\
&\\
&=\|D_g(f(x))(0)\|_Z=0
\end{align*}
$$
So the first part is done. for the second part we notice:
$$
\frac{\|o_g(f(x))(D_f(x)(h)+o_f(x)(h))\|_Z}{\|h\|_X}=\frac{\|o_g(f(x))(k(h))\|_Z}{\|k(h)\|_Y}\frac{\|k(h)\|_Y}{\|h\|_X}
$$
where $k(h)=D_f(x)(h)+o_f(x)(h)$. But $k(h)\to 0$ as $h\to 0$ and thus
$$
\lim_{h\to 0}\frac{\|o_g(f(x))(k(h))\|_Z}{\|k(h)\|_Y}=\lim_{k\to 0} \frac{\|o_g(f(x))(k)\|_Z}{\|k\|_Y}=0
$$
The other part
$$
D_f(x)(h/\|h\|_X)+\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}
$$
is bounded for $h$ close to $0$ and so
$$
\frac{\|o_g(f(x))(k(h))\|_Z}{\|k(h)\|_Y}\frac{\|k(h)\|_Y}{\|h\|_X}\to 0\cdot \text{Bounded}=0
$$
*[dfbl]: differentiable
*[diff'ble]: differentiable
<!--
# Topic 5 DQs (Ignore this - leaving for posterity)
## DQ1
>In most theorems (e.g., Theorem 5.9, p. 109) involving functions that are differentiable in an interval. These functions are not required to be differentiable at endpoints of such intervals. Why? Provide an example.
These theorems basically rely on there being a local min/max inside the interval $(a,b)$ and then using "Fermat's Little Theorem" at that point. This only requires at the point in question.
Examples would include the upper half circle: $f(x)=\sqrt{1-x^2}$, this clearly is continuous on $[-1,1]$, satisfies $f(-1)=f(1)=0$, and fails to be differentiable at $x=-1$ (from the right) or at $x=1$ (from the left).
## DQ 2
Suppose $a$ and $c$ are real, $c>0$, $f$ defined on $[-1,1]$ by
$$
f(x) =
\begin{cases}
|x|^a\sin(|x|^{-c})&\text{if }x\neq 0,\\
0&\text{if }x=0.
\end{cases}
$$
**Note:** This is modified a bit from what was written, the problem being that we have not discussed what $x^a$ means for $x<0$ and $a$ not an integer.
To simplify further concentrate on the interval $[0,1]$ and on just continuity and differentiability from the right at $0$. This way we can avoid issues with absolute values. So I consider here $f:[0,1]\to\mathbb{R}$ defined by:
$$
f(x) =
\begin{cases}
x^a\sin(x^{-c})&\text{if }x> 0,\\
0&\text{if}x=0.
\end{cases}
$$
(a) Show $f$ is continuous iff $a>0$.
> **(if)** If $a=0$, then for $x\neq 0$ the finction is $\sin(x^{-c})$ and as $\lim_{x\to 0}x^{-c}=\infty$ this would have the entire interval $[-1,1]$ on the $y$-axis as limit points for the graph of $f$ and hence $f$ can't be continuous at $0$. If $a<0$, things are worse, since then any point in the $y$-axis is a limit point of the graph of $f$. So $a\ge 0$, implies $f$ is not continuous at $0$. Hence the contrapositive holds, namely, $a>0$ if $f$ is continuous at $0$.
> **(only if)** If $a>0$, then $\lim_{x\to 0}x^a=0$ and the squeeze theorem implies that $f$ is continuous at $0$. Continuity on the rest of the interval is trivial.
(b) $f'(0)$ exists iff $a>1$.
> Here we simply take left and right limits. I'll consider the left case as $x\to 0^+$:
> $$\lim_{x\to 0^+}\frac{f(x)-f(0)}{x-0}=\lim_{x\to 0^+}\frac{x^a\sin(x^c)}{x}=\lim_{x\to0^+}x^{a-1}\sin(x^c)$$
> By the same argument as in (a), the limit exists iff $a-1>0$, i.e., $a>1$. In this case it is clear that $f'(0)=0$.
(c ) $f'$ is bounded iff $a\ge 1+c$.
> Here we have
> $$
> f'(x) = \begin{cases}
> ax^{a-1}\sin(x^{-c})+cx^{a-(c+1)}\cos(x^{-c}) & \text{if }x > 0,\\
> 0 & \text{if }x=0
> \end{cases}
> $$
> The $\sin$ part is bounded if $a\ge 1$ and the $\cos$ part is bounded if $a\ge c+1$.
(d) $f'$ is continuous iff $a>c+1$.
> Just as in (a) $f'$ will be continuous at $0$ iff $a>c+1$.
(e) $f''(0)$ exists iff $a>(c + 1) + 1=c+2$.
> This is just as in (b), the point is that when dividing by $x$ we must have $x^{a-(c+1)}/x = x^{a-(c+2)}\to 0$.
(f) and (g) the pattern continues as above.
You can play with this [here](https://sagecell.sagemath.org/?z=eJxdkMsKgzAQRff5iiCKSRttFUo3Tbd-RKkQNQ8hPoiWpn_fBFtqnc3cezkDM8MghedDDmrXMyCQxU7Yku2mtke2REmNMZBL3LRCIEGgxUBtEgJzDAqXFIaNqq0n5O2ewlEPsyduR5LdCawHPRgaG97EBL46Zmnme9vTJFtPyO1EpR88JkyPitH0tEbVFpWG8_6PTSc1PN1Gau40CsLPldEUXipzDUXsPP25rw0ipNnMLfI8JouWK-2f4OoNk89Zvw==&lang=sage&interacts=eJyLjgUAARUAuQ==), to see what is going on, I just use [0,1] to avoid the issue with using absolute value. You can change the values of ```a``` and ```b```.
-->