571: Ch5 Differentiation

571: Ch5 Differentiation ===== # The derivative of a function f:ℝ→ℝ Let's define and prove propeties of derivatives so that the definitions and theorems immediately lift from $f:\mathbb{R}\to\mathbb{R}$ to $f:X \to Y$ whenever $X, Y$ are a space like $\mathbb{R}^n$ or $\mathbb{C}^n$ and even more general settings. Given a function $f:E\subseteq \mathbb{R} \to \mathbb{R}$ we say $f$ is *differentiable* at $a\in E$ if there is a linear map $D_f(a):\mathbb{R}\to\mathbb{R}$ so that $$ \lim_{h \to 0}\frac{(f(a+h) - f(a))-D_f(a)(h)}{h} = 0 $$ This is sometimes written $$ f(a+h) - f(a) - D_f(a)(h) = o(h)\text{ and }\lim_{h\to 0}\frac{o_f(h)}{h}=0. $$ So $D_f(a)$ is a **first order** linear approximation to $f(a+h)-f(a)$, in the variable $h$. Since $D_f(a)$ is a linear map so we know $D_f(a)(h) = mh$ for some scalar $m$. (In higher dimensions, this would be $D_f(a)(h)=Ah$ for some matrix $A$.) We can use the above to find $m$ $$ \lim_{h \to 0}\frac{(f(a+h) - f(a))-mh}{h} = 0\overset{\text{iff}}{\Longleftrightarrow} \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}=m $$ Often this $m$ is denoted $$ m = \left.\frac{d}{dx} f\,\right|_{x=a}=f'(a) = \lim_{h\to 0}\frac{f(a+h)-f(a)}{h} $$ If $f$ is differentiable at $a$, then $$ |f(a+h)-f(a)| \le |o(h)|+|D_f(a)(h)|= |o(h)|+|mh|\to 0\text{ as }h\to 0 $$ and hence $f$ is continuous at $a$. From this definition various theorems follow easily: \begin{align} (f+g)(a+h)&-(f+g)(a) - (D_f(a)+D_g(a))(h) \\&= (f(a+h)-f(a)-D_f(a)(h))+(g(a+h)-g(a)-D_g(a)(h))\\ &=o_f(h)+o_g(h)\\&=o(h) \end{align} Verify that $\lim_{h\to 0}\frac{o(h)}{h}=0$ and we have shown $D(f+g)(a)=D_f(a)+D_g(a)$. (**Sum Rule**) Similarly, \begin{align} (f\cdot g)(a+h)&-(f\cdot g)(a)-[f(a)\cdot D_g(a)(h)+D_f(a)(h)\cdot g(a)]\\ &= f(a+h)\cdot g(a+h)-f(a)\cdot g(a+h) + f(a)\cdot g(a+h) \\ &\qquad\qquad + f(a)\cdot g(a) - f(a)\cdot D_g(a)(h) -D_f(a)(h)\cdot g(a)\\ &\qquad\qquad + D_f(a)(h)\cdot g(a+h)-D_f(a)(h)\cdot g(a+h)\\ &= [f(a+h)-f(a)-D_f(a)(h)]\cdot g(a+h) \\ &\qquad\qquad + [g(a+h)-g(a)-D_g(a)(h)]\cdot f(a)\\ &\qquad\qquad +(g(a+h)-g(a))\cdot D_f(a)(h)\\ &=o_f(h)\cdot g(a+h) + o_g(h)\cdot f(a) + (g(a+h)-g(a))D_f(a)(h)\\ &=o(h) \end{align} Verify that $\lim_{h\to 0}\frac{o(h)}{h}=0$ and we have $D(f\cdot g)(a)(h)=f(a)\cdot D_g(a)(h)+ D_f(a)(h)\cdot g(a)$. (**Product Rule**) **Note:** The above works for *any* product-like operation $\star$ that satisfies certain distributivity laws: \begin{align} (f+g)\star h&=(f\star h)+(g\star h)\\ h\star(f+g)&=(h\star f)+(h\star g) \end{align} This shows that the *product* rule holds for the inner product $(\langle f,g\rangle)$, the cross product $(f\times g)$, the commutator $([f,g]=f\cdot g-g\cdot f)$, etc. An even more telling example is: \begin{align} (f\circ g)(a+h)&-(f\circ g)(a) = f(g(a)+\boxed{D_g(a)(h)+o_g(h)}) - f(g(a)) \\ &=f(g(a)) + D_f(g(a))(D_g(a)(h)+o_g(h))-f(g(a))\\ &=D_f(g(a))(D_g(a)(h)+o_g(h))\\ &=(D_f(g(a))\circ D_g(a))(h) +D_f(g(a))(o_g(h))\\ &=(D_f(g(a))\circ D_g(a))(h) + o(h) \end{align} where $o(h) = D_f(g(a))(o_g(h))$. What we need to see is that $$ \lim_{h\to 0}\frac{o(h)}{h}=\lim_{h\to 0}\frac{D_f(g(a))(o_g(h))}{h}= D_f(g(a))\Bigl(\lim_{h\to 0}\frac{o_g(h)}{h}\Bigr)=D_f(g(a))(0)=0 $$ **Note:** We used that $D_f(g(a))$ is linear to get $$ D_f(g(a))(D_g(a)(h)+o_g(h)) = D_f(g(a))(D_g(a)(h))+D_f(g(a))(o_g(h)) $$ ## Local Extrema Let $a$ be a local extrema of $f$, then $D_f(a)$ is the (constant) $0$ map. WLOG assume $a$ is a local maximum. If $D_f(a)$ is not the $0$ map, there is $h$ so that $D_f(a)(h)\neq 0$. If $D_f(a)(h)<0$, then $D_f(a)(-h)>0$ so again WLOG assume $D_f(a)(h)>0$. By linearity, $D_f(\gamma h)=\gamma D_f(a)(h)$ and $$ f(a+\gamma h)-f(a)=\gamma D_f(a)(h)+o_f(\gamma h) $$ and $\lim_{\gamma\to 0}\frac{o_f(\gamma h)}{\gamma |h|}=0$ so there is $\delta>0$ so that for $0<\gamma<\delta$, $\frac{o_f(\gamma h)}{\gamma|h|}<\frac{1}{|h|}\frac{D_f(a)(h)}{2}$. So for such $\gamma$, $o_f(\gamma h)<\gamma \frac{D_f(a)(h)}{2}$ and thus $f(a+\gamma h)-f(a)>\gamma \frac{D_f(a)(h)}{2}>0$. So $f(a+\gamma h)>f(a)$ for all $\gamma\in(0,\delta)$. This contradicts $a$ being a local maximum. **Remark** This argument is a bit more convoluted than necessary as for $f:\mathbb R\to\mathbb R$ we have $D_f(a)(h)=mh$ ad we can just argue from $m$. But more generally, $f:\mathbb R^m\to \mathbb R$ and $D_f(a)h=Ah$ where $A$ is a $(1\times m)$-matrix. The argument provided works in this more general setting basically getting that if $a$ is a local extreme point, then $A=O_{1\times m}$. Later we will see that this corresponds to all partial-derivatives being $0$. A point where $D_f(a)\equiv 0$ (is the 0 linear map) is called a ***stationary point*** or ***critical point***. The above shows that all local extrema that occur at a point where $f$ is differentiable are stationary points. ## Mean Value Theorem **Rolle's Theorem** Suppose $f(a)=f(b)$, $h$ is continuous on $[a,b]$ and differentiable on $(a,b)$. Then there is $c\in(a,b)$ so that $f'(c)=0$. **Proof.** Since $f$ is continuous on $[a,b]$ it obtains its max/min. If these both occur at endpoints, then the max and mi are the same and $f$ is constant hence $f'(c)=0$ for all $c\in(a,b)$. Otherwise, there is a local extrema in the interval $(a,b)$ and hence $f'(c)=0$ there. ❏ **Mean Value Theorem** Asume $f$ is continuous on $[a,b]$ and differentiable on $(a,b)$, then there is $c\in (a,b)$ so that $f'(c)=\frac{f(b)-f(a)}{b-a}$. **Proof.** Apply Rolle's to $h(x) = f(x) - rx$ where $r = \frac{f(b)-f(a)}{b-a}$. ❏ **Corollary** If $f'(x)\ge 0$ on an interval $(a,b)$ and continuous on $[a,b]$, then $f$ is increasing on $[a,b]$. Similarly, if $f'(x)\le 0$ on $(a,b)$, then $f$ is decreasing, and if $f'(x)=0$ on $(a,b)$, then $f$ is constant. ❏ There are many variants of this with $\ge$ replaced by $>$ and "increasing" replaced by "strictly-increasing," etc. ### Cauchy Mean Value Theorem **Theorem** Suppose $f$ and $g$ are differentiable on $(a,b)$ and continuous on $[a,b]$, then there is $c\in(a,b)$ such that $f'(c)(g(b)-g(a)) = g'(c)(f(b)-f(a))$. Clearly this is a strengthening of the MVT taking g(x)=x. The proof is essentially the same. **Proof.** Let $h(x)=f(x)(g(b)-g(a)) - g(x)(f(b)-f(a))$, then $h(a)=h(b)$, $h$ is diffble in $(a,b)$, and continuous on $[a,b]$. By Rolle's Theorem there is $c\in(a,b)$ with $h(c)=0$, clearly, this $c$ is what we were looking for. ❏ This version of the MVT immediately gives L'Hospital's rule. ### L'Hospital Suppose $f(a-)=\lim_{x\to a^-}f(x)=0=\lim_{x\to a^-}g(x)=g(a-)$. If $\lim_{x\to a^-}\frac{f'(x)}{g'(x)}=L$ and $g(x)\neq 0$ for $x\in (h,a)$ for some $h<a$, then $\lim_{x\to a^-}\frac{f(x)}{g(x)}=L$. **Proof.** We may assume $g(a)=f(a)=0$ and $f$ and $g$ are differentiable on $(h,a)$ and continuous on $[h,a]$ and $g(x)\neq 0$ on $(h,a)$ for some $h<a$. Then by Cauchy Mean Value Theorem there is $c$ so that $h<c<a$ and $$ \frac{f'(c)}{g'(c)}=\frac{f(h)}{g(h)} $$ Taking the limit as $h\to a^-$ we see also that $c\to a^-$ as well and so $$ \lim_{h\to a^-}\frac{f(h)}{g(h)}=\lim_{c\to a^-}\frac{f'(c)}{g'(c)} $$ if the latter exists. ❏ **Note.** There are many variants of L'Hospital's rule. The limits can be one sided, two sided, $a$ can be $\pm\infty$, or the limit itself might be infinite. It is better to understand the proof than to remember all the cases. ### IVT for derivatives **Theorem** If $f$ is differentiable on $[a,b]$, and $f'(a)<\lambda < f'(b)$, then for some $c\in(a,b)$, $f'(c)=\lambda$. **Proof** Consider $g(x)=f(x)-\lambda x$, so $g$ is differentiable on $[a,b]$ and hence continuous on $[a,b]$. Note that $g'(a)<0<g'(b)$. This means that for some $\delta>0$ and all $h\in(0,\delta)$, $g(a+h)-g(a)<0<g(b)-g(b-h)$ so $g(x)<g(a)$ on $(a,\delta)$ and $g(x)<g(b)$ for on $(b-h,b)$. We know $g$ has a min on $[a,b]$ at some $c$ and thus $c\in(a,b)$. But then $g'(c)=f'(c)-\lambda=0$, so $f'(c)=\lambda$. **Corollary** $f'$ has no "jump" discontinuities. ## Inversion Let $f:E\subset\mathbb R\to \mathbb R$ be continuously differentiable in a neighborhood of $a\in E$. Then if $D_f(a)\neq 0$, i.e., $f'(a)\neq 0$, then there is a neighborhood of $a$ on which $f$ is 1-1 and if $g$ is the "local inverse," then there is a neighborhood of $f(a)=b$ so that $g$ is continuously differentiable and $D_g(b) = D_f(a)^{-1}$, i.e., $g'(b)=\frac{1}{f'(a)}$, which can be written: $$ D_{f^{-1}}(f(a))=D_f(a)^{-1}\text{ or } (f^{-1})'(f(a)) = \frac{1}{f'(a)} $$ ## Taylor's Theorem Given a function represented by a power series on $(c-R, c+R)$ we as $f(x)=\sum a_i(x-c)^i$ we have by a simple calculation that $a_i = \frac{f^{(i)}(c)}{i!}$. Now given a function $f(x)$ with derivatives of all orders defined on $(c-R,c+R)$, it makes sense to ask if $f(x) = \sum \frac{f^{(i)}(c)}{i!}(x - c)^i$. **Notation:** $C^\omega(D)$ is the collection of functions analytic on $D$. So here we are asking about the relationship between $C^\infty(D)$ and $C^\omega(D)$. In general if $I$ is an interval and $f$ has $n$ many derivatives in $I and $c\in I$ define the $n^{\text{th}}$ degree Taylor polynomial centered at $c$ to be $$ P_n(c,x)=\sum_{i=0}^n\frac{f^{(i)}(c)}{i!}(x-c)^i $$ and set $$ R_n(c,x)=f(x)-P_n(c,x). $$ **Notation** $c=0$ write just $P_n(x)$ and $R_n(x)$. when we need to make $f$ clear, $P^f_n(c,x)$ and $R_n^f(c,x)$ can be used. Notice $$ f(x)=\sum_{i=0}^\infty \frac{f^{(i)}(c)}{i!}(x-c)^i \overset{\text{iff}}{\Longleftrightarrow} \lim_{n\to \infty}R_n(c,x)=0 $$ **Taylor's Remainder Theorem** If $x,c\in I$ for $I$ an interval and $f\in C^{n+1}(I)$, then there is $y$ between $x$ and $c$ so that $$ R_n(c,x)=\frac{f^{(n+1)}(y)}{(n+1)!}(x-y)^{n+1} \tag{Lagrange form} $$ $$ R_n(c,x)= \frac{f^{(n+1)}(y)}{n!}(x-c)(x-y)^n\tag{Cauchy Form} $$ $$ R_n(c,x)=\int_c^x \frac{f^{(n+1)}(t)}{n!}(x-t)^n \,dt \tag{Integral form} $$ **Proof:** (Lagrange Form). Fix $x\neq c$ and chose $M$ so that $$ f(x) = P_n(c,x)+\frac{M(x-c)^{n+1}}{(n+1)!} $$ define $$ g(t) = P_n(c,t)+\frac{M(t-c)^{n+1}}{(n+1)!}-f(t) $$ It is clear that $g^{(k)}(c)=0$ for $k\leq n$. Apply Rolle's theorem repeatedly. Since $g(x)=g(c)=0$ get $x_1$ between $x$ and $c$ so that $g'(x_1)=0$. Get $x_2$ between $x_1$ and $c$ so that $g^{(2)}(x_2)=0$, etc. In this way get $x_1$, $x_2$, ..., $x_{n+1}$ so that $g^{k}(x_k)=0$ for $k\leq n+1$. Now $g^{(n+1)}(x_{n+1})=M-f^{(n+1)}(x_{n+1})=0$ so $M=f^{(n+1)}(x_{n+1})$ and we have $$ f(x)=P_n(f,c)(x)+\frac{f^{(n+1)}(x_{n+1})}{(n+1)!}(x_{n+1}-c)^{n+1} $$ ❏ (Lagrange Form) **Proof:** (Integral Form). For $n=0$ we have $P_0(f,c)(x)=f(c)$ and $$ R_0(c,x)=f(x)-f(c)=\int_c^x f'(t)\,dt $$ Assume we have the result for $n$, then let $u(t)=f^{(n+1)}(t)$ and $v'(t)=\frac{(x-t)^n}{n!}$ so that $u'(t)=f^{(n+2)}(t)$ and $v(t)=-\frac{(x-t)^{n+1}}{(n+1)!}$ and \begin{align} R_n(c,x)&=\int_c^x \frac{(x-t)^n}{n!}f^{(n+1)}(t)\,dt=\int_c^x v'(t) u(t)\,dt\\ &=\bigl(u(x)v(x)-u(c)v(c)\bigr)-\int_c^x v(t)u'(t)\,dt\\ &=f^{(n+1)}(x)\cdot 0 + f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!}+\int_c^x\frac{(x-t)^{n+1}}{(n+1)!}f^{(n+2)}(t)\,dt\\ &= f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!} + \int_c^x\frac{f^{(n+2)}(t)}{(n+1)!}(x-t)^{n+1}\,dt\\ \end{align} and $$ \begin{split} R_{n+1}(c,x)=f(x)&-P_{n+1}(c,x)= (f(x)-P_n(c,x))-f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!}\\ &=R_n(c,x)-f^{(n+1)}(c)\frac{(x-c)^{n+1}}{(n+1)!}. \end{split} $$ so we have $$ R_{n+1}(c,x)=\int_c^x\frac{f^{(n+2)}(t)}{(n+1)!}(x-t)^{n+1}\,dt $$ ❏ (Integral Form) **Proof:** (Cauchy Form.) The intermediate value theorem for integrals provides a $y$ between $x$ and $c$ so that $$ \frac{1}{x-c}\int_c^x \frac{f^{(n+1)}(t)}{n!}(x-t)^n\,dt= \frac{f^{(n+1)}(y)}{n!}(x-y)^n $$ So $$ R_n(c,x)=\frac{f^{(n+1)}(y)}{n!}(x-y)^n(x-c) $$ ❏ (Cauchy Form) **Example** $f(x)=e^x$ we have $f^{(n)}(0)=e^0=1$ for all $n$ and thus $$ \left|e^x-\sum_{i=0}^n\frac{x^i}{i!}\right|=\left|\frac{e^cx^{n+1}}{(n+1)!}\right| $$ So it is clear that $|P_n(x)-e^x|\to 0$ as $n\to\infty$ for all $x$. So pointwise $e^x=\sum_{i=0}^\infty \frac{x^i}{i!}$. [Visualize this :eyeglasses:](https://www.desmos.com/calculator/mdmfjxwknx) **Example** Consider $f(x)=\begin{cases}e^{-1/x}&x>0\\0&x\le 0\end{cases}$ It is possible to show that $f^{(n)}(0)=0$ for all $n$, so the Taylor series for $P_\infty(x)=\sum_{i=0}^\infty \frac{0}{i!}x^i=0$ and it is clear that $f(x)\neq P_\infty(x)$. This means that the $R_n(1)=\frac{f^{(n)}(c)}{n!}(1-c)^n\not\to 0$. :::spoiler **The Weeds** Substitute $u=-\frac{1}{x}$ and let $u\to-\infty$. Then it is clear that $g(u)=e^u$ satisfies \begin{align} g'(u)&=e^u\frac{du}{dx}=(-u^2)e^u\\ g''(u)&=(-2u-u^2)e^u\frac{du}{dx}=(-u^2)(-2u-u^2)e^u\\ g^{(n)}(u)&=p_n(u)e^u\\ g^{(n+1)}(u)&=(-u^2)(p'(u)+p(u))e^u=p_{n+1}(u)e^u \end{align} So $p_n(u)$ is a polynomial of degree $2n$ and $\lim_{u\to-\infty}p_n(u)e^u=0$ by repeated L'Hospital. This shows that $f^{(n)}(0)=\lim_{x\to 0^+}\frac{f^{(n-1)}(x)}{x}=\lim_{u\to-\infty}(-u)p_{n-1}(u)e^u=0$. Thus $P_n(0,x)=0$, yet clearly, $f(x)\neq 0$. You can see that these blow up for $\frac{f^{(n)}(c)}{n!}(1-c)^n$, for $n=1,2,3,4,5$, and $c\in(0,1)$: <iframe src="https://www.desmos.com/calculator/pdk0fg2cdt?embed" width="500" height="500" style="border: 1px solid #ccc" frameborder=0></iframe> ::: # The General Definition of the Derivative Let $(X,\|\cdot\|_X)$ and $(Y,\|\cdot\|_Y)$ be normed topological vector spaces (TVS) over either $\mathbb{R}$ or $\mathbb{C}$. We do not need to recall what a vector space over $\mathbb{R}$ or $\mathbb{C}$ is, but do recall that $\|\cdot\|_X:X\to\mathbb{R}$ is a norm iff - $\|x+y\|_X\leq \|x\|_X+\|y\|_X$ - $\|x\|\ge 0$ and $\|x\|_X=0\iff x=0$ - $\|\alpha x\|_X=|\alpha\||x\|_X$ Recall also that a norm $\|\cdot\|_X$ gives rise to a metric by $d_X(x,x')=\|x-x\|_X$. Typical examples are the Euclidean spaces, $\mathbb{R}^n$ and $\mathbb{C}^n$ with the usual $2$-norm, $\|x\|_2^2=\sum_{i=1}^nx_i^2$. Slightly more exotic examples would be, $\ell^2(\mathbb{R})$ or $\ell^2(\mathbb{C})$, the set of **square summable sequences**, with norm $\|x\|_2^2=\sum_{i=1}^\infty x_i^2$ or $\ell^1(\mathbb{C})$, the **absolutely summable sequences**, where the norm is given by $\|x\|_1=\sum_{i=0}^\infty|x_i|$. There are many more such examples, but for this discussion the Euclidean spaces suffice for examples. For these notes I will only consider real Euclidean spaces $\mathbb{R}^n$, but I will use notation that generalizes and hence generalizes readily to the more general setting described above. Given $f:X\to Y$ and $x\in X$ an interior point in the domain of $f$ we can define the *best linear approximation to $f$ at $x$* to be the unique linear function $D_f(x):X\to Y$ so that $$ f(x+h)-f(x)=D_f(x)(h)+o_f(x)(h) $$ where $$ \lim_{h\to 0}\frac{\|o_f(h)\|_Y}{\|h\|_X}=0 $$ **Note**: The remainder function $o_f(x)(h)$ is completely determined by $$ o_f(x)(h)=(f(x+h)-f(x))-D_f(x)(h) $$ For reasons we will not get into here, we also need to insist that $D_f(x)$ is continuous. So long as $X$ is finite dimensional, this will always be the case.^[For normed TVS, a linear $L:X\to Y$ is continuous iff $\|L\|_{\text{op}}=\sup\{\|L(x)\|_Y\mid \|x\|_X=1\}<\infty$. Such linear maps are also termed *bounded linear maps*. If $X$ is finite dimensional, then all linear maps on $X$ are bounded and hence continuous.] **Example 1:** Consider $x\mapsto x^2$ for $x\in\mathbb{R}$. $$ (x+h)(x+h) = x^2+ 2xh + h^2 $$ So $$ f(x+h)-f(x) = 2xh +h^2 = D_f(x)(h) + o_f(x)(h) $$ Clearly, $D_f(x)(h) = 2xh$ is linear (in $h$) and $\|o(h)\|/\|h\| = |h|^2/|h| = |h| \to 0$ as $h\to 0$. **Example 2:** This is essentially the same example. Now $x\in \mathbb{R}^n$ and $f(x)=x^Tx=\langle x,x\rangle = \|x\|^2$. $$ \begin{align} f(x+h)&=(x+h)^T(x+h)=(x^T+h^T)(x+h)\\ &=x^Tx + x^Th+h^Tx + h^th \\ &= \|x\|^2 + (2x^T)(h) + \|h\|^2\\ &=f(x)+(2x^T)(h)+\|h\|^2\\ &=f(x)+D_f(x)(h)+o_f(x)(h) \end{align} $$ Again $D_f(x)(h) = (2x^T)(h)$ is linear (in $h$) and $o_f(x)(h)/\|h\|=\|h\|^2/\|h\| \to 0$ as $h\to 0$. So $D_f(x)(h)=(2x^T)(h)$ and if you like the "usual way" of writing this, $2x^T=[f_1(x),\ldots,f_n(x)]$ where $f_i$ is the partial derivative with respect to the $i$^th^ variable, and so $2x^T$ is the Jacobian of $f$ at $x$. It would not suffice to simply have the error $o_f(h)$ satisfy, $\lim_{h\to 0}o_f(h)=0$. For we could take any $\alpha\in \mathbb{R}$ and set $D_f(h)=\alpha h$. Simple continuity of $f$ at $x$ would imply that $o_f(h)\to 0$ as $h\to 0$, so this clearly is not enough to gaurantee the uniqueness of $D_f(x)$. **Claim:** $D_f(x)$ is unique when it exists. **Proof:** Suppose you had two contestants for linear maps $D_f(x)$ and $D'_f(x)$, note that the error term is determined as $o_f(x)(h)=f(h)-D_f(x)(h)$ and similarly for $o'_f(x)$. We have $$ D_f(x)(h)-D'_f(x)(h)=o_f(x)(h)-o'_f(x)(h) $$ So $L:X\to Y$ defined by $L(h)=D_f(x)(h)-D'_f(x)(h)$ is linear and has the property that $\frac{\|L(h)\|_Y}{\|h\|_X}\to 0$ as $h\to 0$ since $\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}\to 0$ and $\frac{\|o_f'(x)(h)\|_Y}{\|h\|_X}\to 0$. Since $rh\to0$ as $r\to 0$ for any $h\in Y$ we have that \begin{align} 0&=\lim_{r\to 0}\frac{\|L(rh)\|_Y}{\|rh\|_X} =\lim_{r\to 0}\frac{\|L(rh)\|_Y}{\|rh\|_X} =\lim_{r\to 0}\frac{\|rL(h)\|_Y}{r\|h\|_X}\\ &=\lim_{r\to 0}\frac{r\|L(h)\|_Y}{r\|h\|_X} =\lim_{r\to 0}\frac{\|L(h)\|_Y}{\|h\|_X}=\frac{\|L(h)\|_Y}{\|h\|_X} \end{align} But then $\|L(h)\|_Y=0\|h\|_X=0$ and so $L(h)=0$. This was for an arbitrary $h\in X$, so $L=0$, the constantly $0$ map. This means that $L(h)=D_f(x)(h)-D'_f(x)(h)=0$ for all $h$ and hence $D_f(x)=D'_f(x)$ are the same linear map. ❏ **Note:** This proof really used the full condition on the error term: $\lim_{h\to 0}\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}=0$. ## Computing D~f~(x) for f:ℝ^n^→ℝ Suppose $D_f(x)$ exists, then since $D_f(x)$ is linear, there is a $1\times n$ matrix $A$ so that $D_f(x)(h)=Ah$, that is $$ D_f(x)(h)=\bigl[\, a_1\,a_2\,\cdots\,a_n\bigr] \begin{bmatrix}h_1\\h_2\\\vdots\\h_n\end{bmatrix} $$ Given an $m\times n$ matrix $B$ we can find the $j$^th^ column of $A$ as $Ae^n_j$ where $e^n_j$ is the $j$^th^ standard basis element: $$ e^n_j=\begin{bmatrix} 0\\\vdots\\0\\1\\0\\\vdots\\0 \end{bmatrix} \begin{matrix} \phantom{0}\\\phantom{\vdots}\\\phantom{0}\\\leftarrow j\\\phantom{1}\\\phantom{\vdots}\\\phantom{0} \end{matrix} $$ That is $e^n_j(i)=\begin{cases}0&\text{if }i\neq j\\1&\text{if }i=j\end{cases}$. So $B_{i,j}=Be^n_j(i)$ so in our current setting, that is where $A$ is $1\times n$, $a_i=D_f(x)(e_i^n)(i)$. Now consider \begin{align*} \lim_{h\to 0}\frac{|f(x+h)-f(x)-D_f(x)(h)|}{\|h\|}=0 &\implies \lim_{t\to 0}\frac{|f(x+te_i^n)-f(x)-D_f(x)(te_i^n)|}{|t|}=0\\ &\implies \lim_{t\to 0}\left|\frac{f(x+te_i^n)-f(x)}{t}-a_i\right|=0 \end{align*} since $\frac{D_f(x)(te_i^n)}{t}=\frac{tD_f(x)(e_i^n)}{t}=D_f(x)(e_i^n)=a_i$. But let $$ \hat f_x(t)=f(x+te_i^n)=f(x_1,x_2,\ldots,x_{i-1},x_i+t,x_{i+1},\ldots,x_n), $$ then $a_i=\frac{d}{dt}\hat f\big|_{t=0}\stackrel{\tiny{\text{def}}}{=}\frac{\partial}{\partial x_i}f(x)$. We also denote this $\partial_i f(x)$, $\frac{\partial f}{\partial x_i}(x)$, and sometimes $f_i(x)$. So we have shown that if $f$ is differentiable at $x$, then $$ D_f(x)=\bigl[\,\partial_1 f\,\partial_2 f\,\ldots\,\partial_n f\,\bigr] $$ **Example** Consider $f(x)=x^Tx=\sum_{i}x_i^2$ again, so $f:\mathbb R^n\to \mathbb R$. We have $\partial_i f(x)=2x_i$ and so $D_f(x)=\bigl[\,2x_1\,2x_2\,\ldots\,2x_n\,\bigr]=2x^T$ which agrees what we got above! **Example** Consider $f(x,y)=x\cdot y$. Then $[D_f(x,y)]=\bigl[y, x\bigr]$. So at a point $(a,b)$, the best linear approximation to $f((a,b)+(h,k))=(a+h)\cdot(b+k) = \bigl[b\ a\bigr]\Bigl[\begin{smallmatrix}h\\k\end{smallmatrix} \Bigr]=bh+ak$. This makes sense as $$ f((a,b)+(h,k))-f((a,b))=(a+h)(b+k)-ab=(ab+ak+bh+hk)-ab=(ak+bh)+(h,k) $$ So $o_f((h,k))=hk$ and $\lim_{(h,k)\to (0,0)}\frac{hk}{\sqrt{h^2+k^2}}=0$. Note, when we get tot the chain rule for multivariable functions we would have $$ \frac{d}{dt}(f(t)\cdot g(t))=D_F((f(t),g(t))(f'(t),g'(t))= \bigl[g(t)\ f(t)\bigr]\Bigl[\begin{smallmatrix}f'(t)\\g'(t)\end{smallmatrix} \Bigr]=g(t)f'(t)+f(t)g'(t) $$ where $F(x,y)=x\cdot y$. ## Computing D~f~(x) for f : ℝ^n^ → ℝ^m^ Here we just note that $f:\mathbb R^n\to\mathbb R^m$ is really just given by $f(x)=(f_1(x),\ldots,f_m(x))$ where $f_i(x):\mathbb R^n\to\mathbb R$. Actually, to use the correct vector/matrix notation $$ f(x)=\begin{bmatrix}f_1(x)\\f_2(x)\\\vdots\\f_n(x)\end{bmatrix} $$ As above $D_f(x)(i,j)=D_f(x)(e_j^n)(i)=D_{f_i}(x)(j)$ and thus $D_f(x)(i,j)=\partial_j f_i(x)$, that is, $$ D_f(x)=\begin{bmatrix} \partial_1 f_1&\partial_2 f_1&\cdots&\partial_n f_1\\ \partial_1 f_2&\partial_2 f_2&\cdots&\partial_nf_2\\ \vdots&\vdots&\ddots&\vdots\\ \partial_1 f_m&\partial_2 f_m&\cdots&\partial_n f_m \end{bmatrix} $$ **Warning** Clearly if $D_f(x)$ exists, then all of the partial derivatives exist, but the partial derivatives existing is not enough to guarantee that $f$ is differentiable. **Theorem** If $\partial_i f_j$ exists and are continuous in an open nbhd of $x$ for all $i,j$, then $D_f(x)$ exists and is continuous in an open nbhd of $x$. **Proof** Let $U$ be an open nbhd of $x$ so that $\partial_i f$ is continuous in $U$ for all $i$. Let $\epsilon>0$ and choose $\delta >0$ so that $y\in N_\delta(x)\implies \partial_i f(y)\in N_{\epsilon/n}(\partial_i f(x))$. Let $x+h\in N_\delta(x)$. Call $x=y_0$ and $x+h=y_n$ and choose $y_i\in U$ so that $y_i-y_{i-1}=h_ie_i$. By MVT there is $c_i\in (0,h_i)$ so that $f(y_i)-f(y_{i-1})=\partial_i f(y'_i)h_i$ where $y'_i=y_{i-1}+c_ie_i$. We have then $f(x+h)-f(x)=f(y_n)-f(y_0)=\sum_{i=1}^nf(y_i)-f(y_{i-1})=\sum_{i=1}^n\partial_i f(y'_i)h_i$. So $f(x+h)-f(x)-D_f(x)=\sum_{i=1}^n\partial_i f(y'_i)h_i-\sum_{i=1}^n\partial_i f(x)h_i=\sum_{i=1}^n(\partial_i f(y_i')-\partial_i f(x))h_i$. Thus $$ \|f(x+h)-f(x)-D_f(x)\| = \left\|\sum_{i=1}^n(\partial_i f(y_i')-\partial_i f(x))h_i\right\| \le \sum_{i=1}^n(\epsilon/n) |h_i| $$ so $$ \frac{f(x+h)-f(x)-D_f(x)}{\|h\|}<\epsilon. $$ ❏ **Example**: Let $f(x,y)=(x^2-y^2,2xy^2)$, then $\partial_x f_1=2x$, $\partial_y f_1=-2y$, $\partial_x f_2=2y^2$, and $\partial_y f_2=4xy$. Clearly all of these partials are continuous on all of $\mathbb R^2$, so $f$ is differentiable on all of $\mathbb R^2$ and $$ D_f(a,b)(h,k)=\begin{bmatrix} 2a&-2b\\ 2b^2&4ab \end{bmatrix}\begin{bmatrix}h\\k\end{bmatrix} $$ so given that $f(3,4)=(-7, 96)$ we can approximate $f(3.12,3.84)=f(3+0.12,4-0.16)\approx f(3,4)+D_f(3,4)(0.12,-0.16)$ so $$ f(3.12,3.84) \approx \begin{bmatrix}-7\\96\end{bmatrix}+ \begin{bmatrix}6&-8\\32&48\end{bmatrix} \begin{bmatrix}0.12\\-0.16\end{bmatrix} =\begin{bmatrix}-7\\96\end{bmatrix}+ \begin{bmatrix}2\\-3.84\end{bmatrix} =\begin{bmatrix}-5\\92.16\end{bmatrix} $$ You can check that $f(3.12,-3.84)=(-4.9487, 92.01254)$. Just to reiterate something done above. For the case that $f:\mathbb{R}\to\mathbb{R}$ it is clear that $D_f(x)(h) = ah$ for some $a\in\mathbb{R}$ and hence just the definition provides: $$ \frac{f(x+h)-f(x)-ah}{h}=\frac{o_f(x)(h)}{h}\to 0\text{ as }h\to 0 $$ and hence $$ \lim_{h\to 0}\left(\frac{f(x+h)-f(x)}{h}-a\right)=0 $$ which becomes the familiar $$ \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}=a=f'(x) $$ and hence $D_f(x)(h)=f'(x)h$, so $D_f(x):\mathbb{R}\to\mathbb{R}$ is the linear function $h\mapsto f'(x)h$. # Properties of the derivative ## Continuity of *f* at differentiable points Let $f:X\to Y$ be differentiable at $x\in X$, then $f$ is continuous at $x$. To see this notice $$ \|f(x+h)-f(x)\|_Y\leq\|D_f(x)(h)\|_Y+\|o_f(x)(h)\|_Y $$ The continuity of $D_f(x)$ implies that $D_f(x)(h)\to D_f(x)(0)=0$ as $h\to 0$ and by assumption $\|o_f(x)(h)\|_Y\to 0$, so we see $\|f(x+h)-f(x)\|_Y\to 0$ as $h\to 0$ and hence $f$ is continuous at $0$. ## Critical points **Theorem** (Fermat's little theorem.) If $f:X\to\mathbb{R}$ and a local maxima or minima occurs at $c$ and $f$ is differentiable at $c$, then $D_f(c)=0$, i.e., is the $0$ linear map. **Proof** Suppose a local maximum occurs at $c$ and that $D_f(c)(x) = \gamma > 0$ for some $x$, that is $D_f(c)$ is not the $0$-map. Then for $t>0$ $$ \frac{\delta}{\|x\|}=\frac{D_f(c)(tx)}{\|tx\|}=\frac{f(c+tx)-f(c)+o_f(c)(tx)}{\|tx\|} $$ Choose $\delta'$ so that $0<t<\delta'$, $\frac{o_f(c)(tx)}{\|tx\|}<\frac{\delta}{2\|x\|}$, then $f(c+tx)-f(c)>0$. So $f(c)$ is not a local max. If $D_f(c)(x)=\gamma<0$, we do the same, but with $t<0$. If $c$ is a local minimum, then we use the same argument, but invert the rolls of $t>0$ and $t<0$. ❏ **Definition** $c\in X$ is a *critical point* or *stationary point* iff $D_f(x)$ is the $0$ map. So for $f:X\to \mathbb{R}$, if $f$ has a local extrema at $c$ and $f$ is differentiable at $c$, then $c$ is a critical point, but the converse can fail. ## Algebraic properties of derivatives ### Product Rule Suppose $f:X\to Y$ and $g:X\to Y$ are dfbl at $x\in X$ and $\star:Y\times Y\to Y$ is some sort of product satisfying^[This is a very general result it takes care of inner products, cross-products, the usual product, etc.]: - $(y_0\star y_1)\star y_2 = y_0\star(y_1\star y_2)$ - $\alpha(y_0\star y_1) = (\alpha y_0)\star y_1 = y_0\star(\alpha y_0)$ - $y_0\star(y_1 + y_2)=y_0\star y_1 + y_0\star y_2$ - $(y_0 + y_1)\star y_2 = (y_0 \star y_2) + (y_1 \star y_2)$ The first property is associativity and the last three are called *bi-linearity* Then $$ D_{f\star g}(x)(h) = D_f(x)(h) \star g(x) + f(x) \star D_g(x)(h) $$ Be mindful of the types involved here: First the definition of $f\star g:X\to Y$ is given by $(f\star g)(x)=f(x)\star g(x)$. Second, $D_f(x), D_g(x)\in \mathcal{L}(X,Y)$, the space of linear functions from $X$ to $Y$, while $f(x), g(x)\in Y$ and $h\in X$. **Proof** We must show that $$ (f\star g)(x + h) - (f\star g)(x) = (D_f(x)(h) \star g(x) + f(x) \star D_g(x)(h)) + o(h) $$ where $\|o(h)\|_Y/\|h\|_X\to 0$ as $h\to 0$. For this we simply compute using the fact that $f$ and $g$ are dfbl at $x$. $$ \begin{align*} f(x+h)\star& g(x+h)-f(x)\star g(x)\\ &=f(x+h)\star g(x+h) - f(x+h)\star g(x) + f(x+h)\star g(x) - f(x)\star g(x)\\ &=f(x+h)\star(g(x+h)-g(x))+(f(x+h)-f(x))\star g(x)\\ &=(f(x)+D_f(x)(h)+o_f(x)(h))\star (D_g(x)(h)+o_g(x)(h))\\ &\qquad+(D_f(x)(h)+o_f(x)(h))\star g(x)\\ &=f(x)\star D_g(x)(h)+D_f(x)\star g(x) + o(x)(h) \end{align*} $$ where $$ \begin{align*} o(x)(h)&=(D_f(x)(h)+o_f(x)(h))\star (D_g(x)(h)+o_g(x)(h))\\ &\qquad+f(x)\star o_g(x)(h)+o_f(x)(h)\star g(x)\\ \end{align*} $$ It is clear that $\lim_{h\to 0}\frac{\|o(x)(h)\|_Y}{\|h\|_X}=0$, since $\lim_{h\to 0}\frac{\|o_f(x)(h)\|_Y}{\|h\|_X}=0$, $\lim_{h\to 0}\frac{\|o_g(x)(h)\|_Y}{\|h\|_X}=0$, and $\lim_{h\to 0}\frac{\|D_f(x)(h)\|_Y}{\|h\|_X}=0$. ### Chain Rule **Theorem:** Suppose $f:X\to Y$ and $g:Y\to Z$ with $f$ dfbl at $x$ and $g$ dfbl at $g(x)$, then $g\circ f:X\to Z$ is dfbl at $x$ and and $D_{g\circ f}(x)=D_g(f(x))\circ D_f(x)$. **Proof:** The composition of linear functions is linear so $D_g(f(x))\circ D_f(x):X\to Z$ is linear. We must show that for $o:X\to Y$ defined by $$ o(h) = (g\circ f)(x+h)-(g\circ f)(x)-(D_g(f(x))\circ D_f(x))(h) $$ , that $\lim_{h\to 0}\frac{\|o(h)\|_Z}{\|h\|_X}=0$. $$ \begin{align*} g(f(x+h))-g(f(x)) &= D_g(f(x))(f(x+h-f(x))) + o_g(f(x+h)-f(x))\\ &= D_g(f(x))(D_f(x)(h)+o_f(x)(h)) + o_g(D_f(x)(h)+o_f(x)(h))\\ &= D_g(f(x))(D_f(x)(h)) + D_g(o_f(x)(h)) + o_g(D_f(x)(h)+o_f(x)(h)) \end{align*} $$ so define $$ o(h) = D_g(o_f(x)(h)) + o_g(f(x))(D_f(x)(h)+o_f(x)(h)) $$ We must show $\frac{\|o(h)\|_Z}{\|h\|_X}\to 0$ as $h\to 0$. It would suffice to show this true for each "part" of $o(h)$, namely show $$ \frac{\|D_g(f(x))(o_f(x)(h))\|_Z}{\|h\|_X}\to 0\quad\text{and}\quad \frac{\|o_g(f(x))(D_f(x)(h)+o_f(x)(h))\|_Z}{\|h\|_X}\to 0 $$ as $h\to 0$. $$ \lim_{h\to 0}\frac{\|D_g(f(x))(o_f(x)(h))\|_Z}{\|h\|_X}=\lim_{h\to 0}\left\Vert D_g(f(x))\left(\frac{\|o_f(x)(h))\|_Y}{\|h\|_X}\right)\right\Vert_Z $$ But as $y\mapsto \|D_g(f(x))(y)\|_Z$ is continuous we have $$ \begin{align*} \lim_{h\to 0}\left\Vert D_g(f(x))\left(\frac{\|o_f(x)(h))\|_Y}{\|h\|_X}\right)\right\Vert_Z&= \left\Vert D_g(f(x))\left(\lim_{h\to 0}\frac{\|o_f(x)(h))\|_Y}{\|h\|_X}\right)\right\Vert_Z\\ &\\ &=\|D_g(f(x))(0)\|_Z=0 \end{align*} $$ So the first part is done. for the second part we notice: $$ \frac{\|o_g(f(x))(D_f(x)(h)+o_f(x)(h))\|_Z}{\|h\|_X}=\frac{\|o_g(f(x))(k(h))\|_Z}{\|k(h)\|_Y}\frac{\|k(h)\|_Y}{\|h\|_X} $$ where $k(h)=D_f(x)(h)+o_f(x)(h)$. But $k(h)\to 0$ as $h\to 0$ and thus $$ \lim_{h\to 0}\frac{\|o_g(f(x))(k(h))\|_Z}{\|k(h)\|_Y}=\lim_{k\to 0} \frac{\|o_g(f(x))(k)\|_Z}{\|k\|_Y}=0 $$ The other part $$ D_f(x)(h/\|h\|_X)+\frac{\|o_f(x)(h)\|_Y}{\|h\|_X} $$ is bounded for $h$ close to $0$ and so $$ \frac{\|o_g(f(x))(k(h))\|_Z}{\|k(h)\|_Y}\frac{\|k(h)\|_Y}{\|h\|_X}\to 0\cdot \text{Bounded}=0 $$ *[dfbl]: differentiable *[diff'ble]: differentiable