# HW1- Answer
## Mathematic Background (0.8%)
1. $\left(A A^{\top}\right)^{\top}=A A^{\top} \Rightarrow A A^{\top}$ is a symmetric matrix
Moreover, $\forall x \in \mathbb{R}, \quad x^{\top} A A^{\top} x=\left(A^{\top} x\right)^{\top}\left(A^{\top} x\right)=\left\|A^{\top} x\right\| \geqslant 0$
Hence, $A A^{\top}$ is positive semi-definite
2. $f\left(x_1, x_2\right)=x_1 \sin \left(x_2\right) \exp \left(-x_1 x_2\right)$
$\frac{\partial f\left(x_1, x_2\right)}{\partial x_1}=\sin \left(x_2\right) \exp \left(-x_1 x_2\right)-x_1 x_2 \exp \left(-x_1 x_2\right)\sin\left(x_2\right)$
$\frac{\partial f\left(x_1, x_2\right)}{\partial x_2}=x_1 \cos \left(x_2\right) \exp \left(-x_1 x_2\right)+x_1 \sin \left(x_2\right) \exp \left(-x_1 x_2\right)\left(-x_1\right)$
$$
\Rightarrow \nabla f(x)=\left[\begin{array}{c}
\sin \left(x_2\right) \exp \left(-x_1 x_2\right)-x_1 x_2\sin \left(x_2\right) \exp \left(-x_1 x_2\right) \\
x_1 \cos \left(x_2\right) \exp \left(-x_1 x_2\right)-x_1^2 \sin \left(x_2\right) \exp \left(-x_1 x_2\right)
\end{array}\right]
$$
3. Consider the likelihood function $\prod_{i=1}^n f\left(x_i ; p\right)=\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}=p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i}$ $\hat{p}_{m l e} \in \operatorname{argmax}_p \prod_{i=1}^n f\left(x_i ; p\right)=\operatorname{argmax}_p \log \left(\prod_{i=1}^n f\left(x_i ; p\right)\right)$ $\log \left(\prod_{i=1}^n f\left(x_i ; p\right)\right)=\sum_{i=1}^n x_i \log p+\left(n-\sum_{i=1}^n x_i^i\right) \log (1-p)$
Consider the first order condition:
$$
\begin{aligned}
& \frac{\partial}{\partial p} \log \left(\prod_{i=1}^n f\left(x_i ; p\right)\right)=\frac{1}{p} \sum_{i=1}^n x_i-\frac{1}{1-p}\left(n-\sum_{i=1}^n x_i\right)=0 \\
\Rightarrow & p n-p \sum_{i=1}^n x_i=\sum_{i=1}^n x_i-p \sum_{i=1}^n x_i \quad \Rightarrow \hat{p}_{m l e}=\frac{1}{n} \sum_{i=1}^n x_i \\
& \frac{\partial^2}{\partial p^2} \log \left(\prod_{i=1}^n f\left(x_i ; p\right)\right)=\frac{-1}{p^2} \sum_{i=1}^n x_i-\frac{1}{(1-p)^2}\left(n-\sum_{i=1}^n x_i\right) \leq 0 .
\end{aligned}
$$
## Closed-Form Linear Regression Solution (0.8%)
1. Consider
$$
\begin{aligned}
L(\boldsymbol{\theta}) := \sum_{i=1}^n w_i ( y_i-\mathbf{X}_i \boldsymbol{\theta})^2 &=(\mathbf{y}-\mathbf{X} \boldsymbol{\theta})^T \boldsymbol{\Omega}\left(\mathbf{y}-\mathbf{X} \boldsymbol{\theta}\right) \\
&=\mathbf{y}^T \boldsymbol{\Omega} \mathbf{y}-\boldsymbol{\theta}^T \mathbf{X}^T \boldsymbol{\Omega} \mathbf{y}-\mathbf{y}^T \boldsymbol{\Omega} \mathbf{X}\boldsymbol{\theta}+\boldsymbol{\theta}^{T} \mathbf{X}^{T} \boldsymbol{\Omega}\mathbf{X} \boldsymbol{\theta}
\end{aligned}
$$
Implies
$$
\nabla L(\boldsymbol{\theta}) = -2\mathbf{X}^T\boldsymbol{\Omega}\mathbf{y} + 2 (\mathbf{X}^T\boldsymbol{\Omega} \mathbf{X}) \boldsymbol{\theta} := 0
$$
Then $\boldsymbol{\theta}^*=\left(\mathbf{X}^T \boldsymbol{\Omega} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \boldsymbol{\Omega} \mathbf{y}$. Now, check $\boldsymbol{\theta}^*$ is the optimal solution.
$$
\begin{aligned}
L(\boldsymbol{\theta})
&=\mathbf{y}^T \boldsymbol{\Omega} \mathbf{y}-\boldsymbol{\theta}^T \mathbf{X}^T \boldsymbol{\Omega} \mathbf{y}-\mathbf{y}^T \boldsymbol{\Omega} \mathbf{X}\boldsymbol{\theta}+\boldsymbol{\theta}^{T} \mathbf{X}^{T} \boldsymbol{\Omega}\mathbf{X} \boldsymbol{\theta} \\
&=\left(\boldsymbol{\theta}-\boldsymbol{\theta}^*\right)^{T}\mathbf{X}^{T} \boldsymbol{\Omega}\mathbf{X}\left(\boldsymbol{\theta}-\boldsymbol{\theta}^*\right)+\mathbf{y}^{T} \boldsymbol{\Omega} \mathbf{y}-\mathbf{y}^T \boldsymbol{\Omega} \mathbf{X}\left(\mathbf{X}^{T} \boldsymbol{\Omega} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \boldsymbol{\Omega} \mathbf{y}
\end{aligned}
$$
Since $\mathbf{X}^{T} \boldsymbol{\Omega} \mathbf{X}$ is positive semi-definte, the optimal solution appears at $\boldsymbol{\theta}^*$
2. Consider
$$
\begin{aligned}
L(\boldsymbol{\theta}) := \sum_i (y_i - \mathbf{X}_i\boldsymbol{\theta})^2 + \lambda \sum_j w_j^2 &= (\mathbf{y}-\mathbf{X}\boldsymbol{\theta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\theta}) + \lambda\boldsymbol{\theta}^T\boldsymbol{\theta}\\
&= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\theta} - \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{y}+\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}+ \lambda\boldsymbol{\theta}^T\boldsymbol{\theta}
\end{aligned}
$$
Since $\lambda\boldsymbol{\theta}^T\boldsymbol{\theta} = \lambda \boldsymbol{\theta}^T I_m\boldsymbol{\theta}$, where $I_m$ is $m\times m$ identity matrix. Moreover, $\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} + \lambda \boldsymbol{\theta}^T I_m\boldsymbol{\theta} = \boldsymbol{\theta}^T(\mathbf{X}^T\mathbf{X} + \lambda I_m)\boldsymbol{\theta}$
$$
\nabla L(\boldsymbol{\theta}) = -2\mathbf{X}^T\mathbf{y} + 2 (\mathbf{X}^T \mathbf{X} + \lambda I_m) \boldsymbol{\theta} := 0
$$
Then $\boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X} + \lambda I_m)^{-1} \boldsymbol{X}^T\mathbf{y}$. Now, check $\boldsymbol{\theta}^*$ is the optimal solution.
$$
\begin{aligned}
&\quad \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\theta} - \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{y}+\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}+ \lambda\boldsymbol{\theta}^T\boldsymbol{\theta}\\
&= (\boldsymbol{\theta} - \boldsymbol{\theta}^*)^T(\mathbf{X}^T\mathbf{X} + \lambda I_m)(\boldsymbol{\theta} - \boldsymbol{\theta}^*) + \mathbf{y}^T\mathbf{y} - (\boldsymbol{\theta}^*)^T(\mathbf{X}^T\mathbf{X} + \lambda I_m)(\boldsymbol{\theta}^*)
\end{aligned}
$$
Clearly, $\mathbf{X}^T\mathbf{X} + \lambda I_m$ is positive semi-definite matrix, since $\lambda$ is a positive scaler. Hence, $\boldsymbol{\theta}^*$ is the optimal solution.
(Bonus) Let $X^\prime = [X, \boldsymbol{1}]\in \mathbb{R}^{n\times (m+1)}$, where $\boldsymbol{1} = [1, 1, \cdots, 1]^T$. Then
$$
\begin{aligned}
L(\boldsymbol{\theta}) := \sum_i (y_i - \mathbf{X}_i\boldsymbol{\theta})^2 + \lambda \sum_j w_j^2 &= (\mathbf{y}-\mathbf{X}^\prime\boldsymbol{\theta})^T(\mathbf{y}-\mathbf{X}^\prime\boldsymbol{\theta}) + \lambda\boldsymbol{\theta}^T\boldsymbol{\theta} - \lambda b^2\\
&= (\mathbf{y}-\mathbf{X}^\prime\boldsymbol{\theta})^T(\mathbf{y}-\mathbf{X}^\prime\boldsymbol{\theta}) + \boldsymbol{\theta}^T \lambda D\boldsymbol{\theta}
\end{aligned}
$$
where $D =
\left[\begin{array}{ccccc}
1 & 0 & \cdots & 0 & 0 \\
0 & 1 & \cdots & 0 & 0 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \cdots & 1 & 0 \\
0 & 0 & \cdots & 0 & 0
\end{array}\right]\in\mathbb{R}^{(m+1)\times (m+1)}$ i.e. $\operatorname{diag}(1, 1, \cdots, 1, 0)$
Thus, we follow the same argument in (b), then we have
$$
\boldsymbol{\theta}^* = \left[
\begin{array}{c}
\mathbf{w}^*\\
b^*
\end{array}
\right] = (\mathbf{X}^T \mathbf{X} + \lambda D)^{-1} \boldsymbol{X}^T\mathbf{y}
$$
## Logistic Sigmoid Function and Hyperbolic Tangent Function (0.8%)
1. By assumption, we have
\begin{aligned}
2 \sigma(2 a)-1 &=\frac{2}{1+e^{-2 a}}-1 \\
&=\frac{2}{1+e^{-2 a}}-\frac{1+e^{-2 a}}{1+e^{-2 a}} \\
&=\frac{1-e^{-2 a}}{1+e^{-2 a}} \\
&=\frac{e^a-e^{-a}}{e^a+e^{-a}} \\
&=\tanh (a)
\end{aligned}
2. If we now take $a_j=\frac{\left(x-\mu_j\right)}{2s}$, we can rewrite as
\begin{aligned}
y(\mathbf{x}, \mathbf{w}) &=w_0+\sum_{j=1}^M w_j \sigma\left(2 a_j\right) \\
&=w_0+\sum_{j=1}^M \frac{w_j}{2}\left(2 \sigma\left(2 a_j\right)-1+1\right) \\
&=u_0+\sum_{j=1}^M u_j \tanh \left(a_j\right)
\end{aligned}
where $u_j=\frac{1}{2}w_j$, for $j=1, \ldots, M$, and $u_0=w_0+\frac{1}{2}\sum_{j=1}^M w_j$.
## Noise and Regulation (0.8%)
By definition,
\begin{align*}
{\tilde L}_{ss}({\bf w},b) &= {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i + {\bf \eta}_i)-y_i)^2 \right]\\
&= \frac{1}{2N}\sum_{i=1}^N \mathbb{E}\{(\mathbf{w}^T(\mathbf{x}_i + \eta_i) - y_i)^2\}\\
&= \frac{1}{2N}\sum_{i=1}^N\mathbb{E}\left[ \{(\mathbf{w}^T\mathbf{x}_i - y_i) + \mathbf{w}^T\eta_i\}^2\right]\\
&= \frac{1}{2N}\sum_{i=1}^N \mathbb{E} \left[ (\mathbf{w}^T\mathbf{x}_i - y_i)^2\right] - 2\mathbb{E}\{\mathbf{w}^T\eta_i(\mathbf{w}^T\mathbf{x}_i - y_i)\} + \mathbb{E}\left[(\mathbf{w}^T\eta_i)^2\right]\\
&= \frac{1}{2N}\sum_{i=1}^N (\mathbf{w}^T\mathbf{x}_i - y_i)^2 - 2\mathbf{w}^T(\mathbf{w}^T\mathbf{x}_i - y_i)\mathbb{E}(\eta_i) + \mathbb{E}\left[(\mathbf{w}^T\eta_i)^2\right]\\
&= \frac{1}{2N}\sum_{i=1}^N (\mathbf{w}^T\mathbf{x}_i - y_i)^2 + \mathbb{E}\left[(\mathbf{w}^T\eta_i)^2\right]
\end{align*}
Note that $\mathbb{E}(\eta_i) = 0$
Now, calculate $\mathbb{E}\left[(\mathbf{w}^T\eta_i)^2\right]$
\begin{align*}
\sum_{i=1}^N\mathbb{E}(\mathbf{w}^T\eta_i)^2 &= \sum_{i=1}^N\mathbb{E}(\sum_{j=1}^k w_j\eta_{i, j})\\
&= \sum_{i=1}^N\mathbb{E}(\sum_{j = 1}^k\sum_{l=1}^k w_j w_l \eta_{i, j}\eta_{i, l})\\
&= \sum_{j = 1}^k\sum_{l=1}^kw_j w_l\sum_{i=1}^N\mathbb{E}(\eta_{i, j}\eta_{i, l})\\
&= N\sigma^2 \sum_{j = 1}^k\sum_{l=1}^kw_j w_l = N\sigma^2\Vert w\Vert^2
\end{align*}
Hence,
\begin{align*}
{\tilde L}_{ss}({\bf w},b) &= \frac{1}{2N}\sum_{i=1}^N (\mathbf{w}^T\mathbf{x}_i - y_i)^2 + \frac{1}{2N} N\sigma^2\Vert w\Vert^2\\
&= \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2
\end{align*}
## Logistic Regression (0.8%)
1.
\begin{aligned}
& \mathbf{w}^{\top} \mathbf{x}+b=\left[\begin{array}{llll}
-1 & 2 & -1 & 5
\end{array}\right]\left[\begin{array}{l}
7 \\
0 \\
3 \\
10
\end{array}\right]^{\top}+3=-7+0-3+50+3=43 \\
& P\left(c_1 \mid x\right)=\sigma(43)=\frac{1}{1+e^{-43}} = 1 \\
\Rightarrow & P\left(c_2 \mid x\right)=1-P\left(c_1 \mid x\right)=1-\frac{1}{1+e^{-43}}=\frac{e^{-43}}{1+e^{-43}} = 0
\end{aligned}
2.
$$
\begin{aligned}
&P\left(y_i \mid \mathbf{x}_i\right)=f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)^{y_i} \cdot\left(1-f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)^{1-y_i}\right), y_i \in\{0,1\} . \\
&P(\mathbf{y} \mid \mathbf{x})=\prod_i P\left(y_i \mid \mathbf{x}_i\right)=\prod_i f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)^{y_i}\left(1-f_{\mathbf{w},b}\left(\mathbf{x}_i\right)\right)^{1-y_i}
\end{aligned}
$$
Loss function $L(\mathbf{w}, b)=-\log p(\mathbf{y}| \mathbf{x})=-\sum_i\left(y_i \log f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)+\left(1-y_i\right) \log \left(1-f_{\mathbf{w},b}\left(\mathbf{x}_i\right)\right)\right.$
3. Note that
\begin{align*}
\frac{d}{dx}\sigma(x) &= \frac{d}{dx}\frac{1}{1+\exp(-x)}\\
&= (-1)(1+\exp(-x))^{-2}(-\exp(-x))\\
&= \frac{\exp(-x)}{(1+\exp(-x))^2}\\
&= \frac{1}{(1+\exp(-x))}(1 - \frac{1}{(1+\exp(-x))}) = \sigma(x)(1-\sigma(x))
\end{align*}
Consider $\mathbf{z}=\mathbf{w}^T\mathbf{x} + b$. Then
* $\frac{\partial \log\sigma(\mathbf{z})}{\partial \mathbf{w}} = \frac{\partial \log\sigma(\mathbf{z})}{\partial \sigma(\mathbf{z})}\frac{\partial \sigma(\mathbf{z})}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{w}} = (1-\sigma(\mathbf{z}))\mathbf{x}$
* $\frac{\partial \log (1 - \sigma(\mathbf{z}))}{\partial \mathbf{w}} = \frac{\partial \log(1-\sigma(\mathbf{z}))}{\partial (1-\sigma(\mathbf{z}))}\frac{\partial (1-\sigma(\mathbf{z}))}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{w}} = -\sigma(\mathbf{z})\mathbf{x}$
* $\frac{\partial \log\sigma(\mathbf{z})}{\partial b} = \frac{\partial \log\sigma(\mathbf{z})}{\partial \sigma(\mathbf{z})}\frac{\partial \sigma(\mathbf{z})}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial b} = (1-\sigma(\mathbf{z}))$
* $\frac{\partial \log (1 - \sigma(\mathbf{z}))}{\partial b} = \frac{\partial \log(1-\sigma(\mathbf{z}))}{\partial (1-\sigma(\mathbf{z}))}\frac{\partial (1-\sigma(\mathbf{z}))}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial b} = -\sigma(\mathbf{z})$
By above, we get
$$
\begin{aligned}
&\frac{\partial L(\mathbf{w}, b)}{\partial \mathbf{w}}=-\sum_{i=1}^{n} y_i \cdot\left(1-f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)\right)\mathbf{x}_i-\left(1-y_{i}\right) \cdot\left(-f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)\right) \mathbf{x}_i \\
&=\sum_{i=1}^{n} \mathbf{x}_i\left(f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)-y_i\right) \\
&\frac{\partial L(\mathbf{w}, b)}{\partial b}=\sum_{i=1}^{n} f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)-y_i
\end{aligned}
$$
Hence,
$$
\begin{aligned}
\mathbf{w}^{t+1} &\leftarrow \mathbf{w}^t - \eta \sum_{i=1}^{n} \mathbf{x}_i\left(f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)-y_i\right)\\
b^{t+1} &\leftarrow b^t - \eta \sum_{i=1}^{n} f_{\mathbf{w}, b}\left(\mathbf{x}_i\right)-y_i
\end{aligned}
$$