owned this note
owned this note
Published
Linked with GitHub
# MLFALL 2019 HW1 Sol
### 1. Closed-Form Loss Linear Regression
#### 1-(a)
Let's begin with a specific dataset
$$S = \{(x_i, y_i)\}_{i=1}^5 = \{(1, 1.2), (2, 2.4), (3, 3.5), (4, 4.1), (5, 5.6)\}$$
Please find the linear regression model $({\bf w}, b) \in {\mathbb R} \times {\mathbb R}$ that minimizes the sum of squares loss
$$L_{ssq}({\bf w}, b) = \frac{1}{2 \times 5}\sum_{i=1}^5 (y_i- ({\bf w}^T {\bf x}_i+b))^2$$
#### Sol:
$L_{ssq}(w, x) = \frac{1}{10}\sum_{i = 1}^{5} (y_i - (wx_i + b))^2 = \frac{1}{10}\sum_{i = 1}^{5} (wx_i + b - y_i)^2 \\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \frac{1}{10}((w + b - 1.2)^2 + (2w + b - 2.4)^2 + (3w + b - 3.5)^2 \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + (4w + b - 4.1)^2 + (5w + b - 5.6)^2)$
$\frac{\partial}{\partial w}L_{ssq} = \frac{1}{10}(2(w + b - 1.2) \cdot 1 + 2(2w + b - 2.4) \cdot 2 + 2(3w + b - 3.5) \cdot 3\ \\ \ \ \ \ \ \ \ \ \ \ \ \ + 2(4w + b - 4.1) \cdot 4 + 2(5w + b - 5.6) \cdot 5 = 11w + 3b - 12.18$
$\frac{\partial}{\partial b}L_{ssq} = \frac{1}{10}(2(w + b - 1.2) \cdot 1 + 2(2w + b - 2.4) \cdot 1 + 2(3w + b - 3.5) \cdot 1\ \\ \ \ \ \ \ \ \ \ \ \ \ \ +2(4w + b - 4.1) \cdot 1 + 2(5w + b - 5.6) \cdot 1) = 3w + b - 3.36$
$$11w + 3b - 12.18 = 0 \quad \quad (1) \\ 3w + b - 3.36 = 0 \quad \quad \quad \ \ (2)$$
解二元一次方程式可得 $\ w = 1.05 \ b = 0.21$
#### 1-(b)
Please find the linear regression model $({\bf w}, b) \in {\mathbb R}^k \times {\mathbb R}$ that minimizes the sum of squares loss
$$L_{ssq}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T{\bf x}_i+b))^2$$
#### Sol:
令 $\ A = \begin{bmatrix}
1 & x_{1,1} & x_{1,2} & ... & x_{1,k}\\
1 & x_{2,1} & x_{2,2} & ... & x_{2,k}\\
\vdots & \vdots & \vdots & \vdots & \vdots\\
1 & x_{N,1} & x_{N,2} & ... & x_{N,k}\\
\end{bmatrix}$ $w = \begin{bmatrix}
b \\
\bf{w_1} \\
\vdots \\
\bf{w_k}
\end{bmatrix}$ $y = \begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_N
\end{bmatrix}$
則 $Aw = y \quad (A^TA)w = A^Ty \quad$ $\begin{bmatrix}
b \\
\bf{w}
\end{bmatrix} = w = (A^TA)^{-1}A^Ty$
#### 1-\(c\)
A key motivation for regularization is to avoid overfitting. A common choice is to add a $L^2$-regularization term into the original loss function
$$L_{reg}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T {\bf x}_i+b))^2 + \frac{\lambda}{2} \|{\bf w}\|^{2}$$
where $\lambda \geq 0$ and for ${\bf w} = [w_1 ~ w_2 ~ ... ~w_k]^T$, one denotes $\|{\bf w}\|^2 = w_1^2 + ... + w_k^2$.
Please find the linear regression model $({\bf w}, b)$ that minimizes the aforementioned regularized sum of squares loss.
令 $\ A = \begin{bmatrix}
1 & x_{1,1} & x_{1,2} & ... & x_{1,k}\\
1 & x_{2,1} & x_{2,2} & ... & x_{2,k}\\
\vdots & \vdots & \vdots & \vdots & \vdots\\
1 & x_{N,1} & x_{N,2} & ... & x_{N,k}\\
\end{bmatrix}$ $w = \begin{bmatrix}
b \\
\bf{w_1} \\
\vdots \\
\bf{w_k}
\end{bmatrix}$ $y = \begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_N
\end{bmatrix}$
$L = \frac{1}{2N}(Aw-y)^2 + \frac{\lambda}{2}\|w\|^{2} - \frac{\lambda b^2}{2}$
$\frac{\partial L}{\partial w} = \frac{1}{N}(A^TAw-A^Ty) + \lambda Iw - \lambda I\begin{bmatrix}
b \\
0 \\
\vdots \\
0
\end{bmatrix} = 0$
$\begin{bmatrix}
b \\
\bf{w}
\end{bmatrix} = w = \begin{bmatrix}
(A^TA)^{-1}A^Ty \\
(A^TA + \lambda N I)^{-1}A^Ty
\end{bmatrix}$
### 2. Noise and regulation
Consider the linear model $f_{{\bf w},b}: {\mathbb R}^k \rightarrow {\mathbb R}$, where ${\bf w} \in {\mathbb R}^k$ and $b \in {\mathbb R}$, defined as
$$f_{{\bf w},b}({\bf x}) = {\bf w}^T {\bf x} + b$$
Given dataset $S = \{({\bf x}_i,y_i)\}_{i=1}^N$, if the inputs ${\bf x}_i \in {\mathbb R}^k$ are contaminated with input noise ${{\bf \eta}_i} \in {\mathbb R}^k$, we may consider the expected sum-of-squares loss in the presence of input noise as
$${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i + {\bf \eta}_i)-y_i)^2 \right]$$
where the expectation is taken over the randomness of input noises ${\bf \eta}_1,...,{\bf \eta}_N$.
Now assume the input noises ${\bf \eta}_i = [\eta_{i,1} ~ \eta_{i,2} ~ ... \eta_{i,k}]$ are random vectors with zero mean ${\mathbb E}[\eta_{i,j}] = 0$, and the covariance between components is given by
$${\mathbb E}[\eta_{i,j}\eta_{i',j'}] = \delta_{i,i'}\delta_{j,j'} \sigma^2$$
where $\delta_{i,i'} = \left\{\begin{array}{ll}
1 & \mbox{, if} ~ i = i'\\
0 & \mbox{, otherwise.}
\end{array}\right.$ denotes the Kronecker delta.
Please show that
$${\tilde L}_{ssq}({\bf w},b) = \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2$$
That is, minimizing the expected sum-of-squares loss in the presence of input noise is equivalent to minimizing noise-free sum-of-squares loss with the addition of a $L^2$-regularization term on the weights.
- Hint: $\|\bf{x}\|^2 = {\bf x}^T{\bf x} = Trace({\bf x}{\bf x}^T)$.
#### Sol:
$${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(w_i^T(x_i + {\bf \eta}_i) + b -y_i)^2 \right] \\ = \frac{1}{2N}\sum_{i=1}^{N}({\mathbb E}\left[(w_i^Tx_i+ b -y_i)^2 \right] + 2(w_i^Tx_i + b - y_i){\mathbb E}\left[w_i^T{\bf \eta}_i \right] + {\mathbb E}\left[(w_i^T{\bf \eta}_i)^2 \right])$$
$$\because {\mathbb E}\left[w_i^T{\bf \eta}_i \right] = 0 \quad and \quad {\mathbb E}\left[(w_i^T{\bf \eta}_i)^2 \right] = \sigma^2\|\bf{w}^2\|$$
$${\tilde L}_{ssq}({\bf w},b) = \frac{1}{2N}\sum_{i=1}^{N}((f_{{\bf w},b}({\bf x}_i)-y_i)^2 + 0 + \sigma^2\|\bf{w}^2\|) \\ = \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2$$
### 3. Kaggle Hacker
Suppose you have trained $K+1$ models $g_0, g_1, \cdots g_K$, and in particular $g_0({\bf x}) = 0$ is the zero function.
Assume the testing dataset is $\{({\bf x}_i, y_i)\}_{i=1}^N$, where you only know $x_i$ while $y_i$ is hidden. Nevertheless, you are allowed to observe the sum of squares testing error
$$e_k = \frac{1}{N}\sum_{i=1}^N (g_k({\bf x}_i)-y_i)^2, ~~ k = 0, 1, \cdots K$$
Ofcourse, you know $s_k = \frac{1}{N}\sum_{i=1}^N(g_k({\bf x}_i))^2$.
#### 3-(a)
Please express $\sum_{i=1}^N g_k({\bf x}_i)y_i$ in terms of $N, e_0, e_1,\cdots,e_K, s_1,\cdots,s_K$. Prove your answer.
- Hint: $e_0=\frac{1}{N}\sum_{i=1}^{N}y_i^2$
#### Sol:
$e_k = \frac{1}{N} \sum_{i=1}^N(g_k(x_i) - y_i)^2 = \frac{1}{N} \sum_{i=1}^N((g_k(x_i)^2 - 2g_k(x_i)y_i + y_i^2)) = S_k - \frac{2}{N} \sum_{i=1}^Ng_k(x_i)y_i + e_0$
$\frac{1}{N} \sum_{i=1}^Ng_k(x_i)y_i = \frac{N}{2}(S_k - e_k + e_0)$
#### 3-(b)
For the given $K + 1$ models in the previous problem, explain how to solve
$\min_{\alpha_1, \cdots \alpha_K} L_{test}(\sum_{k=1}^{K} \alpha_k g_k) = \min [\frac{1}{N} \sum_{i=1}^N( \sum_{k=1}^{K} \alpha_k g_k({\bf x}_i) - y_i)^2]$, and obtain the optimal weights $\alpha_1, \cdots \alpha_K$.
#### Sol:
令 $\ A = \begin{bmatrix}
g_1(x_1) & g_2(x_1) & ... & g_K(x_1)\\
g_1(x_2) & g_2(x_2) & ... & g_K(x_2)\\
\vdots & \vdots & \vdots & \vdots\\
g_1(X_N) & g_2(x_N)& ... & g_K(x_N)\\
\end{bmatrix}$ $\alpha = \begin{bmatrix}
\alpha_1 \\
\alpha_2 \\
\vdots \\
\alpha_K
\end{bmatrix}$ $y = \begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_N
\end{bmatrix}$
$$L_{test}(f) = \frac{1}{N}(G\alpha - y)^T(G\alpha - y)$$
$$\frac{\partial L}{\partial \alpha} = \frac{1}{N}G^T(G\alpha - y) = 0$$
$$\alpha = (G^TG)^{-1}G^Ty$$