MLFALL 2019 HW1 Sol

# MLFALL 2019 HW1 Sol ### 1. Closed-Form Loss Linear Regression #### 1-(a) Let's begin with a specific dataset $$S = \{(x_i, y_i)\}_{i=1}^5 = \{(1, 1.2), (2, 2.4), (3, 3.5), (4, 4.1), (5, 5.6)\}$$ Please find the linear regression model $({\bf w}, b) \in {\mathbb R} \times {\mathbb R}$ that minimizes the sum of squares loss $$L_{ssq}({\bf w}, b) = \frac{1}{2 \times 5}\sum_{i=1}^5 (y_i- ({\bf w}^T {\bf x}_i+b))^2$$ #### Sol: $L_{ssq}(w, x) = \frac{1}{10}\sum_{i = 1}^{5} (y_i - (wx_i + b))^2 = \frac{1}{10}\sum_{i = 1}^{5} (wx_i + b - y_i)^2 \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \frac{1}{10}((w + b - 1.2)^2 + (2w + b - 2.4)^2 + (3w + b - 3.5)^2 \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + (4w + b - 4.1)^2 + (5w + b - 5.6)^2)$ $\frac{\partial}{\partial w}L_{ssq} = \frac{1}{10}(2(w + b - 1.2) \cdot 1 + 2(2w + b - 2.4) \cdot 2 + 2(3w + b - 3.5) \cdot 3\ \\ \ \ \ \ \ \ \ \ \ \ \ \ + 2(4w + b - 4.1) \cdot 4 + 2(5w + b - 5.6) \cdot 5 = 11w + 3b - 12.18$ $\frac{\partial}{\partial b}L_{ssq} = \frac{1}{10}(2(w + b - 1.2) \cdot 1 + 2(2w + b - 2.4) \cdot 1 + 2(3w + b - 3.5) \cdot 1\ \\ \ \ \ \ \ \ \ \ \ \ \ \ +2(4w + b - 4.1) \cdot 1 + 2(5w + b - 5.6) \cdot 1) = 3w + b - 3.36$ $$11w + 3b - 12.18 = 0 \quad \quad (1) \\ 3w + b - 3.36 = 0 \quad \quad \quad \ \ (2)$$ 解二元一次方程式可得 $\ w = 1.05 \ b = 0.21$ #### 1-(b) Please find the linear regression model $({\bf w}, b) \in {\mathbb R}^k \times {\mathbb R}$ that minimizes the sum of squares loss $$L_{ssq}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T{\bf x}_i+b))^2$$ #### Sol: 令 $\ A = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & ... & x_{1,k}\\ 1 & x_{2,1} & x_{2,2} & ... & x_{2,k}\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{N,1} & x_{N,2} & ... & x_{N,k}\\ \end{bmatrix}$ $w = \begin{bmatrix} b \\ \bf{w_1} \\ \vdots \\ \bf{w_k} \end{bmatrix}$ $y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$ 則 $Aw = y \quad (A^TA)w = A^Ty \quad$ $\begin{bmatrix} b \\ \bf{w} \end{bmatrix} = w = (A^TA)^{-1}A^Ty$ #### 1-$c$ A key motivation for regularization is to avoid overfitting. A common choice is to add a $L^2$-regularization term into the original loss function $$L_{reg}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T {\bf x}_i+b))^2 + \frac{\lambda}{2} \|{\bf w}\|^{2}$$ where $\lambda \geq 0$ and for ${\bf w} = [w_1 ~ w_2 ~ ... ~w_k]^T$, one denotes $\|{\bf w}\|^2 = w_1^2 + ... + w_k^2$. Please find the linear regression model $({\bf w}, b)$ that minimizes the aforementioned regularized sum of squares loss. 令 $\ A = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & ... & x_{1,k}\\ 1 & x_{2,1} & x_{2,2} & ... & x_{2,k}\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{N,1} & x_{N,2} & ... & x_{N,k}\\ \end{bmatrix}$ $w = \begin{bmatrix} b \\ \bf{w_1} \\ \vdots \\ \bf{w_k} \end{bmatrix}$ $y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$ $L = \frac{1}{2N}(Aw-y)^2 + \frac{\lambda}{2}\|w\|^{2} - \frac{\lambda b^2}{2}$ $\frac{\partial L}{\partial w} = \frac{1}{N}(A^TAw-A^Ty) + \lambda Iw - \lambda I\begin{bmatrix} b \\ 0 \\ \vdots \\ 0 \end{bmatrix} = 0$ $\begin{bmatrix} b \\ \bf{w} \end{bmatrix} = w = \begin{bmatrix} (A^TA)^{-1}A^Ty \\ (A^TA + \lambda N I)^{-1}A^Ty \end{bmatrix}$ ### 2. Noise and regulation Consider the linear model $f_{{\bf w},b}: {\mathbb R}^k \rightarrow {\mathbb R}$, where ${\bf w} \in {\mathbb R}^k$ and $b \in {\mathbb R}$, defined as $$f_{{\bf w},b}({\bf x}) = {\bf w}^T {\bf x} + b$$ Given dataset $S = \{({\bf x}_i,y_i)\}_{i=1}^N$, if the inputs ${\bf x}_i \in {\mathbb R}^k$ are contaminated with input noise ${{\bf \eta}_i} \in {\mathbb R}^k$, we may consider the expected sum-of-squares loss in the presence of input noise as $${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i + {\bf \eta}_i)-y_i)^2 \right]$$ where the expectation is taken over the randomness of input noises ${\bf \eta}_1,...,{\bf \eta}_N$. Now assume the input noises ${\bf \eta}_i = [\eta_{i,1} ~ \eta_{i,2} ~ ... \eta_{i,k}]$ are random vectors with zero mean ${\mathbb E}[\eta_{i,j}] = 0$, and the covariance between components is given by $${\mathbb E}[\eta_{i,j}\eta_{i',j'}] = \delta_{i,i'}\delta_{j,j'} \sigma^2$$ where $\delta_{i,i'} = \left\{\begin{array}{ll} 1 & \mbox{, if} ~ i = i'\\ 0 & \mbox{, otherwise.} \end{array}\right.$ denotes the Kronecker delta. Please show that $${\tilde L}_{ssq}({\bf w},b) = \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2$$ That is, minimizing the expected sum-of-squares loss in the presence of input noise is equivalent to minimizing noise-free sum-of-squares loss with the addition of a $L^2$-regularization term on the weights. - Hint: $\|\bf{x}\|^2 = {\bf x}^T{\bf x} = Trace({\bf x}{\bf x}^T)$. #### Sol: $${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(w_i^T(x_i + {\bf \eta}_i) + b -y_i)^2 \right] \\ = \frac{1}{2N}\sum_{i=1}^{N}({\mathbb E}\left[(w_i^Tx_i+ b -y_i)^2 \right] + 2(w_i^Tx_i + b - y_i){\mathbb E}\left[w_i^T{\bf \eta}_i \right] + {\mathbb E}\left[(w_i^T{\bf \eta}_i)^2 \right])$$ $$\because {\mathbb E}\left[w_i^T{\bf \eta}_i \right] = 0 \quad and \quad {\mathbb E}\left[(w_i^T{\bf \eta}_i)^2 \right] = \sigma^2\|\bf{w}^2\|$$ $${\tilde L}_{ssq}({\bf w},b) = \frac{1}{2N}\sum_{i=1}^{N}((f_{{\bf w},b}({\bf x}_i)-y_i)^2 + 0 + \sigma^2\|\bf{w}^2\|) \\ = \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2$$ ### 3. Kaggle Hacker Suppose you have trained $K+1$ models $g_0, g_1, \cdots g_K$, and in particular $g_0({\bf x}) = 0$ is the zero function. Assume the testing dataset is $\{({\bf x}_i, y_i)\}_{i=1}^N$, where you only know $x_i$ while $y_i$ is hidden. Nevertheless, you are allowed to observe the sum of squares testing error $$e_k = \frac{1}{N}\sum_{i=1}^N (g_k({\bf x}_i)-y_i)^2, ~~ k = 0, 1, \cdots K$$ Ofcourse, you know $s_k = \frac{1}{N}\sum_{i=1}^N(g_k({\bf x}_i))^2$. #### 3-(a) Please express $\sum_{i=1}^N g_k({\bf x}_i)y_i$ in terms of $N, e_0, e_1,\cdots,e_K, s_1,\cdots,s_K$. Prove your answer. - Hint: $e_0=\frac{1}{N}\sum_{i=1}^{N}y_i^2$ #### Sol: $e_k = \frac{1}{N} \sum_{i=1}^N(g_k(x_i) - y_i)^2 = \frac{1}{N} \sum_{i=1}^N((g_k(x_i)^2 - 2g_k(x_i)y_i + y_i^2)) = S_k - \frac{2}{N} \sum_{i=1}^Ng_k(x_i)y_i + e_0$ $\frac{1}{N} \sum_{i=1}^Ng_k(x_i)y_i = \frac{N}{2}(S_k - e_k + e_0)$ #### 3-(b) For the given $K + 1$ models in the previous problem, explain how to solve $\min_{\alpha_1, \cdots \alpha_K} L_{test}(\sum_{k=1}^{K} \alpha_k g_k) = \min [\frac{1}{N} \sum_{i=1}^N( \sum_{k=1}^{K} \alpha_k g_k({\bf x}_i) - y_i)^2]$, and obtain the optimal weights $\alpha_1, \cdots \alpha_K$. #### Sol: 令 $\ A = \begin{bmatrix} g_1(x_1) & g_2(x_1) & ... & g_K(x_1)\\ g_1(x_2) & g_2(x_2) & ... & g_K(x_2)\\ \vdots & \vdots & \vdots & \vdots\\ g_1(X_N) & g_2(x_N)& ... & g_K(x_N)\\ \end{bmatrix}$ $\alpha = \begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_K \end{bmatrix}$ $y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$ $$L_{test}(f) = \frac{1}{N}(G\alpha - y)^T(G\alpha - y)$$ $$\frac{\partial L}{\partial \alpha} = \frac{1}{N}G^T(G\alpha - y) = 0$$ $$\alpha = (G^TG)^{-1}G^Ty$$

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.