--- tags : ntu, mlp --- # HW1 - Handwritten Assignment 學號:R08921A09 系級: 電機碩一 姓名:郭羿昇 請實做以下兩種不同feature的模型,回答第 (1) ~ (2) 題: 抽全部9小時內的污染源feature當作一次項(加bias) 抽全部9小時內pm2.5的一次項當作feature(加bias) 備註 : a. NR請皆設為0,其他的非數值(特殊字元)可以自己判斷 b. 所有 advanced 的 gradient descent 技術(如: adam, adagrad 等) 都是可以用的 c. 第1-2題請都以題目給訂的兩種model來回答 d. 同學可以先把model訓練好,kaggle死線之後便可以無限上傳。 e. 根據助教時間的公式表示,(1) 代表 p = 9x18+1 而(2) 代表 p = 9*1+1 ## 1. (1%)記錄誤差值 (RMSE)(根據kaggle public+private分數),討論兩種feature的影響 :::success * 抽全部9小時內的污染源feature當作一次項: 5.62/5.66 * 抽全部9小時內pm2.5的一次項當作feature: 6.42/6.24 只有pm2.5的話資訊可能不太夠因為其他天氣狀況也會影響到Pm2.5的值 ::: ## 2. (1%)解釋什麼樣的data preprocessing 可以improve你的training/testing accuracy,ex. 你怎麼挑掉你覺得不適合的data points。請提供數據(RMSE)以佐證你的想法。 因為資料有時值是正常的,只是前面多+一個負號 所以就把所有值取絕對值 再來把Pm2.5>100 or <2 的去掉 把一些比較無關的feature e.g.風向去掉 RMSE從 6.8->5.6 ## 3.(3%) Refer to math problem ### 1. Closed-Form Linear Regression Solution In the lecture, we've learnt how to solve linear regression problem via gradient descent. Here you will derive the closed-form solution for such kind of problems. In the following questions, unless otherwise specified, we denote $S = \{({\bf x}_i, y_i)\}_{i=1}^N$ as a dataset of $N$ input-output pairs, where ${\bf x}_i \in {\mathbb R}^k$ denotes the vectorial input and $y_i \in {\mathbb R}$ denotes the corresponding scalar output. #### 1-(a) Let's begin with a specific dataset $$S = \{(x_i, y_i)\}_{i=1}^5 = \{(1, 1.2), (2, 2.4), (3, 3.5), (4, 4.1), (5, 5.6)\}$$ Please find the linear regression model $({\bf w}, b) \in {\mathbb R} \times {\mathbb R}$ that minimizes the sum of squares loss $$L_{ssq}({\bf w}, b) = \frac{1}{2 \times 5}\sum_{i=1}^5 (y_i- ({\bf w}^T {\bf x}_i+b))^2$$ :::success set gradient = 0 $$\nabla(L_{ssq}({\bf w}, b))=\begin{bmatrix} \frac{-1}{ 5}\sum_{i=1}^5 (y_i- ({\bf w}^T {\bf x}_i+b) \\ \frac{-1}{ 5}\sum_{i=1}^5 [(y_i- ({\bf w}^T {\bf x}_i+b)x_i] \end{bmatrix}= \begin{bmatrix} 0\\ 0 \end{bmatrix} $$ reorganize top term: $$ b=\frac{\sum y_i}{5}-{\bf w}^T\frac{\sum x_i}{5} $$ bottom term: $$ \sum y_ix_i-b\sum x_i-{\bf w}^T\sum x_i^2=0\ \ -(1)$$ substitue b into (1): $${\bf w}^T=\frac{\sum x_iy_i- \frac{\sum y_i \sum x_i}5}{\sum x_i^2-\frac{\sum x_i\sum x_i}5} $$ and calculate 4 terms: $$\sum x_iy_i \ ,\ \sum x_i^2 \ ,\ \sum x_i \ ,\ \sum y_i$$ we get : $$\begin{bmatrix} {\bf w}^T \\ b \end{bmatrix}= \begin{bmatrix} 1.05 \\ 0.21 \end{bmatrix} $$ ::: #### 1-(b) Please find the linear regression model $({\bf w}, b) \in {\mathbb R}^k \times {\mathbb R}$ that minimizes the sum of squares loss $$L_{ssq}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T{\bf x}_i+b))^2$$ :::success As part(a) , but substitute 5 with N: $$\begin{bmatrix} {\bf w}^T \\ b \end{bmatrix}= \begin{bmatrix} \frac{\sum x_iy_i- \frac{\sum y_i \sum x_i}N}{\sum x_i^2-\frac{\sum x_i\sum x_i}N} \\ \frac{\sum y_i}{N}-{\bf w}^T\frac{\sum x_i}{N} \end{bmatrix} $$ ::: #### 1-\(c\) A key motivation for regularization is to avoid overfitting. A common choice is to add a $L^2$-regularization term into the original loss function $$L_{reg}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T {\bf x}_i+b))^2 + \frac{\lambda}{2} \|{\bf w}\|^{2}$$ where $\lambda \geq 0$ and for ${\bf w} = [w_1 ~ w_2 ~ ... ~w_k]^T$, one denotes $\|{\bf w}\|^2 = w_1^2 + ... + w_k^2$. Please find the linear regression model $({\bf w}, b)$ that minimizes the aforementioned regularized sum of squares loss. :::success set gradient = 0 $$\nabla(L_{ssq}({\bf w}, b))=\begin{bmatrix} \frac{-1}{ N}\sum_{i=1}^N (y_i- ({\bf w}^T {\bf x}_i+b) \\ \frac{-1}{ N}\sum_{i=1}^N [(y_i- ({\bf w}^T {\bf x}_i+b)x_i]+{\lambda}{\bf w}^T\ \end{bmatrix}= \begin{bmatrix} 0\\ 0 \end{bmatrix} $$ top term : apply same rule as part (a) bottom term: $$ \sum y_ix_i-b\sum x_i-{\bf w}^T\sum x_i^2+\lambda {\bf w}^T=0\ \ -(1) $$ substitue b into (1): $$ {\bf w}^T=\frac{\sum x_iy_i- \frac{\sum y_i \sum x_i}N}{\sum x_i^2-\frac{\sum x_i\sum x_i}N-\lambda} $$ ::: --- ### 2. Noise and regulation Consider the linear model $f_{{\bf w},b}: {\mathbb R}^k \rightarrow {\mathbb R}$, where ${\bf w} \in {\mathbb R}^k$ and $b \in {\mathbb R}$, defined as $$f_{{\bf w},b}({\bf x}) = {\bf w}^T {\bf x} + b$$ Given dataset $S = \{({\bf x}_i,y_i)\}_{i=1}^N$, if the inputs ${\bf x}_i \in {\mathbb R}^k$ are contaminated with input noise ${{\bf \eta}_i} \in {\mathbb R}^k$, we may consider the expected sum-of-squares loss in the presence of input noise as $${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i + {\bf \eta}_i)-y_i)^2 \right]$$ where the expectation is taken over the randomness of input noises ${\bf \eta}_1,...,{\bf \eta}_N$. Now assume the input noises ${\bf \eta}_i = [\eta_{i,1} ~ \eta_{i,2} ~ ... \eta_{i,k}]$ are random vectors with zero mean ${\mathbb E}[\eta_{i,j}] = 0$, and the covariance between components is given by $${\mathbb E}[\eta_{i,j}\eta_{i',j'}] = \delta_{i,i'}\delta_{j,j'} \sigma^2$$ where $\delta_{i,i'} = \left\{\begin{array}{ll} 1 & \mbox{, if} ~ i = i'\\ 0 & \mbox{, otherwise.} \end{array}\right.$ denotes the Kronecker delta. Please show that $${\tilde L}_{ssq}({\bf w},b) = \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2$$ That is, minimizing the expected sum-of-squares loss in the presence of input noise is equivalent to minimizing noise-free sum-of-squares loss with the addition of a $L^2$-regularization term on the weights. - Hint: $\|{\bf x}\|^2 = {\bf x}^T{\bf x} = Trace({\bf x}{\bf x}^T)$. :::success $${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i + {\bf \eta}_i)-y_i)^2 \right]$$ set gradient = 0 $$\nabla({\tilde L}_{ssq}({\bf w}, b))=\begin{bmatrix} \frac{1}{ N}\sum_{i=1}^N ({\bf w}^T ({\bf x}_i+{\bf \eta}_i)+b -y_i) \\ \frac{1}{ N}\sum_{i=1}^N [({\bf w}^T ({\bf x}_i+{\bf \eta}_i)+b -y_i) (x_i+\eta_i)] \end{bmatrix}= \begin{bmatrix} 0\\ 0 \end{bmatrix} $$ reoranize top term: $$ \frac{1}{ N}\sum {\bf w}^T {\bf x}_i+\frac{1}{ N}\sum {\bf w}^T\eta_i + b -\frac{1}{ N}\sum y_i=0 \\ \implies b= \frac{1}{ N}\sum y_i -\frac{1}{ N}\sum {\bf w}^T {\bf x}_i \\(\because\frac{1}{ N} \sum \eta_i ={\mathbb E}[\eta_i]=0 ) $$ bottom term: $$ \begin{split} {\bf w}^T &=\frac{\sum x_iy_i- \frac{\sum y_i \sum x_i}N - \frac1N\sum y_i\sum \eta_i-\sum\eta_iy_i}{\sum x_i^2-\frac{\sum x_i\sum x_i}N+ (\sum x_i\eta_i+\eta_ix_i+\eta_i\eta_i) +\frac1N\sum x_i\sum \eta_i} \\ &=\frac{\sum x_iy_i- \frac{\sum y_i \sum x_i}N}{\sum x_i^2-\frac{\sum x_i\sum x_i}N-\sigma^2} \ \ (same\ as\ 1-(c)\ with\ \lambda=\sigma^2) \end{split} $$ ::: --- ### 3. Kaggle Hacker In the lecture, we've learnt the importance of validation. It is said that fine tuning your model based on Kaggle public leaderboard always causes "disaster" on private test dataset. Let's not talk about whether it'll lead to disastrous results or not. The fact is that most students even don't know how to "overfit" public leaderboard except for submitting many and many times. In this problem, you'll see how to take advantages of public leaderboard in hw1 kaggle competition. (In theory XD) > ## **Warning** > ![](https://i.imgur.com/FtmP42R.jpg) Suppose you have trained $K+1$ models $g_0, g_1, \cdots g_K$, and in particular $g_0({\bf x}) = 0$ is the zero function. Assume the testing dataset is $\{({\bf x}_i, y_i)\}_{i=1}^N$, where you only know $x_i$ while $y_i$ is hidden. Nevertheless, you are allowed to observe the sum of squares testing error $$e_k = \frac{1}{N}\sum_{i=1}^N (g_k({\bf x}_i)-y_i)^2, ~~ k = 0, 1, \cdots K$$ Ofcourse, you know $s_k = \frac{1}{N}\sum_{i=1}^N(g_k({\bf x}_i))^2$. #### 3-(a) Please express $\sum_{i=1}^N g_k({\bf x}_i)y_i$ in terms of $N, e_0, e_1,\cdots,e_K, s_1,\cdots,s_K$. Prove your answer. - Hint: $e_0=\frac{1}{N}\sum_{i=1}^{N}y_i^2$ :::success let b=$\sum_{i=1}^N g_k({\bf x}_i)y_i$ $e_0=\frac{1}{N}\sum_{i=1}^{N}y_i^2$ $e_k = \frac{1}{N}\sum(g_k({\bf x}_i)-y_i)^2=\frac1N\sum((g_k({\bf x}_i))^2-2 g_k({\bf x}_i)y_i+y_i^2)=s_k+2b+e_0$ $\implies 2b=s_k+e_0-e_k$ $\implies b=\frac{s_k+e_0-e_k}2$ ::: #### 3-(b) For the given $K + 1$ models in the previous problem, explain how to solve $\min_{\alpha_1, \cdots \alpha_K} L_{test}(\sum_{k=1}^{K} \alpha_k g_k) = \frac{1}{N} \sum_{i=1}^N( \sum_{k=1}^{K} \alpha_k g_k({\bf x}_i) - y_i)^2$, and obtain the optimal weights $\alpha_1, \cdots \alpha_K$. :::success $L_{test}=\frac{1}{N} \sum_{i=1}^N( \sum_{k=1}^{K} \alpha_k g_k({\bf x}_i) - y_i)^2$ $\frac{\partial L}{\partial \alpha_k}=\frac2N\sum_{i=1}^N(\alpha_kg_k(x_i)-y_i)(g_k(x_i))=0$ $\implies \alpha_k=\frac{y_i}{g_k(x_i)}$ :::