ML2021FALL HW1

--- title: 'ML2021FALL HW1' disqus: hackmd --- ## HW1 - Handwritten Assignment (Process of finding the answer should be shown; otherwise, no points will be given) ### 1. Logistic Regression #### 1-(a) Suppose we have a logistic regression model with four features that learns the following bias and weights: $b = 1, w_1 = -1, w_2 = 2, w_3 = -1, w_4 = 5$ Suppose the following feature values for a given example: | $x_1$ | $x_2$ | $x_3$ | $x_4$ | | -------- | -------- | -------- |-------- | | 7 | 0 | 3 | 10 | Function set:$$f_{w,b}(x)=P_{w,b}(C_1|x)=\sigma(\sum_iw_ix_i+b)$$ Please calculate the logistic regression prediction for the above particular example. (The answer should be a scalar indicating the posterior probability of class $C_1$) #### 1-(b) Given training data: | $$x^1$$ | $$x^2$$ | $$x^3$$ | ... | $$x^N$$ | | ------------------------- | ------------------------- | ------------------------- | --- | --- | | $$\hat{y}^1=1 $$(Class 1) | $$\hat{y}^2=1$$ (Class 1) | $$\hat{y}^3=0$$ (Class 2) | ... | $$\hat{y}^N=1 $$(Class 1) | Assume the data is generated so that the probability of sample $x$ belonging to $C_1$ is $$f_{w,b}(x)=P_{w,b}(C_1|x)$$ Given a set of w and b, the probability of generating the data is as follows(assuming the data is generated independently): $$L(w,b)=f_{w,b}(x^1)f_{w,b}(x^2)(1-f_{w,b}(x^3))\cdots f_{w,b}(x^N)$$ Please write down the loss function $L(w,b)$ defined as the negative of the log likelihood (Hint: Cross entropy) #### 1-$c$ Derive the formula that describes the update rule of parameters in logistic regression. (e.g., $w_i\leftarrow w_i-...$) (Hint: Gradient descent) --- ### 2. Closed-Form Linear Regression Solution In the lecture, we've learnt how to solve linear regression problem via gradient descent. Here you will derive the closed-form solution for such kind of problems. In the following questions, unless otherwise specified, we denote $S = \{({\bf x}_i, y_i)\}_{i=1}^N$ as a dataset of $N$ input-output pairs, where ${\bf x}_i \in {\mathbb R}^k$ denotes the vectorial input and $y_i \in {\mathbb R}$ denotes the corresponding scalar output. #### 2-(a) Let's begin with a specific dataset $$S = \{(x_i, y_i)\}_{i=1}^5 = \{(1, 1.5), (2, 2.4), (3, 3.5), (4, 4.1), (5, 5.3)\}$$ Please find the linear regression model $(w, b) \in {\mathbb R} \times {\mathbb R}$ that minimizes the sum of squares loss $$L_{ssq}({\bf w}, b) = \frac{1}{2 \times 5}\sum_{i=1}^5 (y_i- ({\bf w}^T {\bf x}_i+b))^2$$ #### 2-(b) Please find the linear regression model $({\bf w}, b) \in {\mathbb R}^k \times {\mathbb R}$ that minimizes the sum of squares loss $$L_{ssq}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T{\bf x}_i+b))^2$$ #### 2-$c$ A key motivation for regularization is to avoid overfitting. A common choice is to add a $L^2$-regularization term into the original loss function $$L_{reg}({\bf w}, b) = \frac{1}{2N}\sum_{i=1}^N (y_i-({\bf w}^T {\bf x}_i+b))^2 + \frac{\lambda}{2} \|{\bf w}\|^{2}$$ where $\lambda \geq 0$ and for ${\bf w} = [w_1 ~ w_2 ~ ... ~w_k]^T$, one denotes $\|{\bf w}\|^2 = w_1^2 + ... + w_k^2$. Please find the linear regression model $({\bf w}, b)$ that minimizes the aforementioned regularized sum of squares loss. --- ### 3. Noise and regulation Consider the linear model $f_{{\bf w},b}: {\mathbb R}^k \rightarrow {\mathbb R}$, where ${\bf w} \in {\mathbb R}^k$ and $b \in {\mathbb R}$, defined as $$f_{{\bf w},b}({\bf x}) = {\bf w}^T {\bf x} + b$$ Given dataset $S = \{({\bf x}_i,y_i)\}_{i=1}^N$, if the inputs ${\bf x}_i \in {\mathbb R}^k$ are contaminated with input noise ${{\bf \eta}_i} \in {\mathbb R}^k$, we may consider the expected sum-of-squares loss in the presence of input noise as $${\tilde L}_{ssq}({\bf w},b) = {\mathbb E}\left[ \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i + {\bf \eta}_i)-y_i)^2 \right]$$ where the expectation is taken over the randomness of input noises ${\bf \eta}_1,...,{\bf \eta}_N$. Now assume the input noises ${\bf \eta}_i = [\eta_{i,1} ~ \eta_{i,2} ~ ... \eta_{i,k}]$ are random vectors with zero mean ${\mathbb E}[\eta_{i,j}] = 0$, and the covariance between components is given by $${\mathbb E}[\eta_{i,j}\eta_{i',j'}] = \delta_{i,i'}\delta_{j,j'} \sigma^2$$ where $\delta_{i,i'} = \left\{\begin{array}{ll} 1 & \mbox{, if} ~ i = i'\\ 0 & \mbox{, otherwise.} \end{array}\right.$ denotes the Kronecker delta. Please show that $${\tilde L}_{ssq}({\bf w},b) = \frac{1}{2N}\sum_{i=1}^{N}(f_{{\bf w},b}({\bf x}_i)-y_i)^2 + \frac{\sigma^2}{2}\|{\bf w}\|^2$$ That is, minimizing the expected sum-of-squares loss in the presence of input noise is equivalent to minimizing noise-free sum-of-squares loss with the addition of a $L^2$-regularization term on the weights. - Hint: $\|{\bf x}\|^2 = {\bf x}^T{\bf x} = Trace({\bf x}{\bf x}^T)$. --- ### 4. Kaggle Hacker In the lecture, we've learnt the importance of validation. It is said that fine tuning your model based on Kaggle public leaderboard always causes "disaster" on private test dataset. Let's not talk about whether it'll lead to disastrous results or not. The fact is that most students even don't know how to "overfit" public leaderboard except for submitting many and many times. In this problem, you'll see how to take advantages of public leaderboard in hw1 kaggle competition. (In theory XD) > ## **Warning** > ![](https://i.imgur.com/FtmP42R.jpg) Suppose you have trained $K+1$ models $g_0, g_1, \cdots g_K$, and in particular $g_0({\bf x}) = 0$ is the zero function. Assume the testing dataset is $\{({\bf x}_i, y_i)\}_{i=1}^N$, where you only know $x_i$ while $y_i$ is hidden. Nevertheless, you are allowed to observe the sum of squares testing error $$e_k = \frac{1}{N}\sum_{i=1}^N (g_k({\bf x}_i)-y_i)^2, ~~ k = 0, 1, \cdots K$$ Of course, you know $s_k = \frac{1}{N}\sum_{i=1}^N(g_k({\bf x}_i))^2$. #### 4-(a) Please express $\sum_{i=1}^N g_k({\bf x}_i)y_i$ in terms of $N, e_0, e_1,\cdots,e_K, s_1,\cdots,s_K$. Prove your answer. - Hint: $e_0=\frac{1}{N}\sum_{i=1}^{N}y_i^2$ #### 4-(b) For the given $K + 1$ models in the previous problem, explain how to solve $\min_{\alpha_1, \cdots \alpha_K} L_{test}(\sum_{k=1}^{K} \alpha_k g_k) = \min [\frac{1}{N} \sum_{i=1}^N( \sum_{k=1}^{K} \alpha_k g_k({\bf x}_i) - y_i)^2]$, and obtain the optimal weights $\alpha_1, \cdots \alpha_K$.