HW1 - Handwritten Assignment

# HW1 - Handwritten Assignment Revised by Oct 6 (Ver 1), Revised by Oct 10 (Ver 2) --- (Process of finding the answer should be shown; otherwise, no points will be given) ## Mathematic Background (0.8%) (a) A symmetric matrix $M\in\mathbb{R}^n$ is positive semi-definite if $\forall x\in\mathbb{R}^n$ $$ x^T M x \geq 0 $$ Now, given a matrix $A\in\mathbb{R}^{n\times n}$ . Show that $AA^T$ is a positive semi-definite matrix. (b) If $f(x_1, x_2) = x_1\sin(x_2)\exp(-x_1x_2)$, what is the gradient $\nabla f(x)$ of $f$ ? Recall that $\nabla f(x) = \left[\begin{array}{l} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \end{array}\right]$ $(c)$ Given $X_1, \cdots, X_n$ are idetically and independent (i.i.d.) Bernoulli distribution with parameter $p$. Please find the maximun likelihood estimator of $p$. - Hint: The probability mass function of a Bernoulli $X$ can be written as $f(x;p) = p^x(1-p)^{1-x}$ for $x\in \{0, 1\}$ --- ## Closed-Form Linear Regression Solution (0.8%) Suppose the linear regression model $$ \mathbf{y} = \mathbf{X}\boldsymbol{\theta} + \boldsymbol{\epsilon} $$ where $\mathbf{y}\in\mathbb{R}^n, \mathbf{X}\in\mathbb{R}^{n\times d}, \boldsymbol{\theta}\in\mathbb{R}^d$ and $\boldsymbol{\epsilon}\in\mathbb{R}^n$. Note that $\mathbf{X}_i\in \mathbb{R}^{1\times d}$ is the $i$-th row of $\mathbf{X}$ . Write $\boldsymbol{\theta} = [w_1, \cdots, w_d, b]^T$ and $\mathbf{X}_i = [x_{i,1}, x_{i,2}, \cdots, x_{i,m}, 1]$. If the linear model has the bias term $b$, then $d = m + 1$ (denote $x_{i, m+1} = 1$). On the other hand, $d = m$. For simplicity, we assume no bias term. $(a)$(0.4%) Find the general optimal solution $\boldsymbol{\theta^*}$ that minimizes the weighted MSE: $$ \sum_i \omega_i\left(y_i-\boldsymbol{X}_{\mathrm{i}} \boldsymbol{\theta}\right)^2 $$ Note that $\boldsymbol{\Omega}=\operatorname{diag}\left(\omega_1, \ldots, \omega_n\right)$, $\color{red}{\omega_i\geq 0, \forall i}$, and $\mathbf{X^T}\boldsymbol{\Omega}\mathbf{X}$ is invertible $(b)$(0.4%) To avoid overfitting, we add a $L^2$-regularization term into the original loss function (MSE): $$ \sum_i\left(y_i-\boldsymbol{X}_{\mathrm{i}} \boldsymbol{\theta}\right)^2+\lambda \sum_j w_j^2 $$ Write down the matrix form of the loss function and find the general optimal solution of the $\boldsymbol{\theta^*}$ - Hint: $\|\mathbf{w}\|^2 = w_1^2 + \cdots + w_m^2$ for $\mathbf{w} = [w_1, \cdots, w_m]$ (Bonus)(0.2%) In (b), We assume that our linear regression model has bias term. Please solve the solution for $\mathbf{w} = [w_1, \cdots, w_m]$ and $b$ i.e. $\boldsymbol{\theta} = [w_1, \cdots, w_m, b]^T$ --- ## Logistic Sigmoid Function and Hyperbolic Tangent Function (0.8%) Consider the logistic sigmoid function defined by, $$ \sigma(a) = \frac{1}{1+\exp(-a)} $$ and the hyperbolic tangent function defined by, $$ \tanh (a)=\frac{e^a-e^{-a}}{e^a+e^{-a}} $$ (a)(0.4%) Show that these two functions are related by, $$ \tanh(a) = 2\sigma(2a) - 1 $$ (b)(0.4%) Show that a linear combination of logistic sigmoid functions of the form $$ y(x, \mathbf{w})=w_0+\sum_{j=1}^M w_j \sigma\left(\frac{x-\mu_j}{s}\right) $$ is equivalent to a linear combination of 'tanh' functions of the form $$ y(x, \mathbf{u})=u_0+\sum_{j=1}^M u_j \tanh \left(\frac{x-\mu_j}{\color{red}2s}\right) $$ and find the expressions to relate the new parameters $\{u_1, \cdots, u_M\}$ to the original parameters $\{w_1, \cdots, w_M\}$ --- ## Noise and Regulation (0.8%) Consider the linear model $f_{\mathbf{w},b}: \mathbb{R}^k \rightarrow \mathbb{R}$, where $\mathbf{w} \in \mathbb{R}^k$ and $b \in \mathbb{R}$, defined as $$f_{\mathbf{w},b}(x) = \mathbf{w}^T \mathbf{x} + b$$ Given dataset $S = \{(\mathbf{x}_i,y_i)\}_{i=1}^N$, if the inputs $\mathbf{x}_i \in \mathbb{R}^k$ are contaminated with input noise $\boldsymbol{\eta}_i \in \mathbb{R}^k$, we may consider the expected sum-of-squares loss in the presence of input noise as $${\tilde L}_{ss}(\mathbf{w},b) = \mathbb{E}\left[ \frac{1}{2N}\sum_{i=1}^{N}\left(f_{\mathbf{w},b}(\mathbf{x}_i + \boldsymbol{\eta}_i)-y_i\right)^2 \right]$$ where the expectation is taken over the randomness of input noises $\boldsymbol{\eta}_1,...,\boldsymbol{\eta}_N$. <font color="#f00">Additionally, the inputs ($\mathbf{x}_i$) and the input noise ($\boldsymbol{\eta}_i$) are independent.</font> Now assume the input noises $\mathbf{\eta}_i = [\eta_{i,1} ~ \eta_{i,2} ~ ... \eta_{i,k}]$ are random vectors with zero mean $\mathbb{E}[\eta_{i,j}] = 0$, and the covariance between components is given by $$\mathbb{E}[\eta_{i,j}\eta_{i',j'}] = \delta_{i,i'}\delta_{j,j'} \sigma^2$$ where $\delta_{i,i'} = \left\{\begin{array}{ll} 1 & \mbox{, if} ~ i = i'\\ 0 & \mbox{, otherwise.} \end{array}\right.$ denotes the Kronecker delta. Please show that $${\tilde L}_{ss}(\mathbf{w},b) = \frac{1}{2N}\sum_{i=1}^{N}\left(f_{\mathbf{w},b}(\mathbf{x}_i)-y_i\right)^2 + \frac{\sigma^2}{2}\|\mathbf{w}\|^2$$ That is, minimizing the expected sum-of-squares loss in the presence of input noise is equivalent to minimizing noise-free sum-of-squares loss with the addition of a $L^2$-regularization term on the weights. - Hint: $\|\bf{x}\|^2 = {\bf x}^T{\bf x} = Trace({\bf x}{\bf x}^T)$ and the square of a vector is dot product with itself --- ## Logistic Regression (0.8%) Consider the problem of **two-class** classification For logistic regression problem, the posterior probability of class $C_1$ can be written as a logistic sigmoid acting on a linear function of the feature vector $\mathbf{x} = [x_1, x_2, \cdots, x_n]^T\in \mathbb{R}^{n}$ and $\mathbf{w} = [w_1, \cdots, w_n]^T\in \mathbb{R}^{n}$ so that $$ f_{\mathbf{w}, b}(\mathbf{x}) = p(C_1\mid \mathbf{x}) = \sigma(\sum_i w_i x_i + b) = \sigma(\mathbf{w}^T\mathbf{x} + b) $$ with $p(C_2\mid \mathbf{x}) = 1 - p(C_1\mid \mathbf{x}) = 1-f_{\mathbf{w}, b}(\mathbf{x})$. Here $\sigma(\cdot)$ is the *logistic sigmoid* function defined by $\sigma(a) = \frac{1}{1+\exp(-a)}$ for $a$ is a real number. (a)(0.2%) Suppose $\mathbf{w} = [-1, 2, -1, 5]^T$, $\mathbf{x}=[7, 0, 3, 10]^T$ and $b = 3$ Please calculate the logistic regression prediction for the above particular example. (b)(0.3%) Given training data set $\{\mathbf{x}_i, y_{i}\}_{i=1}^N$, where $y_{i}\in \{0, 1\}$ . Suppose $N$ observations are generated independent. Please write down the likelihood function of $p(\mathbf{y}\mid \mathbf{x})$ in terms of ${y}_{i}, f_{\mathbf{w},b}(\mathbf{x}_{i})$, where $\mathbf{y} = [{y}_{1}, \cdots, {y}_{N}]^T$ . Moreover, write down the loss function $L(\mathbf{w}, b)$ defined as the negative of the log likelihood. - Hint: Use the fact that $y_n$ is either 0 or 1 for all $n$ to rewrite $p(y_n|x_n)$ and $p(y|x) = \prod_{i} p(y_i|\mathbf{x}_i)$ $(c)$(0.3%) Derive the formula that describes the update rule of parameters in logistic regression with learning rate $\eta$ (e.g. $\mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} - \cdots$). Note that the answer in terms of $\mathbf{w}^{(t+1)} , \mathbf{w}^{(t)} , f_{\mathbf{w}, b}(\mathbf{x}_i), \mathbf{x}_i, y_i, \eta$ - Hint: Gradient descent