# HW1 Conceptual: Beras Pt. 1

## Conceptual Questions

1. Consider Mean-Square Error (MSE) loss. Why do we square the error? **Please provide two reasons**. (_2-3 sentences_) 2. We covered a lemonade stand example in the class (Intro to Machine Learning), which was a regression task. - How would you convert it into a classification task? - What are the benefits of keeping it a regression task? What are the benefits of converting it to a classification task? - Give another example of a classification and a regression task each. <!-- 2. Explain the difference between classification and regression tasks. Explain your answer and give an example of each! (_2-4 sentences_) --> 3. Consider the following network: ![](https://i.imgur.com/LFEiOcW.png) Let the input, weight, bias, and expected hypothesis matrix $o_{expected}$ be the following: $$ x = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\ \ \ w = \begin{bmatrix} 0.4 & 0.7 & 0.1\\ 0.2 & 0.4 & 0.5 \end{bmatrix}\ \ \ b = \begin{bmatrix} 0.2 \\ 0 \\ 0.4 \end{bmatrix}\ \ \ o_{expected} = \begin{bmatrix} 1 \\ 2 \\ 0 \end{bmatrix} $$ Answer the following questions, write out equations as necessary, and **show your work for every step**. - Complete a manual forward pass (i.e. find the output matrix $o$ AKA $\hat y)$. Assume the network uses a linear function for which the above matrices are compatible. - Using Mean Squared Error as your loss metric, compute the loss relative to $o_{expected}$. - Perform a single manual backward pass assuming that the the expected output $o_{expected}$, AKA the ground truth value $y$, is given. Calculate the partial derivative of the loss with respect to each parameter. - Give the updated weight and bias matrices using the result of your gradient calculations. Use the stochastic gradient descent algorithm with learning rate = 0.1. :::info :::spoiler :shushing_face: **HINTS**: - Check Lab 1 for the formula to get the partial derivative of MSE loss with respect to the weights and bias. - To invoke the chain rule, consider computing $\frac{\partial L_{MSE}}{\partial o}$ as an upstream gradient and then composing it with the local gradients relating $o$ with the parameters. - During the backwards pass, pay close attention to what dimensions the gradients *have* to be. That might help you figure out what operations/matrix configurations you'll need to use. - If you're doing it right, you should expect relatively simple and repetitive numbers. ::: 4. [Empirical risk](https://datascience.stackexchange.com/questions/44091/is-empirical-risk-the-same-thing-as-loss-function) $\hat{R}$ for another machine learning model called [logistic regression](https://www.ibm.com/topics/logistic-regression) (used widely for classification) is: $$\hat{R}_{\log}(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n}\ln(1+\exp(-y_i\mathbf{w}^\top\mathbf{x}_i))$$ This loss is also known as log-loss. Take the gradient of this empirical risk w.r.t $\mathbf{w} \in \mathbb{R}^d$. Your answer may need to include parameters $\mathbf{w}$, number of examples $n$, labels $y_i\in \mathbb{R}$, and training examples $\mathbf{x}_i \in \mathbb{R}^d$. Show your work for every step! ## 2470-only Questions The following are questions that are only required by students enrolled in CSCI2470. **For each question, include a 2-4 sentence argument for why this mathematical detail may be considered interesting and useful.** 1. You have two interdependent random variables $x, y \sim X, Y$ and you'd like to learn a mapping $f: X \to Y$. The best function you can get is $f(x) = \mathbb{E}[Y|X = x]$. This does not account for noise associated with random events. Let $\xi$ be the noise such that $y = f(x) + \xi$. Prove that $\mathbb{E}[\xi] = 0$. :::info :::spoiler :shushing_face: **For Question 1, the following information might be relevant:** - Linearity of Expectation states that $\mathbb{E}[a + b] = \mathbb{E}[a] + \mathbb{E}[b]$ - According to the law of total expectation, $\mathbb{E[E}[B|A]] = \mathbb{E}[B]$ ::: 2. Let $p_{\bf w}$ conditionally parameterize a normal distribution such that $p_{\bf w}(y | {\bf x}) \sim \mathcal{N}({\bf x}^T{\bf w}, 1)$. Show that the [maximizing likelihood objective](https://machinelearningmastery.com/what-is-maximum-likelihood-estimation-in-machine-learning/) can be written by using $\underset{{\bf w}}{\arg\min}$ and the squared error between ${\bf x}^T{\bf w}$ and $y$. Intuitively, this is a probabilistic interpretation of linear regression. We are interested in the joint probability of how the behaviour of the output $y$ is conditional on the values of the input vector $\bf x$ , as well as any parameters of the model $\bf w$. The model is represented as a Gaussian distribution (prior) with the linear function output as the mean. We assume that the variance in the model is fixed ($=1$) (i.e. that it doesn't depend on $\bf x$). :::info :::spoiler :shushing_face: **For Questions 2 and 3, the following information might be relevant:** - Weights ${\bf w} \in \mathbb{R^d}$ parameterize a model $p_{\bf w}(y|{\bf x})$ to predict $p(y)$ given $\bf x$. - The maximum likelihood objective for a dataset $({\bf x}_i, y_i)^n \in (\mathbb{R}^d, \mathbb{R})^n$ involves choosing $\mathbf{w}$ such that the following two objectives (which are equivalent) are satisfied: $$\underset{{\bf w}}{\arg\max} \prod_{i=1}^n p_{\bf w}(y_i|{\bf x}_i) = \underset{{\bf w}}{\arg\min}\sum_{i=1}^n\ln\left({p_{\bf w}(y_i|{\bf x}_i)} \tag{1}^{-1}\right)$$ - (To convert the multiplicative function to additive we take log, so now we are trying to learn weights such that the log-liklihood of $y|x$ is maximized) - An [i.i.d.](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) vector ${\bf x}$ - where each value is pulled from the same distribution and does not depend on any of the other values - abides by $\mathbf{P}({\bf x}) = \prod_{i} \mathbf{P}(x_{i})$. - Recall that the normal (Gaussian) probability density function (PDF) is: $$f(x) = \frac{1}{\sigma \sqrt{(2\pi)}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)$$ ::: 3. Assume instead that each element $w_j \sim \mathcal{N}(0,\frac{1}{\lambda})$ and $p_{\bf w}(y\ |({\bf x}, {\bf w})) \sim \mathcal{N}({\bf w}^T {\bf x}, 1)$. Show that the following must be true ($P({\bf w}$) is the probability density of the vector ${\bf w}$, see hints above): $$\underset{{\bf w}}{\arg\max}\left\{ P({\bf w})\prod_{i=1}^n p_{\bf w}(y_i|({\bf x}_i, {\bf w}))\right\} = \underset{{\bf w}}{\arg\min}\left\{\frac{1}{2}\sum_{i=1}^n ({\bf w}^T {\bf x}_i - y_i)^2 + \frac{\lambda}{2}\sum_{j=1}^d w_j^2\right\}$$ This is a *regularized version* of linear regression (called [ridge regression](https://www.statology.org/ridge-regression/)) where the second term tries to push the weight values that are least influential to 0. $\lambda$ is a hyperparameter (selected by user). <!-- 4. **[Super-Secret Bonus For-Fun Question]** Assume a linear model $\mathbf{w}^\top\mathbf{x} = \sum_{i=1}^dx_iw_i$ is a data point where $\mathbf{x} \in\mathbb{R}^d$ and $\mathbf{w}\in \mathbb{R}^d$. Linear model can be enhanced (not sure about the meaning of "enhanced"? check [here](https://datascience.stackexchange.com/questions/30465/tips-to-improve-linear-regression-model) and see a link named as "polynomial terms" in the answer written by "Brian Spiering") by doing a quadratic expansion of the features, i.e., by constructing a new linear model $f_w$ with parameters $$(w_0, w_{01}, ..., w_{0d},w_{11},w_{12}, ..., w_{1d}, w_{22}, w_{23}, ..., w_{2d}, ..., w_{dd}),$$ defined by $$f_{\mathbf{w}}(x) = \mathbf{w}^\top\phi(\mathbf{x}) = w_0 + \sum_{i=1}^d w_{0i}x_i + \sum_{i\leq j}^d w_{ij}x_ix_j$$ For example, suppose $\phi(x) = (1, x, x^2)$, and $w = (a, b, c)$. Then we have $$x \mapsto w^\top \phi(x) = a + bx + cx^2$$ and we can learn quadratic polynomials. Here's a question. 