---
tags: hw1, conceptual
---
# HW1 Conceptual: Beras Pt. 1
:::info
Conceptual questions due **Friday, February 9, 2024 at 6:00 PM EST**
Programming assignment due **Monday, February 12, 2024 at 6:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there!
**Note:** make sure to select the template file `hw1.tex` from the sidebar on the left of overleaf.
> #### [**Latex Template**](https://www.overleaf.com/read/qqfcmdrjjvcf#4b624a)
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Theme
![](https://m.media-amazon.com/images/I/61+VBRvmv9L.jpg)
*Haters will say that this handsome gentleman is fake.*
## Conceptual Questions
1. Consider Mean-Square Error (MSE) loss. Why do we square the error? **Please provide two reasons**. (_2-3 sentences_)
2. We covered a lemonade stand example in the class (Intro to Machine Learning), which was a regression task.
- How would you convert it into a classification task?
- What are the benefits of keeping it a regression task? What are the benefits of converting it to a classification task?
- Give another example of a classification and a regression task each.
<!-- 2. Explain the difference between classification and regression tasks. Explain your answer and give an example of each! (_2-4 sentences_) -->
3. Consider the following network: ![](https://i.imgur.com/LFEiOcW.png)
Let the input, weight, bias, and expected hypothesis matrix $o_{expected}$ be the following:
$$
x =
\begin{bmatrix} 1 \\ 2 \end{bmatrix}\ \ \
w =
\begin{bmatrix}
0.4 & 0.7 & 0.1\\
0.2 & 0.4 & 0.5
\end{bmatrix}\ \ \
b =
\begin{bmatrix} 0.2 \\ 0 \\ 0.4 \end{bmatrix}\ \ \
o_{expected} =
\begin{bmatrix} 1 \\ 2 \\ 0 \end{bmatrix}
$$
Answer the following questions, write out equations as necessary, and **show your work for every step**.
- Complete a manual forward pass (i.e. find the output matrix $o$ AKA $\hat y)$. Assume the network uses a linear function for which the above matrices are compatible.
- Using Mean Squared Error as your loss metric, compute the loss relative to $o_{expected}$.
- Perform a single manual backward pass assuming that the the expected output $o_{expected}$, AKA the ground truth value $y$, is given. Calculate the partial derivative of the loss with respect to each parameter.
- Give the updated weight and bias matrices using the result of your gradient calculations. Use the stochastic gradient descent algorithm with learning rate = 0.1.
:::info
:::spoiler :shushing_face:
**HINTS**:
- Check Lab 1 for the formula to get the partial derivative of MSE loss with respect to the weights and bias.
- To invoke the chain rule, consider computing $\frac{\partial L_{MSE}}{\partial o}$ as an upstream gradient and then composing it with the local gradients relating $o$ with the parameters.
- During the backwards pass, pay close attention to what dimensions the gradients *have* to be. That might help you figure out what operations/matrix configurations you'll need to use.
- If you're doing it right, you should expect relatively simple and repetitive numbers.
:::
4. [Empirical risk](https://datascience.stackexchange.com/questions/44091/is-empirical-risk-the-same-thing-as-loss-function) $\hat{R}$ for another machine learning model called [logistic regression](https://www.ibm.com/topics/logistic-regression) (used widely for classification) is:
$$\hat{R}_{\log}(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n}\ln(1+\exp(-y_i\mathbf{w}^\top\mathbf{x}_i))$$
This loss is also known as log-loss. Take the gradient of this empirical risk w.r.t $\mathbf{w} \in \mathbb{R}^d$. Your answer may need to include parameters $\mathbf{w}$, number of examples $n$, labels $y_i\in \mathbb{R}$, and training examples $\mathbf{x}_i \in \mathbb{R}^d$. Show your work for every step!
## 2470-only Questions
The following are questions that are only required by students enrolled in CSCI2470. **For each question, include a 2-4 sentence argument for why this mathematical detail may be considered interesting and useful.**
1. You have two interdependent random variables $x, y \sim X, Y$ and you'd like to learn a mapping $f: X \to Y$. The best function you can get is $f(x) = \mathbb{E}[Y|X = x]$. This does not account for noise associated with random events. Let $\xi$ be the noise such that $y = f(x) + \xi$. Prove that $\mathbb{E}[\xi] = 0$.
:::info
:::spoiler :shushing_face:
**For Question 1, the following information might be relevant:**
- Linearity of Expectation states that $\mathbb{E}[a + b] = \mathbb{E}[a] + \mathbb{E}[b]$
- According to the law of total expectation, $\mathbb{E[E}[B|A]] = \mathbb{E}[B]$
:::
2. Let $p_{\bf w}$ conditionally parameterize a normal distribution such that $p_{\bf w}(y | {\bf x}) \sim \mathcal{N}({\bf x}^T{\bf w}, 1)$. Show that the [maximizing likelihood objective](https://machinelearningmastery.com/what-is-maximum-likelihood-estimation-in-machine-learning/) can be written by using $\underset{{\bf w}}{\arg\min}$ and the squared error between ${\bf x}^T{\bf w}$ and $y$. Intuitively, this is a probabilistic interpretation of linear regression. We are interested in the joint probability of how the behaviour of the output $y$ is conditional on the values of the input vector $\bf x$ , as well as any parameters of the model $\bf w$. The model is represented as a Gaussian distribution (prior) with the linear function output as the mean. We assume that the variance in the model is fixed ($=1$) (i.e. that it doesn't depend on $\bf x$).
:::info
:::spoiler :shushing_face:
**For Questions 2 and 3, the following information might be relevant:**
- Weights ${\bf w} \in \mathbb{R^d}$ parameterize a model $p_{\bf w}(y|{\bf x})$ to predict $p(y)$ given $\bf x$.
- The maximum likelihood objective for a dataset $({\bf x}_i, y_i)^n \in (\mathbb{R}^d, \mathbb{R})^n$ involves choosing $\mathbf{w}$ such that the following two objectives (which are equivalent) are satisfied:
$$\underset{{\bf w}}{\arg\max} \prod_{i=1}^n p_{\bf w}(y_i|{\bf x}_i) = \underset{{\bf w}}{\arg\min}\sum_{i=1}^n\ln\left({p_{\bf w}(y_i|{\bf x}_i)} \tag{1}^{-1}\right)$$
- (To convert the multiplicative function to additive we take log, so now we are trying to learn weights such that the log-liklihood of $y|x$ is maximized)
- An [i.i.d.](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) vector ${\bf x}$ - where each value is pulled from the same distribution and does not depend on any of the other values - abides by $\mathbf{P}({\bf x}) = \prod_{i} \mathbf{P}(x_{i})$.
- Recall that the normal (Gaussian) probability density function (PDF) is:
$$f(x) = \frac{1}{\sigma \sqrt{(2\pi)}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)$$
:::
3. Assume instead that each element $w_j \sim \mathcal{N}(0,\frac{1}{\lambda})$ and $p_{\bf w}(y\ |({\bf x}, {\bf w})) \sim \mathcal{N}({\bf w}^T {\bf x}, 1)$. Show that the following must be true ($P({\bf w}$) is the probability density of the vector ${\bf w}$, see hints above):
$$\underset{{\bf w}}{\arg\max}\left\{ P({\bf w})\prod_{i=1}^n p_{\bf w}(y_i|({\bf x}_i, {\bf w}))\right\} = \underset{{\bf w}}{\arg\min}\left\{\frac{1}{2}\sum_{i=1}^n ({\bf w}^T {\bf x}_i - y_i)^2 + \frac{\lambda}{2}\sum_{j=1}^d w_j^2\right\}$$ This is a *regularized version* of linear regression (called [ridge regression](https://www.statology.org/ridge-regression/)) where the second term tries to push the weight values that are least influential to 0. $\lambda$ is a hyperparameter (selected by user).
<!-- 4. **[Super-Secret Bonus For-Fun Question]** Assume a linear model $\mathbf{w}^\top\mathbf{x} = \sum_{i=1}^dx_iw_i$ is a data point where $\mathbf{x} \in\mathbb{R}^d$ and $\mathbf{w}\in \mathbb{R}^d$. Linear model can be enhanced (not sure about the meaning of "enhanced"? check [here](https://datascience.stackexchange.com/questions/30465/tips-to-improve-linear-regression-model) and see a link named as "polynomial terms" in the answer written by "Brian Spiering") by doing a quadratic expansion of the features, i.e., by constructing a new linear model $f_w$ with parameters
$$(w_0, w_{01}, ..., w_{0d},w_{11},w_{12}, ..., w_{1d}, w_{22}, w_{23}, ..., w_{2d}, ..., w_{dd}),$$
defined by
$$f_{\mathbf{w}}(x) = \mathbf{w}^\top\phi(\mathbf{x}) = w_0 + \sum_{i=1}^d w_{0i}x_i + \sum_{i\leq j}^d w_{ij}x_ix_j$$
For example, suppose $\phi(x) = (1, x, x^2)$, and $w = (a, b, c)$. Then we have
$$x \mapsto w^\top \phi(x) = a + bx + cx^2$$
and we can learn quadratic polynomials.
Here's a question. Given a 3-dimensional feature vector $\mathbf{x} = (x_1, x_2, x_3)$, completely write out the quadratic expanded feature vector $\phi(\mathbf{x})$.
:::info
**HINT**: your answer should be of form like:
$$\phi(x):=(1, \underbrace{x_1,...,x_d}_{\textit{linear terms}},\underbrace{ x_1^2,...,x_d^2}_{\textit{sqaured terms}},\underbrace{x_1x_2,...,x_1x_d,...,x_{d-1}x_d}_{\textit{cross terms}})$$
::: -->
<style>
.alert {
color: inherit
}
.markdown-body {
font-family: Inter
}
/* Some really hacky CSS to hide bullet points
* for spoilers in lists */
li:has(details) {
list-style-type: none;
margin-left: -1em
}
li > details > summary {
margin-left: 1em
}
li > details > summary::-webkit-details-marker {
margin-left: -1.05em
}
</style>
<!--## Ethical Questions
Before starting these questions, read our [SRC guidelines](https://hackmd.io/@BrownDeepLearningS23/H1-eGm2jo) to understand your expectations when engaging with SRC content and what you can expect from us.
Before answering the below questions, you should read the abstract from [this article](https://www.science.org/doi/full/10.1126/science.aax2342?casa_token=zMj9u8P9YHsAAAAA%3AvQoR6lLwt9W-noffBAyVqCmOmHiZ1YmYMmHQIwkzYw9wmZP000DMlo6qSfs5xmQc4yrRXT_zGGDX_A) from *Science* discussing racial bias in data used to predict extra care needs.
1. Read [this fact sheet](https://www.who.int/health-topics/social-determinants-of-health#tab=tab_1) put out by the World Health Organization (WHO).
> The diabetes dataset used for the linear regression task in this assignment includes the following variables (features): *age, sex, body mass index, average blood pressure, total serum cholesterol, low-density lipoproteins, high-density lipoproteins, total cholesterol, triglycerides, and blood sugar level*.
Notice that these variables are all specifically health-related. Name a few specific non-health variables that might contribute to diabetes progression. Explain how these variables may affect diabetes progression. (3-5 sentences)
It is clear that the variables we organize our data around become important to our understanding of it. Another important aspect is the collection of the data—ensuring that it faithfully represents the population.
2. Read the following sentence from [this](https://www.nature.com/articles/s41591-020-01174-9) research paper that discusses a deep learning approach to identify breast cancer from mammograms.
> Testing generalization to this dataset is particularly meaningful given the low screening rates in China and the known (and potentially unknown) biological differences found in mammograms between Western and Asian populations, including a greater proportion of women with dense breasts in Asian populations.”
Researchers should ensure that the sample data they gather is representative of the population of interest. Explain how nonrepresentative sampling can lead to inaccurate conclusions. Propose three different ways of reducing sampling bias. Feel free to refer to the sentence above in your explanation. (4-5 sentences)
Many of the deep learning models that you will be learning about this semester rely on training data to make accurate predictions and perform the desired task. The quality of the datasets you use have great implications on the accuracies and biases of the models you develop and train. As we progress through the semester, keep in mind that a lot of issues – regarding both model performance and ethical consideration – are introduced merely by the data that gets used to train our models.
-->