HW1 Conceptual: Beras Pt. 1

Conceptual questions due Friday, February 9, 2024 at 6:00 PM EST
Programming assignment due Monday, February 12, 2024 at 6:00 PM EST

Answer the following questions, showing your work where necessary. Please explain your answers and work.

We encourage the use of

L A T E X

to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there!
Note: make sure to select the template file hw1.tex from the sidebar on the left of overleaf.

Latex Template

Do NOT include your name anywhere within this submission. Points will be deducted if you do so.

Theme

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Haters will say that this handsome gentleman is fake.

Conceptual Questions

Consider Mean-Square Error (MSE) loss. Why do we square the error? Please provide two reasons. (2-3 sentences)
We covered a lemonade stand example in the class (Intro to Machine Learning), which was a regression task.
- How would you convert it into a classification task?
- What are the benefits of keeping it a regression task? What are the benefits of converting it to a classification task?
- Give another example of a classification and a regression task each.

Consider the following network:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Let the input, weight, bias, and expected hypothesis matrix
$o_{e x p e c t e d}$ be the following:

$x = [\begin{matrix} 1 \\ 2 \end{matrix}] w = [\begin{matrix} 0.4 & 0.7 & 0.1 \\ 0.2 & 0.4 & 0.5 \end{matrix}] b = [\begin{matrix} 0.2 \\ 0 \\ 0.4 \end{matrix}] o_{e x p e c t e d} = [\begin{matrix} 1 \\ 2 \\ 0 \end{matrix}]$
Answer the following questions, write out equations as necessary, and show your work for every step.
- Complete a manual forward pass (i.e. find the output matrix
  $o$ AKA
  $\hat{y})$ . Assume the network uses a linear function for which the above matrices are compatible.
- Using Mean Squared Error as your loss metric, compute the loss relative to
  $o_{e x p e c t e d}$ .
- Perform a single manual backward pass assuming that the the expected output
  $o_{e x p e c t e d}$ , AKA the ground truth value
  $y$ , is given. Calculate the partial derivative of the loss with respect to each parameter.
- Give the updated weight and bias matrices using the result of your gradient calculations. Use the stochastic gradient descent algorithm with learning rate = 0.1.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

HINTS:

Check Lab 1 for the formula to get the partial derivative of MSE loss with respect to the weights and bias.
To invoke the chain rule, consider computing
$\frac{\partial L_{M S E}}{\partial o}$ as an upstream gradient and then composing it with the local gradients relating
$o$ with the parameters.
During the backwards pass, pay close attention to what dimensions the gradients have to be. That might help you figure out what operations/matrix configurations you'll need to use.
If you're doing it right, you should expect relatively simple and repetitive numbers.

Empirical risk
$\hat{R}$ for another machine learning model called logistic regression (used widely for classification) is:

${\hat{R}}_{\log} (w) = \frac{1}{n} \sum_{i = 1}^{n} \ln (1 + \exp (- y_{i} w^{⊤} x_{i}))$
This loss is also known as log-loss. Take the gradient of this empirical risk w.r.t
$w \in R^{d}$ . Your answer may need to include parameters
$w$ , number of examples
$n$ , labels
$y_{i} \in R$ , and training examples
$x_{i} \in R^{d}$ . Show your work for every step!

2470-only Questions

The following are questions that are only required by students enrolled in CSCI2470. For each question, include a 2-4 sentence argument for why this mathematical detail may be considered interesting and useful.

You have two interdependent random variables
$x, y \sim X, Y$ and you'd like to learn a mapping
$f : X \to Y$ . The best function you can get is
$f (x) = E [Y | X = x]$ . This does not account for noise associated with random events. Let
$ξ$ be the noise such that
$y = f (x) + ξ$ . Prove that
$E [ξ] = 0$ .

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For Question 1, the following information might be relevant:

Linearity of Expectation states that
$E [a + b] = E [a] + E [b]$
According to the law of total expectation,
$E [E [B | A]] = E [B]$

Let
$p_{w}$ conditionally parameterize a normal distribution such that
$p_{w} (y | x) \sim N (x^{T} w, 1)$ . Show that the maximizing likelihood objective can be written by using
$\underset{w}{\arg min}$ and the squared error between
$x^{T} w$ and
$y$ . Intuitively, this is a probabilistic interpretation of linear regression. We are interested in the joint probability of how the behaviour of the output
$y$ is conditional on the values of the input vector
$x$ , as well as any parameters of the model
$w$ . The model is represented as a Gaussian distribution (prior) with the linear function output as the mean. We assume that the variance in the model is fixed (
$= 1$ ) (i.e. that it doesn't depend on
$x$ ).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For Questions 2 and 3, the following information might be relevant:

Weights
$w \in R^{d}$ parameterize a model
$p_{w} (y | x)$ to predict
$p (y)$ given
$x$ .
The maximum likelihood objective for a dataset
$(x_{i}, y_{i})^{n} \in (R^{d}, R)^{n}$ involves choosing
$w$ such that the following two objectives (which are equivalent) are satisfied:

$\begin{matrix} (1) & \underset{w}{\arg max} \prod_{i = 1}^{n} p_{w} (y_{i} | x_{i}) = \underset{w}{\arg min} \sum_{i = 1}^{n} \ln ({p_{w} (y_{i} | x_{i})}^{- 1}) \end{matrix}$
(To convert the multiplicative function to additive we take log, so now we are trying to learn weights such that the log-liklihood of
$y | x$ is maximized)
An i.i.d. vector
$x$ - where each value is pulled from the same distribution and does not depend on any of the other values - abides by
$P (x) = \prod_{i} P (x_{i})$ .
Recall that the normal (Gaussian) probability density function (PDF) is:

$f (x) = \frac{1}{σ \sqrt{(2 π)}} \exp (- \frac{1}{2} {(\frac{x - μ}{σ})}^{2})$

Assume instead that each element
$w_{j} \sim N (0, \frac{1}{λ})$ and
$p_{w} (y | (x, w)) \sim N (w^{T} x, 1)$ . Show that the following must be true (
$P (w$ ) is the probability density of the vector
$w$ , see hints above):

$\underset{w}{\arg max} {P (w) \prod_{i = 1}^{n} p_{w} (y_{i} | (x_{i}, w))} = \underset{w}{\arg min} {\frac{1}{2} \sum_{i = 1}^{n} (w^{T} x_{i} - y_{i})^{2} + \frac{λ}{2} \sum_{j = 1}^{d} w_{j}^{2}}$ This is a regularized version of linear regression (called ridge regression) where the second term tries to push the weight values that are least influential to 0.
$λ$ is a hyperparameter (selected by user).

komenge

2023/01/20 23:16:25

finish this (Edited)

2023/01/20 23:16:33

faizaanvidhani

2023/01/25 22:46:14

delete (Edited)

2023/01/25 22:46:26

2023/01/25 22:50:08

2. Read the followi

From the list of social determinants health, choose two variables that may put patients at greater risk for diabetes. (Edited)

Brendan Ho

2023/01/25 22:51:19

2. Read t

focus on sampling here, different populations (Edited)

2023/01/25 22:56:27

https://jswve.org/download/2010-1/2dattalo-Ethical-dilemmas-in-sampling.pdf (Edited)

Lingze Zhang

2024/02/03 01:15:45

each element

The w_j here means a single element of the weight vector. j is the index of dimension of the element in w, NOT the index of data example. All x_i's share the same w, which is assumed to be sampled from another normal distribution. (Edited)

2024/02/03 01:22:02

the following

P(w) is the probability density of the weight vector. You may think of it as sampled from the normal distribution and the probability of y is conditioned on both x and w.

HW1 Conceptual: Beras Pt. 1

Latex Template

Theme

Conceptual Questions

2470-only Questions

Read more

HW3 Programming: CNNs

Deep Learning Final Project

HW6 Conceptual: Variational Autoencoders

HW6 Programming: Variational Autoencoders