HW1 Conceptual: Beras Pt. 1
Conceptual questions due Friday, February 9, 2024 at 6:00 PM EST
Programming assignment due Monday, February 12, 2024 at 6:00 PM EST
Answer the following questions, showing your work where necessary. Please explain your answers and work.
We encourage the use of to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there!
Note: make sure to select the template file hw1.tex
from the sidebar on the left of overleaf.
Do NOT include your name anywhere within this submission. Points will be deducted if you do so.
Theme
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Haters will say that this handsome gentleman is fake.
Conceptual Questions
- Consider Mean-Square Error (MSE) loss. Why do we square the error? Please provide two reasons. (2-3 sentences)
- We covered a lemonade stand example in the class (Intro to Machine Learning), which was a regression task.
- How would you convert it into a classification task?
- What are the benefits of keeping it a regression task? What are the benefits of converting it to a classification task?
- Give another example of a classification and a regression task each.
- Consider the following network:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Let the input, weight, bias, and expected hypothesis matrix be the following:
Answer the following questions, write out equations as necessary, and show your work for every step.
- Complete a manual forward pass (i.e. find the output matrix AKA . Assume the network uses a linear function for which the above matrices are compatible.
- Using Mean Squared Error as your loss metric, compute the loss relative to .
- Perform a single manual backward pass assuming that the the expected output , AKA the ground truth value , is given. Calculate the partial derivative of the loss with respect to each parameter.
- Give the updated weight and bias matrices using the result of your gradient calculations. Use the stochastic gradient descent algorithm with learning rate = 0.1.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
HINTS:
- Check Lab 1 for the formula to get the partial derivative of MSE loss with respect to the weights and bias.
- To invoke the chain rule, consider computing as an upstream gradient and then composing it with the local gradients relating with the parameters.
- During the backwards pass, pay close attention to what dimensions the gradients have to be. That might help you figure out what operations/matrix configurations you'll need to use.
- If you're doing it right, you should expect relatively simple and repetitive numbers.
- Empirical risk for another machine learning model called logistic regression (used widely for classification) is:
This loss is also known as log-loss. Take the gradient of this empirical risk w.r.t . Your answer may need to include parameters , number of examples , labels , and training examples . Show your work for every step!
2470-only Questions
The following are questions that are only required by students enrolled in CSCI2470. For each question, include a 2-4 sentence argument for why this mathematical detail may be considered interesting and useful.
- You have two interdependent random variables and you'd like to learn a mapping . The best function you can get is . This does not account for noise associated with random events. Let be the noise such that . Prove that .
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
For Question 1, the following information might be relevant:
- Linearity of Expectation states that
- According to the law of total expectation,
- Let conditionally parameterize a normal distribution such that . Show that the maximizing likelihood objective can be written by using and the squared error between and . Intuitively, this is a probabilistic interpretation of linear regression. We are interested in the joint probability of how the behaviour of the output is conditional on the values of the input vector , as well as any parameters of the model . The model is represented as a Gaussian distribution (prior) with the linear function output as the mean. We assume that the variance in the model is fixed () (i.e. that it doesn't depend on ).
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
For Questions 2 and 3, the following information might be relevant:
- Weights parameterize a model to predict given .
- The maximum likelihood objective for a dataset involves choosing such that the following two objectives (which are equivalent) are satisfied:
- (To convert the multiplicative function to additive we take log, so now we are trying to learn weights such that the log-liklihood of is maximized)
- An i.i.d. vector - where each value is pulled from the same distribution and does not depend on any of the other values - abides by .
- Recall that the normal (Gaussian) probability density function (PDF) is:
- Assume instead that and . Show that tmust be true () is the probability density of the vector , see hints above):
This is a regularized version of linear regression (called ridge regression) where the second term tries to push the weight values that are least influential to 0. is a hyperparameter (selected by user).