HW1 Conceptual: Beras Pt. 1

Conceptual questions due Friday, February 9, 2024 at 6:00 PM EST
Programming assignment due Monday, February 12, 2024 at 6:00 PM EST

Answer the following questions, showing your work where necessary. Please explain your answers and work.

We encourage the use of

LATEX to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there!
Note: make sure to select the template file hw1.tex from the sidebar on the left of overleaf.

Latex Template

Do NOT include your name anywhere within this submission. Points will be deducted if you do so.

Theme

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Haters will say that this handsome gentleman is fake.

Conceptual Questions

  1. Consider Mean-Square Error (MSE) loss. Why do we square the error? Please provide two reasons. (2-3 sentences)
  2. We covered a lemonade stand example in the class (Intro to Machine Learning), which was a regression task.
    • How would you convert it into a classification task?
    • What are the benefits of keeping it a regression task? What are the benefits of converting it to a classification task?
    • Give another example of a classification and a regression task each.
  1. Consider the following network:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    Let the input, weight, bias, and expected hypothesis matrix
    oexpected
    be the following:
    x=[12]   w=[0.40.70.10.20.40.5]   b=[0.200.4]   oexpected=[120]

    Answer the following questions, write out equations as necessary, and show your work for every step.
    • Complete a manual forward pass (i.e. find the output matrix
      o
      AKA
      y^)
      . Assume the network uses a linear function for which the above matrices are compatible.
    • Using Mean Squared Error as your loss metric, compute the loss relative to
      oexpected
      .
    • Perform a single manual backward pass assuming that the the expected output
      oexpected
      , AKA the ground truth value
      y
      , is given. Calculate the partial derivative of the loss with respect to each parameter.
    • Give the updated weight and bias matrices using the result of your gradient calculations. Use the stochastic gradient descent algorithm with learning rate = 0.1.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

HINTS:

  • Check Lab 1 for the formula to get the partial derivative of MSE loss with respect to the weights and bias.
  • To invoke the chain rule, consider computing
    LMSEo
    as an upstream gradient and then composing it with the local gradients relating
    o
    with the parameters.
  • During the backwards pass, pay close attention to what dimensions the gradients have to be. That might help you figure out what operations/matrix configurations you'll need to use.
  • If you're doing it right, you should expect relatively simple and repetitive numbers.
  1. Empirical risk
    R^
    for another machine learning model called logistic regression (used widely for classification) is:
    R^log(w)=1ni=1nln(1+exp(yiwxi))

    This loss is also known as log-loss. Take the gradient of this empirical risk w.r.t
    wRd
    . Your answer may need to include parameters
    w
    , number of examples
    n
    , labels
    yiR
    , and training examples
    xiRd
    . Show your work for every step!

2470-only Questions

The following are questions that are only required by students enrolled in CSCI2470. For each question, include a 2-4 sentence argument for why this mathematical detail may be considered interesting and useful.

  1. You have two interdependent random variables
    x,yX,Y
    and you'd like to learn a mapping
    f:XY
    . The best function you can get is
    f(x)=E[Y|X=x]
    . This does not account for noise associated with random events. Let
    ξ
    be the noise such that
    y=f(x)+ξ
    . Prove that
    E[ξ]=0
    .
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For Question 1, the following information might be relevant:

  • Linearity of Expectation states that
    E[a+b]=E[a]+E[b]
  • According to the law of total expectation,
    E[E[B|A]]=E[B]
  1. Let
    pw
    conditionally parameterize a normal distribution such that
    pw(y|x)N(xTw,1)
    . Show that the maximizing likelihood objective can be written by using
    argminw
    and the squared error between
    xTw
    and
    y
    . Intuitively, this is a probabilistic interpretation of linear regression. We are interested in the joint probability of how the behaviour of the output
    y
    is conditional on the values of the input vector
    x
    , as well as any parameters of the model
    w
    . The model is represented as a Gaussian distribution (prior) with the linear function output as the mean. We assume that the variance in the model is fixed (
    =1
    ) (i.e. that it doesn't depend on
    x
    ).
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For Questions 2 and 3, the following information might be relevant:

  • Weights
    wRd
    parameterize a model
    pw(y|x)
    to predict
    p(y)
    given
    x
    .
  • The maximum likelihood objective for a dataset
    (xi,yi)n(Rd,R)n
    involves choosing
    w
    such that the following two objectives (which are equivalent) are satisfied:
    (1)argmaxwi=1npw(yi|xi)=argminwi=1nln(pw(yi|xi)1)
  • (To convert the multiplicative function to additive we take log, so now we are trying to learn weights such that the log-liklihood of
    y|x
    is maximized)
  • An i.i.d. vector
    x
    - where each value is pulled from the same distribution and does not depend on any of the other values - abides by
    P(x)=iP(xi)
    .
  • Recall that the normal (Gaussian) probability density function (PDF) is:
    f(x)=1σ(2π)exp(12(xμσ)2)
  1. Assume instead that each element
    wjN(0,1λ)
    and
    pw(y |(x,w))N(wTx,1)
    . Show that the following must be true (
    P(w
    ) is the probability density of the vector
    w
    , see hints above):
    argmaxw{P(w)i=1npw(yi|(xi,w))}=argminw{12i=1n(wTxiyi)2+λ2j=1dwj2}
    This is a regularized version of linear regression (called ridge regression) where the second term tries to push the weight values that are least influential to 0.
    λ
    is a hyperparameter (selected by user).