HW2 Conceptual: Beras Pt. 2

Conceptual section due Friday, February 23, 2024 at 6:00 PM EST
Programming section due Monday, February 26, 2024 at 6:00 PM EST

Answer the following questions, showing your work where necessary. Please explain your answers and work.

We encourage the use of

LATEX to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there! Be sure to download the images!

Latex Template

Do NOT include your name anywhere within this submission. Points will be deducted if you do so.

Theme

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

When its not as deep as your neural network

Conceptual Questions

  1. The update rule for weights in a single layer (linear unit+softmax), multi-class neural network with cross entropy loss is defined as follows:

    Let

    wij be the weight that corresponds to the
    i
    th input feature,
    xi
    , and the
    j
    th class. Let
    c
    be the correct class for a given input (the label). The loss is then:
    L=log(Pc)

    If

    j=c, then the derivative of loss with respect to the weight is:
    Lwij=(Pj1)xi

    And if
    jc
    , then:
    Lwij=Pjxi

    We can then use these partial derivatives to descend our weights along the gradient:
    wij=wijLwijα

    where
    α
    is the learning rate.

    1. Derive the above rules from the original definition of cross-entropy loss.

      Hint: Consider both cases of

      j, and start by expanding out
      L
      .

  2. In classification problems, we assign a likelihood probability to each class and use a loss function that outputs a loss based on this probability. Can you use MSE loss for classification tasks? Why or why not? Why is cross-entropy loss most commonly used for classification? (3-5 sentences)

    Hint: Think about how each loss function is shaped, the inputs that they take in, and the range of each function's inputs.

  1. The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable.

    1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (2-4 sentences)
    2. Consider the formula for updating our weights:
      Δw=αLw

      Why do we negate this quantity? What purpose does this serve? (2-4 sentences)
    3. During gradient descent, we calculate the partial derivative of loss (
      L
      ) with respect to each weight
      (wij)
      . Why must we do this for every weight? Why can't we do this only for some weights? (1-3 sentences)
    4. In practice, most operations during gradient descent are vectorized. Why do we do this? Why might this make it beneficial to train the model on a GPU? (1-2 sentences)
    5. Consider the following plot of a loss function for some neural network:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

      Where should gradient descent end up on the graph? How many weights does this model have? If our model starts training at
      (2,2,0)
      , will the loss function ever reach the absolute minimum? Why? Assume the loss function at this point is perfectly flat. (3-5 sentences)
  1. We have previously worked on single-layer linear regression using one linear function
    xW1x+b1
    mapping from
    Rs
    to
    Rt
    . For many real-world scenarios, we actually need multiple layers to model more complex relationships.
    • Calculate the result after we stack another linear function
      xW2x+b2
      mapping from
      Rt
      to
      Ru
      right after the first one.
    • What is the shape of
      W1,b1
      and
      W2,b2
      ? Explain your reasoning.
    • Does the composition of the two linear functions offer an improvement over a single linear function? Explain your answer (2-4 sentences)

Bonus Questions

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
2470 Questions are written below!
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The following questions are bonus questions and are not required. If you do not understand the compose or get gradient functions, TAs will redirect you to these questions first. Completing these questions will help you complete the assignment.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For the following questions, the answers will have to do with the following four functions: get_input_gradients(), get_weight_gradients(), compose_input_gradients(), and compose_weight_gradients(). Note that through the Diffable class, each of the layers shown above (

f1,
f2
,
L
) either individually implement or inherit these functions.

  1. To initialize backpropagation, which function must you call on which layer? What partial derivative(s) would this return? (Hint: Think back to HW1)

  2. At layer

    f1, what shape are the partials that get_input_gradients() and get_weight_gradients() return? Imagine that x is a vector of length
    m
    and that
    w1
    is a weight matrix of size
    (m,r)
    . (Hint: Think back to HW1)

  3. At layer

    f1, what shape are the partials that compose_input_gradients(J) and compose_weight_gradients(J) return? J will be the correct upstream gradient. For this to occur, how must each function resolve the input dimension? What upstream partial is necessary for these functions?

  4. At layer

    f1, how does compose_input_gradients(J), resolve the input dimension difference? More specifically, how does this function create a gradient the same shape as the inputs to this layer
    (N,m)
    ? Be sure to reference the matrix multiplication at line 166 in core.py. Imagine that
    w2
    is a weight matrix of size
    (r,1)
    .

2470-only Questions

The following are questions that are only required by students enrolled in CSCI2470.

  1. In your introductory calculus class, you were likely exposed to the following simple algorithm for finding the minimum of a function: take the derivative of the function, set it to zero, then solve the equation.

    Neural networks are differentiable, so why don't we use this algorithm (instead of gradient descent) to minimize the loss? (1-4 sentences)

  2. Prove that SGD with a batch size of 1 gives an unbiased estimate of the "true gradient" (the gradient calculated over the entire training set). Assume that the batch is selected uniformly at random from the full training set.

    Hints:

    1. Recall that an estimator
      f^
      is an unbiased estimator of a function
      f
      if
      E[f^]=f
      .
    2. Both expectation and differentiation are linear operators.

In the assignment, you will see that even a single-layer neural network can classify the digit with pretty good accuracy. Still, there are several ways in which one could improve the accuracy of the model. The paper in the link below suggests and easy way to increase the performance without changing the architecture of our network: training several identical models! Please read the following paper (found here) and answer the following questions:

You may need to login with your Brown username to access the paper.

  1. What is the committee? Why would it help increase the performance of the model?

  2. What is the preprocessing they implement in this paper? How does preprocessing prevent the models' errors from being strongly correlated?