Conceptual section due Friday, February 23, 2024 at 6:00 PM EST
Programming section due Monday, February 26, 2024 at 6:00 PM EST
Answer the following questions, showing your work where necessary. Please explain your answers and work.
We encourage the use of to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there! Be sure to download the images!
Latex Template
Do NOT include your name anywhere within this submission. Points will be deducted if you do so.
When its not as deep as your neural network
The update rule for weights in a single layer (linear unit+softmax), multi-class neural network with cross entropy loss is defined as follows:
Let be the weight that corresponds to the th input feature, , and the th class. Let be the correct class for a given input (the label). The loss is then:
If , then the derivative of loss with respect to the weight is:
And if , then:
We can then use these partial derivatives to descend our weights along the gradient:
where is the learning rate.
Derive the above rules from the original definition of cross-entropy loss.
Hint: Consider both cases of , and start by expanding out .
In classification problems, we assign a likelihood probability to each class and use a loss function that outputs a loss based on this probability. Can you use MSE loss for classification tasks? Why or why not? Why is cross-entropy loss most commonly used for classification? (3-5 sentences)
Hint: Think about how each loss function is shaped, the inputs that they take in, and the range of each function's inputs.
The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable.
The following questions are bonus questions and are not required. If you do not understand the compose
or get
gradient functions, TAs will redirect you to these questions first. Completing these questions will help you complete the assignment.
For the following questions, the answers will have to do with the following four functions: get_input_gradients()
, get_weight_gradients()
, compose_input_gradients()
, and compose_weight_gradients()
. Note that through the Diffable class, each of the layers shown above (, , ) either individually implement or inherit these functions.
To initialize backpropagation, which function must you call on which layer? What partial derivative(s) would this return? (Hint: Think back to HW1)
At layer , what shape are the partials that get_input_gradients()
and get_weight_gradients()
return? Imagine that x is a vector of length and that is a weight matrix of size . (Hint: Think back to HW1)
At layer , what shape are the partials that compose_input_gradients(J)
and compose_weight_gradients(J)
return? J
will be the correct upstream gradient. For this to occur, how must each function resolve the input dimension? What upstream partial is necessary for these functions?
At layer , how does compose_input_gradients(J)
, resolve the input dimension difference? More specifically, how does this function create a gradient the same shape as the inputs to this layer ? Be sure to reference the matrix multiplication at line 166 in core.py
. Imagine that is a weight matrix of size .
The following are questions that are only required by students enrolled in CSCI2470.
In your introductory calculus class, you were likely exposed to the following simple algorithm for finding the minimum of a function: take the derivative of the function, set it to zero, then solve the equation.
Neural networks are differentiable, so why don't we use this algorithm (instead of gradient descent) to minimize the loss? (1-4 sentences)
Prove that SGD with a batch size of 1 gives an unbiased estimate of the "true gradient" (the gradient calculated over the entire training set). Assume that the batch is selected uniformly at random from the full training set.
Hints:
In the assignment, you will see that even a single-layer neural network can classify the digit with pretty good accuracy. Still, there are several ways in which one could improve the accuracy of the model. The paper in the link below suggests and easy way to increase the performance without changing the architecture of our network: training several identical models! Please read the following paper (found here) and answer the following questions:
You may need to login with your Brown username to access the paper.
What is the committee? Why would it help increase the performance of the model?
What is the preprocessing they implement in this paper? How does preprocessing prevent the models' errors from being strongly correlated?