HW2 Conceptual: Beras Pt. 2

Conceptual section due Friday, February 23, 2024 at 6:00 PM EST
Programming section due Monday, February 26, 2024 at 6:00 PM EST

Answer the following questions, showing your work where necessary. Please explain your answers and work.

We encourage the use of

L A T E X

to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there! Be sure to download the images!

Latex Template

Do NOT include your name anywhere within this submission. Points will be deducted if you do so.

Theme

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

When its not as deep as your neural network

Conceptual Questions

The update rule for weights in a single layer (linear unit+softmax), multi-class neural network with cross entropy loss is defined as follows:

Let
$w_{i j}$ be the weight that corresponds to the
$i$ th input feature,
$x_{i}$ , and the
$j$ th class. Let
$c$ be the correct class for a given input (the label). The loss is then:

$L = - \log (P_{c})$

If
$j = c$ , then the derivative of loss with respect to the weight is:

$\frac{\partial L}{\partial w_{i j}} = (P_{j} - 1) x_{i}$
And if
$j \neq c$ , then:

$\frac{\partial L}{\partial w_{i j}} = P_{j} x_{i}$
We can then use these partial derivatives to descend our weights along the gradient:

$\begin{array}{r} w_{i j} = w_{i j} - \frac{\partial L}{\partial w_{i j}} α \end{array}$
where
$α$ is the learning rate.
1. Derive the above rules from the original definition of cross-entropy loss.
  
  Hint: Consider both cases of
  $j$ , and start by expanding out
  $L$ .
In classification problems, we assign a likelihood probability to each class and use a loss function that outputs a loss based on this probability. Can you use MSE loss for classification tasks? Why or why not? Why is cross-entropy loss most commonly used for classification? (3-5 sentences)

Hint: Think about how each loss function is shaped, the inputs that they take in, and the range of each function's inputs.

The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable.
1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (2-4 sentences)
2. Consider the formula for updating our weights:
  
  $Δ w = - α \frac{\partial L}{\partial w}$
  Why do we negate this quantity? What purpose does this serve? (2-4 sentences)
3. During gradient descent, we calculate the partial derivative of loss (
  $L$ ) with respect to each weight
  $(w_{i j})$ . Why must we do this for every weight? Why can't we do this only for some weights? (1-3 sentences)
4. In practice, most operations during gradient descent are vectorized. Why do we do this? Why might this make it beneficial to train the model on a GPU? (1-2 sentences)
5. Consider the following plot of a loss function for some neural network:
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Where should gradient descent end up on the graph? How many weights does this model have? If our model starts training at
  $(- 2, 2, 0)$ , will the loss function ever reach the absolute minimum? Why? Assume the loss function at this point is perfectly flat. (3-5 sentences)

We have previously worked on single-layer linear regression using one linear function
$x \mapsto W_{1} x + b_{1}$ mapping from
$R^{s}$ to
$R^{t}$ . For many real-world scenarios, we actually need multiple layers to model more complex relationships.
- Calculate the result after we stack another linear function
  $x \mapsto W_{2} x + b_{2}$ mapping from
  $R^{t}$ to
  $R^{u}$ right after the first one.
- What is the shape of
  $W_{1}, b_{1}$ and
  $W_{2}, b_{2}$ ? Explain your reasoning.
- Does the composition of the two linear functions offer an improvement over a single linear function? Explain your answer (2-4 sentences)

Bonus Questions

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2470 Questions are written below!

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The following questions are bonus questions and are not required. If you do not understand the compose or get gradient functions, TAs will redirect you to these questions first. Completing these questions will help you complete the assignment.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For the following questions, the answers will have to do with the following four functions: get_input_gradients(), get_weight_gradients(), compose_input_gradients(), and compose_weight_gradients(). Note that through the Diffable class, each of the layers shown above (

f_{1}

f_{2}

L

) either individually implement or inherit these functions.

To initialize backpropagation, which function must you call on which layer? What partial derivative(s) would this return? (Hint: Think back to HW1)
At layer
$f_{1}$ , what shape are the partials that get_input_gradients() and get_weight_gradients() return? Imagine that x is a vector of length
$m$ and that
$w_{1}$ is a weight matrix of size
$(m, r)$ . (Hint: Think back to HW1)
At layer
$f_{1}$ , what shape are the partials that compose_input_gradients(J) and compose_weight_gradients(J) return? J will be the correct upstream gradient. For this to occur, how must each function resolve the input dimension? What upstream partial is necessary for these functions?
At layer
$f_{1}$ , how does compose_input_gradients(J), resolve the input dimension difference? More specifically, how does this function create a gradient the same shape as the inputs to this layer
$(N, m)$ ? Be sure to reference the matrix multiplication at line 166 in core.py. Imagine that
$w_{2}$ is a weight matrix of size
$(r, 1)$ .

2470-only Questions

The following are questions that are only required by students enrolled in CSCI2470.

In your introductory calculus class, you were likely exposed to the following simple algorithm for finding the minimum of a function: take the derivative of the function, set it to zero, then solve the equation.

Neural networks are differentiable, so why don't we use this algorithm (instead of gradient descent) to minimize the loss? (1-4 sentences)
Prove that SGD with a batch size of 1 gives an unbiased estimate of the "true gradient" (the gradient calculated over the entire training set). Assume that the batch is selected uniformly at random from the full training set.
Hints:
1. Recall that an estimator
  $\hat{f}$ is an unbiased estimator of a function
  $f$ if
  $E [\hat{f}] = f$ .
2. Both expectation and differentiation are linear operators.

In the assignment, you will see that even a single-layer neural network can classify the digit with pretty good accuracy. Still, there are several ways in which one could improve the accuracy of the model. The paper in the link below suggests and easy way to increase the performance without changing the architecture of our network: training several identical models! Please read the following paper (found here) and answer the following questions:

You may need to login with your Brown username to access the paper.

What is the committee? Why would it help increase the performance of the model?
What is the preprocessing they implement in this paper? How does preprocessing prevent the models' errors from being strongly correlated?

HW2 Conceptual: Beras Pt. 2

Latex Template

Theme

Conceptual Questions

Bonus Questions

2470-only Questions

Read more

HW3 Programming: CNNs

Deep Learning Final Project

HW6 Conceptual: Variational Autoencoders

HW6 Programming: Variational Autoencoders