---
tags: hw3, conceptual
---
# HOMEWORK 3 Conceptual: BERAS
:::info
Conceptual section due **Tuesday, September 25, 2025 at 10:00 PM EST**
Programming section due **Thuesday, October 2, 2025 at 10:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers. A non-editable homework template is linked, so copy the entire .tex file and image into your own Overleaf project and go from there!
> #### [**Latex Template**](https://www.overleaf.com/read/fxxpfgfjyryw#825423)
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Conceptual Questions
1. The update rule for weights in our single layer, multi-class neural network (dense layer & softmax activation) with cross entropy loss is defined as follows:
Let $w_{ij}$ be the weight that corresponds to the $i$th input feature, $x_i$, and the $j$th class. Let $c$ be the correct class for a given input (the _label_) and $P_j$ be the predicted probability of class $j$. $P_c$ is the case where $j=c$. The loss is then:
$$L = -\log(P_c)$$
If $j = c$, then the derivative of loss with respect to the weight is:
$$\frac{\partial L}{\partial w_{ij}} = (P_j - 1)x_i$$
And if $j \neq c$, then:
$$\frac{\partial L}{\partial w_{ij}} = P_j x_i$$
We can then use these partial derivatives to descend our weights along the gradient:
\begin{align}
w_{ij} = w_{ij} - \frac{\partial L}{\partial w_{ij}}\alpha
\end{align}
where $\alpha$ is the learning rate.
Our network's architecture can then be defined as:
1.**Feature Input Vector:** $\mathbf{x} = [x_1, x_2, \dots, x_n]$
2.**Linear Layer:** $z_j = \sum_{i=1}^n w_{ij}x_i + b_j$
3.**Softmax:** $P_j = \sigma(z_j)$
4.**Cross-Entropy Loss:** $L = -\sum_{j=1}^K y_j\log(P_j)$
1. **Derive the above rules from the original definition of cross-entropy loss. Make sure to include all of your work (such as the quotient rule).**
:::success
**Hint:** Start by expanding out $L$ to our definition of cross-entropy in full. From there, you can see the simplication down to $L=-\log(P_c)$. Break down each part of the network from the input $x$, through the dense layer, and the softmax layer to produce our probability distribution, $P$.
Key Tip:
$$ \sigma(z_j) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$
:::
2. In classification problems, we assign a likelihood probability to each class and use a loss function that outputs a loss based on this probability. Can you use MSE loss for classification tasks? Why or why not? Why is cross-entropy loss most commonly used for classification? (_3-5 sentences_)
:::info
**Hint:** Think about how each loss function is shaped, the inputs that they take in, and the range of each function's inputs.
:::
3. The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable.
1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (_2-4 sentences_)
2. Consider the formula for updating our weights:
$$\Delta w = -\alpha \frac{\partial L}{\partial w}$$
Why do we negate this quantity? What purpose does this serve? (_2-4 sentences_)
3. During gradient descent, we calculate the partial derivative of loss ($L$) with respect to each weight $(w_{ij})$. Why must we do this for every weight? Why can't we do this only for some weights? (_1-3 sentences_)
4. We have previously worked on single-layer linear regression using one linear function $\mathbf{x} \mapsto\mathbf{W_1}\mathbf{x}+\mathbf{b_1}$ mapping from $\mathbb{R^s}$ to $\mathbb{R^t}$. For many real-world scenarios, we actually need multiple layers to model more complex relationships.
- Calculate the result after we stack another linear function $\mathbf{x} \mapsto\mathbf{W_2}\mathbf{x}+\mathbf{b_2}$ mapping from $\mathbb{R^t}$ to $\mathbb{R^u}$ right after the first one.
- What is the shape of $\mathbf{W_1}, \mathbf{b_1}$ and $\mathbf{W_2}, \mathbf{b_2}$? Explain your reasoning.
- Does the composition of the two linear functions offer an improvement over a single linear function? Explain your answer (*2-4 sentences*)
## Bonus Questions
The following questions are *bonus* questions and are **not required**. If you do not understand the `compose` or `get` gradient functions, TAs will redirect you to these questions first. Completing these questions **will help** you complete the assignment.

For the following questions, the answers will have to do with the following four functions: `get_input_gradients()`, `get_weight_gradients()`, `compose_input_gradients()`, and `compose_weight_gradients()`. Note that through the Diffable class, each of the layers shown above ($f_1$, $f_2$, $L$) either individually implement or inherit these functions.
1. To initialize backpropagation, which function must you call on which layer? What partial derivative(s) would this return?
2. At layer $f_1$, what shape are the partials that `get_input_gradients()` and `get_weight_gradients() `return? Imagine that x is a vector of length $m$ and that $w_1$ is a weight matrix of size $(m, r)$.
3. At layer $f_1$, what shape are the partials that `compose_input_gradients(J)` and `compose_weight_gradients(J)` return? `J` will be the correct upstream gradient. For this to occur, how must each function resolve the input dimension? What upstream partial is necessary for these functions?
4. At layer $f_1$, how does `compose_input_gradients(J)`, resolve the input dimension difference? More specifically, how does this function create a gradient the same shape as the inputs to this layer $(N,x)$? Be sure to reference the matrix multiplication at line 166 in `core.py`. Imagine that $w_2$ is a weight matrix of size $(r,1)$.