---
tags: hw3, conceptual
---
# Homework 3 Conceptual: BERAS
:::info
Conceptual section due **Thursday, February 19, 2026 at 11:59 PM EST**
Programming section due **Thuesday, February 26, 2026 at 11:59 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers. A non-editable homework template is linked, so copy the entire .tex file and image into your own Overleaf project and go from there!
> #### [**Latex Template**](https://www.overleaf.com/read/pbyqsczsswfm#dc04cf)
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Problem 1
The update rule for weights in our single layer, multi-class neural network (dense layer & softmax activation) with cross entropy loss is defined as follows:
Let $w_{ij}$ be the weight that corresponds to the $i$th input feature, $x_i$, and the $j$th class. Let $c$ be the correct class for a given input (the _label_) and $P_j$ be the predicted probability of class $j$. $P_c$ is the case where $j=c$. The loss is then:
$$L = -\log(P_c)$$
If $j = c$, then the derivative of loss with respect to the weight is:
$$\frac{\partial L}{\partial w_{ij}} = (P_j - 1)x_i$$
And if $j \neq c$, then:
$$\frac{\partial L}{\partial w_{ij}} = P_j x_i$$
We can then use these partial derivatives to descend our weights along the gradient:
\begin{align}
w_{ij} = w_{ij} - \frac{\partial L}{\partial w_{ij}}\alpha
\end{align}
where $\alpha$ is the learning rate.
Our network's architecture can then be defined as:
1.**Feature Input Vector:** $\mathbf{x} = [x_1, x_2, \dots, x_n]$
2.**Linear Layer:** $z_j = \sum_{i=1}^n w_{ij}x_i + b_j$
3.**Softmax:** $P_j = \sigma(z_j)$
4.**Cross-Entropy Loss:** $L = -\sum_{j=1}^K y_j\log(P_j)$
1. **Derive the above rules from the original definition of cross-entropy loss. Make sure to include all of your work (such as the quotient rule).**
:::success
**Hint:** Start by expanding out $L$ to our definition of cross-entropy in full. From there, you can see the simplication down to $L=-\log(P_c)$. Break down each part of the network from the input $x$, through the dense layer, and the softmax layer to produce our probability distribution, $P$.
Key Tip:
$$ \sigma(z_j) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$
:::
## Problem 2
In classification problems, we assign a probability to each class and use a loss function that outputs a loss based on these probabilities.
**(a) Can you use MSE loss for classification tasks? Why or why not?**
Consider both theoretical and practical perspectives. *(3-5 sentences)*
:::info
**Hint:** Think about the range of outputs from a softmax layer versus what MSE expects. Also consider how MSE penalizes errors compared to cross-entropy.
:::
**Why is categorical cross-entropy loss most commonly used for multi-class classification?**
Explain its relationship to probability and information theory. *(3-5 sentences)*
## Problem 3
The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable.
1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (_2-4 sentences_)
2. Consider the formula for updating our weights:
$$\Delta w = -\alpha \frac{\partial L}{\partial w}$$
**Why do we negate this quantity? What purpose does this serve? (_2-4 sentences_)**
3. During gradient descent, we calculate the partial derivative of loss ($L$) with respect to each weight $(w_{ij})$. **Why must we do this for every weight? Why can't we do this only for some weights? (_1-3 sentences_)**
## Problem 4
Consider stacking two linear transformations: $f_1(\mathbf{x}) = \mathbf{W_1}\mathbf{x}+\mathbf{b_1}$ mapping from $\mathbb{R}^s$ to $\mathbb{R}^t$, followed by $f_2(\mathbf{y}) = \mathbf{W_2}\mathbf{y}+\mathbf{b_2}$ mapping from $\mathbb{R}^t$ to $\mathbb{R}^u$.
**(a) Calculate the result of composing these two functions: $f_2(f_1(\mathbf{x}))$**
Simplify your answer to show it can be written as a single linear transformation.
:::success
**Hint:** Substitute $f_1(\mathbf{x})$ into $f_2$ and expand. Look for opportunities to combine terms.
:::
**(b) What are the shapes of $\mathbf{W_1}, \mathbf{b_1}$ and $\mathbf{W_2}, \mathbf{b_2}$?**
*Explain your reasoning based on the dimensions given.*
**(c ) Does the composition of two linear functions offer any advantage over a single linear function? *What does this tell you about the necessity of non-linear activation functions in deep networks?*** *(3-5 sentences)*
## Problem 5: Adaptive Optimizers
Modern deep learning often uses adaptive optimizers like Adam or RMSProp instead of basic SGD.
**(a) What is the key difference between basic gradient descent and adaptive methods like Adam?**
*(2-3 sentences)*
**(b) Why might adaptive learning rates help with training?**
*Consider what happens when different parameters have gradients of very different magnitudes*. *(3-4 sentences)*
# Bonus Questions
:::warning
**Note:** The following questions are *bonus* questions and are **not required**. Understanding these will help you debug issues with gradient computation in your implementation.
:::

In your implementation, each layer has four key gradient-related methods: `get_input_gradients()`, `get_weight_gradients()`, `compose_input_gradients()`, and `compose_weight_gradients()`.
**Bonus Problem 1**
**To initialize backpropagation, which method must you call on which layer?** What does this method return, and why do we start backpropagation from the loss?
**Bonus Problem 2**
At layer $f_1$, imagine that $\mathbf{x}$ is a batch of inputs with shape $(N, m)$ and $\mathbf{w_1}$ is a weight matrix of size $(m, r)$.
(a) What shape does `get_input_gradients()` return? Explain why it has this shape.
(b) What shape does `get_weight_gradients()` return? Explain why it has this shape.
**Bonus Problem 3**
**What is the difference between `get_input_gradients()` and `compose_input_gradients(J)`?** What does the parameter `J` represent, and why do we need both methods? *(3-5 sentences)*
**Bonus Problem 4: Chain Rule in Practice**
At layer $f_1$, the method `compose_input_gradients(J)` receives an upstream gradient $\mathbf{J}$ from layer $f_2$. If $f_2$ has weights $\mathbf{w_2}$ of shape $(r, 1)$, explain how the chain rule is applied to compute the gradient that gets passed to earlier layers. What mathematical operation combines the local gradient with the upstream gradient?