Homework 3 Conceptual: BERAS

--- tags: hw3, conceptual --- # Homework 3 Conceptual: BERAS :::info Conceptual section due **Thursday, February 19, 2026 at 11:59 PM EST** Programming section due **Thuesday, February 26, 2026 at 11:59 PM EST** ::: Answer the following questions, showing your work where necessary. Please explain your answers and work. :::info We encourage the use of $\LaTeX$ to typeset your answers. A non-editable homework template is linked, so copy the entire .tex file and image into your own Overleaf project and go from there! > #### [**Latex Template**](https://www.overleaf.com/read/pbyqsczsswfm#dc04cf) ::: :::warning Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so. ::: ## Problem 1 The update rule for weights in our single layer, multi-class neural network (dense layer & softmax activation) with cross entropy loss is defined as follows: Let $w_{ij}$ be the weight that corresponds to the $i$th input feature, $x_i$, and the $j$th class. Let $c$ be the correct class for a given input (the _label_) and $P_j$ be the predicted probability of class $j$. $P_c$ is the case where $j=c$. The loss is then: $$L = -\log(P_c)$$ If $j = c$, then the derivative of loss with respect to the weight is: $$\frac{\partial L}{\partial w_{ij}} = (P_j - 1)x_i$$ And if $j \neq c$, then: $$\frac{\partial L}{\partial w_{ij}} = P_j x_i$$ We can then use these partial derivatives to descend our weights along the gradient: \begin{align} w_{ij} = w_{ij} - \frac{\partial L}{\partial w_{ij}}\alpha \end{align} where $\alpha$ is the learning rate. Our network's architecture can then be defined as: 1.**Feature Input Vector:** $\mathbf{x} = [x_1, x_2, \dots, x_n]$ 2.**Linear Layer:** $z_j = \sum_{i=1}^n w_{ij}x_i + b_j$ 3.**Softmax:** $P_j = \sigma(z_j)$ 4.**Cross-Entropy Loss:** $L = -\sum_{j=1}^K y_j\log(P_j)$ 1. **Derive the above rules from the original definition of cross-entropy loss. Make sure to include all of your work (such as the quotient rule).** :::success **Hint:** Start by expanding out $L$ to our definition of cross-entropy in full. From there, you can see the simplication down to $L=-\log(P_c)$. Break down each part of the network from the input $x$, through the dense layer, and the softmax layer to produce our probability distribution, $P$. Key Tip: $$ \sigma(z_j) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$ ::: ## Problem 2 In classification problems, we assign a probability to each class and use a loss function that outputs a loss based on these probabilities. **(a) Can you use MSE loss for classification tasks? Why or why not?** Consider both theoretical and practical perspectives. *(3-5 sentences)* :::info **Hint:** Think about the range of outputs from a softmax layer versus what MSE expects. Also consider how MSE penalizes errors compared to cross-entropy. ::: **Why is categorical cross-entropy loss most commonly used for multi-class classification?** Explain its relationship to probability and information theory. *(3-5 sentences)* ## Problem 3 The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable. 1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (_2-4 sentences_) 2. Consider the formula for updating our weights: $$\Delta w = -\alpha \frac{\partial L}{\partial w}$$ **Why do we negate this quantity? What purpose does this serve? (_2-4 sentences_)** 3. During gradient descent, we calculate the partial derivative of loss ($L$) with respect to each weight $(w_{ij})$. **Why must we do this for every weight? Why can't we do this only for some weights? (_1-3 sentences_)** ## Problem 4 Consider stacking two linear transformations: $f_1(\mathbf{x}) = \mathbf{W_1}\mathbf{x}+\mathbf{b_1}$ mapping from $\mathbb{R}^s$ to $\mathbb{R}^t$, followed by $f_2(\mathbf{y}) = \mathbf{W_2}\mathbf{y}+\mathbf{b_2}$ mapping from $\mathbb{R}^t$ to $\mathbb{R}^u$. **(a) Calculate the result of composing these two functions: $f_2(f_1(\mathbf{x}))$** Simplify your answer to show it can be written as a single linear transformation. :::success **Hint:** Substitute $f_1(\mathbf{x})$ into $f_2$ and expand. Look for opportunities to combine terms. ::: **(b) What are the shapes of $\mathbf{W_1}, \mathbf{b_1}$ and $\mathbf{W_2}, \mathbf{b_2}$?** *Explain your reasoning based on the dimensions given.* **(c ) Does the composition of two linear functions offer any advantage over a single linear function? *What does this tell you about the necessity of non-linear activation functions in deep networks?*** *(3-5 sentences)* ## Problem 5: Adaptive Optimizers Modern deep learning often uses adaptive optimizers like Adam or RMSProp instead of basic SGD. **(a) What is the key difference between basic gradient descent and adaptive methods like Adam?** *(2-3 sentences)* **(b) Why might adaptive learning rates help with training?** *Consider what happens when different parameters have gradients of very different magnitudes*. *(3-4 sentences)* # Bonus Questions :::warning **Note:** The following questions are *bonus* questions and are **not required**. Understanding these will help you debug issues with gradient computation in your implementation. ::: ![](https://i.imgur.com/HSm2nTO.png) In your implementation, each layer has four key gradient-related methods: `get_input_gradients()`, `get_weight_gradients()`, `compose_input_gradients()`, and `compose_weight_gradients()`. **Bonus Problem 1** **To initialize backpropagation, which method must you call on which layer?** What does this method return, and why do we start backpropagation from the loss? **Bonus Problem 2** At layer $f_1$, imagine that $\mathbf{x}$ is a batch of inputs with shape $(N, m)$ and $\mathbf{w_1}$ is a weight matrix of size $(m, r)$. (a) What shape does `get_input_gradients()` return? Explain why it has this shape. (b) What shape does `get_weight_gradients()` return? Explain why it has this shape. **Bonus Problem 3** **What is the difference between `get_input_gradients()` and `compose_input_gradients(J)`?** What does the parameter `J` represent, and why do we need both methods? *(3-5 sentences)* **Bonus Problem 4: Chain Rule in Practice** At layer $f_1$, the method `compose_input_gradients(J)` receives an upstream gradient $\mathbf{J}$ from layer $f_2$. If $f_2$ has weights $\mathbf{w_2}$ of shape $(r, 1)$, explain how the chain rule is applied to compute the gradient that gets passed to earlier layers. What mathematical operation combines the local gradient with the upstream gradient?