HW3 Conceptual: BERAS

--- tags: hw3, conceptual --- # HOMEWORK 3 Conceptual: BERAS :::info Conceptual section due **Tuesday, September 25, 2025 at 10:00 PM EST** Programming section due **Thuesday, October 2, 2025 at 10:00 PM EST** ::: Answer the following questions, showing your work where necessary. Please explain your answers and work. :::info We encourage the use of $\LaTeX$ to typeset your answers. A non-editable homework template is linked, so copy the entire .tex file and image into your own Overleaf project and go from there! > #### [**Latex Template**](https://www.overleaf.com/read/fxxpfgfjyryw#825423) ::: :::warning Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so. ::: ## Conceptual Questions 1. The update rule for weights in our single layer, multi-class neural network (dense layer & softmax activation) with cross entropy loss is defined as follows: Let $w_{ij}$ be the weight that corresponds to the $i$th input feature, $x_i$, and the $j$th class. Let $c$ be the correct class for a given input (the _label_) and $P_j$ be the predicted probability of class $j$. $P_c$ is the case where $j=c$. The loss is then: $$L = -\log(P_c)$$ If $j = c$, then the derivative of loss with respect to the weight is: $$\frac{\partial L}{\partial w_{ij}} = (P_j - 1)x_i$$ And if $j \neq c$, then: $$\frac{\partial L}{\partial w_{ij}} = P_j x_i$$ We can then use these partial derivatives to descend our weights along the gradient: \begin{align} w_{ij} = w_{ij} - \frac{\partial L}{\partial w_{ij}}\alpha \end{align} where $\alpha$ is the learning rate. Our network's architecture can then be defined as: 1.**Feature Input Vector:** $\mathbf{x} = [x_1, x_2, \dots, x_n]$ 2.**Linear Layer:** $z_j = \sum_{i=1}^n w_{ij}x_i + b_j$ 3.**Softmax:** $P_j = \sigma(z_j)$ 4.**Cross-Entropy Loss:** $L = -\sum_{j=1}^K y_j\log(P_j)$ 1. **Derive the above rules from the original definition of cross-entropy loss. Make sure to include all of your work (such as the quotient rule).** :::success **Hint:** Start by expanding out $L$ to our definition of cross-entropy in full. From there, you can see the simplication down to $L=-\log(P_c)$. Break down each part of the network from the input $x$, through the dense layer, and the softmax layer to produce our probability distribution, $P$. Key Tip: $$ \sigma(z_j) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$ ::: 2. In classification problems, we assign a likelihood probability to each class and use a loss function that outputs a loss based on this probability. Can you use MSE loss for classification tasks? Why or why not? Why is cross-entropy loss most commonly used for classification? (_3-5 sentences_) :::info **Hint:** Think about how each loss function is shaped, the inputs that they take in, and the range of each function's inputs. ::: 3. The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable. 1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (_2-4 sentences_) 2. Consider the formula for updating our weights: $$\Delta w = -\alpha \frac{\partial L}{\partial w}$$ Why do we negate this quantity? What purpose does this serve? (_2-4 sentences_) 3. During gradient descent, we calculate the partial derivative of loss ($L$) with respect to each weight $(w_{ij})$. Why must we do this for every weight? Why can't we do this only for some weights? (_1-3 sentences_) 4. We have previously worked on single-layer linear regression using one linear function $\mathbf{x} \mapsto\mathbf{W_1}\mathbf{x}+\mathbf{b_1}$ mapping from $\mathbb{R^s}$ to $\mathbb{R^t}$. For many real-world scenarios, we actually need multiple layers to model more complex relationships. - Calculate the result after we stack another linear function $\mathbf{x} \mapsto\mathbf{W_2}\mathbf{x}+\mathbf{b_2}$ mapping from $\mathbb{R^t}$ to $\mathbb{R^u}$ right after the first one. - What is the shape of $\mathbf{W_1}, \mathbf{b_1}$ and $\mathbf{W_2}, \mathbf{b_2}$? Explain your reasoning. - Does the composition of the two linear functions offer an improvement over a single linear function? Explain your answer (*2-4 sentences*) ## Bonus Questions The following questions are *bonus* questions and are **not required**. If you do not understand the `compose` or `get` gradient functions, TAs will redirect you to these questions first. Completing these questions **will help** you complete the assignment. ![](https://i.imgur.com/HSm2nTO.png) For the following questions, the answers will have to do with the following four functions: `get_input_gradients()`, `get_weight_gradients()`, `compose_input_gradients()`, and `compose_weight_gradients()`. Note that through the Diffable class, each of the layers shown above ($f_1$, $f_2$, $L$) either individually implement or inherit these functions. 1. To initialize backpropagation, which function must you call on which layer? What partial derivative(s) would this return? 2. At layer $f_1$, what shape are the partials that `get_input_gradients()` and `get_weight_gradients() `return? Imagine that x is a vector of length $m$ and that $w_1$ is a weight matrix of size $(m, r)$. 3. At layer $f_1$, what shape are the partials that `compose_input_gradients(J)` and `compose_weight_gradients(J)` return? `J` will be the correct upstream gradient. For this to occur, how must each function resolve the input dimension? What upstream partial is necessary for these functions? 4. At layer $f_1$, how does `compose_input_gradients(J)`, resolve the input dimension difference? More specifically, how does this function create a gradient the same shape as the inputs to this layer $(N,x)$? Be sure to reference the matrix multiplication at line 166 in `core.py`. Imagine that $w_2$ is a weight matrix of size $(r,1)$.