---
tags: hw2, conceptual
---
# HW2 Conceptual: Beras Pt. 2
:::info
Conceptual section due **Friday, February 23, 2024 at 6:00 PM EST**
Programming section due **Monday, February 26, 2024 at 6:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers. A non-editable homework template is linked, so copy the .tex file into your own Overleaf project and go from there! *Be sure to download the images!*
> #### [**Latex Template**](https://drive.google.com/drive/folders/10xrzqtL0LYbK6GmtPYqdguE29WT6HBgz?usp=sharing)
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Theme
![](https://i.imgur.com/Ntnt8uP.jpg)
*When its not as deep as your neural network*
## Conceptual Questions
1. The update rule for weights in a single layer (linear unit+softmax), multi-class neural network with cross entropy loss is defined as follows:
Let $w_{ij}$ be the weight that corresponds to the $i$th input feature, $x_i$, and the $j$th class. Let $c$ be the correct class for a given input (the _label_). The loss is then:
$$L = -\log(P_c)$$
If $j = c$, then the derivative of loss with respect to the weight is:
$$\frac{\partial L}{\partial w_{ij}} = (P_j - 1)x_i$$
And if $j \neq c$, then:
$$\frac{\partial L}{\partial w_{ij}} = P_j x_i$$
We can then use these partial derivatives to descend our weights along the gradient:
\begin{align}
w_{ij} = w_{ij} - \frac{\partial L}{\partial w_{ij}}\alpha
\end{align}
where $\alpha$ is the learning rate.
1. **Derive the above rules from the original definition of cross-entropy loss.**
:::info
**Hint:** Consider both cases of $j$, and start by expanding out $L$.
:::
2. In classification problems, we assign a likelihood probability to each class and use a loss function that outputs a loss based on this probability. Can you use MSE loss for classification tasks? Why or why not? Why is cross-entropy loss most commonly used for classification? (_3-5 sentences_)
:::info
**Hint:** Think about how each loss function is shaped, the inputs that they take in, and the range of each function's inputs.
:::
<!--3. Why do we use a bias vector in our forward pass? What function does the bias serve? Can you give us a real-world example of how changing the bias could enhance a model's prediction capabilities? (_3-5 sentences_)-->
<!--4. Why do we use a normal distrbution when intializing weights? Why is the bias intialized as a zero vector? (_4-5 sentences_)-->
3. The following questions will refer to gradient descent! Please feel free to refer to lecture notes when applicable.
1. In lecture, you learned about gradient descent. What is a gradient? How is it different from a partial derivative? How do they relate? (_2-4 sentences_)
2. Consider the formula for updating our weights:
$$\Delta w = -\alpha \frac{\partial L}{\partial w}$$
Why do we negate this quantity? What purpose does this serve? (_2-4 sentences_)
3. During gradient descent, we calculate the partial derivative of loss ($L$) with respect to each weight $(w_{ij})$. Why must we do this for every weight? Why can't we do this only for some weights? (_1-3 sentences_)
4. In practice, most operations during gradient descent are vectorized. Why do we do this? Why might this make it beneficial to train the model on a GPU? (_1-2 sentences_)
5. Consider the following plot of a loss function for some neural network:
![Loss Function](https://i.imgur.com/JfHzG6E.jpg)
Where should gradient descent end up on the graph? How many weights does this model have? If our model starts training at $(-2, 2, 0)$, will the loss function ever reach the absolute minimum? Why? Assume the loss function at this point is _perfectly flat_. (_3-5 sentences_)
<!-- 6. The following questions refer to empirical risk minimization.
1. How does empirical risk differ from true risk? Why do we minimize empirical risk? (_2-3 sentences_)
2. Consider the following two figures ($(a)$ and $(b)$, respectively):
Figure (a) | Figure (b)
:-------------------------:|:-------------------------:
![](https://i.imgur.com/7HzajNx.png) | ![](https://i.imgur.com/fDpK70u.png):
State whether each of the models represent the following: (_2-3 sentences_)
1. High or low empirical risk
2. Overfit or underfit
3. A good predictor
3. When is there low empirical risk and high true risk? -->
4. We have previously worked on single-layer linear regression using one linear function $\mathbf{x} \mapsto\mathbf{W_1}\mathbf{x}+\mathbf{b_1}$ mapping from $\mathbb{R^s}$ to $\mathbb{R^t}$. For many real-world scenarios, we actually need multiple layers to model more complex relationships.
- Calculate the result after we stack another linear function $\mathbf{x} \mapsto\mathbf{W_2}\mathbf{x}+\mathbf{b_2}$ mapping from $\mathbb{R^t}$ to $\mathbb{R^u}$ right after the first one.
- What is the shape of $\mathbf{W_1}, \mathbf{b_1}$ and $\mathbf{W_2}, \mathbf{b_2}$? Explain your reasoning.
- Does the composition of the two linear functions offer an improvement over a single linear function? Explain your answer (*2-4 sentences*)
## Bonus Questions
:warning: **2470 Questions are written below!** :warning:
The following questions are *bonus* questions and are **not required**. If you do not understand the `compose` or `get` gradient functions, TAs will redirect you to these questions first. Completing these questions **will help** you complete the assignment.
![](https://i.imgur.com/HSm2nTO.png)
For the following questions, the answers will have to do with the following four functions: `get_input_gradients()`, `get_weight_gradients()`, `compose_input_gradients()`, and `compose_weight_gradients()`. Note that through the Diffable class, each of the layers shown above ($f_1$, $f_2$, $L$) either individually implement or inherit these functions.
1. To initialize backpropagation, which function must you call on which layer? What partial derivative(s) would this return? (Hint: Think back to HW1)
2. At layer $f_1$, what shape are the partials that `get_input_gradients()` and `get_weight_gradients() `return? Imagine that x is a vector of length $m$ and that $w_1$ is a weight matrix of size $(m, r)$. (Hint: Think back to HW1)
3. At layer $f_1$, what shape are the partials that `compose_input_gradients(J)` and `compose_weight_gradients(J)` return? `J` will be the correct upstream gradient. For this to occur, how must each function resolve the input dimension? What upstream partial is necessary for these functions?
4. At layer $f_1$, how does `compose_input_gradients(J)`, resolve the input dimension difference? More specifically, how does this function create a gradient the same shape as the inputs to this layer $(N,m)$? Be sure to reference the matrix multiplication at line 166 in `core.py`. Imagine that $w_2$ is a weight matrix of size $(r,1)$.
## 2470-only Questions
The following are questions that are only required by students enrolled in CSCI2470.
1. In your introductory calculus class, you were likely exposed to the following simple algorithm for finding the minimum of a function: take the derivative of the function, set it to zero, then solve the equation.
Neural networks are differentiable, so why don't we use this algorithm (instead of gradient descent) to minimize the loss? (_1-4 sentences_)
2. Prove that SGD with a batch size of 1 gives an _unbiased estimate_ of the "true gradient" (the gradient calculated over the entire training set). Assume that the batch is selected uniformly at random from the full training set.
:::info
**Hints:**
1. Recall that an estimator $\hat{f}$ is an unbiased estimator of a function $f$ if $\mathbb {E}[\hat f] = f$.
2. Both expectation and differentiation are linear operators.
:::
In the assignment, you will see that even a single-layer neural network can classify the digit with pretty good accuracy. Still, there are several ways in which one could improve the accuracy of the model. The paper in the link below suggests and easy way to increase the performance without changing the architecture of our network: training several identical models! Please read the following paper (found [here](https://ieeexplore.ieee.org/document/6065510)) and answer the following questions:
:::info
You may need to login with your Brown username to access the paper.
:::
3. What is the committee? Why would it help increase the performance of the model?
4. What is the preprocessing they implement in this paper? How does preprocessing prevent the models' errors from being strongly correlated?
<!--## Ethical Questions
In this homework assignment, you’ve implemented a multi-layered perceptron from scratch. This architecture is the basis for many of the more complex models you will learn about and implement later on in the semester. It is here that the issues of interpretability and explainability arise. Skim [this article](https://arxiv.org/pdf/1811.10154.pdf) from Cynthia Rudin, a professor of computer science at Duke University, who advocates for the creation of inherently interpretable models. Refer back to specific sections to answer the following questions.
1. Imagine that you are having a discussion with someone who believes that trying to make models more interpretable will result in a loss of accuracy. Based on the research article linked above, explain whether there exists a tradeoff between accuracy and interpretability. Discuss at least one piece of evidence found in the paper to support your argument. (3-4 sentences)
2. Rudin suggests that many researchers assume the existence of an accuracy-interpretability tradeoff. What are some of the dangers of making such an assumption? (2-3 sentences)
Now read [this CNN article](https://www.cnn.com/2022/12/21/business/tesla-fsd-8-car-crash/index.html#:~:text=A%20driver%20told%20authorities%20that,a%20California%20Highway%20Patrol%20traffic) about a crash involving self-driving technology.
3. What advice would you give to someone working on Tesla’s AI software for creating AI models that the public trusts? Consider whether efforts should be directed towards explaining the black model that is Tesla’s autopilot software or creating a model that is inherently interpretable to begin with. (2-3 sentences)
4. What are the social and political consequences of using deep learning black-box models that are accurate but not explainable? (2-3 sentences)
-->