$$
\newcommand{\R}{\mathbb{R}}
$$
## MLP Architecture
We are trying to implement a simple MLP without biases. The parameters are just a list of matrices $W_1, \cdots, W_l \in \R^{n\times n}$ and we are using a ReLU nonlinearity $\sigma$. Let $x_0 \in \R^n$ be the inputs to the model, then the MLP is defined as
$$
x_i = \sigma(W_i\; x_{i-1})
$$
**Task** Complete the MLP implementation
```python!
import numpy as np
n = 6 # network width
l = 3 # network depth
def create_MLP_weights():
W = [np.zeros([n,n]) for _ in range(l)]
return W
def MLP(x, W):
pass
x = np.random.normal(size=[n])
W = create_MLP_weights()
MLP(x, W)
```
## Fixing the Initializations
**Question** What is the issue with the previous initialization scheme?
A common way to initialize neural networks is to sample each entry of the weight matrices from a normal with mean $0$ and standard deviaation $\sqrt\frac{2}{ n}$
**Task** Using only samples from a normal distribution with mean $0$ and std $1$, which you can get via `np.random.normal(size=shape)`, implement this improved initialization scheme.
**Question** can you prove that the matrices you sampled have the desired standard deviation?
## Classification
Let's say we want to solve a classification task with $c\in \mathbb{N}$ classes. First, we will need a function that transforms the network width $n$ to the $c$ logits. To do this we will need one more weight matrix of size $\R^{c\times n}$.
Complete the following code:
```python!
c = 4
def model_logits(x, W):
MLP_W, logit_W = W
pass
def create_model_weights():
pass
logits = model_logits(x, create_model_weights())
```
Once we have the model logits, we will want to get the output probabilities, which we can get by applying the softmax to the logits.
Complete the following code
```python!
def softmax(logits):
pass
def model_probs(x, W):
pass
```
## Loss
**Task** Implement the negative log likelihood
```python!
target = 2 # Suppose this is the target class for our input x
def model_loss(x, W, target):
pass
model_loss(x, W, target)
```
## Gradients
Let's study the gradients of a single MLP layer. Just like before, $x\in \R^n$ and $W\in \R^{m\times n}$ and $\sigma$ is a relu nonlinearity. The MLP layer is defined as:
$$y = f(x, W) = \sigma(W x)$$
**Question** What is the derivative of $y$ wrt $x$?
**Question** What is the derivative of $y$ wrt $W$?
**Question** If we already know $\nabla_y l$, the gradient of the loss wrt $y$, what is the gradient of the loss wrt $x$ and $W$?
**Task** complete the following code
```python!
def f(x, W):
z = W@x
y = np.max(0, z)
return y
def f_bwd(x, W, y_grad):
pass
```