$$ \newcommand{\R}{\mathbb{R}} $$ ## MLP Architecture We are trying to implement a simple MLP without biases. The parameters are just a list of matrices $W_1, \cdots, W_l \in \R^{n\times n}$ and we are using a ReLU nonlinearity $\sigma$. Let $x_0 \in \R^n$ be the inputs to the model, then the MLP is defined as $$ x_i = \sigma(W_i\; x_{i-1}) $$ **Task** Complete the MLP implementation ```python! import numpy as np n = 6 # network width l = 3 # network depth def create_MLP_weights(): W = [np.zeros([n,n]) for _ in range(l)] return W def MLP(x, W): pass x = np.random.normal(size=[n]) W = create_MLP_weights() MLP(x, W) ``` ## Fixing the Initializations **Question** What is the issue with the previous initialization scheme? A common way to initialize neural networks is to sample each entry of the weight matrices from a normal with mean $0$ and standard deviaation $\sqrt\frac{2}{ n}$ **Task** Using only samples from a normal distribution with mean $0$ and std $1$, which you can get via `np.random.normal(size=shape)`, implement this improved initialization scheme. **Question** can you prove that the matrices you sampled have the desired standard deviation? ## Classification Let's say we want to solve a classification task with $c\in \mathbb{N}$ classes. First, we will need a function that transforms the network width $n$ to the $c$ logits. To do this we will need one more weight matrix of size $\R^{c\times n}$. Complete the following code: ```python! c = 4 def model_logits(x, W): MLP_W, logit_W = W pass def create_model_weights(): pass logits = model_logits(x, create_model_weights()) ``` Once we have the model logits, we will want to get the output probabilities, which we can get by applying the softmax to the logits. Complete the following code ```python! def softmax(logits): pass def model_probs(x, W): pass ``` ## Loss **Task** Implement the negative log likelihood ```python! target = 2 # Suppose this is the target class for our input x def model_loss(x, W, target): pass model_loss(x, W, target) ``` ## Gradients Let's study the gradients of a single MLP layer. Just like before, $x\in \R^n$ and $W\in \R^{m\times n}$ and $\sigma$ is a relu nonlinearity. The MLP layer is defined as: $$y = f(x, W) = \sigma(W x)$$ **Question** What is the derivative of $y$ wrt $x$? **Question** What is the derivative of $y$ wrt $W$? **Question** If we already know $\nabla_y l$, the gradient of the loss wrt $y$, what is the gradient of the loss wrt $x$ and $W$? **Task** complete the following code ```python! def f(x, W): z = W@x y = np.max(0, z) return y def f_bwd(x, W, y_grad): pass ```