# Multilayer Perceptron ###### tags: `Data Science`, `Notes` ## Feature Representation Suppose we have the feature matrix of training data, denoted as $\pmb{X}$. \begin{align}\pmb{X}=\begin{bmatrix} X_{1,1} & X_{1,2} & \cdots & X_{1,d}\\ X_{2,1} & X_{2,2} & \cdots & X_{2,d}\\ \vdots & \vdots & \ddots & \vdots \\ X_{n,1} & X_{n,2} & \cdots & X_{n,d}\\ \end{bmatrix}\end{align} where $X_{i, j}$ is the $j$-th feature of the $i$-th sample. Note that this is the simplified way to describe the features. In practitcal problems, the features can be a multi-dimensional __tensor__ instead of a 2 dimensional matrix. Here are some conventional ways to structure the feature: | Data | Feature | |:---- |:------- | | Structured | $X_{i, j}$ is the $j$-th feature of the $i$-th sample. | | Sequential | $X_{i, j, k}$ is the $k$-th feature in the $j$ position of the sequence (i.e. timestep) of the $i$-th sample. | | Image | $X_{i, j, k, l}$ is the $l$-th channel feature (i.e. RGB) at the $k$-th horizontal position and $l$-th vertical position of the $i$-th sample. | | Video | $X_{i, j, k, l, m}$ is th $m$-th channel feature at the $l$-th horizontal position and $k$-th vertical position in $j$-th frame of the $i$-th sample. | ## Network Representation To simplify the representation, we will use the structured data for the introduction in this note. Let $\pmb{x_{k}}$ be vector representation for the $k$-th features of all samples: <span style="padding-right: 250px;"></span>![](https://i.imgur.com/JLFr35Q.png) We can perform linear combination of multiple feature vector to perfomr __linear feature transformation__. \begin{align}\pmb{\hat{x}}&=\pmb{Xw^T}+b\\ &=w_1\pmb{x_1}+w_2\pmb{x_2}+...w_d\pmb{x_d}+b\\&=w_1\begin{bmatrix} X_{1,1} \\ X_{2,1} \\ \vdots \\ X_{n,1} \end{bmatrix}+w_2\begin{bmatrix} X_{1,2} \\ X_{2,2} \\ \vdots \\ X_{n,2} \end{bmatrix}+\cdots+w_d\begin{bmatrix} X_{1,d} \\ X_{2,d} \\ \vdots \\ X_{n,d} \end{bmatrix}+b\end{align} The mathematic equation can be visualize as a weighted directional network: <span style="padding-right: 125px;"></span>![](https://i.imgur.com/9rD6ALA.png) Now, we can applied an __activation function__ to the transformed feature vector $\pmb{\hat{x}}$: \begin{align}\sigma(w_1\pmb{x_1}+w_2\pmb{x_2}+...w_d\pmb{x_d}+b)\end{align} Similarly, we can use a network to visualize the equation: ![](https://i.imgur.com/aM5fPnK.png) ## General Structure of Multilayer Perceptron The general structure of multilayer perceptron network can be implemented easily using Tensorflow/Keras framework. ($n_i$ is the number of neuron in each layer with sigmoid activation function) ```python from tensorflow.keras import models from tensorflow.keras import layers network = models.Sequential() network.add(layers.Dense(n_1, activation='sigmoid', input_shape=input.shape[-1])) ... network.add(layers.Dense(n_k, activation='identity')) ``` The above network can be visualized as follow: <span style="padding-right: 125px;"></span>![](https://i.imgur.com/i1cT5RP.png) ## Nonlinear Transformation of Features: Activation Functions Please refer to [Activation Functions](https://hackmd.io/WmT0WUXaRfG4KDs1VBvNrw?view) ## Evaluation of the Model: Objective Functions Please Refer to [Objective Functions](https://hackmd.io/0velDqCbQYyYYO8nR4Sjwg?view). ## Training of Single Layer Perceptron: Gradient Descent Now let's look into how does the model find the optimal $\pmb{w}$ and $\pmb{b}$. Suppose the task is a regression problem. Given a single layer perceptrion with linear activation function: <span style="padding-right: 180px;"></span>![](https://i.imgur.com/SP3z2jP.png) The task is to find $\pmb{w^*}$ and $b^*$ that minimize the L2 loss of the model: \begin{align}L(\pmb{w})&=\sum(f(x) - y)^2\\ &=\sum_{i=1}^n(w_1X_{i,1} + w_2X_{i,2}+...w_dX_{i,d}+b-y_i)^2\end{align} We can use multiple ways to optimize the functions (i.e. least square, genetic algorithms, Newton methods etc,.) Here, we introduce the most commonly used method in deep neural network: __gradient descent__. \begin{align}&w^{(1)}=w^{(0)}-\eta\nabla_wL|_{w^{(0)}}\\ &w^{(2)}=w^{(1)}-\eta\nabla_wL|_{w^{(1)}}\\ &w^{(3)}=w^{(2)}-\eta\nabla_wL|_{w^{(2)}}\\ &w^{(t)}=w^{(i-1)}-\eta\nabla_wL|_{w^{(t-1)}}\end{align} Where $t$ is the number of iterations, $\eta$ is learning rate (a hyperparameter specifies the magnitude of each update) and $\nabla_wL|_{w^{(t-1)}}$ is the gradient of loss with respect to $w$ at $t$ iteration: \begin{align} \frac{\partial L(\pmb{w}, b)}{\partial w_1^{(t-1)}}&=2\sum_i(f^{(t-1)}(x_i)-y)w_1^{(t-1)}\\ \frac{\partial L(\pmb{w}, b)}{\partial w_2^{(t-1)}}&=2\sum_i(f^{(t-1)}(x_i)-y)w_2^{(t-1)}\\ \frac{\partial L(\pmb{w}, b)}{\partial w_d^{(t-1)}}&=2\sum_i(f^{(t-1)}(x_i)-y)w_d^{(t-1)}\\ \frac{\partial L(\pmb{w}, b)}{\partial b^{(t-1)}}&=2\sum_i(f^{(t-1)}(x_i)-y)\\ \end{align} It is clear to see that the objective function is a function of parameters $\pmb{w}$, and the gradient $\nabla_wL|_{w^{(t-1)}}$ always points towards the opposite direction of the __local minimum (maximum)__. Here is a visualization of gradient descent for features with only two dimensions ($d=2$), ignored bias term: ![](https://i.imgur.com/7tTBDEY.png) One can see that the performance of gradient descent highly depends on the parameter $\eta$ and $t$ (when $\eta$ is too big, the process become unstable, when $\eta$ is too small, the process become slow and require more iterations to reach convergence). Therefore, many optimizers have been proposed to address this issue. Also check: [An Interactive Tutorial on Numerical Optimization](https://www.benfrederickson.com/numerical-optimization/) ## Back Propgation ![](https://i.imgur.com/mPFmSNE.png) ### Step 1 Compute the gradient for each node consider only the input layer. Note that this step is necessary only in compuatational implementation (to store variables that might be repeatedly used to reduce computational time): \begin{align}\frac{∂\pmb{z_1}}{∂w_1}&=\pmb{x_1},\;\frac{∂\pmb{z_1}}{∂w_2}=\pmb{x_2}\\ \frac{∂\pmb{z_2}}{∂w_3}&=\pmb{a}, \;\;\;\frac{∂\pmb{z_3}}{∂w_4}=\pmb{a}\\ \frac{∂\pmb{a}}{∂\pmb{z_1}}&=\begin{cases}1, \text{if}\;z_1 > 0\\0, \text{else}\end{cases}\\ \frac{∂\pmb{b}}{∂\pmb{z_2}}&=\begin{cases}1, \text{if}\;z_2 > 0\\0, \text{else}\end{cases}\\ \frac{∂\pmb{c}}{∂\pmb{z_3}}&=\begin{cases}1, \text{if}\;z_3 > 0\\0,\text{else}\end{cases}\end{align} ### Step 2 Compute the gradient of loss with respect to the weight: \begin{align} \frac{∂L}{∂w_1}&=\frac{∂L}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}\\ &=\frac{∂L}{∂\pmb{a}}\frac{∂\pmb{a}}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}\\ &=(\frac{∂L}{∂\pmb{z_2}}\frac{∂\pmb{z_2}}{∂\pmb{a}}+\frac{∂L}{∂\pmb{z_3}}\frac{∂\pmb{z_3}}{∂\pmb{a}})\frac{∂\pmb{a}}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}\\ &=(\frac{∂L}{∂\pmb{b}}\frac{∂\pmb{b}}{∂\pmb{z_2}}\frac{∂\pmb{z_2}}{∂\pmb{a}}+\frac{∂L}{∂\pmb{c}}\frac{∂\pmb{c}}{∂\pmb{z_3}}\frac{∂\pmb{z_3}}{∂\pmb{a}})\frac{∂\pmb{a}}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}\\ &=(\frac{∂L}{∂\pmb{\hat{y}}}\frac{∂\pmb{\hat{y}}}{∂\pmb{b}}\frac{∂\pmb{b}}{∂\pmb{z_2}}\frac{∂\pmb{z_2}}{∂\pmb{a}}+\frac{∂L}{∂\pmb{\hat{y}}}\frac{∂\pmb{\hat{y}}}{∂\pmb{c}}\frac{∂\pmb{c}}{∂\pmb{z_3}}\frac{∂\pmb{z_3}}{∂\pmb{a}})\frac{∂\pmb{a}}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}\\ &=(\frac{∂L}{∂\pmb{\hat{y}}}\frac{∂\pmb{\hat{y}}}{∂\pmb{b}}\frac{∂\pmb{b}}{∂\pmb{z_2}}\frac{∂\pmb{z_2}}{∂\pmb{a}}\frac{∂\pmb{a}}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}+\frac{∂L}{∂\pmb{\hat{y}}}\frac{∂\pmb{\hat{y}}}{∂\pmb{c}}\frac{∂\pmb{c}}{∂\pmb{z_3}}\frac{∂\pmb{z_3}}{∂\pmb{a}}\frac{∂\pmb{a}}{∂\pmb{z_1}}\frac{∂\pmb{z_1}}{∂w_1}) \end{align} The resulting equation is the summation of the derivitives for different path from the output layer to the corresponding weight in the input layer: ![](https://i.imgur.com/Ky1b6qa.png)