02-1 - HackMD

# 02-1 --- layout: default title: Lecture 2 - Part 1 authors: Amartya Prasad, Dongning Fang, Yuxin Tang, Sahana Upadhya date: 3 February 2020 --- ## [Backpropogation and Gradient Based Methods](00:00:00-17:23:00) ### Parametrized Models : $$\bar{y} = G(x,w)$$ Parametrized Models are simply functions that depend on inputs and trainable parameters. There is no fundamental difference between the two, except that trainable parameters are shared across training samples while the input naturally changes by sample. In most deep learning frameworks, parameters are implicit, that is, they aren't passed when the function is called. They are 'saved inside the function', so to speak, atleast in the object-oriented versions of models. The parametrized model(function) takes in an input, has a parameter vector and produces an output. In Supervised Learning, this output goes into the Cost function ($C(y,\bar{y}$)), which compares the True Output (${y}$) with the Model Output ($\bar{y}$). The computation graph for this model is shown in Figure 1. | <img src = "https://i.imgur.com/XB2nQ0P.jpg" style="zoom: 50%"> | |--| | Figure 1: Computation Graph representation for a Parameterized Model | Examples of parametrized functions - - Linear Model - Weighted Sum of Components of the Input Vector : $$\bar{y} = \sum_i w_i x_i \text{ ; } C(y,\bar{y}) = ||y - \bar{y}||^2 $$ - Nearest Neighbour - There is an input x and a W matrix with each row of the matrix indexed by k. The output is the k that corresponds to the row of the w closest to x : $$\bar{y} = \text{argmin}_k ||x - w_{k,.} ||^2 $$ Parametrized models could also involve complicated functions. ### Block diagram notations for computation graphs - Variables ( tensor, scalar, continuous, discrete) - <img src="https://i.imgur.com/d8ecASC.png" style="zoom:50%"> is an observed input to your system - <img src="https://i.imgur.com/ZR8qnJB.png" style="zoom:50%"> is a computed variable which is produced by a deterministic function - Deterministic functions <center><img src ="https://i.imgur.com/YqsXYE9.png" style="zoom:50%"></center> - Takes in multiple inputs and can give multiple outputs - It has an implicit parameter variable (${w}$) - The rounded side of it tells us in which direction it is easy to compute. In the above diagram, it is easier to compute ${\bar{y}}$ from ${x}$ than the other way around - Scalar-valued function <center><img src ="https://i.imgur.com/Eyjy0K2.png" style="zoom:50%"></center> - Used to represent cost functions - Has an implicit scalar output - Takes multiple inputs and outputs a single value (usually the distance between the inputs) ### Loss Functions The Loss function is the function that is minimized during training. There are two types of Loss: 1)Per Sample Loss - $$\text{eg. } L(x,y,w) = C(y, G(x,w)) $$ 2)Average Loss - For any set of Samples $$S = \{(x[p],y[p])/p=0,1...P-1 \}$$ Average Loss Over the Set S is given by : $$L(S,w) = \frac{1}{P} \sum_{(x,y)} L(x,y,w)$$ In the standard Supervised Learning paradigm, the loss(per sample) is simply the output of the Cost Function. Machine Learning is mostly about optimizing functions, usually minimizing. It could also involve funding Nash Equilibria between two functions like with GANs. This is done using Gradient Based Methods, though not necessarily Gradient Descent. ### Gradient Descent A Gradient Based Method is a method/algorithm that finds the minima of a function, assuming that one can easily compute the gradient of that function. It assumes the function is continuous and almost-everywhere differentiable(It need not be differentiable everywhere). *insert picture here* Gradient Descent - Imagine being in the mountain in the fog at night and you want to go down to the village but you can't see anything so you look around to find the direction of steepest descent and take a step in that direction. Algorithm : $$w \leftarrow w - \eta \frac{\partial L(S,w)}{\partial w}$$ For SGD(Stochastic Gradient Descent), the Algorithm Becomes : Pick a p in 0...P-1, then update w $$w \leftarrow w - \eta \frac{\partial L(x[p], y[p],w)}{\partial w}$$ where w represents the parameter to be optimised. $\eta \text{ is a constant here but in more sophisticated algorithms, it could be a matrix}$ If it is a positive semi-definite matrix, we'll still move downhill but not necessarily in the direction of steepest descent. In fact the direction of steepest descent may not always be the direction we want to move in. If the function is not differentiable, i.e it has a hole or is staircase like or flat, where the gradient doesn't give you any information, one has to resort to other methods - called 0-th Order Methods or Gradient-Free Methods. Deep Learning is all about Gradient Based Methods. However, RL(Reinforcement Learning) involves Gradient Estimation without the Gradient itself. An example is a robot learning to ride a bike where the robot falls every now and then. The Objective Function measures how long the bike stays up without falling. Unfortunately, there is no gradient for the objective function. The robot needs to try different things. The RL cost function is not differentiable most of the time but the network that computes the output that goes into the cost function is, from which point on, it is gradient-based. This is the main difference between Supervised Learning and Reinforcement Learning - with the latter, the Cost function C is not differentiable. In fact it completely unknown. It just returns an output when inputs are fed to it, like a Blackbox. This makes RL highly inefficient and is one of the main drawbacks of RL - particularly when parameter vector is high dimensional(which implies a huge solution space to search in, making it hard to find where to move). A very popular technique in RL is Actor Critic Methods. A critic method basically consists in having a second C module which is a known, trainable module. One is able to train the C module, which is differentiable, to approximate the cost function/reward function. The reward is a negative cost, more like a punishment. That’s a way of making the cost function differentiable, or at least approximating it by a differentiable function so that one can backpropagate. ## [Advantages of SGD, Backpropagation for Traditional Neural Net](00:18:27-00:38:00) ### Advantages of Stochastic Gradient Descent In practice，we use the stochastic gradient to compute the gradient of the objective function w.r.t the parameters. Instead of computing the full gradient of the objective function, which is the average of all samples, stochastic gradient just takes one sample, computes the loss, $L$, and the gradient of the loss w.r.t the parameters, and then takes one step in the negative gradient direction. $$w \leftarrow w - \eta \frac{\partial L(x[p], y[p],w)}{\partial w}$$ In the formula, W is approached by w minus the step-size, times the gradient of the per-sample loss function w.r.t the parameters for given sample, ($x[p]$,$y[p]$). If we do this on a single sample, we will get a very noisy trajectory as shown in Figure 2. Instead of parameter vector directly going downhill, it’s stochastic. Every sample will pull to a different direction. It’s just the average that pulls us to the minimum of the average. Although it looks inefficient, it’s much faster than batch gradient at least in the context of machine learning when the samples have some redundancy between them. ![](https://i.imgur.com/Au9xWTv.png) In practice, we use batches instead of doing stochastic gradient on a single sample. We compute the average of the gradient over a batch of samples, not a single sample, and then do one step. The only reason for doing this is that the hardware given to us (i.e. GPUs, multicore CPUs) is more efficient if we have batches. It’s easier to parallelize and get more efficient computation. Batching is the simplest way to parallelize. ### Traditional Neural Network Traditional Neural Net are basically interspersed layers of linear operations and point-wise non-linear operations. For linear operations, conceptually it is just a matrix-vector multiplication. We take the (input) vector multiplied by a matrix formed by the weights. The second type of operation is to take all the components of the weighted sums vector and pass it through some simple non-linearities (i.e. ReLu, tanh,…). ![](https://i.imgur.com/KowUjtY.png) Above is an example of a 2-layer network, because what matters are the pairs (i.e linear+non-linear). Some people call it a 3-layer network because they count the variables, but Prof. Lecun thinks that it is not fair. Note that if there is no non-linearities in the middle, we may as well have a single layer. Because the product of two linear functions is a linear function. Below is the how linear and non-linear functional blocks stack: ![](https://i.imgur.com/eK2u920.png) In the graph, $s[i]$ is the weighted sum of unit i which is computed as: $$s[i]=\Sigma_{j \in UP(i)}w[i,j]\cdot z[j]$$ where $UP(i)$ is all the predecessors of i. $z[j]$ is the jth output from the previous layer. The output $z[i]$ is computed as: $$z[i]=f(s[i])$$ where f is a non-linear function. ### Backpropagation through a non-linear function The first way to do backpropagation is to backpropagate through a non linear function. We take a particular non-linear function $h$ from the network and leave everything else in the blackbox. ![](https://i.imgur.com/R3yhRUc.png) We are going to use the chain rule to compute the gradients: $$g(h(s))' = g'(h(s))\cdot h'(s)$$where $h'(s)$ is the derivative of $z$ w.r.t $s$,$\frac{dz}{ds}$ To make the connection between derivatives clear, we rewrite the formula into: $$\frac{dC}{ds} = \frac{dC}{dz}*\frac{dz}{ds} = \frac{dC}{dz}*h'(s)$$ Hence if we have a chain of those functions in the network, we can backpropagate by multiplying by the derivatives of all the h functions one after the other all the way back to the bottom. It’s more intuitive to think of it in terms of perturbation. Perturbing $s$ by $ds$ will perturb $z$ by: $$dz = ds*h'(s)$$Thus this will perturb C by: $$dC = dz*\frac{dC}{dz} = ds*h’(s)*\frac{dC}{dz}$$, ending up with the same formula as the one shown above. ### Backpropagation through a weighted sum For a linear module, we do backpropagation through a weighted sum. Here we view the entire network as a blackbox except for 3 connections going from a z variable to a bunch of s variables. ![](https://i.imgur.com/vme3dwp.png) This time the perturbation is a weighted sum. Z influences several variables. Perturbing $z$ by $dz$ will perturb $s[0]$, $s[1]$ and $s[2]$ by: $$ds[0]=w[0]*dz$$$$ds[1]=w[1]*dz$$$$ds[2]=w[2]*dz$$ This will perturb C by $$dC = ds[0]*\frac{dC}{ds[0]}+ds[1]*\frac{dC}{ds[1]}+ds[2]*\frac{dC}{ds[2]}$$ Hence C is going to vary by the sum of the 3 variations: $$\frac{dC}{dz} = \frac{dC}{ds[0]}*w[0]+\frac{dC}{ds[1]}*w[1]+\frac{dC}{ds[2]}*w[2]$$ ## [Part 3] ## [Implementation of Backpropagation](00:38:01-00:51:37) ### Block Diagram of a Traditional Neural Net - Linear blocks $$s_{k+1}=w_kz_k$$ - Non-linear blocks $$z_k=h(s_k)$$ ![](https://i.imgur.com/aYWO6gO.png) $w_k$: matrix $z_k$: vector $h$: application of scalar h function to every component This is a 3-layer neural net with pairs of linear and non-linear functions, though most modern neural nets do not have clear linear and non-linear separations and are more complex. ### PyTorch ![](https://i.imgur.com/pGYKHSC.png) - We can implement neural nets with object oriented classes in PyTorch. First we define a class for the neural net and intialize linear layers in the constructor using predefined nn.Linear class. Linear layers have to be separate objects because each of them contains a vector for the parameter. The nn.Linear class also adds the bias vector implicitly. Then we define a forward function on how to compute outputs with torch.relu function as the nonlinear activation. We don't have to initialize separate relu functions because they don't have parameters. - We do not need to compute the gradient by ourselves since PyTorch knows how to back propagate and transform the gradient given the forward function. ### Backprop through a functional module ![](https://i.imgur.com/VxBrCQo.png) - Using chain rule for vector functions $$ z_g : [d_g\times 1] $$ $$ z_f:[d_f\times 1] $$ $$\frac{\partial c}{\partial{z_f}}=\frac{\partial c}{\partial{z_g}}\frac{\partial {z_g}}{\partial{z_f}}$$ $$[1\times d_f]= [1\times d_g]\times[d_g\times d_f]$$ This is the basic formula for $\frac{\partial c}{\partial{z_f}}$ using the chain rule. Note that the gradient to the scalar function with respect to a vector is a vector of the same size as the vector with respect to which you differentiate. In order to make the notations consistent, it is a row vector instead of column vector. - Jacobian matrix $$(\frac{\partial{z_g}}{\partial {z_f}})_{ij}=\frac{(\partial {z_g})_i}{(\partial {z_f})_j}$$ We need $\frac{\partial {z_g}}{\partial {z_f}}$ to compute gradient with respect to $z_f$ given gradient with respect to $z_g$, which is the Jacobian matrix with respect to the input. Each entry $ij$ is equal to the partial derivative of the $i$th component of the output vector and the $j$th component of the input vector. ### Backprop through a multi-stage graph ![](https://i.imgur.com/QB7TvT3.png) - Using chain rule for vector functions $$\frac{\partial c}{\partial {z_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial {z_{k+1}}}{\partial {z_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial f_k(z_k,w_k)}{\partial {z_k}}$$ $$\frac{\partial c}{\partial {w_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial {z_{k+1}}}{\partial {w_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial f_k(z_k,w_k)}{\partial {w_k}}$$ These are also obtained using the chain rule. - Two Jacobian matrices for the module - One with respect to $z[k]$ - One with repsect to $w[k]$ We need two Jacobian matrices here. One for input states, and the other for the parameters. - [x] TBD by Yuxin Tang (yt1526) --------------------------------------------------------------------------------------------------------- # 02-2 --- layout: default title: Lecture 2 - Part 2 authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan date: 3 February 2020 --- ## [A Concrete Example of Backpropagation and Intro to Basic Neural Network Modules](00:51:37-01:07:40) ### Example We next consider a concrete example of backpropagation assisted by a visual graph. The arbitrary function $G(w)$ is inputted into the cost function $C$, which can be represented as a graph. Through the manipulation of multiplying by the Jacobian matrices, we can transform this graph into the graph that will compute the gradients going backwards. Note that PyTorch and TensorFlow do this automatically for the user, i.e., the forward graph is automatically "reversed" to create the derivative graph that backpropagates the gradient. In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that $$ \frac{dC(y,\bar{y})}{dw}=1 \cdot \frac{dC(y,\bar{y})}{d\bar{y}}\cdot\frac{dG(x,w)}{dw} $$ Note that complications might arise when the architecture of the graph is not fixed, but is data-dependent. For example, there might be a condition in the neural net code that depends on the length of a vector. Though this is possible, it becomes increasingly difficult to manage this variation when the number of loops exceeds a reasonable amount. In terms of dimensions, $\frac{dC(y,\bar{y})}{dw}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{dC(y,\bar{y})}{d\bar{y}}$ is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{d\bar{y}}{dw}=\frac{dG(x,w)}{dw}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$. ### Basic Neural Net Modules There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to built by a combination of other, elementary modules). - Linear: $Y=W\cdot X$ - ReLU: $y=\texttt{ReLU}(x)$ - Duplicate: $Y_1=X$, $Y_2=X$ - Akin to a "splitter" where both outputs are equal to the input. - When backpropagating, the gradients get summed, i.e., $$ \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2} $$ - Add: $Y=X_1+X_2$ - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e., $$ \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and }\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1 $$ - Max: $Y=\texttt{max}(X_1,X_2)$ - Since this function can also be represented as $$ Y=\texttt{max}(X_1,X_2)=\begin{cases} X_1 & X_1 > X_2 \\ 0 & else \end{cases} $$ then $$ \frac{dY}{dX_1}=\begin{cases} 1 & X_1 > X_2 \\ 0 & else \end{cases} $$ Therefore, by the chain rule, $$ \frac{dC}{dX_1}=\begin{cases} \frac{dC}{dY}\cdot1 & X_1 > X_2 \\ 0 & else \end{cases} $$ ## [SoftMax](01:07:41-1:23:18) *SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between 0 and 1 that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $yi$ in the equation below is a vector of probabilities for all the categories. $$ y_i = \frac{e^{x_i}}{\Sigma_j e^{x_j}} $$ However, the use of softmax leaves a network susceptible to a vanishing gradient. A vanishing gradient is a problem, as it prevents weights from changing its value, and may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when s is large, $h(s)$ is 1, and when s is small, $h(s)$ is 0. Because the sigmoid function is flat at $h(s) = 0 $ and $h(s) = 1$, the gradient is 0, which results in a vanishing gradient. ![](https://i.imgur.com/L4hklTj.png) $$ h(s) = \frac{1}{1 + e^{-s}}$$ Mathematicians came up with the idea of logsoftmax in order to solve for the issue of the vanishing gradient created by softmax. *LogSoftMax* is a another basic module in PyTorch. As can be seen in the equation below, *LogSoftMax* is a combination of softmax and log. $$ \log(y_i )= \frac{\log(e^{x_i})}{\log(\Sigma_j e^{x_j})} = \frac{e^{x_i}}{\log(\Sigma_j e^{x_j})} = e^{x_i} - \log(\Sigma_j e^{x_j})$$ The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + e^{s})$ part of the function. When s is very small, the value is 0, and when s is very large, the value is s. As a result it doesn’t saturate, and the vanishing gradient is avoided. $$ \log(\frac{e^{s}}{e^{s} + 1})= s - \log(1 + e^{s})$$ ![](https://i.imgur.com/RDFWVoN.png) ## [Practical Tricks for Backpropagation][1:23:19 to 1:45:15] [Practical Tricks for Backpropagation](1:23:19 to 1:45:15) #### Use ReLU as the non-linear activation function ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent (tanh) function to fall out of favor. The reason ReLU works best is likely that its single kink makes it scale equivariant. #### Use cross-entropy loss as the objective function for classification problems Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax). #### Use stochstic gradient descent on minibatches during training As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient. #### Shuffle the order of the training examples when using stochastic gradient descent Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, etc. Ideally, you should have samples from every class in every minibatch. However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch). #### Normalize the inputs to have mean zero and unit variance Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, for example, it is common to take the mean $$m_b$$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individal image as $$ b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{max(\sigma_b, \epsilon)} $$ where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. This is necessary to get a meaningful signal out of images taken in different lighting; for example, daylit pictures have a lot of red while underwater pictures have almost none. #### Use a schedule to decrease the learning rate The learning rate should fall as training goes on. However, in practice, most advanced models are trained by using algorithms like Adam instead of simple SGD with a learning rate. #### Use L1 and/or L2 regularization for weight decay You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows: $$ L(S, w) = C(S, w) + \alpha ||w||^2\\ \frac{\partial R}{\partial w_i} = 2w_i\\ w_i = w_i + \eta\frac{\partial C}{\partial w_i} = w_i + \eta(\frac{\partial C}{\partial w_i} + 2 \alpha w_i) $$ To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update. $$ w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i} $$ L1 regularization is similar, except we use $\sum_i |w_i|$ instead of $||w||^2$. #### Use dropout Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all inputs rathern than becoming overly reliant on a small number of input examples. This method was initial proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>. #### For more tricks, see <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>. #### Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.