Aishwarya Rajan
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 02-1 --- layout: default title: Lecture 2 - Part 1 authors: Amartya Prasad, Dongning Fang, Yuxin Tang, Sahana Upadhya date: 3 February 2020 --- ## [Backpropogation and Gradient Based Methods](00:00:00-17:23:00) ### Parametrized Models : $$\bar{y} = G(x,w)$$ Parametrized Models are simply functions that depend on inputs and trainable parameters. There is no fundamental difference between the two, except that trainable parameters are shared across training samples while the input naturally changes by sample. In most deep learning frameworks, parameters are implicit, that is, they aren't passed when the function is called. They are 'saved inside the function', so to speak, atleast in the object-oriented versions of models. The parametrized model(function) takes in an input, has a parameter vector and produces an output. In Supervised Learning, this output goes into the Cost function ($C(y,\bar{y}$)), which compares the True Output (${y}$) with the Model Output ($\bar{y}$). The computation graph for this model is shown in Figure 1. | <img src = "https://i.imgur.com/XB2nQ0P.jpg" style="zoom: 50%"> | |--| | Figure 1: Computation Graph representation for a Parameterized Model | Examples of parametrized functions - - Linear Model - Weighted Sum of Components of the Input Vector : $$\bar{y} = \sum_i w_i x_i \text{ ; } C(y,\bar{y}) = ||y - \bar{y}||^2 $$ - Nearest Neighbour - There is an input x and a W matrix with each row of the matrix indexed by k. The output is the k that corresponds to the row of the w closest to x : $$\bar{y} = \text{argmin}_k ||x - w_{k,.} ||^2 $$ Parametrized models could also involve complicated functions. ### Block diagram notations for computation graphs - Variables ( tensor, scalar, continuous, discrete) - <img src="https://i.imgur.com/d8ecASC.png" style="zoom:50%"> is an observed input to your system - <img src="https://i.imgur.com/ZR8qnJB.png" style="zoom:50%"> is a computed variable which is produced by a deterministic function - Deterministic functions <center><img src ="https://i.imgur.com/YqsXYE9.png" style="zoom:50%"></center> - Takes in multiple inputs and can give multiple outputs - It has an implicit parameter variable (${w}$) - The rounded side of it tells us in which direction it is easy to compute. In the above diagram, it is easier to compute ${\bar{y}}$ from ${x}$ than the other way around - Scalar-valued function <center><img src ="https://i.imgur.com/Eyjy0K2.png" style="zoom:50%"></center> - Used to represent cost functions - Has an implicit scalar output - Takes multiple inputs and outputs a single value (usually the distance between the inputs) ### Loss Functions The Loss function is the function that is minimized during training. There are two types of Loss: 1)Per Sample Loss - $$\text{eg. } L(x,y,w) = C(y, G(x,w)) $$ 2)Average Loss - For any set of Samples $$S = \{(x[p],y[p])/p=0,1...P-1 \}$$ Average Loss Over the Set S is given by : $$L(S,w) = \frac{1}{P} \sum_{(x,y)} L(x,y,w)$$ In the standard Supervised Learning paradigm, the loss(per sample) is simply the output of the Cost Function. Machine Learning is mostly about optimizing functions, usually minimizing. It could also involve funding Nash Equilibria between two functions like with GANs. This is done using Gradient Based Methods, though not necessarily Gradient Descent. ### Gradient Descent A Gradient Based Method is a method/algorithm that finds the minima of a function, assuming that one can easily compute the gradient of that function. It assumes the function is continuous and almost-everywhere differentiable(It need not be differentiable everywhere). *insert picture here* Gradient Descent - Imagine being in the mountain in the fog at night and you want to go down to the village but you can't see anything so you look around to find the direction of steepest descent and take a step in that direction. Algorithm : $$w \leftarrow w - \eta \frac{\partial L(S,w)}{\partial w}$$ For SGD(Stochastic Gradient Descent), the Algorithm Becomes : Pick a p in 0...P-1, then update w $$w \leftarrow w - \eta \frac{\partial L(x[p], y[p],w)}{\partial w}$$ where w represents the parameter to be optimised. $\eta \text{ is a constant here but in more sophisticated algorithms, it could be a matrix}$ If it is a positive semi-definite matrix, we'll still move downhill but not necessarily in the direction of steepest descent. In fact the direction of steepest descent may not always be the direction we want to move in. If the function is not differentiable, i.e it has a hole or is staircase like or flat, where the gradient doesn't give you any information, one has to resort to other methods - called 0-th Order Methods or Gradient-Free Methods. Deep Learning is all about Gradient Based Methods. However, RL(Reinforcement Learning) involves Gradient Estimation without the Gradient itself. An example is a robot learning to ride a bike where the robot falls every now and then. The Objective Function measures how long the bike stays up without falling. Unfortunately, there is no gradient for the objective function. The robot needs to try different things. The RL cost function is not differentiable most of the time but the network that computes the output that goes into the cost function is, from which point on, it is gradient-based. This is the main difference between Supervised Learning and Reinforcement Learning - with the latter, the Cost function C is not differentiable. In fact it completely unknown. It just returns an output when inputs are fed to it, like a Blackbox. This makes RL highly inefficient and is one of the main drawbacks of RL - particularly when parameter vector is high dimensional(which implies a huge solution space to search in, making it hard to find where to move). A very popular technique in RL is Actor Critic Methods. A critic method basically consists in having a second C module which is a known, trainable module. One is able to train the C module, which is differentiable, to approximate the cost function/reward function. The reward is a negative cost, more like a punishment. That’s a way of making the cost function differentiable, or at least approximating it by a differentiable function so that one can backpropagate. ## [Advantages of SGD, Backpropagation for Traditional Neural Net](00:18:27-00:38:00) ### Advantages of Stochastic Gradient Descent In practice,we use the stochastic gradient to compute the gradient of the objective function w.r.t the parameters. Instead of computing the full gradient of the objective function, which is the average of all samples, stochastic gradient just takes one sample, computes the loss, $L$, and the gradient of the loss w.r.t the parameters, and then takes one step in the negative gradient direction. $$w \leftarrow w - \eta \frac{\partial L(x[p], y[p],w)}{\partial w}$$ In the formula, W is approached by w minus the step-size, times the gradient of the per-sample loss function w.r.t the parameters for given sample, ($x[p]$,$y[p]$). If we do this on a single sample, we will get a very noisy trajectory as shown in Figure 2. Instead of parameter vector directly going downhill, it’s stochastic. Every sample will pull to a different direction. It’s just the average that pulls us to the minimum of the average. Although it looks inefficient, it’s much faster than batch gradient at least in the context of machine learning when the samples have some redundancy between them. ![](https://i.imgur.com/Au9xWTv.png) In practice, we use batches instead of doing stochastic gradient on a single sample. We compute the average of the gradient over a batch of samples, not a single sample, and then do one step. The only reason for doing this is that the hardware given to us (i.e. GPUs, multicore CPUs) is more efficient if we have batches. It’s easier to parallelize and get more efficient computation. Batching is the simplest way to parallelize. ### Traditional Neural Network Traditional Neural Net are basically interspersed layers of linear operations and point-wise non-linear operations. For linear operations, conceptually it is just a matrix-vector multiplication. We take the (input) vector multiplied by a matrix formed by the weights. The second type of operation is to take all the components of the weighted sums vector and pass it through some simple non-linearities (i.e. ReLu, tanh,…). ![](https://i.imgur.com/KowUjtY.png) Above is an example of a 2-layer network, because what matters are the pairs (i.e linear+non-linear). Some people call it a 3-layer network because they count the variables, but Prof. Lecun thinks that it is not fair. Note that if there is no non-linearities in the middle, we may as well have a single layer. Because the product of two linear functions is a linear function. Below is the how linear and non-linear functional blocks stack: ![](https://i.imgur.com/eK2u920.png) In the graph, $s[i]$ is the weighted sum of unit i which is computed as: $$s[i]=\Sigma_{j \in UP(i)}w[i,j]\cdot z[j]$$ where $UP(i)$ is all the predecessors of i. $z[j]$ is the jth output from the previous layer. The output $z[i]$ is computed as: $$z[i]=f(s[i])$$ where f is a non-linear function. ### Backpropagation through a non-linear function The first way to do backpropagation is to backpropagate through a non linear function. We take a particular non-linear function $h$ from the network and leave everything else in the blackbox. ![](https://i.imgur.com/R3yhRUc.png) We are going to use the chain rule to compute the gradients: $$g(h(s))' = g'(h(s))\cdot h'(s)$$where $h'(s)$ is the derivative of $z$ w.r.t $s$,$\frac{dz}{ds}$ To make the connection between derivatives clear, we rewrite the formula into: $$\frac{dC}{ds} = \frac{dC}{dz}*\frac{dz}{ds} = \frac{dC}{dz}*h'(s)$$ Hence if we have a chain of those functions in the network, we can backpropagate by multiplying by the derivatives of all the h functions one after the other all the way back to the bottom. It’s more intuitive to think of it in terms of perturbation. Perturbing $s$ by $ds$ will perturb $z$ by: $$dz = ds*h'(s)$$Thus this will perturb C by: $$dC = dz*\frac{dC}{dz} = ds*h’(s)*\frac{dC}{dz}$$, ending up with the same formula as the one shown above. ### Backpropagation through a weighted sum For a linear module, we do backpropagation through a weighted sum. Here we view the entire network as a blackbox except for 3 connections going from a z variable to a bunch of s variables. ![](https://i.imgur.com/vme3dwp.png) This time the perturbation is a weighted sum. Z influences several variables. Perturbing $z$ by $dz$ will perturb $s[0]$, $s[1]$ and $s[2]$ by: $$ds[0]=w[0]*dz$$$$ds[1]=w[1]*dz$$$$ds[2]=w[2]*dz$$ This will perturb C by $$dC = ds[0]*\frac{dC}{ds[0]}+ds[1]*\frac{dC}{ds[1]}+ds[2]*\frac{dC}{ds[2]}$$ Hence C is going to vary by the sum of the 3 variations: $$\frac{dC}{dz} = \frac{dC}{ds[0]}*w[0]+\frac{dC}{ds[1]}*w[1]+\frac{dC}{ds[2]}*w[2]$$ ## [Part 3] ## [Implementation of Backpropagation](00:38:01-00:51:37) ### Block Diagram of a Traditional Neural Net - Linear blocks $$s_{k+1}=w_kz_k$$ - Non-linear blocks $$z_k=h(s_k)$$ ![](https://i.imgur.com/aYWO6gO.png) $w_k$: matrix $z_k$: vector $h$: application of scalar h function to every component This is a 3-layer neural net with pairs of linear and non-linear functions, though most modern neural nets do not have clear linear and non-linear separations and are more complex. ### PyTorch ![](https://i.imgur.com/pGYKHSC.png) - We can implement neural nets with object oriented classes in PyTorch. First we define a class for the neural net and intialize linear layers in the constructor using predefined nn.Linear class. Linear layers have to be separate objects because each of them contains a vector for the parameter. The nn.Linear class also adds the bias vector implicitly. Then we define a forward function on how to compute outputs with torch.relu function as the nonlinear activation. We don't have to initialize separate relu functions because they don't have parameters. - We do not need to compute the gradient by ourselves since PyTorch knows how to back propagate and transform the gradient given the forward function. ### Backprop through a functional module ![](https://i.imgur.com/VxBrCQo.png) - Using chain rule for vector functions $$ z_g : [d_g\times 1] $$ $$ z_f:[d_f\times 1] $$ $$\frac{\partial c}{\partial{z_f}}=\frac{\partial c}{\partial{z_g}}\frac{\partial {z_g}}{\partial{z_f}}$$ $$[1\times d_f]= [1\times d_g]\times[d_g\times d_f]$$ This is the basic formula for $\frac{\partial c}{\partial{z_f}}$ using the chain rule. Note that the gradient to the scalar function with respect to a vector is a vector of the same size as the vector with respect to which you differentiate. In order to make the notations consistent, it is a row vector instead of column vector. - Jacobian matrix $$(\frac{\partial{z_g}}{\partial {z_f}})_{ij}=\frac{(\partial {z_g})_i}{(\partial {z_f})_j}$$ We need $\frac{\partial {z_g}}{\partial {z_f}}$ to compute gradient with respect to $z_f$ given gradient with respect to $z_g$, which is the Jacobian matrix with respect to the input. Each entry $ij$ is equal to the partial derivative of the $i$th component of the output vector and the $j$th component of the input vector. ### Backprop through a multi-stage graph ![](https://i.imgur.com/QB7TvT3.png) - Using chain rule for vector functions $$\frac{\partial c}{\partial {z_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial {z_{k+1}}}{\partial {z_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial f_k(z_k,w_k)}{\partial {z_k}}$$ $$\frac{\partial c}{\partial {w_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial {z_{k+1}}}{\partial {w_k}}=\frac{\partial c}{\partial {z_{k+1}}}\frac{\partial f_k(z_k,w_k)}{\partial {w_k}}$$ These are also obtained using the chain rule. - Two Jacobian matrices for the module - One with respect to $z[k]$ - One with repsect to $w[k]$ We need two Jacobian matrices here. One for input states, and the other for the parameters. - [x] TBD by Yuxin Tang (yt1526) --------------------------------------------------------------------------------------------------------- # 02-2 --- layout: default title: Lecture 2 - Part 2 authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan date: 3 February 2020 --- ## [A Concrete Example of Backpropagation and Intro to Basic Neural Network Modules](00:51:37-01:07:40) ### Example We next consider a concrete example of backpropagation assisted by a visual graph. The arbitrary function $G(w)$ is inputted into the cost function $C$, which can be represented as a graph. Through the manipulation of multiplying by the Jacobian matrices, we can transform this graph into the graph that will compute the gradients going backwards. Note that PyTorch and TensorFlow do this automatically for the user, i.e., the forward graph is automatically "reversed" to create the derivative graph that backpropagates the gradient. In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that $$ \frac{dC(y,\bar{y})}{dw}=1 \cdot \frac{dC(y,\bar{y})}{d\bar{y}}\cdot\frac{dG(x,w)}{dw} $$ Note that complications might arise when the architecture of the graph is not fixed, but is data-dependent. For example, there might be a condition in the neural net code that depends on the length of a vector. Though this is possible, it becomes increasingly difficult to manage this variation when the number of loops exceeds a reasonable amount. In terms of dimensions, $\frac{dC(y,\bar{y})}{dw}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{dC(y,\bar{y})}{d\bar{y}}$ is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{d\bar{y}}{dw}=\frac{dG(x,w)}{dw}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$. ### Basic Neural Net Modules There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to built by a combination of other, elementary modules). - Linear: $Y=W\cdot X$ - ReLU: $y=\texttt{ReLU}(x)$ - Duplicate: $Y_1=X$, $Y_2=X$ - Akin to a "splitter" where both outputs are equal to the input. - When backpropagating, the gradients get summed, i.e., $$ \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2} $$ - Add: $Y=X_1+X_2$ - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e., $$ \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and }\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1 $$ - Max: $Y=\texttt{max}(X_1,X_2)$ - Since this function can also be represented as $$ Y=\texttt{max}(X_1,X_2)=\begin{cases} X_1 & X_1 > X_2 \\ 0 & else \end{cases} $$ then $$ \frac{dY}{dX_1}=\begin{cases} 1 & X_1 > X_2 \\ 0 & else \end{cases} $$ Therefore, by the chain rule, $$ \frac{dC}{dX_1}=\begin{cases} \frac{dC}{dY}\cdot1 & X_1 > X_2 \\ 0 & else \end{cases} $$ ## [SoftMax](01:07:41-1:23:18) *SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between 0 and 1 that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $yi$ in the equation below is a vector of probabilities for all the categories. $$ y_i = \frac{e^{x_i}}{\Sigma_j e^{x_j}} $$ However, the use of softmax leaves a network susceptible to a vanishing gradient. A vanishing gradient is a problem, as it prevents weights from changing its value, and may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when s is large, $h(s)$ is 1, and when s is small, $h(s)$ is 0. Because the sigmoid function is flat at $h(s) = 0 $ and $h(s) = 1$, the gradient is 0, which results in a vanishing gradient. ![](https://i.imgur.com/L4hklTj.png) $$ h(s) = \frac{1}{1 + e^{-s}}$$ Mathematicians came up with the idea of logsoftmax in order to solve for the issue of the vanishing gradient created by softmax. *LogSoftMax* is a another basic module in PyTorch. As can be seen in the equation below, *LogSoftMax* is a combination of softmax and log. $$ \log(y_i )= \frac{\log(e^{x_i})}{\log(\Sigma_j e^{x_j})} = \frac{e^{x_i}}{\log(\Sigma_j e^{x_j})} = e^{x_i} - \log(\Sigma_j e^{x_j})$$ The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + e^{s})$ part of the function. When s is very small, the value is 0, and when s is very large, the value is s. As a result it doesn’t saturate, and the vanishing gradient is avoided. $$ \log(\frac{e^{s}}{e^{s} + 1})= s - \log(1 + e^{s})$$ ![](https://i.imgur.com/RDFWVoN.png) ## [Practical Tricks for Backpropagation][1:23:19 to 1:45:15] [Practical Tricks for Backpropagation](1:23:19 to 1:45:15) #### Use ReLU as the non-linear activation function ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent (tanh) function to fall out of favor. The reason ReLU works best is likely that its single kink makes it scale equivariant. #### Use cross-entropy loss as the objective function for classification problems Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax). #### Use stochstic gradient descent on minibatches during training As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient. #### Shuffle the order of the training examples when using stochastic gradient descent Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, etc. Ideally, you should have samples from every class in every minibatch. However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch). #### Normalize the inputs to have mean zero and unit variance Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, for example, it is common to take the mean $$m_b$$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individal image as $$ b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{max(\sigma_b, \epsilon)} $$ where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. This is necessary to get a meaningful signal out of images taken in different lighting; for example, daylit pictures have a lot of red while underwater pictures have almost none. #### Use a schedule to decrease the learning rate The learning rate should fall as training goes on. However, in practice, most advanced models are trained by using algorithms like Adam instead of simple SGD with a learning rate. #### Use L1 and/or L2 regularization for weight decay You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows: $$ L(S, w) = C(S, w) + \alpha ||w||^2\\ \frac{\partial R}{\partial w_i} = 2w_i\\ w_i = w_i + \eta\frac{\partial C}{\partial w_i} = w_i + \eta(\frac{\partial C}{\partial w_i} + 2 \alpha w_i) $$ To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update. $$ w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i} $$ L1 regularization is similar, except we use $\sum_i |w_i|$ instead of $||w||^2$. #### Use dropout Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all inputs rathern than becoming overly reliant on a small number of input examples. This method was initial proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>. #### For more tricks, see <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>. #### Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully