Try   HackMD

Weight Initialization In Neural Networks

Table of Contents

Remember, Standard Deviation = √Variance

Variance : σ²
Standard deviation : σ


What not to do

Zero Initialization

All weights are zero, creates what is known as dead neuron as Input-information to each neuron will be zero, no matter the input xi.

Input fed to neuron = xi * wi
As wi = 0,
xi*wi=0

  • Due to this,during backpropogation even the gradient ∇ will be zero (see the backpropogation weight updation formula)
  • The network fails to learn the intricacies of input data, or in other words, fails to map the relationship between the input and output.

Symmetrical/Constant Initialization

All weights are assigned the same value, bad idea as all inputs are essentially treated as the same.Although the outputs are not zero for neurons, they dont learn anything.
Read more here

Random Initialization

Randomly select values from a gaussion curve having mean µ and standard deviation σ

N ( µ ,σ )

Decide what should be the deviation σ in these random-weights. Issue here is if the deviation σ is very low, this becomes close to symmetric initialization on the other hand if we choose a high σ, we tend to move towards the exploding and vanishing gradient problems.


Solutions

Maintain some variance in weight allocation

LeCun Initialization

Weights are distributed in such a way that the variance of weights closely follows the variance in output.

Output of a neuron with linear activation function is given by
y=w1x1+w2x2+wnxn +b

var(y)=var(w1x1+w2x2+wnxn +b)

As bias parameter is constant, it has no bias, we neglect the bias term.

var(y)=var(w1)var(x1) + var(w2)var(x2) + var(wn)var(xn)

As weights are i.i.d (Independent, Identically Distributed),

var(y)=Nvar(w)var(x)
Where N is the dimension of input vector. As our goal was to match the variance in input and output,
Nvar(w)=1
var(w)=1N

LeCun's initialization suggests to randomly allocate weights from a gaussian curve with mean 0 and standard deviation equal to 1/

N

Xavier Glorot Initialization

For efficient performance on backpropogation, the variance should also account for the backword pass through the network.
Weights distribution should be a Gaussian with zero mean and variance given by following formula

var(w)=2fan+fan where fanin represents the number of inputs coming to the specific layer and fanout represents the no. of outputs moving to the next layer.
WN(0,2n+n)

He Initialization

As ReLU function does is defined as f(x)=max(0,x).
It is not a zero mean function, this makes our initial assumption that the weights distribution have zero mean.
To account for this, we slightly modify the xavier/glorot method.

WN(0,2n)


Summary

  1. Zero Initialization doesn't work and neither does Initializing to some constant. This leads to what's called the symmetry problem.
  2. Random Initialization can be used to break the symmetry.But, if weights are too small we don't get significant variance in the activations as we go deeper into the network on the other side, If the weights are too large then it leads to saturation.
  3. LeCun Initialization can be used to make sure that the activations have significant variance, but the gradients still suffer
  4. Xavier Initialization is used to maintain the same smooth distribution for both the forward pass as well the backpropagation.
  5. But, Xavier Initialization fails for ReLU, instead we use He Initialization for ReLU.