Weight Initialization In Neural Networks

Weight Initialization In Neural Networks

Remember, Standard Deviation = √Variance

Variance : σ²
Standard deviation : σ

What not to do

Zero Initialization

All weights are zero, creates what is known as dead neuron as Input-information to each neuron will be zero, no matter the input x_i.

Input fed to neuron = x_i * w_i
As w_i = 0,
x_i*w_i=0

Due to this,during backpropogation even the gradient ∇ will be zero (see the backpropogation weight updation formula)
The network fails to learn the intricacies of input data, or in other words, fails to map the relationship between the input and output.

Symmetrical/Constant Initialization

All weights are assigned the same value, bad idea as all inputs are essentially treated as the same.Although the outputs are not zero for neurons, they dont learn anything.
Read more here

Random Initialization

Randomly select values from a gaussion curve having mean µ and standard deviation σ

N ( µ ,σ )

Decide what should be the deviation σ in these random-weights. Issue here is if the deviation σ is very low, this becomes close to symmetric initialization on the other hand if we choose a high σ, we tend to move towards the exploding and vanishing gradient problems.

Solutions

Maintain some variance in weight allocation

LeCun Initialization

Weights are distributed in such a way that the variance of weights closely follows the variance in output.

Output of a neuron with linear activation function is given by
y=w₁x₁+w₂x₂+…w_nx_n +b

var(y)=var(w₁x₁+w₂x₂+…w_nx_n +b)

As bias parameter is constant, it has no bias, we neglect the bias term.

var(y)=var(w₁)var(x₁) + var(w₂)var(x₂) + …var(w_n)var(x_n)

As weights are i.i.d (Independent, Identically Distributed),

v a r (y) = N * v a r (w) * v a r (x)

Where N is the dimension of input vector. As our goal was to match the variance in input and output,

N * v a r (w) = 1

v a r (w) = \frac{1}{N}

LeCun's initialization suggests to randomly allocate weights from a gaussian curve with mean 0 and standard deviation equal to 1/

\sqrt{N}

Xavier Glorot Initialization

For efficient performance on backpropogation, the variance should also account for the backword pass through the network.
Weights distribution should be a Gaussian with zero mean and variance given by following formula

v a r (w) = \frac{2}{f a n ᵢ ₙ + f a n ₒ ᵤ ₜ}

where fan_in represents the number of inputs coming to the specific layer and fan_out represents the no. of outputs moving to the next layer.

W \sim N (0, \frac{2}{n ᵢ ₙ + n ₒ ᵤ ₜ})

He Initialization

As ReLU function does is defined as f(x)=max(0,x).
It is not a zero mean function, this makes our initial assumption that the weights distribution have zero mean.
To account for this, we slightly modify the xavier/glorot method.

W \sim N (0, \frac{2}{n ᵢ ₙ})

Summary

Zero Initialization doesn't work and neither does Initializing to some constant. This leads to what's called the symmetry problem.
Random Initialization can be used to break the symmetry.But, if weights are too small we don't get significant variance in the activations as we go deeper into the network on the other side, If the weights are too large then it leads to saturation.
LeCun Initialization can be used to make sure that the activations have significant variance, but the gradients still suffer
Xavier Initialization is used to maintain the same smooth distribution for both the forward pass as well the backpropagation.
But, Xavier Initialization fails for ReLU, instead we use He Initialization for ReLU.

Weight Initialization In Neural Networks

Table of Contents

What not to do

Zero Initialization

Symmetrical/Constant Initialization

Random Initialization

Solutions

LeCun Initialization

Xavier Glorot Initialization

He Initialization

Summary

Read more

Lab 1 : 16th Jan

LTSpice SEM-VI

Deep Learning for academica

Verilog SEM-5