Remember, Standard Deviation = √Variance
Variance : σ²
Standard deviation : σ
All weights are zero, creates what is known as dead neuron as Input-information to each neuron will be zero, no matter the input xi.
Input fed to neuron = xi * wi
As wi = 0,
xi*wi=0
All weights are assigned the same value, bad idea as all inputs are essentially treated as the same.Although the outputs are not zero for neurons, they dont learn anything.
Read more here
Randomly select values from a gaussion curve having mean µ and standard deviation σ
N ( µ ,σ )
Decide what should be the deviation σ in these random-weights. Issue here is if the deviation σ is very low, this becomes close to symmetric initialization on the other hand if we choose a high σ, we tend to move towards the exploding and vanishing gradient problems.
Maintain some variance in weight allocation
Weights are distributed in such a way that the variance of weights closely follows the variance in output.
Output of a neuron with linear activation function is given by
y=w1x1+w2x2+…wnxn +b
var(y)=var(w1x1+w2x2+…wnxn +b)
As bias parameter is constant, it has no bias, we neglect the bias term.
var(y)=var(w1)var(x1) + var(w2)var(x2) + …var(wn)var(xn)
As weights are i.i.d (Independent, Identically Distributed),
Where N is the dimension of input vector. As our goal was to match the variance in input and output,
LeCun's initialization suggests to randomly allocate weights from a gaussian curve with mean 0 and standard deviation equal to 1/
For efficient performance on backpropogation, the variance should also account for the backword pass through the network.
Weights distribution should be a Gaussian with zero mean and variance given by following formula
As ReLU function does is defined as f(x)=max(0,x).
It is not a zero mean function, this makes our initial assumption that the weights distribution have zero mean.
To account for this, we slightly modify the xavier/glorot method.