Normalization used is GroupNorm
with 1
group i.e. same as layer norm:
Then:
use_layer_scale
works (taken from https://arxiv.org/abs/2103.17239)Key idea is to have almost no contribution from the residual layer at the start of the training and then use learnable parameters to decide how much contribution should come from the residual layer.
The confusing part is the dimensions of output after self.norm1
(B,C,H,W) and self.layer_scale_1
(C,1,1).
It works because torch.rand(C,1,1) * torch.rand(B,C,H,W)
is the same as torch.rand(1,C,1,1) * torch.rand(B,C,H,W)
To do:
ceil_mode
and count_include_pad
in nn.AvgPool2d
vit