Try   HackMD

Patches are all you need?

  • Main claim: patches are what lead to an improved performace at least to a certain extent
  • Stem implementation
        self.stem = nn.Sequential(
            nn.Conv2d(in_chans, dim, kernel_size=patch_size, stride=patch_size),
            activation(),
            nn.BatchNorm2d(dim)
        )
        self.blocks = nn.Sequential(
            *[nn.Sequential(
                    Residual(nn.Sequential(
                        nn.Conv2d(dim, dim, kernel_size, groups=dim, padding="same"),
                        activation(),
                        nn.BatchNorm2d(dim)
                    )),
                    nn.Conv2d(dim, dim, kernel_size=1),
                    activation(),
                    nn.BatchNorm2d(dim)
            ) for i in range(depth)]
        )
  • Note how nn.Conv2d(dim, dim, kernel_size, groups=dim, padding="same") mixes spatial dimensions only while nn.Conv2d(dim, dim, kernel_size=1) mixes channels only. Same in spirit as MLP-Mixer
tags: vit