self.stem = nn.Sequential(
nn.Conv2d(in_chans, dim, kernel_size=patch_size, stride=patch_size),
activation(),
nn.BatchNorm2d(dim)
)
self.blocks = nn.Sequential(
*[nn.Sequential(
Residual(nn.Sequential(
nn.Conv2d(dim, dim, kernel_size, groups=dim, padding="same"),
activation(),
nn.BatchNorm2d(dim)
)),
nn.Conv2d(dim, dim, kernel_size=1),
activation(),
nn.BatchNorm2d(dim)
) for i in range(depth)]
)
nn.Conv2d(dim, dim, kernel_size, groups=dim, padding="same")
mixes spatial dimensions only while nn.Conv2d(dim, dim, kernel_size=1)
mixes channels only. Same in spirit as MLP-Mixervit
import torch from torch_scatter import scatter We will use this graph: num_nodes = 5 num_edges = 6 num_edge_types = 3 edge_index = torch.LongTensor([
Dec 20, 2022Paper: https://arxiv.org/abs/2111.11418 Key idea: abstract the network architecture from high performing models like Transformers, MLP-Mixers etc. It is this network that gives good performance. They replace transformer, MLP-mixer etc with pooling to prove this statement. The main thing to understand is how pooling works: class Pooling(nn.Module): """ Implementation of pooling for PoolFormer --pool_size: pooling size """
Dec 9, 2022Mostly the same people behind ViT paper. Adequate (84.15 top 1 on ImageNet by Mixer-L/16) but not SOTA. Benefits much more from scaling up. Common part with ViT: Divide an image into NxN patches, unroll each patch and do a linear transform. Some simple Mlp = Linear -> Activation -> Dropout -> Linear -> Dropout style MLP layers implemented here class Mlp(nn.Module): """ MLP as used in Vision Transformer, MLP-Mixer and related networks """ def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.): super().__init__()
Dec 9, 2022Another self-supervised image representation model Key idea: During pre-training, we randomly mask some proportion of image patches, and feed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches. Image patches are flattened into vectors and linearly projected. In our experiments, we split each 224 × 224 image into a 14 × 14 grid of image patches, where each patch is 16 × 16. An image is broken down into patches. Each patch has two representations: patch representation via transformer encoder and tokens from a fixed vocabulary.
Dec 9, 2022or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up