Mlp = Linear -> Activation -> Dropout -> Linear -> Dropout
style MLP layers implemented hereclass Mlp(nn.Module):
""" MLP as used in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
drop_probs = to_2tuple(drop)
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.drop1 = nn.Dropout(drop_probs[0])
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop2 = nn.Dropout(drop_probs[1])
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.drop2(x)
return x
class MixerBlock(nn.Module):
""" Residual Block w/ token mixing and channel MLPs
Based on: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
"""
def __init__(
self, dim, seq_len, mlp_ratio=(0.5, 4.0), mlp_layer=Mlp,
norm_layer=partial(nn.LayerNorm, eps=1e-6), act_layer=nn.GELU, drop=0., drop_path=0.):
super().__init__()
tokens_dim, channels_dim = [int(x * dim) for x in to_2tuple(mlp_ratio)]
self.norm1 = norm_layer(dim)
self.mlp_tokens = mlp_layer(seq_len, tokens_dim, act_layer=act_layer, drop=drop)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
self.mlp_channels = mlp_layer(dim, channels_dim, act_layer=act_layer, drop=drop)
def forward(self, x):
x = x + self.drop_path(self.mlp_tokens(self.norm1(x).transpose(1, 2)).transpose(1, 2))
x = x + self.drop_path(self.mlp_channels(self.norm2(x)))
return x
mlp_tokens
and mlp_channels
operate along two dimensions of the (tabular) data.Droppath
allows for stochastic depth, implemented hereTODO: Understand SpatialGatingUnit
and SpatialGatingBlock
defined here.
vit
import torch from torch_scatter import scatter We will use this graph: num_nodes = 5 num_edges = 6 num_edge_types = 3 edge_index = torch.LongTensor([
Dec 20, 2022Paper: https://arxiv.org/abs/2111.11418 Key idea: abstract the network architecture from high performing models like Transformers, MLP-Mixers etc. It is this network that gives good performance. They replace transformer, MLP-mixer etc with pooling to prove this statement. The main thing to understand is how pooling works: class Pooling(nn.Module): """ Implementation of pooling for PoolFormer --pool_size: pooling size """
Dec 9, 2022Main claim: patches are what lead to an improved performace at least to a certain extent Stem implementation self.stem = nn.Sequential( nn.Conv2d(in_chans, dim, kernel_size=patch_size, stride=patch_size), activation(), nn.BatchNorm2d(dim) ) Blocks implementation
Dec 9, 2022Another self-supervised image representation model Key idea: During pre-training, we randomly mask some proportion of image patches, and feed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches. Image patches are flattened into vectors and linearly projected. In our experiments, we split each 224 × 224 image into a 14 × 14 grid of image patches, where each patch is 16 × 16. An image is broken down into patches. Each patch has two representations: patch representation via transformer encoder and tokens from a fixed vocabulary.
Dec 9, 2022or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up