vinsis
·
Last edited by vinsis on Dec 9, 2022
Linked with GitHub
Contributed by

1. Vision Transformers (An image is worth 16x16 words)

Pretrained on 300+ million images
SOTA on ImageNet (88.55%)
Added CLS token to patch tokens => responsible for predicting true label

tags: `vit`

Last changed by

182

How to implement a GNN

import torch from torch_scatter import scatter We will use this graph: num_nodes = 5 num_edges = 6 num_edge_types = 3 edge_index = torch.LongTensor([

Dec 20, 2022

Metaformer is Actually What You Need for Vision [CVPR 2022]

Paper: https://arxiv.org/abs/2111.11418 Key idea: abstract the network architecture from high performing models like Transformers, MLP-Mixers etc. It is this network that gives good performance. They replace transformer, MLP-mixer etc with pooling to prove this statement. The main thing to understand is how pooling works: class Pooling(nn.Module): """ Implementation of pooling for PoolFormer --pool_size: pooling size """

Dec 9, 2022

Patches are all you need?

Main claim: patches are what lead to an improved performace at least to a certain extent Stem implementation self.stem = nn.Sequential( nn.Conv2d(in_chans, dim, kernel_size=patch_size, stride=patch_size), activation(), nn.BatchNorm2d(dim) ) Blocks implementation

Dec 9, 2022

MLP-Mixer: An All-MLP Architecture for Vision

Mostly the same people behind ViT paper. Adequate (84.15 top 1 on ImageNet by Mixer-L/16) but not SOTA. Benefits much more from scaling up. Common part with ViT: Divide an image into NxN patches, unroll each patch and do a linear transform. Some simple Mlp = Linear -> Activation -> Dropout -> Linear -> Dropout style MLP layers implemented here class Mlp(nn.Module): """ MLP as used in Vision Transformer, MLP-Mixer and related networks """ def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.): super().__init__()

Dec 9, 2022

Sign in

Password

Forgot password

By clicking below, you agree to our terms of service.

Wallet ( )

Connect another wallet

New to HackMD? Sign up

1. Vision Transformers (An image is worth 16x16 words)

tags: vit

Read more

How to implement a GNN

Metaformer is Actually What You Need for Vision [CVPR 2022]

Patches are all you need?

MLP-Mixer: An All-MLP Architecture for Vision

tags: `vit`