Emerging Properties in Self-Supervised Vision Transformers (DINO🐲)

# Emerging Properties in Self-Supervised Vision Transformers (DINO🐲) --- ## Abstract In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: 1. First, **self-supervised ViT features contain explicit information about the semantic segmentation of an image**, which does not emerge as clearly with supervised ViTs, nor with convnets. 1. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNetwith a small ViT. Our study also underlines the **importance of momentum encoder**, **multi-crop training**, and the use of **small patches with ViTs**. We implement our findings into a simple self-supervised method, called **DINO**, which we interpret as a form of **self-distillation with no labels**. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base. ![](https://i.imgur.com/dcTKRPD.png) --- ## Introduction ### Related work 1. #### Knowledge Distillation ![](https://i.imgur.com/nLZJ31F.png) ![](https://i.imgur.com/Rtkt6Ot.png) 1. #### Self-training and knowledge distillation [[1]](https://arxiv.org/abs/1911.04252) have shown that distillation could be used to propagate soft pseudo-labels to unlabelled data in a self-training pipeline, drawing an essential connection between selftraining and knowledge distillation. Finally, our work is also related to codistillation [[2]](https://arxiv.org/abs/1804.03235) where student and teacher have the same architecture and use distillation during training. However, the teacher in codistillation is also distilling from the student, while it is updated with an average of the student in our work. ![](https://i.imgur.com/cTnNDrs.png) --- ## Approach ![](https://i.imgur.com/1qeMQ1I.png) ![](https://i.imgur.com/m9ETWnm.png) ### Teacher network Unlike knowledge distillation, we do not have a teacher gθt given a priori and hence, we build it from past iterations of the student network. Of particular interest, using an exponential moving average (EMA) on the student weights, i.e., a momentum encoder, is particularly well suited for our framework. The update rule is θt ← λθt + (1 − λ)θs, with λ following a cosine schedule from 0.996 to 1 during training. ![](https://i.imgur.com/jWcJJk4.png) ### Avoiding collapse Several self-supervised methods differ by the operation used to avoid collapse, either through **contrastive loss**, **clustering constraints**, **predictor** or **batch normalizations**.While our **framework can be stabilized with multiple normalizations** [[10]](https://proceedings.neurips.cc/paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.html), it can also **work with only a centering and sharpening of the momentum teacher outputs** to avoid model collapse. Centering prevents one dimension to dominate but encourages collapse to the uniform distribution, while the sharpening has the opposite effect. Applying both operations balances their effects which is sufficient to avoid collapse in presence of a momentum teacher. ![](https://i.imgur.com/TcYt5JM.png) Choosing this method to avoid collapse trades stability for less dependence over the batch: the centering operation only depends on first-order batch statistics and can be interpreted as adding a bias term c to the teacher : gt(x) ← gt(x) + c. The center c is updated with an exponential moving average(EMA), which allows the approach to work well across different batch sizes. ![](https://i.imgur.com/AZnXGbG.png) where m > 0 is a rate parameter and B is the batch size. Output sharpening is obtained by using a low value for the temperature τt in the teacher softmax normalization. ![](https://i.imgur.com/oLnPxfT.png) ``` python= import numpy as np import matplotlib.pyplot as plt def softmax(x, c, t): y = np.exp((x - np.max(x)) / t) f_x = (y - c) / np.sum(np.exp(x) / t) return f_x c = 0 #center t = 0.5 #temperature output = np.array([1.001, 1.425, 1.955, 1.203, 1.091]) probs = softmax(output, c, t) print(probs) plt.bar(range(5), probs) plt.title('C : {}, T : {}'.format(c, t)) plt.show() ``` ![](https://i.imgur.com/KaH45QU.png) ---