Notes on A Simple Framework for Contrastive Learning of Visual Representations(SimCLR)

# Notes on A Simple Framework for Contrastive Learning of Visual Representations(SimCLR) Original Paper Link: [arxiv Link](https://arxiv.org/abs/2002.05709) Notes Author: [Pulkit Mathur](https://github.com/mathurpulkit) # Abstract This paper presents SimCLR: a simple framework for contrastive learning of visual representations.<a href="" title="In Contrastive Representative Learning, the goal is to try to learn an embedding space in which similar inputs stay close to each other while dissimilar ones stay far apart. Contrastive Learning can be applied to both supervised and unsupervised settings.">[1]</a> The architecture is designed to be simple and it doesn't require memory banks. The authors show 3 major things that have a major effect on training. First, composition of data augmentations play a significant role in defining effective predictive tasks. Second, using learnable non-linear transformation between the learned representation and the representation used for contrastive loss substantially improves the quality of representations. Third, contrastive learning benefits more from larger batch sizes and more iterations compared to supervised learning. Combining these three techniques, the authors are able to achieve SOTA performance on the ImageNet classification task. # Introduction Learning effective visual representations without human supervision is a long standing problem. Most approaches fall into two categories, discriminative and generative. Generative approaches learn to generate or model pixels in the input space. Examples include Deep Belief Nets, Generative Adversarial Networks etc. Discriminative approaches typically use objective functions similar to those in supervised Deep Learning to learn pretext tasks, usually using heuristics, which can limit the potential of the model to generalise the learned representations.<a href="" title="The pretext task is the self-supervised learning task solved to learn visual representations, with the aim of using the learned representations or model weights obtained in the process, for the downstream task.">[2]</a> Discriminative approaches have shown great promising results, achieving SOTA performance. In this paper, the authors introduce a simple framework for contrastive learning of visual representations, called as SimCLR. It outperforms current(at the time of publication) methods, and is simpler as it requires neither specialised architectures, nor a memory bank. The authors systematically study the major components of their system, each of which has a significant positive impact on the quality of the learned representations. The components are: - Composition of multiple data augmentations - Learnable Non-Linear Transformation between representation and contrastive loss - Normalised embeddings and an appropriate temperature parameter - Using larger batch sizes, longer training, and deeper and wider networks(More impact compared to supervised learning) Under the linear evaluation protocol, SimCLR achieves 76.5% top-1 accuracy on the ImageNet dataset. When fine-tuned with only 1% of the ImageNet labels, SimCLR achieves 85.8% top-5 accuracy. When fine-tuned on other natural image classification datasets, SimCLR performs on par with or better than a strong supervised baseline. # Method SimCLR learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. The framework has 4 major components: - A data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted **\\(\tilde{x}_{i}\\)** and **\\(\tilde{x}_{j}\\)**, which are a positive pair. Three simple augmentations are applied sequentially, random cropping followed by resize back to the original size, random color distortions and random Gaussian blur. <center><img src="https://i.imgur.com/CKE0EPU.png" alt ="Data Augmentation and representation and General Architecture"></center> <center>Figure 1</center> - A neural network base encoder \\(f(\cdot)\\) that gives representation networks as output. Many different architectures can be used. The authors opt to use ResNet for simplicity. \\(h_{i} = f(\tilde{x}_{i}) = ResNet(\tilde{x}_{i})\\), where \\(h_{i} \: \epsilon \: R^{d}\\) is the output after average pooling layer. - A small projection head neural network \\(g(\cdot)\\) that maps representations to the space where loss is applied. The authors use a MLP with one hidden layer and ReLU activation function on the hidden layer. We obtain \\(z_{i} = g(h_{i}) = W^{(2)}\sigma (W^{(1)}h_{i})\\) where \\(\sigma\\) is the ReLU function. - A contrastive loss function defined for a contrastive prediction task. The contrastive prediction task tries to identify its positive pair from a subset with \\(2N\\) examples(pairs generated from \\(N\\) examples in the original dataset). Batches of size \\(N\\) are randomly sampled and \\(2N\\) data points are created(by making 2 augmentations of each sample). Thus, each sample has a positive pair and \\(2(N-1)\\) negative pairs. Let \\(sim(u,v)\\) denote the cosine loss between two vectors \\(u\\) and \\(v\\). Then, the loss is defined as: ## \\(\ell_{ij} = -\log_{}{\frac{exp(sim(z_{i},z_{j})/\tau)}{\sum_{k=1}^{2N} \Bbbk_{[k\neq i]}exp(sim(z_{i},z_{j})/\tau)}} \\) where \\(\Bbbk_{[k\neq i]} \: \in \: \lbrace 0,1\rbrace\\) is an indicator function evaluating to 1 iff \\(k \neq i\\), and \\(\tau\\) denotes a temperature parameter. Final Loss is computed against all positive pairs, both \\((i,j)\\) and \\((j,i)\\) in a mini batch. ### Training with Large Batch Size The authors trained their model with varying batch sizes from 256 to 8192. As training with large batch sizes is unstable with SGD/Momentum, the authors use a LARS Optimizer.<a href="first" title="Large Batch Training of Convolutional Networks, You et al 2017">[3]</a> ### Global Batch Normalization In distributed training with data parallelism, the BN mean and variance are typically aggregated locally per device. To prevent the model from exploiting the local information leakage, the authors aggregate the mean and variance across all devices. # Data Augmentation The authors perform two types of data augmentation. The geometric augmentations include cropping, resizing(including horizontal flipping), rotation and cutout. The other type of augmentation transforms appearance, using tools such as color distortion(including color dropping, brightness, contrast, saturation, hue), Gaussian blur, and Sobel filtering. In normal training, one augmentation from both the types(geometric and appearance) is applied to every sample image. Resizing is necessarily applied to keep the size of the input image constant. As cropping is always applied, it makes it difficult to study the effects of other transformations in the absence of cropping. So, the authors perform an **ablation study** in which they first crop and resize the input image, then they keep one branch of input as-it-is(applying no transformations), and in the other branch, they apply the targeted transformations which they want to study. This does hurt performance, but it doesn't affect the impact of the individual transformations substantially, and hence, is useful for ablation study. <center><img src="https://i.imgur.com/sz2LRTV.png" alt="Linear evaluation of single or composition of data augmentations"> </center> The above figure shows linear evaluation results under individual and composition of transformations. The authors observe that the model identifies positive pairs more easily in a single augmentation. However, the quality of representations generated is much better in the case of composed augmentations. The composition of cropping followed by color distortion stands out especially. The authors conjecture that this happens because the color distribution of the patches of an image are fairly same throughout the image. In fact, color histograms alone suffice to distinguish images, as shown in the below figure. The network can exploit this fact for prediction, which can hurt performance for the classification task and the quality of the representations themselves. Therefore, using color distortion is necessary to create better quality representations. <center><img src="https://i.imgur.com/kLVZgju.png" alt="Color histograms of image patches before and after color distortion"> </center> The authors also evaluate the performance of the model when trained with different strength color distortions. While supervised models aren't affected much by color distortions, unsupervised models perform better on classification tasks when trained with stronger distortions. <center><img src="https://i.imgur.com/lKmWAcH.png" alt="Effect of Color Distortion Strength on Classifier accuracy"> </center> # Architecture Details Unsupervised Learning benefits more from larger and wider models compared to supervised learning. ### Non-Linear Projection Head The authors have used non-linear projection heads from the representation to calculate the losses. They found that a non-linear projection head performs better than a linear projection head(+3%), and much better than no head(10%). The performance improvement is independent of the output dimension. Even when non-linear projection is used, the layer before the projection head(i.e \\(h\\)) is a much better representation(>10%) than the layer after(i.e \\(z = g(h)\\)). It shows that the layer before is a better representation of the input than the layer after. <center><img src="https://i.imgur.com/B8ihldQ.png" alt="Linear evaluation of representations with different projection heads"></center> The authors conjecture that the importance of using the representation head is because of information loss induced by the contrastive loss. \\(z\\) is trained to be invariant to data transformation, and thus \\(g\\) can remove info irrelevant to the downstream task of contrastive loss, such as color or orientation. By leveraging the nonlinear transformation \\(g(·)\\), more information can be formed and maintained in \\(h\\). To verify this hypothesis, the authors try to predict the transformation applied during training using either \\(h\\) or \\(g(h)\\). As shown in the table below, \\(g(h)\\) loses most of this information in most cases, while it's present in \\(h\\). <center><img src="https://i.imgur.com/GEG5kFj.png" alt="Accuracy of Prediction of transformtions using either h or g(h)"></center> # Loss Functions and Batch Size ### Normalised Cross Entropy Loss Performance Looking at the gradient, we observe: 1) \\(\ell_{2}\\) normalization along with temperature effectively weights different examples, and an appropriate temperature can help the model learn from hard negatives. 2) Unlike cross-entropy, other objective functions do not weigh the negatives by their relative hardness. As a result, one must apply semi-hard negative mining for these loss functions, instead of computing the gradient over all loss terms, one can compute the gradient using semi-hard negative terms (i.e., those that are within the loss margin and closest in distance, but farther than positive examples). The authors find that, without \\(\ell_{2}\\) normalisation, although the contrastive task accuracy is higher, but the linear classifier's accuracy over the representation is lower. ### Effect of Larger Batch Sizes and Longer Training If the number of epochs of the dataset are lower(e.g. 100), larger batch sizes perform significantly better than the smaller batch sizes. This effect diminishes with more epochs, provided the batches are randomly sampled. In contrast to supervised learning, larger batch sizes lead to a faster convergence in contrastive learning, due to more number of negative examples per batch. Training for more epochs also increases accuracy, as more negative examples are available. # Comparison with other SotA Below table shows the comparison of SimCLR with other Methods. They use ResNet-50 with 3 different hidden layer widths(1\\(\times\\), 2\\(\times\\), and 4\\(\times\\)). All the accuracies reported for SimCLR below are for models trained for 1000 epochs. <center><img src="https://i.imgur.com/69j4Mdg.png" alt=""></center> ### Semi Supervised Learning The authors sample 1% or 10% of the ImageNet dataset randomly in a class balanced way, and fine-tune the whole base network without any regularization. The results obtained are given below: <center><img src="https://i.imgur.com/sPlSG1H.png" alt="Accuracy of various models with few labels"></center> ### Transfer Learning The authors evaluate transfer learning performance across 12 natural image datasets in both linear evaluation (fixed feature extractor) and fine-tuning settings, comparing them with supervised baselines. They perform hyperparameter tuning for each model-dataset combination and select the best hyperparameters on a validation set. The method performs better or equal to the baseline on most datasets. Below table provides all the details: <center><img src="https://i.imgur.com/oLGgzBN.png" alt=""></center> # Conclusion In this work, the authors present a simple framework for contrastive representation learning. They show that using a standard architecture(like ResNet-50), we can achieve good accuracy without using any heuristics. Using a combination of techniques such as data augmentation, using a non-linear projection head, normalised cross entropy loss with adjustable temperature and using larger batch sizes, the authors achieve SOTA performance on various tasks, even beating supervised baselines in some tasks such as transfer learning. The authors conclude that complex architectures and design choices aren't necessary for good performance in self-supervised learning, and that SOTA performance can be achieved using simpler architectures and good design choices.