LooC++ - HackMD

# LooC++ Objective: * Having a model that's The Good: The Bad: Embeding spaces: * Query $z^q$ represents the original image * Positive samples: $z^{k^+}$ * Negative samples: $z^{k^-}$ * So, if we have an embeding space $Z_0$ * Positive pairs of $z_0^q$ are mapped to a single point called: $z_0^{k_0^+}$ * Negative pairs of $z_0^q$ are embeddings of other instances in this embeding space: $\{z_0^{k_0^-}\}$ ![LooC Img](https://i.imgur.com/cNjsAOn.png) ![](https://i.imgur.com/XA7VkAE.png) # Questions: * “For LooC++, we include Conv5 block into the projection head h, and use the concatenated features at the last layer of Conv5, instead of the last layer of h, from each head.” * “We adopt Momentum Contrastive Learning (MoCo) [13] as the backbone of our framework for its efficacy and efficiency, and incorporate the improved version from [4].” * We should read [4] * **What is the diff between Looc and Looc++?** Are they using the same loss? How is LooC calculating the different losses? * **Answer**: I think the diff between the two is at evaluation time: either just use the space V (LooC) or use all the heads by concatenating the output to use for prediction/embeding (LooC++) * “We apply random-resized cropping, horizontal flipping and Gaussian blur as augmentations without designated embedding spaces.” What does it mean * “Note than for both LooC and LooC++ the augmented additional keys are only fed into the key encoding network, which is not back-propagated, thus it does not much increase computation or GPU memory consumption.” What does it mean? * "We use separate queues [13] for individual embedding space and set the queue size to 16,384" How and why there is queues? * Experiment at Table 1 is unclear to me * Do they use a head here? Or is it just the blue section from figure 2 * When they train, do they always use the heads? * Why is there a difference of performance between table 4 and table 2? * How many negative samples for each k_i? * Does each image have the same transformation for (e.g.) k_1 * Ask jasonhsiao97@gmail.com for questions # contextualization <sub><sup>of the work within the literature - especially the literature that, at the point of presentation, that was previously reviewed in the course.</sup></sub> * Evolution of Contrastive loss * MoCo * Pre-task papers (rotation, texture randomization, etc.)? # Details * “We adopt Momentum Contrastive Learning (MoCo) [13] as the backbone of our framework for its efficacy and efficiency, and incorporate the improved version from [4].” * “Implementation details. We closely follow [4] for most training hyper-parameters. We use a ResNet-50 [15] as our feature extractor. We use a two-layer MLP head with a 2048-d hidden layer and ReLU for each individual embedding space. We train the network for 500 epochs, and decrease the learning rate at 300 and 400 epochs. We use separate queues [13] for individual embedding space and set the queue size to 16,384. Linear classification evaluation details can be found in the Appendix. The batch size during training of the backbone and the linear layer is set to 256.” # Related Work ### Pretext Tasks * Local brightness, color, and texture features are combined together * simple linear model can be trained to detect boundaries * Relative patch prediction and rotation * used to discover the underlined **structure** of the objects * Image colorization task * used to learn representations capturing **color** information The inductive **bias introduced by each pretext task** can often be associated with a corresponding hand-crafted descriptor. ### Multi-Task Self-Supervised Learning As shown in [17], training with two tasks can yield better performance than seven tasks together, as some tasks might be conflicted with each other. To solve this problem, different weights for different tasks are learned to optimize for the downstream tasks [28]. However, searching the weights typically requires labels,and is time-consuming and does not generalize to different tasks. **In this paper**, we also proposed to learn representation which can factorize and unify information from different augmentations. Instead of using sparse regularization, we define different contrastive learning objective in a multi-head architecture. ### Contrastive Learning Instead of enumerating all the possible selections of augmentations, we proposed a unified framework which captures different factors of variation introduced by different augmentations. # Paper conclusion * Current contrastive learning approaches rely on specific augmentation * may yield suboptimal performance on downstream tasks if the wrong transformation invariances are presumed * Proposed new model * Learns transformation **dependent and invariant** representations by constructing multiple embeddings * each of which is **not** contrastive to a single type of transformation (here lies the papers title) --- # Script --- # Motivation * “Recent self-supervised contrastive methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars).” * Self-supervised learning, which uses raw image data and/or available pretext tasks as its own supervision, has become increasingly popular as the inability of supervised models to generalize beyond their training data has become apparent. * Whereas pretext tasks aim to recover the transformations between different “views” of the same data, more recent contrastive learning methods [37, 32, 13, 3] instead try to learn to be invariant to these transformations, while remaining discriminative with respect to other data points * About contrastive learning methods “Yet, the inductive bias introduced through such augmentations is a double-edged sword, as each augmentation encourages invariance to a transformation which can be beneficial in some cases and harmful in others: e.g., adding rotation may help with view-independent aerial image recognition, but significantly downgrade the capacity of a network to solve tasks such as detecting which way is up in a photograph for a display application.” # Context # Overview / Methodology? * Figure 1. visualisation of when augmentation could hurt the performance according to a task * Self-supervised contrastive learning relies on data augmentations as depicted in (a) to learn visual representations. However, current methods introduce inductive bias by encouraging neural networks to be less sensitive to information w.r.t. the augmentation, which may help or may hurt. As illustrated in (b), rotation invariant embeddings can help on certain flower categories, but may hurt animal recognition performance; conversely color invariance generally seems to helps coarse grained animal classification, but can hurt many flower categories. Our method, shown in the following figure, overcomes this limitation. * “The property negatively affects the learnt representations: 1) Generalizability and transferability are harmed if they are applied to the tasks where the discarded information is essential, e.g., color plays an important role in fine-grained classification of birds; 2) Adding an extra augmentation is complicated as the new operator may be helpful to certain classes while harmful to others, e.g., a rotated flower could be very similar to the original one, whereas it does not hold for a rotated car; 3) The hyper-parameters which control the strength of augmentations need to be carefully tuned for each augmentation to strike a delicate balance between leaving a short-cut open and completely invalidate one source of information.” * Figure 2 * “We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks.” * “In this paper, we experiment with three types of augmentations: rotation, color jittering, and texture randomization” *and* “We use three types of augmentations as pretext tasks for static image data, namely color jittering (including random gray scale), random rotation (90◦, 180◦, or 270◦ ), and texture randomization [10, 11] (details in the Appendix).” * “We propose Leave-one-out Contrastive Learning (LooC), a framework for multi-augmentation contrastive learning. Our framework can selectively prevent information loss incurred by an augmentation. Rather than projecting every view into a single embedding space which is invariant to all augmentations, in our LooC method the representations of input images are projected into several embedding spaces, each of which is not invariant to a certain augmentation while remaining invariant to others, as illustrated in Figure 2. In this way, each embedding sub-space is specialized to a single augmentation, and the shared layers will contain both augmentation-varying and invariant information. We learn a shared representation jointly with the several embedding spaces; we transfer either the shared representation alone, or the concatenation of all spaces, to downstream tasks.” * “Instead of mapping an image into a single embedding space which is invariant to all the hand-crafted augmentations, our model learns to construct separate embedding sub-spaces, each of which is sensitive to a specific augmentation while invariant to other augmentations. We achieve this by optimizing multiple augmentation-sensitive contrastive objectives using a multi-head architecture with a shared backbone. Our model aims to preserve information with regard to each augmentation in a unified representation, as well as learn invariances to them. The general representation trained with these augmentations can then be applied to different downstream tasks, where each task is free to selectively utilize different factors of variation in our representation.” * “In general, the framework can be summarized as the following components: (i) A data augmentation module T constituting n atomic augmentation operators, such as random cropping, color jittering, and random flipping. Positive pair (q, k+) is generated by applying two randomly sampled data augmentation on the same reference image. (ii) An encoder network f which extracts the feature v of an image I by mapping it into a d-dimensional space R^d . (iii) A projection head h which further maps extracted representations into a hyper-spherical (normalized) embedding space. This space is subsequently used for a specific pretext task, i.e., contrastive loss objective for a batch of positive/negative pairs.” * “As a key towards learning a good feature representation [3], a strong augmentation policy prevents the network from exploiting naïve cues to match the given instances.” * “The augmented views are encoded by a neural network encoder f(·) into feature vectors v^q, v^{k0}, ..., v^{kn} in a joint embedding space V ∈ R^d. Subsequently, they are projected into n+1 normalized embedding spaces Z_0, Z_1, ..., Z_n ∈ R^{d'} by projection heads h : V → Z, among which Z_0 is invariant to all types of augmentations, whereas Z_i (∀i ∈ {1, 2, ..., n}) is dependent on the i^{th} type of augmentation but invariant to other types of augmentations. In other words, in Z_0 all features v should be mapped to a single point, whereas in Z_i (∀i ∈ {1, 2, ..., n}) only v^q and v^{k_i} should be mapped to a single point while v^{k_j} ∀j \neq i should be mapped to n−1 separate points, as only I_q and I_{k_i} share the same i^{th} augmentation” * “The network must preserve information w.r.t. all augmentations in the general embedding space V in order to optimize the combined contrastive learning objectives of all normalized embedding spaces.” * “Learnt representations. The representation for downstream tasks can be from the general embedding space V (Figure 2, blue region), or the concatenation of all embedding sub-spaces (Figure 2, grey region). LooC method returns V; we term the implementation using the concatenation of all embedding sub-spaces as LooC++.” * “Our representation shows consistent performance gains with increasing number of augmentations.” “Note that random rotation and texture randomization are not utilized in state-of-the-art contrastive learning based methods [3, 13, 4] and for good reason, as we will empirically show that naïvely taking these augmentations negatively affects the performance on some specific benchmarks. ” # Experiments “We evaluate our approach across a variety of diverse tasks including large-scale classification [5], fine-grained classification [34, 33], few-shot classification [23], and classification on corrupted data [2, 16].” “We train our model on the 100-category ImageNet (IN-100) dataset, a subset of the ImageNet [5] dataset. […] subset contains ∼125k images” “After training, we adopt linear classification protocol by training a supervised linear classifier on frozen features of feature space V for LooC, or concatenated feature spaces Z for LooC++” ## Table 1 “Study on augmentation inductive biases.” ### Methodology “We start by designing an experiment which allows us to directly measure how much an augmentation affects a downstream task which is sensitive to the augmentation. For example, consider two tasks which can be defined on IN-100: Task A is 4-category classification of rotation degrees for an input image; Task B is 100-category classification of ImageNet objects. We train a supervised linear classifier for task A with randomly rotated IN-100 images, and another classifier for task B with unrotated images. In Table 1 we compare the accuracy of the original MoCo (w/o rotation augmentation), MoCo w/ rotation augmentation, and our model w/ rotation augmentation.” ### conclusion “A priori, with no data labels to perform augmentation selection, we have no way to know if rotation should be utilized or not. Adding rotation into the set of augmentations for MoCo downgrades object classification accuracy on IN-100, and significantly reduces the capacity of the baseline model to distinguish the rotation of an input image. We further implement a variation enforcing the random rotating angle of query and key always being the same. Although it marginally increases rotation accuracy, IN-100 object classification accuracy further drops, which is inline with our hypothesis that the inductive bias of discarding certain type of information introduced by adopting an augmentation into contrastive learning objective is significant” ## Table 2 Fine-grained recognition ### Methodology “To fairly evaluate this, we compare our method with original MoCo on a diverse set of downstream tasks. Table 2 lists the results on iNat-1k, CUB-200 and Flowers-102.” ### Conclusion “A prominent application of unsupervised learning is to learn features which are transferable and generalizable to a variety of downstream tasks.” “Our method demonstrates superior generalizability and transferability with increasing number of augmentations.” “Although demonstrating marginally superior performance on IN-100, the original MoCo trails our LooC counterpart on all other datasets by a noticeable margin.” “Specifically, applying LooC on random color jiterring boosts the performance of the baseline which adopts the same augmentation. The comparison shows that our method can better preserve color information. Rotation augmentation also boosts the performance on iNat-1k and Flowers-102, while yields smaller improvements on CUB-200, which supports the intuition that some categories benefit from rotation-invariant representations while some do not. The performance is further boosted by using LooC with both augmentations, demonstrating the effectiveness in simultaneously learning the information w.r.t. multiple augmentations. Interestingly, LooC++ brings back the slight performance drop on IN-100, and yields more gains on iNat-1k, which indicates the benefits of explicit feature fusion without hand-crafting what should or should not be contrastive in the training objective.” ## Table 3 Robustness learning results ### Methodology “Table 3 compares our method with MoCo and supervised model on ON-13 and IN-C-100, two testing sets for real-world data generalization under a variety of noise conditions. The linear classifier is trained on standard IN-100, without access to the testing distribution.” ### Conclusion “Rotation augmentation is beneficial for ON-13, and texture augmentation if beneficial for IN-C-100.” “The fully supervised network is most sensitive to perturbations, albeit it has highest accuracy on the source dataset IN-100. We also see that rotation augmentation is beneficial for ON-13, but significantly downgrades the robustness to data corruptions in IN-C-100. Conversely, texture randomization increases the robustness on IN-C-100 across all corruption types, particularly significant on “Blur” and “Weather”, and on the severity level above or equal to 3, as the representations must be insensitive to local noise to learn texture-invariant features, but its improvement on ON-13 is marginal. Combining rotation and texture augmentation yields improvements on both datasets, and LooC++ further improves its performance on IN-C-100.” ## Figure 3 Qualitative results ### Methodology "In Figure 3 we show nearest-neighbor retrieval results using features learnt with LooC vs. corresponding MoCo baseline." "The figure shows the top nearest-neighbor retrieval results of LooC vs. corresponding invariant MoCo baseline with color (left) and rotation (right) augmentations on IN-100 and iNat-1k." ### Conclusion "The top retrieval results demonstrate that our model can better preserve information which is not invariant to the transformations presented in the augmentations used in contrastive learning." "The results show that our model can better preserve information dependent on color and rotation despite being trained with those augmentations." ## Table 4 "Ablation: MoCo w/ all augmentations vs. LooC" ### Methodology "We compare our method and MoCo trained with all augmentations. We also add multiple Conv5 heads to MoCo, termed as MoCo++, for a fair comparison with LooC++." ### Conclusion "Using multiple heads boosts the performance of baseline MoCo, nevertheless, our method achieves better or comparable results compared with its baseline counterparts." ## Table 5 "Comparisons of concatenating features from different embedding spaces in LooC++" ### Methodology "jointly trained on color, rotation and texture augmentations. " ### Conclusion "Different downstream tasks show nonidentical preferences for augmentation-dependent or invariant representations." ## Figure 4 Ablation: Augmentation-dependent embedding spaces vs. tasks ### Methodology ### Conclusion # Summary # Future work? --- # Students' questions