# How relevant is representation learning for copy detection in an adversarial setting? **Made by Pierre Bongrand (5396077) & Alessandro Duico (5621402)** ## Introduction In this blog, we will present our approach to solve the [Image Similarity Challenge](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/page/376/) introduced by Facebook AI in late 2021. The task of this challenge is part of the field of copy detection. The task of this challenge is to build a model that detects whether a query image is derived from one of the images in a large corpus of reference images. In the context of this challenge, the problem has an adversarial component and is therefore challenging to solve. As some content might be flagged as not displayable on a platform, users might try to modify this content and repost it hoping that the system won't detect that this content is the same as previously posted. Our approach to solve this challenge is to use a pre-existing and pre-trained architecture used for representation learning and fine-tune it for our specific task of copy detection. ## Problem Definition The FacebookAI challenge had two tracks: Matching and Description. We chose the **Matching** track. In the original challenge the problem definition is described below [^chalpaper]: The challenge consists of determining for each query image whether it originated from one of the reference images (and assign a confidence score indicating its similarity to the candidate reference image). The image on the left is a reference image. The image on the right is a query image that has been derived from the reference image -- in this case, it has been cropped and combined with another image. ![](https://i.imgur.com/2IzNeC5.png) Edited query images may have been modified using several techniques. The query image below had filters, brushmarks, and script applied using photo editing software. ![](https://i.imgur.com/7SljilW.png) The editing techniques employed on this data set include cropping, rotations, flips, padding, pixel-level transformations, different color filters, changes in brightness level, and the like. We were given **1 million reference images** and **50'000 query images**. However, due to resource constraints, we were only able to address this problem on a subset of the data given. Therefore, we consider the same problem described above but with **20'000 reference images** and **10'000 query images**. Note that in the original problem some queries might not have copies in the reference image set. However, to make the problem more attainable we decided to only consider queries for which we have a reference. **The problem is therefore not:** Is there a transformed copy of the query image? **But:** What is the transformed copy of the query image? ## Related Work ### SimCLR: Representation learning in a contrastive setting SimCLR[^SIMCLR] is used to learn a representation of images using contrastive learning. **The goal of representation learning is to make it easier to extract useful information when building classifiers or other predictors.** To achieve this goal, the approach of T.Chen et al is to make the network trying to learn a representation that maximizes the agreement between two images that are the same but have different transformed views. While it is also trying to minimize the agreement between transformed views of different images. SimCLR relies on the concept of positive and negative pairs. This is illustrated in the GIF [^gif] below. ![](https://i.imgur.com/xnz1XPB.gif) In order to both attract similar images and repel different images SimCLR is using the following loss: ![](https://i.imgur.com/6IaAZ7s.png) In this case, the pair $i,j$ is a pair of similar images (positive pair), and $i,k$ is a pair of different images (negative pair). ### Interesting approaches As this is an open challenge that finished in October 2021, some previous work was already done to solve it. We will describe and investigate two very interesting solutions to the problem defined above. #### Winning Solution of the challenge - W. Wang et al.[^winner] The main takeaways of the winning solution consist of the following steps: 1. Unsupervised pre-training substitutes the commonly used supervised one. Boostrap your own latent (BYOL) was applied, which is an upgraded version of SimCLR for contrastive learning as it doesn't need to have positive and negative labels. BYOL relies on RESNET which itself is using CNN. 2. Perform data augmentation by creating their own image transformation function. A set of basic transformations and 6 advanced transformations are exploited. 3. During testing, a global-local and local-global matching strategy is proposed. The strategy performs local verification between reference and query images The image below summarizes well the framework and methods used. ![](https://i.imgur.com/yp0tskk.png) #### Runner-up Solution of the challenge - SeungKee Jeon [^runnerup] This submission was somehow original as the paper describing the solution is very brief, but the technique used is very unique and performs surprisingly well. The main takeaways of the runner-up solution consist of the following steps: 1. Data augmentation was performed similarly to the winner's solution 2. During training both query and reference images were concatenated and ViT[^ViT] was applied to learn to predict if this image is in the reference. Unlike the previous method, ViT leverages transformers. It is known that ViT exhibits an extraordinary performance when trained on enough data. ViT outperforms CNN with 4x fewer computational resources. Arguably, what makes ViT perform well is the global attention that makes it possible to learn the relationship between the concatenated query and reference image. Note that the ablation study performed by the winner methods shows that the local-verification plays a very important role in the proposed model. Unfortunately, the details about the concatenation implementation were not described. However, it would be interesting to investigate the code provided [here](https://github.com/seungkee/2nd-place-solution-to-Facebook-Image-Similarity-Matching-Track) and understand why this solution was proven to be performing so well. ## Method SimCLR has already been used in other computer vision problems as a generic representation of images on an unlabeled dataset and then used a second network to fine-tune the representation with a smaller amount of labeled data. Image matching tasks can be extremely computationally and resources heavy. Just for reference, on ImageNet [^ImageNet] BYOL takes about two weeks to pre-train 300 epochs with ResNet-152 backbone using 8 NVIDIA Tesla V100 32GB GPUs. Using a pre-trained SimCLR network and finetuning it is the most reasonable solution we found. The finetuning is performed by a fully connected neural network. Moreover, unlike other winning and runner-up solutions, our main issue is not to have enough training samples but that we do not have enough resources to already exploit the ones we already have. Therefore, while the state-of-the-art solutions perform data augmentation to train their model exploiting BYOL or ViT, we downsampled the training set. Our specs: 4 cores of Intel(R) Xeon(R) CPU @ 2.80GHz (on GCP) ### Architecture of the pre-trained network: We use the pre-trained SimCLR model `r50_2x_sk0` from the [SimCLR Github](https://github.com/google-research/simclr/). The main properties of the model are the following: | Depth | Width | SK | Param (M) | F-T (1%) | F-T(10%) | F-T(100%) | Linear eval | Supervised | |--------:|--------:|------:|--------:|-------------:|--------------:|---------------:|-----------------:|--------------:| | 50 | 1X | False | 24 | 57.9 | 68.4 | 76.3 | 71.7 | 76.6 | | 50 | 1X | True | 35 | 64.5 | 72.1 | 78.7 | 74.6 | 78.5 | | --> **50** | **2X** | **False** | **94** | **66.3** | **73.9** | **79.1** | **75.6** | **77.8** | | 50 | 2X | True | 140 | 70.6 | 77.0 | 81.3 | 77.7 | 79.3 | | [...] |||||||||| For an explanation of the headers, refer to Table 1 of the <a href="https://arxiv.org/abs/2006.10029">SimCLRv2</a> paper. ### Architecture of the fine-tuning network: We pipe the outputs of the pre-trained model into a fully connected layer, that maps the pre-trained representation (4096 features) to a new representation of the same size. ``` self.dense_layer = tf.keras.layers.Dense(units=4069, name="head_supervised_new") ``` We also experimented with 3 layers without fine-tuning. ### Fine-tuning loss #### Triplet loss We initially considered using the Triplet loss, which takes 3 images as input, minimizing the distance for the positive pair and maximizing the distance for a negative pair. However, this approach did not seem effective for our dataset. Considering that it only has one matching image and many non-matching, every triplet for a certain query would always contain the same positive pair. #### Fine-tuning with Temperature Loss As we noticed that contrastive learning could be applied to that problem we wanted to apply a contrastive learning loss to our fine-tuning network. We first tried to apply some fine-tuning with the original sophisticated temperature loss from [^SIMCLR]. ![](https://i.imgur.com/6IaAZ7s.png) However, preparing a Tensorflow dataset with batches that were compatible with this loss didn't turn out to be an easy task. After countless hours of debugging with Tensorflow, we chose to use another loss. #### SimCLR with Fine-tuning with original contrastive learning Loss In the end, we settled for a simpler loss: ![](https://i.imgur.com/K4wPyeY.png) * $X_1$ and $X_2$ are points for which we check the similarity * $Y$: specifies the similarity between two points (X1 and X2) * $Y=1$ if similar * $Y=0$ if dissimilar * $L_S$ is a loss function that is applied to the output in case the images are similar. * $L_D$ is the loss that is applied in case the images are dissimilar. * $D_W$ is the Dissimilarity between 2 data points This loss was introduced by Yann Lecun et al.[^yann] in: Learning a Similarity Metric Discriminatively. ##### Implementation: Our implementation: ``` left_output = model(left) # shape [BATCH_SIZE, 4096] right_output = model(right) # shape [BATH_SIZE, 4096] d = tf.reduce_sum(tf.square(left_output - right_output), 1) d_sqrt = tf.sqrt(d) loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d loss = 0.5 * tf.reduce_mean(loss) ``` Each element of the dataset is a tuple of: + **left** (image) : Tensor(224,224,3) + **right** (image) : Tensor(224,224,3) + **label** (0 if positive, 1 if negative) : float ### Dataset loader As this was our first computer vision project, loading the images was a challenge in itself. We found an elegant solution to generate our dataset. We create our dataset using a generator, so images do not have to be downloaded beforehand. When the training loop runs, the loader function checks whether the required image exists on disk, else it downloads it from [aws](https://drivendata-prod.s3.amazonaws.com/data/79/public/downloading_image_archives.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20220617%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220617T194630Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=8d561a87a78c02f1deb23fafd36f07908374526ab17de5030d5b8e8a46be5734) (which is where the images are available online). This allows for testing with a subset of the dataset, downloading only the minimum requirements. Of course, downloading in advance makes training significantly faster. In our case, we downloaded the first 20k reference pictures on disk and picked negatives from those. ## Experiments Throughout this project different settings were experimented, in this section, we will describe some of these attempts and explain the insights we gained. ### Setup The dataset used consists of 20'000 reference images. The training set consists of 7'000 query images and the test set consists of 3'000 query images. It is important to note that in our setup **each query image has a corresponding match within the reference images**, which is not the case in the original paper. For each training instance, we compare the query to the reference, then we compare the query to 1 negative sample. During the training, the negative pair is sampled from a pool of 20'000 images, instead of 1 million in the original problem. ### Metrics used The metrics used for measuring the performance of our model is Top-k accuracy. As our model is outputting a 4096 dimension vector, we measure the distance between the representation of the query image compared to the representations of every reference. For example, the top-k accuracy is the ratio of queries for which the matching reference is in the k closest references. We use Top-1, Top-5, and Top-10. Just to make it clear, the accuracy becomes higher as we increase k. The distance used to compare representations is the **(squared) Euclidean distance**. ``` d = tf.reduce_sum(tf.square(left - right), 1) ``` ### Architecture and network parameters testing The table below summarizes all the experiment and network parameters we tried to achieve better results. Note, that each experiment setting was performed from 2 to 3 times and that this is an average. We always had very variance between the runs. The arrows like this: '->' shows you what was changed from the previous iteration. The numbers in **bold** represent the best score for each Topk. | ID | learning rate | weight decay | Batch size | Fine-tuning Model | Notes | Top1 | Top5 | Top10 | | --- | ------------- | ------------ | ---------- | ------------------------ | -------------------------------------------------- | --------- | --------- | --------- | | 0 | 0.1 | 0. | 1 | 1 Dense layer | Small batch sizes don't work with contrastive loss | 0. | 0. | 0. | | 1 | 0.1 | 0. | ->64 | 1 Dense layer | | 0.312 | 0.412 | 0.471 | | 2 | ->NONE | ->NONE | 64 | -> 1 Dense layer (noisy) | no fine-tuning (run by mistake) | 0.477 | 0.552 | 0.576 | | 3 | NONE | NONE | 64 | ->NONE | no fine-tuning | **0.479** | **0.554** | **0.581** | | 4 | ->0.1 | 0. | 64 | ->3 Dense layers | loss increases to +inf | 0. | 0. | 0. | | 5 | ->0.01 | 0. | 64 | 3 Dense layers | loss increases to +inf | 0. | 0. | 0. | | 6 | ->0.001 | 0. | 64 | 3 Dense layers | bad partial accuracy, interrupted | <0.2 | <0.3 | <0.3 | | 7 | 0.001 | ->10e-4 | ->256 | 3 Dense layers | | 0.135 | 0.165 | 0.18 | | 8 | 0.001 | 10e-4 | 256 | ->1 Dense layer | | 0.18 | 0.226 | 0.256 | <!-- | 8a | NONE | NONE | 64 | 0.517 | 0.597 | 0.625 | no Dense layer, only 1000 test references | --> #### Observations: We will walk you through our thought process and reasoning behind each parameter change. 0. We first tried a small batch size in the hope that the model would be fast to run. But we realized that the Batch size must be high for this model to perform well, our attempts at training with BATCH_SIZE=1 were disastrous. As the loss would increase to infinity and the top1, top5, top10 scores were 0. 1. Increasing the batch size to 64 gave us immediately strong results. 2. **We run an experiment by accident that turned out to be our best performing solution. It computed predictions with a untrained fully connected layer, i.e. with random weights.** 3. **We measured how good SimCLR alone was performing. SimCLR without fine-tuning is the best-performing model overall.** 4. We assumed that 1 layer was not enough to learn any classification and used 3 dense layers. It turned out to not work at all, during training the loss would go to infinity. 5. We tried to decrease the learning rate as the original representation was better. It was still not working with a loss going to infinity 6. We decreased again the learning by a factor of 10, and this solution turned out to work, but worse than the original 1 dense layer. 7. After so many attempts with 64 of batch size, we evaluated bigger batch size (264) and added some weight decay. It turned out to perform worse than 6 which was surprising. 8. Using only 1 layer with the same other parameters was performing better than 7. From these experiments, we have the intuition that increasing the weight decay even more and/or reduce the learning rate while performing longer training would improve the model. ## Discussion Our final results are convincing. Achieving almost 0.5 of accuracy on top1 is an excellent result as there are 3000 possible classes (other pictures in the set). ### Minor image transformation Here is an example of a query at the top, in the middle is the reference we have to find, and below is the image our model returned. The model used for these matchings is model 2, the best performing model with fine-tuning. ![](https://i.imgur.com/hbMhiz6.png) ![](https://i.imgur.com/oPwOCzZ.png) We observe in these examples, that when only minor transformations are applied - random noise, writing on top, adding small shapes on the edges - our model is still able to identify the corresponding reference image. ### SimCLR strong semantic representation Moreover, we observed that the pre-trained network used (SimCLR), has been trained to distinguish images from their semantic meaning. An example of this is found below (from experiment ID=2): ![](https://i.imgur.com/Phrmydw.png) We observe that the query image and the retrieved one have nothing in common when comparing the images, except that they both represent a train. It implies that SimCLR is detecting the type of object in an image and is returning an image with a similar object. It is not analyzing the image itself. We observed this with other examples(from experiment ID=2): ![](https://i.imgur.com/cWJHLsz.png) In this case, we see that the model understood that we wanted to match a garden with mostly grass, but it didn't match it with the correct one. This phenomenon should be better represented by our metrics. Using SimCLR with such a strong semantic representation makes the task easier and skews our scores. The problem should be copy detection and not object detection. Moreover, in only 3'000 images only so many different objects can be represented. Therefore, to have a fair comparison with previous work, we should run our experiments on images of a certain category. Unfortunately, we were not able to test it, but it would be an interesting experiment. ### Future experiments Currently, we trained the network with 1 negative per positive pair. But the number of negative pairs per positive pair is a number that we could vary in future experiments. As explained in the above section, an experiment with only references from the same object type would allow us to measure how well our model can do image matching and not only object matching. ### Afterthoughts Computing a vector representation for each reference image does not scale well with big datasets (millions of images). A better approach would be to use multiple layers that converge to a single neuron. The single feature predicts the same label that we have in the dataset, 0 for positive pairs and 1 for negative. During validation, instead of sorting the reference images by distance to the query, it would only be necessary to sort them by the prediction of the model (lowest is the best match). We should train end-to-end, so we can obtain a representation that is less focused on the semantic and more on image patterns. ## Conclusion We are surprised by the results, SimCLR performs surprisingly well on copy detection or copy identification problem in our specific case. Moreover, we were not expecting that SimCLR would have a stronger semantic representation than image pattern recognition. It could be that the pre-trained network we selected had this specific goal in mind, but it was not described as so. We wish we had time to investigate more and explore other network architecture. It would have been interesting to compare SimCLR in this setting against more traditional approaches with CNNs. ## Implementation A repository containing details about the implementation is available [here](https://github.com/Duico/Image-matching-SimCLR). [^chalpaper]: https://arxiv.org/pdf/2106.09672.pdf [^SIMCLR]: https://arxiv.org/abs/2002.05709 [^gif]: https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html [^winner]: https://arxiv.org/pdf/2111.07090.pdf [^runnerup]: https://arxiv.org/pdf/2111.09113.pdf [^ImageNet]: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5206848 [^BYOL]: https://arxiv.org/pdf/2006.07733.pdf [^ViT]: https://arxiv.org/pdf/2010.11929.pdf [^yann]: http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf