Reproducibility study of Attentional Constellation Nets for Few-Shot Learning

# Reproducibility study of Attentional Constellation Nets for Few-Shot Learning ## Authors Koen Bessels - 4593537 Michiel O'Herne - 4387783 Rick Dekker - 4682548 ## Introduction This blog post presents the work done on the efforts to reproduce the deep learning research paper "Attentional Constellation Nets For Few-Shot Learning".[^4] This project was as a part of the course "Deep learning" in Q3 2022 at the TU Delft. "Attentional Constellation Nets For Few-Shot Learning" introduces a novel method of implementing Convolutional Neural Networks combined with a constellation model and attention learning for the few-shot learning problem. Aptly named: ConstellationNet. Our work on reproducing this paper consists of reproducing results from the paper, testing it on a new dataset and an ablation study in an attempt to broaden the available details of this new method. ## Paper exposition Below is a quick overview of ConstellationNet and the reasoning behind its implementation is presented. ### Few shot learning In essence, few-shot learning is the problem created from tying to make predictions based on a rather limited number of samples. Below a visualisation of said few-shot sample data is portrayed. ![](https://i.imgur.com/ayoocbv.png) *Figure 1: Few-Shot Learning (image adapted from [^3])* As a result of this shift in the amount of samples available, the focus of the learning is also adjusted from being able to recognize objects as their true value towards becoming able to tell apart objects based on their features. For ConstellationNet the focus was put upon 1-shot and 5-shot datasets, and as such all experiments were carried out on these dataset types. ### Constellation model The constellation model family consists of models that aim to detect objects on a categorial level, classifying objects as being part of a group, instead of focusing on the individual classification. What is essential for this approach is the geometric relationship between separate parts of an object being classified, giving constellation models the capacity to build a model categorizing on more than just appearance. In essence this means that parts of the object are identified, and then compared to each other in a spatial configuration, as shown in the image below: ![](https://i.imgur.com/HtOksYz.png) *Figure 2: Spatial configuration of detections (image adapted from [^5])* As illustrated in the figure above, a constellation model looks at found features as well as the geometric relationship between them. ### Attention learning The final important aspect of ConstellationNet is the concept of self-attention; a digital implementation of cognitive attention used by organisms. The ideal is a mapping over inputs, relating their importance to a given arbitrary score. Hereby enabling the algorithm to differentiate between inputs to “focus” on the more heavily weighing scores as compared to less impactful scores. ### Linking the three methods to final network ConstellationNet can be described as a more fleshed-out convolutional neural network, including several additions that help it obtain a much better performance on few-shot learning problems. The first of these is the addition of a constellation model to the CNN framework which performs clustering based on the features and their inner geometric relationship. On top of this, a self-attention mechanism is applied to further model the relations between the found features, and create a more concrete model of objects. In conclusion, ConstellationNet combines the above-mentioned methods in a novel manner to come up with a state-of-the-art convolutional neural network model that can be trained on few-shot problems in roughly comparable timeframes to current model while outperforming them nearly completely across the board, when trained and tested on well-known few-shot benchmark datasets. The figure below highlights the working of each cycle, followed by an illustration of the pipeline for the entire model. ![](https://i.imgur.com/GVxe4va.png) *Figure 3: ConstellationNet pipeline, taken from [^4]* ## Reproducibility approach To reproduce this paper we have chosen to use the GitHub repository already created for this ConstellationNet network. This code is well documented and easily useable. This means that other valuable experimentation can be done on this relatively new network. The original GitHub repository can be found [here](https://github.com/mlpc-ucsd/ConstellationNet). The first thing that needs to be done is to check if the results from the paper can be reproduced with their own GitHub repository. By reproducing the results from table 2 row 6 out of the paper[^4]. For simplicity the table is also shown below: ![](https://i.imgur.com/pVIWWFo.png) *Figure 4: A copy of table 2 from the ConstellationNet paper[^4]* After this, we train ConstellationNet on the PACS dataset[^1]. The PACS dataset consists of 4 different domains. Photos, art, cartoons and sketches. ![](https://i.imgur.com/b2Jr4s5.jpg) *Figure 5: A visualisation of the PACS dataset, taken from [^6]* PACS is used for testing how well a model generalizes to unseen data. For this, 3 of the domains are split in training and validation data (9:1 respectively) and tested on the remaining domain.[^2] Furthermore we study some clustering ablations. Instead of mini-batch K-means used in the original paper we look at HDBScan and experiment with the clustering batch size of mini-batch K-means. ## Google cloud services To train the network Google Cloud services was used. This service allows users to create virtual machines with high-end hardware. This was especially usefull for this project, since the original code required a NVIDIA GPU to be able to train the network. This tool makes it also possible to train models easily. Training networks can take hours or even days of computing depending on the size of the dataset and the complexity of the model. And although we trained on fairly small datasets, that were trainable in a couple of hours, it did help us to perform multiple experiments. The hardware and settings that were used for our virtual machine were: * Region: europe-west4-a * Operating software: Ubuntu 2004 * Machine type: n1-standard-8 * GPU: Nvidia T4 * Disk space: 100 gb From there it was a matter of installing the correct drivers for the GPU and following the step by step installation instructions that can be found in the README of the original GitHub repository. ## Results ### Github repository To be able to make changes to the original code and add extra files we created our own github repository that was forked from the original. Our repository can be found [here](https://github.com/MrMeepsle/ConstellationNet). ### Checking reproducibility The first thing that had to be done was checking if the original code was able to reproduce the same results as the results stated in the paper. The paper does not state how many times the model is trained to average out its performance. For our own training we trained the network two times wich gave us these results: <table> <tr> <th colspan="2">Model</th> <th colspan="2">Backbone</th> <th colspan="2">Dataset</th> <th colspan="2">1-shot</th> <th colspan="2">5-shot</th> </tr> <tr> <th colspan="2">ConstellationNet (Paper)</th> <th colspan="2">Conv-4</th> <th colspan="2">CIFAR-FS</th> <th colspan="2">69.3 &pm; 0.3</th> <th colspan="2">82.7 &pm; 0.2</th> </tr> <tr> <th colspan="2">ConstellationNet (Reproduced)</th> <th colspan="2">Conv-4</th> <th colspan="2">CIFAR-FS</th> <th colspan="2">68.62 &pm; 0.3</th> <th colspan="2">82.54 &pm; 0.2</th> </tr> </table> The results show that the values do not perfectly match however they were fairly close so it was concluded that we were able to reproduce the results. So therefore different experimentations were done on the ConstellationNet network. ### Model generalization study with PACS As discussed before ConstellationNet was tested with the PACS dataset. To be able to train this dataset a set of adaptations to the code had to be made. Multiple configuration files were created that specified things such as the hyperparameters used for the training and evaluation. Furthermore a script called "pacs.py", which can be found in the datasets folder, was constructed. This script finds the PACS dataset and transforms it in such a way that it can be used by the ConstellationNet code without further changes. From there the regular training commandline can be used to train on PACS. <table> <tr> <th colspan="2">Model</th> <th colspan="2">Photos</th> <th colspan="2">Art</th> <th colspan="2">Cartoon</th> <th colspan="2">Sketch</th> <th colspan="2">Avg.</th> </tr> <tr> <th colspan="2">1-Shot ConstellationNet (Conv-4)</th> <th colspan="2">53.42 &pm; 0.24</th> <th colspan="2">36.11 &pm; 0.22</th> <th colspan="2">55.18 &pm; 0.25</th> <th colspan="2">58.68 &pm; 0.24</th> <th colspan="2">50.85 &pm; 0.24</th> </tr> <tr> <th colspan="2">5-Shot ConstellationNet (Conv-4)</th> <th colspan="2">67.43 &pm; 0.17</th> <th colspan="2">49.32 &pm; 0.17</th> <th style="background-color: #ffd2c2" colspan="2">70.42 &pm; 0.16</th> <th style="background-color: #ffd2c2" colspan="2">71.78 &pm; 0.18</th> <th colspan="2">64.74 &pm; 0.17</th> </tr> <tr> <th colspan="2">DSN [2]</th> <th colspan="2">83.25</th> <th colspan="2">61.13</th> <th colspan="2">66.54</th> <th colspan="2">58.58</th> <th colspan="2">67.38</th> </tr> <tr> <th colspan="2">uDICA [2]</th> <th style="background-color: #ffd2c2" colspan="2">91.78</th> <th colspan="2">64.57</th> <th colspan="2">64.54</th> <th colspan="2">51.12</th> <th colspan="2">68.00</th> </tr> <tr> <th colspan="2">low rank CNN [2]</th> <th colspan="2">89.50</th> <th style="background-color: #ffd2c2" colspan="2">62.86</th> <th colspan="2">66.97</th> <th colspan="2">57.51</th> <th style="background-color: #ffd2c2" colspan="2">69.21</th> </tr> </table> The accuracy (%) of different architectures on PACS, architectures are trained on 3 domains and tested on the remaining domain. The results for other architectures are extracted from the paper introducing PACS[^2] For every domain, the 5-shot version of ConstellationNet performs better than the 1-shot version. Overall other architectures perform better on photos, but interestingly ContellationNet has a photo-classification performance that is similar to its performance on cartoon and sketch domains. On the Art domain it also performs worse than other architectures and worse then on other domains. As can be seen from the table above, overall ConstellationNet generalizes reasonably well with an average accuracy of 64.74%. But in average it always performs less well than some of the better performing architectures on PACS. This implies that ConstellationNet is less suited for building and understanding of 'concepts' from the data and generalizing them to an unseen domain. ### Feature clustering Feature clustering is an important part of this network. It is used to find out which features are most frequent and therefore more meaningful then other features. For this the ConstellationNet network uses a mini batch soft K-means clustering algorithm. By randomly initializing global cluster centers and each convolution step updating the global cluster centers the real centers can be approximated. ![](https://i.imgur.com/6v9JUkb.png) *Figure 6: Clustering part of the ConstellationNet pipeline* Since clustering is a big part of this method, we wanted to look at different clustering methods that could maybe improve the overall performance. Especially HDBSCAN looked at first glance like a good alternative. HDBSCAN is a clustering method that uses a density-based method. Which makes it more robust for non-circular shapes, compared to mini-batch K-means[^7].  Because HDBSCAN is a widely used clustering method, there is good documentation available and there are helpfull tools for including it in python, like HDBSCAN in sklearn. However HDBSCAN has a major problem. HDBSCAN does not perform well on big inputs with high dimensionality. To give an idea: the mini-batch input size after the first convolution was $(131072*64)$, which already took HDBSCAN unnaceptably long. <table> <tr> <th colspan="2">Dataset size</th> <th colspan="2">500 x 2</th> <th colspan="2">5000 x 2</th> <th colspan="2">50000 x 2</th> <th colspan="2">500 x 64</th> <th colspan="2">5000 x 64</th> <th colspan="2">50000 x 64</th> </tr> <tr> <th colspan="2">Time</th> <th colspan="2">0.0161 sec</th> <th colspan="2">0.125 sec</th> <th colspan="2">2.531 sec</th> <th colspan="2">0.0385 sec</th> <th colspan="2">2.678 sec</th> <th colspan="2">288.175 sec</th> </tr> </table> Experimentation of the HDBSCAN function did yield these times for computing the clusters on a randomly generated dataset. This shows that high dimensionality of the input increases the required time to cluster the data exponentionally. Mini-batch K-means seemed to be a logical choice as clustering method for ConstellationNet. The choice for the amount of clusters used was clearly explained in the paper, but the researchers do not clearly explain why they chose for a batch size of $128$ samples for the mini-batch K-means clustering algorithm. This is another hyperparameter that can be tuned of course. For this reason we did an ablation study to better understand what different batch sizes would mean for the accuracy and training time of ConstellationNet. The standard batch size used by the researchers was $128$. So to be able to find out if the batch size does have an effect on the accuracy and training time, four extra batch sizes is trained on: $32, 64, 256, 512$. Because of limited time and money which will be talked more about in the discusion each batch size is only trained on twice. Since the CIFAR-FS dataset with a conv4 backbone was already used to find out whether we were able to reproduce similar results, this dataset and backbone are also used for this experiment. Changing the batch size in the code can be done in the configuration scripts found in the configuration folder. Graphs 1 and 2 show the results ![](https://i.imgur.com/Wf4y06u.png) *Graph 1: The batch size against the average accuracy of both the 1 and 5 shot.* Something that can be seen from graph 1 is that the accuracy of both 1-shot and 5-shot increase with an increasing batch size. This increase seems to have a linear trend. However this comes with a cost as can be seen in graph 2. ![](https://i.imgur.com/aIVAjpd.png) *Graph 2: The batch size against the average training time that was needed* The training time also increases with an increasing batch size. While the increase from $32$ to $64$ is not that large, the training time does become significantly larger when jumping from $256$ to $512$. The training time seems to be exponentially related to the batch size. From the above graphs, it appears that if the batch size is increased, the accuracy eventually converges while the trainig time keeps increasing. A lot of extra experimentation needs to be done before giving a reliable advice on which batch size is the best for ConstellationNet. This might also differ with datasets of different sizes. More training runs on more different batch sizes would need to be done to get more significant results. It would be interesting to look at how using different datasets or backbones would change the results. For now we can only conclude that if you have the resources available, 512 seems to be the best batch size in terms of accuracy. ## Discussion ### Significance of results Training the model took a few hours for every run, which meant there was no time for extensive re-running of the model. The time needed for setting up a Google Cloud VM and the limited credits at our disposal also did not improve this. This means that the results reported in this blog are often from 1 or 2 runs, which does not make them very significant. Ideally we would have had at least 5 runs per result to make them more credible. This is also an issue with the main paper, since there is nowhere specifically explained how many times each of their models were trained. ### High accuracies for sketch and cartoon domains in PACS dataset The results for the training on the PACS datasets shows that model outperforms the other models significantly when testing on the sketch and cartoon domains. However it underperforms when testing on the photo and art domain, which is a very interesting result. One theory is that the attention learning is able to find the important features easier because of the larger contrast in the samples from the sketch and art domain compared to the art and photo domain. However giving a good explanation requires more experimentation to determine if this is a coincidence or something that can be related to a specific aspect of ConstellationNet. ### Clustering algorithms Since it became quickly apparent that HDBSCAN was not the right option for us to implement another clustering method into ConstellationNet our focus was directed on another experiment. This was done because of the limited time we had for this project. However more research could be done in finding out if there are other clustering algorithms that could improve the performance. ## Contributions Below are the parts of the project each individual laid more focus on: **Koen**: Experimentation with batch sizes, training, writing clustering results **Michiel**: Experimentation with HDBSCAN, training, writing paper exposition **Rick**: Setting up Google Cloud services, writing code to get PACS dataset implemented to work with ConstellationNet, training, writing PACS results ## References [^1]: Fratto, N. November 2021: PACS Dataset. Kaggle, accessed 10/03/2022, https://www.kaggle.com/datasets/nickfratto/pacs-dataset?select=pacs_label [^2]: Li, Da & Yang, Yongxin & Song, Yi-Zhe & Hospedales, Timothy. (2017). Deeper, Broader and Artier Domain Generalization. 5543-5551. 10.1109/ICCV.2017.591. [^3]: Mohammadi, F. et al. 2019. Researchgate. https://www.researchgate.net/publication/335420271_An_Introduction_to_Advanced_Machine_Learning_Meta_Learning_Algorithms_Applications_and_Promises [^4]: Xu, W., Xu, Y., Wang, H., & Tu, Z. (2021). Attentional Constellation Nets for Few-Shot Learning. ICLR. [^5]: Marcel Simon, E. Rodner. 30 April 2015. Computer Science 2015 IEE International Conference on Computer Vision (ICCV). https://www.semanticscholar.org/paper/Neural-Activation-Constellations%3A-Unsupervised-Part-Simon-Rodner/e704f7b9f0b7218715888b7b6f23960245588d02 [^6]: Wang, Jindong & Lan, Cuiling & Liu, Chang & Ouyang, Yidong & Qin, Tao. (2021). Generalizing to Unseen Domains: A Survey on Domain Generalization. https://www.researchgate.net/publication/349787277_Generalizing_to_Unseen_Domains_A_Survey_on_Domain_Generalization [^7]: Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD.