Reproduction of "Artist identification using Neural Networks"

# Reproduction of "Artist identification using Neural Networks" Authors: - Marco van Veen 5062322 - Matej Havelka 5005914 - Thomas Streefkerk 4964217 GitHub code: https://github.com/tstreefkerk98/artist-identification \ Dataset: https://www.kaggle.com/datasets/thomasstreefkerk/wikiart-artists ## Introduction Artist identification is the task of identifying which artist painted a specific painting. Historically, this has been a task performed by art historians or other experts in the field, all of which have a considerable amount of training. In this blog, we will reproduce the paper written by N. Viswanathan [[1]](#1) that uses convolutional neural networks to perform this task. Additionally, we extend the proposed method, that uses transfer learning with a model pretrained on IMAGENET1K_V1, by comparing to models pretrained on different datasets. We will show that the dataset used to train the pretrained model significantly influences the performance of the model. More specifically, we show the performance in identifying artists is higher on artists that mainly created paintings containing objects from this dataset and lower for artists that did not. ### Pretrained Models The following table shows the models used and if they are pretrained on anything, and if so, which dataset: | Model | Pretrained on: | |--------------|:-----------------------:| | Baseline CNN | - | | ResNet-18 | IMAGENET1K_V1 [[3]](#3) | | ResNet-20 | CIFAR-10 [[4]](#4) | | ResNet-20 | CIFAR-100 [[5]](#5) | As can be seen, we compare use two different types of ResNet models. Ideally, we would have liked to compare the same ResNet architecture pretrained on different datsets, however, we were unable to find them and did not have the resources to train them from scratch. To keep the effect of this mismatch to a minimum, we opted for ResNets of similar depth. ### Dataset The dataset we used is a subset of the WikiArt dataset [[2]](#2). This subset consists of paintings from 57 artists with 300 paintings per artist for a total of 17100 images. There are considerably more paintings in the full WikiArt dataset but not all of them were used. The reason being, as mentioned in [[1]](#1), that to identify an artist, the model needs to have seen enough examples of said artist. Therefore, the decision was made to only use the artists with a sufficient amount of paintings in the dataset. Additionally, we randomly select 300 paintings from these 57 artists to ensure an even amount of paintings per artist. ### Setups #### Baseline CNN The baseline CNN has the following architecture: | Input size | Layer | |---------------|:-----------------------------:| | 3x224x224 | 3x3 CONV, stride 2, padding 1 | | 32x112x112 | 2x2 Maxpool | | 32x56x56 | 3x3 CONV, stride 2, padding 1 | | 32x28x28 | 2x2 Maxpool | | 1x6272 | Fully-connected | | 1x228 | Fully-connected | In this table, the batch normalization and ReLU layers have been left out to make it more readable. They are performed after the Maxpool layers and after the second-to-last fully-connected layer. The final fully-connected layer maps to the 57 classes, so in this case the 57 artists. #### Transfer learning Networks For the transfer learning networks, we take one of the pretrained models and replace the final fully-connected layer with a new one. Similarly to the baseline CNN, this new layer maps to the 57 artists in our dataset. ## Hypothesis Pretraining helps models learn meaningful representations about the data. Fine-tuning on such models can hereby greatly increase the performance of the task. Due to the nature of the task we are tackling in this blog, the question we ask is: *"How impactful is the dataset that the pretrained model is trained with to the final performance of the model?"*. We hypothesise that the dataset is of significant importance to the performance. Specifically, we expect the perfomance in identifying paintings of artists that typically paint objects that are contained within the chosen dataset will be better than artists who typically do not paint these objects. For example, we expect artists that paint a lot of flowers to be more easily identifyable to the model, and thus give a higher accuracy, if flowers are contained within the dataset with which the pretrained model was trained. Contrary, we expect the opposite to be true if flowers are not contained within the dataset with which the pretrained model was trained. ## Experiments To test out our hypotheses, we first started by reproducing [[1]](#1). Considering there is no code linked to this paper, we implemented it with our own knowledge. While the paper does a better job at sharing their hyperparameters than other papers, they still left some grey areas where we had to guess their approach. At first we split the wikiart dataset of 57 artists with 300 paintings into the training, validation and testing sets. We used the 80-10-10 ratios, as used in the original paper. Each image is randomly cropped into 224x224 resolution, flipped horizontally with 50% and normalized. For normalization we decided to use parameters we found in a pytorch tutorial, as our original guess of 0s and 1s did not work properly. For validation and testing images, we don't flip and we use center crop rather than random crop. To address our hypotheses, we run four models: - Baseline CNN (baseline) - Resnet18 pretrained on ImageNet1k (resnet18) - Resnet20 pretrained on CIFAR10 (resnet20-10) - Resnet20 pretrained on CIFAR100 (resnet20-100) The baseline is run only to provide some generic result, and see whether the task is not too simplistic. The resnet18 is trained on Imagenet dataset, meaning it includes a lot more information and images than any other model we run. To test out our hypotheses about the importance of this dataset, we run two resnet20 trained on the CIFAR dataset. The paper has a specific approach to training. At the start, it freezes the entirity of the pretrained model and replaces the last layer with a linear layer. They unfreeze the last layer and train it until they do not observe an improvement. We implemented this as early stopping, so if we do not see an improvement over 5 epochs, we unfreeze the pretrained weights and fine tune the entire model. We only train the baseline as whole, as it is not pretrained. We use the Adam optimizer with learning rate set to 0.001. Once we unfreeze the model, we also reinitialize the optimizer and set the learning rate to 0.0001, as was described in the original paper. We use (0.9, 0.999) as our betas, also following the paper. Overall we give it a maximum of 25 epochs to run, but it can stop earlier if the model converges again (first it unfreezes, if there is no improvement again, it stops). Additionally, we save our models and statistics during training and the code we used can be found [in our GitHub project](https://github.com/tstreefkerk98/artist-identification). To further analyze the differences in performance on each artist, we plot the accuracies per artist, per model on the test set. From those we handpicked the ones with major discrepency between the performance of different models. We then observe whether the artist is known for painting specific objects and whether they are in the dataset used during pretraining. ## Results & Discussion ### Reproduction Results The table below shows the training and test accuracies obtained after training and evaluating the four models. The results we obtained for the models presented in [[1]](#1) (Baseline CNN and ResNet18 - ImageNet) are close to the results they originally reported. However, our implementations seem to perform slightly worse. | Model | Train Acc | Train Acc from [1] | Test Acc | Test Acc from [1] | | -------- | -------- | -------- | -------- | -------- | | Baseline CNN | 0.33 | 0.43 | 0.34 | 0.42 | | ResNet18 - ImageNet | 0.86 | 0.91 | 0.74 | 0.77 | | ResNet20 - CIFAR-10 | 0.26 | N/A | 0.28 | N/A | | ResNet20 - CIFAR-100 | 0.40 | N/A | 0.41 | N/A | We expect that these small differences in performance are due to minor differences between our and the paper's implementations, as not everything about the training procedure was explained in detail in the paper. For example, the authors mentioned that they decreased the learning rates and unfroze the pre-trained model after "improvement slows down significantly", but they did not specify any thresholds or other parameters for this. They also seem to implement early stopping, as the number of training epochs differed between models, but it was again not specified how this was actually performed. Furthermore, they mention that they normalised the images, but they did not describe what values they used for this. Where exactly batch normalisation layers were applied within the baseline CNN was also not mentioned. Finally, some differences can also be due to different random seeds used for splitting the dataset into train/validation/test and the random cropping and flipping. Ideally, we would have tested the model trained over multiple random seeds and averaged the results to get a more fair indication of the performance. However, due to time and computational constraints, we were unable to do this. The figures below show the training and validation accuracies for each of the models throughout the training epochs. As was also the case in the original paper, a clear bump in both training and validation accuracies can be seen for the ImageNet ResNet18 around epoch 12. This was the epoch where the learning rate got decreased by a factor of 10 and the base ResNet was unfrozen, allowing the base model to adapt to the new artist classification task. In the original paper, however, the unfreezing happened around epoch 5, which suggests that they used different criteria for when to unfreeze and decrease learning rates. This was to be expected because, as mentioned, they were not clear about the exact criteria for this. For the remaining three models, the training and validation accuracies slowly go up with no significant bumps as for the ResNet18. However, both ResNet20 models seem to have some relatively larger improvements near the end. This might indicate that there's still some more significant improvements to be made for those models if they were allowed to train for longer, which we were not able test out due to time constraints. ![](https://hackmd.io/_uploads/ry6Sg5_D2.jpg) ![](https://hackmd.io/_uploads/BkQUe5Own.jpg) ### Pre-trained Model Comparisons The table above also shows performance differences between the two ResNet20's. The model pre-trained on CIFAR-100 achieves roughly 13% higher accuracy than the CIFAR-10 pre-trained model. It is interesting to note that while CIFAR-100 contains 100 distinct classes and CIFAR-10 only contains 10 classes, they both contain 60000 images in total with the images evenly spread amongst the classes. So, it appears that seeing a larger variety of (relevant) classes during pre-training has a positive effect on artist classification, even if they contain significantly less samples per class (e.g., 600 vs 6000 here) A possible explanation for this may be that most of the CIFAR-10 classes (airplanes, trucks, cars, deer, etc.) are not relevant to the paintings in the dataset, indicating that the gap between the pre-training classification task of these 10 classes and the artist classification task might be too large for efficient transfer learning. Since the model was pre-trained on these 10 classes with 6000 samples each, it might have had to "unlearn" some of the undesired features it learnt in order to perform the artist classification task here. This may also explain why the baseline was slightly better than the CIFAR-10 model when trained from scratch. The CIFAR-100 model has seen many more classes relevant to the paintings and for only 600 samples each, so it could more easily "unlearn" the undesired features and improve further upon the commonly seen classes/objects within the paintings. This more general pre-training task could, therefore, be the reason that the CIFAR-100 model managed to beat the baseline CNN performance when trained from scratch. Finally, the ResNet18 pre-trained on ImageNet clearly outperforms the other models by a significant amount. This is most likely due to the large number of different classes seen during pre-training (1000) and the overall size of the dataset, which is roughly 1.3 million images. So, this model was able to learn useful features much better for a large number of classes, allowing it to more easily generalise to other tasks, such as the artist classification here. ### Individual Artist Classification Results The figure below shown the confusion matrix of the ResNet18 pre-trained on ImageNet. The artist labels are not shown, as there are too many (57). The diagonal shows the correctly identified artists. Throughout the diagonal, most of the values seem to be rather high, showing that many artists have been correctly classified. Furthermore, there do not seem to be any significant outliers outside of the diagonal, they are mostly only low values. This shows that there are also no artists which were consistently misclassified as one other specific artist. Low classification results for certain artists are, therefore, due to their paintings being attributed to a number of other artists which could be due to various reasons. For example, Paul Gauguin is most often confused with either Camille Pissaro or Vincent van Gogh, as he uses a style of brush strokes which are somewhat similar to the other artists'. This could be the reason for the confusion. Furthermore, his test set images contain a number of sceneries similar to ones found within the training sets of the other two artists. The original paper reported that the worst performing artist was Henri Matisse with 11 out of 30 correct, suggesting that this was due to the wide variety of styles he used. In our case, however, he was classified correctly roughly 23 times. When looking at the test set images, it seems that it contained only approximately 2 or 3 similarly styled paintings, which could be the reason for the higher accuracy in our case. This shows that the performance differences for artists with varying styles may differ significantly between different, random train/val/test splits of the data, which should be kept in mind when analysing results, especially when comparing between different papers. ![](https://hackmd.io/_uploads/rk7d1JtP3.jpg) The second figure below shows the confusion matrix for the ResNet20 pre-trained on CIFAR-100, which is the second best performing model. From the diagonal it can be seen that many artists are not classified correctly anymore most of the time, which is also reflected by the many higher values in the off-diagonal elements all over the figure. The confusion matrices for the baseline CNN and CIFAR-10 ResNet10 are left out, but were similar with even more misclassifications scattered throughout. ![](https://hackmd.io/_uploads/r1_LMlFPh.jpg) One of the most correctly classified artists is Giovanni Battista Piranesi. All of his art samples are highly detailed black and white sketches of ancient buildings and artefacts (Roman/Greek) and some people or statues of ancient times. Test samples from many other artists have been misclassified as him as well. The ones with the most incorrect classifications are artists such as Franciso Goya, Gustave Dore, or M.C. Escher. These artists also often drew detailed black and white sketches, but with vastly different scenes and objects drawn. So, it seems that the model simply looked at colour compositions and classified these artists based on that instead of actual contents. Random examples from each of the four artists are shown below. In the ImageNet ResNet18 these misclassification did not occur, which could partially be due to the model actually recognising the contents of the different drawings to some degree to distinguish between them. The fact that the CIFAR-100 model outperformed the CIFAR-10 model, with the only difference being a larger variety of pre-training classes, could support this, since ImageNet contains significantly more classes and samples per class. This could have helped it more easily extract useful features to distinguish the architectural and schematic sketches of Piranesi from the rest to reduce misclassification of the others. ![](https://hackmd.io/_uploads/SkCP8bcvn.png) *Examples of artworks from the four artists. From top left to bottom right: Giovanni Battista Piranesi, Franciso Goya, Gustave Dore, and M.C. Escher.* ### Comparing Groups of Similar Artists #### Artists Mainly Depicting Sceneries The figure below shows the test accuracies for a selected group of artists whose artworks consisted almost exlusively of similar sceneries, such as forests, plains, mountains, fields with some old houses or towns in the background, etc. Additionally, they all painted in somewhat realistic styles (realism, romanticism, impressionism) where it's easy to distinguish what's being portrayed. From the figure it can be seen the ImageNet ResNet18 performs best every time, which was to be expected as it was trained on a large number of classes and much larger dataset than the others. The CIFAR-100 ResNet20 is the second best every time, except once. The baseline CNN seems to generally outperform the CIFAR-10 ResNet20, which is the worst. ![](https://hackmd.io/_uploads/SyDbSbKPh.jpg) An important thing to note is that the CIFAR-100 dataset contains some classes of objects/scenes occurring within the artworks (houses, castles, forests, mountains, seas, etc.) and ImageNet also contains similar or related classes with much larger sample sizes. CIFAR-10, however, contains no classes related to these artworks, perhaps only ships or horses whose samples may contain similar scenes. The CIFAR-10 model also almost always performs worse than the baseline CNN. This may be due to it being pre-trained on completely "useless" classes for this task, which it now has to fix and may take longer than starting from scratch like the baseline. The CIFAR-100 model almost consistently performs equal to or better than the baseline. This suggests that seeing even a few similar classes during the pre-training can already significantly boost performance, since the CIFAR-10 version was worse than the baseline. A reason for this can be that the model does not have to spend much time learning how to extract features related to the shown objects and scenes, and can instead focus on additionally learning artist-specific features, such as style or other less obvious features. For example, Ivan Shishkin almost exclusively painted forests. The baseline model likely checks for a lot of green colours and vertical lines due to the many trees. The CIFAR-100 model already saw forests during pre-training, which likely let it focus more on learning Shishkin's specific style in order to outperform the baseline. The baseline CNN outperforms the CIFAR-100 model when it comes to Albert Bierstadt and performs equally for Ivan Aivazovsky, and both accuracies come close to the ImageNet model's. Almost all the scenes depicted by both artists contain bright skies, often coming from a single point (the sun), which makes them relatively easy to classify. Typical examples of their works are shown below. This consistent, large bright area in the top of the images might be how the baseline determines to classify these artists. Furthermore, Aivazovsky almost exclusively paints naval scenes with ships, which might additionally help the baseline and other models distinguish the two artists. The CIFAR-100 and ImageNet models have seen ships and seas during pre-training, which might help them distinguish between the artists. The baseline, on the other hand, possibly only checks for the presence of blue colours due to the sea to distinguish the artists instead of learning other, more detailed sea and ships related features. ![](https://hackmd.io/_uploads/BJytdW5vh.png) *Two typical paintings from Albert Bierstadt (left) and Ivan Aivazovsky (right).* One surprising result from the figure is the accuracies for Isaav Levitan, where only ImageNet manages to perform well, while both CIFAR models are quite bad and the baseline does not manage to correctly classify anything. When looking at Levitan's work, it can be seen that he mostly painted a large variety of scenes, such as fields, forests, mountains, lakes, plains with houses. There also do not seem to be any obvious, shared features among the paintings, such as the bright skies for Bierstadt and Aivazovsky. So, it seems that the baseline was not able to exploit any recurring shapes or colour-related features, leading to its complete failure for this artist. Furthermore, the CIFAR-10 model also completely failed here, classifying only one or two samples out of 30. The CIFAR-100 version seems to actually be able to ditinguish some of his works, but it's still quite bad. This very poor performance is somewhat surprising, considering that it had already seen all the scenes and objects (houses) as classes during pre-training. This shows that the models do not simply classify based on content, but also have to learn artist-specific features such as style. The fact that ImageNet was trained on significantly more classes and data seems to be the likely reason for it being able to more easily adapt to artist-specific features. #### Artists Mainly Depicting Portraits or Human Scenes Finally, we consider one more group of artists which all mostly painted portraits or human scenes. The results are shown in the figure below. The ImageNet model again consistently outperforms the others, which was expected due to the large dataset size and huge variety of classes and images. The CIFAR-100 dataset also contains a few classes of humans, but only has somewhat good results for three of the artists, and is beaten by the baseline in two of those. Except for these three artists, the results of the three non-ImageNet models are quite bad, especially compared to the ImageNet model. ![](https://hackmd.io/_uploads/BJK5V-cw3.jpg) The performances of the non-ImageNet models are the worst for Beksinski, Chase, Goya, and Rembrandt. The baseline was not even able to correctly classify any of the images, while the CIFAR models performed very poorly. Besides normal portraits, Goya and Rembrandt also drew many pencil drawings of people and scenes, which may explain the difficulty of classifying their works for the other models. The CIFAR-100 model managed to significantly improve upon the CIFAR-10 model for Rembrandt, however. This may be due to the fact that Rembrandt's work contained relatively more normal portraits, allowing the model to utilise the human classes seen during pre-training better in this case. Beksinski and Chase were difficult to classify for other reasons. While Chase painted many portraits, he also painted a variety of other things, such as different sceneries and household objects. This variety likely made it quite hard to classify his works correctly. However, the CIFAR-100 model has seen similar objects and sceneries during pre-training. As mentioned before, a reason for its poor performance could be the small dataset size which didn't allow it to learn more meaningful features to extend upon. Additionally, the less realistic, impressionistic style could have also made it harder for the model. The difficulty of Beksinski lies in his very surreal, dystopic artstyle which often contains humanoid shapes. The CIFAR-100 model performs roughly equal to the CIFAR-10 model here, indicating that it did not manage to effectively utilise the learnt human features to this difficult style. The ImageNet model achieved quite good performance, however. It was likely able to capture the distinct, dark style of the artist due to its more extensive pre-training where it learnt to extract a much wider variety of features. ![](https://hackmd.io/_uploads/HJALcbqvh.png) *Two random examples of Beksinski's artworks.* Finally, all models, besides CIFAR-10, managed to achieve decent performance for Modigliani and Kirchner. The baseline even managed to slightly outperform the CIFAR-100 model in both cases. Both artists have very distinct styles which are relatively easy to detect. This likely helped the baseline achieve such good results in these two cases, but very poor in the others. Modigliani tends to paint simple humans in the same few positions without much detail, often with similar colours. Almost all of Kirchner's works have similarly looking women and have a very prevalent beige-like colour throughout the drawings, which makes his work relatively easy to identify. ## Conclusion The goal of the project was to reproduce the main results of the original paper ([[1]](#1)), make the code available for others, and to extend upon the paper with an analysis of the effects of different pre-training datasets for the artist classification task. We managed to obtain very similar results for the baseline CNN and ResNet18 pre-trained on ImageNet, so the originally reported results are reproducible. However, both of our models performed slightly worse. The main reason is likely due to differences in when we unfroze the pre-trained model and decreased the learning rates, as the authors did not specify how they exactly performed this. Additionally, small differences in normalisation procedures and random seeds could also explain some of the differences. Moreover, we only tested the model on trained on one seed. Ideally, we would have liked to train the model on multiple seeds and average the performance to get a more honest inidication of the performance. Differences in pre-training datasets seems to lead to significant performance differences for this task, especially for more realistic artstyles. The CIFAR-10 dataset contains almost no relevant classes when it comes to the artworks, while the CIFAR-100 contains classes which occur relatively often within the artworks. The performance of the CIFAR-10 pre-trained model was overall worse than the baseline CNN and the CIFAR-100 pre-trained model, even though both datasets contain the same number of samples in total. The much larger and more varied ImageNet1K model performed the best by far, even on artists with varying contents and styles. These results shows that pre-training on a larger and highly varied dataset can significantly increase the performance for artist classification task. ## References <a id="1">[1]</a> Viswanathan, N. (2017). Artist Identification with Convolutional Neural Networks <br> <a id="2">[2]</a> Kaggle. https://www.kaggle.com/c/painter-by-numbers. <br> <a id="3">[3]</a> IMAGENET1K_V1. https://www.kaggle.com/datasets/vitaliykinakh/stable-imagenet1k. <br> <a id="4">[4]</a> CIFAR-10. https://www.kaggle.com/c/cifar-10/. <br> <a id="5">[5]</a> CIFAR-100. https://www.kaggle.com/datasets/fedesoriano/cifar100.