HackMD - Collaborative Markdown Knowledge Base

## Car Brand & Construction Year Classification - Blog Post Group 37 **Group members** * Eduard Klein Onstenk (4930894, e.t.kleinonstenk@student.tudelft.nl) * Devin Lieuw A Soe (4933028, d.l.lieuwasoe@student.tudelft.nl) ### Introduction  This article considers the implementation of car brand classification using images from dataset CompCars [7], which covers 163 distinct car brands. This is a fine-grained object recognition task, which means different classes are hard to distinguish by nature as the classes are already subcategories of an overlapping group. For this task the ResNet architecture is used, which is also used in other research for state of the art results. In comparison to other fine-grained object classifications, the classification of cars is complexified by a number of challenges. For one there is a large amount of different car models. Furthermore within each class variance is high as a result of many different view-points, poses and different vehicle types.  The implementation is done in Python using PyTorch. Two subtasks have been identified to narrow the scope and focus of this blog. The first problem is a 163-class classification problem, classifying 163 distinct brands using the CompCars dataset. The second problem is a two-class classification problem classifying whether a car has been manufactured before or after 2012. Some areas within cities have restrictions based on the production year of cars driving in those areas, which is an in-practice example of the second task. Experimenting with both these problems provides evaluations based on a classification task with a relatively high amount of classes, as well as one with a relatively low amount of classes. The results are produced by two students from TU Delft as part of the course Computer Vision. The computions were run on a single NVIDEA 980TI. Given this there were some time constraints which motivate the choices described later for smaller datasets and less epochs for example.  The ability to classify and detect cars is vital in multiple fields, such as traffic managent by the government or to companies in the automotive industry. After reading this article the reader will understand the design choices for implementing the car brand classification and construction year classification problem, as well as understand their impact on the resulting accuracies and be able to reproduce the results.     ## Research questions In this work, ResNet [1] is used as the baseline model for image classification. Experiments are conducted with two different image classification tasks, car brand classification and construction year classification. These are 163-class and 2-class problems respectively. **Input data** ResNet can be imported within PyTorch with the "pretrained" option that can be set to either True or False. An experiment is performed where both options are compared to answer the question: _'What is the influence of pretraining ResNet to on its performance regarding image classification?'_ Since image processing tasks tend to by computationally expensive we thought of using grayscale input images instead of RGB input images with the intention to reduce training time. Therefore, we ask the question: _'What is the effect of training ResNet with grayscale input images instead of RGB input images?'_ **Altering average pooling** The performance of ResNet is evaluated against a number of altered versions of ResNet. More specifically, we apply changes to the global average pooling that is included in the ResNet network before its final fully connected layer. The research question concerning this alteration is as follows: _'What is the influence of changing the global average pooling of ResNet to local average pooling?'_ Furthermore, we evaluate a spatially weighted pooling approach that is inspired by [2] and [3]. For this, the global average pooling of ResNet is altered as well. By experimenting with this concept, we aim to answer the question: _'To what extend can the image classification performance of ResNet be improved by spatially weighting the feature maps?'_ ## Related work The literature and datasets around fine grained object recognition, or fine grained car recognition in this case more specifically, have been extended a lot in recent years [5,6,7]. Numerous approaches have their scope on classifying the model of cars. Hu et al. [Deep CNNs with SWP] divide related work into three differt categories, namely: "texture, feature-based approaches, 3D representation-based approaches, and DCNN-based approaches". The first of these, texture feature-based approach focusses on data produced from surveillance cameras for traffic monitoring, limited to frontal car images [8, 9]. The second 3D representation-based approach extends the scope to multiple viewpoints and unconstrained poses [10]. The DCNN-based approach is motivated by its recent impressive results across other image classification tasks [4, 5]. Finally, our research questions were especially inspired by three different papers that are related to image classification, thus we'll cover them a little more extensively. Computer vision has seen multiple breakthroughs in recent years, especially DCNN's achieve SOTA results. To solve complexer tasks with higher accuracy often the networks are made deeper, but this tends to saturate accuracy and degrade it. Using skip connections ResNet [1] solves the vanishing gradient problem as well as making sure the model learns the identity functions. Recently another breakthrough was made in achieving SOTA results with Cross-convolutional-layer pooling [2], which derives discriminative features from convolutional layers, from fully connected layers. <img src="https://i.imgur.com/wlKFOhw.png" alt="chart" style="width:250px;margin:15px 0;"> _Figure 1: The demonstration of extracting local features from a convolutional layer. [2]_ Spatially weighted pooling (SWP) is used in DCNNs for fine grained car recognition [3], which refines the effectiveness of feature representation of the best DCNNs. Their method improved SOTA results on the CompCars dataset used in this article from 91.2% to 97.6%. It uses an extension of cross-convolutional layer pooling, extracting features with convolutional feature maps from two consecutive convolutional layers. CFM's are pooled because they often indicate meaningful areas of images. <img src="https://i.imgur.com/qYLqajU.png" alt="chart" style="width:500px;margin-bottom:15px 0;"> _Figure 2: An overview of the SWP approach proposed by [3]_ ## Method This section discusses the dataset that is used for the experiments, as well as how this dataset is subsampled to reduce training time. Furthermore, the approaches of altering the global average pooling of standard ResNet are described. **CompCars** CompCars [4] is used as the dataset for the experiments of this work. CompCars is one of the most widely used datasets for computer vision tasks involving cars, and it contains 136,726 images of entire cars and 27,618 images capturing the car parts. Each car that is included in CompCars is shot from a number of different angles, and is labeled with the brand, model, and construction year. **Subsampling** Since the CompCars dataset is relatively large, the experiments of this work are performed on a subsample. The amount of images that are labeled the same brand varies significantly; some brands contain thousands of images, while other brands only contain a number of images that is lower than 50. We aim to reduce the size of the dataset while maintaining the class distribution to a sufficient extent. For this, we have created a script that does not just naively take a small percentage of the entire dataset, but it takes into account the class sizes. The reduced dataset is constructed with the following rules: - If a class/brand contains 100 images or less, all of these images will be included - If a class/brand contains more than 100 images, the amount of images that will be included equals 100 + 0.2 * (*class_size* - 100) - The distribution among the car models within a class is kept the same With this method of subsampling, the class distributions are sufficiently maintained while small classes are not reduced to size 0. Also, no car models will be discarded within brands. **Average pooling of ResNet** The last two layers of standard ResNet are the global average pooling (GAP) layer, followed by a fully connected layer. The input of the GAP layer has a dimension of 7 × 7 × 2048, and the paper about cross-convolutional-layer pooling [2] explains how this array serves as the feature maps resulting from the convolutional layers. However, the GAP layer converts this feature map array into a 1 × 1 × 2048 array by using an average pooling kernel of 7 × 7. This results in all spatial information to be discarded. In an attempt to improve ResNet's car brand classification performance, we change the kernel size to 3 × 3 and specify a stride of 2. The resulting local average pooling layer outputs an array of 3 × 3 × 2048. This approach is similar to ResNet-LMP described in the spatially weighted pooling paper [3], however max pooling is used for ResNet-LMP, and its performance seems to be the same as standard ResNet according to their results. We decided to maintain the average pooling to see whether the performance will change. We also experiment with a local average pooling with a kernel size of 5 × 5 and a stride of 2, outputting an 2 × 2 × 2048 array. **Spatially weighted pooling** As for the SWP implementation, the GAP layer of standard ResNet is substituted with a new layer that has a weight matrix of 7 × 7. All 49 subarrays of the 7 × 7 × 2048 input are weighted by the corresponding coefficient within this weight matrix. This approach deviates slightly from [3] as they vary the number of learned masks. The batch normalization and the extra fully connected layers are added in the same way as [3]. ## Experiments and results This section provides the details and results of the experiments that are performed in this work. All input images are resized to 224 × 224. 70% of the dataset is used for training and 30% for testing. The experiments are done with a batch size of 16 and the models are trained for 40 epochs. The learning rate is divided by 10 after each 10 training epochs, with an initial learning rate of 0.001, see Figure 3 A weight decay of 0.0005 and a momentum of 0.9 are configured for the learning rate. <img src="https://i.imgur.com/9Ufwp3T.png" alt="chart" style="width:400px;"> _Figure 3: Comparison of two different learning rates. ResNet152 is trained with the 163-class data of the car brand classification problem. An initial learning rate of 0.001 seems to result in a better training process, which is why this learning rate is used for the experiments._ **Experiment 1: ResNet152 pretrained vs. non-pretrained** Table 1 shows the resulting accuracies of both pretrained and non-pretrained ResNet152 on the 163-class car brand classification problem. The accuracies measure the performance of the models after training for 40 epochs. <table> <tr> <td><strong>Model</strong> </td> <td><strong>Pretrained</strong> </td> <td><strong>163-class accuracy (%)</strong> </td> </tr> <tr> <td>ResNet152 </td> <td>Yes </td> <td>66.48 </td> </tr> <tr> <td>ResNet152 </td> <td>No </td> <td>66.43 </td> </tr> </table> _Table 1: Results of Experiment 1_ The comparison between a pretrained and a non-pretrained ResNet152 model seems to result in a difference that is not significant. The rest of the experiments are performed with pretrained ResNet152 weights. **Experiment 2: RGB vs grayscale** As an attempt to reduce training time, we experiment with grayscale input images in this experiment. Table 2 shows the training durations of standard pretrained ResNet152 on RGB and on grayscale input images for the 163-class car brand classification problem. The accuracies measure the performance of the models after training for 40 epochs. <table> <tr> <td><strong>Model</strong> </td> <td><strong>Input images</strong> </td> <td><strong>Training time (hours)</strong> </td> </tr> <tr> <td>ResNet152 </td> <td>RGB </td> <td>9.5 </td> </tr> <tr> <td>ResNet152 </td> <td>Grayscale </td> <td>9.4 </td> </tr> </table> _Table 2: Results of Experiment 2_ The comparison between using RGB or grayscale input images seems to result in a time difference that is not significant. The rest of the experiments are therefore performed with RGB input images. **Experiment 3: Changing global average pooling of ResNet152** Table 3 shows the resulting accuracies of the versions of ResNet152 where the global average pooling is altered. The accuracies measure the performance of the models after training for 40 epochs. <table> <tr> <td><strong>Model</strong> </td> <td><strong>Avg. pooling kernel size</strong> </td> <td><strong>Avg. pooling stride</strong> </td> <td><strong>163-class accuracy (%)</strong> </td> <td><strong>2-class accuracy (%)</strong> </td> </tr> <tr> <td>ResNet152 </td> <td>7 × 7 </td> <td>1 </td> <td>66.48 </td> <td>80.11 </td> </tr> <tr> <td>ResNet152 </td> <td>5 × 5 </td> <td>2 </td> <td>65.27 </td> <td>78.54 </td> </tr> <tr> <td>ResNet152 </td> <td>3 × 3 </td> <td>2 </td> <td>62.22 </td> <td>74.83 </td> </tr> </table> _Table 3: Results of Experiment 3_ **Experiment 4: ResNet152-SWP** This last experiment is inspired by ResNet-swp from the spatially weighted pooling paper [3]. Table 4 shows the accuracy of our approach of spatially weighted pooling, described in the previous section. We only evaluate on the 163-class car brand classification task due to resource and time constraints. The resulting accuracy measures the performance of the model after training for 40 epochs. <table> <tr> <td><strong>Model</strong> </td> <td><strong>SWP</strong> </td> <td><strong>163-class accuracy (%)</strong> </td> </tr> <tr> <td>ResNet152 </td> <td>Yes </td> <td>68.62 </td> </tr> <tr> <td>ResNet152 </td> <td>No </td> <td>66.48 </td> </tr> </table> _Table 4: Results of Experiment 4_ ## Discussion, conclusion, and future work As for changing the global average pooling of ResNet to local average pooling, which was experimented with in Experiment 3, the alteration seems to have a negative effect on the performance on both the 163-class and the 2-class tasks, see Table 3. A cause for this might be that the local averaging is not invariant to translation. This also explains why the average pooling kernel of size 3 × 3 performs worse than the kernel size of 5 × 5. Experimenting with max pooling instead of average pooling [3] could be more interesting, as max pooling is parially invariant to translation. Our approach of spatially weighted pooling does show a slight increase in performance for the 163-class car brand recognition problem. The improvement is relatively small, but the variability in configuration of hyperparameters as well as the SWP paper [3] motivate that there is room for more improvement. Due to time and resource constraints we were not able to train the model for the 2-class problem. However, it might be interesting for future research to evaluate this approach on tasks that contain less classes. **References** [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. Microsoft Research, 2015. [2] Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4749–4757, 2015. [3] Qichang Hu, Huibing Wang, Teng Li, and Chunhua Shen. Deep CNNs with Spatially Weighted Pooling for Fine-Grained Car Recognition. In IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 11, pages 3147–3156, 2017. [4] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A largescale car dataset for fine-grained categorization and verification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3973–3981, 2015. [5] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 5546–5555. [6] T.-Y. Lin, A. K. Roy-Chowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1449–1457 [7] L. Yang, P. Luo, C. C. Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 3973–3981 [8] Kadambari, K. V. and Nimmalapudi, Vishnu Vardhan. "Deep Learning Based Traffic Surveillance System For Missing and Suspicious Car Detection". 2020. [9] H. He, Z. Shao, and J. Tan, “Recognition of car makes and models from a single traffic-camera image,” IEEE Trans. Intell. Transp. Syst., vol. 16, no. 6, pp. 3182–3192, Dec. 2015. [10] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, Dec. 2013, pp. 554–561.