Computer Vision by Deep Learning: Getting your hands dirty with License Plates

# Computer Vision by Deep Learning: Getting your hands dirty with License Plates # Intro Automatic license plate recognition (ALPR) systems are nowadays broadly used by governments to for example automatically identify speeding cars, or scan highways for specific license plates. These systems often utilize deep neural networks. The input to these networks is an image of a car, and the output is often a bounding box for the license plate, together with the text of the license plate. <figure> <img src="https://i.imgur.com/dUe8fON.gif" alt="Trulli" style="width:80%"> <figcaption align = "center">Fig. 1 - An example scenario of license plate recognition.</figcaption> </figure> (GIF from [learnopencv.com](https://learnopencv.com/automatic-license-plate-recognition-using-deep-learning/?ck_subscriber_id=452195442#Dataset--)) ALPR can be divided up into roughly 3 different tasks. First, the task is to extract the location of the license plate out of the entire input image. Secondly, extract where the characters are inside the license plate, and lastly, detect what characters are on the license plate. The current state of the art approaches used often use complex neural networks, that perform these 3 tasks in 1 network. These networks are difficult to train as they require a lot of data. For this blog post, we decided to take a step back from the state of the art approaches, and instead try to perform automatic license plate recognition by using deep learning for each of inidividual tasks, and then combining these networks together to perform the recognition. Specifically, we focus on the last two tasks: Character recognition and character location detection. We chose to focus on these ones as the third task, detecting the location of the license plate from an image is the hardest task of the three, and given the timeframe we think this would not be feasible to achieve. The goal of this project was not to achieve great performance in recognizing license plates, but to see if simplifying the problem a lot can still get an acceptable performance, without making use of deep complex neural networks. # Optical Character recognition The first task that needs to be solved is the task of recognizing characters, also known as Optical Character recognition (OCR). To be able to train a neural network on this, we first need to obtain a dataset of various characters. We chose a dataset from [kaggle](https://www.kaggle.com/aladdinss/license-plate-digits-classification-dataset). This dataset contains over 1000 images for each character, from 0 to Z. Each character has around 100 unique images, and for each unique image 10 augmented images are generated, by adding rotation and/or changing brightness. ![]() <figure> <img src="https://i.imgur.com/Zih6KUK.png" alt="Trulli" style="width:80%"> <figcaption align = "center">Fig. 2 - Various augmented images of the ‘0’ character.</figcaption> </figure> After obtaining the dataset, we needed to train a model that could predict the character from an input image. To determine the most suitable network to use for this task, various neural network architectures were created. They all have at least 1 convolutional layer, but some have two convolutional layers. We did not include architectures with more than two convolutional layers as we think two of these layers should already be able to sufficiently learn this prediction task. Next to convolutional layers, we also tried architectures with dropout layers and normalization layers. <figure> <img src="https://i.imgur.com/EB8cQDE.jpg" alt="Trulli" style="width:80%"> <figcaption align = "center">Fig. 3 - Example train images with their groundtruth labels.</figcaption> </figure> <figure> <img src="https://i.imgur.com/M3nEc25.jpg" alt="Trulli" style="width:100%"> <figcaption align = "center">Fig. 4 - Train and test losses + accuracies after training for 25 epochs.</figcaption> </figure> All architectures also have ReLU layers after each convolutional layer, and in the end all features are flattened before being passed into a fully connected layer that outputs values for each class. We then can find the index of the maximum output, resulting in a specific predicted character. We split the data into 80% train and 20% test images. These images were randomly selected. Most training settings were kept at their default values, except for the learning rate which was set to 0.001. After training the networks on these images, the training graphs almost all looked like Figure 4 above. We can see that the network is able to learn quite well and even achieves a training accuracy of 100%. We observed that the train and test loss and accuracy are very close together, which caused us to wonder why this could be the case. After some thought we realized that our data split of 80% train and 20% test images might not have been fair. As the dataset contains a lot of augmentations of the unique images, it is not fair to train on a part of the augmentations and then test on the other part of the augmentations of the same character image. These augmentations are too similar (see also Figure 2 mentioned previously), therefore they do not really represent a good test set, as a test set should give an indication of how well the network generalizes to unseen data. To fix this issue, we new train and test splits, but now we made sure that all the augmented images stemming from the same unique character image are never split up such that a part of them end up in the train set and a part of them in the test set. Training the networks again resulted in plots that looked more like in Figure 5. <figure> <img src="https://i.imgur.com/t8D8bBi.jpg" alt="Trulli" style="width:100%"> <figcaption align = "center">Fig. 5 - Train and test losses after changing data splits.</figcaption> </figure> We can see that the train and test accuracy and loss are now more spread apart, and the test accuracy now gives a better representation of how well the network generalises to unseen data. Some of the training graphs that were used to determine the best network architecture for this task are shown below. <figure> <img src="https://i.imgur.com/ktOLCR8.jpg" alt="Trulli" style="width:100%"> <figcaption align = "center">Fig. 6 - Train and test losses + accuracies for different network architectures.</figcaption> </figure> In the end we chose to use the model that achieved the highest test accuracy based of these plots. This test accuracy was around 0.95. This model will then be used to predict characters from input images containing a character. To see how well our own model would be compared to a typical state of the art (SOTA) model, we searched online for a model that was typically used for OCR. This is a model with more convolutional layers than our models, and has more parameters per layer. The train curves can be found in Figure 7. We can see that our models perform a little better on the test set so we think we did a good job. <figure> <img src="https://i.imgur.com/384zYn3.jpg" alt="Trulli" style="width:100%"> <figcaption align = "center">Fig. 7 - Train and test losses + accuracies for the SOTA model.</figcaption> </figure> # Bounding box detection **Standard techniques and challenges** Object detection is an integral part of computer vision research and has a few challenges which need solving: 1. Different amount of objects in frame requires differing output size 2. Object can differ in size, location and aspect ratio A naive approach to solve the first point would be to suggest region of interests (ROI), and use a CNN to predict if an object is proposed. However, this is where the second problem is introduced. CNN’s are not capable of handling different aspect ratios. Besides, how should the ROI’s be proposed if objects can be located anywhere and have any size. Different methods such as R-CNN ([https://arxiv.org/abs/1311.2524](https://arxiv.org/abs/1311.2524)), YOLO ([https://arxiv.org/pdf/1506.02640.pdf](https://arxiv.org/pdf/1506.02640.pdf)) and all their successors have been suggested to solve these issues. Although these methods are robust and perform well they are relatively big and require extensive training data and computational power. Since we are trying to build a system from the ground up while using the least tools possible, an analysis was made on the listed challenges. Concluding that if some simplifying assumptions were made the second challenge would be solved, leaving only the first challenge for which the naive approach suffices. <figure> <img src="https://i.imgur.com/sW9CuJ8.jpg" alt="Trulli" style="width:60%"> <figcaption align = "center">Fig. 8 - Difference between an aligned license plate and not aligned license plate.</figcaption> </figure> **Simplifying assumptions and ROI proposal** The first assumption is that the input to the bounding box detector will be a properly aligned and cropped license plate like the top plate in Figure 8. This assumption can be made as cameras taking images of license plates tend to be aligned horizontally and either directly in front or behind the car. Besides, if the image is not well aligned the image can be reshaped to fit the requirements. The benefit of this assumption is it allows us to assume that characters will be located in the vertical middle of the input image. Therefore only leaving uncertainty of horizontal location, size and aspect ratio. Then removing the latter two by assuming that the characters occupy a certain ratio of the license plate height, in our case 70% of the image height, and assuming a fixed aspect ratio, 2:3. Which leaves us with just the uncertainty in horizontal location. Solving this directly does not seem possible as no further assumptions can be made on the horizontal locations of characters, as no regularized pattern is used by all plates. The plate should therefore be scanned horizontally to propose ROI to a CNN predicting if the proposed region does contain a character. We use a sliding window to do this, using the assumed character height and aspect ratio taking 50 steps over the license plate as illustrated in Figure 9. <figure> <img src="https://i.imgur.com/5Dk3US8.gif" alt="Trulli" style="width:60%"> <figcaption align = "center">Fig. 9 - Explanation of sliding window for license plates.</figcaption> </figure> **ROI data generation and labelling** To train a CNN able to predict which of the 50 proposed bounding boxes actually contains the characters a dataset and scoring metric is required. The dataset used is called openALPR ([https://github.com/openalpr/benchmarks/tree/master/endtoend/eu](https://github.com/openalpr/benchmarks/tree/master/endtoend/eu)) and is randomly split in a train and test set (80/20). For each license plate all bounding boxes of the sliding window were stored (e.g., each of the red windows in Figure 10). Each of these bounding boxes needed a scoring metric for which intersection over union (IoU) was used. IoU is a standard metric to classify how good a ROI corresponds to the actual ground truth bounding box. It does so by computing the intersection of ground truth (GT) (in green) and ROI (dotted red) and dividing that by the union of the two. Therefore scoring low if there is little overlap between ROI and GT (left example) while scoring high if there is a lot of overlap (right example). <figure> <img src="https://i.imgur.com/0NhJ2TW.jpg" alt="Trulli" style="width:90%"> <figcaption align = "center">Fig. 10 - Explanation of Intersection over Union calculation for license plates.</figcaption> </figure> Since no GT bounding boxes were labelled for the openALPR dataset and we regarded manual labelling as too labour intensive and not interesting, a classical computer vision approach was used to generate labelling. This classical computer vision approach uses OpenCV to apply various image processing techniques to the image of the license plate. A baseline algorithm that exctracts characters from license plates was taken from [pyimagesearch.com](https://pyimagesearch.com/), which used various steps such as thresholding, contour detection and bounding box creation. It also uses various checks such as checking for certain width/height and aspect ratio to determine if a candidate bounding box is suitable to fit a license plate character. A few settings were tweaked which resulted in the results as seen in Figure 11. These bounding box locations were then saved as ground truth bounding boxes. <figure> <img src="https://i.imgur.com/ACeKAXp.jpg" alt="Trulli" style="width:60%"> <figcaption align = "center">Fig. 11 - The three steps to obtain the bounding boxes of the characters, as seen in the last row.</figcaption> </figure> Plotting the computed IoU for all sliding windows results in Figure 12, where blue represents the actual IoU. We observed some room for improvements: 1. Most peaks reached around 0.8 with an occasional outlier near 1.0 2. If no character was present (e.g., the start of a license plate) the value went to 0.0, however between characters it only decreased to ± 0.4. So most peaks were too low and most troughs too high, thus normalization of the values to get peaks to 1.0 and throughs to 0.0 was required. Standard would be to use the formula _normalized value = (value - lower bound) / (upper bound - lower bound)_, however because for each license plate a few outliers near 0.0 and 1.0 existed this did not yield the required effect yet. Therefore instead of upper and lower bound the mean ± std was used, yielding the normalized value displayed in orange in Figure 12. <figure> <img src="https://i.imgur.com/vaRA9iV.png" alt="Trulli" style="width:60%"> <figcaption align = "center">Fig. 12 - Comparision of actual IoU and normalized IoU.</figcaption> </figure> **CNN IoU estimation** The next step in the pipeline is to have a CNN estimate the IoU for each sliding window. Here we train on the generated bounding boxes and their IoU label. Several network architectures were tried, Figure 13 shows the result of the first network that was tried on the left and the final network on the right. The ground truth IoU and predicted IoU are represented in blue and orange respectively. Although the model on the left does manage to mimic peaks and troughs of the ground truth there are some problems: 1. The difference between peaks (on the character) and troughs (between characters) for some characters is too little (e.g., between the two 0’s) 2. The start of the license plate has a relatively high predicted IoU which will later turn out to be a problem as the blue country identification box will be suggested as a character As visible in the right model, optimization of the model architecture solved these issue. To find the most suitable model architecture, we used the same approach as explained in the OCR section above, but now using more complex networks including 4 convolutional layers. After training the network with highest test accuracy was chosen as the most suitable network architecture. <figure> <img src="https://i.imgur.com/V4IwSoq.png" alt="Trulli" style="width:100%"> <figcaption align = "center">Fig. 13 - Compare different architectures for IoU estimation.</figcaption> </figure> **Bounding box selection** Now that the model can properly predict the IoU of an unseen license plate, the system should be able to select the most promising boundingboxes from the sliding windows. The first naive option that comes to mind is simply selecting the 7 bounding boxes with the highest predicted IoU, which results in Figure 14.a. Which shows that both R and F are selected twice with different bounding boxes, leaving 1 and A without a bounding box. A standard way to deal with this is using non max suppression (NMS) where the most likely bounding box is selected, after which all bounding boxes with overlap are not considered anymore. In that case, after selecting the light blue box around R neither the green box around R nor purple box around K would be a candidate anymore. This would result in the selection showed in Figure 14.b. At least no characters have multiple bounding boxes assigned to them, however, 1 and A are still not assigned a bounding box. This follows logically from our NMS definition as no overlap of boxes is allowed and a box does not fit between 0 and 9, which are selected early because of their high IoU. To solve this we add some relaxation to our NMS by allowing 30% of the boxes to overlap, resulting in Figure 14.c. Which shows the ability of the system to select proper bounding boxes (also called regions of interest). <figure> <img src="https://i.imgur.com/NYjD8Ij.png" alt="Trulli" style="width:60%"> <figcaption align = "center">Fig. 14 - Comparison of boundingbox selection techniques.</figcaption> </figure> The bounding box predictor model had an accuracy of 0.99 on the test set of license plates. It predicted 127 bounding boxes, of which 126 were correct. The total number of bounding boxes to be predicted was also 126. # Full pipeline The bounding box predictor network is combined with the OCR network to create a pipeline which takes as input a cropped image of a license plate, and outputs the characters of the license plate. First the bounding boxes of all the characters are extracted, then the boundingbox is extracted out of the license plate and fed into the OCR network, which outputs a character. When we do this for all the predicted boundingboxes, we can get all the characters in the license plate. The results for the full pipeline are shown below. Initially, we used the pipeline with the SOTA OCR model, and then we also tried it with the best of our own OCR models. Our own model performed a little bit better than the SOTA model, but there is still a lot of room for improvement. We noticed that the OCR model was not really generalizing well to these unseen inputs of characters, which is why we thought of using transfer learning on the train set of the bounding box classifier. In this way, the OCR model would be able to also train a little bit on how the images extracted in the full pipeline would look like. This additional training set was relatively small compared to the original training set. (~100 vs ~1000 images). After doing some extra training of the OCR model on these plates, the OCR accuracy in the pipeline was 0.815, and our pipeline was able to classify 8/18 license plates fully correct! <table> <tr> <td>OCR model </td> <td>OCR Accuracy </td> <td>Fully correct plates </td> </tr> <tr> <td>SOTA </td> <td>0.548 </td> <td>0/18 </td> </tr> <tr> <td>Own model </td> <td>0.595 </td> <td>2/18 </td> </tr> <tr> <td>Own model transfer learned </td> <td>0.815 </td> <td>8/18 </td> </tr> </table> To conclude, you do not always need the most complex model for your problem, if you can make some simplifying assumptions as we did here, you can often get away with less complex models and still get an acceptable performance. The bounding box predictions are really accurate, if we assume that the license plate is cropped in a frontal perspective. The OCR model initially did not generalize well to the images as extracted by our bounding box predictor, but after doing transfer learning on those the accuracy reached 0.815. # Improvements / future work Next steps could be chosen from the following: * Use a larger license plate dataset containing plates from more countries. To be able to generalize better, but also to train the OCR models using more data generated by our own pipeline * Make a more extensive comparison to the state of the art systems * Add localization of license plates and transform them to be horizontally aligned and unskewed