Mini-YOLO for single object detection

# Mini-YOLO for single object detection Name: Szuyu (Angela) Lin Email: szuyul@cs.cmu.edu # Introduction This is a very simplified implementation of the YOLO detection and classification framework: "You Only Look Once: Unified, Real-Time Object Detection", J.Redmon, S. Divvala, R.Girshick, A. Farhadi, https://arxiv.org/pdf/1506.02640.pdf The original YOLO implementation has the following ideas: ## Architecture The resized input image is split to an `S x S` "grid". Each grid produces a set of classification scores (for `C` classes in total) and `B` bounding boxes, each containing a confidence / objectness score. In order to do so, the network architecture is defined as the following: The backbone pretrained on ImageNet produces a feature map of size `(1024, 7, 7)`. Each grid cell is built on the feature maps, therefore `S=7`. The backbone is followed by a set of linear layers, and then the output tensor is resized to a size of `(30, 7, 7)`, where `30=B*5+C`, having `B=2, C=20` and each bounding box is represented by 5 dimensions `(x, y, w, h, confidence)`. ## Loss For detection, NMS is first applied on bounding boxes with an IoU score above a certain threshold, and then losses are computed with a series of criteria. Only the remaining box (which has the highest confidence score) is "responsible" for the corresponding ground truth object. * `Localization loss`: For each grid cell (`SxS`) & each predicted box (`B`) responsible for ground truth object: Sum of square error between predicted & ground truth (`x, y`). * `Box dimension loss`: For each grid cell (`SxS`) & each predicted box (`B`) responsible for ground truth object: Sum of square error between predicted & ground truth (`w, h`). * `Objectness loss`: For each grid cell & each predicted box responsible of ground truth object: Sum of square error between confidence and ground truth objectness (1). For each grid cell & each predicted box excluding the ones above: Sum of square error between `confidence` and ground truth objectness (0). Only penalizes NMS'ed boxes for each class. * `Classification loss`: For each grid cell, it predicts a set of confidence scores of size `C` (classes): Sum of squared error between ground truth (binary) & confidence for each class. Since we have very limited information in this case, and we may assume there is only 1 object of 1 class to detect, the loss function is simplified with only `localization loss` and `objectness loss` remaining. # Implementation ## Dataset The dataset takes in a pair of (images and labels) to compose each training instance. For convenience of indexing, the `x, y` coordinates are flipped in this implementation (i.e. `x` is the vertical direction / rows). Though this does not affect the required format of the final inference script. Augmentations (random horizontal and vertical flips) are added to the data during training phase. ## Model ### Feature backbone In the original paper, the authors mentioned that they used an architecture inspired by and similar to `GoogLeNet` as their backbone, therefore, in this implementation the backbone is the pretrained `GoogLeNet`, which produces feature maps of size `(832, 7, 7)` (after removing 5 final layers). The features are not fine-tuned in this implementation. In our case, the dimensions of the output can be greatly reduced, where `B=1` because we have only 1 object of 1 class to detect, therefore `C=0, B=1` and feature vector dimensions `3` (representing `x, y, confidence` while omitting `w, h`) instead of 30. This makes the final output tensor size `(3, 7, 7)`. ![](https://i.imgur.com/2hzhKGn.jpg) ### Classifier The classifier here consists of 2 linear layers with `512` and `3*7*7` output neurons, respectively. A 1D batch normalization layer replaces the original dropout layer with rate 0.5 (due to considerations for the amount of data). ### Activation The activation functions for the final layer is defined as following: * For coordinates: Outputs are linearly scaled to between 0 and 1. * For confidence scores: Different from the original implementation (the confidence scores do not necessarily sum up to 1 because there could be multiple objects in each grid), a softmax is applied over the output feature map, which makes the confidence scores of all grid cells sum up to 1. ## Loss ### Objectness loss We assume only 1 object is present in each instance, or in other words, we only want the grid cell with maximum confidence score. Therefore a 2D maxpooling layer with `kernel_size=7` will give us the index of the targeted grid cell. The ground truth here is a `7*7` matrix of zeros, except for the index of the grid cell containing the object (which is 1). Therefore, the objectness here is computed by taking the squared error between the softmax probabilities and the ground truth matrix, and weighted by scalar factors (defined in the original paper). ### Localization loss The predicted coordinates `x_hat, y_hat` are the normalized coordinates in each grid, therefore we have to scale them so that they are in the global coordinates. The raw coordinate predictions are first divided by `S`, and then the "grid offset", which is, for example if we predict `[1, 2]` as the grid cell containing the object, we add `(1/7, 2/7)` to the divided coordinates. The localization is the squared error between the ground truth `x, y` and the predicted coordinates in global scale `x_hat, y_hat`, while both are between 0 and 1. ![](https://i.imgur.com/T7yiSYk.jpg) ## Evaluation and inference The logic of transforming the raw predictions to final outputs are similar to the last section. We find the box with max confidence and then add the grid offsets to the scaled coordinate predictions. # Submission ## Files The submission contains the following files: * `train_phone_finder.py` * `find_phone.py` * `models.py`: defines model architecture. * `loss.py`: defines loss and evaluation functions. * `dataset.py`: defines the dataset. ## Environment and Requirements * `python 3.8` * `torch==1.8.1` * `torchvision==0.9.1` * `PIL==8.2.0` * `numpy==1.19.2`