The topic of our final project is to perform object detection on NEU surface defect database[1]. NEU surface defect database is a dataset that contains six kinds of typical surface defects, and each defect includes 300 samples.
We choose YOLOv3 [2] as our model, and implement our project based on the repository PyTorch-YOLOv3 [3].
In this project, we have two primary goals:
Since NEU surface defect database only includes images & annotations, we separate the original data into three datasets (training/validation/ testing). We randomly seperate the data and ensure that the number of each class is the same. The ratio of the number of data of three datasets is 6:2:2
.
NEU surface defect database provides PASCAL VOC [4] format annotations. To apply YOLO on our data, first, we need to convert the annotations into YOLO format.
PASCAL VOC format represents the bounding box by the top-left coordinate and the bottom-right coordinate \((x_{top-left},\ y_{top-left}),\ (x_{bottom-right},\ y_{bottom-right})\) and stores the data in XML format. On the other hand, YOLO format represents the bounding box by the center and the height and weight \((x_{center},\ y_{center},\ weight,\ height)\).
source code:
data_process/prepare_data.ipynb
.
In the beginning, we simply applied several common data augmentation on our training data. The augmentations are Gaussian blur
, flip (vertical & horizontal)
, rotation
, dropout
, noise
, and hue/saturation
adjustment. We use imgaug[5] as our implementation. The detailed setting and the implementation source code lists as follow:
self.aug = iaa.Sequential([
iaa.Sometimes(0.25, iaa.GaussianBlur(sigma=(0, 3.0))),
iaa.Sometimes(0.25, iaa.Fliplr(0.25)),
iaa.Sometimes(0.25, iaa.Flipud(0.25)),
iaa.Sometimes(0.25, iaa.Affine(rotate=(-15, 15), mode='symmetric')),
iaa.Sometimes(0.25,
iaa.OneOf([iaa.Dropout(p=(0, 0.25)),
iaa.CoarseDropout((0.0, 0.05), size_percent=(0.02, 0.25))])),
iaa.Sometimes(0.25, iaa.SaltAndPepper(0.5)),
iaa.Sometimes(0.25,
iaa.AddToHueAndSaturation(value=(-20, 20), per_channel=True))
])
However, we did not get a worse result AP = baseline-11%
after applying the heavy augmentation. We then visualized our augmentation results:
From the above pictures, we can observe that the blur & noise damage the structure of the original data, and the defect (crazing) became unrecognizable. Therefore, we adjust the augmentation settings several times and eventually achieved a better AP.
After several adjustments, our augmentation changes to the following settings:
self.aug = iaa.Sequential([
iaa.Sometimes(0.25, iaa.Fliplr(0.5)),
iaa.Sometimes(0.25, iaa.Flipud(0.5)),
iaa.Sometimes(0.1, iaa.Affine(rotate=(-15, 15), mode='symmetric')),
iaa.Sometimes(0.1,
iaa.OneOf([iaa.Dropout(p=(0, 0.1)),
iaa.CoarseDropout((0.0, 0.05), size_percent=(0.01, 0.1))])),
iaa.Sometimes(0.1, iaa.SaltAndPepper(0.1))
])
The main difference between the lite augmentation and the heavy augmentation are:
0.25
to 0.1
(except flip), and the degree of each augmentation is lowered (such as the angle of rotation).blur
, massive noise (SaltAndPepper)
and Hue/Saturation
are removed. We excluded the former due to the excessive noise, and excluded the former since our datasets are monochrome.The above picture is the visualized result of the enhanced augmentation, and the mAP of lite augmentation is baseline+2.3%
.
source code:
utils/datasets.py
After applying augmentation, we do some statistical analysis and notice that the detection ratio & AP of crazing
and pitted_surface
are lower than other classes.
To improve the AP of these two classes, we observe the misclassified images (red bounding box = ground truth) and try to apply the image processing that can highlight the edges and fine details
of the image.
We apply the sharpen
as a preprocessing, that is, the sharpening effects will be applied to all the training and testing. The sharpen
effect is also implemented by imgaug[5]. Sharpen implementation:
class SharpenTransform:
def __init__(self):
self.aug = iaa.Sequential([
iaa.Sharpen(alpha=(0.75, 0.75), lightness=(1.25, 1.25))
])
def __call__(self, img):
img = np.array(img)
out = self.aug.augment_image(img)
out = PIL.Image.fromarray(out.astype('uint8'), 'RGB')
return out
However, we do not get a better result by sharpening processing. The mAP of evaluation set becomes baseline-1.8%
(but the mAP of validation set = baseline+3.1%
). Our conjecture is: the working mechanism of ML-based models (especially CNN models) may be different from our intuition or human visual system. Although the edge looks more apparent, some subtle information may be ruined due to the sharpening operation. Besides, other techniques such as concatenating the processed data after the original data may get better results. We have not implemented such methods since it is more complicated to design for object detection models.
source code:
utils/datasets.py
YOLO performs predictions from a pre-determined set of boxes with particular height-width[6], which is the so-called anchor boxes
. Nevertheless, the default set of the anchor boxes may not fit the custom training data. Hence, we redefine the size of anchor boxes according to the object sizes of our dataset. We use K-means clustering
to compute the new anchor boxes (the center of each cluster will become the new anchor box size).
The above is our clustering results. We cluster the bounding sizes by weight & height and get 9 sizes of bounding boxes (the default number of YOLO).
After adjusting anchor box sizes, the evaluation mAP increases to baseline+3.1%
(validation mAP = baseline+5.5%
).
source code:
data_process/anchor_box.ipynb
IoU (Intersection over Union) is a common metric of object detection tasks.
The figure[7] shows the definition of IoU: the overlap between the prediction and the ground-truth
divides the union of the prediction and the ground-truth
.
IoU is a straightforward and efficient metric: it is simple to compute and has the property of scale invariance (focuses on the area of the shapes, no matter their size)[8]. Nevertheless, recent research [8][9] indicates that there are several issues of using IoU as a metric. For instance:
To solve the mentioned problems, we enhance the loss function by replacing IoU with GIoU (Generalized Intersection over Union)[8]. The concept of GIoU is adding a penalty term
to suppress the area which should not be bounded. The image of the left[10] illustrates the main component of GIoU:
The loss function using GIoU can be denoted as: \(\mathcal{L}_{GIoU}=1-IoU+\frac{\mid C-B\cup B^{gt} \mid}{\mid C \mid}\)
Where \(C\) is the smallest convex hull that encloses both \(B\) (prediction bounding box) and \(B^{gt}\) (ground-truth bounding box), and \(\frac{\mid C-B\cup B^{gt} \mid}{\mid C \mid}\) is the mentioned penalty term
. The above gives an example: if \(C\) is larger than the value of \(\mathcal{L}_{GIoU}\) becomes higher. Hence the right case will cause a higher loss than the left case.
To apply GIoU to our project (the PyTorch-YOLOv3 repository[3]), we replace the original IoU function (bbox_iou()
in utils/utils.py
) with our GIoU version bbox_giou()
. We refer to [11] and implement the GIoU function. The following is the code segment (some detailed operations are removed) of our GIoU implementation.
complete source code:
utils/utils.py
# compute C
area_C = (max(x1_pred,x2_pred,x1_gt,x2_gt)
-min(x1_pred,x2_pred,x1_gt,x2_gt))*(max(y1_pred,y2_pred,y1_gt,y2_gt)
-min(y1_pred,y2_pred,y1_gt,y2_gt))
# compute Union & Overlap
area_pred = (x2_pred-x1_pred)*(y1_pred-y2_pred)
area_gt = (x1_gt-x2_gt)*(y1_gt-y2_gt)
sum_area = area_pred + area_gt
w1 = x2_pred - x1_pred
w2 = x2_pred - x1_pred
h1 = y1_pred - y2_pred
h2 = y1_gt - y2_gt
W = min(x1_pred,x2_pred,x1_gt,x2_gt) + w1 + w2 - max(x1_pred,x2_pred,x1_gt,x1_gt)
H = min(y1_pred,y2_pred,y1_gt,y2_gt) + h1 + h2 - max(y1_pred,y2_pred,y1_gt,y2_gt)
Area = W*H
add_area = sum_area - Area
# get GIoU
end_area = (area_C - add_area)/area_C
giou = iou - end_area
Here we summarize the results of experiments on the evaluation set:
Augment | Anchor Box | GIoU | Result (mAP) | |
---|---|---|---|---|
Baseline | 0.665 | |||
Aug | v | 0.688 (+2.3%) | ||
Anchor Box | v | 0.680 (+1.5%) | ||
GIoU | v | 0.704 (+3.9%) | ||
Aug + Anchor | v | v | 0.693 (+2.8%) | |
Aug + GIoU | v | v | 0.701 (+3.6%) | |
Anchor + GIoU | v | v | 0.679 (+1.4%) | |
All | v | v | v | 0.711 (+4.6%) |
(abbreviations: aug/augment=data augmentation; anchor box=adjust anchor box sizes; GIoU=use GIoU instead of IoU)
From the experiment results, our observations and brief conclusions are: each technique can improve the mAP of our model. Furthermore, the experiment that applies all the techniques leads to the best mAP.
nvcr.io/nvidia/pytorch:20.01-py3
According to the implementation of PyTorch-YOLOv3[3], we can save weights of our PyTorch model in Darknet format using save_darknet_weights
function from class Darknet
, which is the first step of our TensorRT deployment for the YOLOv3 model.
After saving our PyTorch YOLOv3 model in Darknet format, the next step is to construct the ONNX graph with the Darknet config and weights, since ONNX is one of the acceptable model formats in TensorRT. Here we modified the sample code yolov3_to_onnx.py
from NVIDIA, changed the model I/O, set correct output dimension, and fixed some bugs in the code.
To successfully deploy the YOLOv3 model to TensorRT, it's necessary to check each layer in YOLOv3 model architecture and find which layer is unsupported in TensorRT. After the examination, we can observe that lots of operations used in YOLOv3 are Convolution, BatchNormalization, or LeakyRelu, which can be parsed and deployed to TensorRT directly. However, the detection layer (YOLO layer) in the model is not. To solve this issue, a simple solution is to only convert the YOLOv3 backbone to ONNX and TensorRT and implement the YOLO layer as part of postprocessing.
We implement the same preprocessing flow as PyTorch version, which consists of the following operations:
Image.open()
After inferencing with TensorRT execution context, we will get three output arrays with different scales. The first thing we have to do in postprocessing is to reproduce the detection layer of YOLOv3, and apply it to these three output arrays to get real bounding-box coordinate information. The formula below shows what a detection layer does, for instance, we have to apply sigmoid, calculate exponential, multiply by anchor dimensions, and add corresponding grid coordinates.
Since a one-stage object detection model predicts bounding boxes on each grid of feature maps, there are a huge number of objects will be predicted. However, only partial output objects have high object confidence, the others usually have less than 1% instead. For the purpose of getting a suitable result, it's required to filter these output proposals. Thus, we assign an object confidence threshold, and only choose objects which have higher object confidence than this threshold as our result.
This step is used to filter repeated bounding boxes. As the figure below, sometimes both of the two bounding boxes have high object confidence, but they are overlapped –- that is, only one object exists actually. To deal with this condition, the Non-Maximum Suppression algorithm will filter objects according to their IoU and scores, and discard objects which have high IoU with the other but relative lower confidence.
The figures are the prediction result of patches_93
in PyTorch and our TensorRT implementation. We have validated that the average difference of output arrays between PyTorch and TensorRT is less than 0.001.
The table shows the performance of our model running on PyTorch and TensorRT respectively.
First, we built the TensorRT engine with original precision (FP32), and calculate the latency and the FPS. However, we only got about 5% performance improvement. Furthermore, we also tried to build a TensorRT engine with FP16 precision, and we got 45% performance improvement this time. The main reason for this outcome is due to the different optimization for FP32 and FP16 of Nvidia. When we use FP16, that means we only need half of digits compared with FP32. In addition, since Nvidia support this kind of operation, we can benefit from using FP16 and get a better performance improvement in the end.
Average Latency | FPS | |
---|---|---|
PyTorch | 18.55 ms | 53.91 |
TensorRT FP32 | 17.01 ms | 58.78 |
TensorRT FP16 | 10.09 ms | 99.14 |
Our postprocessing includes YOLO layer and NMS. In the table below, we can see that the postprocessing in PyTorch is faster. The main reason is that the YOLO layer and NMS in PyTorch’s implementation are constructed with lots of torch operations, which are running on CUDA devices. However, in our TensorRT implementation, the YOLO layer and NMS are not part of our TensorRT engine, thus they are running on the host. If we want to solve this issue, a TensoRT custom plugin for YOLO layer may be a good solution.
Average Latency | |
---|---|
PyTorch | 2.39 ms |
TensorRT FP32 | 3.64 ms |
TensorRT FP16 | 3.66 ms |