Unlike the RCNN series, YOLO treats target detection as a regression problem, and directly uses a network for classification and box regression.
The specific method is: divide the image into S * S grids, and each grid predicts the positions (x, y, w, h) of B bboxes, confidence (confidence is the intersection ratio), and class probability. The output dimension is S * S * (B * 5 + C), and C is the number of categories. No matter how many boxes are contained in the grid, each grid only predicts a set of class probabilities. During the test, the conditional class probability and the confidence of the prediction box are multiplied to indicate that each box contains the confidence of a certain type of object. This score can represent the category probability and prediction accuracy of the box at the same time.
The basic network model is GoogLe Net, but instead of using its inception module, it uses 1 * 1 and 3 * 3 convolutional layers alternately.
Convolutional layer extraction features, fully connected layer prediction category and box position regression, a total of 24 convolutional layers, 2 fully connected layers.
First 20 convolutional layers + 1 global average pooling + 1 fully connected
First 20 convolutional layers + 4 convolutional layers + 2 fully connected + 1 fully connected (prediction category / frame position)
Including 4 parts: box center position x, y loss + box width and height w, h loss + confidence loss + classification loss.
Using the large classification data set ImageNet to expand the data types of target detection, it can detect 9,000 types of targets (YOLO1 only has 20 types)
Make the gradient larger and avoid the gradient disappearing Faster convergence and faster training Not applied to the entire data set, noisy, improving the generalization ability of the model
The input image size of the YOLO1 classification network is 224 * 224, and the input image size of the target detection network is 448 * 448. Therefore, YOLO1 needs to complete both the target detection task and the task of adapting to higher resolution images.
Use the k-mean clustering algorithm to let the model automatically select the more appropriate a priori frame length and width (YOLO1 is manually specified, with a certain degree of subjectivity)
Distance matrix of the custom clustering algorithm:, centroid is the box selected as the cluster center during clustering, and box is the other box.
The prediction is the offset of the center of the prediction box relative to the grid unit. The logistic is used to limit the prediction value to a range of 0-1, so that the box offset will not exceed 1 network (RPN prediction anchor box and prediction box bbox. Offset, it is possible that the offset is large, causing the model to be unstable)
Let the distance from the upper left corner of the grid to the upper left corner of the image be cx, cy, and the height and width of the piror bounding (template box) be ph and pw.
The calculation of the prediction frame coordinates is shown in the figure:
The 26 * 26 * 512 feature map of the previous layer is divided into four, which are connected into four 13 * 13 * 2048 feature maps, and then connected with the 13 * 13 * 1024 feature map of the subsequent layer to obtain 13 * 13 * 3072 features Illustration.
FCN network, not fixed input size
Similar to vgg, in the end, global average pooling is used, each feature map gets 1 value, and then using full connection will have many fewer parameters.
Remove the last 1000 class output convolutional layers of the classification network, plus 3 3 * 3 convolutional layers, each 1 * 1 convolutional layer after each 3 * 3, and the last 3 * 3 * 512 Add a passthrough layer between the second and the penultimate 3 * 3 * 1024 to get more detailed results, and the last 1 * 1 layer outputs the result. The network structure diagram is omitted. (It looks like 11 new floors are added here)
YOLO2 proposes a joint training mechanism that mixes images from detection and classification datasets for training. When the network sees the image marked for detection, it back-propagates based on the full yolov2 loss function. When it sees a classified image, it only back-propagates the loss from the classification-specific part.
There may be multiple categories of objects in each box, and softmax can only be used for single classification, so instead of sigmoid, sigmoid can be used for multi-label classification.
The feature map sampled on the current layer is added to the feature map of the upper layer to obtain a combined feature map, and some convolutional layers are added to process the combined feature map, so that a more fine-grained target can be predicted.
Use k-mean mean clustering algorithm to predict 9 template boxes for each grid, so that the recall can be improved (5 in YOLO2 and 2 in YOLO1).
Class prediction using cross-entropy loss function (square error for YOLO2).