YOLO9000: Better, Faster, Stronger(YOLOv2)(翻譯)

# YOLO9000: Better, Faster, Stronger(YOLOv2)(翻譯) ###### tags: `YOLO` `CNN` `論文翻譯` `deeplearning` >[name=Shaoe.chen] [time=Tue, Apr 21, 2020] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/abs/1612.08242) * [吳恩達老師_深度學習_卷積神經網路_第三週_目標偵測](https://hackmd.io/@shaoeChen/SJXmp66KG?type=view) ::: ## Abstract :::info We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time. ::: :::success 我們介紹YOLO9000，一個目前為止最好的，即時物體偵測系統，可以偵測超過9000個物體類別。首先，我們借鑒先前的工作，提出多種改善YOLO檢測的方法。這個改進的模型，YOLOv2，在像是PASCAL VOC與COCO的標準檢測任務上是當今最強的。使用一種創新、多尺度的訓練方法，相同的YOLOv2模型可以在不同大小的輸入上執行，在速度與準度性之間提供一個簡單的折衷。在67 FPS時，在VOC2007上，YOLOv2得到76.8的mAP。在40FPS時，YOLOv2則是得到78.6d的mAP，優於像是採用ResNet與SSD的Faster RCNN的方法，而且執行速度明顯更快。最後，我們提出一種將物體偵測與分類一起訓練的方法。使用這個方法，我們可以同時在COCO檢測資料集與ImageNet分類資料集上同時訓練YOLO9000。我們這種聯合訓練的方式，讓YOLO9000可以預測檢測那些沒有標記過的檢測資料的物件類別。我們在ImageNet detection task上驗證我們的方法。儘管單純擁有200個類別中的44個類別的檢測資料，YOLO9000仍然在ImageNet檢測驗證集上得到19.7mAP。在COCO以外的156個類別中，YOLO9000得到16.0的mAP。但是YOLO可以檢測出200多個類別；它可以預測9000多個物件類別。而且它仍然可以即時運算。 ::: :::warning 個人見解： * 聯合訓練的好處在於可以讓YOLO9000檢測出不在COCO資料集裡面的資料。因此即使在ImageNet上有資料，而在COCO沒有，也可以順利檢測出相關物件 ::: ## 1. Introduction :::info General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accurate. However, most detection methods are still constrained to a small set of objects. ::: :::success 通用的目標檢測應該要快、狠、準，而且要可以辨識各種的物件。自從引入神經網路之後，檢測框架已經變的又快又準。但是，多數的檢測方法仍然受限於一些部份的物件。 ::: :::info Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. The most common detection datasets contain thousands to hundreds of thousands of images with dozens to hundreds of tags \[3\] \[10\] \[2\]. Classification datasets have millions of images with tens or hundreds of thousands of categories \[20\] \[2\]. ::: :::success 相較於其它任務(如分類與標記)的資料集，目前的目標檢測資料是有限的。最常見的檢測資料集包含數千到數十萬帶有幾十到幾百張標記的影像。而分類資料集包含數百萬張具有數萬或數十萬個類別的影像。 ::: :::info We would like detection to scale to level of object classification. However, labelling images for detection is far more expensive than labelling for classification or tagging (tags are often user-supplied for free). Thus we are unlikely to see detection datasets on the same scale as classification datasets in the near future. ::: :::info ![](https://i.imgur.com/Ma6v91M.png) Figure 1: YOLO9000. YOLO9000 can detect a wide variety of object classes in real-time. Figure 1：YOLO9000。YOLO9000可以即時檢測多種物件類別。 ::: :::success 我們希望檢測能夠擴展到物件分類的級別。然而，標記用於檢測的影像所花費的成本遠比用於分類或標記要貴的多(標記通常是用戶免費提供)。因此，我們不大可能在近期有機會看到檢測資料集的規模可以跟分類資料集的規模一樣大。 ::: :::info We propose a new method to harness the large amount of classification data we already have and use it to expand the scope of current detection systems. Our method uses a hierarchical view of object classification that allows us to combine distinct datasets together. ::: :::success 我們提出一個新的方法，利用我們已經擁有的大量分類資料集，使用它來擴展當前檢測系統的的範圍。我們的方法使用物件分類的階梯式觀點，這允許我們將不同的資料集結合在一起。 ::: :::info We also propose a joint training algorithm that allows us to train object detectors on both detection and classification data. Our method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness. ::: :::success 我們還提出一種聯合訓練的演算法，這讓我們可以同時利用檢測資料與分類資料來訓練目標檢測。我們的方法利用已標記的檢測影像來學習精確的定位物件，同時使用分類影像來增加其詞彙量與魯棒性。 ::: :::info Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection system to produce YOLOv2, a state-of-the-art, real-time detector. Then we use our dataset combination method and joint training algorithm to train a model on more than 9000 classes from ImageNet as well as detection data from COCO. ::: :::success 我們用這種方法訓練出YOLO9000，一個即時的目標檢測系統，可以檢測超過9000種不同物件類別。首先，我們在YOLO檢測系統基礎上做了改善以得到YOLOv2，一個最新的即時檢測器。然後，我們使用資料集結合的方法與聯合訓練演算法在ImageNet(超過9000個類別)與COCO上訓練。 ::: :::info All of our code and pre-trained models are available online at http://pjreddie.com/yolo9000/. ::: :::success 我們所有的程式碼與預訓練模型都可以在下面網址取得：http://pjreddie.com/yolo9000/ ::: ## 2. Better :::info YOLO suffers from a variety of shortcomings relative tostate-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus mainly on improving recall and localization while maintaining classification accuracy. ::: :::success 相較於最新的檢測系統，YOLO有著許多缺點。對比於Fast R-CNN，YOLO的異常分析說明著，YOLO有著很明顯的定位誤差問題。此外，與region proposal-based相比，YOLO有著相對較低的召回率。因此，我們主要關注在改善召回率與定位誤差問題，並同時保有分類準確率。 ::: :::info Computer vision generally trends towards larger, deeper networks \[6\] \[18\] \[17\]. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2 we want a more accurate detector that is still fast. Instead of scaling up our network, we simplify the network and then make the representation easier to learn. We pool a variety of ideas from past work with our own novel concepts to improve YOLO’s performance. A summary of results can be found in Table 2. ::: :::success 電腦視覺通常趨向於更大、更深的網路 \[6\] \[18\] \[17\]。好的效能通常取決於訓練大型網路或是將多個模型集合在一起。然而，對於YOLOv2，我們想要的是一個更準確的檢測器，而且仍然非常的快速。我們簡化網路而不擴大網路，也讓representation更容易學習。我們將過去工作中的各種想法以及我們創新的概念集合在一起，以此提高YOLO的效能。結果的總結可以在Table 2中找到。 ::: :::info ![](https://i.imgur.com/Y0weyDH.png) Table 2: The path from YOLO to YOLOv2. Most of the listed design decisions lead to significant increases in mAP. Two exceptions are switching to a fully convolutional network with anchor boxes and using the new network. Switching to the anchor box style approach increased recall without changing mAP while using the new network cut computation by 33%. Table 2：YOLO到YOLOv2的路徑。清單內的多數設計決策都會導致mAP明顯的增加。有兩個例外，切換至使用anchor boxes的全卷積網路，以及使用新的網路。切換到anchor box類型的方法可以增加召回率而不改變mAP，同時使用新的網路可以減少33%的計算。 ::: :::info **Batch Normalization.** Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization [7]. By adding batch normalization on all of the convolutional layers in YOLO we get more than 2% improvement in mAP. Batch normalization also helps regularize the model. With batch normalization we can remove dropout from the model without overfitting. ::: :::success **Batch Normalization.** Batch normalization讓模型的收斂性有了明顯的改善，同時消除對其它形式的正規化的需求\[7\]。透過在YOLO的所有卷積層中增加batch normalization，我們在mAP上得到2%以上的改善。Batch normalization同時有助於模型的正規化。使用Batch normalization，我們可以移除dropout而不會過擬合。 ::: :::info **High Resolution Classifier.** All state-of-the-art detection methods use classifier pre-trained on ImageNet [16]. Starting with AlexNet most classifiers operate on input images smaller than 256 × 256 [8]. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution. ::: :::success **High Resolution Classifier.** 所有最新的檢測方法都使用ImageNet\[16\]上預訓練的分類器。從AlexNet開始，多數的分類器處理的都是小於256x256的輸入\[8\]。原始的YOLO在224x224的解析度上訓練分類器，然後再將解析度提高到448x448來做檢測。這意味著網路必須同時切換至學習目標檢測並調整為新的輸入解析度。 ::: :::info For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP. ::: :::success 對於YOLOv2，我們首先在ImageNet上以448x448的解析度來微調網路(10個epochs)。這讓網路有時間調整其濾波器，以便在高解析度的輸入上可以有更好的執行。接著，我們在檢測任務上微調上面得到的網路。這種高解析度的分類網路讓我們的mAP增加幾乎4%。 ::: :::info **Convolutional With Anchor Boxes.** YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly Faster R-CNN predicts bounding boxes using hand-picked priors \[15\]. Using only convolutional layers the region proposal network (RPN) in Faster R-CNN predicts offsets and confidences for anchor boxes. Since the prediction layer is convolutional, the RPN predicts these offsets at every location in a feature map. Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn. ::: :::success **Convolutional With Anchor Boxes.** YOLO在卷積特徵提取器上使用全連接層直接預測邊界框的座標。而不是直接預測座標，Fast R-CNN使用手工挑選的先驗預測邊界框\[15\]。單純的使用卷積層，Faster R-CNN中的region proposal network (RPN)預測出anchor box的偏移與置信度。因為預測層是卷積層，因此RPN會在feature map中的每一個位置預測這些偏移。預測偏移而不是座標，這簡化了問題而使得網路更容易學習。 ::: :::info We remove the fully connected layers from YOLO and use anchor boxes to predict bounding boxes. First we eliminate one pooling layer to make the output of the network’s convolutional layers higher resolution. We also shrink the network to operate on 416 input images instead of 448×448. We do this because we want an odd number of locations in our feature map so there is a single center cell. Objects, especially large objects, tend to occupy the center of the image so it’s good to have a single location right at the center to predict these objects instead of four locations that are all nearby. YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13. ::: :::success 我們移除了YOLO內的全連接層，並使用anchor boxes來預測邊界框。首先，我們消除一個池化層，讓網路的卷積層的輸出有更高的解析度。我們還將輸入影像縮小為416x416，而不是448x448。這麼做是因為我們希望feature map內的位置數量是奇數，因此只會有一個中心單元。物件，尤其是大型的物件，往往會佔據影像的中心區域，因此最好在中心有單一個位置來預測這些物件，而不是所有附近的四個位置。YOLO的卷積層將會影像[降低取樣](http://terms.naer.edu.tw/detail/7274075/?index=1)32倍，因此使用416x416的輸入影像會得到feature map 13x13的輸出。 ::: :::info When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object. ::: :::success 當我們轉用anchor boxex的時候，我們還將類別預測機制與空間位置[解耦](http://terms.naer.edu.tw/detail/7268980/)，並且為每個anchor box預測類別與objectness。YOLO之後，objectness的預測仍然是預測實際框與候選框之間的IOU，而類別的預測則是在給定存在物件情況下，預測該物件的條件機率。 ::: :::warning 個人見解： * objectness，指anchor box內是否存在物件 ::: :::info Using anchor boxes we get a small decrease in accuracy. YOLO only predicts 98 boxes per image but with anchor boxes our model predicts more than a thousand. Without anchor boxes our intermediate model gets 69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even though the mAP decreases, the increase in recall means that our model has more room to improve. ::: :::success 使用anchor boxes會有準確度略降的問題。YOLO每一張影像僅預測98個框，但使用anchor boxes，我們的模型可以預測超過1000個以上的框。在沒有anchor boxes情況下，我們的中間模型得到69.5的mAP，以及81%的召回率。使用anchor boxes之後，我們的模型得到69.2%的mAP以及88%的召回率。即使mAP降低，召回率的增加意味著我們的模型更多的改進空間。 ::: :::info **Dimension Clusters.** We encounter two issues with anchor boxes when using them with YOLO. The first is that the box dimensions are hand picked. The network can learn to adjust the boxes appropriately but if we pick better priors for the network to start with we can make it easier for the network to learn to predict good detections. ::: :::success **Dimension Clusters.** 在YOLO上使用anchor boxes，我們遇到兩個問題。第一個問題是，框的維度是手工挑選的。網路可以學習適當地調整框，但是如果我們為網路挑選更好的先驗條件做為起始，那就可以讓網路更容易學會預測好的檢測。 ::: :::info Instead of choosing priors by hand, we run k-means clustering on the training set bounding boxes to automatically find good priors. If we use standard k-means with Euclidean distance larger boxes generate more error than smaller boxes. However, what we really want are priors that lead to good IOU scores, which is independent of the size of the box. Thus for our distance metric we use: $d$(box, centroid) = 1 − IOU(box, centroid) ::: :::success 不使用手工挑選先驗，我們在訓練集邊界框上執行k-mean clustering來自動找到好的先驗。如果我們使用標準的k-means(歐幾里德距離)，那麼較大的框會的誤差會較小的框更多。然而，我們真正想要的是能夠得到好的IOU分數的先驗，這與框的大小無關。因此，對於距離的度量我們使用： $d$(box, centroid) = 1 − IOU(box, centroid) ::: :::info We run k-means for various values of k and plot the average IOU with closest centroid, see Figure 2. We choose k = 5 as a good tradeoff between model complexity and high recall. The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes. ::: :::success 我們用各種不同的$k$值來執行k-means，並將最近接centroid(聚類中心?)的平均IOU畫出來，見Figure 2。我們選擇$k=5$做為模型複雜度與高召回率之間的折衷，聚類的[中心點](http://terms.naer.edu.tw/detail/7256049/)(centroid)與手工挑選的anchor boxes明顯的不同。少了短而寬的框，多了高而薄的框。 ::: :::info ![](https://i.imgur.com/XS4khae.png) Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC. Figure 2：在VOC與COCO上的聚類框維度。我們在邊界框的維度上執行k-means聚類，以此為模型獲得好的先驗。左邊的圖說明的是我們以各種不同的k所得的平均IOU。我們發現，當k=5的時候，可以在模型的召回率與複雜度上取得一個好的折衷。右邊的圖說明的是VOC與COCO的相對中心點。兩組的先驗都傾向更薄、更高的框，而COCO的大小變化會比VOC還要大。 ::: :::info We compare the average IOU to closest prior of our clustering strategy and the hand-picked anchor boxes in Table 1. At only 5 priors the centroids perform similarly to 9 anchor boxes with an average IOU of 61.0 compared to 60.9. If we use 9 centroids we see a much higher average IOU. This indicates that using k-means to generate our bounding box starts the model off with a better representation and makes the task easier to learn. ::: :::success 我們在Table 1中比較平均IOU最接近的幾個聚類策略先驗與手工挑選anchor boxes的結果。在只有5個先驗情況下，[中心點](http://terms.naer.edu.tw/detail/7256049/)的表現與9個anchor boxes相近，平均IOU為61.0:60.9。如果我們使用9個中心點，我們可以看到，會有更高的平均IOU。這指出，使用k-means來生成邊界框，可以以更好的representation啟動模型，並且讓任務更易於學習。 ::: :::info ![](https://i.imgur.com/a2A2IyQ.png) Table 1: Average IOU of boxes to closest priors on VOC 2007. The average IOU of objects on VOC 2007 to their closest, unmodified prior using different generation methods. Clustering givs much better results than using hand-picked priors. Table 1：VOC2007上最接近先驗框的平均IOU。VOC2007上物件的平均IOU，與其最接近、未經修改的物件使用不同的生成方法。聚類可以得到比手工挑選先驗更好的結果。 ::: :::info **Direct location prediction.** When using anchor boxes with YOLO we encounter a second issue: model instability, especially during early iterations. Most of the instability comes from predicting the $(x, y)$ locations for the box. In region proposal networks the network predicts values $t_x$ and $t_y$ and the $(x, y)$ center coordinates are calculated as: $$ \begin{align} x = (t_x * w_a) - x_a \\ y = (t_y * h_a) - y_a \\ \end{align} $$ ::: :::success 當在YOLO使用anchor boxes的時候，我們遇到的第二個問題：模型的不穩定，尤其是在初期的迭代期間。多數的不穩定來自於預測框的$(x, y)$位置。在RPN(region proposal networks)中，網路預測$t_x$與$t_y$以及中心座標$(x, y)$的計算為： $$ \begin{align} x = (t_x * w_a) - x_a \\ y = (t_y * h_a) - y_a \\ \end{align} $$ ::: :::info For example, a prediction of $t_x = 1$ would shift the box to the right by the width of the anchor box, a prediction of $t_x = −1$ would shift it to the left by the same amount. ::: :::success 舉例來說，$t_x = 1$的預測會造成依著anchor box的寬框向右推移框，而$t_x = −1$的預測則會讓框向左推移相同的量。 ::: :::info This formulation is unconstrained so any anchor box can end up at any point in the image, regardless of what location predicted the box. With random initialization the model takes a long time to stabilize to predicting sensible offsets. ::: :::success 這個數學式是不受限制的，因此任何的anchor box都可以在影像中的任一點結束，不管預測框的位置為何。模型使用隨機初始化需要花費很長的時間才能穩定，這時候才能預測合理的偏移量。 ::: :::info Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between 0 and 1. We use a logistic activation to constrain the network’s predictions to fall in this range. ::: :::success 我們不預測偏移量，而是依著YOLO的方法，預測相對於網格位置的位置座標。這將實際框限制在0、1之間。我們使用logistic activation來限制網路的預測會落於這個區間範圍內。 ::: :::warning 個人理解： * 使用sigmoid讓輸入介於0、1之間 ::: :::info The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates for each bounding box, $t_x$, $t_y$, $t_w$, $t_h$, and $t_o$. If the cell is offset from the top left corner of the image by $(c_x, c_y)$ and the bounding box prior has width and height pw, ph, then the predictions correspond to: $$ \begin{align} b_x = \sigma(t_x) + c_x \\ b_y = \sigma(t_y) + c_y \\ b_w = p_w e^{t_w} \\ b_h = p_he^{t_h} \\ Pr(object) * IOU(b, object = \sigma(t_o) \end{align} $$ ::: :::success 網路會在輸出的feature map中的每一個格子上預測出5個邊界框。網路為每個邊界框預測5個座標，$t_x$、$t_y$、$t_w$、$t_h$與$t_o$。如果格子是從影像的左上角偏移$(c_x, c_y)$，而先前的邊界框的寬與高為$p_w$、$p_h$，那麼預測對應為： $$ \begin{align} b_x = \sigma(t_x) + c_x \\ b_y = \sigma(t_y) + c_y \\ b_w = p_w e^{t_w} \\ b_h = p_he^{t_h} \\ Pr(object) * IOU(b, object = \sigma(t_o) \end{align} $$ ::: :::info Since we constrain the location prediction the parametrization is easier to learn, making the network more stable. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the version with anchor boxes. ::: :::success 因為我們限制了位置的預測，因此參數化更容易學習，這也讓網路更為穩定。與使用anchor boxes版本相比，使用dimension clusters並直接預測邊界框中心位置，這提高幾乎5%的效能。 ::: :::info **Fine-Grained Features.** This modified YOLO predicts detections on a 13 × 13 feature map. While this is sufficient for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution. ::: :::success **Fine-Grained Features.** 修正後的YOLO預測13x13的feature map上的檢測結果。雖然這對大型物件而言是足夠的，但是它可能是受益於用於定位更小物件的細粒度更細的特徵。Faster R-CNN與SSD都是在網路中的各feature maps上執行proposal networks，以這種方法得到一系列的解析度。我們採用另一種方法，僅僅增加一個passthrough layer，即能以26x26的解析度從較早的層中帶特徵過來。 ::: :::info The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, which can be concatenated with the original features. Our detector runs on top of this expanded feature map so that it has access to fine grained features. This gives a modest 1% performance increase. ::: :::success passthrough layer透過堆疊[相鄰的](http://terms.naer.edu.tw/detail/7241021/)特徵到不同的通道(非空間位置)，將高、低解析度的特徵[串接](http://terms.naer.edu.tw/detail/7262043/)起來，這類似於ResNet中的[恆等映射](http://terms.naer.edu.tw/detail/2117540/)。這方法將26x26x512的feature map轉換為13x13x2048的feature map，這可以跟原始的特徵相[連接](http://terms.naer.edu.tw/detail/7262043/)。我們的檢測器在這個擴展的feature map 上執行，因此可以訪問細粒度的特徵。這提高了幾乎1%的效能。 ::: :::info ![](https://i.imgur.com/7PT3xcy.png) Figure 3: Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function Figure 3：維度先驗的邊界框與位置預測。我們預測框的寬與高，以及聚類中心點的偏移量。我們使用sigmoid function來預測框相對於濾波器應用位置的中心位置。 ::: :::info **Multi-Scale Training.** The original YOLO uses an input resolution of 448 × 448. With the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. ::: :::success **Multi-Scale Training.** 原始YOLO使用的輸入解析度是448x448。隨著加入anchor boxes，我們調整解析度為416x416。然而，由於我們的模型單純的使用卷積與池化層，因此可以很快的調整大小。我們希望YOLOv2可以在不同大小的影像中執行，因此我們將其訓練至模型中。 ::: :::info Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training. ::: :::success 我們每隔幾次迭代就會變更輸入影像的大小，而不是固定不動。每10個批次，我們的網路就會隨機選擇新的影像大小。因為我們的模型有做32倍的[降低取樣](http://terms.naer.edu.tw/detail/7274075/?index=1)，因此我們從下面32的倍數中取出：{320, 352, ..., 608}。因此，最小的選項是320x320，最大是608x608。我們將網路調整為該尺寸大小，然後繼續訓練。 ::: :::info This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions. The network runs faster at smaller sizes so YOLOv2 offers an easy tradeoff between speed and accuracy. ::: :::success 這樣子的制度強迫網路學習在各種不同的輸入維度上有好的預測。這意味著，相同的網路可以預測不同解析度的檢測。在較小的尺寸情況下網路可以執行的更快，因此，YOLOv2可以在速度與準確度上輕鬆的權衡。 ::: :::info At low resolutions YOLOv2 operates as a cheap, fairly accurate detector. At 288 × 288 it runs at more than 90 FPS with mAP almost as good as Fast R-CNN. This makes it ideal for smaller GPUs, high framerate video, or multiple video streams. ::: :::success 在低解析度情況下，YOLOv2的執行是不費力的，相當準確的檢測器。在解析度288x288時，它可以執行超過90FPS，而且mAP幾乎與Fast R-CNN相當。這讓它非常適合較小的GPU，[高畫面更新率](http://terms.naer.edu.tw/detail/7286438/)視頻，或多視頻串流。 ::: :::info At high resolution YOLOv2 is a state-of-the-art detector with 78.6 mAP on VOC 2007 while still operating above real-time speeds. See Table 3 for a comparison of YOLOv2 with other frameworks on VOC 2007. Figure 4 ::: :::success 在高解析度情況下，YOLOv2是一個在VOC2007上擁有78.6mAP的最佳檢測器，同時仍然以即時速度在執行。YOLOv2與其它框架在VOC2007上的比較請見Table 3。Figure 4。 ::: :::info ![](https://i.imgur.com/b4eTo3F.png) Figure 4: Accuracy and speed on VOC 2007. ::: :::info ![](https://i.imgur.com/ms8o6f0.png) Table 3: Detection frameworks on PASCAL VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size. All timing information is on a Geforce GTX Titan X (original, not Pascal model). Table 3：在PASCAL VOC 2007上的檢測框架。YOLOv2比其它先驗檢測方法都還要來快，也更準。它還可以在不同解析度下執行，輕鬆的在速度與準確度之間折衷選擇。每一個YOLOv2的作品實際上都是相同的訓練模型(有相同的權重)，只是以不同的大小來評估而以。所有的時間信息都是在Geforce GTX Titan X測試而得(原始，不是Pascal) ::: :::info **Further Experiments.** We train YOLOv2 for detection on VOC 2012. Table 4 shows the comparative performance of YOLOv2 versus other state-of-the-art detection systems. YOLOv2 achieves 73.4 mAP while running far faster than competing methods. We also train on COCO and compare to other methods in Table 5. On the VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP, comparable to SSD and Faster R-CNN. ::: :::success **Further Experiments.** 我們YOLOv2在VOC2012上執行檢測。Table 4說明著，YOLOv2與其它最新檢測系統的比較效能。YOLOv2實現73.4mAP，同時執行速度遠超過其它挑戰的方法。我還們在COCO資料集上訓練，並且與其它方法做了比較，見Table 5。在VOC度量上(IOU=0.5)，YOLOv2得到的mAP為44.0，可說是與SSD、Faster R-CNN旗鼓相當。 ::: :::success ![](https://i.imgur.com/XqyOZO8.png) Table 5: Results on COCO test-dev2015. Table adapted from [11] ::: :::info ![](https://i.imgur.com/GBip6Ss.png) Table 4: PASCAL VOC2012 test detection results. YOLOv2 performs on par with state-of-the-art detectors like Faster R-CNN with ResNet and SSD512 and is 2 − 10× faster. Table 4：PASCAL VOC2012測試檢測結果。YOLOv2有著與最新的檢測器，如R-CNN(ResNet)、SSD512相同的效能，而且快了2-10倍。 ::: ## 3. Faster :::info We want detection to be accurate but we also want it to be fast. Most applications for detection, like robotics or self-driving cars, rely on low latency predictions. In order to maximize performance we design YOLOv2 to be fast from the ground up. ::: :::success 我們希望檢測除了準確之外還要快。多數用於檢測的應用程式，像是機器人或自駕車，都是仰賴低延遲的預測。為了最大化效能，我們從頭開始設計YOLOv2 ::: :::info Most detection frameworks rely on VGG-16 as the base feature extractor \[17\]. VGG-16 is a powerful, accurate classification network but it is needlessly complex. The convolutional layers of VGG-16 require 30.69 billion floating point operations for a single pass over a single image at 224 × 224 resolution. ::: :::success 多數的檢測框架都以VGG-16\[17\]做為基本的特徵提取器。VGG-16是強大的，一個準確的分類網路，但不需要這麼複雜。VGG-16的卷積層光是224x224的影像解析度單一次的pass就需要30.69億浮點計算。 ::: :::info The YOLO framework uses a custom network based on the Googlenet architecture \[19\]. This network is faster than VGG-16, only using 8.52 billion operations for a forward pass. However, it’s accuracy is slightly worse than VGG16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s custom model gets 88.0% ImageNet compared to 90.0% for VGG-16. ::: :::success YOLO框架使用基於Googlenet架構\[19\]的自定義網路。這網路比VGG-16還要快，一個forward pass只需要8.52億次的浮點計算。但是它的準確度比VGG-16還要略差一點。對於單一剪裁(single-crop)，解析度為224x224的top-5的準確率，在YOLO的自定義模型可以得到88.0%(ImageNet資料集上)，而VGG-16則為90.0%。 ::: :::info **Darknet-19.** We propose a new classification model to be used as the base of YOLOv2. Our model builds off of prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly 3 × 3 filters and double the number of channels after every pooling step \[17\]. Following the work on Network in Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions \[9\]. We use batch normalization to stabilize training, speed up convergence, and regularize the model \[7\]. ::: :::success **Darknet-19.** 我們提出一個新的分類模型做為YOLOv2的基礎。我們的模型建立在網路設計的前期工作和該領域的共同知識的基礎上。類似於VGG模型，我們大多數情況下使用3x3的濾波器，並在每一個池化步驟之後讓通道數增加一倍\[17\]。依著Network in Network (NIN)，我們使用全域平均池化來做預測，還有1x1濾波器在3x3卷積之間壓縮feature representation\[9\]。我們使用batch normalization來穩定訓練過程，提高收斂並正規化模型\[7\]。 ::: :::info Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. For a full description see Table 6. Darknet-19 only requires 5.58 billion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet. ::: :::success 最終的模型稱為Darknet-19，擁有19個卷積層，以及5個最大池化層。完整的說明見Table 6。Darknet-19只需要5.58億次的浮點計算來處理影像，但是在ImageNet上可以達到72.9%的top-1準確度，以及91.2%的top-5準確度。 ::: :::info ![](https://i.imgur.com/6QC1mh2.png) Table 6: Darknet-19. ::: :::info **Training for classification.** We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework \[13\]. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts. ::: :::success **Training for classification.** 使用Darknet神經網路框架，我們在標準的ImageNet 1000個類別的分類資料集上訓練網路，160個epochs，使用隨機梯度下降，起始的learning rate為0.1，4次方的多項式速率衰減，權重衰減為0.0005，以及0.9的momentum。訓練過程中，我們使用標準的資料增強的技巧，像是隨機剪裁、旋轉、色相、飽和度，以及曝光度的轉換。 ::: :::info As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of $10^{-3}$ . At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%. ::: :::success 如上所述，在以224x224的影像做完初始訓練之後，我們以較大的影像尺寸448x448對網路做微調。對於這個微調，我們以上述的參數單純的訓練10個epochs，然後調整為$10^{-3}$。在這種高解析度情況下，我們的網路可以實現76.5%的top-1準確率以及93.3%的top-5準確率。 ::: :::info **Training for detection.** We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features. ::: :::success **Training for detection.** 我們調整網路的幾個部份，讓它可以用於檢測，移除最後的卷積層，改以3個3x3卷積層(filters=1024)來替代，最後再加上1x1的卷積層，這卷積層帶有檢測的輸出數量。用於VOC對話，我們預測5個框，每個框有5個座標以及20個類別，因此filters為125。我們還從最後的3x3x512到倒數第二個卷積層中加入一個passthrough layer，因此我們的模型可以使用[細粒](http://terms.naer.edu.tw/detail/7283709/)特徵。 ::: :::warning 個人理解： * 5個座標+20個類別=25，然後5個框，因此5x25=125 ::: :::info We train the network for 160 epochs with a starting learning rate of $10^{-3}$ , dividing it by 10 at 60 and 90 epochs. We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc. We use the same training strategy on COCO and VOC. ::: :::success 我們的模型訓練160個epochs，初始的learning rate為$10^{-3}$，在第60與第90個epoch的時候除10。我們使用0.0005的權重衰減，以及0.9的momentum。我們使用與YOLO、SSD類似的資料增強的方法，隨機剪裁、顏色的偏移，等。我們在COCO與VOC都使用相同的訓練策略。 ::: ## 4. Stronger :::info We propose a mechanism for jointly training on classification and detection data. Our method uses images labelled for detection to learn detection-specific information like bounding box coordinate prediction and objectness as well as how to classify common objects. It uses images with only class labels to expand the number of categories it can detect. ::: :::success 我們提出一個將分類與檢測資料聯合訓練的機制。我們的方法是用標記的影像來學習特定的檢測資訊，像是邊界框位置的預測與objectness，以及如何對常見的物件分類。它只有使用有分類標籤的影像來擴展它可以檢測的類別數量。 ::: :::info During training we mix images from both detection and classification datasets. When our network sees an image labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classification-specific parts of the architecture. ::: :::success 訓練過程中，我們混合來自檢測與分類的資料集。當我們的網路看到影像標記為檢測時，我們可以基於整個YOLOv2 loss function來反向傳播。當網路看到它是一個分類影像的時候，我們會單純的從架構中分類的特定部份來反向傳播loss。 ::: :::info This approach presents a few challenges. Detection datasets have only common objects and general labels, like “dog” or “boat”. Classification datasets have a much wider and deeper range of labels. ImageNet has more than a hundred breeds of dog, including “Norfolk terrier”, “Yorkshire terrier”, and “Bedlington terrier”. If we want to train on both datasets we need a coherent way to merge these labels ::: :::success 這個方法帶來了一些挑戰。檢測資料集單純的擁有一般物件與通用標籤，像是"dog"或"boat"。分類資料集則是擁有更為廣泛與更深層次的標籤。ImageNet有超過一百多種的狗的種類，包含"Norfolk terrier"、"Yorkshire terrier"與"Bedlington terrier"。如果我們希望在兩個資料集上進行訓練，那我們就需要一種連貫的方法來合併這些標籤。 ::: :::info Most approaches to classification use a softmax layer across all the possible categories to compute the final probability distribution. Using a softmax assumes the classes are mutually exclusive. This presents problems for combining datasets, for example you would not want to combine ImageNet and COCO using this model because the classes “Norfolk terrier” and “dog” are not mutually exclusive. ::: :::success 多數用於分類的方法都是使用softmax layer在所有可能的類別上計算最終的機率分佈。使用softmax的前提是假設類別是[互斥](http://terms.naer.edu.tw/detail/6927/)的。這給合併資料集帶來問題，舉例來說，你可能不想要使用這個模型來結合Imagenet與COCO，因為"Norfolk terrier"與"dog"這兩個類別並不是[互斥](http://terms.naer.edu.tw/detail/6927/)的。 ::: :::info We could instead use a multi-label model to combine the datasets which does not assume mutual exclusion. This approach ignores all the structure we do know about the data, for example that all of the COCO classes are mutually exclusive. ::: :::success 我們可以使用multi-label model來結合資料集，而不需要假設類別之間是[互斥](http://terms.naer.edu.tw/detail/6927/)的。但這個方法忽略了我們對資料所瞭解的結構，舉例來說，COCO的類別都是[互斥](http://terms.naer.edu.tw/detail/6927/)的。 ::: :::info **Hierarchical classification.** ImageNet labels are pulled from WordNet, a language database that structures concepts and how they relate \[12\]. In WordNet, “Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc. Most approaches to classification assume a flat structure to the labels however for combining datasets, structure is exactly what we need. ::: :::success **Hierarchical classification.** ImageNet的標記是由WordNet(語言資料庫)中所提取，用來建構concepts以及它們之間的相關性\[12\]。在WordNet裡面，"Norfolk terrier"與"Yorkshire terrier"都是"terrier"的下義詞，為"hunting dot"的一種，是"dog"的一種，是一種"canine"(犬)，等。多數用於分類的方法都假設標籤是[扁平結構](http://terms.naer.edu.tw/detail/1824577/)，但是，對於結合資料集，這正是我們所需要的結構。 ::: :::info WordNet is structured as a directed graph, not a tree, because language is complex. For example a “dog” is both a type of “canine” and a type of “domestic animal” which are both synsets in WordNet. Instead of using the full graph structure, we simplify the problem by building a hierarchical tree from the concepts in ImageNet. ::: :::success WordNet是一個[有向圖](http://terms.naer.edu.tw/detail/7271901/)的結構，而不是tree，因為語言是非常複雜的。舉例來說，"dog"同時是"canine"也是"domestic animal"，這兩個在WordNet裡面都是synsets。我們透過從ImageNet中的concepts建立hierarchical tree([階層樹](http://terms.naer.edu.tw/detail/3275240/))來簡化問題，而不使用完整的圖結構。 ::: :::info To build this tree we examine the visual nouns in ImageNet and look at their paths through the WordNet graph to the root node, in this case “physical object”. Many synsets only have one path through the graph so first we add all of those paths to our tree. Then we iteratively examine the concepts we have left and add the paths that grow the tree by as little as possible. So if a concept has two paths to the root and one path would add three edges to our tree and the other would only add one edge, we choose the shorter path. ::: :::success 為了建立這個tree，我們檢查ImageNet裡面的visual nouns，並檢查其通過WordNet graph到root node的路徑(此例為“physical object”)。許多的synsets只會有一條通過graph的路徑，因此，首先我們要將所有的這些路徑加到我們的tree。然後，我們反覆檢查剩下的concepts，並盡可能的少加入會讓tree成長的路徑。因此，如果一個concept有兩個路徑可以到達root，而一個路徑會在tree中加入三個edges，另一個只會增加一個edge，我們會選擇較短路徑。 ::: :::info The final result is WordTree, a hierarchical model of visual concepts. To perform classification with WordTree we predict conditional probabilities at every node for the probability of each hyponym of that synset given that synset. For example, at the “terrier” node we predict: $Pr$(Norfolk terrer|terrier) $Pr$(Yorkshire terrer|terrier) $Pr$(Bedlington terrer|terrier) ::: :::success 最終得到的結果就是WordTree，一個visual concepts的hierarchical model。為了要以WordTree來執行分類，我們預測每個節點上的條件機率，也就是給定該synset的每個hyponym的機率。舉例來說，在"terrier"節點上，我們預測： $Pr$(Norfolk terrer|terrier) $Pr$(Yorkshire terrer|terrier) $Pr$(Bedlington terrer|terrier) ::: :::info If we want to compute the absolute probability for a particular node we simply follow the path through the tree to the root node and multiply to conditional probabilities. So if we want to know if a picture is of a Norfolk terrier we compute: $Pr$(Norfolk terrier) = $Pr$(Norfolk terrier|terrier) \∗ $Pr$(terrier|hunting dog) \∗ . . . \∗ \∗ $Pr$(mammal|$Pr$(animal) \∗ $Pr$(animal|physical object) ::: :::success 如果我們要計算某個特定節點的絕對機率，我們只需要延著tree到root node的路徑，然後乘上條件機率。因此，如果我們想瞭解，圖片是否為Norfolk terrier，那我們就計算： $Pr$(Norfolk terrier) = $Pr$(Norfolk terrier|terrier) \∗ $Pr$(terrier|hunting dog) \∗ . . . \∗ \∗ $Pr$(mammal|$Pr$(animal) \∗ $Pr$(animal|physical object) ::: :::info For classification purposes we assume that the the image contains an object: $P_r$(physical object) = 1. ::: :::success 為了進行分類，我們假設影像中包含一個物件：$P_r$(physical object) = 1。 ::: :::info To validate this approach we train the Darknet-19 model on WordTree built using the 1000 class ImageNet. To build WordTree1k we add in all of the intermediate nodes which expands the label space from 1000 to 1369. During training we propagate ground truth labels up the tree so that if an image is labelled as a “Norfolk terrier” it also gets labelled as a “dog” and a “mammal”, etc. To compute the conditional probabilities our model predicts a vector of 1369 values and we compute the softmax over all sysnsets that are hyponyms of the same concept, see Figure 5. ::: :::success 為了驗證這個方法，我們在WordTree上使用1000 class ImageNet訓練模型Darnket-19。為了建置WordTree1k，我們加入所有中間節點，這些節點將標籤空間從1000擴展至1369。訓練過程中，我們傳遞真實的標籤給tree，因此，如果影像標記為"Norfolk terrier"，它們也會被標記為"dog"與"mammal"，等。為了計算條件機率，我們的模型預測會輸出1369個值的向量，然後我們計算相同concept的hyponyms的所有sysnsets的softmax。見Figure 5。 ::: :::info ![](https://i.imgur.com/dZDZugG.png) Figure 5: Prediction on ImageNet vs WordTree. Most ImageNet models use one large softmax to predict a probability distribution. Using WordTree we perform multiple softmax operations over co-hyponyms. Figure 5：在ImageNet與WordTree的預測。大多數的ImageNet模型使用較大的softmax來預測機率分佈。使用WordTree，我們可以對co-hyponyms做多個softmax的操作。 ::: :::info Using the same training parameters as before, our hierarchical Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5 accuracy. Despite adding 369 additional concepts and having our network predict a tree structure our accuracy only drops marginally. Performing classification in this manner also has some benefits. Performance degrades gracefully on new or unknown object categories. For example, if the network sees a picture of a dog but is uncertain what type of dog it is, it will still predict “dog” with high confidence but have lower confidences spread out among the hyponyms. ::: :::success 使用與之前相同的訓練參數，hierarchical Darknet-19實現71.9%的top-1準確率與90.4%的top-5準確率。儘管增加了369個額外的concepts，並且讓網路預測tree structure，但我們的準確率只有少量的下降。以這種方法來執行分類還是有一些好處的。在新的或未知的物件類別上，效能通常會下降。舉例來說，假如網路看到狗的圖片，但並不確定牠是那種類型的狗，網路仍然會以高置信度來預測"dog"，但在hyponyms之間會散佈較低的置信度。 ::: :::info This formulation also works for detection. Now, instead of assuming every image has an object, we use YOLOv2’s objectness predictor to give us the value of $Pr$(physical object). The detector predicts a bounding box and the tree of probabilities. We traverse the tree down, taking the highest confidence path at every split until we reach some threshold and we predict that object class. ::: :::success 這個數學式也可以用於檢測。現在，我們不再假設每一張影像都有一個物件，而是使用YOLOv2的objectness predictor來提供我們$Pr$(physical object)的值。這個檢測器預測邊界框與機率樹。我們向下[遍歷](http://terms.naer.edu.tw/detail/7361243/)tree，在每一個split的時候都採用最高的置信度，直到達到某個閥值並預測該物件類別。 ::: :::info **Dataset combination with WordTree.** We can use WordTree to combine multiple datasets together in a sensible fashion. We simply map the categories in the datasets to synsets in the tree. Figure 6 shows an example of using WordTree to combine the labels from ImageNet and COCO. WordNet is extremely diverse so we can use this technique with most datasets. ::: :::success **Dataset combination with WordTree.** 我們可以使用WordTree以合理的方式將多個資料集結合在一起。我們單純的將資料集裡面的類別映射到tree的synsets。Figure 6說明了使用WordTree結合ImageNet與COCO標記的範例。WordNet真的非常多樣化，也因此我們才可以在大多數資料集上使用這個技術。 ::: :::info ![](https://i.imgur.com/BE6Q8Qn.png) Figure 6: Combining datasets using WordTree hierarchy. Using the WordNet concept graph we build a hierarchical tree of visual concepts. Then we can merge datasets together by mapping the classes in the dataset to synsets in the tree. This is a simplified view of WordTree for illustration purposes. Figure 6：使用WordTree hierarchy結合資料集。使用WordNet concept graph，我們建立visual concepts的hierarchical tree。然後我們就可以利用映射資料集內的類別到tree裡面的synsets將資料集合併在一起。這是一個用於說明目的的WordTree的簡化視圖。 ::: :::info **Joint classification and detection.** Now that we can combine datasets using WordTree we can train our joint model on classification and detection. We want to train an extremely large scale detector so we create our combined dataset using the COCO detection dataset and the top 9000 classes from the full ImageNet release. We also need to evaluate our method so we add in any classes from the ImageNet detection challenge that were not already included. The corresponding WordTree for this dataset has 9418 classes. ImageNet is a much larger dataset so we balance the dataset by oversampling COCO so that ImageNet is only larger by a factor of 4:1. ::: :::success **Joint classification and detection.** 現在，我們可以使用WordTree來合併資料集，我們可以在分類、檢測資料集上訓練我們的聯合模型。我們希望訓練一個非常大型的檢測器，因此我們可以用COCO檢測資料與完整的ImageNet版本中的前9000個類別來建立結合的資料集。我們還需要評估我們的方法，因此我們加入未包含訓練集內的ImageNet檢測競賽內的所有類別。這資料集與WordTree對應有9418個類別。ImageNet是一個較大型的資料集，因為我們透過oversampling COCO來平衡資料集，以此讓ImageNet與COCO為4:1。 ::: :::info Using this dataset we train YOLO9000. We use the base YOLOv2 architecture but only 3 priors instead of 5 to limit the output size. When our network sees a detection image we backpropagate loss as normal. For classification loss, we only backpropagate loss at or above the corresponding level of the label. For example, if the label is “dog” we do assign any error to predictions further down in the tree, “German Shepherd” versus “Golden Retriever”, because we do not have that information. ::: :::success 使用這個資料集，我們訓練YOLO9000。我們使用基於YOLO2的架構，但priors為3，而不是5，以此限制輸出的大小。當我們的網路看到檢測影像的時候，我們會一如往常的反向傳播loss。對於分類的loss，我們只會反向傳播loss至標籤的相應或更高級別上。舉例來說，如果標籤為"dog"，那我們會將所有的誤差分配給tree內裡更下層的預測，像是"German Shepherd"對"Golden Retriever"，因為我們並沒有這些資訊。 ::: :::info When it sees a classification image we only backpropagate classification loss. To do this we simply find the bounding box that predicts the highest probability for that class and we compute the loss on just its predicted tree. We also assume that the predicted box overlaps what would be the ground truth label by at least .3 IOU and we backpropagate objectness loss based on this assumption. ::: :::success 當它看到的是分類影像，我們只會反向傳播分類的loss。為了這麼做，我們只需要找到預測該類別最高機率的邊界框，然後僅於其預測的tree上計算其loss。我們還假設，預測的框與實際的框的重疊至少為0.3IOU，然後依據這個假設來反向傳播objectness的loss。 ::: :::info Using this joint training, YOLO9000 learns to find objects in images using the detection data in COCO and it learns to classify a wide variety of these objects using data from ImageNet. ::: :::success 使用這種聯合訓練方式，YOLO9000學會使用COCO裡面的檢測資料來發現影像中的物件，然後從ImageNET的資料中學會對這些物件做各式各樣的分類。 ::: :::info When we analyze YOLO9000’s performance on ImageNet we see it learns new species of animals well but struggles with learning categories like clothing and equipment. New animals are easier to learn because the objectness predictions generalize well from the animals in COCO. Conversely, COCO does not have bounding box label for any type of clothing, only for person, so YOLO9000 struggles to model categories like “sunglasses” or “swimming trunks”. ::: :::success 當我們在ImageNet上分析YOLO9000的效能時，我們看到它成功的學習到新的動物種類，但在學習依服、設備的部份卻遇到困難。新的動物較為容易學習是因為objectness的預測從COCO裡面的動物有著較好的泛化。相反的，COCO的並沒有任何衣服類型的邊界框標記，只針對人，也因此，YOLO9000很難對"sunglasses"或"swimming trunks"等類別建模。 ::: :::info ![](https://i.imgur.com/DxzPSZz.png) Table 7: YOLO9000 Best and Worst Classes on ImageNet. The classes with the highest and lowest AP from the 156 weakly supervised classes. YOLO9000 learns good models for a variety of animals but struggles with new classes like clothing or equipment. Table 7：在ImageNet上，YOLO9000最佳與最糟的類別。來自156弱監督類別中AP最高與最低的類別。YOLO9000是一個學一各種動物的好模型，但是在新的類別，像是衣服或設備上卻遇到困難。 ::: ## 5. Conclusion :::info We introduce YOLOv2 and YOLO9000, real-time detection systems. YOLOv2 is state-of-the-art and faster than other detection systems across a variety of detection datasets. Furthermore, it can be run at a variety of image sizes to provide a smooth tradeoff between speed and accuracy. ::: :::success 我們介紹了YOLOv2與YOLO9000，一個即時檢測系統。YOLOv2是最佳的，且在各種檢測資料集上比其它檢測系統還要快。而且，它可以執行在各種大小的影像上，在速度與準確度上有著平滑的折衷。 ::: :::info YOLO9000 is a real-time framework for detection more than 9000 object categories by jointly optimizing detection and classification. We use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO. YOLO9000 is a strong step towards closing the dataset size gap between detection and classification. ::: :::success YOLO9000是一個即時的框架，利用聯合最佳化檢測與分類來檢測超過9000種物件類別。我們使用WordTree結合多種來源資料，然後我們的聯合最佳化技術同時訓練ImageNet與COCO。YOLO9000是縮小檢測與分類之間的資料集大小差距的重要的一步。 ::: :::info Many of our techniques generalize outside of object detection. Our WordTree representation of ImageNet offers a richer, more detailed output space for image classification. Dataset combination using hierarchical classification would be useful in the classification and segmentation domains. Training techniques like multi-scale training could provide benefit across a variety of visual tasks. ::: :::success 我們的許多技術包含物件檢測之外的技術。ImageNet的WordTree representation為影像分類提供更富豐、更詳細的輸出空間。使用hierarchical classification結合的資料集，在分類與分割領域上會是非常有用的。訓練技術，像是multi-scale，可以在各種視覺任務中提供好處。 ::: :::info For future work we hope to use similar techniques for weakly supervised image segmentation. We also plan to improve our detection results using more powerful matching strategies for assigning weak labels to classification data during training. Computer vision is blessed with an enormous amount of labelled data. We will continue looking for ways to bring different sources and structures of data together to make stronger models of the visual world. ::: :::success 對於進一步的工作，我們希望使用類似的技術用於弱監督式影像分割。我們還計畫使用更強的匹配策略(訓練過程中分配弱標籤給分類資料)來改善檢測結果。電腦視覺很幸運的擁有大量的標記資料。我們會持續尋找一個方法，將不同的來源、結構的資料結合在一起，以建立更強大的視覺世界模型。 :::