# You Only Look Once: Unified, Real-Time Object Detection(YOLO)(翻譯)
###### tags: `YOLO` `CNN` `論文翻譯` `deeplearning`
>[name=Shaoe.chen] [time=Thu, Mar 6, 2020]
[TOC]
## 說明
區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院
:::info
原文
:::
:::success
翻譯
:::
:::warning
個人註解,任何的翻譯不通暢部份都請留言指導
:::
:::danger
* [paper hyperlink](https://arxiv.org/pdf/1506.02640.pdf)
* [吳恩達老師_深度學習_卷積神經網路_第三週_目標偵測](https://hackmd.io/@shaoeChen/SJXmp66KG?type=view)
:::
## Abstract
:::info
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
:::
:::success
我們介紹YOLO,一個[目標檢測](http://terms.naer.edu.tw/detail/6329028/)的新的方法。先前關於[目標檢測](http://terms.naer.edu.tw/detail/6329028/)的工作是重新利用分類器來執行檢測。相反的,我們將目檢檢測塑造為一個空間分離邊界框與相關類別機率的迴歸問題。單一個神經網路直接由完整的影像只做一次的評估同時預測邊界框與類別機率。因為整個檢測的管線~(pipeline)~是單一網路,因此可以直接對檢測效能做end-to-end的最佳化。
:::
:::info
Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
:::
:::success
我們的整合架構非常的快。我們的基於YOLO的模型即時處理影像為每秒45 frames。另一個較小的網路版本,Fast YOLO,每秒處理來到驚人的155 frames,同時mAP仍然達到其它即時檢測的兩倍。與目前最好的檢測系統相比,YOLO會有更多的定位誤差,但在背景下預測[假陽性](http://terms.naer.edu.tw/detail/6305457/)的機會較小。最後,YOLO學習了非常一般的目標的表示~(representations)~。當從自然影像泛化到其它領域的時候(如藝術作品),它優於其它的檢測方法(包含DPM與R-CNN),
:::
## 1. Introduction
:::info
Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
:::
:::success
人類只要看一眼影像就會立即的知道有那一些物件在影像中,它們在那裡,以及它們如何互動。人類的視覺系統是快速而且準確的,允許我們去做一些複雜的任務,像是在幾乎沒有意識的情況下駕駛。快又準確的[目標檢測](http://terms.naer.edu.tw/detail/6329028/)演算法可以讓電腦在沒有特別的傳感器情況下駕駛車輛,讓輔助設備能夠傳送即時的場景信息給人類用戶,並釋放通用,反應靈敏的機器人系統的潛能。
:::
:::info
Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image \[10\].
:::
:::success
目前的檢測系統是重新利用分類器來執行檢測。為了檢測物件,這些系統採用該物件的分類器,然後在測試影像的不同位置與比例上進行評估。像是可變形組件模型(DPM)等系統,使用滑動視窗方法,其分類器在整張影像上的均勻間隔位置執行\[10\]。
:::
:::info
More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene \[13\]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
:::
:::success
像是R-CNN等最近提出使用region proposal(候選區域)的方法,首先在影像中生成可能的邊界框,然後在這些建議的框上執行分類器。分類之後,利用[後處理](http://terms.naer.edu.tw/detail/6662024/)來改進邊界框,消除重覆的檢測,然後依據場景中的其它物件重新評分這些框\[13\]。這些複雜的pipeline非常慢且難以最佳化,因為每個個別的組件都需要被分別的訓練。
:::
:::info
We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
:::
:::success
我們將[目標檢測](http://terms.naer.edu.tw/detail/6329028/)重新構造為一個迴歸問題,直接從影像像素到邊界框坐標與類別機率。使用我們的系統,你只需要看一次(YOLO)影像就可以預測有出現的物件以及它們在那。
:::
:::info
YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
:::
:::success
YOLO是如此的清新簡單:見圖1。一個單獨的卷積網路同時預測多個邊界框與這些邊界框的類別機率。YOLO訓練在完整的影像上,而且直接地最佳化檢測效能。相較於傳統的目標檢測的方法,這種整合的模型有多個好處。
:::
:::info

**Figure 1: The YOLO Detection System.** Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
**Figure 1: The YOLO Detection System.** 使用YOLO處理影像非常簡單明瞭。我們的系統(1)改變輸入影像大小為448x448,(2)在影像上執行單個卷積網路,然後(3)利用模型的置信度對得到的檢測確認閥值。
:::
:::info
First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
:::
:::success
首先,YOLO非常快。因為我們將檢測重新塑造為迴歸問題,因此不需要過於複雜的pipeline。只需要在測試的時候,在一張新的影像上執行我們的神經網路就可以預測檢測。我們的基礎網路在Titan X GPU上以每秒45 frames執行(沒有批處理),而快速的版本執行可以超過150fps。這意味著我們可以以不到25毫秒的延遲即時處理串流影音。除此之外,YOLO的mAP是其它即時系統的兩倍。關於我們系統在webcam上即時處理的展示可以參考我們專案網路:http://pjreddie.com/yolo/ 。
:::
:::info
Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method \[14\], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
:::
:::success
第二,YOLO在做預測的時候會對影像做全域地推理。不同於基於滑動視窗與region proposal~(候選區域)~的技術,YOLO在訓練與測試期間會看整張的影像,因此它隱式地編碼關於類別與其外觀的上下文信息。Fast R-CNN,一個高階的檢測方法\[14\],因為它不能看到較大的上下文,因此會將影像中的背景區塊誤判為物件。與Fast R-CNN相比,YOLO所產生的背景錯誤少了一半以上。
:::
:::info
Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
:::
:::success
第三,YOLO學習物件的泛化表示~(generalizable representations)~。當在自然影像上訓練且在藝術品上測試的時候,YOLO在很大程度上優於DPM與R-CNN等高階檢測方法。因為YOLO是高泛化性的,因此在應用於新領域或未預期輸入的時候比較不會故障。
:::
:::info
YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
:::
:::success
YOLO在準確度上仍然落後最先進的檢測系統。儘管它可以快速的辨識影像中的物件,但它很難精確的定位某些物件,特別是小的物件。我們在實驗中進一步的研究這些權衡。
:::
:::info
All of our training and testing code is open source. A variety of pretrained models are also available to download.
:::
:::success
我們所有的訓練與測試程式碼都是開源的。多種預訓練模型現在也可以下載。
:::
## 2. Unified Detection
:::info
We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
:::
:::success
我們將目標檢測的各個組件整合為單一神經網路。我們的網路使用整個影像的特徵來預測每一個邊界框。它還可以同時預測影像所有邊界框內的所有類別。這意味著我們的網路會通盤的考慮整個影像以及影像中的所有物件。YOLO設置可以為可以[端到端](http://terms.naer.edu.tw/detail/6598095/)的訓練以及即時的速度,同時維持高平均精確度。
:::
:::info
Our system divides the input image into an $S × S$ grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
:::
:::success
我們的系統會將輸入的影像劃分為$SxS$的網格。如果物件的中心落入網格格子內,那該網格格子就負責檢測該物件。
:::
:::warning
個人見解:
* 這邊說明的是,假設你的$S$設置為19,那影像就會被分割為19x19=361個網格,再從這361個網格裡面去預測裡面是否存在物件,當然這後面還會再加入IoU。
:::
:::info
Each grid cell predicts $B$ bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as $\mathsf{Pr(Object)} ∗ \mathsf{IOU^{truth}_{pred}}$ . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
:::
:::success
每一個網格格子都預測邊界框$B$以及這些框的置信度分數。這個置信度分數反映出模型如何確信框包含物件以及該框所預測的準度程度。正確的說,我們將置信度定義為$\mathsf{Pr(Object)} ∗ \mathsf{IOU^{truth}_{pred}}$。如果格子內沒有物件的存在,那置信度分數就應該是零。換句話說,我們希望置信度分數等於預測框與實際框之間的intersection over union(IoU)
:::
:::info
Each bounding box consists of 5 predictions: $x, y, w, h$, and confidence. The $(x, y)$ coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
:::
:::success
每一個邊界框包含五個預測,$x, y, w, h$以及置信度。$(x, y)$座標代表相對於網格格子邊界的中心。相對於整個影像所預測的寬與高。最後,置信度的預測表示預測框與任一實際框之間的IOU。
:::
:::warning
個人見解:
* $x, y$就是該格子的中心點
:::
:::info
Each grid cell also predicts $C$ conditional class probabilities, $\mathsf{Pr(Classi|Object)}$. These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes $B$.
:::
:::success
每一個網格格子還預測$C$,條件類別機率,$\mathsf{Pr(Classi|Object)}$。這些機率來自包含物件的網格格子。無論有多少數量的框$B$,我們都只會預測每個網格格子的一組類別機率。
:::
:::info
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
$$
Pr(Class_i \vert Object) * Pr(Object) * IOU_{pred}^{truth} = Pr(Class_i) IOU_{pred}^{truth} \qquad (1)
$$
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
:::
:::success
在測試的時候,我們會將條件類別機率與各別框的置信度預測做相乘計算,
$$
Pr(Class_i \vert Object) * Pr(Object) * IOU_{pred}^{truth} = Pr(Class_i) IOU_{pred}^{truth} \qquad (1)
$$
然後得到每一個框的特定類別的置信度分數。這些分數編碼出現在框中的類別機率以及框擬合的物件所預測的結果有多好。
:::
:::info
For evaluating YOLO on PASCAL VOC, we use $S = 7, B = 2$. PASCAL VOC has 20 labelled classes so $C = 20$. Our final prediction is a $7 × 7 × 30$ tensor.
:::
:::success
為了在PASCAL VOC上評估YOLO,我們使用$S = 7, B = 2$。PASCAL VOC有20個標記類別,因此$C = 20$。我們最終預測的張量為$7 × 7 × 30$
:::
:::info

**Figure 2: The Model.** Our system models detection as a regression problem. It divides the image into an $S × S$ grid and for each grid cell predicts $B$ bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an $S × S × (B ∗ 5 + C)$ tensor.
**Figure 2: The Model.** 我們的系統將檢測以迴歸問題來建模。它將影像劃分為$SxS$個網格,每一個網格格子都預測$B$ 邊界框,以及這些框的置信度,與$C$ 類別機率。這些預測被編碼為$S × S × (B ∗ 5 + C)$ tensor。
:::
### 2.1. Network Design
:::info
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset \[9\]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates
:::
:::success
我們以卷積神經網路來實做這個模型,並且以PASCAL VOC檢測資料集\[9\]來評估模型。網路的初始卷積層從影像中提取特徵,而全連接層則是預測輸出機率與座標。
:::
:::info
Our network architecture is inspired by the GoogLeNet model for image classification \[34\]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al \[22\]. The full network is shown in Figure 3.
:::
:::success
我們的網路架構受到用於影像分類的GoogLeNet模型所啟發\[34\]。我們的網路有24層卷積層,後面接2個全連接層。不同於GoogLeNet的inception,我們單純的使用1x1的reduction layers,然後是3x3的卷積層,這類似於Lin et al\[22\]。完整的網路架構如圖3所示。
:::
:::info

**Figure 3: The Architecture.** Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.
**Figure 3: The Architecture.** 我們的檢測網路有24層卷積層,接續的是2個全連接層。交替的1x1卷積層降低了來自上一層的特徵空間。我們用ImageNet分類任務的影像以一半的解析度(224x224)對卷積層做了預訓練,然後以兩倍的解析度做偵測。
:::
:::warning
個人見解:
* 模型看的出來,其輸入維度為448x448,因此說法是以224x224的資料集做預訓練,然後模型輸入為是448x448。
:::
:::info
We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
:::
:::success
我們還訓練一個快速版本的YOLO,旨在突破快速目標檢測的界限。Fast YOLO用一個擁有較少卷積層的神經網路(9層而非24層),以及較少的濾波器(這些層中)。除了網路的大小之外,所有的訓練與測試參數都是與標準的YOLO一樣。
:::
:::info
The final output of our network is the 7 × 7 × 30 tensor of predictions.
:::
:::success
網路最終的輸出預測為7x7x30的張量(tensor)。
:::
### 2.2. Training
:::info
We pretrain our convolutional layers on the ImageNet 1000-class competition dataset \[30\]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo \[24\]. We use the Darknet framework for all training and inference c.
:::
:::success
我們用ImageNet 1000-class競賽資料集\[30\]對卷積層做預訓練。針對預訓練的部份,我們使用圖3的前20個卷積層,然後是平均池化層~(average-pooling layer)~與全連接層。訓練這網路大概花了一個禮拜,並且在ImageNet 2012驗證集上得到single crop top-5 accuracy of 88%的效能,與Caffe’s Model Zoo \[24\]內的GoogLeNet模型效能相當。我們使用Darknet framework做訓練與推理Darknet framework 。
:::
:::info
We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance \[29\]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
:::
:::success
然後我們將模型轉為執行檢測。Ren et al.說明,同時加入卷積層與全連接層到預訓練的網路可以提高效能\[29\]。按他們的範例,我們以隨機初始化權重的方式增加四層卷積層與兩層全連接層。檢測通常需要[細粒度](http://terms.naer.edu.tw/detail/2698496/)的視覺信息,因此我們將網路的輸入解析度從224x224提高到448x448。
:::
:::info
Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
:::
:::success
模型最後一層可以同時預測機率與邊界框的座標。我們利用影像的寬與高來正規化邊界框的寬與高,因此它們的值都落於0、1之間。我們將邊界框的座標x、y參數化為特定網格格子位置的偏移,因此它們也被限定在0、1之間。
:::
:::info
We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
$$
\left.
\begin{array}{l}
\text{x,}&if \space x > 0\\
\text{0.1x,}&\text{otherwise}
\end{array}
\right\}
=\phi(x) \qquad (2)
$$
:::
:::success
我們在最後一層使用線性啟動函數,而其它層則是使用leaky rectified linear activation(Leaky ReLU):
$$
\left.
\begin{array}{l}
\text{x,}&if \space x > 0\\
\text{0.1x,}&\text{otherwise}
\end{array}
\right\}
=\phi(x) \qquad (2)
$$
:::
:::info
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
:::
:::success
我們針對模型輸出的[平方誤差總和](http://terms.naer.edu.tw/detail/958246/)~(SSE)~最佳化。使用[平方誤差總和](http://terms.naer.edu.tw/detail/958246/)是因為它很容易最佳化,但是它跟我們要最大化平均精確度的目標並不完全一致。它的定位誤差與分類誤差的權重是相等的,也許這並不是那麼理想。還有,在每個影像中,許多的網格格子並不包含任何的物件。這會將這些格子的置信度分數推向零,通常會造成含有物件的格子的梯度過於強烈。這會導致模型的不穩定,因而導致訓練在初期就出現偏差。
:::
:::info
To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, $\lambda_{coord}$ and $\lambda_{noobj}$ to accomplish this. We set $\lambda_{coord} = 5$ and $\lambda_{noobj} = .5$.
:::
:::success
要補救這個問題,對於不包含物件的框,我們增加邊界框座標預測的loss並減少置信度預測的loss。我們使用兩個參數$\lambda_{coord}$與$\lambda_{noobj}$來完成這個動作。我們設置$\lambda_{coord} = 5$、$\lambda_{noobj} = .5$
:::
:::info
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
:::
:::success
[平方誤差總和](http://terms.naer.edu.tw/detail/958246/)還[平均地](http://terms.naer.edu.tw/detail/3472040/)加權大型框與小型框中的誤差。我們的誤差[度量](http://terms.naer.edu.tw/detail/2119672/)應該反映出在大型框中的小[偏差](http://terms.naer.edu.tw/detail/3635151/)比小型框中的小偏差還要來的小。要部份的解決這個問題,我們預測邊界框的寬與高的[平方根](http://terms.naer.edu.tw/detail/1219214/),而不是直接的預測邊界框的寬與高。
:::
:::info
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
:::
:::success
YOLO會預測每個網格格子有多個邊界框。在訓練時,我們只希望一個邊界框[預測器](http://terms.naer.edu.tw/detail/2122170/)負責每一個物件。我們指定一個[預測器](http://terms.naer.edu.tw/detail/2122170/)來負責預測物件,基於哪一個預測具有最高的當前IOU與實際框。這導致了邊界框預測之間的[特定化](http://terms.naer.edu.tw/detail/2125035/)。每一個[預測器](http://terms.naer.edu.tw/detail/2122170/)都能更好的預測某些大小、長寬比,或物件類別,從而改善整體的召回率。
:::
:::info
During training we optimize the following, multi-part loss function:
$$\begin{aligned}
&\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ (x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2\right] \\
&+\lambda_{coord}\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i}-\sqrt{\hat{h}_i})^2\right] \\
& + \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} (C_i - \hat{C}_i)^2 \\
& + \lambda_{noobj}\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{noobj} (C_i - \hat{C}_i)^2 \\
& + \sum_{i=0}^{S^2} \mathbb{1}_i^{obj} \sum_{c \in classes}(p_i(c) - \hat{p}_i(c))^2 \qquad (3)
\end{aligned}$$
where $\mathbb{1}^{obj}_i$ denotes if object appears in cell $i$ and $\mathbb{1}^{obj}_{ij}$ denotes that the $j$th bounding box predictor in cell $i$ is “responsible” for that prediction.
:::
:::success
訓練期間,我們最佳化下面multi-part loss function:
$$\begin{aligned}
&\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ (x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2\right] \\
&+\lambda_{coord}\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i}-\sqrt{\hat{h}_i})^2\right] \\
& + \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} (C_i - \hat{C}_i)^2 \\
& + \lambda_{noobj}\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{noobj} (C_i - \hat{C}_i)^2 \\
& + \sum_{i=0}^{S^2} \mathbb{1}_i^{obj} \sum_{c \in classes}(p_i(c) - \hat{p}_i(c))^2 \qquad (3)
\end{aligned}$$
其中$\mathbb{1}^{obj}_i$表示物件是否出現在格子$i$中,而,如果$\mathbb{1}^{obj}_{ij}$表示格子$i$中的第$j$個邊界框[預測器](http://terms.naer.edu.tw/detail/2122170/)對該預測是"可信賴的"。
:::
:::info
Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
:::
:::success
注意到,如果物件存在於該網格格子中,則損失函數只會懲罰分類錯誤(因此前面討論條件分類機率)。如果[預測器](http://terms.naer.edu.tw/detail/2122170/)是"可信賴的"的真實框(也就是說,該網格格子中有任一個[預測器](http://terms.naer.edu.tw/detail/2122170/)有最高的IOU),那也只會懲罰邊界框座標錯誤。
:::
:::info
We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
:::
:::success
我們使用PASCAL VOC 2007與2012的訓練、驗證資料集訓練網路(大約135epochs)。在以PASCAL VOC2012測試的時候,我們還用了VOC2007的資料來訓練。訓練過程中,我們的batch size為64,momentum為0.9,以及decay為0.0005。
:::
:::info
Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from $10^{−3}$ to $10^{−2}$. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with $10^{−2}$ for 75 epochs, then $10^{−3}$ for 30 epochs, and finally $10^{−4}$ for 30 epochs.
:::
:::success
我們的learning rate schedule如下:第一個epochs,我們慢慢的將learning rate從$10^{−3}$提高至$10^{−2}$。如果我們以較高的learning rate開始,那就會因為不穩定的梯度而造成模型的偏差。我們以$10^{−2}$訓練75個epochs,然後$10^{−3}$訓練30個epochs,最後是$10^{−4}$訓練30個epochs。
:::
:::info
To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers \[18\]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
:::
:::success
為了避免過擬合,我們使用dropout以及廣泛的資料增強。在第一個全連接層之後接上dropout layer(dropout rate為0.5),這可以預防層之間的[共同適應](http://terms.naer.edu.tw/detail/514540/)\[18\]。而資料增強的部份,我們引入隨機縮放以及原始影像大小最多20%的[平移](http://terms.naer.edu.tw/detail/955614/)。我們還在HSV色彩空間中隨機調整影像的曝光度與飽合度(最多1.5倍)。
:::
### 2.3. Inference
:::info
Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
:::
:::success
就像在訓練一樣,預測測試影像的檢測只需要一次的網路評估。在PASCAL VOC上面,網路每一張影像預測98個邊界框與每一個框的類別機率。不同於基於分類器的方法,YOLO只需要一次網路評估,所以它測試的時候真的非常的快。
:::
:::info
The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
:::
:::success
網格的設計在邊界框預測中強制增加空間的多樣性。通常,對於一個物件落入那一個網格格子是非常清楚的,而且網路對每一個物件只會預測一個框。然而,一些大型的物件或多個格子邊界附近的物件可以被多個格子定位的很好。Non-maximal suppression可以用來修復這些多重檢測的問題。儘管對R-CNN或DPM的效能而言並不重要,但是non-maximal suppression在mAP中增加了2-3%。
:::
### 2.4. Limitations of YOLO
:::info
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
:::
:::success
YOLO在邊界框的預測上施加了強大的空間限制,因為每一個網格格子只會預測兩個框,而且只能有一個類別。這種空間的約束限制了我們模型所能預測的周圍物件的數量。我們的模型很難發現群組中出現的小物件,像是成群的鳥。
:::
:::info
Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
:::
:::success
因為我們的模型是學習從資料中預測邊界框,因此難以泛化到新的或不常見的長寬比或形態的物件。我們的模型還使用相對粗糙的特徵來預測邊界框,因為我們的架構有多個來自輸入影像的downsampling layers。
:::
:::info
Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.
:::
:::success
最後,儘管我們訓練的是近似檢測效能的loss function,但是對於小邊界框與大邊界框的錯誤處理是相同的(loss function)。大邊界框的小錯誤通常是[良性的](http://terms.naer.edu.tw/detail/719683/),但是在小框中的小錯誤在IOU上就會有很大的影響。錯誤的主要來源就是不正確的定位。
:::
## 3. Comparison to Other Detection Systems
:::info
Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar \[25\], SIFT \[23\], HOG \[4]\, convolutional features \[6\]). Then, classifiers \[36, 21, 13, 10\] or localizers \[1, 32\] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image \[35, 15, 39\]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
:::
:::success
[目標檢測](http://terms.naer.edu.tw/detail/6329028/)在電腦視覺中是一個核心的問題。檢測管線~(pipeline)~通常起始於從輸入影像提取一連串的穩健特徵(Haar \[25\], SIFT \[23\], HOG \[4]\,卷積特徵 \[6\])。然後使用分類器\[36, 21, 13, 10\]或定位器\[1, 32\]在特徵空間中辨識物件。這些分類器或定位器以滑動視窗的方式在整個影像或影像中的部份區域子集上執行\[35, 15, 39\]。我們比較了YOLO與其它頂級的檢測框架,並突顯出主要的相似或差異部份‧
:::
:::info
**Deformable parts models.** Deformable parts models (DPM) use a sliding window approach to object detection \[10\]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.
:::
:::success
Deformable parts models (DPM)使用一個滑動視窗來執行[目標檢測](http://terms.naer.edu.tw/detail/6329028/)\[10\]。DPM使用[不相交(互斥)](http://terms.naer.edu.tw/detail/2114921/)的管線~(pipeline)~對高分區域做提取靜態特徵,分類區域,預測邊界框,等。我們的系統使用單一個卷積神經網路來取代所有這些不同的部份。網路同時執行特徵提取,邊界框預測,非極大值抑制~(non maximal suppression)~,以及上下文的推理。不再是靜態的特徵,而是由網路訓練出符合預期的特徵,並針對檢測任務來最佳化它們。我們的整合架構模型比起DPM更快,更準確。
:::
:::warning
個人見解:
* in-line還是on-line?
:::
:::info
**R-CNN.** R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search \[35\] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].
:::
:::success
R-CNN及其變體使用region proposals~(候選區域)~取代sliding windows~(滑動視窗)~來尋影像中的物件。Selective Search \[35\]生成潛在的邊界框,卷積網路提取特徵,SVM對框評分,線性模型調整邊界框,然後非極大抑制消除重覆的檢測。這個複雜pipeline~(管線)~的每一個階段都必須獨立地準確調整,然後得到的系統非常的慢,在測試時每一張影像會多花40秒以上的時間\[14\]。
:::
:::info
YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.
:::
:::success
YOLO跟R-CNN有一些類似。每一個網格格子都會提出潛在的邊界框,然後使用卷積特徵對這些邊界框評分。然而,我們的系統在網格格子提出潛在邊界框的這個動作上施加了空間的限制,這有助於減緩相同物件的多個檢測問題。我們的系統提出更少的邊界框,每一個影像對比於Selective Search的2000個,我們僅提出98個。最後,我們的系統將這些各別的組件結合為一個單獨,可共同優化的模型中。
:::
:::info
**Other Fast Detectors** Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search \[14\] \[28\]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.
:::
:::success
Fast and Faster R-CNN焦點放在利用共享計算與使用神經網路取代Selective Search\[14\] \[28\]提出候選區域來提高R-CNN框架的速度。儘管它們在R-CNN的速度與準確度上提供了改善,但其仍不符合即時效能。
:::
:::info
Many research efforts focus on speeding up the DPM pipeline \[31\] \[38\] \[5\]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.
:::
:::success
許多研究工作集中在提高DPM管線的速度\[31\] \[38\] \[5\]。他們使用[級聯](http://terms.naer.edu.tw/detail/6595667/)來提高HOG的計算速度,並將計算推到GPUs。但實際上只有30Hz的DPM\[31\]是真的可以即時運算的。
:::
:::info
Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.
:::
:::success
YOLO並沒有試著去最佳化大型檢測管線~(pipeline)~的各個組件,相反的,它被設計為快速檢測,完全不考慮這個pipeline。
:::
:::info
Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation \[37\]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.
:::
:::success
對人臉或人的單一類別檢測器可以被高度最佳化,因為它們只需要處理較少的變化\[37\]。YOLO 是一個以一般用途為目標的檢測器,可以學習同時檢測多個物件。
:::
:::info
**Deep MultiBox.** Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest \[8\] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.
:::
:::success
不同於R-CNN,Szegedy et al.訓練一個卷積神經網路來預測感興趣的區域\[8\],而不是使用Selective Search。透過以置信度預測取代單一類別預測,MultiBox還可以執行單一[目標檢測](http://terms.naer.edu.tw/detail/6329028/)。然而,MultiBox並不能執行一般的[目標檢測](http://terms.naer.edu.tw/detail/6329028/)而且它仍然是大型檢測管線內的一部份,需要進一步的做image patch分類。雖然YOLO跟MultiBox都使用卷積網路來預測影像中的邊界框,但只有YOLO是完整的檢測系統。
:::
:::info
**OverFeat.** Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection \[32\]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
:::
:::success
**OverFeat.** Sermanet et al.訓練卷積神經網路來執行定位,並調整這個定位器來執行檢測\[32\]。OverFeat高效地執行滑動視窗~(sliding windows)~檢測,但它仍然是一個不相交的系統。OverFeat的最佳化是針對定位,而不是檢測效能。像是DPM,在執行檢測的時候,定位器只會看到局部的信息。OverFeat並不能推理全域上下文,因此需要明顯的[後處理](http://terms.naer.edu.tw/detail/253446/)才能夠產出前後一致的檢測結果。
:::
:::info
**MultiGrasp.** Our work is similar in design to work on grasp detection by Redmon et al \[27\]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.
:::
:::success
**MultiGrasp.** 我們的工作在設計上類似於Redmon et al \[27\]的grasp detection~(抓取檢測?)~。我們的邊界框預測的網格方法是基於MultiGrasp系統做迴歸分析(?)。然而,grasp detection比起[目標檢測](http://terms.naer.edu.tw/detail/6329028/)要簡單的多。MultiGrasp只需要為包含一個物件的影像預測單一個graspable region。它並不需要估測物件的大小、位置或邊界或預測它的類別,只需要找到適合grasping的區域即可。但是YOLO會同時預測影像中多個類別的多個物件的邊界框與類別機率。
:::
## 4. Experiments
:::info
First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN \[14\]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
:::
:::success
首先我們在PASCAL VOC 2007上比較YOLO與其它即時檢測系統。為了瞭解YOLO與R-CNN變體之間的差異,我們探索YOLO與Fast R-CNN(R-CNN效能最高的版本之一)在VOC 2007上的錯誤\[14\]。基於不同的錯誤記錄,我們證明YOLO可以用來對Fast R-CNN檢測重新評分,並減少來自背景[假陽性](http://terms.naer.edu.tw/detail/6305457/)的錯誤,從而明顯的提高效能。我們還介紹VOC 2012結果,並且與目前最佳方法比較其mAP。最後,我們證明,在兩個藝術品資料集上,YOLO比起其它的檢測器有著更好的泛化性(泛化到新領域上)。
:::
:::info

Table 1: Real-Time Systems on PASCAL VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.
:::
### 4.1. Comparison to Other Real-Time Systems
:::info
Many research efforts in object detection focus on making standard detection pipelines fast \[5\] \[38\] \[31\] \[14\] \[17\] \[28\]. However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) \[31\]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
:::
:::success
許多[目標檢測](http://terms.naer.edu.tw/detail/6329028/)的研究工作都聚焦在快速建立標準檢測流程\[5\] \[38\] \[31\] \[14\] \[17\] \[28\]。然而,只有Sadeghi et al.真的建立一個即時檢測系統(每秒30-frames或更好)\[31\]。我們比較了YOLO與他們在GPUs上實作的DPM(30Hz或100Hz)。即使其它的工作還沒有達到即時哩程碑,但我們還是比較了他們相關的mAP與速度,以檢查[目標檢測](http://terms.naer.edu.tw/detail/6329028/)系統中可用的準確度-效能之間的權衡。
:::
:::info
Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.
:::
:::success
Fast YOLO是目前我們所知在PASCAL上最快的[目標檢測](http://terms.naer.edu.tw/detail/6329028/)方法。它是當今最快的目標檢測器。擁著有52.7%的mAP,它的準確度是過往即時檢測工作的兩倍以上。YOLO將mAP推進到63.4%,而且還能同時保持即時效能。
:::
:::info
We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.
:::
:::success
我們還使用VGG-16訓練YOLO。這個模型比YOLO準確率更高,但同時速度也明顯的較慢。這對與其它建立在VGG-16上的檢測系統比較非常有幫助,但是因為它比即時系統還慢,因此論文的其它部份就集中在我們更快的模型上。
:::
:::info
Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 \[38\]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.
:::
:::success
Fastest DPM在不犧牲過多的mAP情況下有效的提高DPM的速度,但它的即時效能仍然降低了2倍\[38\]。與神經網路方法相比,DPM的檢測準確度相對較低,這也是它的限制之一。
:::
:::info
R-CNN minus R replaces Selective Search with static bounding box proposals \[20\]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
:::
:::success
減去R的R-CNN以static bounding box proposals取代Selective Search\[20\]。儘管它比R-CNN還要快,但仍然無法達到即時,而且因為沒有好的proposals而使得準確度大幅降低。
:::
:::info
Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from realtime.
:::
:::success
Fast R-CNN提高了R-CNN分類階段的速度,但它仍然依賴著selective search,每張照片大概需要2秒來生成候選的邊界框。因此,它有很高的Map,但對於即時性還是有一段距離(0.5fps)。
:::
:::info
The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. \[8\]. In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The ZeilerFergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
:::
:::success
最近的R-CNN以神經網路取代selective search來找出候選的邊界框,這類似於Szegedy et al. \[8\]。在我們的測試中,他們最準確的模型可以達到7fps,而一個較小的、比較沒那麼準的模型則可以在18fps上執行。Faster R-CNN的VGG-16的版本高了10mAP,但仍然比YOLO慢了6倍。ZeilerFergus Faster R-CNN只比YOLO慢2.5倍,但其準確度也較低。
:::
### 4.2. VOC 2007 Error Analysis
:::info
To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast RCNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.
:::
:::success
為了進一步檢查YOLO與目前最佳的檢測器之間的差異,我們查看了VOC 2007上的詳細結果。我們比較了YOLO與Fast R-CNN,因為Fast R-CNN是PASCAL上效能最高的檢測器之一,而且它的檢測是公開的。
:::
:::info
We use the methodology and tools of Hoiem et al. \[19\] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:
* Correct: correct class and IOU > .5
* Localization: correct class, .1 < IOU < .5
* Similar: class is similar, IOU > .1
* Other: class is wrong, IOU > .1
* Background: IOU < .1 for any object
:::
:::success
我們使用Hoiem et al. \[19\]的[方法論](http://terms.naer.edu.tw/detail/3264334/)與工具。對於測試時的每個類別,我們查看了該類別的top-N的預測。每一個預測不是正確就是依著錯誤的類型做分類:
* Correct: correct class and IOU > .5
* Localization: correct class, .1 < IOU < .5
* Similar: class is similar, IOU > .1
* Other: class is wrong, IOU > .1
* Background: IOU < .1 for any object
:::
:::info
Figure 4 shows the breakdown of each error type averaged across all 20 classes.
:::
:::success
圖4說明所有20個類別中平均每個錯誤類型的分析。
:::
:::info

Figure 4: Error Analysis: **Fast R-CNN vs. YOLO** These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
Figure 4:錯誤分析:**Fast R-CNN vs. YOLO** 這些圖表說明各種類別中前N個檢測的定位與背景錯誤百分比(N=該類別中的物件)。
:::
:::info
YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.
:::
:::success
YOLO難以正確地定位物件。定位錯誤比所有其它YOLO的錯誤加起來都還要多。Fast R-CNN有著較少的定位錯誤,但背景錯誤卻比較多。其中最高的13.6%是[假陽性](http://terms.naer.edu.tw/detail/6305457/),並不包含任何物件。Fast R-CNN在預測背景誤差的可能性是YOLO的三倍。
:::
### 4.3. Combining Fast R-CNN and YOLO
:::info
YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
:::
:::success
比起Fast R-CNN,YOLO所產生的背景錯誤少多了。透過使用YOLO來消除Fast R-CNN的背景檢測誤差,效能上我們得到了明顯的提升。對於R-CNN預測的每個邊界框,我們都會檢查是否YOLO也預測了類似的框。如果是,我們就根據YOLO預測的機率與兩個框之間的重疊來提高預測值。
:::
:::info
The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.
:::
:::success
最佳的Fast R-CNN模型在VOC 2007測試集上來到71.8%的mAP。在結合YOLO之後,它的mAP增加了3.2%,來到75.0%。我們還試著去結合最佳的Fast R-CNN模型與其它Fast R-CNN的多個版本。這些ensembles在mAP中有小小的增長,大約是.0到.6%之間,更多細節請見Table 2。
:::
:::info

Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best version of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.
Table 2:在VOC 2007上的模型結合實驗。我們研究Fast R-CNN最好的版本與各種模型結合在一起的效果。其它版本的Fast R-CNN只能讓模型效能提高一點點,而YOLO可以明顯的提高效能。
:::
:::info
The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.
:::
:::success
來自YOLO的提升不僅僅是model ensembling的副產物,因為與其它不同版本的Fast R-CNN的結合都只帶來一點點的提升。更確切的說,就是因為YOLO在測試時所犯的不同錯誤,所以對於提高Fast R-CNN的效能特別有效。
:::
:::info
Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.
:::
:::success
很不幸的,這樣的結合並沒有辦法從YOLO的速度得到好處,因為我們的作法是分別執行每個模型,然後組合結果。但是,因為YOLO是這麼快的,與Fast R-CNN相比,它並不會增加任何明顯的計算時間。
:::
### 4.4. VOC 2012 Results
:::info
On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.
:::
:::success
在VOC 2012測試集上,YOLO的mAP分數為57.9%。這比目前最佳技術還要低,更接近使用VGG-16的原始R-CNN,見Table 3說明。與最接近的競爭者相比,我們的系統在小物件上面臨挑戰。在一些類別,像是瓶子、綿羊以及電視/螢幕上,YOLO的得分比R-CNN或Feature Edit還要低8-10%。然而,在其它類別,像是貓與火車,YOLO可以得到較高的效能。
:::
:::info

**Table 3: PASCAL VOC 2012 Leaderboard.** YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
**Table 3: PASCAL VOC 2012 Leaderboard.** YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. 顯示各種檢測方法的mean average precision與per-class average precision。YOLO是唯一的即時檢測器。Fast R-CNN + YOLO是得分第四高的,比Fast R-CNN.還要高2.3%。
:::
:::info
Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.
:::
:::success
結合Fast R-CNN與YOLO的模型是最高效能的檢測方法之一。Fast R-CNN在結合YOLO之後得到了2.3%的改善,在公開排行榜上提高了五個名次。
:::
### 4.5. Generalizability: Person Detection in Artwork
:::info
Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before \[3\]. We compare YOLO to other detection systems on the Picasso Dataset \[12\] and the People-Art Dataset \[3\], two datasets for testing person detection on artwork.
:::
:::success
用於目標檢測的學術資料集從相同分佈中取出訓練與測試資料。在真實世界的應用程式中,那是非常難以預測所有可用的應用情況,而且測試資料可能會跟系統之前所看過的資料有所不同\[3\]。我們在Picasso Dataset\[12\]與People-Art Dataset\[3\]上比較YOLO與其它的檢測系統,這兩個資料集是用於測試術藝品人物檢測。
:::
:::info
Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.
:::
:::success
圖5顯示出YOLO與其它檢測方法之間的比較效能。作為參考,我們提供了VOC2007的人物檢測的AP,其所有的模型都只有用VOC2007的資料訓練。Picasso models是以VOC 2012的資料訓練,而People-Art則是以VOC 2010的資料訓練。
:::
:::info

Figure 5: Generalization results on Picasso and People-Art datasets.
Figure 5:將結果泛化到Picasso與People-Art資料集上。
:::
:::info
R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.
:::
:::success
R-CNN在VOC 2007上有著較高的AP,R-CNN在應用到藝術品的時候下降的非常大。R-CNN使用Selective Search來做邊界框的候選,這是針對自然影像的調整。R-CNN在分類器步驟的時候只能看到很小的區域,而且這需要很好的候選 框。
:::
:::info
DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.
:::
:::success
DPM應用到藝術品的時候維持良好的AP。先前的工作理論認為,DPM表現良好,因為它有強力的物件形狀與佈局的空間模型。通過DPM並不會像R-CNN一樣的退化,而是從一個較低的AP開始。
:::
:::info
YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.
:::
:::success
YOLO在VOC 2007上有很好的效能,而且當應該在藝術品的時候,它的AP退化比其它的方法都還來的低。與DPM一樣,YOLO對物件大小與形狀,以及物件之間關聯與通常出現的地方做建模。藝術品與自然影像在像素級別上是非常不一樣的,但在物件的大小與形狀方法類似,因此,YOLO仍然可以預測好的邊界框與檢測。
:::
## 5. Real-Time Detection In The Wild
:::info
YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, including the time to fetch images from the camera and display the detections.
:::
:::success
YOLO是一個又快又準的目標檢測器,非常適合用於電腦視覺應用程式。我們連接YOLO到webcam,並驗證它是否能夠保持在即時的效能,包含從camera取得影像並檢測該影像的時間。
:::
:::info
The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/ .
:::
:::success
最終得到的系統是互動式的,而且非常吸引人。雖然YOLO是各別的處理影像,但是在掛上webcam之後,它就像是追蹤系統,可以檢測物件的移動與外觀的變化。系統的展示與原始碼都可以在我們的專案網站上找到http://pjreddie.com/yolo/ 。
:::
## 6. Conclusion
:::info
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
:::
:::success
我們介紹YOLO,一個目標檢測的整合模型。我們的模型構造簡單,而且能夠以完整的影像直接訓練。不像基於分類器的方法,YOLO是以與檢測效能直接相對應的loss function訓練,而且是整個模型一起訓練。
:::
:::info
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
:::
:::success
Fast YOLO是文獻中最快的通用型的目標檢測器,而且YOLO推動了最先進的即時目標檢測。YOLO還可以很好的泛化到新的領域,讓它成為快速又穩建的應用程式的理想選擇。
:::
:::info

**Figure 6: Qualitative Results.** YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it does think one person is an airplane.
**Figure 6: Qualitative Results.** YOLO執行在來自網路的藝術品與自然影像上。儘管它將一個人視為一架飛機了,但大多數情況下都是很準確的。
:::