YOLO V1: YOLO家族的起源

# YOLO V1: YOLO家族的起源 ###### tags: `Research` 物件偵測一直是計算機視覺中非常重要的一個分支，其功能在上一篇[介紹 R-CNN](https://hackmd.io/@jaihuayen/ByWWByoXq) 中，也有稍作解釋。在此介紹一個現階段無論是學界、業界還是數據競賽中，做物件偵測常用的模型：YOLO。目前最常用的是 YOLO V3、V4，不過要了解 YOLO 的原理，從 [YOLO V1](https://arxiv.org/pdf/1506.02640.pdf) 開始。 --- ### Chapter 1: Introduction 一開始作者就提到 YOLO 的幾個優點： > First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. 1. 使用迴歸模型的方式進行方框的預測，使得 YOLO 在物件偵測上速度非常快。 > Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. 2. YOLO 是基於整張圖片去做偵測，不像是其他 sliding window 等方式是採局部的偵測。 > Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. 3. YOLO 學習到圖片中更通用的特徵。作者在自然度影像上訓練，並在藝術作品上測試，其偵測的效果仍然比之前的 R-CNN 系列還要好。這些優點使得YOLO光在速度上就已經讓他能夠實作於業界，更何況他的表現狀況不比之前R-CNN系列差，這部分就在評估模型的階段細說。作者也給了一張流程圖，讓讀者能夠快速了解 YOLO 的運作流程： ![](https://i.imgur.com/F1t1Kfz.png =400x) 總結三步驟： 1. 將影像轉換成448x448 2. CNN運算 3. NMS計算出物體的位置以及給予信心度 ### Chapter 2: YOLO Model #### 2.1 Model Architacture 接下來要來介紹此模型的框架，其構思是來自於 [GoogLeNet](https://arxiv.org/abs/1409.4842)，具體的框架如下圖： ![](https://i.imgur.com/6gIHaZh.png) 一開始必須將影像資料轉換成 $448\times448$，以符合模型的輸入要求。經過一連串CNN的框架下，輸出為 $7\times7\times1024$。最後再透過全連結層，輸出為 $7\times7\times30$。輸出被定為這個數字是有原因的，以下來解析這些數字： 1. $7\times7$：一張圖片將分為 49 個格子，如果一個物件的中心點落在該格中，則該格負責該物件的預測。因此總共有 49 個檢測格子。 2. $30=(4+1)\times2+20$：4 為預測該物件的中心位置，以及寬高 $(x,y,w,h)$；1 為檢測框的信心度 $c$；$\times2$ 為一個格子有兩個預測框(這裡的2可以進行調整，為一個超參數)；20 為預測的 20 個類別。 Note：一個格子多個檢測框的目的，作者內文提供如下： > YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall. 對於多個邊框下，挑選出與物件最接近的進行修正，其餘的不進行調整。作者發現這樣可以使得預測框的功能逐漸專一化，並提升 Recall。原文中對於這些檢測格子給了一些初步的效果統整： > YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds. 在此應該可以發現，一個格子只負責一個物件的偵測。不過這也是他的缺點，如果該物件在格子的邊緣，則偵測效果會不太好。並且對於小物件的偵測效果也不佳。 #### 2.2 Loss Function 模型的框架出來後，訓練模型也需要定義損失函數，來決定物件需要如何偵測並且定位。以下為他的損失函數並來解析這個損失函數的意義： ![](https://i.imgur.com/c7wWqMR.png) 1. 第一行：計算該物件的中心點位置。如果該格沒有該物件，則此項為0。 2. 第二行：計算該物件的寬高。如果該格沒有該物件，則此項為0。 3. 第三、四行：計算該物件的信心度。如果該格沒有該物件，則此項為0。有該物件，則計算IOU。 4. 第五行：計算物件的類別機率。其中 $\lambda$ 的設計是因為，如果不採取這個調整權重時，則因49個格子中通常只有少數的格子有物件，因此 mAP 會不高。所以期望在有物件的格子權重更大。 ### Chapter 3: Model Performance 在 PASCAL VOC 2007 的資料集下，相較於其他的物件偵測模型，其結果如下： ![](https://i.imgur.com/Vej3gK9.png =500x) 同樣使用 VGG-16 的 backbone，YOLO 的 mAP 相較於 Fast R-CNN 下降7點，不過 YOLO 的速度提升三倍。然後在 Real-Time 的速度下，Fast YOLO 相較於 YOLO 來說 mAP 下降 9 點，但是速度又提升三倍多。再來是作者對於判斷錯誤的數據進一步地分析： ![](https://i.imgur.com/5QCOlPY.png =600x) 作者對於圓餅圖中每項的意義如下： > Each prediction is either correct or it is classified based on the type of error: • Correct: correct class and $IOU > 0.5$ • Localization: correct class, $0.1 < IOU < 0.5$ • Similar: class is similar, $IOU > 0.1$ • Other: class is wrong, $IOU > 0.1$ • Background: $IOU < 0.1$ for any object YOLO 在 Localization 的結果較差，看來直接對位置進行迴歸模型上的估計，其效果並沒有很好。不過在背景的錯誤率降低許多，可能是因為 YOLO 是一個會全面地看整張照片，可以更好地分辨背景以及物件。 ### Chapter 4: Conclusion YOLO v1 整體來說，相較於 R-CNN 系列，不僅速度較快、泛化性較佳，其表現狀況也不會下降很多。 --- ### Reference 1. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In *Proceedings of the IEEE conference on computer vision and pattern recognition* (pp. 779-788). [Link](https://arxiv.org/pdf/1506.02640.pdf) 2. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In *Proceedings of the IEEE conference on computer vision and pattern recognition* (pp. 1-9). [Link](https://arxiv.org/abs/1409.4842) 3. 【论文解读】Yolo三部曲解读——Yolov1 [Link](https://zhuanlan.zhihu.com/p/70387154)