YOLOv1 - HackMD

###### tags: `Paper Notes` # YOLOv1 * 原文：You Only Look Once: Unified Real-Time Object Detection * 機構：University of Washington, Allen Institute for AI, Facebook AI Research * 時間：2016 年 ### Model Architecture * R-CNN 需要先用 region proposal methods 先偵測出可能的 bounding box 位置，用一個 classifier 在上面判斷各類別的機率，最後再用一些 post-processing 方法來找出更精準的 bounding box 位置、大小。 * 而 YOLO (You Only Look Once) 不需要這麼麻煩，YOLO 直接用一個 convolutional network 就可以同時預測 bounding boxes 與 class probabilities。 * 每張圖片都先分成 $S \times S$ 個 gird。 * 若某物件的中心點在該 grid 中，則由該 grid 負責預測該物件。 * 每個 grid 都有 $B$ 個 bounding boxes 與相應的 confidence scores。 * confidence score 表示該 bounding box 對於「該 grid 有物件的信心度」與「該 bounding box 的位置準確度」，具體公式如下： $$ Pr(Object) \times IOU_{pred}^{truth} $$ * $IOU_{pred}^{truth}$：intersection over union (IOU) between ground truth and predicted box * 在訓練的過程中僅指定一個與 ground truth 的 IOU 最大的 bounding box 負責預測該類別的物件。 * 每個 bounding box 都包含 5 個值：$x$、$y$、$w$、$h$、$confidence$。 * $(x, y)$：bounding box 中心點在 grid 裡的相對位置。 * $(w, h)$：bounding box 相對於 grid 的寬、高。 * $confidence$：$Pr(Object) \times IOU_{pred}^{truth}$ * 對於每個 grid，無論 $B$ 設多少，都只會預測 $C$ 個 class probabilities，$Pr(Class_i | Object)$。因此，每個 bounding box 的 class-specific confidence scores 如下： $$ Pr(Class_i | Object) * Pr(Object) * IOU_{pred}^{truth} = Pr(Class_i) * IOU_{pred}^{truth} $$ * 模型的輸出為 $S \times S \times (B * 5 + C)$ 維的向量。 * 一張圖片共有 $S \times S$ 個 grid。 * 每個 grid 都要預測 $B$ 個 bounding box 與 $C$ 個 class probabilities。 * 每個 bounding box 都有 5 種數值 $x$、$y$、$w$、$h$、$confidence$。 * 範例： For evaluating YOLO on PASCAL VOC, we use $S = 7, B = 2$. PASCAL VOC has 20 labelled classes so $C = 20$. Our final prediction is a $7 × 7 × 30$ tensor. * 如圖三所示，整個 YOLO 就是擁有 24 層 convolutional layers 與 2 層 fully connected layers 的網路。 * 比較特別的是，作者將 1 × 1 的 reduction layers 接在 3 × 3 的 convolutional layers 後面，用於減少計算量（受 GoogLeNet 啟發 [34]）。 * 除了最後一層的 activation function 使用 linear activation 外，其他層都是使用 leaky ReLU。leaky ReLU 公式如下： $$ \phi(x) = \left\{ \begin{array}{} x, & if\ x > 0 \\ 0.1x, & otherwise \end{array} \right. $$ <img src="https://i.imgur.com/yYvcRWN.png" style="zoom: 67%;" /> <center>圖三：YOLO 架構圖</center> * YOLO 的 loss function 如下（共可分成 5 個部分）： $$ \lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{obj} [(x_i - \hat{x_i})^2 + (y_i - \hat{y_i})^2] \\ + \lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{obj} [(\sqrt{w_i} - \sqrt{\hat{w_i}})^2 + (\sqrt{h_i} - \sqrt{\hat{h_i}})^2] \\ + \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{obj} (C_i - \hat{C_i})^2 \\ + \lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} 1_{ij}^{noobj} (C_i - \hat{C_i})^2 \\ + \sum_{i=0}^{S^2} 1_{i}^{obj} \sum_{c \in classes} (p_i(c) - \hat{p_i}(c))^2 $$ * $1_{i}^{obj}$：grid $i$ 中有無物件 * $1_{ij}^{obj}$：grid $i$ 中的第 $j$ 個 bounding box 是否有負責預測某類別。 * $\lambda_{coord} (=5)$、$\lambda_{noobj} (=0.5)$：提升含有物件的 grid 的 bounding box coordinate predictions loss，降低沒有物件的 grid 的 confidence predictions loss。（因為大部分的 grid 都是沒有物件的，因此要降低他們的權重，避免模型一直預測「沒物件」） * 由於物件有大有小，大物件的誤差值通常較大，因此對 $w$、$h$ 開根號以緩解這個問題。 * 範例：大物件長度 100、小物件為 10，令兩者都誤差 10%，則大物件的誤差值為 10、小物件為 1。先開根號在做運算的結果為：大物件：$\sqrt{110} - \sqrt{100} = 0.48$、小物件：$\sqrt{11} - \sqrt{10} = 0.15$。 ### Experiments & Results * 先用 ImageNet 1000-class competition dataset 對前 20 層做預訓練。在預訓練的過程中，還額外增加了一層 average-pooling layer 跟一層 fully connected layer。 * 預訓練時使用 $224 × 224$ 的圖片，但真正在訓練時是用 $448 × 448$ 的圖片。 * YOLO 使用 PASCAL VOC 資料集做評估，實驗結果如表三所示。 <img src="https://i.imgur.com/E7YQvwN.png" style="zoom: 60%;" /> <center>表三：YOLO 與其他模型的比較。</center> * YOLO 的偵測速度為 45 fps。 ### References [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2