Object Detection - YOLO v2

謝朋諺(Adam Hsieh)
Thu, Apr 18, 2019 4:09 PM

tags: `paper`

Reference

YOLOv2 論文筆記
 論文筆記1 - (YOLOv2) YOLO9000: Better, Faster, Stronger
關於影像辨識，所有你應該知道的深度學習模型
 目標檢測網絡之 YOLOv2
YOLO v1,YOLO v2,YOLO9000算法总结与源码解析
 YOLOv2目标检测详解
 Training Object Detection (YOLOv2) from scratch using Cyclic Learning Rates

YOLO9000: Better, Faster, Stronger

論文連結

這版主要集中在改進 Recall 跟定位，同時保持分類的準確率
本文不是擴展網路而是簡化網路，使它更容易學習。
本文使用以下想法以提高 YOLO v2 性能，如下表，之後則會慢慢介紹：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

幾乎每項加完 mAP 都有上升，除了 anchor box 和完全卷積網路，但這項主要是為了提升 Recall 且盡量不降低 mAP。

Batch Normalization

模型收斂有顯著改善而不需要其他的正規化。
通過在 YOLO 中的所有 CNN Layer 添加 Batch Normalization 在 mAP 上有超過 2% 的改進效果。
Batch Normalization 也有助於正規我們的模型，就算移除掉 Dropout 也不會 Overfitting。

高解析度分類器

在 YOLO 第一版的時候預訓練時是用
$224 * 224$ 的輸入，檢測時才使用
$448 * 448$ 。這導致分類模型切換到檢測模型時，模型還要適應影像辨識率的改變。
第二版的時候，作者先在 pre-train 的分類模型
$(224 * 224)$ 先訓練 160 epoch，然後再將解析度調到
$448 * 448$ 訓練 10 個 epochs，這兩步都在 ImageNet 的資料集中操作，最後在檢測的數據集上 fine-tune，也是使用
$448 * 448$ 。

Convolutional With Anchor Boxes

之前 YOLO v1 使用的是全連接層去預測 Bounding Box 的座標。
YOLO v2 借鑒了 Faster R-CNN 的思想去引入 anchor。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Faster R-CNN 中的 RPN (Region Proposal Network) 是在 feature map 上取 sliding window，每個 sliding window 的中心點稱之為 anchor point，然後將事先準備好的 k 個不同尺寸比例的 box 以同一個 anchor point 去計算可能包含物體的機率(score)，取機率最高的 box。這 k 個 box 稱之為 anchor box。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

所以每個 anchor point 會得到 2k 個 score，以及 4k 個座標位置 (box 的左上座標，以及長寬，所以是 4 個數值)。在 Faster R-CNN 論文裡，預設是取 3 種不同大小搭配 3 種不同長寬比的 anchor box，所以 k 為 3x3 = 9 。

Faster R-CNN 裡用區域建議網路（Region Proposal Network，RPN）來預測 anchor box 的偏移和信心度，預測偏移量而不是預測座標使得網路更容易學習！
YOLO v2 做了以下改變：
1. 刪除全連接層跟最後一個 Pooling layer 使最後一個 CNN 可以有更高分辨率的特徵。
2. 縮小網格為 416 而不是 448*448。這樣做是為了要讓特徵圖中的大小為奇數，所以會有一個中心網格。
3. 一般的物件都會佔據圖片的中心，尤其是大物件更容易，所以中心有一個單一的位置是很好預測這些物件，而不是四個位置都在物件的中心附近。
4. 網路最終將
  $416 * 416$ 的輸入下採樣 32 倍到 13*13 的 feature map 輸出。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
使用 pooling 下採樣，有 5 個 size = 2, stride = 2 的 max pooling，而卷積層沒有降低大小,因此最後的特征是
$416 / 2^{5} = 13$ )
1. YOLO v2 將預測類別的機制從空間位置中拆開，而變為引入 Faster R-CNN 的 anchor box 同時預測類別和座標，如下圖。
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
2. YOLO v1 圖像分成
  $7 * 7$ 的網格，每格網格有兩個 Bounding Box 所以一共有
  $7 * 7 * 2 = 98$ 的 Bounding Box
3. YOLO v2 feature map 大小為
  $13 * 13$ 每個 cell 預測 X 個 anchor box 所以一共有
  $13 * 13 * X > 1000$ ，增加了 box 數量已提高物件定位的準確率。
4. 原本模型 69.5 mAP、81% recall，使用 anchor box 使得模型獲得 69.2 mAP、recall 為 88%。

Dimension Clusters（維度聚類）

Faster R-CNN 在使用 Anchor Box 的大小和比例是按經驗設定的，然後網路會在訓練過程中調整 Anchor Box 大小。
本文不是手動去給 Box 的寬高，而是採用 K-means 的方式對訓練集的 Bounding Boxes 作聚類，以自動找到最好的先驗框。
傳統上 K-means 的歐幾里德距離會導致較大的 Box 會比較小的 Box 產生更多的 error，聚類可能會偏頗。
為了不讓誤差與 Box 的尺寸有太大關係，本文改變了距離函數：

$d (b o x, c e n t r o i d) = 1 - I O U (b o x, c e n t r o i d)$
本文對不同的
$k$ 值進行實驗，並繪製具有最接近質心的平均 IOU，如下圖，我們發現在
$k = 5$ 時的複雜性和 recall 之間有良好的權衡。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
結果可以看到中扁長的框較少，瘦高的框更多（符合行人特徵）。
使用 5 種大小的 Bounding Box 與 Faster R-CNN 的 9 種大小框幾乎分數相當。說明 K-means 方法更具代表性。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Direct Location Prediction (直接預測位置)

YOLO 使用 Anchor Box 遇到兩個問題，一個是前面提到的 Anchor Box 的尺寸是手動挑選（已解決），第二個就是模型在預測 Box 的（x,y）位置不夠穩定，公式為：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
$t_{x} = 1$ 代表將框向右移動 Anchor Box 的寬度，
$t_{x} = - 1$ 代表向左移動相同的量。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

原論文中的公式寫錯了，不小心把加法寫成減法，上面這才是對的。

YOLO v2 沒有使用這種預測方式反而沿用 YOLO v1 的方法，預測 Bounding Box 中心點相對於網格左上角位置（
$c_{x}, c_{y}$ ）的偏移量。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

由於 Anchor Box 預測位置的公式沒有任何約束，他可以落在圖片任何位置，在訓練時可能要很長時間才能預測出正確的偏移。

網格在最後一個卷積層輸出為
$13 * 13$ 的 cell，每個 cell 有 5 個 Anchor Box 的長寬(
$p_{w}, p_{h}$ )來預測 5 個 Bounding Box，每個 Bounding Box 預測 5 個值，分別為：
$t_{x}, t_{y}, t_{h}, t_{w}, t_{o}$ (
$t_{o}$ 有點類似於YOLO v1 的 confidence )。
為了讓 Bounding Box 中心點約束在 cell 中，使用 sigmoid 將
$t_{x}, t_{y}$ 標準化在 0 ~ 1 之間，使得模型更穩定。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
使用 Dimension Clusters 跟 Direct Location Prediction 使得 mAP 又上升了 5%。

Fine-Grained Features（細粒度特徵）

添加了一層 Passthrough Layer，將前面一層的
$26 * 26$ 的特徵圖和本層的
$13 * 13$ 特徵圖進行連接，與 ResNet 的 shortcut 類似。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

在

13 * 13

的特徵圖上做預測雖然對於大物件已經足夠，但對小物件不一定夠好，這裡合併前面大一點的特徵圖可以有效的檢測小物件。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

具體操作：

26 * 26 * 512

的特徵圖經過 passthrough 處理後就變成

13 * 13 * 2048

特徵圖（特徵圖大小變為 1/4，通道數變為 4 倍），然後與後面的

13 * 13 * 1024

連接起來變為

13 * 13 * 3072

的特徵圖，最後在該特徵圖上卷積做預測。

做完之後可以得到將近 mAP 1% 的提升。

Multi-Scale Training

原本的輸入是
$448 * 448$ ，但添加 Anchor Box 後改為
$416 * 416$ ，由於網路只用到 CNN 跟 Pooling Layer 就可以很方便的進行動態調整。
本文在每 10 次迭代後就會隨機選新的圖片尺寸當輸入。由於我們採樣是用 32，因此會以 32 為倍數抽取 {320, 352, …, 608}，最大的是 608X608。
這種機制使得網路可以更好地預測不同尺寸的圖片。
- 在小尺寸的圖片 YOLO v2 可以運行的更快，在速度和精度上達到平衡。288X288 的解析度下可以分析運行超過 90 FPS 而且 mAP 幾乎與 Fast RCNN 一樣好。
- 在大只寸圖片下 YOLO v2 是最好的檢測器。在 VOC 2007 有 78.6 mAP 同時還仍有 Real-time 的速度。
下表就是在 VOC 2007 上檢測的結果。在 Geforce GTX Titan XP 的結果。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
本文也加入了 VOC 2012 進行檢測。YOLO v2 可得到 73.4 的 mAP，而運行速度遠遠快於其他方法（比 Faster R-CNN 跟 SSD512 快 2-10 倍）。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

本文也對 COCO 進行訓練，在 VOC 指標 (IOU=0.5)上，YOLO v2 獲得 44.0 mAP，與 SSD, Faster RCNN 相當。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Darknet 19

大多數檢測器都以 VGG 16 作為特徵擷取，但它對於 224X224 的圖片需要 306.9 億的浮點數，而我們 YOLO v1 使用的是 Googlenet 只用了 85.2 億的浮點數計算，但準確率略低於 VGG 16。
本文提出一個新的分類模型做為 YOLO v2 的基礎。
- 大多數為 3X3 的 Filter。
- 在每個 Pooling 後將通到數加倍。
- 最後使用 Average Pooling 以及 1X1 得 Filter 來壓縮 3X3 的卷積之間的特徵表示。
- 使用 Batch Normalization 來穩定訓練、加速收斂、正規化模型。
- Darknet 19 有 19 層 CNN、5 個 Max pooling layer。
- 只需要 55.8 億個參數。
- 在 ImageNet 上實現了 72.9% 的 top1 準度和 91.2% 的 top5 準度。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Training for Classification

參數設定：
- Starting learning Rate: 0.1
- Polynomial rate decay: 4
- Weight decay: 0.0005
- Momentum: 0.9
Data augmentation:
- Random crop (隨機剪裁)
- Rotation (旋轉)
- Hue (色調)
- Saturation (飽和度)
- Exposure shift (曝光度)
Training:
- 如上所述，一開始先在 224X224 預訓練在 ImageNet 1000 跑 160 epoch，Learning rate 設 0.1。
- 並在 448X448 又訓練 10 epoch，Learning rate 改設為 0.001，其餘參數設定都與上面相同。
- top1 準度為 76.5%，top5 為 93.3%。

Training for Detection

修改 Darknet 19 分類模型為檢測模型。
- 將最後一層 CNN、Avgpool、Softmax 移除
- 最後面新增了三個
  $3 * 3 * 1024$ 的 CNN
- 另外還外加一個 Passtrough Layer 將
  $3 * 3 * 512$ 傳到後面使得模型可以做細粒度特徵。
- 最後接一個輸出大小與我們檢測數量相同的
  $1 * 1$ 的 CNN 來預測結果。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

圖上框起來部分為 Darknet 19 的範圍，此資料集是用 VOC 的 20 類。

舉例來說，根據 VOC 資料集，每個 cell 要預測 5 個 Bounding Box，每個 Bounding Box 要預測 5 個座標值以及 20 個分類值，所以每個 cell 有 125 個 filter。
YOLO v2 公式為:

$f i l t e r_n u m = n u m * (c l a s s e s + 5) = 5 * (20 + 5) = 125$
YOLO v1 公式為:

$f i l t e r_n u m = c l a s s e s + n u m * (c o o r d s + c o n f i d e n c e) = 20 + 2 * (4 + 1) = 30$
YOLO v1、v2 差別在於類別 v1 是一個 cell 預測一個分類，v2 是一個 box 預測一個分類。
Training:
- 我們訓練模型 160 epoch，初始 Learning rate 為 0.001，在 epoch 為 60、90 時 Learning rate 各除以 10。
- Weight Decay 設為 0.0005、Momentum 為 0.9。
- Data augmentation 則與上述相同。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

論文裡沒有說明先驗框匹配和 loss 是怎麼做的！

LOSS 介紹

$w, h$ 分別是特徵圖的寬與高
$(13 * 13)$ 。
$A$ 指的是先驗框的數目（這裡是 5）。
各個
$λ$ 是各個 loss 的權重係數，可以參考 YOLO v1 的 loss。
第一項 loss 是計算 background 的信心誤差，但對於哪些預測框來預測背景就必須先計算各個預測框和所有 Ground Truth 的 IOU 值，並且取最大的 MAX IOU，如果該值小於 Threshold（YOLO v2 設為 0.6），那麼預測框就被標記為 Background，需要計算 noobj 的值。
第二項是計算先驗框與預測框的座標誤差，但是只在前 12,800 個 iterations 間計算。這項應該是為了訓練前期使預測框更快速學習到先驗框的形狀。
第三大項是計算與某個 Ground Truth 匹配的預測框各部分 loss，包括座標誤差、信心指數誤差以及分類誤差。
另外在計算 boxes 的誤差時，YOLO v1 採用的是平方根以降低 boxes 的大小對誤差的影像，YOLO v2 是直接計算，但會根據 Ground Truth 的大小對權重係數進行修正：

$c o o r d_s c a l e * (2 - t r u t h . w * t r u t h . h)$

先驗框匹配

對於某個 Ground Truth 首先要確定其中心點要落在哪個 cell 上，然後計算與 5 個先驗框的 IOU 值。
計算 IOU 值時不考慮座標只考慮形狀，所以先將先驗框與 Ground Truth 中心點都偏移到同一位置（原點），然後計算 IOU 最大的那個先驗框與 Ground Truth 匹配，對應的預測框用來預測這個 Ground Truth。

YOLO 9000

大多數方法在分類上會使用 softmax 來計算最終的機率分佈，使用 softmax 的話就是假設類別之間彼此互斥。但在“諾福克犬”與“狗”並不是互斥的關係。我們可以使用多標籤模型來組合不互斥的數據集。

分層分類

ImageNet 標籤是從 WordNet 中提取的，在 WordNet 中“諾福克犬”與“約克夏犬”都是“獵犬”的下位詞。
WordNet 被建構為有向圖而不是樹。例如“狗”既是“犬”也是”家畜“，在 WordNet 中屬於同義詞。因此我們不使用完整的圖結構，而是通過 ImageNet 中的概念建構層次樹來簡化問題。
為了建構樹我們會檢查 WordNet 圖到根節點的路徑，根節點是“physical object”，越抽象的標籤越靠近根節點層。如果有兩條路徑可以到達根，則我們只取最短的路徑。
最後的結果我們稱為 WordTree 並用它來執行分類，我們為每個標籤節點預測一個條件機率，如果要計算一個節點標籤發生的機率就從根節點出發，連乘他們的條件機率，直到到達標籤節點。
舉例來說 "terrier" 節點我們預測出：

$P r (N o r f o l k t e r r i e r | t e r r i e r)$

$P r (Y o r k s h i r e t e r r i e r | t e r r i e r)$

$P r (B e d l i n g t o n t e r r i e r | t e r r i e r)$

如果我們要計算特定節點的絕對機率，只需遵循通過樹到達根節點的路徑並乘以條件機率，因此若想知道圖片是否為 "Norfolk terrier" 我們計算：

$P r (N o r f o l k t e r r i e r) = P r (N o r f o l k t e r r i e r | t e r r i e r) * P r (t e r r i e r | h u n t i n g d o g) * . . . * P r (M a m m a l | a n i m a l) * P r (a n i m a l | p h y s i c a l o b j e c t)$

為了分類目的，我們假設圖像至少包含一個物件根節點：
$P r (p h y s i c a l o b j e c t) = 1$
為了驗證這種方法，本文使用 Darknet 19 在 1000 類的 ImageNet 上做驗證，在 1000 類上的中間多添加了 369 的中間節點，所以預測輸出的機率是一個 1369 維的向量，並且對每一個父節點的所有子節點做 softmax。
使用之前一樣的參數訓練，這種分層的 Darknet 19 仍然獲得 top1 有 71.9% 的準度，top5 則有 90.4% 的分類效果。
在檢測上也適用於這個公式。他預測
$P r (p h y s i c a l o b j e c t)$ 、Bounding Box 和條件機率樹。
我們遍歷整個樹，沿著最高信心指數的路徑直到到達某個閾值，我們就預測是那個物件的類別。

與 WordTree 的資料集結合

我們可以使用 WordTree 以合理方式將多個資料集結合。如下圖：

聯合分類和檢測

我們使用 WordTree 組合資料集，並使用 COCO 的檢測資料集和 ImageNet 前 9000 類組合起來。共有 9418 個類別。
因爲 ImageNet 是較大的資料集，因此我們將 COCO 進行過採樣將他們比例只差在 4:1。
YOLO 9000 和 YOLO v2 預測輸出的結構一樣，YOLO v2 輸出 5 個 anchor boxes，YOLO 9000 輸出 3 個 Bounding Box。
YOLO 9000 預測 1369 個條件機率外加一個
$P r (p h y s i c a l o b j e c t)$ ，而且在確定最終類別時，不像 YOLO v2 直接輸出最大機率值就可，YOLO 9000 必須設定一個閾值，並計算每個標籤節點的機率值，直到這一層至少有一個節點的機率達到閾值，如果這一層有多個節點達到閾值，選擇機率值最大的標籤節點為最終類別，如果只有一個那就是他當最終類別了。
當 YOLO 9000 遇到檢測資料集圖片時和 YOLO v2 一樣傳遞 loss（包含 Bounding Box 的 loss），對於類別的部分，只傳遞類別標籤層級和以上的標籤誤差。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

具體的計算方法是，假設WordTree已經建立，找到真實類別標籤及以上層級標籤節點，分別計算他們發生的機率，真實標籤節點到根節點的這條路徑上的所有標籤節點的值為 1，其他節點為 0，然後對應相減求平方和)

如果 YOLO 9000 遇到分類資料集圖片的話，只傳遞分類 loss，利用每個格子（YOLO 9000 一個格子有三個 anchor box）中信心度最高的 Bounding Box 對應的類別機率計算分類 loss。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

作者最後也提到一個想法，將信心度最高的 Anchor Box 假設為真實標籤 Bounding Box 值，格子中的其他 Anchor Box 與他計算 IOU，若 IOU 大於 0.3，則也反向傳遞 Bounding Box 誤差。

使用這種聯合訓練，YOLO 9000 在 COCO 中的檢測資料集學習找到圖片中的物件。在 ImageNet 中的資料學習分類各種各樣的物件。
本文在 ImageNet 的檢測任務上評估 YOLO 9000。
- ImageNet 的檢測任務中只有 44 個類別出現在 COCO 資料集中，這意味著 YOLO 9000 只看到大多數測試圖像的分類資料，而不是檢測資料。
- YOLO 9000 最終整體獲得 19.7 mAP，在從未見過的 156 個物件檢測資料集則有 16.0 mAP。
- 這結果高於 DPM 得方法，但在 YOLO 9000 是在不同資料集上進行半監督訓練，而且 YOLO 9000 可同時 Real-time 檢測 9000 多種的類別。
- 當我們分析 YOLO 9000 在 ImageNet 上性能，我們看到他是有在學習新的動物物種卻很難學習新的服裝和設備，原因應是要被預測的物件與 COCO 中的動物有些一致性，但相反地 COCO 中沒有任何衣服的類型只有人的類別，如下表是來自 156 個弱監督式學習的前幾名跟後幾名：

Object Detection - YOLO v2

tags: paper

Reference

YOLO9000: Better, Faster, Stronger

Batch Normalization

高解析度分類器

Convolutional With Anchor Boxes

Dimension Clusters（維度聚類）

Direct Location Prediction (直接預測位置)

Fine-Grained Features（細粒度特徵）

Multi-Scale Training

Darknet 19

Training for Classification

Training for Detection

LOSS 介紹

先驗框匹配

YOLO 9000

分層分類

與 WordTree 的資料集結合

聯合分類和檢測

Read more

Working Effectively with Legacy Code (Chapter 1 ~ Chapter 6)

3D Object Detection - Multi-task Learning

3D Object Detection - PGD

Andrej Karpathy: Tesla Autopilot and Multi-Task Learning for Perception and Prediction

tags: `paper`