YOLOv2 論文閱讀

YOLO9000: Better, Faster, Stronger: https://arxiv.org/pdf/1612.08242.pdf

Introduction

文中主要基於YOLO模型提出幾點改善方法來提升其的速度與準確率，並可在VOC2007取得67FPS與76.8mAP的成績，除此之外，因為過往的偵測模型僅能偵測少量的物件，因此論文中也提出一訓練方法，可使偵測的物件數量達到9000個。

Better

由YOLO文中的錯誤分析可以發現YOLO相較於Fast RCNN的localization errors較顯著，因此本篇論文主要專注於改善recall及localization同時維持分類準確率，而文中提出許多新的概念加入到YOLO中以改善其表現。

Batch Normalization
Batch normalization改善模型的收斂同時也取消其他的正規化方法，而在每層convolutional layers加入batch normalization也使模型提升2%mAP，並可以將dropout layer移除。
High Resolution Classifier
YOLOv2以前的偵測器會先使用AlexNet在256x256的ImageNet上做pretrain，而此舉會導致影像解析度不夠，為此，在YOLO中有再將解析度提升至448進行偵測，而以上說明影像解析度對於偵測是非常重要的。再YOLOv2中首先對分類網路在448x448的ImageNet上fine-tune 10個epochs，之後在將網路的偵測部分上進行fine-tune，而這樣做可以提升將近4%mAP。
Convolutional With Anchor Boxes
YOLOv2引入Faster RCNN論文中Anchor box的概念，首先將YOLO的FC layers移除，並去掉後面一層池化層以保持輸出層的高解析度，此外，YOLOv2也將輸入縮減為416使得最後的輸出層可以是奇數(13x13)，而在使用anchor box 之後雖然下降了0.3mAP，recall卻提高了7%。
Dimension Clusters
文中提到因anchor box的大小是人為決定的超參數，而如果有適當的anchor box，網路對於預測框的學習會更加容易。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
為了挑選一組適當的anchor box，文中採用K-means clustering，而為了使聚類中心與box的距離越小越好；以及使聚類中心與box的IOU越大越好，因此採用以下公式：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
文中選擇k=5，而由Table 1得知，在經過聚類後得到的anchor box比人為決定的anchor box來的好。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Direct location prediction
因YOLO論文中預測的座標點應用於Anchor box會導致模型不穩定，因此文中提出新的預測公式，改用anchor box對於預測框點的偏移位置(tx, ty, tw, th)及信心值(to)進行預測：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
而他們與預測框的關係如下圖：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
其中bx, by為預測框的中心點座標; cx, cy為當前網格左上角到圖像左上角的距離； bw, bh為預測框的寬高；Pw, Ph為Anchor box的寬高；
$P r (𝑜 𝑏 𝑗 𝑒 𝑐 𝑡) * 𝐼 𝑂 𝑈 (𝑏, 𝑜 𝑏 𝑗 𝑒 𝑐 𝑡)$ 為預測框的信心值。
Fine-Grained Features
在YOLOv2中為了使模型可以更好地偵測小型的物件，作者將26x26x512層經由passthrough layer轉成13x13x2048並與13x13x512層做concate，如此相當於做了一次特徵融合，也因此concate後的輸出為13x13x3072維，而這樣的架構可以提昇1%mAP。
Multi-Scale Training
為了使模型能夠適應不同大小的輸入影像，且因為YOLOv2模型僅使用到卷積層及池化層，YOLOv2中在訓練過程中會在幾個迭代之後由{320, 352,…,608}中隨機選取一個大小來訓練模型，Table3為模型在不同輸入影像大小下的結果。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Faster

與YOLOv1不同，YOLOv2 採用一個新的架構 - Darknet-19作為basline模型，其是由19個卷積層及5個Maxpool層所組成，而最後一層採用Global average pooling層作為最後分類器的輸出。而Darknet-19可以在ImageNet的分類問題取得72.9的top-1 accuracy。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

YOLOv2中首先將模型在分類問題中訓練，並採用以下參數：

classes: 1000
epoch: 160
image size: 224
start learning rate: 0.1, polynomial rate decay with power of 4
optimizer: SGD, weight decay: 0.0005, momentum: 0.9
finetune: img_size: 448, lr: 0.001, epoch: 10

而在偵測問題中，作者將最後一層Avgpool移除並接上3層3x3的卷積層，而在VOC的任務中，會預測5個boxes、5個座標、20個類別因此總輸出會有nxnx5x(5+20)，n為最後一層feature map大小。訓練參數如下－：

epoch: 160
learning rate: 0.001, milestone: 60, 90, gamma: 0.1
optimizer: SGD, weight decay: 0.0005, momentum: 0.9

Stronger

在訓練模型期間混和偵測及分類資料集，當label為分類資料集時僅回傳分類的loss；當label為偵測資料集時則回傳所有loss，然而分類資料集的label通常無法與偵測資料集的label融合，舉例來說，ImageNet中可能就將狗分為不同品種的狗，而COCO中可能只有狗一類，如此一來label資料便無法融合，因此文中提出一Hierarchical classification方法。

Hierarchical classification

文中利用物件的從屬關係，以physical object作為節點建立一個WordTree，如此可以將不同dataset的label融合在一起。在label時該物件會由其父類的probability相乘而得。舉例而言，Norfolk terrier類的可能性會由其所有的父類可能性相乘而得，而最後

P r (p h y s i c a l

o b j e c t) = 1

。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

當訓練時，在進行softmax時不會對所有類別進行，而是會對同一層級的類別進行softmax，並且從該類別的節點往下歷練並選取該層級最高的probability，最後將每個層級的probability相乘，即上面Norfolk terrier的概念。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Joint classification and detection

文中利用WordTree方法解決了資料集的問題後，將COCO偵測資料集與ImageNet top9000類資料集混合後用YOLOv2的模型，以上述方法訓練了模型－YOLO9000，其可以偵測9000個不同的類別，並在ImageNet驗證集上取得19.7mAP。

Conclusion

YOLOv2提出了一系列的方法，如darknet-19、anchor box、passthrough、混和訓練等等來改善YOLOv1，但與其說改善，其實整個架構的已經換掉了，僅有保留將bbox輸出成回歸問題的這個概念而已，不過YOLOv2確實加入了很多新的想法來改善回歸問題。

YOLOv2 論文閱讀

Introduction

Better

Faster

Stronger

Hierarchical classification

Joint classification and detection

Conclusion

Read more

YOLO 論文閱讀

Image2StyleGAN++論文閱讀

Image2StyleGAN 論文閱讀

StyleGAN 論文閱讀