DSOD 筆記

現有的物件偵測器常常仰賴在 large-scale classification 的數據集, 例如 ImageNet, 導致有 'learning bias' 來自分類與檢測任務的loss function 和類別分佈的不同

Transfer learning 的幾個缺點

Limited structure design space
大部分的預訓練模型都在ImageNet分類任務上做
Learning bias
分類與檢測最佳化目標不同, 可能導致學到局部最小值而非對於檢測任務最佳的解
Domain mismatch
例如用ImageNet訓練的網路拿來fine tune成醫學影像的檢測模型

兩個問題
1.可能從頭開始訓練模型嗎？
2.如果可行, 有沒有什麼準則可以設計對於偵測任務,資源高效且維持SOTA準確度的網路？

原文表示DSOD非常彈性, 可以為各種不同計算資源的環境做 "剪裁", 例如伺服器,桌機,移動裝置或嵌入式

本文三大貢獻(自己講）

第一個可以從頭開始訓練並得到SOTA結果的框架
引入並驗證一些設計"efficient object detection networks "的規則 by ablation study
在三個benchmarks 上達到SOTA的結果(PASCAL VOC2007,2012和MS COCO)

DSOD

總體架構

可分為兩部分

骨架-特徵萃取層
DenseNet的變體, 1個 stem +4個 dense +2個trainsition 跟2個 transition w/o pooling
前端預測層
多尺度的預測(SSD)

先討論一些規則,最後附上整體架構圖

規則一： Proposal-free

本文說調查了三種類型的算法：
1.用Selective search決定候選區域
2.用RPN決定候選區（候選相對少)
3.不需要候選區域

內文觀察到只有第3種可以成功在不使用預訓練模型的情況下收斂。
推測這是因為RoI pooling會阻礙梯度smoothly的從區域等級倒傳遞到特徵圖上
作者認為在預訓練過的網路中,ROI池化層之前的初始參數已經夠好了,所以只有不須候選區域的算法可以從頭開始訓練。

規則二： Deep Supervision

"Deep supervised learning" 的核心思想是讓loss function能監督淺層網路(這部份是我的理解)。
在本篇是使用Dense block的概念作為隱式的深度監督的方法。
本文認為淺層的網路可以透過skip connections得到額外的監督(from objective function)

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

有些深度監督實現方式是透過多個loss function

Transition w/o Pooling Layer: 本文宣稱這個池化層可以讓densenet增加block數而不用降低最後個特徵圖的解析度(原DenseNet僅4個 block, 要增加網路深度只能在block內增添新的層)

規則三：Stem Block

一個stem block長這樣

本文提到在實驗時發現增添這樣的塊可以增加檢測的性能,原本在DenseNet中是使用以下的block

作者說是被inceptionv3和v4啟發
這個部份我持懷疑態度,在我的認知中,這個部份的始祖是VGG

規則四：Dense Prediction Structure

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

原本SSD使用VGG16(微改版）做為骨架, 在輸入圖片

300 \times 300

下做了六種尺度的輸出, 第一個尺度對應到

38 \times 38

的特徵圖(大特徵圖檢測小物件)

Dense prediction structure則是在每個尺度融合各尺度的資訊
限制每個尺度輸出跟預測的特徵圖同深度
在DSOD裡每個尺度(除了第一個),都將前一個特徵圖的一半再concatenate進來

Down-Sampling block都長這樣

$1 \times 1$ 的卷積層是為了降低一半的channels
先放MaxPooling的原因是為了降低計算量

Learning Half and Reusing Half

架構圖

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Training Settings
框架：Caffe
顯卡：TitanX

不同於SSD只對第一個尺度,DSOD對所有輸出使用L2正則化, 其他包括資料增強，defalt boxes 的尺度和長寬比, loss function都參考SSD。
學習率和batch size與SSD不同, 在後面會詳述。

Ablation Study on PASCOL VOC 2007

在2007+2012的trainval上做訓練, 在2007test上做測試

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

1,2 rows 比較有無Transition w/o Pooling Layer的差異, 差了 1.7% mAP

有Transition w/o Pooling Layer較好

2,3 rows 比較 transition layers 的
$θ$ (compression factor),
$θ = 1, 0.5 差了$ 2.9% mAP

特徵圖沒降的好

3,4 rows 比較 bottleneck layers的深度影響,
$c = 12, 64$ 差了4.1% mAP

瓶頸層寬的好

4,5 rows 比較第一個卷積層的深度,
$c = 32, 64$ 差了1.1% mAP

第一層深的好

5,6 rows 比較成長率(growth rate) k 值,
$k = 16, 48$ 差了4.8% mAP

成長率代表每個block輸出的channel數,在這高的好, 但其實densenet原文有提到不用太高

6,9 rows 比較有沒有Stem Block, 有無差了2.8%

原文假設 Stem Block 可以避免原圖資訊流失, 在此驗證有 Stem Block 確實比較好

Dense Prediction Structure

What if pre-training on ImageNet?

原文使用 DS/64-12-16-1 為骨架在ImageNet上訓練得到 top-1 66.8%, top-5 87.8% 的準確度, 再用VOC (07+12 trainval) 做訓練, 在VOC 07 test做驗證得到 70.3% 的 mAP, 相對應的從頭開始訓練模型則得到70.8% 的 mAP, 作者說未來可能會做更詳細的討論。

下面是一些結果

VOC2007

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →