FCOS: Fully Convolutional One-Stage Object Detection

tags: `paper notes` `deep learning`

Paper Link

Problems of anchor-based objection detection

anchor-based object detection 有以下幾個問題:

detection performance 與 anchor 大小、比例和數量有很大關係(sensitive)
因為 anchor 的大小與比例都是固定的，detector 較難處理有較大 shape variation 的物件，尤其在小物件上
為了 high recall rate，anchor-based method 會放非常多的 anchor 在圖片的每個區域
- 有多多？ more than 180K anchor boxes in feature pyramid networks (FPN) for an image with its shorter side being 800
- recall 高指的是分類為 positive 的框中真的是 positive 的比例高
IoU 的運算非常耗時且複雜
- 比如 YOLOv4 的 CIoU 就不止得算重疊面積，還要算中心點距離以及長寬比

如何區分正負樣本

anchor-based 和 anchor-free 的主要差別就在於正負樣本的定義方式，通常物件偵測都會有以下三種樣本情境:

正樣本 (positive sample): 代表這邊有某類物件的樣本，要訓練物件的分類器，同時訓練bounding box offset的regression。
負樣本 (negative sample): 代表這邊沒有任何物件 (屬於背景)的樣本，要訓練物件的分類器，設計上常見使用背景類別或是讓所有物件的分類輸出為零。
忽略樣本 (ignore sample): 不參與訓練的樣本。

FCOS 這裡的定義方式是:

location (x, y) 只要落在任何一個 ground-truth box 且這個 box 的 class label 為 ground-truth 的就會被分類為 positive sample
若否，這個 (x, y) 就屬於 negative sample，其 c*=0 (background class)
而如果 (x, y) 落在多個 bounding box 之中，就被視為 ambiguous sample
- ambiguous sample 的問題後面會用 multi-level prediction 來解決

實作上，FCOS 其實是在預測下圖的 4D vector (l, t, r, b)，分別代表從 location (x,y) 延伸出來的四個距離

real 4D vector = (l*, t*, r*, b*)

舉例來說，若 (x,y) 落在 bbox \(B_i\) 之內，則 training regression targets for the location (x,y) 就會是

作者表示他們認為 anchor-free 可以更好的去利用盡可能多的前景圖片來訓練 regressor，而這就是 anchor-free 會表現較好的原因之一
會這樣減是因為 ground-truth 為 left-top 和 bottom-right 的關係

Model

Architecture

FCOS 以 FCN 為基礎，加入 FPN 和 Focal loss

backbone CNN + FPN + 以 C 個二元分類器來取代一個多元分類器
每一次的輸出都會是一個 4D 的 l, t, r, b vector \(p\)，以及一個 80-D 的分類標籤 (MSCOCO有80類)
head 為 shared head，也就是說每個 pixel 的三種預測都是基於所有的 feature level head 來產生


















# Centerness head
P3_ctrness: sigmoid(head(P3))   # [B, H/8, W/8, 1]  
P4_ctrness: sigmoid(head(P4))   # [B, H/16, W/16, 1]
P5_ctrness: sigmoid(head(P5))   # [B, H/32, W/32, 1] 
P6_ctrness: sigmoid(head(P6))   # [B, H/64, W/64, 1] 
P7_ctrness: sigmoid(head(P7))   # [B, H/128, W/128, 1]
# Classification head
P3_class_prob: sigmoid(head(P3)) * p3_ctrness # [B, H/8, W/8, C]  
P4_class_prob: sigmoid(head(P4)) * p4_ctrness # [B, H/16, W/16, C]
P5_class_prob: sigmoid(head(P5)) * p5_ctrness # [B, H/32, W/32, C]
P6_class_prob: sigmoid(head(P6)) * p6_ctrness # [B, H/64, W/64, C]
P7_class_prob: sigmoid(head(P7)) * p7_ctrness # [B, H/128, W/128, C]
# Regression head
P3_reg: conv2d(head(P3))   # [B, H/8, W/8, 4]  
P4_reg: conv2d(head(P4))   # [B, H/16, W/16, 4]
P5_reg: conv2d(head(P5))   # [B, H/32, W/32, 4] 
P6_reg: conv2d(head(P6))   # [B, H/64, W/64, 4] 
P7_reg: conv2d(head(P7))   # [B, H/128, W/128, 4]

與 RetinaNet 的比較

RetinaNet 和 FCOS 的 backbone 和 FPN 幾乎是一樣的

FPN

透過簡單的 top-down 架構加上 skip connection 解決以往 detection network 無法有效率的辨識多尺度物件的問題
- (a.): 將圖片縮放成多個不同尺寸的圖片，從這些圖片產生不同尺寸的特徵圖，可以得到不錯結果但計算量跟記憶體空間需求都很大
- (b.): 只做 CNN+Pooling，利用最後一個特徵圖輸出結果，大部分的 CNN 模型都是這個類別，雖然記憶體佔用少但只關注到最後一層的特徵，ex: Fast RCNN & Faster RCNN
- (c.): 在不同尺寸的特徵圖上做預測，融合預測出來的結果，雖然速度跟結果不錯但沒有重複使用特徵，導致對小物件的偵測結果不好，ex: SSD
- (d.) FPN，融合深層和淺層特徵來預測，淺層特徵可以用來定位物件，深層特徵用來辨識物件
  
  FPN 是 merged by addition, 1x1 conv 是用來讓 chennl 數量一樣

與其他 anchor free 方法的比較:

CornerNet 需要去配對左上角和右下角，須引入額外的距離metrics，導致模型中有複雜的後處理
DenseBox 很難處理重疊的 bbox，且 recall 相對低（FCOS 利用multi-level FCN解決）
FCOS 為 propose-free，不需要超參數設計，且可以沿用過去的 FCN 設計

Anchor-based vs Anchor-free

anchor-based 的做法是考慮 input image 中的位置作為 anchor-box center，並盡量將 bounding box regress 過去
- 而 FCOS 則直接以位置作為目標來做 regress

符號定義

第 \(i_{th}\) layer 的 feature map = \(F_i\)
ground-truth bounding boxes = \(B_i = (x_0^{(i)}, y_0^{(i)}, x_1^{(i)}, y_1^{(i)}, c^{(i)})\)
- 其中 \(x_0\) 和 \(y_0\) 為 left-top corner of box, \(x_1\) 和 \(y_1\) 為 right-bottom corner of box
- \(c\) 代表的是哪一個 class

backbone CNN 的 feature map \(F\) 上每一個 x, y 都可以利用 \(\lfloor\frac{s}{2}\rfloor + xs\) 和 \(\lfloor\frac{s}{2}\rfloor + ys\) 來 mapping back onto input image，而這個映射回去的座標就會在 (x, y) 感知域中心的附近

\(s\) = total stride until the layer, FCOS 用了五個 FPN level, stride = [8, 16, 32 ,64, 128]















# Compute x, y (locations)
def compute_locations(h, w, stride, device):
    shifts_x = torch.arange(
        0, w * stride, step=stride,
        dtype=torch.float32, device=device
    )
    shifts_y = torch.arange(
        0, h * stride, step=stride,
        dtype=torch.float32, device=device
    )
    shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
    shift_x = shift_x.reshape(-1)
    shift_y = shift_y.reshape(-1)
    locations = torch.stack((shift_x, shift_y), dim=1) + stride // 2
    return locations

loss function

Focal loss: 對 easy example 做 down-weighting，讓訓練過程盡量去訓練 hard example

其中 \(\alpha\) 代表 \(\alpha\) balance，是一個能直接降低負樣本權重的值
- 作者實驗試出來發現加進去會表現比較好一點才加
\(\gamma\) 為 focusing 參數，需 >= 0，用來控制 hard example 和 easy example 的權重
當 \(\alpha = 1\)且 \(\gamma = 0\) 的時候，focal loss = cross entropy

UnitBox IoU loss: 只針對 gt 中的 pixel 計算 cross entropy with input of IoU

比起 l2 loss, 更全面地把 bounding box 的四個參數融合再一起計算 IoU
加上負數是因為要讓原本越大越好的 IoU 變成越小越好 (loss)，ln 只是做簡單的 mapping 也可以換成其他的

inference

把 p>0.5 的 location 作為 positive sample，並把上面的 \(l*, t*, r*, b*\) 反算得到其位置

Tricks for improvement

Multi-level Prediction with FPN for FCOS

原因

FCOS 有可能會因為 1.) large stride of the final feature map in a CNN 造成很低的 best possible recall (BPR), 2.) ground-truth bounding boxes 造成在訓練過程中難以定義真實樣本的模糊性

在預測上跟 anchor-based dectector 的不同

anchor-based 是利用各種不同大小的 anchor boxes 來分配到不同的 feature level
FCOS 是直接限制每一個 feature level 的 bounding box regression 範圍

流程

計算所有 feature level 中的每一的位置的 l*, t*, r*, b*
若位置符合 max(l*, t*, r*, b*) > \(m_i\) 或是 max(l*, t*, r*, b*) < \(m_{i-1}\)，那這個位置就會被設為 negative sample 並且不再需要 bounding box regression
- \(m_i\) = maximum distance that feature level \(i\) needs to regress
- 在 FCOS 中，m2, m3, m4, m5, m6 and m7 被設為 0, 64, 128, 256, 512, \(\infty\)
假如再做過以上計算後，仍然有位置是被多個 ground-truth bounding box 涵蓋到的話，那就會選擇面積最小的那個 bbox 作為 ground truth
最後，他們學習 SSD, focal loss 將 heads 共享在不同 feature level上，好讓 FCOS 更加 paramter-efficient，也可以 improve performance
- 他們有觀察到不同 feature level 不能用一樣的 regress size，否則不太合理
- 所以 assign the size range [0:64] for P3 and [64:128] for P4，限制每個 feature level 只能看固定範圍大小的
- 他們透過將 \(exp(x)\) 改為 \(exp(s_ix)\) 來達到 shared head，\(s_i\) 為一個可訓練的scalar

Center-ness for FCOS

問題

在做完 multi-level prediction 後，FCOS 仍和 anchor-box detector 有差距
他們發現這是因為 FCOS 裡面有很多與目標物件中心距離較遠的低品質 bounding boxes

解法

他們增加一個與 classification branch 平行的 branch 來預測一個位置的 centerness
- The center-ness 代表的是一個物件的中心和他所負責的 locations(l*, t*, r*, b*) 之間的 normalized distance
- 開根號是用來放緩 centerness 的 decay 速度
- center-ness 值域為 (0,1)，因此是用 binary cross entropy 來訓練，並且把這個 loss 加到上面那個 loss function 上
在 testing 階段，final score = classifcation score * center-ness，因此 center-ness 可作爲將這些低品質 bb 的分數下降，讓他們在最後的 NMS 階段被過濾掉
這邊還有提到說，其實也可以直接使用 ground-truth bounding box 的中心區域作為 ground-truth，但這樣又會額外引入超參數所以他們不想使用

Center-ness 效果

Result

FCOS 為當時的 SOTA 結果

Ablation Study

References

Articles

Papers

Focal Loss for Dense Object Detection (RetinaNet, Focal loss)
Bridging the Gap Between Anchor-based and Anchor-free Detection via
Adaptive Training Sample Selection (ATSS)
Feature Pyramid Networks for Object Detection (FPN)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.