tags: `2022Q2技術研討`, `detection`

DETR (DEtection TRansformer)

簡介

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

本篇論文是 FB 首度將 NLP 的 transformer 用在 CV 的 object detection 上，將 object detection 視作一個 direct set prediction problem，並且精簡了很多 object detection 上的額外操作(non-maximum suppression, anchor generation) 的 state-of-art 的目標檢測模型。(2020 年的拍謝XD)

github codes

與傳統 Faster R-CNN 流程比較

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

流程非常的簡潔

模型架構

transformer encoder / decoder 架構

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Backbone: 先跑CNN，然後把最後一層透過Conv1d降維，得到了feature map，然後轉成(d,HW)的shape等下準備餵進transformer
Transformer encoder: 然後把1.的output丟到transformer encoder上做multi-head self-attention
- transformer只在一開始加了position encoding，TEDR覺得一次不夠，每一個encoder block都給你加好加滿
Transformer decoder: 在transformer decoder的部分，使用N個d維的vector來當作object query vector
- decoder也給你每層加position encoding，值得一提的是這裡的position encoding是直接用query vector
Prediction feed-forward networks (FFNs): 最後接FFN(feed-forward networks)產生N個prediction，前面有提到這個N會比image object數量還來的大，所以會有一個特殊的class叫做no object，可以把他想成圖片的背景

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

loss function 簡介

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

bipartite matching (二分匹配)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

兩邊 sets 的元素數量都是 N ，所以我們是可以做一個配對的操作，讓左邊的元素都能找到右邊的一個配對元素，每個左邊元素找到的右邊元素都是不同的，也就是一一對應。這樣的組合可以有 N! 種，這個 N 即是模型可以預測的最大數量。
透過匈牙利演算法來處理 matching problem

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

object detection set prediction loss

符號定義：

y 是 ground truth 的 objects set，
$\hat{y} = {\hat{y_{i}}}_{i = 1}^{N}$ 是 predicts set
這邊會假設
$N$ 是一個大於 image 所應應有的 object 數量(根據 coco dataset 作者選了 N=100)，不過這樣後續匹配的時候就必須對 ground truth 去做 padding 才能湊到 N 組 object，pad 的方式是使用用
$\emptyset$ (代表沒有 object)來 padding
$y_{i}$ 是 ground truth 的一組 object，每一個
$y_{i}$ 包含了 class label 和 bounding box 的四個值:
$y_{i} = (c_{i}, b_{i})$ (class labels and box coordinates x,y,w,h)
對於一個
$y_{i}$ ，他對應到的 prediction 是
${\hat{y}}_{σ (i)}$ (用匈牙利演算法)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$- \log {\hat{p}}_{\hat{σ} (i)} (c_{i})$ 是希望predict分類分的越準越好
- ${\hat{p}}_{σ (i)} (c_{i})$ 是
  $σ (i)$ 被預測維class
  $c_{i}$ 的機率
- 當分的正確的時候(也就是機率=1)這一項就是0
$1_{{c_i \neq \emptyset}} L_{box} (b_{i}, {\hat{b}}_{\hat{σ}} (i))$ 則是bounding box的重合度匹配
- 使用了 L1 loss 和 IOU(單純用 L1 loss 會造成這個loss 過於依賴 bounding box 的大小，所以加上了 IOU)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

object quries (又稱 prediction slots)

object queries 是可學習的 embedding，與當前輸入影象的內容無關（不由當前影象內容計算得到）

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

每個點都是一個 object，上圖是從 COCO 2017 val set 的所有 object 預測分佈在各 object queries 的情形

綠色的點代表小 bbox

紅色的點代表大的 horizontal bbox

藍色的點代表大的 vertical bbox

object quries 是隨機初始化，並隨著網路的訓練而更新，可以想成是學習了整個訓練集上的統計資訊
在目標檢測中每個 object query 可以看作是一種可學習的動態anchor

成效

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

特色

將目標檢測問題看做 Set Prediction 問題，用二分圖匹配實現 label assignment，準度比加強版的 Faster R-CNN 好
parallel decoding
less computation (no nms)
anchor free

問題

收斂速度慢 (在 COCO 基準上，DETR 需要 500 個 epoch 才能收斂，這比 Faster R-CNN 慢了 10 到 20 倍)
- 在初始化時transformer中每個query對所有位置給予幾乎相同的權重，這使得網路需要經過長時間的訓練將attention收斂到特定的區域
對於小 object 的準確度較低
- 只取 cnn 最後的 layer 看的是比較大物件的 features (Transformer 的編碼器中注意力權重計算的複雜度與畫素個數的平方成正比，所以沒有辦法用高解析度的 feature map)
同一張圖片太多 object 時，預測力大幅下降 (training data 比較沒有這種 case)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

參考資料

https://www.gushiciku.cn/pl/gXi7/zh-tw

tags: 2022Q2技術研討, detection

DETR (DEtection TRansformer)

簡介

與傳統 Faster R-CNN 流程比較

模型架構

transformer encoder / decoder 架構

loss function 簡介

成效

特色

問題

參考資料

Read more

研究內容

Multiple Object Tracking (MOT)

Week 2：模型壓縮 - 剪枝 (pruning)

Disentangling Writer and Character Styles for Handwriting Generation

tags: `2022Q2技術研討`, `detection`