Bottom-up Object Detection by Grouping Extreme and Center Points

## Bottom-up Object Detection by Grouping Extreme and Center Points <center class="font-s1"> A10715003 陳炫均<br> A10715006 張秋霞 </center> --- ## Outline - 1.Introduction - 2.Related Work - 3.Preliminaries - 4.ExtremeNet for Object detection - 5.Experiments - 6.Conclusion --- ## 1.Introduction ---- ### 1.1 ExtremeNet - One-stage detector category - Bottom-up Approaches - Four extreme points + center point - Coco Mask：AP——43.7% --- ## 2.Related Work ---- ### 2.1 Two-stage - 先提取region proposal,再基于proposal做二次修正。 - 優點：two-stage精度高 - 缺點：速度慢。 - RCNN系列。 ---- ### 2.1 One-stage - single-stage直接在圖片上經過計算生成detections，速度快。 - ExtremeNet屬於One-stage類目標檢測模型。 - YOLO、SSD等在\$O(h^2w^2\$)空間中設置anchor。 - ExtremeNet在\$O(hw)\$空間中來檢測5個獨立的部件 ---- ### 2.2 Top-bottom 🆚 Bottom-up 以檢測行人為例： - Top-Down framework：就是先進行行人檢測，得到邊界框，然後在每一個邊界框檢測人體關鍵點，連接成每個人的姿態。 - Bottom-Up framework：就是先對整個圖片進行每個人體關鍵點部件的檢測，再將檢測到的人體部位拼接成每個人的姿態。 ---- ### 2.3 Implicit keypoint detection - StarMap：在檢測過程中，使用一個heatmap 來mix所有的keypoints。 - 是StarMap的一種，但具有更明確的幾何屬性。 ---- ### 2.4 傳統Object Detection的局限性 - 因為形狀問題（rectangular region），通常預測結果中包含了太多無用的干擾。 - 枚舉了大量不理解object具體視覺信息的框。 - 因為bounding box本身就不能很好的代表物體本身，所以只能提提取到有限的信息（形狀、姿勢等）。 ---- ### 2.5 ExtremeNet ![](https://i.imgur.com/Bplcc2o.png) ---- ### 2.5 ExtremeNet ExtremeNet的解決方案： - Bottom-up的Deformable Part Model，进行part detection。 - 使用key point预测算法，對每個object預測4張極值點heatmap和1張center point heatmap，然後通過brute force grouping algorithm來（複雜度 \$O(n^4)\$)找到有效的點組。 ---- ### 2.6 ExtremeNet 🆚 CornerNet Definite keypoint: Corner point通常落在物體外部，往往沒有強烈的局部特徵。extreme point一般就在物體邊界上，視覺上比較好辨認； Keypoint grouping：ExtremeNet純粹依賴幾何關係進行極值點的分組，沒有隱含的特徵學習，效果更好。 --- ## 3.Preliminaries ---- ### 3.1 Extreme and center points - ExtremeNet使用四點標註法，分別是上下左右四个方向上的极点来表示$$（x^{(t)},y^{(t)}）,（x^{(l)},y^{(l)}）,（x^{(b)},y^{(b)}）,（x^{(r)},y^{(r)}）$$我們還可以通過這四個點來計算出中心點$$(\frac{x^{(l)}+x^{(r)}}{2},\frac{y^{(t)}+y^{(b)}}{2})$$ ---- ### 3.2 Keypoint Detection - 使用fully convolutional encoder-decoder network預測一個multi-channel heatmap，每個通道都對應一個類別的關鍵點。 - 使用HourglassNetwork作為backbone，對每張heatmap進行**加權逐點邏輯回歸**，目的是減少grouth truth周圍的虛警懲罰。 ---- ### 3.4 CornerNet - ExtremeNet沿用CornerNet的網絡結構和loss function,但是没有采用它的embedding layer。 - 基于HourglassNetwork進行關鍵點檢測，去預測兩組相對點的heatmap。 ![](https://i.imgur.com/UQtHazo.png) ---- ### 3.4 CornerNet ![](https://i.imgur.com/jr7HLYR.png) ---- ### 3.5 Deep Extreme Cut - 用Deep Extreme Cut接在ExtremeNet之後可以得到一個更加精細的分割效果。 --- ## 4.ExtremeNet for Object detection ---- ### 4.1 ExtremeNet Detail ExtremeNet的輸出通道：\$5\times C+4\times 2\$。對於每個類別，預測四張extreme point的heatmap和一張center point的map。然後對每種極值點heatmap，再預測2張offset map。 5(上下左右中)、C(物件數量)、4(上下左右)、2(X/Y方向) ![](https://i.imgur.com/zJNIdtP.png) ---- ### 4.2 Center Grouping **大體分為兩個步驟** - 第一步：**ExtremePeak**。提取heatmap中所有的極值點。極值點定義為任意一個pixel的周圍3x3滑動窗口範圍的pixel相比的值都要大，且達到預設的閥值。 - 第二部：**暴力枚舉**。對於每一種極值點組合，計算它們的中心點，如果center map對應位置上的和計算中心點之間的響應超過預設閥值，則將這一組5個點作為一個備選，該備選的score為5個對應點（上下左右中）的score平均值 ---- ### 4.3 Center Grouping ![](https://i.imgur.com/DsISbZg.png =x550) ---- ### 4.2 Center Grouping ![](https://i.imgur.com/3McTJat.png =x500) ---- ### 4.3 Ghost box suppression 問題：Ghost box的意思就是存在多個並排排列且大小相近，在做center grouping的時候有多個選擇。 ![](https://i.imgur.com/Y66xzod.png) 解決方法：使用soft NMS（soft non-maxima suppression）來抑制Ghost box：如果某個包圍框，其內部所有的包圍框的score綜合超過其本身score的3倍，則將其本身的score修正為原來的1/2。 ---- ### 4.4 Edge aggregation 問題：extreme point的定義並不唯一，這就導致如果物體沿著水平或垂直方向邊緣形成極值點的話（比如汽車頂部），沿著該邊緣的點都可能會被當作extreme point。會造成兩個問題：一方面，較弱響應可能會低於預設的極值點的閥值，導致漏掉所有點，另外一方面，即使僥倖超過了閥值，但其score可能還是比不過輕微旋轉過的目標（在兩個方向上都有較大的響應）。 ![](https://i.imgur.com/8fFatHt.png =x80)![](https://i.imgur.com/gi1G6DI.png =x370) ---- ### 4.4 Edge aggregation 解決辦法：對每個extreme point，向它的兩個方向聚集。具體作法是，沿著X/Y軸方向，將第一個單調下降區間內的點的score按一定權重累加到原來的extreme point上。 ![](https://i.imgur.com/hL7r6qB.png =x400) ---- ### 4.5 Extreme Instance Segmentation - **Octagon mask**具體作法：首先根據4個極值點找到Octagon;然後對每個極值點在其所屬的矩形遍上，沿著兩個方向各延長矩形邊的1/8；最後將8個點連接起來，如果遇到矩形邊界則截斷，得到最後的Octagon估算結果。 - **ExtremePoints+DEXTR**給定一副圖像和若干個極值點，即可得到一個類別未知分割mask. --- ## 5.Experiment ---- ![](https://i.imgur.com/6y1WSsA.jpg =x500) ---- ### 5.1 Extreme Instance Segmentation **Dataset：MS COCO** ![](https://i.imgur.com/poXCAty.jpg) ---- ### 5.2 Training details **CornerNet** - Input resolution to 511x511, Output resolution to 128x128. - Data augmentation consists of flipping, random scaling between 0.6 and 1.3, random color jittering. - Otimized with Adam with learning rate 2.5e-4. - Trained on 10 GPUs for 500k iterations, and equivalent of over 140 GPU days. ---- ### 5.2 Training details **ExtremeNet** - 5 GPUs for 250k iterations with a batch size of 24. - Learning rate is dropped 10x at the 200k iteration. - The sate-of-the-art comparison experiment is trained from scratch on 5GPUs for 500k Iteration with learning rate dropped at the 450 iteration. ---- ### 5.3 Testing details #### 細節 - 保持原來image resolution. - 我們使用flip augmentation. - Main comparison, we use 5x multi-scale(0.5,0.75,1.25,1.5) augmentation. - Soft-NMS過濾所有augmented檢測的結果。 #### 測試每張照片時間 - 測試時間：322ms(3.1FPS) - 網絡前向傳播時間：168ms - 解碼、圖片的pre-processing、圖片的post-processing時間：130ms ---- ### 5.4 Ablation studies ![](https://i.imgur.com/8Pdt6Br.png =x500) ---- ### 5.5 State-of-the-art comparisons ![](https://i.imgur.com/30zJAxQ.png =x500) ---- ### 5.6 Instance Segmentation ![](https://i.imgur.com/xrwvAEx.png =x500) --- ## 6.Conclusion 這是一個bottom-up extreme point estimation的新颖的物件偵測方法，它主要提取4個extreme points estimation和groups them in a purely geometric manner,輸出層加上Deep Extreme Cut，還可以大大增加偵測精準度。 --- ## Thank you for your listening! --- <style> .reveal .slides{ text-align:left } .reveal .slides >section h2 { font-size:55px; text-align:center; } .reveal .slides >section h3 { font-size:45px; } .reveal .slides >section p { font-size:30px; } .reveal .slides >section ul li { font-size:30px; } .center{ text-align:center; } font-s1{ font-size:22px; } </style>