# Deformable ConvNets ###### tags: `paper notes` `deep learning` [v1 paper](https://arxiv.org/abs/1703.06211) [v2 paper](https://arxiv.org/abs/1811.11168) [on ICCV17 youtube](https://www.youtube.com/watch?v=HRLMSrxw2To) ## Why Deformable Conv ? - Associated with Anchor-free object detection - Reppointv1 actually is DCNv3 - SOTA backbone ![](https://i.imgur.com/W8IlqEx.png) ![](https://i.imgur.com/TuEdEhD.png) ![](https://i.imgur.com/D0wdq0n.png) - Can be applied to any field where CNNs are used - Mainly on object detection - Non-parametric method - No need for any hyper-parameters, all transformations are from training - light-weighted - amount of additional parameters and computation are small ## DCNv1 嘗試解決視覺領域中**同一個物體在不同場景或視角中會有各種未知的幾何型態變化問題**,而當時(2017)的解法通常有兩種 1. 透過各種 data augmentation 方法來建立出各種幾何變化的圖片,好讓模型學到更 robust 的表徵 2. 使用如 SIFT 的 transformation-invariant 演算法 這兩種解法的缺點 1. data augmentation 會受限於樣本的限制而無法建立足夠多的資料 (也就是原始資料太少的話,再怎麼做擴充都沒用),而且這些 augmentation 都是固定且已知的方法,無法模擬出未知的幾何變換 2. 人工去設計出複雜的變換不太可能做到,頂多只能設計一些如旋轉或是縮放之類的簡單變換 因此,DCNv1 就嘗試將 CNN 改成可以處理幾何變換的 deformable CNN 來拓展 receptive field 解決上述問題 ![](https://i.imgur.com/YcE2g1b.jpg) ### Deformable Convolution 在正常的 Conv 中加入 2D offset (位移值),而這個 offset 的值是藉由額外的 Conv 來學習決定 ![](https://i.imgur.com/Kl8gABJ.png) - (c.) 和 (d.) 分別為 offset 的 special case, (c.) 為尺度變換, (d.) 為旋轉變換 流程 - 3x3 deformable conv with dilation 1, 2N: x-axis and y-axis offset ![](https://i.imgur.com/tZwOCIg.png) 1. Sampling a grid region $R$ over the input feature map using a rectangular kernel - 3x3 grid $R$ = {(-1, -1), (-1, 0),...(0, 1),(1, 1)} | -1, -1 | 0, -1 | 1, -1 | | --- | --- | --- | | -1, 0 | p_0 | 1, 0 | | -1, 1 | 0, 1 | 1, 1 | 2. 1. Multiplying sampled values by the weights of the rectangular kernel $w$ and then summing them across the kernel to give a single scalar value - Standard Conv for each location $p_0$ - $p_0$ is location on input feature map - $p_n$ is $n_{th}$ point on grid $R$ ![](https://i.imgur.com/g8XkXKF.png) - Deformable Conv for each location $p_0$ - $\triangle p_n$ is offset value, typically fractional ![](https://i.imgur.com/X729dLh.png) - Here, $p = p_0 + p_n + \triangle p_n$ is arbitrary location 也由於 offset 通常會是 fractional 的關係,我們需要一種方式來得到經過偏移後的近似值,這篇是使用 blinear - $q$ enumerates all the valid positions on the input feature map, $G$ is bilinear interpolation kernel ![](https://i.imgur.com/uWsYNmu.png) ### Deformable RoI Pooling ![](https://i.imgur.com/mi8ARDN.png) 這裡說的 RoI Pooling 是 Fast R-CNN 的,指的是經過 selective search 後的候選框在 feature map 上的 mapping ![](https://i.imgur.com/elFNByR.gif) - 原本的 RoI pooling 是把每一個 RoI 切成固定大小的 bin,分別做 pooling 後得到固定大小的 feature map - DCNv1 採 mean pooling, Fast R-CNN 中的是 max pooling ![](https://i.imgur.com/BnqxPcR.png) Deformable RoI pooling - 差別一樣在於 offset,也一樣使用 bilinear 做插值 ![](https://i.imgur.com/DmYi2Q5.png) ### Implementation of Deformable ConvNets ![](https://i.imgur.com/iJ83X3y.png) ```python class DeformConvNet(nn.Module): def __init__(self): super(DeformConvNet, self).__init__() # conv11 self.conv11 = nn.Conv2d(1, 32, 3, padding=1) self.bn11 = nn.BatchNorm2d(32) # conv12 self.offset12 = ConvOffset2D(32) self.conv12 = nn.Conv2d(32, 64, 3, padding=1, stride=2) self.bn12 = nn.BatchNorm2d(64) # conv21 self.offset21 = ConvOffset2D(64) self.conv21 = nn.Conv2d(64, 128, 3, padding= 1) self.bn21 = nn.BatchNorm2d(128) # conv22 self.offset22 = ConvOffset2D(128) self.conv22 = nn.Conv2d(128, 128, 3, padding=1, stride=2) self.bn22 = nn.BatchNorm2d(128) # out self.fc = nn.Linear(128, 10) def forward(self, x): x = F.relu(self.conv11(x)) x = self.bn11(x) x = self.offset12(x) x = F.relu(self.conv12(x)) x = self.bn12(x) x = self.offset21(x) x = F.relu(self.conv21(x)) x = self.bn21(x) x = self.offset22(x) x = F.relu(self.conv22(x)) x = self.bn22(x) x = F.avg_pool2d(x, kernel_size=[x.size(2), x.size(3)]) x = self.fc(x.view(x.size()[:2])) x = F.softmax(x) return x ``` ### Result of DCNv1 DCN 在小物件、中物件、大物件的感知域 ![](https://i.imgur.com/9nXk8YK.jpg) ![](https://i.imgur.com/xI39lt6.png) ![](https://i.imgur.com/K7pV6fN.png) ## DCNv2: More Deformable, Better Results ### DCNv1 的問題 - 利用 offset 所拓展出來的 receptive field 侷限在 RoI 中,導致特徵會被與物件本身無關的 pixel 干擾 - 他們加入 modulation mechanism 來讓模型也可以學習判斷是否對每一塊的 input feature 感興趣 - 原本的 offset feature map channel 會從 2N → 3N - 下圖是從 conv 5 in Faster R-CNN + ResNet-50 baseline 中抽出來視覺化的圖,由上而下分別代表 effective sampling locations , effective receptive field, error-bounded saliency,由左至右代表小物件、大物件、背景 ![](https://i.imgur.com/k65liK1.png) ![](https://i.imgur.com/LYyJaLp.png) ![](https://i.imgur.com/rkZBsPf.png) - (c.) 沒有 sampling location 的原因是因為跟 (b) 長得一樣 ### 除了 modulation 以外 DCNv2 所做的改進 **Add more Deformable Convolution layers** (3 layers in conv5 stage → 12 layers in conv3, 4, 5 stage in ResNet50) → Need efficient training method → Knowledge distillation, use R-CNN as teacher + **Add feature mimicking loss** ![](https://i.imgur.com/VNXDHju.png) ### M**odulated deformable convolution** - $\triangle m_k$ 就是 modulate factor,介於 0 到 1 之間 - $w_k$ 是原本的 $w$,但因為要針對每一個 - $K$ 為 sampling location 數量,這裡是針對每一個採樣點都去計算 conv,而不是一整個採樣區域 ![](https://i.imgur.com/pnm4oEk.png) ### M**odulated deformable RoI Pooling** 流程: 1. Given an input RoI, RoIpooling divides it into $K$ spatial bins (e.g. 7x7). 2. Within each bin, sampling grids of even spatial intervals are applied (e.g. 2x2) 3. The sampled values on the grids are averaged to compute the bin output. ![](https://i.imgur.com/KXASWby.png) - $p_{kj}$為 j-th grid cell in the k-th bin - $n_k$ 為 sampled grid cells 的數量 ### R-CNN Feature Mimicking - KD 中的 feature mimick 是在使用大模型的 ground truth 監督小模型訓練的同時,也加入大小模型之間 feature map 的監督 - 使用的原因是他們發現 error-bounded saliency 中有抓到很多 **RoI 以外的區域 (redundant context)** 並且發現這個問題是來自於 Faster R-CNN 的問題 (其實並不是他們自己做的研究,而是來自 **[Revisiting RCNN: On Awakening the Classification Power of Faster RCNN, ECCV, 2018](https://arxiv.org/abs/1803.06799)** 的實驗) - 為了解決這樣的問題,他們結合 R-CNN 和 Faster R-CNN 的分類分數來得到最終的分類分數 - 在這個架構中,輸入 R-CNN 的是 Crop 過的圖片以及 proposed RoI,因此能更好的呈現出 物件本身所得到的結果,以此來解決這樣的問題 - 但這樣的架構會讓訓練的時候速度很慢...... - 訓練的時候 2個 fc 和 RoI pooling 是共享參數的,而推論的時候不會用到 RCNN branch ![](https://i.imgur.com/z5lvLrH.png) ### Loss: Cosine similarity between RCNN and FRCNN - $b$: RoI - $\Omega$ is Roi set, randomly sampled from 32 positive region proposals generated by RPN ![](https://i.imgur.com/URtOSjB.png) ![](https://i.imgur.com/cp6bAUZ.jpg) - 最終的 Loss = feature mimic loss + R-CNN classification loss + Faster R-CNN loss - **The loss weights of the two newly introduced loss terms are 0.1 times those of the original Faster R-CNN loss terms** ## Summary - DCN 可以應用到任一個 CNNs model,最常被拿來使用組合是 ResNext + DCNv2 或是 Res2Net + DCNv2 - Res2Net 也挺有趣的,或許有機會可以講講 ![](https://i.imgur.com/0viYJ8E.png) ![](https://i.imgur.com/XVEMWm5.png) - DCN 確實能拓展 CNN 處理幾何變形的能力,但同時也會造成抓取的區域不太準確 - 其實會抓到一堆額外區域的原因是因為 DCNv2 並沒有監督 offset 的學習,而這就交給了 Reppoint 來解決 - 不會增加太多計算量 ![](https://i.imgur.com/Egrmv46.png) - R-CNN Feature mimicking 參考用就好,沒辦法拓展到其他模型,他們其實有把這個架構應用在 regular conv 上,結果沒甚麼效果 ![](https://i.imgur.com/tiHnVzf.png) ![](https://i.imgur.com/BESiC1r.png) ## References - Pytorch 有提供 deformable v2 的 deform_conv2d API [deform_conv2d - Torchvision main documentation](https://pytorch.org/vision/main/generated/torchvision.ops.deform_conv2d.html#torchvision.ops.deform_conv2d) - Source Code [torchvision.ops.deform_conv - Torchvision 0.12 documentation](https://pytorch.org/vision/stable/_modules/torchvision/ops/deform_conv.html) - Articles [Deformable Convolutions Demystified](https://towardsdatascience.com/deformable-convolutions-demystified-2a77498699e8)