Region of Interest (RoI)

# Region of Interest (RoI) ROI 是針對原始圖片的 proposed rgion ## RoI Pooling 主要是在 [Fast R-CNN](https://arxiv.org/pdf/1504.08083) 中應用 ![image](https://hackmd.io/_uploads/SJXMkMP0ll.png) ### 原理由下可知，Fast R-CNN 是用 Max Pooling 來實作 RoI Pooling，對 RoI 的區域擷取成一個特定大小(這裡是 7x7) 的 feature map >The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (e.g., 7 × 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w). 以下舉例假設你有一個 8×8 的 feature map，某個 RoI 區域是 6×6，你想要的輸出是 3×3 - 就把這個 6×6 的區域分成 3×3 的小格子，每個小格子大約是 2×2。 - 在每個 2×2 的小格子裡找最大值，組成 3×3 的輸出。為什麼要這樣做？ - 全連接層（FC layer）需要固定大小的輸入，但 RoI 區域大小不一。 - RoI max pooling 讓每個候選區域都能產生固定大小的特徵，方便後續分類和定位。與 SPPnet 的關係 - SPPnet（Spatial Pyramid Pooling）是更早的類似方法，會用多層不同大小的池化格子。 - RoI pooling 就是 SPPnet 的特例，只用一層固定大小的池化格子。 ### :warning: 問題如果 $h$ 和 $H$、$w$ 和 $W$ 不是倍數關係? reference: [目標檢測 RoI Pool 和 RoI Align 的區別](https://blog.csdn.net/wzk4869/article/details/128561590) 如圖， 665 x 665 想要輸出 32 x 32 的 feature map，但是 $665/2 = 20.78$，但是沒有 20.78 格，所以量化 (Quantization) 成是 20 x 20 的方框，20 x 20 想要再繼續 RoI Pooling 變成 7 x 7 的 feature map，但是 $20/7=2.86$ 所以就看成是 2 x 2 的 feature map ![image](https://hackmd.io/_uploads/SJlG-7wRll.png) 這樣造成了很多的像素誤差: - 7x7 的 feature map 產生 36.2404 的像素誤差 $${0.86}^{2} \times (7 \times 7) = 36.2404$$ - 20 x 20 feature map 產生 623.0016 的像素誤差 $${0.78}^2 \times (20 \times 20) = 623.0016$$ 所以之後 Mask R-CNN 就提出了 RoI Align ## RoI Align RoI Align 的原始論文是 2017 年由 Kaiming He 等人發表的 [Mask R-CNN](https://arxiv.org/pdf/1703.06870) 論文 ![image](https://hackmd.io/_uploads/rymmnGw0xx.png) ### 原理 reference: [Understanding Region of Interest - Part 2 (RoI Align)](https://erdem.pl/2020/02/understanding-region-of-interest-part-2-ro-i-align) 就是因為 RoI Pooling 會將 float **量化**成整數，所以才會 >While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks 所以 RoI Align 要避免這個 **量化** 的動作 >To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries 而是用 bilinear interpolation 來計算這些浮點位置的特徵值 >We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed. 參考 [Instance Segmentation](https://tjmachinelearning.com/lectures/1718/instance/instance.pdf) 的圖，內插方式如下 ![image](https://hackmd.io/_uploads/SkwVrvDAeg.png) ### Math reference: [Understanding Region of Interest - Part 2 (RoI Align)](https://erdem.pl/2020/02/understanding-region-of-interest-part-2-ro-i-align) 如果直接看參考上寫的雙線性內插計算方式，很難理解 **四個採樣點** 是怎麼來的，所以可以直接去看 source code 第一步就是把整塊平分成 3 x 3 ![image](https://hackmd.io/_uploads/BJ7HLFhCel.png) 計算出黃色方框的中心點 P ，然後找出相近的四個 pixel 的中心(紅點) ![image](https://hackmd.io/_uploads/HJELLth0xl.png) 最後用 P 和四個相近 pixel 中心點的相對距離進行雙線性內插計算，得到 P 點的 pixel 值 ![image](https://hackmd.io/_uploads/B1oyPYn0xg.png) ## RoI Align Source Code [Source code for torchvision.ops.roi_align](https://docs.pytorch.org/vision/master/_modules/torchvision/ops/roi_align.html) ### Bin ![image](https://hackmd.io/_uploads/S1UDgj2Cgx.png) ### Bilinear Interpolation 其中： - K：ROI 數量 - C：通道數 - PH, PW：每個 ROI 要取樣的格點數（通常對應 pooled feature 大小） - IY, IX：每個位置內部的取樣點數（雙線性插值的細分） :::spoiler ```python # NB: all inputs are tensors def _bilinear_interpolate( input, # [N, C, H, W] roi_batch_ind, # [K] y, # [K, PH, IY] x, # [K, PW, IX] ymask, # [K, IY] xmask, # [K, IX] ): _, channels, height, width = input.size() # deal with inverse element out of feature map boundary y = y.clamp(min=0) x = x.clamp(min=0) y_low = y.int() x_low = x.int() y_high = torch.where(y_low >= height - 1, height - 1, y_low + 1) y_low = torch.where(y_low >= height - 1, height - 1, y_low) y = torch.where(y_low >= height - 1, y.to(input.dtype), y) x_high = torch.where(x_low >= width - 1, width - 1, x_low + 1) x_low = torch.where(x_low >= width - 1, width - 1, x_low) x = torch.where(x_low >= width - 1, x.to(input.dtype), x) ly = y - y_low lx = x - x_low hy = 1.0 - ly hx = 1.0 - lx # do bilinear interpolation, but respect the masking! # TODO: It's possible the masking here is unnecessary if y and # x were clamped appropriately; hard to tell def masked_index( y, # [K, PH, IY] x, # [K, PW, IX] ): if ymask is not None: assert xmask is not None y = torch.where(ymask[:, None, :], y, 0) x = torch.where(xmask[:, None, :], x, 0) return input[ roi_batch_ind[:, None, None, None, None, None], torch.arange(channels, device=input.device)[None, :, None, None, None, None], y[:, None, :, None, :, None], # prev [K, PH, IY] x[:, None, None, :, None, :], # prev [K, PW, IX] ] # [K, C, PH, PW, IY, IX] v1 = masked_index(y_low, x_low) v2 = masked_index(y_low, x_high) v3 = masked_index(y_high, x_low) v4 = masked_index(y_high, x_high) # all ws preemptively [K, C, PH, PW, IY, IX] def outer_prod(y, x): return y[:, None, :, None, :, None] * x[:, None, None, :, None, :] w1 = outer_prod(hy, hx) w2 = outer_prod(hy, lx) w3 = outer_prod(ly, hx) w4 = outer_prod(ly, lx) val = w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4 return val ``` ::: 雙線性插值用於計算非整數座標位置的像素值，通過周圍 4 個像素的加權平均： ```bash x_low x_high y_low v1 -------- v2 | •P | | (x,y) | y_high v3 -------- v4 ``` 1. 邊界處理 ```python y = y.clamp(min=0) x = x.clamp(min=0) ``` 確保座標不會是負數。 2. 4 個採樣點 ```python y_low = y.int() # 向下取整 x_low = x.int() # 處理邊界情況：如果已經在邊界，高座標就等於低座標 y_high = torch.where(y_low >= height - 1, height - 1, y_low + 1) y_low = torch.where(y_low >= height - 1, height - 1, y_low) y = torch.where(y_low >= height - 1, y.to(input.dtype), y) x_high = torch.where(x_low >= width - 1, width - 1, x_low + 1) x_low = torch.where(x_low >= width - 1, width - 1, x_low) x = torch.where(x_low >= width - 1, x.to(input.dtype), x) ``` - y_low, x_low：左上角鄰居 - y_high, x_high：右下角鄰居 - 邊界處理：防止索引超出 [0, height-1] 和 [0, width-1] 3. 計算插值權重 ```python ly = y - y_low # y 方向的小數部分 [0, 1) lx = x - x_low # x 方向的小數部分 [0, 1) hy = 1.0 - ly # y 方向的互補權重 hx = 1.0 - lx # x 方向的互補權重 ``` ![image](https://hackmd.io/_uploads/BybTsthRxg.png) 4. 取得四個鄰居的像素值 ```python def masked_index(y, x): if ymask is not None: # 將無效的座標設為 0（用於自適應採樣） y = torch.where(ymask[:, None, :], y, 0) x = torch.where(xmask[:, None, :], x, 0) return input[ roi_batch_ind[:, None, None, None, None, None], # 選擇 batch torch.arange(channels, device=input.device)[None, :, None, None, None, None], # 所有通道 y[:, None, :, None, :, None], # y 座標 [K, PH, IY] x[:, None, None, :, None, :], # x 座標 [K, PW, IX] ] # 輸出: [K, C, PH, PW, IY, IX] ``` |索引項 |含義|結果形狀| |- |- |-| | `roi_batch_ind[:, None, None, None, None, None]`| 哪個 batch 中的 ROI | `[K, 1, 1, 1, 1, 1]` | | `torch.arange(channels, device=input.device)[None, :, None, None, None, None]` | 所有 channel| `[1, C, 1, 1, 1, 1]` | | `y[:, None, :, None, :, None]` | Y 座標索引| `[K, 1, PH, 1, IY, 1]` | | `x[:, None, None, :, None, :]`| X 座標索引| `[K, 1, 1, PW, 1, IX]` | 5. ```python def outer_prod(y, x): # 外積：將 y 和 x 的權重相乘 return y[:, None, :, None, :, None] * x[:, None, None, :, None, :] w1 = outer_prod(hy, hx) # (1-ly) × (1-lx) - 左上角權重 w2 = outer_prod(hy, lx) # (1-ly) × lx - 右上角權重 w3 = outer_prod(ly, hx) # ly × (1-lx) - 左下角權重 w4 = outer_prod(ly, lx) # ly × lx - 右下角權重 ``` ![image](https://hackmd.io/_uploads/SJ5Tr9hRge.png) 7. 計算插值結果 ```python val = w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4 return val # [K, C, PH, PW, IY, IX] ``` ![image](https://hackmd.io/_uploads/BkA1LqhRgl.png) ## Basic Skill 1. `torch.where(condition, A, B)` 會逐元素判斷： - 如果 condition 為 True → 選 A - 如果 condition 為 False → 選 B 舉例 ```python x.shape = [batch_size, channels, seq_len] xmask.shape = [batch_size, seq_len] ``` 如果 ```python xmask[:, None, :] ``` 那 `xmask` 的維度就是 ```python [batch_size, 1, seq_len] ``` 所以今天有一個 tensor ```python x = torch.tensor([[[1, 2, 3], [4, 5, 6]]]) # shape [1, 2, 3] xmask = torch.tensor([[1, 0, 1]], dtype=torch.bool) # shape [1, 3] ``` 經過 mask ```python x = torch.where(xmask[:, None, :], x, 0) ``` 就會變成 ```python x = [[[1, 0, 3], [4, 0, 6]]] ``` 可以看到第 2 個位置（mask=0）被 mask 成零。