# Region of Interest (RoI)
ROI 是針對原始圖片的 proposed rgion
## RoI Pooling
主要是在 [Fast R-CNN](https://arxiv.org/pdf/1504.08083) 中應用

### 原理
由下可知,Fast R-CNN 是用 Max Pooling 來實作 RoI Pooling,對 RoI 的區域擷取成一個特定大小(這裡是 7x7) 的 feature map
>The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (e.g., 7 × 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).
以下舉例
假設你有一個 8×8 的 feature map,某個 RoI 區域是 6×6,你想要的輸出是 3×3
- 就把這個 6×6 的區域分成 3×3 的小格子,每個小格子大約是 2×2。
- 在每個 2×2 的小格子裡找最大值,組成 3×3 的輸出。
為什麼要這樣做?
- 全連接層(FC layer)需要固定大小的輸入,但 RoI 區域大小不一。
- RoI max pooling 讓每個候選區域都能產生固定大小的特徵,方便後續分類和定位。
與 SPPnet 的關係
- SPPnet(Spatial Pyramid Pooling)是更早的類似方法,會用多層不同大小的池化格子。
- RoI pooling 就是 SPPnet 的特例,只用一層固定大小的池化格子。
### :warning: 問題
如果 $h$ 和 $H$、$w$ 和 $W$ 不是倍數關係?
reference: [目標檢測 RoI Pool 和 RoI Align 的區別](https://blog.csdn.net/wzk4869/article/details/128561590)
如圖, 665 x 665 想要輸出 32 x 32 的 feature map,但是 $665/2 = 20.78$,但是沒有 20.78 格,所以量化 (Quantization) 成是 20 x 20 的方框,20 x 20 想要再繼續 RoI Pooling 變成 7 x 7 的 feature map,但是 $20/7=2.86$ 所以就看成是 2 x 2 的 feature map

這樣造成了很多的像素誤差:
- 7x7 的 feature map 產生 36.2404 的像素誤差
$${0.86}^{2} \times (7 \times 7) = 36.2404$$
- 20 x 20 feature map 產生 623.0016 的像素誤差
$${0.78}^2 \times (20 \times 20) = 623.0016$$
所以之後 Mask R-CNN 就提出了 RoI Align
## RoI Align
RoI Align 的原始論文是 2017 年由 Kaiming He 等人發表的 [Mask R-CNN](https://arxiv.org/pdf/1703.06870) 論文

### 原理
reference: [Understanding Region of Interest - Part 2 (RoI Align)](https://erdem.pl/2020/02/understanding-region-of-interest-part-2-ro-i-align)
就是因為 RoI Pooling 會將 float **量化**成整數,所以才會
>While this may not impact
classification, which is robust to small translations, it has a
large negative effect on predicting pixel-accurate masks
所以 RoI Align 要避免這個 **量化** 的動作
>To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries
而是用 bilinear interpolation 來計算這些浮點位置的特徵值
>We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed.
參考 [Instance Segmentation](https://tjmachinelearning.com/lectures/1718/instance/instance.pdf) 的圖,內插方式如下

### Math
reference: [Understanding Region of Interest - Part 2 (RoI Align)](https://erdem.pl/2020/02/understanding-region-of-interest-part-2-ro-i-align)
如果直接看參考上寫的 雙線性內插 計算方式,很難理解 **四個採樣點** 是怎麼來的,所以可以直接去看 source code
第一步就是把整塊平分成 3 x 3

計算出 黃色方框 的中心點 P ,然後找出相近的四個 pixel 的中心(紅點)

最後用 P 和四個相近 pixel 中心點的相對距離進行雙線性內插計算 , 得到 P 點的 pixel 值

## RoI Align Source Code
[Source code for torchvision.ops.roi_align](https://docs.pytorch.org/vision/master/_modules/torchvision/ops/roi_align.html)
### Bin

### Bilinear Interpolation
其中:
- K:ROI 數量
- C:通道數
- PH, PW:每個 ROI 要取樣的格點數(通常對應 pooled feature 大小)
- IY, IX:每個位置內部的取樣點數(雙線性插值的細分)
:::spoiler
```python
# NB: all inputs are tensors
def _bilinear_interpolate(
input, # [N, C, H, W]
roi_batch_ind, # [K]
y, # [K, PH, IY]
x, # [K, PW, IX]
ymask, # [K, IY]
xmask, # [K, IX]
):
_, channels, height, width = input.size()
# deal with inverse element out of feature map boundary
y = y.clamp(min=0)
x = x.clamp(min=0)
y_low = y.int()
x_low = x.int()
y_high = torch.where(y_low >= height - 1, height - 1, y_low + 1)
y_low = torch.where(y_low >= height - 1, height - 1, y_low)
y = torch.where(y_low >= height - 1, y.to(input.dtype), y)
x_high = torch.where(x_low >= width - 1, width - 1, x_low + 1)
x_low = torch.where(x_low >= width - 1, width - 1, x_low)
x = torch.where(x_low >= width - 1, x.to(input.dtype), x)
ly = y - y_low
lx = x - x_low
hy = 1.0 - ly
hx = 1.0 - lx
# do bilinear interpolation, but respect the masking!
# TODO: It's possible the masking here is unnecessary if y and
# x were clamped appropriately; hard to tell
def masked_index(
y, # [K, PH, IY]
x, # [K, PW, IX]
):
if ymask is not None:
assert xmask is not None
y = torch.where(ymask[:, None, :], y, 0)
x = torch.where(xmask[:, None, :], x, 0)
return input[
roi_batch_ind[:, None, None, None, None, None],
torch.arange(channels, device=input.device)[None, :, None, None, None, None],
y[:, None, :, None, :, None], # prev [K, PH, IY]
x[:, None, None, :, None, :], # prev [K, PW, IX]
] # [K, C, PH, PW, IY, IX]
v1 = masked_index(y_low, x_low)
v2 = masked_index(y_low, x_high)
v3 = masked_index(y_high, x_low)
v4 = masked_index(y_high, x_high)
# all ws preemptively [K, C, PH, PW, IY, IX]
def outer_prod(y, x):
return y[:, None, :, None, :, None] * x[:, None, None, :, None, :]
w1 = outer_prod(hy, hx)
w2 = outer_prod(hy, lx)
w3 = outer_prod(ly, hx)
w4 = outer_prod(ly, lx)
val = w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4
return val
```
:::
雙線性插值用於計算非整數座標位置的像素值,通過周圍 4 個像素的加權平均:
```bash
x_low x_high
y_low v1 -------- v2
| •P |
| (x,y) |
y_high v3 -------- v4
```
1. 邊界處理
```python
y = y.clamp(min=0)
x = x.clamp(min=0)
```
確保座標不會是負數。
2. 4 個採樣點
```python
y_low = y.int() # 向下取整
x_low = x.int()
# 處理邊界情況:如果已經在邊界,高座標就等於低座標
y_high = torch.where(y_low >= height - 1, height - 1, y_low + 1)
y_low = torch.where(y_low >= height - 1, height - 1, y_low)
y = torch.where(y_low >= height - 1, y.to(input.dtype), y)
x_high = torch.where(x_low >= width - 1, width - 1, x_low + 1)
x_low = torch.where(x_low >= width - 1, width - 1, x_low)
x = torch.where(x_low >= width - 1, x.to(input.dtype), x)
```
- y_low, x_low:左上角鄰居
- y_high, x_high:右下角鄰居
- 邊界處理:防止索引超出 [0, height-1] 和 [0, width-1]
3. 計算插值權重
```python
ly = y - y_low # y 方向的小數部分 [0, 1)
lx = x - x_low # x 方向的小數部分 [0, 1)
hy = 1.0 - ly # y 方向的互補權重
hx = 1.0 - lx # x 方向的互補權重
```

4. 取得四個鄰居的像素值
```python
def masked_index(y, x):
if ymask is not None:
# 將無效的座標設為 0(用於自適應採樣)
y = torch.where(ymask[:, None, :], y, 0)
x = torch.where(xmask[:, None, :], x, 0)
return input[
roi_batch_ind[:, None, None, None, None, None], # 選擇 batch
torch.arange(channels, device=input.device)[None, :, None, None, None, None], # 所有通道
y[:, None, :, None, :, None], # y 座標 [K, PH, IY]
x[:, None, None, :, None, :], # x 座標 [K, PW, IX]
] # 輸出: [K, C, PH, PW, IY, IX]
```
|索引項 |含義|結果形狀|
|- |- |-|
| `roi_batch_ind[:, None, None, None, None, None]`| 哪個 batch 中的 ROI | `[K, 1, 1, 1, 1, 1]` |
| `torch.arange(channels, device=input.device)[None, :, None, None, None, None]` | 所有 channel| `[1, C, 1, 1, 1, 1]` |
| `y[:, None, :, None, :, None]` | Y 座標索引| `[K, 1, PH, 1, IY, 1]` |
| `x[:, None, None, :, None, :]`| X 座標索引| `[K, 1, 1, PW, 1, IX]` |
5.
```python
def outer_prod(y, x):
# 外積:將 y 和 x 的權重相乘
return y[:, None, :, None, :, None] * x[:, None, None, :, None, :]
w1 = outer_prod(hy, hx) # (1-ly) × (1-lx) - 左上角權重
w2 = outer_prod(hy, lx) # (1-ly) × lx - 右上角權重
w3 = outer_prod(ly, hx) # ly × (1-lx) - 左下角權重
w4 = outer_prod(ly, lx) # ly × lx - 右下角權重
```

7. 計算插值結果
```python
val = w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4
return val # [K, C, PH, PW, IY, IX]
```

## Basic Skill
1. `torch.where(condition, A, B)`
會逐元素判斷:
- 如果 condition 為 True → 選 A
- 如果 condition 為 False → 選 B
舉例
```python
x.shape = [batch_size, channels, seq_len]
xmask.shape = [batch_size, seq_len]
```
如果
```python
xmask[:, None, :]
```
那 `xmask` 的維度就是
```python
[batch_size, 1, seq_len]
```
所以今天有一個 tensor
```python
x = torch.tensor([[[1, 2, 3],
[4, 5, 6]]]) # shape [1, 2, 3]
xmask = torch.tensor([[1, 0, 1]], dtype=torch.bool) # shape [1, 3]
```
經過 mask
```python
x = torch.where(xmask[:, None, :], x, 0)
```
就會變成
```python
x = [[[1, 0, 3],
[4, 0, 6]]]
```
可以看到第 2 個位置(mask=0)被 mask 成零。