<style>
p {
text-align: justify;
}
.f-box {
display: flex;
justify-content: center;
}
.sm {
width: 60%;
margin-left: 2rem;
}
.ss {
width: 47%;
margin-left: 2rem;
}
.img-caption {
font-size: 1.5rem;
font-weight: 500;
letter-spacing: -1px;
padding: 10px 0 30px 0;
}
</style>
<h1 style="text-align:center">VFX Project II: Image Stitching</h1>
<p style="text-align:right">VFX (NTU, Spring 2022)</p>
Hao-Cheng Lo 駱皓正 (D08227104)
Li-Wei Fu 傅立威 (R10942078)
有鑒於數學式顯示問題,請參照[線上版完整報告書](https://hackmd.io/@VFX2022/H1mDIvlB5)。
## 緒論
在本作業中,縫合全景圖的實作流程如下:首先,在特徵偵測(feature detection)階段,我們以Harris Coner Dectection對圖片進行特徵的偵測。其次,在特徵描述(feature description)階段,我們以SIFT Descriptor之算法對於上述特徵點以向量對其進行表徵。接著,在特徵配對(feature matching)階段,我們比對連續兩兩影像之特徵相似度,擇優選出2D對2D的對應點,並應用該對應點資料,搭配RANSAC解出較具有穩健性的平面投影轉換(homography)。最後,在全景影像縫合(image blending)階段,我們根據前揭平面投影轉換與後向形變(backward warping),據此逐一對圖片做縫合,便可得一全景圖。以下是我們各小節的細節與實驗結果,並以Parrington資料集為例,本作業所親自攝影照片置於最後部分:
1. 特徵偵測(feature detection)
2. 特徵描述(feature description)
3. 特徵配對(feature matching)
4. 全景影像縫合(image blending)
5. 結果展示
## 特徵偵測(feature detection)
針對特徵的偵測,本作業採取課堂中所提及之Harris Corner Dectection算法作為特徵偵測。具體而言,其主要步驟如下,代碼符號與講義一致:
1. 針對原圖轉成灰階並通過Gaussian Filter進行降噪處理。
2. 計算其一次微分$I_x,\ I_y$與二次微分$I^2_x,\ I^2_y$,並使其再次通過Gaussian Filter。
3. 建構矩陣$M$,並依照其行列值與跡數求出Harris Corner Response $R$.
4. 設定一閾值$=\mu_R +\alpha\sigma_R$,其中$\alpha$為使用者自定,本作業設定為$0.5$,如果一像素之Harris Corner Response低於閾值則被過濾掉;反之,高於閾值則保留。
5. 除了設定全局的閾值,我們亦用non-maximal suppression之方法,檢視每個候選特徵是否是局部最大值,也就是相較於其鄰居為最大者。最大者之對應特徵點,則為Harris Corner之特徵。
```python=
def harris_corner(I_x, I_y, kernel_size=9, sigma=3, k_value=0.04, alpha=0.5):
'''
Input:
I_x: x gradient
I_y: y gradient
Output:
points: index=1 is x coordinate, index=0 is y coordinate
'''
# https://en.wikipedia.org/wiki/Harris_corner_detector
# Step 1: Gradient
# done
# Step 2: Structure tensor setup
# In original algo, it is mean filter.
S_x_2 = cv2.GaussianBlur(np.square(I_x), (kernel_size, kernel_size), sigma)
S_y_2 = cv2.GaussianBlur(np.square(I_y), (kernel_size, kernel_size), sigma)
S_xy = cv2.GaussianBlur(I_x * I_y, (kernel_size, kernel_size), sigma)
# Step 3: Harris response calculation
det_M = (S_x_2 * S_y_2) - np.square(S_xy)
trace_M = S_x_2 + S_y_2
harris_response = det_M - k_value * (trace_M ** 2)
# Step 4: Non-maximum suppression
threshold = np.mean(harris_response) + alpha * np.std(harris_response)
local_max = np.where(harris_response <= threshold, 0, 1).astype(np.uint8)
for i in np.arange(9):
if i == 4:
continue
kernel = np.zeros(9)
kernel[4] = 1
kernel[i] = -1
kernel = kernel.reshape(3, 3)
tmp_result = np.where(cv2.filter2D(
harris_response, -1, kernel) < 0, 0, 1).astype(np.uint8)
local_max &= tmp_result
# Decide feature points
points = np.where(local_max == 1)
print(f"[Harris Corner] Found {len(points[0])} local features!")
return points, harris_response # x, y coordinates
```
### 實驗結果
以下為Parrington資料集之兩連續圖之Harris Coner Detection之表現。我們可以發現無論從Response的分布以及最後特徵點的結果來看,反應較大的區域(深紅色)或是最後篩選出來的特徵點,皆偏向位於影像中corner處,應屬合理。
<div class="f-box">
<img class="ss" src="https://i.imgur.com/5vYbEqS.jpg">
<img class="ss" src="https://i.imgur.com/VAkwgpf.jpg">
</div>
<center class="img-caption">
原影像
</center>
<div class="f-box">
<img class="sm" src="https://i.imgur.com/AQDi6zx.png">
<img class="sm" src="https://i.imgur.com/AhRtQjU.png">
</div>
<center class="img-caption">
Harris Coner Response
</center>
<div class="f-box">
<img class="ss" src="https://i.imgur.com/0IMEugb.png">
<img class="ss" src="https://i.imgur.com/LWyQte4.png">
</div>
<center class="img-caption">
特徵點
</center>
## 特徵描述(feature description)
基於前揭確定之特徵點後,我們進一步需要表徵這些特徵點,以便區別不同特徵點,進而匹配不同影像上相似的特徵點。具體而言,我們運用SIFT的特徵描述算法,該算法著重局部描述(local description)。步驟如下:
1. 找出每個特徵點梯度的角度$\theta$和長度大小$m$。
2. 把360度分為八類角度(bin),並將每個特徵點的角度設為其對應的角度分類(bin)。
3. 對於「所有點」來說,考慮其鄰居在八類角度的分佈情況(並考慮其實際角度值),據此對該點進行以八維向量描述。
4. 對於「特徵點」來說,使用一個16x16的window並根據前一步求得的主要旋轉方向,並進行旋轉校正,以便不受到旋轉影響。
5. 接續上步驟,對於「特徵點」來說,再考慮其鄰居16×16大小的像素,並平均區分為4×4的區域共16區,接著檢視每一區在八維向量的分佈情形。
6. 據此,可合併成16×8=128維度的資料。最後針對這些資料進行正規化,即可得到代表該特徵點的128維特徵向量。
```python=
def sift_descriptor(points, I_x, I_y, num_bin=8, kernel_size=11, patch_size=12):
# length and theta
m = np.hypot(I_x, I_y)
theta = get_theta(I_x, I_y)
# histogram of theta
hist_bin_width = 360.0 / num_bin
theta_which_hist = (theta) // int(hist_bin_width) % num_bin
orientation_shape = (num_bin, I_x.shape[0], I_x.shape[1])
orientation_stack = np.zeros(orientation_shape)
for i in range(num_bin):
build_orientation_layer(
orientation_stack[i], theta_which_hist == i, m, kernel_size)
descriptors = []
keypoints = []
for x, y in zip(points[1], points[0]): # for each point
description = []
try:
rotated_orientation_stack = build_rotated_orientation_stack(
x, y, orientation_stack, patch_size, theta[y, x])
for yy in [4, 8, 12, 16]:
for xx in [4, 8, 12, 16]:
orientation_vec = np.array(
[np.sum(rotated_orientation_stack[i][yy:yy+4, xx:xx+4]) for i in range(num_bin)])
orientation_vec = orientation_vec / \
(orientation_vec.sum() + 1e-7)
orientation_vec = np.clip(orientation_vec, 0, 0.2)
orientation_vec = orientation_vec / \
(orientation_vec.sum() + 1e-7)
description += orientation_vec.tolist()
except:
description = -1
if np.sum(description) > 0:
keypoints.append([x, y])
descriptors.append(description)
print(
f"[SIFT Descriptor] Complete {len(keypoints)} features' descriptors!")
return np.array(keypoints).astype(np.float32), np.array(descriptors).astype(np.float32)
```
### 實驗結果
由以下結果可以發現在邊緣強烈處,其梯度值較大。除此之外,從各個像素之角度的分類也可以發現,類似的邊緣其分類較為一致。
<div class="f-box">
<img class="sm" src="https://i.imgur.com/kJpo5r7.png">
<img class="sm" src="https://i.imgur.com/wi0z63n.png">
</div>
<center class="img-caption">
Gradient Value
</center>
<div class="f-box">
<img class="sm" src="https://i.imgur.com/NdYhzTP.png">
<img class="sm" src="https://i.imgur.com/xRylWj9.png">
</div>
<center class="img-caption">
Theta in 8 Bins
</center>
## 特徵配對(feature matching)
### 找好對應點
前面我們已經把找到的特徵點都用一個128維度的向量來表示,接下來我們要把不同張圖片的對應點找出來。我們使用 Euclidean Distance 及 Manhattan Distance 來計算兩向量的距離,配合上課教授教的 heuristic method 做一個簡單的篩選。首先我們在要拼接的圖片裡,先找出和前一張圖片的特徵點最接近的點以及第二接近的點,如果最接近的點和第二接近的點距離比小於 0.8 我們便把它當作對應點。
```python=
def match_keypoints(des1, des2, norm='L1'):
'''
Input:
des1: N-by-128 matrices, representing N points' descriptor
des2: M-by-128 matrices, representing M points' descriptor
norm: distance definition, there are two options
L1-norm - Manhattan distance
L2-norm - Euclidean distance
Output:
matches: list of match index
'''
# We use heuristic method:
# If the distance ratio of d(x, x1) and d(x, x2) is smaller than 0.8,
# then it is accepted as a match.
RATIO = 0.8
N = des1.shape[0]
if norm == 'L1':
dist = cdist(des1, des2, metric='minkowski', p=1)
elif norm == 'L2':
dist = cdist(des1, des2, metric='euclidean')
matches = []
sorted_index = np.argsort(dist, axis=1)
for i in range(N):
idx1, idx2 = sorted_index[i, 0], sorted_index[i, 1]
dist1, dist2 = dist[i, idx1], dist[i, idx2]
if dist1 < dist2 * RATIO:
matches.append([i, idx1])
return matches
```
### Homography
找到對應點之後,我們透過 Homography 得到兩張連續圖片間的關係。計算 Homography 最少需要四個點,如果只使用四個點可以得到唯一解;超過不一定有解,但我們可以計算誤差最小的。如果 input 對應點是四個點,我們會使用第一種方法,用反矩陣的方式解 Ah = v,在這裡我們假設 Homography 最後一個參數是1 (h33 = 1);如果 input 對應點超過四個點,我們會使用第二種方法,解 Ah = 0 的 least square solution。
```python=
def solve_homography(u, v):
'''
Input:
u: N-by-2 source pixel location matrices
v: N-by-2 destination pixel location matrices
Output:
H: 3-by-3 homography matrix
'''
N = u.shape[0]
if v.shape[0] is not N:
print('u and v should have the same size')
return None
if N < 4:
print('At least 4 points should be given')
return None
# Method 1: A is 8 x 8 matrix, slove Ah = v with unique solution (Faster)
if N == 4:
A = np.array([
[u[0, 0], u[0, 1], 1, 0, 0, 0, -u[0, 0] * v[0, 0], -u[0, 1] * v[0, 0]],
[0, 0, 0, u[0, 0], u[0, 1], 1, -u[0, 0] * v[0, 1], -u[0, 1] * v[0, 1]],
[u[1, 0], u[1, 1], 1, 0, 0, 0, -u[1, 0] * v[1, 0], -u[1, 1] * v[1, 0]],
[0, 0, 0, u[1, 0], u[1, 1], 1, -u[1, 0] * v[1, 1], -u[1, 1] * v[1, 1]],
[u[2, 0], u[2, 1], 1, 0, 0, 0, -u[2, 0] * v[2, 0], -u[2, 1] * v[2, 0]],
[0, 0, 0, u[2, 0], u[2, 1], 1, -u[2, 0] * v[2, 1], -u[2, 1] * v[2, 1]],
[u[3, 0], u[3, 1], 1, 0, 0, 0, -u[3, 0] * v[3, 0], -u[3, 1] * v[3, 0]],
[0, 0, 0, u[3, 0], u[3, 1], 1, -u[3, 0] * v[3, 1], -u[3, 1] * v[3, 1]],
])
# In some cases, Homography cannot be calculated.
# 1) Duplicate points
# 2) More than 2 collinear points
if np.linalg.det(A) == 0:
return None
H = np.append(np.matmul(np.linalg.inv(A), v.flatten()), 1).reshape(3, 3)
return H
# Method 2: A is 2n x 9 matrix, slove Ah = 0 with least square solution
A = []
for ui, vi in zip(u, v):
A.append([ui[0], ui[1], 1, 0, 0, 0, -ui[0] * vi[0], -ui[1] * vi[0], -vi[0]])
A.append([0, 0, 0, ui[0], ui[1], 1, -ui[0] * vi[1], -ui[1] * vi[1], -vi[1]])
U, Sigma, Vt = np.linalg.svd(A)
H = Vt[-1].reshape(3, 3)
return H
```
### 用 Ransac 篩選 outlier
Matching 時難免會有一些 Match 錯誤的點,我們必須要把這些點篩選掉,以免對結果產生影響。
使用的方法為 Ransac,因為 Homography 最少是由四個點決定,所以我們每次 random sample 四個點,我們會先判斷這四個點是否能圍成一個四邊形,確保可以用這四個點計算 Homography。接著把所有的特徵點帶入 Homography 算出對應點的座標(計算得出),和真正的對應點(ground truth)計算誤差,誤差太大的就當作是 outlier。我們希望 inlier 越多越好,如果 inlier 數量相同我們會再比較總誤差,總誤差越小越好,每次進步就會更新 Homography 及 inlier。前面的步驟會重複執行 10000 次,最後就可以找到不錯的結果。
```python=
def find_homography(kp1, des1, kp2, des2, norm='L1', num_iters=10000, threshold=5):
matches = match_keypoints(des1, des2, norm)
print(f"[Match] Found {len(matches)} matches!")
im1_pts = np.array([kp1[m[0]] for m in matches])
im2_pts = np.array([kp2[m[1]] for m in matches])
u = im2_pts.T
u = np.append(u, [np.ones_like(u[0])], axis=0)
v_ans = im1_pts.T
v_ans = np.append(v_ans, [np.ones_like(v_ans[0])], axis=0)
# apply RANSAC to choose best H
most_inliers = 0
best_H = np.eye(3)
least_error = np.inf
best_mask = np.full(len(matches), True)
population = range(len(matches))
for i in range(num_iters):
# randomly choose 4 corners to compute the homography
idx = random.sample(population, 4)
H = solve_homography(im2_pts[idx], im1_pts[idx])
if H is None:
continue
v = np.matmul(H, u)
v /= v[2]
dist = np.linalg.norm(v[:2] - v_ans[:2], axis=0)
inliers = (dist < threshold).sum()
if inliers > most_inliers or (inliers == most_inliers and least_error > dist[best_mask].sum()):
best_H = H
best_mask = dist < threshold
most_inliers = inliers
least_error = dist[best_mask].sum()
# # (Optional) Remove outliers and recalculate H
# best_H = solve_homography(im2_pts[best_mask], im1_pts[best_mask])
return best_H
```
### 實驗結果
以下結果展示最接近的前50個對應點,可以看到這些對應關係畫出來後長度都很接近、互相平行,表示兩張照片基本上只有 translation 而已。

## 全景影像縫合(image blending)
找完圖片的之間對應關係後就可以 warping 了,我們實現了 forward warp 以及 backward warp。首先用 meshgrid 建立座標,u 是 source image 座標、v 是 destination image 座標。
Forward warp 就是把計算好的 Homography 和 u 相乘得到 v,normalized 後四捨五入,然後用 mask 把超過 destination image 邊界的地方找出並刪除,最後直接把每個點 assign 過去,這種方法因為不能保證 destination image 所有的 pixel 都能夠被對應到,所以可能會產生洞。
Backward warp 需要計算 Homography 的 inverse,與 v 相乘得到 u,也要做normalized,然後用 mask 把超過 source image 邊界的地方找出並刪除,接著在source image 做 bilinear interpolation 得到 destination image 每個pixel 的值再 assign 過去,這種方法不會產生洞。因此我們最後會選用 backward warp 來拼接圖片。
```python=
def warping(src, dst, H, ymin, ymax, xmin, xmax, direction='b'):
'''
Input:
src: source image
dst: destination output image
H: homography from source to destination
ymin: lower vertical bound of the destination (source, if forward warp) pixel coordinate
ymax: upper vertical bound of the destination (source, if forward warp) pixel coordinate
xmin: lower horizontal bound of the destination (source, if forward warp) pixel coordinate
xmax: upper horizontal bound of the destination (source, if forward warp) pixel coordinate
direction: indicates backward warping or forward warping
Output:
dst: destination output image
Note:
Image shape: H, W, C
Image coordinate: X, Y
but H -> Y
W -> X
So we should swap H and W to generate the correct coordinates
src and dst shape: H W C
corresponding coordinate: v[1] v[0]
'''
h_src, w_src, ch = src.shape
h_dst, w_dst, ch = dst.shape
H_inv = np.linalg.inv(H)
# Prevent changing to the original dst
dst = dst.copy()
# meshgrid the (x, y) coordinate pairs
x, y = np.meshgrid(np.arange(xmin, xmax), np.arange(ymin, ymax))
x, y = x.flatten().astype(int), y.flatten().astype(int)
if direction == 'b':
# reshape the destination pixels as N x 3 homogeneous coordinate
v = np.stack([x, y, np.ones_like(x)])
u = np.matmul(H_inv, v)
u /= u[2]
# calculate the mask of the transformed coordinate (should not exceed the boundaries of source image)
# upper bound use ">=" to prevent index out of bounds in interpolation
mask = np.logical_or(np.logical_or(u[0] < 0, u[0] >= w_src - 1), np.logical_or(u[1] < 0, u[1] >= h_src - 1))
v_x, v_y, u_x, u_y = np.delete(v[0], mask), np.delete(v[1], mask), np.delete(u[0], mask), np.delete(u[1], mask)
# We use bilinear interpolation to deal with the problem of coordinates that are not integers
int_x, int_y = u_x.astype(int), u_y.astype(int)
frac_x, frac_y = u_x - int_x, u_y - int_y
a = (1 - frac_x) * (1 - frac_y)
b = (1 - frac_x) * frac_y
c = frac_x * (1 - frac_y)
d = frac_x * frac_y
p = np.zeros((int_x.size, 3))
p += a[..., None] * src[int_y, int_x]
p += b[..., None] * src[int_y + 1, int_x]
p += c[..., None] * src[int_y, int_x + 1]
p += d[..., None] * src[int_y + 1, int_x + 1]
# assign to destination image
dst[v_y, v_x] = p
elif direction == 'f':
# reshape the destination pixels as N x 3 homogeneous coordinate
u = np.stack([x, y, np.ones_like(x)])
v = np.matmul(H, u)
v /= v[2]
# use the rounding method to assign the destination image coordinates
v = np.rint(v).astype(int)
# calculate the mask of the transformed coordinate (should not exceed the boundaries of destination image)
mask = np.logical_or(np.logical_or(v[0] < 0, v[0] > w_dst - 1), np.logical_or(v[1] < 0, v[1] > h_dst - 1))
v_x, v_y, u_x, u_y = np.delete(v[0], mask), np.delete(v[1], mask), np.delete(u[0], mask), np.delete(u[1], mask)
# assign to destination image
dst[v_y, v_x] = src[u_y, u_x]
return dst
```
### 實驗結果
拼接完圖片後會有一些黑色沒有 warp 到的地方,我們最後會手動裁切掉讓結果更好看。

## Demo code
```python=
if __name__ == "__main__":
img_paths = glob.glob("ours/*.jpg")
img_paths = sorted(img_paths, key=lambda x: int(re.findall(r"\d+", x)[0]))
imgs = [cv2.imread(path) for path in img_paths]
h_max = max([x.shape[0] for x in imgs])
w_max = sum([x.shape[1] for x in imgs])
# create the final stitched canvas
dst = np.zeros((h_max, w_max, imgs[0].shape[2]), dtype=np.uint8)
dst[:imgs[0].shape[0], :imgs[0].shape[1]] = imgs[0]
# for all images to be stitched
H_total = np.eye(3)
acc_width = 0
for idx in range(len(imgs) - 1):
im1 = imgs[idx]
im2 = imgs[idx + 1]
# feature detection (detection + description)
kp1, des1 = myDetectAndCompute(im1.copy())
kp2, des2 = myDetectAndCompute(im2.copy())
# matching (match keypoint + use RANSAC method to find homography)
H = find_homography(kp1, des1, kp2, des2)
# chain the homographies
H_total = H_total @ H
# apply warping
acc_width += im1.shape[1]
dst = warping(im2, dst, H_total, 0, h_max, 0, w_max, direction='b')
cv2.imwrite("./panorama.png", dst)
```
## 結果展示
Ours:


