HackMD - Collaborative Markdown Knowledge Base

<style> p { text-align: justify; } .f-box { display: flex; justify-content: center; } .sm { width: 60%; margin-left: 2rem; } .ss { width: 47%; margin-left: 2rem; } .img-caption { font-size: 1.5rem; font-weight: 500; letter-spacing: -1px; padding: 10px 0 30px 0; } </style> <h1 style="text-align:center">VFX Project II: Image Stitching</h1> <p style="text-align:right">VFX (NTU, Spring 2022)</p> Hao-Cheng Lo 駱皓正 (D08227104) Li-Wei Fu 傅立威 (R10942078) 有鑒於數學式顯示問題，請參照[線上版完整報告書](https://hackmd.io/@VFX2022/H1mDIvlB5)。 ## 緒論在本作業中，縫合全景圖的實作流程如下：首先，在特徵偵測(feature detection)階段，我們以Harris Coner Dectection對圖片進行特徵的偵測。其次，在特徵描述(feature description)階段，我們以SIFT Descriptor之算法對於上述特徵點以向量對其進行表徵。接著，在特徵配對(feature matching)階段，我們比對連續兩兩影像之特徵相似度，擇優選出2D對2D的對應點，並應用該對應點資料，搭配RANSAC解出較具有穩健性的平面投影轉換(homography)。最後，在全景影像縫合(image blending)階段，我們根據前揭平面投影轉換與後向形變(backward warping)，據此逐一對圖片做縫合，便可得一全景圖。以下是我們各小節的細節與實驗結果，並以Parrington資料集為例，本作業所親自攝影照片置於最後部分： 1. 特徵偵測(feature detection) 2. 特徵描述(feature description) 3. 特徵配對(feature matching) 4. 全景影像縫合(image blending) 5. 結果展示 ## 特徵偵測(feature detection) 針對特徵的偵測，本作業採取課堂中所提及之Harris Corner Dectection算法作為特徵偵測。具體而言，其主要步驟如下，代碼符號與講義一致： 1. 針對原圖轉成灰階並通過Gaussian Filter進行降噪處理。 2. 計算其一次微分$I_x,\ I_y$與二次微分$I^2_x,\ I^2_y$，並使其再次通過Gaussian Filter。 3. 建構矩陣$M$，並依照其行列值與跡數求出Harris Corner Response $R$. 4. 設定一閾值$=\mu_R +\alpha\sigma_R$，其中$\alpha$為使用者自定，本作業設定為$0.5$，如果一像素之Harris Corner Response低於閾值則被過濾掉；反之，高於閾值則保留。 5. 除了設定全局的閾值，我們亦用non-maximal suppression之方法，檢視每個候選特徵是否是局部最大值，也就是相較於其鄰居為最大者。最大者之對應特徵點，則為Harris Corner之特徵。 ```python= def harris_corner(I_x, I_y, kernel_size=9, sigma=3, k_value=0.04, alpha=0.5): ''' Input: I_x: x gradient I_y: y gradient Output: points: index=1 is x coordinate, index=0 is y coordinate ''' # https://en.wikipedia.org/wiki/Harris_corner_detector # Step 1: Gradient # done # Step 2: Structure tensor setup # In original algo, it is mean filter. S_x_2 = cv2.GaussianBlur(np.square(I_x), (kernel_size, kernel_size), sigma) S_y_2 = cv2.GaussianBlur(np.square(I_y), (kernel_size, kernel_size), sigma) S_xy = cv2.GaussianBlur(I_x * I_y, (kernel_size, kernel_size), sigma) # Step 3: Harris response calculation det_M = (S_x_2 * S_y_2) - np.square(S_xy) trace_M = S_x_2 + S_y_2 harris_response = det_M - k_value * (trace_M ** 2) # Step 4: Non-maximum suppression threshold = np.mean(harris_response) + alpha * np.std(harris_response) local_max = np.where(harris_response <= threshold, 0, 1).astype(np.uint8) for i in np.arange(9): if i == 4: continue kernel = np.zeros(9) kernel[4] = 1 kernel[i] = -1 kernel = kernel.reshape(3, 3) tmp_result = np.where(cv2.filter2D( harris_response, -1, kernel) < 0, 0, 1).astype(np.uint8) local_max &= tmp_result # Decide feature points points = np.where(local_max == 1) print(f"[Harris Corner] Found {len(points[0])} local features!") return points, harris_response # x, y coordinates ``` ### 實驗結果以下為Parrington資料集之兩連續圖之Harris Coner Detection之表現。我們可以發現無論從Response的分布以及最後特徵點的結果來看，反應較大的區域(深紅色)或是最後篩選出來的特徵點，皆偏向位於影像中corner處，應屬合理。 <div class="f-box"> <img class="ss" src="https://i.imgur.com/5vYbEqS.jpg"> <img class="ss" src="https://i.imgur.com/VAkwgpf.jpg"> </div> <center class="img-caption"> 原影像 </center> <div class="f-box"> <img class="sm" src="https://i.imgur.com/AQDi6zx.png"> <img class="sm" src="https://i.imgur.com/AhRtQjU.png"> </div> <center class="img-caption"> Harris Coner Response </center> <div class="f-box"> <img class="ss" src="https://i.imgur.com/0IMEugb.png"> <img class="ss" src="https://i.imgur.com/LWyQte4.png"> </div> <center class="img-caption"> 特徵點 </center> ## 特徵描述(feature description) 基於前揭確定之特徵點後，我們進一步需要表徵這些特徵點，以便區別不同特徵點，進而匹配不同影像上相似的特徵點。具體而言，我們運用SIFT的特徵描述算法，該算法著重局部描述(local description)。步驟如下： 1. 找出每個特徵點梯度的角度$\theta$和長度大小$m$。 2. 把360度分為八類角度(bin)，並將每個特徵點的角度設為其對應的角度分類(bin)。 3. 對於「所有點」來說，考慮其鄰居在八類角度的分佈情況(並考慮其實際角度值)，據此對該點進行以八維向量描述。 4. 對於「特徵點」來說，使用一個16x16的window並根據前一步求得的主要旋轉方向，並進行旋轉校正，以便不受到旋轉影響。 5. 接續上步驟，對於「特徵點」來說，再考慮其鄰居16×16大小的像素，並平均區分為4×4的區域共16區，接著檢視每一區在八維向量的分佈情形。 6. 據此，可合併成16×8=128維度的資料。最後針對這些資料進行正規化，即可得到代表該特徵點的128維特徵向量。 ```python= def sift_descriptor(points, I_x, I_y, num_bin=8, kernel_size=11, patch_size=12): # length and theta m = np.hypot(I_x, I_y) theta = get_theta(I_x, I_y) # histogram of theta hist_bin_width = 360.0 / num_bin theta_which_hist = (theta) // int(hist_bin_width) % num_bin orientation_shape = (num_bin, I_x.shape[0], I_x.shape[1]) orientation_stack = np.zeros(orientation_shape) for i in range(num_bin): build_orientation_layer( orientation_stack[i], theta_which_hist == i, m, kernel_size) descriptors = [] keypoints = [] for x, y in zip(points[1], points[0]): # for each point description = [] try: rotated_orientation_stack = build_rotated_orientation_stack( x, y, orientation_stack, patch_size, theta[y, x]) for yy in [4, 8, 12, 16]: for xx in [4, 8, 12, 16]: orientation_vec = np.array( [np.sum(rotated_orientation_stack[i][yy:yy+4, xx:xx+4]) for i in range(num_bin)]) orientation_vec = orientation_vec / \ (orientation_vec.sum() + 1e-7) orientation_vec = np.clip(orientation_vec, 0, 0.2) orientation_vec = orientation_vec / \ (orientation_vec.sum() + 1e-7) description += orientation_vec.tolist() except: description = -1 if np.sum(description) > 0: keypoints.append([x, y]) descriptors.append(description) print( f"[SIFT Descriptor] Complete {len(keypoints)} features' descriptors!") return np.array(keypoints).astype(np.float32), np.array(descriptors).astype(np.float32) ``` ### 實驗結果由以下結果可以發現在邊緣強烈處，其梯度值較大。除此之外，從各個像素之角度的分類也可以發現，類似的邊緣其分類較為一致。 <div class="f-box"> <img class="sm" src="https://i.imgur.com/kJpo5r7.png"> <img class="sm" src="https://i.imgur.com/wi0z63n.png"> </div> <center class="img-caption"> Gradient Value </center> <div class="f-box"> <img class="sm" src="https://i.imgur.com/NdYhzTP.png"> <img class="sm" src="https://i.imgur.com/xRylWj9.png"> </div> <center class="img-caption"> Theta in 8 Bins </center> ## 特徵配對(feature matching) ### 找好對應點前面我們已經把找到的特徵點都用一個128維度的向量來表示，接下來我們要把不同張圖片的對應點找出來。我們使用 Euclidean Distance 及 Manhattan Distance 來計算兩向量的距離，配合上課教授教的 heuristic method 做一個簡單的篩選。首先我們在要拼接的圖片裡，先找出和前一張圖片的特徵點最接近的點以及第二接近的點，如果最接近的點和第二接近的點距離比小於 0.8 我們便把它當作對應點。 ```python= def match_keypoints(des1, des2, norm='L1'): ''' Input: des1: N-by-128 matrices, representing N points' descriptor des2: M-by-128 matrices, representing M points' descriptor norm: distance definition, there are two options L1-norm - Manhattan distance L2-norm - Euclidean distance Output: matches: list of match index ''' # We use heuristic method: # If the distance ratio of d(x, x1) and d(x, x2) is smaller than 0.8, # then it is accepted as a match. RATIO = 0.8 N = des1.shape[0] if norm == 'L1': dist = cdist(des1, des2, metric='minkowski', p=1) elif norm == 'L2': dist = cdist(des1, des2, metric='euclidean') matches = [] sorted_index = np.argsort(dist, axis=1) for i in range(N): idx1, idx2 = sorted_index[i, 0], sorted_index[i, 1] dist1, dist2 = dist[i, idx1], dist[i, idx2] if dist1 < dist2 * RATIO: matches.append([i, idx1]) return matches ``` ### Homography 找到對應點之後，我們透過 Homography 得到兩張連續圖片間的關係。計算 Homography 最少需要四個點，如果只使用四個點可以得到唯一解；超過不一定有解，但我們可以計算誤差最小的。如果 input 對應點是四個點，我們會使用第一種方法，用反矩陣的方式解 Ah = v，在這裡我們假設 Homography 最後一個參數是1 (h33 = 1)；如果 input 對應點超過四個點，我們會使用第二種方法，解 Ah = 0 的 least square solution。 ```python= def solve_homography(u, v): ''' Input: u: N-by-2 source pixel location matrices v: N-by-2 destination pixel location matrices Output: H: 3-by-3 homography matrix ''' N = u.shape[0] if v.shape[0] is not N: print('u and v should have the same size') return None if N < 4: print('At least 4 points should be given') return None # Method 1: A is 8 x 8 matrix, slove Ah = v with unique solution (Faster) if N == 4: A = np.array([ [u[0, 0], u[0, 1], 1, 0, 0, 0, -u[0, 0] * v[0, 0], -u[0, 1] * v[0, 0]], [0, 0, 0, u[0, 0], u[0, 1], 1, -u[0, 0] * v[0, 1], -u[0, 1] * v[0, 1]], [u[1, 0], u[1, 1], 1, 0, 0, 0, -u[1, 0] * v[1, 0], -u[1, 1] * v[1, 0]], [0, 0, 0, u[1, 0], u[1, 1], 1, -u[1, 0] * v[1, 1], -u[1, 1] * v[1, 1]], [u[2, 0], u[2, 1], 1, 0, 0, 0, -u[2, 0] * v[2, 0], -u[2, 1] * v[2, 0]], [0, 0, 0, u[2, 0], u[2, 1], 1, -u[2, 0] * v[2, 1], -u[2, 1] * v[2, 1]], [u[3, 0], u[3, 1], 1, 0, 0, 0, -u[3, 0] * v[3, 0], -u[3, 1] * v[3, 0]], [0, 0, 0, u[3, 0], u[3, 1], 1, -u[3, 0] * v[3, 1], -u[3, 1] * v[3, 1]], ]) # In some cases, Homography cannot be calculated. # 1) Duplicate points # 2) More than 2 collinear points if np.linalg.det(A) == 0: return None H = np.append(np.matmul(np.linalg.inv(A), v.flatten()), 1).reshape(3, 3) return H # Method 2: A is 2n x 9 matrix, slove Ah = 0 with least square solution A = [] for ui, vi in zip(u, v): A.append([ui[0], ui[1], 1, 0, 0, 0, -ui[0] * vi[0], -ui[1] * vi[0], -vi[0]]) A.append([0, 0, 0, ui[0], ui[1], 1, -ui[0] * vi[1], -ui[1] * vi[1], -vi[1]]) U, Sigma, Vt = np.linalg.svd(A) H = Vt[-1].reshape(3, 3) return H ``` ### 用 Ransac 篩選 outlier Matching 時難免會有一些 Match 錯誤的點，我們必須要把這些點篩選掉，以免對結果產生影響。使用的方法為 Ransac，因為 Homography 最少是由四個點決定，所以我們每次 random sample 四個點，我們會先判斷這四個點是否能圍成一個四邊形，確保可以用這四個點計算 Homography。接著把所有的特徵點帶入 Homography 算出對應點的座標(計算得出)，和真正的對應點(ground truth)計算誤差，誤差太大的就當作是 outlier。我們希望 inlier 越多越好，如果 inlier 數量相同我們會再比較總誤差，總誤差越小越好，每次進步就會更新 Homography 及 inlier。前面的步驟會重複執行 10000 次，最後就可以找到不錯的結果。 ```python= def find_homography(kp1, des1, kp2, des2, norm='L1', num_iters=10000, threshold=5): matches = match_keypoints(des1, des2, norm) print(f"[Match] Found {len(matches)} matches!") im1_pts = np.array([kp1[m[0]] for m in matches]) im2_pts = np.array([kp2[m[1]] for m in matches]) u = im2_pts.T u = np.append(u, [np.ones_like(u[0])], axis=0) v_ans = im1_pts.T v_ans = np.append(v_ans, [np.ones_like(v_ans[0])], axis=0) # apply RANSAC to choose best H most_inliers = 0 best_H = np.eye(3) least_error = np.inf best_mask = np.full(len(matches), True) population = range(len(matches)) for i in range(num_iters): # randomly choose 4 corners to compute the homography idx = random.sample(population, 4) H = solve_homography(im2_pts[idx], im1_pts[idx]) if H is None: continue v = np.matmul(H, u) v /= v[2] dist = np.linalg.norm(v[:2] - v_ans[:2], axis=0) inliers = (dist < threshold).sum() if inliers > most_inliers or (inliers == most_inliers and least_error > dist[best_mask].sum()): best_H = H best_mask = dist < threshold most_inliers = inliers least_error = dist[best_mask].sum() # # (Optional) Remove outliers and recalculate H # best_H = solve_homography(im2_pts[best_mask], im1_pts[best_mask]) return best_H ``` ### 實驗結果以下結果展示最接近的前50個對應點，可以看到這些對應關係畫出來後長度都很接近、互相平行，表示兩張照片基本上只有 translation 而已。 ![](https://i.imgur.com/NF27vM7.png) ## 全景影像縫合(image blending) 找完圖片的之間對應關係後就可以 warping 了，我們實現了 forward warp 以及 backward warp。首先用 meshgrid 建立座標，u 是 source image 座標、v 是 destination image 座標。 Forward warp 就是把計算好的 Homography 和 u 相乘得到 v，normalized 後四捨五入，然後用 mask 把超過 destination image 邊界的地方找出並刪除，最後直接把每個點 assign 過去，這種方法因為不能保證 destination image 所有的 pixel 都能夠被對應到，所以可能會產生洞。 Backward warp 需要計算 Homography 的 inverse，與 v 相乘得到 u，也要做normalized，然後用 mask 把超過 source image 邊界的地方找出並刪除，接著在source image 做 bilinear interpolation 得到 destination image 每個pixel 的值再 assign 過去，這種方法不會產生洞。因此我們最後會選用 backward warp 來拼接圖片。 ```python= def warping(src, dst, H, ymin, ymax, xmin, xmax, direction='b'): ''' Input: src: source image dst: destination output image H: homography from source to destination ymin: lower vertical bound of the destination (source, if forward warp) pixel coordinate ymax: upper vertical bound of the destination (source, if forward warp) pixel coordinate xmin: lower horizontal bound of the destination (source, if forward warp) pixel coordinate xmax: upper horizontal bound of the destination (source, if forward warp) pixel coordinate direction: indicates backward warping or forward warping Output: dst: destination output image Note: Image shape: H, W, C Image coordinate: X, Y but H -> Y W -> X So we should swap H and W to generate the correct coordinates src and dst shape: H W C corresponding coordinate: v[1] v[0] ''' h_src, w_src, ch = src.shape h_dst, w_dst, ch = dst.shape H_inv = np.linalg.inv(H) # Prevent changing to the original dst dst = dst.copy() # meshgrid the (x, y) coordinate pairs x, y = np.meshgrid(np.arange(xmin, xmax), np.arange(ymin, ymax)) x, y = x.flatten().astype(int), y.flatten().astype(int) if direction == 'b': # reshape the destination pixels as N x 3 homogeneous coordinate v = np.stack([x, y, np.ones_like(x)]) u = np.matmul(H_inv, v) u /= u[2] # calculate the mask of the transformed coordinate (should not exceed the boundaries of source image) # upper bound use ">=" to prevent index out of bounds in interpolation mask = np.logical_or(np.logical_or(u[0] < 0, u[0] >= w_src - 1), np.logical_or(u[1] < 0, u[1] >= h_src - 1)) v_x, v_y, u_x, u_y = np.delete(v[0], mask), np.delete(v[1], mask), np.delete(u[0], mask), np.delete(u[1], mask) # We use bilinear interpolation to deal with the problem of coordinates that are not integers int_x, int_y = u_x.astype(int), u_y.astype(int) frac_x, frac_y = u_x - int_x, u_y - int_y a = (1 - frac_x) * (1 - frac_y) b = (1 - frac_x) * frac_y c = frac_x * (1 - frac_y) d = frac_x * frac_y p = np.zeros((int_x.size, 3)) p += a[..., None] * src[int_y, int_x] p += b[..., None] * src[int_y + 1, int_x] p += c[..., None] * src[int_y, int_x + 1] p += d[..., None] * src[int_y + 1, int_x + 1] # assign to destination image dst[v_y, v_x] = p elif direction == 'f': # reshape the destination pixels as N x 3 homogeneous coordinate u = np.stack([x, y, np.ones_like(x)]) v = np.matmul(H, u) v /= v[2] # use the rounding method to assign the destination image coordinates v = np.rint(v).astype(int) # calculate the mask of the transformed coordinate (should not exceed the boundaries of destination image) mask = np.logical_or(np.logical_or(v[0] < 0, v[0] > w_dst - 1), np.logical_or(v[1] < 0, v[1] > h_dst - 1)) v_x, v_y, u_x, u_y = np.delete(v[0], mask), np.delete(v[1], mask), np.delete(u[0], mask), np.delete(u[1], mask) # assign to destination image dst[v_y, v_x] = src[u_y, u_x] return dst ``` ### 實驗結果拼接完圖片後會有一些黑色沒有 warp 到的地方，我們最後會手動裁切掉讓結果更好看。 ![](https://i.imgur.com/m6NA6ZQ.png) ## Demo code ```python= if __name__ == "__main__": img_paths = glob.glob("ours/*.jpg") img_paths = sorted(img_paths, key=lambda x: int(re.findall(r"\d+", x)[0])) imgs = [cv2.imread(path) for path in img_paths] h_max = max([x.shape[0] for x in imgs]) w_max = sum([x.shape[1] for x in imgs]) # create the final stitched canvas dst = np.zeros((h_max, w_max, imgs[0].shape[2]), dtype=np.uint8) dst[:imgs[0].shape[0], :imgs[0].shape[1]] = imgs[0] # for all images to be stitched H_total = np.eye(3) acc_width = 0 for idx in range(len(imgs) - 1): im1 = imgs[idx] im2 = imgs[idx + 1] # feature detection (detection + description) kp1, des1 = myDetectAndCompute(im1.copy()) kp2, des2 = myDetectAndCompute(im2.copy()) # matching (match keypoint + use RANSAC method to find homography) H = find_homography(kp1, des1, kp2, des2) # chain the homographies H_total = H_total @ H # apply warping acc_width += im1.shape[1] dst = warping(im2, dst, H_total, 0, h_max, 0, w_max, direction='b') cv2.imwrite("./panorama.png", dst) ``` ## 結果展示 Ours: ![](https://i.imgur.com/xg8bUBI.jpg) ![](https://i.imgur.com/sqSuFag.jpg) ![](https://i.imgur.com/1JIXVHf.jpg)