Single Shot Dectector

# Single Shot Dectector ###### tags: `CodeReview` `SSD` [TOC] ## Model ### Multi-scale feature maps for detection 和經典直筒結構的 YOLO 相比，SSD 引入了多層的 Convolution Layer 作為預測bounding box 的 feature map，使其能夠涵蓋更多種尺度的物體，直覺上這非常合理，因為要求同一個 feature map 同時包含各種大小的物體資訊實在太困難了，將不同尺度的物體分別預測才是更合理的。以下是 SSD-Tensorflow 定義模型時的程式碼，很清楚的可以看到在定義完 body 之後包含了一段 for loop ，這裡就是用於定義最後面輸出的多個 feature maps，也就是 `ssd_multibox_lay` 的輸出，並將其收集起來回傳。而在 `feat_layer` 中的層數如架構圖，分別是 `['block4', 'block7', 'block8', 'block9', 'block10', 'block11']` ```python= def ssd_net(...): ... ... # Prediction and localisations layers. predictions = [] logits = [] localisations = [] for i, layer in enumerate(feat_layers): with tf.variable_scope(layer + '_box'): p, l = ssd_multibox_layer(end_points[layer], num_classes, anchor_sizes[i], anchor_ratios[i], normalizations[i]) predictions.append(prediction_fn(p)) logits.append(p) localisations.append(l) return predictions, localisations, logits, end_points ``` 進一步的看到 `ssd_multibox_layer` ，將各層輸入後，都分別通過兩個卷積層，分別預測 bounding box 和 categories。 ```python= def ssd_multibox_layer(inputs, num_classes, sizes, ratios=[1], normalization=-1, bn_normalization=False): ... ... return cls_pred, loc_pred ``` ### Default boxes and aspect ratios 即為 Faster-RCNN 的 anchors，在 every grids on feature map，都會有被事先設定好尺寸的 bounding box ，而預測的就是這些 default box 之於 groundtruth box 的校正值，也就是 offsets。這邊特別提到的是，SSD 不但使用了 Faster-RCNN 的 anchors 概念，還把它們應用在了 multi-scale feature maps，使 SSD 比起 YOLO 又更能夠在一張圖中找到更多不同尺度的物體。 ___ ## Training ### Choosing scales and aspect ratios for default boxes ![](https://i.imgur.com/177hqnh.png) 這張圖經典的表示了 default box 的功能，就是協助 SSD 更能學習多種不同大小的物體，隨之而來的問題是應該如何設定不同尺度的 default box 才能達到最好的效果。也因為 default box 是人為設定的，所以針對不同的資料集或場景，是應該要隨之變動的： > In practice, one can also design a distribution of default boxes to best fit a specific dataset. How to design the optimal tiling is an open question as well. 以下這段程式碼就是決定多個 default box 的實作： ```python=306 def ssd_anchor_one_layer(...): ... ... h[0] = sizes[0] / img_shape[0] w[0] = sizes[0] / img_shape[1] di = 1 # For ratio=1, give one more size if len(sizes) > 1: h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0] w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1] di += 1 for i, r in enumerate(ratios): h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r) w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r) return y, x, h, w ``` 首先，總共有五種長寬比 $a_r={1, 2, 3, \dfrac{1}{2}, \dfrac{1}{3}}$ ，而對於 $a_r=1$ 的 default box，還另外計算了一個 $s_{k'}=\sqrt{s_k \times s_{k+1}}$ ，所以至多會有六種大小。而對於比例非為 1 的長寬，計算方式為 $w_k^a=s_k \sqrt{a_r}, h_k^a= s_k \sqrt{a_r}$ ，$s_k$ 的算法原文給出如下： ![](https://i.imgur.com/j97WOZG.png) 除了透過多個 feature map ，多個 default box 進一步協助 SSD 掌握更多不同大小的物體： > By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. ### Matching strategy 現在我們手上應該有三種 box： 1. Groundtruth box 2. Default box 3. Predicted box Predicted box 是由 Default box 透過 Predicted offsets 修正得到的。在訓練階段，我們必須要把 Default box 和 Groundtruth box 配對，使得修正得到的 Predicted box 能夠和 Groundtruth box 計算 loss。原文特別提到的是，一個 Groundtruth box 是會和多個 Default box 對應用作訓練的，因為 Default box 重疊性相當高，這會使訓練更簡單。 > Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap. 以下我們再來看看 SSD-Tensorflow 的 Matching Strategy 怎麼實現： ```python= def tf_ssd_bboxes_encode_layer(...): ... def body(i, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax): """Body: update feature labels, scores and bboxes. Follow the original SSD paper for that purpose: - assign values when jaccard > 0.5; - only update if beat the score of other bboxes. """ # Jaccard score. label = labels[i] bbox = bboxes[i] jaccard = jaccard_with_anchors(bbox) # Mask: check threshold + scores + no annotations + num_classes. mask = tf.greater(jaccard, feat_scores) # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold)) mask = tf.logical_and(mask, feat_scores > -0.5) mask = tf.logical_and(mask, label < num_classes) imask = tf.cast(mask, tf.int64) fmask = tf.cast(mask, dtype) # Update values using mask. feat_labels = imask * label + (1 - imask) * feat_labels feat_scores = tf.where(mask, jaccard, feat_scores) feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax # Check no annotation label: ignore these anchors... # interscts = intersection_with_anchors(bbox) # mask = tf.logical_and(interscts > ignore_threshold, # label == no_annotation_label) # # Replace scores by -1. # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores) return [i+1, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax] ``` 17 行奇怪的 -0.5 ，我將它放在附註，可以忽略它，並不影響結果。主要是透過 `tf_ssd_bboxes_encode_layer` 的 `tf.while_loop(...)` 來實做這一過程，重點在以下這幾行： ```python=15 jaccard = jaccard_with_anchors(bbox) mask = tf.greater(jaccard, feat_scores) ... ... ... feat_labels = imask * label + (1 - imask) * feat_labels feat_scores = tf.where(mask, jaccard, feat_scores) feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin ... ``` 15 行是輸入一個 grundtruth box ，和一個 default box 計算 jaccard ，如同原論文中提及的一樣，jaccard overlap 即 IOU 的一種： ![](https://i.imgur.com/vasUZT4.png) 如果 jaccard overlap 高於 threshold (0.5) ，則我們就將該 defualt box 配對給該 groundtruth box，22 - 28 行就是配對過程。 `feat_scores.shape` 和 `x, y` 一樣，假設是一個 `(38, 38, anchors_num)` 的 grid map，初始是一個 0 矩陣。當 jaccard overlap 高於該 element 時，就會回傳 True ，故 `mask.shape = (38, 38, anchors_num)`，值為 bool，換句話說，只要該點的前一個 defualt box 的 jaccard overlap 小於現在，它就會被重新配對給更符合該 groundtruth box 的 default box。故最後我們可能會得到： ```python= # feat_labels.shape = (38, 38) feat_labels = [[0, 0, ..., 5, 5, 5, ..., 2, 2, 0,...], [0, 0, ..., 5, 5, 5, ..., 2, 2, 0,...], ... [...], [...], ... [1, 1, 1, 1, ..., 0, 0, 9, 9, 0, ... ]] # feat_scores.shape = (38, 38) feat_scores = [[0, 0, ..., 0.92, 0.92, 0.92, ..., 0.6, 0.6, 0,...], [0, 0, ..., 0.92, 0.92, 0.92, ..., 0.6, 0.6, 0,...], ... [...], [...], ... [0.78, 0.78, 0.78, 0.78, ..., 0, 0.59, 0.59, 0, ... ]] ``` 像是這樣的二維矩陣，`feat_ymin, feat_xmin, feat_ymax, feat_xmax`也是同樣的道理，這六個二維矩陣代表了某一個 convolution layer 和 groundtruth box 的對應關係，而每一個 cell 是用於儲存該位置上的各種資訊，你可以把一個 cell 當作是 default box 的中心點。我原以為這個步驟會是一個一個 grid 逐個去歷遍每個 defualt box ，但是實作上它是將 groundtruth box 輸入，使用 numpy 快速的、向量化的尋找是否有符合條件的 default box。當然也可能是這個 repo 大量使用 numpy 的關係，實際上可能還是要看看正統的 Tensorflow Object Detection API 怎麼寫，不過值得借鑒。 ### Training objective 以下是其 loss function ，如同前述所提，SSD 不同於 YOLO 直接預測 groundtruth box ，而是預測 grundtruth box 和 default box 之間的 offsets，所以在 matching 之後，我們會先計算 $g$ 和 $d$ 之間的差距 $\hat{g}$ ，再和我們求得的 $l$ 取 L1 smooth loss。至於分類則是我們熟知的 softmax cross-entropy，就不多做介紹。 ![](https://i.imgur.com/HtjyEWA.png) 以下是求得 $\hat{g}$ 的程式碼，和原文一模一樣，除了特別顯眼的 `prior_scaling` 是原文沒有的，github 作者並沒有給出一個理由，和它的 `anchor_size` 一樣任性。有可能的理解是，因為 `prior_scaling = [0.1, 0.1, 0.2, 0.2]` ，這個大小會讓值更大，所以也許有助於訓練，再來是要平衡 `x, y, width, height` 之間的大小差距，看看就好： ```python=149 # Encode features. feat_cy = (feat_cy - yref) / href / prior_scaling[0] feat_cx = (feat_cx - xref) / wref / prior_scaling[1] feat_h = tf.log(feat_h / href) / prior_scaling[2] feat_w = tf.log(feat_w / wref) / prior_scaling[3] # Use SSD ordering: x / y / w / h instead of ours. feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1) return feat_labels, feat_localizations, feat_scores ``` ### Hard negative mining 原文中提及： > After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. 所以我們不能使用全部的 false positive ，而是只取出和 groundtruth box 配對成功的 default box 數量的三倍作為訓練之用，所以 628 行才會將 `negative_ratio=3` 乘以 `n_positives`。往下追，631 行是將按照 `nvalues_flat` 的負值排序，換句話說就是最小的排最高，數量則是剛剛計算的 `n_neg`。之後往上找到 622 行的 `nvalues_flat` ，事實上它就是模型 class predictions，由於是 logits ，所以它就是 class confidence。最後 632 行，我們再拿已經排序好的 hard example values 去和 nmask 取得實際的 default box position。 ```python=611 # Compute positive matching mask... pmask = gscores > match_threshold fpmask = tf.cast(pmask, dtype) n_positives = tf.reduce_sum(fpmask) # Hard negative mining... no_classes = tf.cast(pmask, tf.int32) predictions = slim.softmax(logits) nmask = tf.logical_and(tf.logical_not(pmask), gscores > -0.5) fnmask = tf.cast(nmask, dtype) nvalues = tf.where(nmask, predictions[:, 0], 1. - fnmask) nvalues_flat = tf.reshape(nvalues, [-1]) # Number of negative entries to select. max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32) n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size n_neg = tf.minimum(n_neg, max_neg_entries) val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg) max_hard_pred = -val[-1] # Final negative mask. nmask = tf.logical_and(nmask, nvalues < max_hard_pred) fnmask = tf.cast(nmask, dtype) ``` 最後再將該 mask 和最後求得的 loss 做 broadcasting，就會得到 hard negative example 的 loss 了。 ```python=644 with tf.name_scope('cross_entropy_neg'): loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=no_classes) loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value') tf.losses.add_loss(loss) ``` ___ ## Remarks 1. 在 SSD-Tensorflow 凡是出現 `*scores > -0.5` 的部份，都可以被忽略，那是作者用於追蹤 ***no-class boxes*** 的機制。 ![](https://i.imgur.com/gGZ2kIV.png) 也可以參考被註解在 matching strategy 下面的舊程式碼： ```python=127 # Check no annotation label: ignore these anchors... # interscts = intersection_with_anchors(bbox) # mask = tf.logical_and(interscts > ignore_threshold, # label == no_annotation_label) # # Replace scores by -1. # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores) ```