YOLOv3: An Incremental Improvement

# YOLOv3: An Incremental Improvement ###### tags: `YOLO` `CNN` `論文翻譯` `deeplearning` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/abs/1804.02767) * [吳恩達老師_深度學習_卷積神經網路_第三週_目標偵測](https://hackmd.io/@shaoeChen/SJXmp66KG?type=view) ::: ## Abstract :::info We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 $\text{AP}_{50}$ in 51 ms on a Titan X, compared to 57.5 $\text{AP}_{50}$ in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/. ::: :::success 我們把YOLO做了一些更新。我們做了一些小小的設計變更讓它的效能更好。我們還訓練了這個新的網路，它真的是白白胖胖的。它比上一代的網路還要大，不過更準確就是了。別擔心，它仍然是快狠準。在320x320的輸入情況下，YOLOv3可以以28.2mAP在22ms內執行完畢，這跟SSD一樣的準，但速度是它的三倍快。當你看著舊的.5 IOU mAP的偵測指標時，你會知道YOLOv3是非常好的。它在Titan X上以51ms達到了57.9 $\text{AP}_{50}$，對比RetinaNet的198ms內的57.5 $\text{AP}_{50}$，類似的效能，但是快了3.8倍。跟之前一樣，所有的程式碼都在https://pjreddie.com/yolo/。 ::: ## 1. Introduction :::info Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little. ::: :::success 有時候厚，你就只是混一混就一年過去了，你知道嗎？我今年其實沒有什麼在做研究的。反而是花不少時間在Twitter上。玩了一點GANs。去年[12]、[1]我還有一點動力；所以我試著把YOLO做一點改善。不過，老實說，沒什麼比你隨隨便便改兩下就讓它變得更好來的有趣了。我也有幫忙其他人做了一點他們的研究。 ::: :::info Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT! ::: :::success 事實上，這也是我們今天來這裡的原因。我們有一個完稿(camera-ready)的最後期限[4]，而且我們需要引用我們對YOLO做的一些隨機更新，不過我們並沒有資料來源。所以，那就來搞一份技術報告吧。 ::: :::info The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means. ::: :::success 技術報告的好處在於，它們不需要介紹太多，你們都知道我們為什麼在這邊。所以這個介紹的最後會成為這篇論文的其它部份的指引。首先，我們要來跟你說說YOLOv3是怎麼一回事。然後我們會告訴你我們怎麼做的。當然還會跟你們說我們嚐試過但是失敗的部份。最後，我們要來想想這意味什麼。 ::: :::info ![](https://hackmd.io/_uploads/By_Hxk_2c.png) Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU. Figure 1. 我們從Focal Loss paper[9]中修正這張圖表。YOLOv3的執行速度明顯快過其它性能相當的偵測方法。不管是M40還是Titan X，它們基本上都是相同的GPU. ::: ## 2. The Deal :::info So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all. ::: :::success 這邊說的是YOLOv3的概念：我們大多從其他人身上得到好的想法。我們還訓練一個分類器網路(別其他人還要好)。我們將會帶著你從頭開始貫穿整個系統的說明，好讓你瞭解所有的一切。 ::: :::info ![](https://hackmd.io/_uploads/BJlQ-JO3q.png) Figure 2. **Bounding boxes with dimension priors and location prediction.** We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15]. Figure 2. **Bounding boxes with dimension priors and location prediction.** 我們把框(box)的寬、高預測為從集群中心(cluster centroids)的偏移。然後使用sigmoid function預測相對於filter application位置的框(box)的中心座標。這張圖表就是公然的自我抄襲[15]。 ::: ### 2.1. Bounding Box Prediction :::info Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, $t_x,t_y,t_w,t_h$. If the cell is offset from the top left corner of the image by $(c_x,c_y)$ and the bounding box prior has width and height $p_w,p_h$, then the predictions correspond to: $$ \begin{align} & b_x = \sigma(t_x) + c_x \\ & b_y = \sigma(t_y) + c_y \\ & b_w = p_w e^{t_w} \\ & b_h = p_h e^{t_h} \end{align} $$ ::: :::success 依著YOLO9000的設計，我們的系統使用dimension clusters做為anchor boxes[15]來預測bounding boxes。網路會為每個bounding box預測四個座標，$t_x,t_y,t_w,t_h$。如果單元格(cell)從影像的左上角化移了$(c_x,c_y)$，然後bounding box prior的寬與高為$p_w,p_h$，那預測的就會是： $$ \begin{align} & b_x = \sigma(t_x) + c_x \\ & b_y = \sigma(t_y) + c_y \\ & b_w = p_w e^{t_w} \\ & b_h = p_h e^{t_h} \end{align} $$ ::: :::info During training we use sum of squared error loss. If the ground truth for some coordinate prediction is ttt our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ * − t* . This ground truth value can be easily computed by inverting the equations above. ::: :::success 在訓練過程中，我們使用平方誤差損失的總和(sum of squared erorr loss)。如果某個真實(ground truth)座標的預估為$\hat{t}_*$，那梯度就會真實值(ground truth)減掉預估值：$\hat{t}_* - t_*$。這個真實值(ground truth)透過翻轉上面的方程式就可以很輕鬆的計算。 ::: :::info YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness. ::: :::success YOLOv3會用logistic regression為每個bounding box預測一個objectness score(大概就是指是不是一個物件的指標分數)。如果bounding box prior與實際物件(ground truth object)的重疊比其它bounding box prior還要來的多的話，那它的值就應該是1。如果bounding box prior不是最好的，不過確實也是跟實際物件(ground truth object)有重疊還超過某個閥值，那根據[17]，我們會忽略掉這個預測。我們採用的閥值為0.5。不像[17]，我們的系統只會為每個實際物件(ground truth object)分派一個bounding box prior。如果沒有分派bounding box prior給實際物件(ground truth object)的話，那loss就只會是objectness，而沒有座標或是分類預測的部份。 ::: ### 2.2. Class Prediction :::info Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions. ::: :::success 每個框(box)都會使用multiable classification來預測bounding box可能包含的類別。我們並沒有使用softmax，因為我們有發現到這對於好的效能而言並非必要的，就只是單純的使用logistic classifiers。在訓練過程中，針對分類預測的部份，我們使用binary cross-entropy loss來評估。 ::: :::info This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data. ::: :::success 當我們轉向更複雜的領域的時候，像是Open Images Dataset[7]，這個公式是有幫助的。在這個資料集中有著很多重疊的標籤(label)(像是Woman與Person)。使用softmax的話會加深每個box都只會有一個類別的假設，不過事實並非如此。多標籤(multilabel)的方法可以更好的對資料建模。 ::: ### 2.3. Predictions Across Scales :::info YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions. ::: :::success YOLOv3會在三個不同的尺度上預測框(boxes)。我們的系統會使用類似於特徵金字塔網路[8](feature pyramid networks，FPA)的概念從這些尺度中提取特徵。從我們的基本特徵提取器中，我們加了幾個卷積層。最後一個(上下文來看是指卷積層)預測了三維的tensor encoding，bounding box、objectness與class predictions。在我們的COCO[10]實驗中，我們在每個尺度預測三個框(boxes)，因此這個張量(tensor)為$N \times N \times [3*(4+1+80)]$，這是4個bounding box offsets、1個objectness prediction與80個class predictions。 ::: :::info Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size. ::: :::success 接下來我們會取前兩個層的特徵圖(feature map)，然後把它做2倍的upsample(上採樣)。我們還會拿比較前面的feature map跟這個上採樣之後的特徵做合併。這樣的方法讓我們可以從上採樣的特徵中取得更有意義的語意信息(semantic information)，而且也可以從比較前面的特徵圖(feature map)中得到更為細膩的信息(finer-grained information)。然後，我們再加入更多的捲積層來處理這個組合的特徵圖(feature map)，最終，預測一個類似的張量(tensor)，儘管現在那大小已經是兩倍了。 ::: :::info We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network. ::: :::success 相同的設計再來一次，用來預測最終尺度(final scale)的框(boxes)。因此，對第三個尺度(3rd scale)的預測會受益於所以的先驗計算(prior computation)以及從網路中比較早的那些layers的細膩特徵(fine-grained features)。 ::: :::info We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were:(10×13),(16×30),(33×23),(30×61),(62×45),(59×119),(116 × 90),(156 × 198),(373 × 326). ::: :::success 我們仍然是採用k-means clustering來定義我們的bounding box priors。我們就是隨興的選擇九個集群與三個尺度，然後就在各自的尺度上均勻的分割集群。在COCO資料集上，這九個集群為：(10×13),(16×30),(33×23),(30×61),(62×45),(59×119),(116 × 90),(156 × 198),(373 × 326)。 ::: ### 2.4. Feature Extractor :::info We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53! ::: :::success 我們用一個新的網路來做特徵的提取。這網路混合YOLOv2的Darknet-19以及最新奇的residual network。我們的網路使用連續的3x3與1x1的卷積層，不過現在多了shortcut connection，而且明顯這網路胖了。有53個卷積層，嗯，所以，，我們現在就把它稱為，，Darknet-53。 ::: :::info This new network is much more powerful than Darknet-19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results: ::: :::success 這個新網路比起Darknet-19強多了，不過仍然是比ResNet-101或是ResNet-152還要來的有效。下面是一些ImageNet得到的結果： ![](https://hackmd.io/_uploads/Sk_IUqP39.png) Table 2. **Comparison of backbones**. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks. ::: :::info Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster. ::: :::success 每個網路的訓練都是採用相同的配置，然後是以256x256、單一剪裁的準確度(single crop accuracy)。執行時間是在Titan X上衡量出來的(輸入為256x256)。因此，Darknet-53的效能跟當今世上最強的分類器是相當的，不過浮點運算更少，速度更快就是了。Darknet-53比ResNet-101還要來的好，快了1.5倍。Darknet-53的效能與ResNet-152相當，但快了2倍。 ::: :::warning single crop：單一剪裁，可能照片是256x256，但網路的輸入部份是224x224，那就可能會從中間剪出一個相對應的大小。 multiple crops：多剪裁，如上所述，只是可能就是從左上、左下、右上、右下以及中間剪裁。 ::: :::info Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient. ::: :::success Darknet-53還測出每秒最高的浮點運算。這意味著這個網路架構很好地利用了GPU的效能，讓它的評估更有效率，也因此更快。這主要是因為ResNets太深太多層了，而且不是很有效率。 ::: :::info ![](https://hackmd.io/_uploads/SyeUzk_2c.png) ::: ### 2.5. Training :::info We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14]. ::: :::success 我們仍然是在完整影像(full images，no hard negative mining)且沒有做任何處理的情況下訓練。我們使用多尺度的訓練(multi-scale training)，採用資料增強(data augmentation)、batch normalizaiton、所有平常你會用到的那些。我們使用Darknet neural network framkwork來訓練、測試[14] ::: :::warning hard negative：字面來看就是很難區分的負樣本，也就是被錯分類成正樣本的負樣本 ::: ## 3. How We Do :::info YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though. ::: :::success YOLOv3金架妹賣。看一下Table 3的比較說明。就COCOs那出乎意料的平均mAP指標來看，跟SSD的變體相比是差不多的，不過快了3倍。不過，就這指標而言，仍然是落後其它模型，像是RetinaNet之類的。 ::: :::warning weird：翻譯為[出乎意料的或怪異的](https://dictionary.cambridge.org/zht/%E8%A9%9E%E5%85%B8/%E8%8B%B1%E8%AA%9E-%E6%BC%A2%E8%AA%9E-%E7%B9%81%E9%AB%94/weird)，但老覺得這樣翻起來那邊怪怪的？ ::: :::info ![](https://hackmd.io/_uploads/BkpvfkO2q.png) Table 3. I’m seriously just stealing all these tables from [9] they take soooo long to make from scratch. Ok, YOLOv3 is doing alright. Keep in mind that RetinaNet has like 3.8× longer to process an image. YOLOv3 is much better than SSD variants and comparable to state-of-the-art models on the $\text{AP}_{50}$ metric. Table 3. 我很認真地說，我就只是從[9]偷了這些表格，這從頭做的話要花很多時間。Ok，YOLOv3表現的很好。記得，RetinaNet處理影像的時間要長個3.8倍。YOLOv3比SSD的變體還要來的好，而且在$\text{AP}_{50}$指標上可以跟當今世上最強的模型相比擬了。 ::: :::info However, when we look at the “old” detection metric of mAP at IOU= .5 (or $\text{AP}_{50}$ in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects. However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object. ::: :::success 然而，當我們在那個IOU=.5(或者是圖表中的$\text{AP}_{50}$)看著mAP的"舊"的偵測指標(detection metric)時，你會發現，YOLOv3真的很強。它幾乎是跟RetinaNet一樣強，而且遠比SSD變體還要來的厲害。這就說明了YOLOv3真的是一個非常強的偵測器，擅長為物體生成適合的框(boxes)。不過，隨著IOU閥值的增加，其效能會明顯的下降，這說明了YOLOv3很難讓框(boxes)跟物體完美的貼合。 ::: :::info In the past YOLO struggled with small objects. However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APS performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this. ::: :::success 過來說，YOLO是比較難處理小的物件的。不過，現在風向已經變了。通過這種新的多尺度的預測(multi-scale predictions)，我們看到YOLOv3有著相對高的APS效能。只是，它在中型與大型尺寸的物件上的效能是相對較差的。我們需要更多的調查才能搞清楚狀況。 ::: :::info When we plot accuracy vs speed on the $\text{AP}_{50}$ metric (see figure 5) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better. ::: :::success 當我們在$\text{AP}_{50}$指標上繪製出準確度與速度(figure 5)的時候，看的出來YOLOv3對比其它的偵測系統來說有著明顯的優勢。也就是說，它更快、更好。 ::: :::info ![](https://hackmd.io/_uploads/ByC74J_hc.png) Figure 5. These two hypothetical detectors are perfect according to mAP over these two images. They are both perfect. Totally equal. Figure 5. 根據兩張影像的mAP來看，兩個假設的偵測器都很完美。它們都很完美。完全一樣。 ::: ## 4. Things We Tried That Didn’t Work :::info We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember. ::: :::success 在開發YOLOv3的時候，我們嚐試了很多處理，很多也都沒有用。這邊就記錄下我們還記得的事情。 ::: :::info **Anchor box** $x, y$ **offset predictions.** We tried using the normal anchor box prediction mechanism where you predict the $x, y$ offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well. ::: :::success **Anchor box** $x, y$ **offset predictions.** 我們嚐試使用標準的anchor box prediction mechanism(錨定框預測機制？)，裡面我們使用linear activation(線性啟動)把預測的$x,y$的偏移(offset)視為框(box)的寬或高的倍數。我們發現到，這個公式降低了模型的穩定性，而且效果不是很好。 ::: :::info **Linear $x, y$ predictions instead of logistic.** We tried using a linear activation to directly predict the $x, y$ offset instead of the logistic activation. This led to a couple point drop in mAP. ::: :::success 我們嚐試使用linear activation直接預測$x,y$的偏移，而非logistic activation。不過這導致在mAP上下降幾個百分點。 ::: :::info **Focal loss.** We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure. ::: :::success **Focal loss.** 我們嚐試使用focal loss。不過它讓我們的mAP下降大約2個百分點。YOLOv3可能對於focal loss嚐試要解決的問題已經具有魯棒性，因為它有著單獨的objectness predictions以條件類別預測(conditional class predictions)。因此，對於多數的範例來說，類別的預測是沒有損失(loss)的？或者有其它原因？我們並不是那麼確定。 ::: :::info **Dual IOU thresholds and truth assignment.** Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3−.7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results. ::: :::success **Dual IOU thresholds and truth assignment.** Faster RCNN在訓練過程中採用兩個IOU閥值。如果預測跟實際對象(ground truth)的重疊為.7，那就是正樣本(positive example)，如果是[.3 - .7]，那就忽略掉，如果是小於.3，那就是負樣本。我們也嚐試這種類似的策略，不過沒有得到一個好的結果。 ::: :::info We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training. ::: :::success 我們非常喜歡我們現在的公式(表述？)，它看起來最少還是有local optima。某些技術可能最終是有可能產生好的結悲突，也許它們需要的就只是一些調整來穩定訓練。 ::: :::info ![](https://hackmd.io/_uploads/rJTyVk_n9.png) Figure 3. Again adapted from the [9], this time displaying speed/accuracy tradeoff on the mAP at .5 IOU metric. You can tell YOLOv3 is good because it’s very high and far to the left. Can you cite your own paper? Guess who’s going to try, this guy → [16]. Oh, I forgot, we also fix a data loading bug in YOLOv2, that helped by like 2 mAP. Just sneaking this in here to not throw off layout. Figure 3. 再一次的從[9]裡面拿出來改編，這次是在.5 IOU的mAP上顯示速度/準確度。看的出來YOLOv3真的很好，它非常高，而且在很左邊的地方。你能引用你自己的論文嗎？猜猜看誰會這麼做，就是這傢伙 → [16]。啊，我忘了，我們還修正一個在YOLOv2的資料載入的bug，這得益於2 mAP。只是偷偷放這邊，不要壞了整個版面。 ::: ## 5. What This All Means :::info YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU. ::: :::success YOLOv3是一個好的偵測器。它很快，而且它很準。在COCO average AP上，介於0.5與0.95之間的IOU指標來看表現並沒有很好。不過它在.5這個舊的偵測指標來看是很不錯的。 ::: :::info Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is surprisingly difficult.” [18] If humans have a hard time telling the difference, how much does it matter? ::: :::success 為什麼我們還是要切換指標(metric)呢？原始的COCO論文中有著這麼一句話："A full discussion of evaluation metrics will be added once the evaluation server is complete"。Russakovsky et al說，人類是很難區分.3跟.5的IOU的！"訓練人類[目檢](http://terms.naer.edu.tw/detail/3128084/).3 IOU的bounding box，然後把它從.5 IOU的bounding box中區分開來是非常困難的。"[18]如果人類都很難分辨出差異了，那又有什麼關係呢？ ::: :::warning 一旦評估服務完成，就會加入評估指標的全面討論？ ::: :::info But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to.... wait, you’re saying that’s exactly what it will be used for?? Oh. ::: :::success 不過也許更好的問題是："既然我們有了這些偵測器在手，那能怎麼用啊？"。多數從事這項研究的人都在谷狗或是臉書。我想，，，最少我們知道這技術目前被掌握的很好，絕對不是拿來收集你的個人信息然後出售..蛋勒，哩系共，這就是它現在的用途？賣阿尼。 ::: :::info Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait..... ::: :::success 好吧，其他大量資助視覺研究的是軍隊，他們現在還沒有拿來做什麼可怕的事，像是用這新技術殺人之類的...我心累 ::: :::info I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [13], or tracking their cat as it wanders around their house [19]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much. ::: :::success 我有一個願望，我希望多數人數使用電腦視覺都是做一些快樂、有益的事，像是在自然公園中計算斑馬的數量[13]，或是當貓咪在家閒晃的時候跟蹤牠[19]。不過電腦視覺已經被投入一些可疑的應用，做為研究人員，我們有責任要想想我們的研究可能造成的危險，然後想想怎麼減少危險。畢竟，我們欠世界那麼多。 ::: :::info In closing, do not @ me. (Because I finally quit Twitter). ::: :::success 最後，不要@我了，因為我終於離開Twitter了。 ::: :::info ![](https://hackmd.io/_uploads/Hk6bVy_25.png) Figure 4. Zero-axis charts are probably more intellectually honest... and we can still screw with the variables to make ourselves look good! :::