YOLOv4: Optimal Speed and Accuracy of Object Detection

# YOLOv4: Optimal Speed and Accuracy of Object Detection ###### tags: `YOLO` `CNN` `論文翻譯` `deeplearning` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/abs/2004.10934) * [吳恩達老師_深度學習_卷積神經網路_第三週_目標偵測](https://hackmd.io/@shaoeChen/SJXmp66KG?type=view) ::: ## Abstract :::info There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet. ::: :::success 有很多的功能宣稱可以提高卷積神經網路(CNN)的準確度(accuracy)。這需要在大型資料集上對這些特徵的組合做實際的測試，然後再對其結果做理論上的證明。有些方法是針對某些特定模型與某些特定問題的，或是只能應用在小規模的資料集；有些方法，像是batch-normalization與residual-connections，則是可以應用到多數的模型、任務與資料集。我們假設這些通用方法包括Weighted-Residual-Connections (WRC)、Cross-Stage-Partial-connections (CSP)、Cross mini-Batch Normalization (CmBN)、Self-adversarial-training (SAT)與Mish-activation。我們使用新的方法：WRC、CSP、CmBN、SAT、Mish activation、Mosaic data augmentation、CmBN、DropBlock regularization與CIoU loss，並將其中一些方法結合起來，以實現最佳結果：MS COCO資料集的43.5% AP (65.7% AP50)，在Tesla V100張這顯卡上實時速度(realtime speed)為∼65 FPS。原始碼就放在Git上，去拿吧：https://github.com/AlexeyAB/darknet ::: ## 1. Introduction :::info The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU. ::: :::success 多數基於CNN的物體偵測器在很大程度上僅適用於推薦系統上。舉例來說，透過城市的視訊鏡頭來尋找閒置的停車位就是由slow accurate models來執行的，而汽車碰撞警示則是跟fast inaccurate models有關。提高這種實時的物體偵測器的準確度，不僅可以將之用於hint generating recommendation systems(提示生成推薦系統？)，還可以用於獨立的流程管理以及人工輸入的減少。常見的[圖形處理器](http://terms.naer.edu.tw/detail/20815301/)(GPU)上的實時物體偵測器的操作讓它們可以以付擔的起的價格來大量使用。最準確的現代神經網路無法實時的操作，需要大量的GPU來對大量的資料進行訓練。我們透過建立一個在常見的GPU上實時運行的CNN來解決這類問題，而對於這個CNN模型來說，也只需要一個常見的GPU來訓練即可。 ::: :::warning * slow accurate models，個人理解是推論慢，比較准，而fast inaccurate models則是推論快，但較不準的模型。 ::: :::info The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows: 1. We develope an efficient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector. 2. We verify the influence of state-of-the-art Bag-ofFreebies and Bag-of-Specials methods of object detection during the detector training. 3. We modify state-of-the-art methods and make them more effiecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc. ::: :::success 這項研究的主要目標在於設計一個在生產系統中，物體偵測器快速的執行速度，以及最佳化平行計算，而非Billion floating-point operations (BFLOPS)。我們希望設計的東西可以更容易的訓練、使用。舉例來說，隨便一個人拿著一塊GPU來訓練、測試都可以得到一個實時、高品質、讓人心悅誠服的物體偵測結果，就如同Figure 1所給出的YOLOv4的結果。我們的貢獻總結如下： 1. 我們開發出一個高效且強大的物體偵測模型。這讓每個人都可以隨便拿一塊1080Ti或是2080Ti的GPU來訓練出一個超快而且準確的物體偵測器。 2. 我們在偵測器的訓練期間驗證了最先進的Bag-of Freebies與Bag-of-Specials。 3. 我們修正了最先進的方法，讓它們更有效且更適合用於單GPU的訓練，包括CBN[89]、PAN[49]、SAM[85]等等。 ::: :::info ![](https://hackmd.io/_uploads/HJd2llrNj.png) Figure 1: Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves YOLOv3’s AP and FPS by 10% and 12%, respectively. Figure 1：比較我們所提出的YOLOv4與其它最先進的物體偵測器。YOLOv4的執行速度比EfficientDet快兩倍，效能的部份則是相當。將YOLOv3的AP、FPS各自提高10%、12%。 ::: :::info ![](https://hackmd.io/_uploads/Hk_cPGjSj.png) Figure 2: Object detector. ::: ## 2. Related work ### 2.1. Object detection models :::info A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a twostage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several topdown paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17]. In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection. ::: :::success 現今的偵測器通常是由兩個部份所組成，一個用ImageNet預訓練的骨幹，跟一個用來預測類別與邊界框的head(模型的最上層)。對於那些在GPU平台上執行的偵測器來說，它們的骨幹可以是VGG[68]、ResNet[26]、ResNeXt[86]、或是DenseNet[30]。對於模型的頂部(head)的部份，通常也是分成兩個類別，也就是one-stage object detector與two-stage object detector。最具代表性的two-stage object detector就是R-CNN[19]系列的模型，包括fast R-CNN[18]、faster R-CNN[64]、R-FCN[9]與Libra R-CNN[58]。也是可以將two-stage object detector弄成anchor-free object detector，像是RepPoints[87]。one-stage object detector的話，最具代表性的模型就是YOLO[61, 62, 63]、SSD[50]、與RetinaNet[45]。這幾年也在發展著anchor-free one-stage object detectors。這類的偵測器的話則是有CenterNet[13]、CornerNet[37, 38]、FCOS[78]等等。近年來所開發的物體偵測器通常會在骨幹(backbone)跟頂部(head)之間加入一些層(layer)，這些層通常是用來收集不同階段的特徵圖(feature maps)。我們可以將之稱為物體偵測器的頸部(neck)。通常，頸部的部份會是由幾個由下而上與幾個由上而下的路徑所組成。配置這種機制的網路包括Feature Pyramid Network (FPN)[44]、Path Aggregation Network (PAN)[49]、BiFPN[77]、與NAS-FPN[17]。除了上述模型之外，有一些研究人員會把他們的研究重心放在直接建置一個新的骨幹上(SpineNet[12]、HitDetector[20])，用以做為物體偵測。 ::: :::warning * one-stage：一次推論就把類別跟框分辨出來 * two-stage：先取得Region Proposal，再從中辨識類別 ::: :::info To sum up, an ordinary object detector is composed of several parts: * Input: Image, Patches, Image Pyramid * Backbones: VGG16 [68], ResNet-50 [26], SpineNet [12], EfficientNet-B0/B7 [75], CSPResNeXt50 [81], CSPDarknet53 [81] * Neck: * Additional blocks: SPP [25], ASPP [5], RFB[47], SAM [85] * Path-aggregation blocks: FPN [44], PAN [49], NAS-FPN [17], Fully-connected FPN, BiFPN [77], ASFF [48], SFAM [98] * Heads: * Dense Prediction (one-stage): * RPN [64], SSD [50], YOLO [61], RetinaNet [45] (anchor based) * CornerNet [37], CenterNet [13], MatrixNet [60], FCOS [78] (anchor free) * Sparse Prediction (two-stage): * Faster R-CNN [64], R-FCN [9], Mask RCNN [23] (anchor based) * RepPoints [87] (anchor free) ::: :::success 我們來總結一下，一個普通的物體偵測器由幾個部位所組成： * 輸入：Image、Patches、Image Pyramid * 骨幹：VGG16[68]、ResNet-50[26]、SpineNet[12]、EfficientNet-B0/B7[75]、CSPResNeXt50[81]、CSPDarknet53[81] * 頸部： * Additional blocks：SPP[25]、ASPP[5]、RFB[47]、SAM[85] * Path-aggregation blocks：FPN[44]、PAN[49]、NAS-FPN[17]、Fully-connected FPN、BiFPN[77]、ASFF[48]、SFAM[98] * 頂部： * Dense Prediction (one-stage)： * RPN[64]、SSD[50]、YOLO[61]、RetinaNet[45](anchor based) * CornerNet[37]、CenterNet[13]、MatrixNet[60]、FCOS[78](anchor free) * Sparse Prediction (two-stage)： * Faster R-CNN[64]、R-FCN[9]、Mask RCNN[23](anchor based) * RepPoints[87](anchor free) ::: ### 2.2. Bag of freebies :::info Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the definition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they definitely benefit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, flipping, and rotating. ::: :::success 通常，常見的物體偵測器是離線訓練的。因此，研究人員會喜歡利用這一個優勢來開發更好的訓練方法，使其在不增加推理成本的情況下讓物體偵測器得到更好的準確度。像這種只會改變訓練策略或是單純的增加訓練成本的方法就稱為"bag of freebies"。物體偵測方法通常會用而且又符合"bag of freebies"的方法就是資料增強(data augmentation)。資料增強的目的在於增加輸入影像的變化性，這能讓設計的物體偵測模型對從不同環境中所獲得的影像有更高的魯棒性。舉例來說，photometric distortions與geometric distortions是資料增強中常見的兩種方法，而且這方法是一定有利於物體偵測任務。在處理photometric distortion的時候，我們會調整影像的亮度、對比、色調、飽和與噪點。如果是geometric distortions的話，我們會增加隨機的縮放、剪裁、翻轉與旋轉。 ::: :::info The data augmentation methods mentioned above are allpixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classification and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN. ::: :::success 剛剛上面提到的資料增強的方法全都是pixel-wise adjustments，而且調整區域內的所有原始像素信息都是被保留的。此外，一些從事資料增強强的研究人員會把重點放在模擬物體遮蔽的問題上。他們在影像分類和物體偵測方面取得了很好的成果。舉例來說，random erase[100]與CutOut[11]隨機地選擇一個矩形的區域，然後填充隨機或是互補的零值。至於hide-and-seek[69]與grid mask[6]則是隨機或是均勻地選擇影像中多個矩形區域，然後將之取代為零。如果把類似的概念應用到特徵圖的話，那就有DropOut[71]、DropConnect[80]、與DropBlock[16]等幾種方法。此外，也有一些研究人員提出把多張影像一起用來做資料增強的方法。舉例來說，MixUp[92]拿兩張影像以不同的系數比率來相乘、疊加，然後用這些疊加比例來調整標記(label)。CutMix[91]則是把剪下來的影像拿去蓋其它影像的矩陣區域，然後再根據混合區域的大小來調整標記(label)。除了上面說的方法之外，style transfer GAN[15]也可以用來做資料增強，這麼做還可以有效減少CNN學習到的texture bias(紋理偏差？)。 ::: :::warning * pixel-wise adjustments：像素級別的調整，大概就是調整是以pixel為單位在調整的意思 * random erase：字面上的看法就是隨機橡皮擦來擦擦 ::: :::info Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object detector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difficult to express the relationship of the degree of association between different categories with the one-hot hard representation. This representation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard label into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label refinement network. ::: :::success 不同於上面提出的各種方法，其它bag of freebies的方法比較致力於解決資料集中語義分佈可能存在的偏差問題。在處理語義分佈偏差問題的時候，一個非常重要的問題是，在不同類別之間的資料不平衡問題，這個問題通常是利用two-stage object detector中的hard negative example mining[72]與online hard example mining[67]來解決。但是example mining的這類方法並不適用於one-stage object detector，因為這類的偵測器屬於密集預測型的架構(dense prediction architecture)。因此，Lin et al.[45]提出focal loss來處理存在各種類別之間的資料不平衡問題。另一個非常重要的問題是，你很難去用one-hot hard representation來表達不同類別之間關聯程度的關係。在執行標記(labeling)的時候通常會使用這種表示結構(representation scheme)。[73]中提出的label smoothing(標記平滑)是將hard label轉為soft label來訓練，這能讓模型更為魯棒性。為了能夠獲得更好的soft label，Islam et al.[33]引入knowledge distillation(知識蒸餾)的概念，以此設計label refinement network(LRN)。 ::: :::info The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to directly perform regression on the center point coordinates and height and width of the BBox, i.e., $\left\{x_{center}, y_{center}, w, h\right\}$, or the upper left point and the lower right point, i.e., $\left\{x_{top\_left}, y_{top\_left}, x_{bottom\_right}, y_{bottom\_right} \right\}$. As for anchor-based method, it is to estimate the corresponding offset, for example $\left\{x_{center\_offset}, y_{center\_offset}, w_{offset}, h_{offset} \right\}$ and $\left\{x_{top\_left\_offset}, y_{top\_left\_offset}, x_{bottom\_right\_offset}, y_{bottom\_right\_offset} \right\}$. However, to directly estimate the coordinate values of each point of the BBox is to treat these points as independent variables, but in fact does not consider the integrity of the object itself. In order to make this issue processed better, some researchers recently proposed IoU loss [90], which puts the coverage of predicted BBox area and ground truth BBox area into consideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by executing IoU with the ground truth, and then connecting the generated results into a whole code. Because IoU is a scale invariant representation, it can solve the problem that when traditional methods calculate the l1 or l2 loss of {x, y, w, h}, the loss will increase with the scale. Recently, some researchers have continued to improve IoU loss. For example, GIoU loss [65] is to include the shape and orientation of object in addition to the coverage area. They proposed to find the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox, and use this BBox as the denominator to replace the denominator originally used in IoU loss. As for DIoU loss [99], it additionally considers the distance of the center of an object, and CIoU loss [99], on the other hand simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. CIoU can achieve better convergence speed and accuracy on the BBox regression problem. ::: :::success 最後一個bag of freebies是Bounding Box (BBox) regression的目標函數。傳統的物體偵測器通常使用Mean Square Error (MSE)直接對BBox的中心點座標與高、寬做迴歸，也就是$\left\{x_{center}, y_{center}, w, h\right\}$，或者是左上、右下的座標點，$\left\{x_{top\_left}, y_{top\_left}, x_{bottom\_right}, y_{bottom\_right} \right\}$。對於anchor-based的方法，就是去估測相對應的偏移(offset)，舉例來說，$\left\{x_{center\_offset}, y_{center\_offset}, w_{offset}, h_{offset} \right\}$與$\left\{x_{top\_left\_offset}, y_{top\_left\_offset}, x_{bottom\_right\_offset}, y_{bottom\_right\_offset} \right\}$。不過厚，要直接的去估測BBOX每個點的座標值的這個動作，就是把這些點視為independent variables(自變數、獨立變項)，不過，事實上這並沒有考慮到物體本身的完整性。為了能夠更好的處理這些問題，一些研究人員提出IoU loss[90]，這方法考慮了預測的BBox與實際的BBox的覆蓋範圍。IoU loss的計算過程會通過計算與實際BBox的IoU來觸發BBox四個座標點的計算，然後把生成的結果連結到整個程式碼。IoU是一種尺度不變的表示，它可以解決傳統方法在計算$\left\{x, y, w, h \right\}$的$l_1$、$l_2$的loss會隨尺度增加的問題。近來也是有一些研究人員持續不斷的改進IoU loss。舉例來說，GIoU loss[65]就是一種除了覆蓋區域之外還包括物體的形狀與方向的方法。他們提出的就是能夠同時覆蓋預測的BBox與真實的BBox的最小區域的BBox，然後用這個最小區域的BBox來做為原始IoU loss分母。DIoU loss[99]的話，它額外的考慮了物體中心點的距離，而CIoU loss[99]則是同時考慮重疊區域、中心點之間的距離，以及[長寬比](https://terms.naer.edu.tw/detail/ea6ba6a862029e21ae1dcf36f7662bf3/)。CIoU在BBox regression問題上有著較好的收斂速度與準確度。 ::: ### 2.3. Bag of specials :::info For those plugin modules and post-processing methods that only increase the inference cost by a small amount but can significantly improve the accuracy of object detection, we call them “bag of specials”. Generally speaking, these plugin modules are for enhancing certain attributes in a model, such as enlarging receptive field, introducing attention mechanism, or strengthening feature integration capability, etc., and post-processing is a method for screening model prediction results. ::: :::success 對於那些只需要增加一點點點的推論成本就能夠明顯提升物體偵測準確度的plugin modules與post-processing([後處理](https://terms.naer.edu.tw/detail/e186eecdb6d49ab095171b76e031f903/))的方法，我們稱之為"bag of specials"。一般來說，這些pluging modules是用來增強模型中的一些屬性，像是擴大接收域(receptive field)，引入注意力機制，或是加強特徵的整合能力等，而後處理的部份則是一種篩選模型預測結果的方法。 ::: :::info Common modules that can be used to enhance receptive field are SPP [25], ASPP [5], and RFB [47]. The SPP module was originated from Spatial Pyramid Matching (SPM) [39], and SPMs original method was to split feature map into several d × d equal blocks, where d can be {1, 2, 3, ...}, thus forming spatial pyramid, and then extracting bag-of-word features. SPP integrates SPM into CNN and use max-pooling operation instead of bag-of-word operation. Since the SPP module proposed by He et al. [25] will output one dimensional feature vector, it is infeasible to be applied in Fully Convolutional Network (FCN). Thus in the design of YOLOv3 [63], Redmon and Farhadi improve SPP module to the concatenation of max-pooling outputs with kernel size k × k, where k = {1, 5, 9, 13}, and stride equals to 1. Under this design, a relatively large k × k maxpooling effectively increase the receptive field of backbone feature. After adding the improved version of SPP module, YOLOv3-608 upgrades AP50 by 2.7% on the MS COCO object detection task at the cost of 0.5% extra computation. The difference in operation between ASPP [5] module and improved SPP module is mainly from the original k×k kernel size, max-pooling of stride equals to 1 to several 3 × 3 kernel size, dilated ratio equals to k, and stride equals to 1 in dilated convolution operation. RFB module is to use several dilated convolutions of k×k kernel, dilated ratio equals to k, and stride equals to 1 to obtain a more comprehensive spatial coverage than ASPP. RFB [47] only costs 7% extra inference time to increase the AP50 of SSD on MS COCO by 5.7%. ::: :::success 常見可以用於增強receptive field的模組為SPP[25]、ASPP[5]與RFB[47]。SPP module是源自於Spatial Pyramid Matching (SPM)[39]，SPMs的原始方法是將feature map切割成$d \times d$相同大小的方塊(equal blocks)，其中$d$可以是$\left\{1,2,3,... \right\}$，以此形成一個空間金字塔，再來提取bag-of-word features(詞袋特徵)。SPP把SPM整合進去CNN，然後用max-pooling的操作來取代bag-of-word(詞袋操作)。由於He et al.[25]提出的SPP module會輸出一維特徵向量(one dimensional feature vector)，這在Fully Convolutional Network (FCN)是不可實行的。也因此，在YOLOv3[63]的設計中，Redmon與Farhadi把SPP module改進為把max pooling的輸出以kernel size為$k \times k$的方式做[序連連接](https://terms.naer.edu.tw/detail/61086a450705ad4da31db403fe362418/)，其中$k=\left\{1, 5, 9, 13 \right\}$，且stride等於1。這種設計的話，相對較大的$k \times k$的max-pooling有效地增加骨幹特徵的接收域(receptive field)。在加入改良版的SPP module之後，YOLOv3-608在MS COCO這個物體偵測任務上的AP50提升了2.7%(以增加0.5%的計算量做為代價)。ASPP[5] module與improved SPP module在運作上的差異主要是從原本的$k \times k$的kernel size、stride等於1的max-pooling變成是多個$3 \times 3$的kernel size(dilated ratio等價於$k$)，過程中的stride還是等於1。RFB module是採用多個kernel為$k \times k$的dilated convolutions，dilated ratio等價於$k$，stride等於1，以取得比ASPP更全面性的空間覆蓋。RFB[47]只需要7%的額外的推論時間就可以把在MS COCO資料集上SSD的AP50提高5.7%。 ::: :::warning * dilated ratio等價於$k$的意思就是說，一個5x5的filter等價於2個3x3，7x7的話則是等價於3個3x3 ::: :::info The attention module that is often used in object detection is mainly divided into channel-wise attention and pointwise attention, and the representatives of these two attention models are Squeeze-and-Excitation (SE) [29] and Spatial Attention Module (SAM) [85], respectively. Although SE module can improve the power of ResNet50 in the ImageNet image classification task 1% top-1 accuracy at the cost of only increasing the computational effort by 2%, but on a GPU usually it will increase the inference time by about 10%, so it is more appropriate to be used in mobile devices. But for SAM, it only needs to pay 0.1% extra calculation and it can improve ResNet50-SE 0.5% top-1 accuracy on the ImageNet image classification task. Best of all, it does not affect the speed of inference on the GPU at all. ::: :::success 通常應用在物體偵測的注意力模組(attention module)主要分為channel-wise attention與pointwise attention，這兩種attention models主要的代表分別為Squeeze-and-Excitation (SE)[29]與Spatial Attention Module (SAM)[85]。雖然SE module能夠提高ResNet50在ImageNet的分類任務1%的top-1準確度(只需要額外增加2%的計算力)，不過這在GPU上大約會增加10%的推論時間(inference time)，因此，這比較適用於移動裝置上使用。但如果是SAM的話，只需要付上0.1%的額外計算就能夠提高ResNet50-SE在ImageNet的分類任務上0.5%的top-1準確度。最重要的是，它根本不會影響到GPU上的推論速度。 ::: :::info In terms of feature integration, the early practice is to use skip connection [51] or hyper-column [22] to integrate lowlevel physical feature to high-level semantic feature. Since multi-scale prediction methods such as FPN have become popular, many lightweight modules that integrate different feature pyramid have been proposed. The modules of this sort include SFAM [98], ASFF [48], and BiFPN [77]. The main idea of SFAM is to use SE module to execute channelwise level re-weighting on multi-scale concatenated feature maps. As for ASFF, it uses softmax as point-wise level reweighting and then adds feature maps of different scales. In BiFPN, the multi-input weighted residual connections is proposed to execute scale-wise level re-weighting, and then add feature maps of different scales. ::: :::success 在[特徵整合](https://terms.naer.edu.tw/detail/2b753c431b5edb4444f34f079a9253ac/)(feature integration)的部份，早期的做法是使用skip connection[51]或是hyper-column[22]的作法將low-level的物理特徵整合到high-level的語義特徵(semantic feature)。由於像是FPN這類的多尺度(multi-scale)預測方法愈來愈受歡迎，就有愈來愈多的整合不同特徵金字塔(feature pyramid)的輕量級模組被提出。這類模組包含SFAM[98]、ASFF[48]與BiFPN[77]。SFAM主要的概念就是使用SE module在多尺度序連的特徵圖上(multi-scale concatenated feature maps)做channel-wise level的re-weighting(重加權？)。ASFF的話，它使用softmax來做為point-wise level的re-weighting，然後再加入不同尺度的特徵圖。在BiFPN中，則是提出multi-input weighted residual connections來做scale-wise level的re-weighting，然後再加入不同尺度的特徵圖。 ::: :::info In the research of deep learning, some people put their focus on searching for good activation function. A good activation function can make the gradient more efficiently propagated, and at the same time it will not cause too much extra computational cost. In 2010, Nair and Hinton [56] propose ReLU to substantially solve the gradient vanish problem which is frequently encountered in traditional tanh and sigmoid activation function. Subsequently, LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and Mish [55], etc., which are also used to solve the gradient vanish problem, have been proposed. The main purpose of LReLU and PReLU is to solve the problem that the gradient of ReLU is zero when the output is less than zero. As for ReLU6 and hard-Swish, they are specially designed for quantization networks. For self-normalizing a neural network, the SELU activation function is proposed to satisfy the goal. One thing to be noted is that both Swish and Mish are continuously differentiable activation function. ::: :::success 在深度學習的研究中，有些人會把重點放在找一個好的啟動函數(activation function)。一個好的啟動函數可以讓梯度更有效地傳播，同時也不會產生太多額外的計算成本。2010年中，Nair與Hinton[56]提出的ReLU大幅度的解決在傳統tand與sigmoid這兩個啟動函數很常遇到的梯度消失與爆炸問題。隨之而來的是LReLU[54]、PReLU[24]、ReLU6[28]、Scaled Exponential Linear Unit (SELU)[35]、Swish[59]、hard-Swish[27]與Mish[55]等，也都被提出用來解決梯度消失問題。LReLU與PReLU主要是解決在ReLU上輸出小於0的時候梯度為0的問題。ReLU6與hard-Swish是專門為quantization networks所設計的。對於神經網路的self-normalizing的話，則是有提出SELU來滿足目標。值得注意的是，Swish與Mish都是連續可微的啟動函數。 ::: :::info The post-processing method commonly used in deeplearning-based object detection is NMS, which can be used to filter those BBoxes that badly predict the same object, and only retain the candidate BBoxes with higher response. The way NMS tries to improve is consistent with the method of optimizing an objective function. The original method proposed by NMS does not consider the context information, so Girshick et al. [19] added classification confidence score in R-CNN as a reference, and according to the order of confidence score, greedy NMS was performed in the order of high score to low score. As for soft NMS [1], it considers the problem that the occlusion of an object may cause the degradation of confidence score in greedy NMS with IoU score. The DIoU NMS [99] developers way of thinking is to add the information of the center point distance to the BBox screening process on the basis of soft NMS. It is worth mentioning that, since none of above postprocessing methods directly refer to the captured image features, post-processing is no longer required in the subsequent development of an anchor-free method. ::: :::success 在基於深度學習的物體偵測方法中，最常被用到的後處理方法就是NMS，這可以用來過濾那些預測到相同物體的BBoxes，只會保留有著較高響應的候選BBoxes。NMS嚐試改進的方式跟最佳化目標函數的方法是一致的。一開始由NMS所提出的方法是不包含上下文信息(context information)的，因此Girshick et al.[19]在R-CNN中加入類別置信度分數來做為參考，然後根據置信度分數的順序，以由高至低的順序來做greedy NMS。soft NMS [1]的話，它考慮到在使用IoU score的greedy NMS中，物體的遮擋可能會造成置信度分數降低的問題。DIoU NMS[99]開發人員的想法是在soft NMS的基礎上，在BBOX篩選過程中加入中心點距離的信息。值得提到的是，因為上面說的後處理方法都不是直接參考補捉到的影像特徵，所以在anchor-free method的後續開發中就不需要再做後處理。 ::: ## 3. Methodology :::info The basic aim is fast operating speed of neural network, in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We present two options of real-time neural networks: * For GPU we use a small number of groups (1 - 8) in convolutional layers: CSPResNeXt50 / CSPDarknet53 * For VPU - we use grouped-convolution, but we refrain from using Squeeze-and-excitement (SE) blocks - specifically this includes the following models: EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3 ::: :::success 基本的目標是神經網路在生產系統中快速執行速度，以及平行計算的最佳化，而不是low computation volume theoretical indicator (BFLOP)。我們提出兩種實時神經網路的選項： * 對於GPU，我們在卷積層中使用少量的群組(1-8)：CSPResNeXt50 / CSPDarknet53(Group Convolution？) * 對於VPU，我們使用grouped-convolution，不過我們會避免使用Squeeze-and-excitement (SE) blocks，特別是這包含下面模型：EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3 ::: ### 3.1. Selection of architecture :::info Our objective is to find the optimal balance among the input network resolution, the convolutional layer number, the parameter number (filter size2 * filters * channel / groups), and the number of layer outputs (filters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset [46]. ::: :::success 我們的目標就是在輸入網路的解析度、卷積層的數量、參數量(filter size^2^ \* filters \* channel / groups)，以及每一層輸出的數量(filters)之間找出一個最佳的平衡。舉例來說，我們的大量研究說明著，CSPResNext50在ILSVRC2012 (ImageNet)資料集[10]上的物體分類的部份比起CSPDarknet53要好的多了。不過，反過來說，CSPDarknet53在MS COCO資料集[46]上的偵測物體的部份則是比CSPResNext50要來的好。 ::: :::info The next objective is to select additional blocks for increasing the receptive field and the best method of parameter aggregation from different backbone levels for different detector levels: e.g. FPN, PAN, ASFF, BiFPN. ::: :::success 下一個目標為不同的檢測級別(detector levels)選擇額外的blocks來增加receptive field，以及從不同的骨幹級別(backbone levels)選擇最好的參數聚合的方法：像是FPN、PAN、ASFF、BiFPN。 ::: :::info A reference model which is optimal for classification is not always optimal for a detector. In contrast to the classifier, the detector requires the following: * Higher input network size (resolution) – for detecting multiple small-sized objects * More layers – for a higher receptive field to cover the increased size of input network * More parameters – for greater capacity of a model to detect multiple objects of different sizes in a single image ::: :::success 那種用在類別應用上是最佳的參考模型，拿來用在偵測並不見得會最好的。相較於分類器，偵測器需要下面這些： * 較高的輸入網路大小(解析度)，用於偵測多個小尺寸的物體 * 更多層，能夠有較高的receptive field來覆蓋增加的輸入網路的大小 * 更多的參數，讓模型有更大的能力可以在單一影像中偵測出多個不同尺度的物體 ::: :::info Hypothetically speaking, we can assume that a model with a larger receptive field size (with a larger number of convolutional layers 3 × 3) and a larger number of parameters should be selected as the backbone. Table 1 shows the information of CSPResNeXt50, CSPDarknet53, and EfficientNet B3. The CSPResNext50 contains only 16 convolutional layers 3 × 3, a 425 × 425 receptive field and 20.6 M parameters, while CSPDarknet53 contains 29 convolutional layers 3 × 3, a 725 × 725 receptive field and 27.6 M parameters. This theoretical justification, together with our numerous experiments, show that CSPDarknet53 neural network is the optimal model of the two as the backbone for a detector. ::: :::success 假設說，我們可以選擇一個有更大的receptive field size(有很多很多的$3 \times 3$的卷積層)以及參數更多的模型來做為骨幹。Table 1給出了CSPResNeXt50、CSPDarknet53、與EfficientNet B3的信息。CSPResNext50僅包含16個$3 \times 3$的卷積層，$\mathsf{a} \space 425 \times 425$的receptive field，以及$20.6 \mathsf{M}$的參數量，而CSPDarknet53則是包含29個$3 \times 3$的卷積層，$\mathsf{a} \space 725 \times 725$的receptive field，以及$27.6 \mathsf{M}$的參數量。這個理論[成義](https://terms.naer.edu.tw/detail/679c650871981df88c96b6eed100934f/)以及我們大量的實驗說明著，CSPDarknet53神經網路是兩者作為偵測器主幹的最佳模型。 ::: :::info ![](https://hackmd.io/_uploads/SkShfZWBo.png) Table 1: Parameters of neural networks for image classification Table 1：用於影像分類的神經網路的參數 ::: :::info The influence of the receptive field with different sizes is summarized as follows: * Up to the object size - allows viewing the entire object * Up to network size - allows viewing the context around the object * Exceeding the network size - increases the number of connections between the image point and the final activation ::: :::success 不同大小的接收域的影響總結如下： * Up to the object size - allows viewing the entire object * Up to network size - allows viewing the context around the object * Exceeding the network size - increases the number of connections between the image point and the final activation ::: :::warning 這邊沒有特別翻譯，好難去說明，receptive field的大小如果大到物體大小的話，那就可以看整個物體，如果大到整個網路大小的話，那就可以看整個物體周圍的上下文，超過網路大小的話，那就會增加image point與final activation之間的連接數。 ::: :::info We add the SPP block over the CSPDarknet53, since it significantly increases the receptive field, separates out the most significant context features and causes almost no reduction of the network operation speed. We use PANet as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN used in YOLOv3. ::: :::success 我們在CSPDarknet53上增加SPP block，因為這明顯增加receptive field，分離出最重要的上下文特徵，而且這幾乎不會降低網路的執行速度。我們使用PANet來做為不同檢測級別(detector levels)從不同骨幹級別(backbone levels)中做參數聚合(parameter aggregation)的方法，而不是YOLOv3中使用的FPN。 ::: :::warning 這邊說明的應該就是YOLOv3中從Darknet去做三個output到yolo的那一段，這三個output就是FPN的output。 ::: :::info Finally, we choose CSPDarknet53 backbone, SPP additional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head as the architecture of YOLOv4. ::: :::success 最終，我們選擇CSPDarknet53做為骨幹，SPP做為附加的模組，PANet做為path-aggregation(頸部)，然後YOLOv3 (anchor based)做為頭部，這樣的組合做為YOLOv4的架構。 ::: :::info In the future we plan to expand significantly the content of Bag of Freebies (BoF) for the detector, which theoretically can address some problems and increase the detector accuracy, and sequentially check the influence of each feature in an experimental fashion. ::: :::success 未來我們計畫為dector大幅度的擴展Bag of Freebies (BoF)的內容，這理論上可以解決一些問題，並增加dector的準確度，然後以實驗的形式依序確認每個特徵的影響。 ::: :::info We do not use Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices. This allows anyone to reproduce our state-of-the-art outcomes on a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti. ::: :::success 我們並沒有使用Cross-GPU Batch Normalization (CGBN或SyncBN)或是其它貴森森的專用設備。這讓有興趣的人都可以隨便拿一張圖形處理器就能復現我們最好棒棒的結果，像是GTX 1080Ti或是RTX 2080Ti。 ::: ### 3.2. Selection of BoF and BoS :::info For improving the object detection training, a CNN usually uses the following: * Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish * Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU * Data augmentation: CutOut, MixUp, CutMix * Regularization method: DropOut, DropPath [36], Spatial DropOut [79], or DropBlock * Normalization of the network activations by their mean and variance: Batch Normalization (BN) [32], Cross-GPU Batch Normalization (CGBN or SyncBN) [93], Filter Response Normalization (FRN) [70], or Cross-Iteration Batch Normalization (CBN) [89] * Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP) ::: :::success 為了能夠改善物體偵測的訓練，CNN通常使用下面這些咪啊： * Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish * Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU * Data augmentation: CutOut, MixUp, CutMix * Regularization method: DropOut, DropPath [36], Spatial DropOut [79], or DropBlock * Normalization of the network activations by their mean and variance: Batch Normalization (BN) [32], Cross-GPU Batch Normalization (CGBN or SyncBN) [93], Filter Response Normalization (FRN) [70], or Cross-Iteration Batch Normalization (CBN) [89] * Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP) ::: :::info As for training activation function, since PReLU and SELU are more difficult to train, and ReLU6 is specifically designed for quantization network, we therefore remove the above activation functions from the candidate list. In the method of reqularization, the people who published DropBlock have compared their method with other methods in detail, and their regularization method has won a lot. Therefore, we did not hesitate to choose DropBlock as our regularization method. As for the selection of normalization method, since we focus on a training strategy that uses only one GPU, syncBN is not considered. ::: :::success 對於訓練使用的activation function，因為PReLU與SELU更難訓練，ReLU6的話則是專門設計來給quantization network使用的，所以我們就把上面列的這幾個啟動函數從我們的候選清單中排除。[正則化](https://terms.naer.edu.tw/detail/d901ee250a685df8063edb4e76632f6b/)(regularization)方法的部份，發表DropBlock的人仔細的把它們的方法與其它方法做了比較，成果表明他們的方法贏了很多很多。因此，我們毫不手軟的選擇DropBlock來做為我們的[正則化](https://terms.naer.edu.tw/detail/d901ee250a685df8063edb4e76632f6b/)方法。至於正規化(normalization)方法的選擇，因為我們只專注在一塊GPU上的訓練策略上，就不考syncBN了。 ::: :::info ![](https://hackmd.io/_uploads/B1s0vGsBo.png) Figure 3: Mosaic represents a new method of data augmentation ::: ### 3.3. Additional improvements :::info In order to make the designed detector more suitable for training on single GPU, we made additional design and improvement as follows: * We introduce a new method of data augmentation Mosaic, and Self-Adversarial Training (SAT) * We select optimal hyper-parameters while applying genetic algorithms * We modify some exsiting methods to make our design suitble for efficient training and detection - modified SAM, modified PAN, and Cross mini-Batch Normalization (CmBN) ::: :::success 為了能夠讓我們所設計的偵測器更適合在單一GPU上訓練，我們做了下面這些額外的設計： * 我們引入新的資料增強的方法Mosaic，與Self-Adversarial Training (SAT) * 我們在應用[基因演算法](https://terms.naer.edu.tw/detail/5e32be0b9e7b66e1efba532fef70a49d/)的時候選擇最佳超參數 * 我們調整一些現有的方法，讓我們的設計更適用於高效的訓練與檢測，修正SAM、PAN與Cross mini-Batch Normalization (CmBN) ::: :::info Mosaic represents a new data augmentation method that mixes 4 training images. Thus 4 different contexts are mixed, while CutMix mixes only 2 input images. This allows detection of objects outside their normal context. In addition, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size. ::: :::success Mosaic表示一種新的資料增強方法，其混合四張訓練影像。因此它混合四種不同的上下文，而CutMix僅混合兩個輸入影像。這允許檢測正常上下文之外的物體。此外，batch normalization會在每個layer從四張不同的影像中計算啟動資訊(activation statistics)。這明顯減少對large mini-batch size的需求。 ::: :::info Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward backward stages. In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself, altering the original image to create the deception that there is no desired object on the image. In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way. ::: :::success Self-Adversarial Training (SAT)也代表一種新的資料增強技術，這在兩個前向(forward)、反向(backward)階段中執行。在第一階段中(1st stage)，神經網路會改變原始的影像而不是權重。以這樣的方式來看，神經網路會對自己做一個對抗式的攻擊，改變原始影像，以產生影像上所沒有的的物體來做欺騙。在第二階段中就會訓練神經網路以正常的方式檢測這張編修過影像上的物體。 ::: :::info CmBN represents a CBN modified version, as shown in Figure 4, defined as Cross mini-Batch Normalization (CmBN). This collects statistics only between mini-batches within a single batch. ::: :::success CmBN表示CBN的修改版本，如Figure 4所示，將之定義為Cross mini-Batch Normalization (CmBN)。這單純的在單一batch內的mini-batches之間收集統計信息而以。 ::: :::info ![](https://hackmd.io/_uploads/Hk1AtGUHj.png) Figure 4: Cross mini-Batch Normalization ::: :::info We modify SAM from spatial-wise attention to pointwise attention, and replace shortcut connection of PAN to concatenation, as shown in Figure 5 and Figure 6, respectively. ::: :::success 我們把SAM從spatial-wise attention調整成point-wise attention，然後把PAN的shortcut connection調整成concatenation，如Figure 5、Figure 6所示。 ::: :::info ![](https://hackmd.io/_uploads/ryqg5MUHi.png) Figure 5: Modified SAM. ::: :::info ![](https://hackmd.io/_uploads/SJT-cfIHo.png) ![](https://hackmd.io/_uploads/SkZz5fUSs.png) ::: ### 3.4. YOLOv4 :::info In this section, we shall elaborate the details of YOLOv4. YOLOv4 consists of: * Backbone: CSPDarknet53 [81] * Neck: SPP [25], PAN [49] * Head: YOLOv3 [63] YOLOv4 uses: * Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing * Bag of Specials (BoS) for backbone: Mish activation, Cross-stage partial connections (CSP), Multiinput weighted residual connections (MiWRC) * Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler [52], Optimal hyperparameters, Random training shapes * Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS ::: :::success 這一章我們就要來好好說說YOLOv4的細節了。YOLOv4包含： * Backbone: CSPDarknet53 [81] * Neck: SPP [25], PAN [49] * Head: YOLOv3 [63] YOLOv4使用： * Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing * Bag of Specials (BoS) for backbone: Mish activation, Cross-stage partial connections (CSP), Multiinput weighted residual connections (MiWRC) * Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler [52], Optimal hyperparameters, Random training shapes * Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS ::: ## 4. Experiments :::info We test the influence of different training improvement techniques on accuracy of the classifier on ImageNet (ILSVRC 2012 val) dataset, and then on the accuracy of the detector on MS COCO (test-dev 2017) dataset. ::: :::success 我們在ImageNet (ILSVRC 2012 val)這資料集上測試不同的訓練改進技術對準確度的影響，然後在MS COCO(test-dev 2017)資料集上測試對偵測器的準確度影響。 ::: ### 4.1. Experimental setup :::info In ImageNet image classification experiments, the default hyper-parameters are as follows: the training steps is 8,000,000; the batch size and the mini-batch size are 128 and 32, respectively; the polynomial decay learning rate scheduling strategy is adopted with initial learning rate 0.1; the warm-up steps is 1000; the momentum and weight decay are respectively set as 0.9 and 0.005. All of our BoS experiments use the same hyper-parameter as the default setting, and in the BoF experiments, we add an additional 50% training steps. In the BoF experiments, we verify MixUp, CutMix, Mosaic, Bluring data augmentation, and label smoothing regularization methods. In the BoS experiments, we compared the effects of LReLU, Swish, and Mish activation function. All experiments are trained with a 1080 Ti or 2080 Ti GPU. ::: :::success 在ImageNet影像分類實驗中，預設的超參數如下：training steps為8,000,000；batch size與mini-batch size分別為128、32；採用多項式型的學習效率衰減的策略，初始值為0.1；warm-up steps為1,000；momentum與weight decay分別為0.9與0.005。我們所有的BoS實驗的超參數都採預設設置，如果是BoF實驗的話，就會額外增加50%的training steps。在BoF實驗中，我們驗證了MixUp、CutMix、Mosaic、Bluring等資料增強與label smoothing regularization這幾種方法。在BoS實驗中，我們比較了LReLU、Swish、與Mish等三種啟動函數的效果。所有的實驗都是用一塊1080Ti或是2080Ti的GPU訓練的。 ::: :::warning 非常強調YOLOv4採用的設備非常親民！ ::: :::info In MS COCO object detection experiments, the default hyper-parameters are as follows: the training steps is 500,500; the step decay learning rate scheduling strategy is adopted with initial learning rate 0.01 and multiply with a factor 0.1 at the 400,000 steps and the 450,000 steps, respectively; The momentum and weight decay are respectively set as 0.9 and 0.0005. All architectures use a single GPU to execute multi-scale training in the batch size of 64 while mini-batch size is 8 or 4 depend on the architectures and GPU memory limitation. Except for using genetic algorithm for hyper-parameter search experiments, all other experiments use default setting. Genetic algorithm used YOLOv3-SPP to train with GIoU loss and search 300 epochs for min-val 5k sets. We adopt searched learning rate 0.00261, momentum 0.949, IoU threshold for assigning ground truth 0.213, and loss normalizer 0.07 for genetic algorithm experiments. We have verified a large number of BoF, including grid sensitivity elimination, mosaic data augmentation, IoU threshold, genetic algorithm, class label smoothing, cross mini-batch normalization, selfadversarial training, cosine annealing scheduler, dynamic mini-batch size, DropBlock, Optimized Anchors, different kind of IoU losses. We also conduct experiments on various BoS, including Mish, SPP, SAM, RFB, BiFPN, and Gaussian YOLO [8]. For all experiments, we only use one GPU for training, so techniques such as syncBN that optimizes multiple GPUs are not used. ::: :::success 在MS COCO物體偵測實驗中，預設的超參數如下：training steps為500,500；學習效率初始為0.01，然後在400,000 steps與450,000 steps的時候會各自乘上0.1；momentum與weight decay分別為0.9與0.0005。所有的架構都是使用單一塊GPU來執行多尺度的訓練(batch size為64)，mini-batch size是4還是8就取決於架構跟GPU的記憶體限制。除了超參數的搜尋實驗是使用基因演算法之外，其它的實驗也都是使用預設設置。基因演算法用YOLOv3-SPP，搭配GIoU loss來訓練，以300個epochs來尋找min-val 5k sets。我們採用找到的學習效率0.00261、momentum 0.949、IoU閥值為0.213、loss normalizer 0.07的參數組合來做基因演算法的實驗。我們已經驗證大量的BoF，包含grid sensitivity elimination、mosaic data augmentation、IoU threshold、genetic algorithm、class label smoothing、cross mini-batch normalization、self-adversarial training、cosine annealing scheduler、dynamic mini-batch size、DropBlock、Optimized Anchors、還有不同類型的IoU losses。我們還在各種的BoS上做實驗，包含Mish、SPP、SAM、RFB、BiFPN、與Gaussian YOLO [8]。所有的實驗都是使用一塊GPU來做訓練，所以沒有使用像是syncBN這種最佳化多塊GPUs的技術就是了。 ::: ### 4.2. Influence of different features on Classifier training :::info First, we study the influence of different features on classifier training; specifically, the influence of Class label smoothing, the influence of different data augmentation techniques, bilateral blurring, MixUp, CutMix and Mosaic, as shown in Fugure 7, and the influence of different activations, such as Leaky-ReLU (by default), Swish, and Mish. ::: :::success 首先，我們研究不同特徵對分類器訓的影響；具體來說就是Class label smoothing的影響、不同資料增強技術的影響、如Figure 7所示的bilateral blurring、MixUp、CutMix與Mosaic，以及不同的啟動函數的影響，像是Leaky-ReLU (by default)、Swish與Mish。 ::: :::warning class label smoothing：所指為在類別標籤上(1、2、3)加入一些噪點資訊 ::: :::info ![](https://hackmd.io/_uploads/SkWdmKwSi.png) Figure 7: Various method of data augmentation. ::: :::info In our experiments, as illustrated in Table 2, the classifier’s accuracy is improved by introducing the features such as: CutMix and Mosaic data augmentation, Class label smoothing, and Mish activation. As a result, our BoFbackbone (Bag of Freebies) for classifier training includes the following: CutMix and Mosaic data augmentation and Class label smoothing. In addition we use Mish activation as a complementary option, as shown in Table 2 and Table 3. ::: :::success 在我們的實驗中，如Table 2所說的那樣，我們透過引入像是CutMix與Mosaic data augmentation、Mish activation等方法來提高分類器的準確度。因此，我們用於分類器訓練的BoF-backbone(Bag of Freebies)就包含下面方法：CutMix與Mosaic data augmentation與Class label smoothing。因外，我們使用Mish activation來做為一個[互補](https://terms.naer.edu.tw/detail/56ddd11df487a4aedc01367a72c07aa2/)的選項，如Table 2、Table 3所示。 ::: :::info ![](https://hackmd.io/_uploads/BJtJpqDSo.png) Table 2: Influence of BoF and Mish on the CSPResNeXt-50 classifier accuracy. ::: :::info ![](https://hackmd.io/_uploads/BJDBTcvBj.png) Table 3: Influence of BoF and Mish on the CSPDarknet-53 classifier accuracy. ::: ### 4.3. Influence of different features on Detector training :::info Further study concerns the influence of different Bag-of-Freebies (BoF-detector) on the detector training accuracy, as shown in Table 4. We significantly expand the BoF list through studying different features that increase the detector accuracy without affecting FPS: * S: Eliminate grid sensitivity the equation $b_x = \sigma(t_x) + c_x, b_y=\sigma(t_y)$, where $c_x$ and $c_y$ are always whole numbers, is used in YOLOv3 for evaluating the object coordinates, therefore, extremely high $t_x$ absolute values are required for the $b_x$ value approaching the $c_x$ or $c_{x+1}$ values. We solve this problem through multiplying the sigmoid by a factor exceeding 1.0, so eliminating the effect of grid on which the object is undetectable. * M: Mosaic data augmentation - using the 4-image mosaic during training instead of single image * IT: IoU threshold - using multiple anchors for a single ground truth IoU (truth, anchor) > IoU threshold * GA: Genetic algorithms - using genetic algorithms for selecting the optimal hyperparameters during network training on the first 10% of time periods * LS: Class label smoothing - using class label smoothing for sigmoid activation * CBN: CmBN - using Cross mini-Batch Normalization for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch * CA: Cosine annealing scheduler - altering the learning rate during sinusoid training * DM: Dynamic mini-batch size - automatic increase of mini-batch size during small resolution training by using Random training shapes * OA: Optimized Anchors - using the optimized anchors for training with the 512x512 network resolution * GIoU, CIoU, DIoU, MSE - using different loss algorithms for bounded box regression ::: :::success 我們進一步的研究涉及不同的Bag-of-Freebies(BoF-detector)對偵測器訓練準確度的影響，如Table 4所示。我們通過研究在不影響FPS的情况下提高偵測器準確度的不同特徵，這明顯的擴展了BoF的清單： * S：Eliminate grid sensitivity，其方程式$b_x = \sigma(t_x) + c_x, b_y=\sigma(t_y)$，其中$c_x$與$c_y$總是整數，這是在YOLOv3中用來評估物體座標用的，所以說，對於$b_x$接近$c_x$、$c_{x+1}$來說，是需要非常高的$t_x$絕對值。我們透過把sigmoid乘上一個大於1.0的因子來解決這個問題，因此就消除grid上的物體無法被偵測到的影響。 * M：Mosaic data augmentation，訓練期間使用四張影像的馬賽克，而不是單一影像 * IT：IoU threshold，對單一個ground truth IoU (truth, anchor) > IoU threshold使用多個anchors來處理 * GA：Genetic algorithms，在網路訓練的前10%的時期間採用基因演算法來選擇最佳超參數 * LS：Class label smoothing，使用class label smoothing來做sigmoid activation * CBN：CmBN，使用Cross mini-Batch Normalization來收集整個批次(batch)內的統計資訊，而非單純的收集單一mini-batch內的統計資訊 * CA：Cosine annealing scheduler，在sinusoid training期間改變學習效率 * DM：Dynamic mini-batch size，透過使用Random training shapes在小的解析度訓練中自動新增mini-batch size * OA：使用優化過的anchors來做512x512網路解析度的訓練 * GIoU, CIoU, DIoU, MSE，為bounded box regression使用不同的loss algorithms ::: :::info Further study concerns the influence of different Bag-of-Specials (BoS-detector) on the detector training accuracy, including PAN, RFB, SAM, Gaussian YOLO (G), and ASFF, as shown in Table 5. In our experiments, the detector gets best performance when using SPP, PAN, and SAM. ::: :::success 進一步的研究涉及不同的Bag-of-Specials (BoS-detector)在偵測器訓練準確度上的影響，包含PAN、RFB、SAM、Gaussian YOLO (G)、與ASFF，如Table 5所示。在我們的實驗中，當使用SPP、PAN、與SAM的時候，其偵測器取得最佳效能。 ::: :::info ![](https://hackmd.io/_uploads/rkeAAyFHj.png) Table 5: Ablation Studies of Bag-of-Specials. (Size 512x512). ::: ### 4.4. Influence of different backbones and pretrained weightings on Detector training :::info Further on we study the influence of different backbone models on the detector accuracy, as shown in Table 6. We notice that the model characterized with the best classification accuracy is not always the best in terms of the detector accuracy. ::: :::success 擱再來，我們研究不同的主幹模型(backbone models)對偵測器準確度的影響，如Table 6所示。我們注意到，模型就算有著最佳的分類準確度也不代表能在偵測器準確度上也是最好。 ::: :::info ![](https://hackmd.io/_uploads/Bk-DdzoSj.png) Table 6: Using different classifier pre-trained weightings for detector training (all other training parameters are similar in all models) . ::: :::info First, although classification accuracy of CSPResNeXt50 models trained with different features is higher compared to CSPDarknet53 models, the CSPDarknet53 model shows higher accuracy in terms of object detection. ::: :::success 首先，儘管相較於CSPDarknet53來說，用不同特徵訓練出來的CSPResNeXt50的分類準確度是較佳的，但CSPDarknet53在物體偵測上還是有著較高的準確性。 ::: :::info Second, using BoF and Mish for the CSPResNeXt50 classifier training increases its classification accuracy, but further application of these pre-trained weightings for detector training reduces the detector accuracy. However, using BoF and Mish for the CSPDarknet53 classifier training increases the accuracy of both the classifier and the detector which uses this classifier pre-trained weightings. The net result is that backbone CSPDarknet53 is more suitable for the detector than for CSPResNeXt50. ::: :::success 再來就是，我們把BoF與Mish用在CSPResNeXt50分類器的訓練是可以提高其分類準確度，但是當我們進一步的把這些預訓練的權重用到偵測器訓練上的時候，反而會降低其準確度。然而，把BoF跟Mish用在CSPDarknet53的話，是可以兩者都增加準確度的(使用這個模型預訓練出來的權重的情況下)。最終結果就是CSPDarknet53比CSPResNeXt50更適合用於偵測器。 ::: :::info We observe that the CSPDarknet53 model demonstrates a greater ability to increase the detector accuracy owing to various improvements. ::: :::success 我們觀察到，由於各種改善，CSPDarknet53顯示出更大的能力來提高偵測器的準確度。 ::: ### 4.5. Influence of different mini-batch size on Detector training :::info Finally, we analyze the results obtained with models trained with different mini-batch sizes, and the results are shown in Table 7. From the results shown in Table 7, we found that after adding BoF and BoS training strategies, the mini-batch size has almost no effect on the detector’s performance. This result shows that after the introduction of BoF and BoS, it is no longer necessary to use expensive GPUs for training. In other words, anyone can use only a conventional GPU to train an excellent detector. ::: :::success 最後，我們分析了使用不同mini-batch size訓練得到的結果，如Table 7所示。從Table 7的結果來看，我們發現到，在加入BoF與BoS訓練策略之後，mini-batch size對於偵測器的效能已經幾乎沒有影響了。這個結果說明著，在引入BoF與BoS之後，訓練模型的這部份已經不再需要那種昂貴的GPU了。換句話說，隨便一個路人甲都可以用著一塊一般的GPU來訓練一個出色的偵測器。 ::: :::info ![](https://hackmd.io/_uploads/SyO_uGsHo.png) Table 7: Using different mini-batch size for detector training. ::: ## 5. Results :::info Comparison of the results obtained with other stateof-the-art object detectors are shown in Figure 8. Our YOLOv4 are located on the Pareto optimality curve and are superior to the fastest and most accurate detectors in terms of both speed and accuracy. ::: :::success 跟其它效能出色的物體偵測器的比較如Figure 8所示。我們的YOLOv4落於Pareto optimality curve，而且在速度與準確度上都優於最快、最準的偵測器。 ::: :::info ![](https://hackmd.io/_uploads/S1VfOGoHi.png) Figure 8: Comparison of the speed and accuracy of different object detectors. (Some articles stated the FPS of their detectors for only one of the GPUs: Maxwell/Pascal/Volta) ::: :::info Since different methods use GPUs of different architectures for inference time verification, we operate YOLOv4 on commonly adopted GPUs of Maxwell, Pascal, and Volta architectures, and compare them with other state-of-the-art methods. Table 8 lists the frame rate comparison results of using Maxwell GPU, and it can be GTX Titan X (Maxwell) or Tesla M40 GPU. Table 9 lists the frame rate comparison results of using Pascal GPU, and it can be Titan X (Pascal), Titan Xp, GTX 1080 Ti, or Tesla P100 GPU. As for Table 10, it lists the frame rate comparison results of using Volta GPU, and it can be Titan Volta or Tesla V100 GPU. ::: :::success 由於不同的方法採用不同架構的GPU來做推論時間的驗證，我們在大家常用的Maxwell、Pascal與Volta架構的GPU上執行YOLOv4，然後以此跟其它最先進(state-of-the-art)的方法來做比較。Table 8列出使用Maxwell GPU的[圖幀率](https://terms.naer.edu.tw/detail/4bc36333ff152e8736c9678f06f1185f/)的比較結果，它可以是GTX Titan X (Maxwell)或Tesla M40 GPU。Table 9列出使用Pascal GPU的[圖幀率](https://terms.naer.edu.tw/detail/4bc36333ff152e8736c9678f06f1185f/)比較結果，它可以是Titan X (Pascal)、Titan Xp、GTX 1080 Ti或Tesla P100 GPU。Table 10的話則是列出使用Volta GPU的[圖幀率](https://terms.naer.edu.tw/detail/4bc36333ff152e8736c9678f06f1185f/)比較結果，它可以是Titan Volta或是Tesla V100 GPU。 ::: ## 6. Conclusions :::info We offer a state-of-the-art detector which is faster (FPS) and more accurate (MS COCO AP50...95 and AP50) than all available alternative detectors. The detector described can be trained and used on a conventional GPU with 8-16 GB-VRAM this makes its broad use possible. The original concept of one-stage anchor-based detectors has proven its viability. We have verified a large number of features, and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments. ::: :::success 我們給出一個世界NO1的偵測器，比所有可用的替代偵測器都來的更快(FPS)更準(MS COCO AP50..95與AP50)。我們所描述的偵測器是可以用一般那種8-16 GB-VRAM的GPU就可以訓練，這造就無限可能。one-stage anchor-based detectors已經證明它的可行性。我們已經驗證大量的功能(方法)，並且選擇使用這些功能(方法)來提高分類器與偵測器的準度度。這些功能(方法)可以用作未來研究和開發的最佳實踐。 ::: :::info ![](https://hackmd.io/_uploads/BkgsuGsHi.png) ![](https://hackmd.io/_uploads/Bye3uzsri.png) Table 8: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (testdev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.) ::: :::info ![](https://hackmd.io/_uploads/BJogKMsBs.png) ![](https://hackmd.io/_uploads/Hk0xFGoSj.png) Table 9: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (test-dev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.) ::: :::info ![](https://hackmd.io/_uploads/S1_EYziBo.png) ![](https://hackmd.io/_uploads/BkjNFMiSj.png) Table 10: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (test-dev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.) :::