# CSPNET: A NEW BACKBONE THAT CAN ENHANCE LEARNING CAPABILITY OF CNN ###### tags: `CSPNET` `CNN` `論文翻譯` `deeplearning` [TOC] ## 說明 區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解,任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1911.11929.pdf) ::: ## ABSTRACT :::info Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at https://github.com/WongKinYiu/CrossStagePartialNetworks. ::: :::success 神經網路讓很多先進的方法都能夠在電腦視覺任務上(如物體偵測)得到難以置信的結果。不過,這樣的成功絕大部份依靠昂貴的計算資源,這讓貧窮限制人們的想像。這篇論文中,我們提出Cross Stage Partial Network (CSPNet)來降低過去那些需要從網路架構角度來做大量推理計算的問題。我們把問題歸因於網路最佳化中的重覆性梯度信息。我們所提出的網路利用整合從網路階段(network stage)開始與結束的特徵圖(feature maps)來著眼於梯度的可變性,在我們的實驗中是可以在ImageNet資料集上有著相同、甚至更高準確度情況降低20%的計算量,而且在MS COCO物體偵測資料集上的AP50方面明顯優於最好的方法。CSPNet很容易實作的,而且它的通用性足以應付基於ResNet、ResNeXt與DenseNet的架構。程式碼就在https://github.com/WongKinYiu/CrossStagePartialNetworks 。 ::: ## 1. Introduction :::info Neural networks have been shown to be especially powerful when it gets deeper [7, 39, 11] and wider [40]. However, extending the architecture of neural networks usually brings up a lot more computations, which makes computationally heavy tasks such as object detection unaffordable for most people. Light-weight computing has gradually received stronger attention since real-world applications usually require short inference time on small devices, which poses a serious challenge for computer vision algorithms. Although some approaches were designed exclusively for mobile CPU [9, 31, 8, 33, 43, 24], the depth-wise separable convolution techniques they adopted are not compatible with industrial IC design such as Application-Specific Integrated Circuit (ASIC) for edge-computing systems. In this work, we investigate the computational burden in state-of-the-art approaches such as ResNet, ResNeXt, and DenseNet. We further develop computationally efficient components that enable the mentioned networks to be deployed on both CPUs and mobile GPUs without sacrificing the performance. ::: :::success 神經網路已經被證明,當它變的又深[7, 39, 11]又寬[40]的時候就會特別強大。不過,如果你想擴展神經網路的架構,通常就會帶來更多的計算量,這造成多數人無法負擔像是物體偵測這種計算量大的任務。輕量計算(light-weight computing)的方式逐漸地受到愈來愈多的關注,因為現在中的應用通常會需要在小型設備上有著較短的推論時間,這讓電腦視覺演算法面對了嚴峻的挑戰。雖然有些方法是專門為mobile CPU [9, 31, 8, 33, 43, 24]所設計的,不過它們所採用的depth-wise separable convolution的技術與工業IC設計是不相容的,像是用於邊緣計算系統(edge-computing systems)的Application-Specific Integrated Circuit (ASIC)。在這個研究中,我們研究了像是ResNet、ResNeXt與DenseNet等最先進方法的計算負載(computational burden)。我們進一步的開了高效計算的組件,這組件讓上面提到的幾個網路都能夠在不犧牲效能的情況下佈署在CPUs與mobile GPUs上。 ::: :::info In this study, we introduce Cross Stage Partial Network (CSPNet). The main purpose of designing CSPNet is to enable this architecture to achieve a richer gradient combination while reducing the amount of computation. This aim is achieved by partitioning feature map of the base layer into two parts and then merging them through a proposed cross-stage hierarchy. Our main concept is to make the gradient flow propagate through different network paths by splitting the gradient flow. In this way, we have confirmed that the propagated gradient information can have a large correlation difference by switching concatenation and transition steps. In addition, CSPNet can greatly reduce the amount of computation, and improve inference speed as well as accuracy, as illustrated in Fig 1. The proposed CSPNet-based object detector deals with the following three problems: 1) **Strengthening learning ability of a CNN** The accuracy of existing CNN is greatly degraded after lightweightening, so we hope to strengthen CNN’s learning ability, so that it can maintain sufficient accuracy while being lightweightening. The proposed CSPNet can be easily applied to ResNet, ResNeXt, and DenseNet. After applying CSPNet on the above mentioned networks, the computation effort can be reduced from 10% to 20%, but it outperforms ResNet [7], ResNeXt [39], DenseNet [11], HarDNet [1], Elastic [36], and Res2Net [5], in terms of accuracy, in conducting image classification task on ImageNet [2]. 2) **Removing computational bottlenecks** Too high a computational bottleneck will result in more cycles to complete the inference process, or some arithmetic units will often idle. Therefore, we hope we can evenly distribute the amount of computation at each layer in CNN so that we can effectively upgrade the utilization rate of each computation unit and thus reduce unnecessary energy consumption. It is noted that the proposed CSPNet makes the computational bottlenecks of PeleeNet [37] cut into half. Moreover, in the MS COCO [18] dataset-based object detection experiments, our proposed model can effectively reduce 80% computational bottleneck when test on YOLOv3-based models. 3) **Reducing memory costs** The wafer fabrication cost of Dynamic Random-Access Memory (DRAM) is very expensive, and it also takes up a lot of space. If one can effectively reduce the memory cost, he/she will greatly reduce the cost of ASIC. In addition, a small area wafer can be used in a variety of edge computing devices. In reducing the use of memory usage, we adopt cross-channel pooling [6] to compress the feature maps during the feature pyramid generating process. In this way, the proposed CSPNet with the proposed object detector can cut down 75% memory usage on PeleeNet when generating feature pyramids ::: :::success 在這個研究中,我們引入Cross Stage Partial Network (CSPNet)的觀念。設計CSPNet的主目的在於讓這個架構能夠在實現更豐富的梯度結合的同時也降低計算量。這個目標的實現是利用將基礎層(base layer)的特徵圖(feature map)分區成兩個部份,再利用我們所提出的cross-stage hierarchy來合併它們。我們主要的概念就是透過分割梯度流(gradient flow)讓gradient flow能夠在不同的網路路徑(network path)中傳播(propagate)。以這樣的方式,我們可以確認到,透過切換concatenation與transition的這兩個步驟所傳播的梯度信息可以有較大的相關[差分](https://terms.naer.edu.tw/detail/15cfd3cc62018963c64ca224c7ab86a2/)(correlation difference)。此外,CSPNet可以大大的減少計算量,並且增進其推論速度與準確度,如Fig 1所說明。我們所提出的CSPNet-based object detector主要處理下面三個問題: 1) Strengthening learning ability of a CNN(強化CNN的學習能力),現行的CNN在輕量化之後的準確度會大幅度的下降,因此我們會希望能夠加強CNN的學習能力,讓它可以在瘦身的同時保有足夠的準確度。我們所提出的CSPNet可以輕輕鬆鬆的就弄到ResNet、ResNeXt與DenseNet上去。在上述架構上採用CSPNet之後,其計算量可以附低10%到20%不等,而且在ImageNet[2]上的影像分類任務中的準確度方面還可以優於ResNet [7]、ResNeXt [39]、DenseNet [11]、HarDNet [1]、Elastic [36]、與Res2Net [5] 2) Removing computational bottlenecks(去除計算瓶頸),過高的計算瓶頸會導致更多長的週期來完成推論過程,或是一些[算術單元](https://terms.naer.edu.tw/detail/c6801ebaa222ef6758e395d68ad3f37d/)會經常的閒置。因此,我們希望我們能夠平均地分配CNN中每一層的計算量,這樣我們就能夠有效地提升每個計算單元的利用率,因而減少不必要的能源消耗。值得注意的是,我們提出的CSPNet能夠讓PeleeNet [37]的計算瓶頸減半。更甚者,在MS COCO [18] dataset-based object detection的實驗中,我們提出的模型在基於YOLOv3的模型上測試的時候,可以有效降低80%的計算瓶頸 3) Reducing memory costs(減少記憶體成本),[動態隨機存取記憶體(DRAM)](https://terms.naer.edu.tw/detail/63af43ce9294dc3c2226228fc670aad0/)的晶圓製造成本是非常貴的,而且也佔非常大的空間。如果可以有效降低記憶體成本,那他/她將可以大大地降低[特殊應用積體電路(ASIC)](https://terms.naer.edu.tw/detail/628cdec919eab2ea30df200eadd78687/)的成本。此外,小面積的晶圓可以用在各種不同的邊緣計算設備中。為了能夠減少記憶體用量,我們採用cross-channel pooling [6]在特徵金字塔生成的過程中壓縮特徵圖。採用這個方法,在PeleeNet上採用我們所提出的CSPNet(搭配我們所提出的目標檢測器)可以在生成特徵金字塔的時候降低75%的記憶體用量 ::: :::info ![](https://hackmd.io/_uploads/S1ULL_eui.png) ![](https://hackmd.io/_uploads/S1cL8uldi.png) Figure 1: Proposed CSPNet can be applied on ResNet [7], ResNeXt [39], DenseNet [11], etc. It not only reduce computation cost and memory usage of these networks, but also benefit on inference speed and accuracy. Figure 1:我們所提出的CSPNet可以用於ResNet [7]、ResNeXt [39]、DenseNet [11]等架構上。除了可以降低這些網路的計算成本與記憶體用量之外,還有利於推論速度與準確度。 ::: :::info Since CSPNet is able to promote the learning capability of a CNN, we thus use smaller models to achieve better accuracy. Our proposed model can achieve 50% COCO AP50 at 109 fps on GTX 1080ti. Since CSPNet can effectively cut down a significant amount of memory traffic, our proposed method can achieve 40% COCO AP50 at 52 fps on Intel Core i9-9900K. In addition, since CSPNet can significantly lower down the computational bottleneck and Exact Fusion Model (EFM) can effectively cut down the required memory bandwidth, our proposed method can achieve 42% COCO AP50 at 49 fps on Nvidia Jetson TX2. ::: :::success 由於CSPnet能夠提升CNN的學習能力,因此,我們可以用較小的模型來實現更好的準確性。我們提出的模型可以在GTX 1080ti顯卡上以109 fps的推論速度實現50%的COCO AP50。由於CSPNet可以有效地減少大量的[記憶體進出(memory traffic)](https://www.ithome.com.tw/news/134809)的次數,因此,我們提出的方法在Intel Core i9-9900K上可以以52 fps實現40% COCO AP50。此外,因為CSPNet可以明顯地降低計算瓶頸,並且Exact Fusion Model(EFM)可以有效地減少記憶體頻寬的需求,我們所提出的方法可以在Nvidia Jetson TX2上以49 fps的速度實現42% COCO AP50。 ::: ## 2. Related work :::info **CNN architectures design**. In ResNeXt [39], Xie et al. first demonstrate that cardinality can be more effective than the dimensions of width and depth. DenseNet [11] can significantly reduce the number of parameters and computations due to the strategy of adopting a large number of reuse features. And it concatenates the output features of all preceding layers as the next input, which can be considered as the way to maximize cardinality. SparseNet [46] adjusts dense connection to exponentially spaced connection can effectively improve parameter utilization and thus result in better outcomes. Wang et al. further explain why high cardinality and sparse connection can improve the learning ability of the network by the concept of gradient combination and developed the partial ResNet (PRN) [35]. For improving the inference speed of CNN, Ma et al. [24] introduce four guidelines to be followed and design ShuffleNet-v2. Chao et al. [1] proposed a low memory traffic CNN called Harmonic DenseNet (HarDNet) and a metric Convolutional Input/Output (CIO) which is an approximation of DRAM traffic proportional to the real DRAM traffic measurement. ::: :::success CNN architectures design(CNN架構設計)。在ResNeXt [39]中Xie等人首先證明[基數](https://terms.naer.edu.tw/detail/930b131227d9b84a55fc7fea30d9b2ad/)比寬度與深度的維度還要來的有效。DenseNet [11]因為採用大量重用特徵(reuse feature)的策略,所以它明顯地減少參數量與計算量。而且它把前面所有層(layers)的輸出特徵(output features)連接(concatenates)起來做為下一個輸入,這可以被視為是最大化基數的一種方式。SparseNet [46]把dense connection調整為指數間隔連接(exponentially spaced connection),這有效提升參數利用率,因而有較好的結果。Wang等人進一步的解釋為何高基數與稀疏連接可以透過梯度的結果來提升網路學習能力,並以此開發partial ResNet(PRN) [35]。為了提升CNN的推論速度,Ma等人 [24]引入四個依循的準則,並設計出ShuffleNet-v2。Chao等人 [1]提出一種稱之為harmonic DenseNet(HarDNet)的low memory traffic CNN與一種metric Convolutional Input/Output (CIO),這是一種與實際DRAM流量測量成比例的DRAM流量的近似值。 ::: :::info **Real-time object detector**. The most famous two real-time object detectors are YOLOv3 [29] and SSD [21]. Based on SSD, LRF [38] and RFBNet [19] can achieve state-of-the-art real-time object detection performance on GPU. Recently, anchor-free based object detector [3, 45, 13, 14, 42] has become main-stream object detection system. Two object detector of this sort are CenterNet [45] and CornerNet-Lite [14], and they both perform very well in terms of efficiency and efficacy. For real-time object detection on CPU or mobile GPU, SSD-based Pelee [37], YOLOv3-based PRN [35], and Light-Head RCNN [17]-based ThunderNet [25] all receive excellent performance on object detection. ::: :::success Real-time object detector(實時物體偵測器)。最著名的兩個實時物體偵測器就是YOLOv3[29]與SSD[21]。基於SSD、LRF[38]與RFBNet[19]的物體偵測器可以在GPU上實現最好棒棒的實時物體偵測的效能。近來,基於anchor-free的物體偵測器[3, 45, 13, 14, 42]已經成為主流的物體偵測系統。這種類型的兩個物體偵測器為CenterNet [45]與CornerNet-Lite [14],而且在效率與效力上都表現的不錯。對於在CPU或是mobile GPU上的實時物體偵測器則是有SSD-based Pelee [37]、YOLOv3-based PRN [35]與Light-Head RCNN [17]-based的ThunderNet [25],這些在物體偵測上都有著出色的效果。 ::: ## 3. Method ### 3.1 Cross Stage Partial Network :::info **DenseNet**. Figure 2 (a) shows the detailed structure of one-stage of the DenseNet proposed by Huang et al. [11]. Each stage of a DenseNet contains a dense block and a transition layer, and each dense block is composed of $k$ dense layers. The output of the $i^{th}$ th dense layer will be concatenated with the input of the $i^{th}$ th dense layer, and the concatenated outcome will become the input of the $(i+1)^{th}$ dense layer. The equations showing the above-mentioned mechanism can be expressed as: $$ \begin{align} & \mathbf{x}_1 = \mathbf{w}_1 * \mathbf{x}_0 \\ & \mathbf{x}_2 = \mathbf{w}_2 * \left[\mathbf{x}_0, \mathbf{x}_1 \right] \\ & \vdots \\ & \mathbf{x}_k = \mathbf{w}_k * \left[\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{k-1} \right] \end{align} \tag{1} $$ where $*$ represents the convolution operator, and $\left[x_0, x_1, \cdots \right]$ means to concatenate $x_0, x_1, \cdots$ and $w_i$ and $x_i$ are the weights and output of the $i^{th}$ dense layer, respectively. ::: :::success **DenseNet**。Figure 2 (a)說明著由Huang [11]等人所提出的DenseNet其中一個階段的結構細節。DenseNet的每個stage(階段)都包含一個dense block與transition layer,然後每個dense block都由$k$個dense layers所組成。第$i^{th}$個dense layer的output會跟第$i^{th}$個dense layer的input連接(concatenates),然後這個連接後的結果就會是第$(i+1)^{th}$個dense layer的input。上面提到的機制我們可以用下面方程式來表示: $$ \begin{align} & \mathbf{x}_1 = \mathbf{w}_1 * \mathbf{x}_0 \\ & \mathbf{x}_2 = \mathbf{w}_2 * \left[\mathbf{x}_0, \mathbf{x}_1 \right] \\ & \vdots \\ & \mathbf{x}_k = \mathbf{w}_k * \left[\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{k-1} \right] \end{align} \tag{1} $$ 其中,$*$表示卷積的操作,$\left[x_0, x_1, \cdots \right]$則意味著連接$x_0, x_1, \cdots$,然後$w_i$與$x_i$各別為第$i^{th}$個dense layer的權重與輸出。 ::: :::info ![](https://hackmd.io/_uploads/H1o7XP4_j.png) Figure 2: Illustrations of (a) DenseNet and (b) our proposed Cross Stage Partial DenseNet (CSPDenseNet). CSPNet separates feature map of the base layer into two part, one part will go through a dense block and a transition layer; the other one part is then combined with transmitted feature map to the next stage. Figure 2:(a) DenseNet與 (b) 我們所提出的Cross Stage Partial DenseNet (CSPDenseNet)的說明。CSPNet把base layer的feature map分成兩個部份,一部份會經過dense block與transition layer;另一部份則是跟傳輸到(transmitted)的feature map結合起來到下一個階段(stage) ::: :::info If one makes use of a backpropagation algorithm to update weights, the equations of weight updating can be written as: ![](https://hackmd.io/_uploads/HkyZQwVOo.png) where $f$ is the function of weight updating, and $g_i$ represents the gradient propagated to the $i$ th dense layer. We can find that large amount of gradient information are reused for updating weights of different dense layers. This will result in different dense layers repeatedly learn copied gradient information. ::: :::success 如果我們用反向傳播演算法來更新權重值,那,權重更新的方程式就可以寫成是: ![](https://hackmd.io/_uploads/Byb-XvV_o.png) 其中$f$是權重更新的函數,$g_i$表示傳遞到第$i$個dense layer的梯度。我們可以發現到,大量的梯度信息(gradient information)被重覆用來更新不同的dense layers的權重。這會導致不同的dense layer重複地學習複製的梯度資訊。 ::: :::info **Cross Stage Partial DenseNet.** The architecture of one-stage of the proposed CSPDenseNet is shown in Figure 2 (b). A stage of CSPDenseNet is composed of a partial dense block and a partial transition layer. In a partial dense block, the feature maps of the base layer in a stage are split into two parts through channel $x_0 = \left[x^{'}_0, x^{''}_0 \right]$. Between $x^{''}_0$ and $x^{'}_0$ , the former is directly linked to the end of the stage, and the latter will go through a dense block. All steps involved in a partial transition layer are as follows: First, the output of dense layers, $\left[ x^{''}_0, x_1,\cdots, x_k \right]$, will undergo a transition layer. Second, the output of this transition layer, $x_T$ , will be concatenated with $x^{''}_0$ and undergo another transition layer, and then generate output $x_U$ . The equations of feed-forward pass and weight updating of CSPDenseNet are shown in Equations 3 and 4, respectively. ![](https://hackmd.io/_uploads/BJzs2D4ui.png) ::: :::success **Cross Stage Partial DenseNet.** 我們所提出的CSPDenseNet的其中一個階段的架構如Figure 2(b)所示。CSPDenseNet的一個階段是由partial dense block與partial transition layer所組成。在一個partial dense block中,其base layer的feature maps在一個階段中會通過$x_0 = \left[x^{'}_0, x^{''}_0 \right]$分成兩個部份。在$x^{''}_0$與$x^{'}_0$之間,前者會直接連結到該階段的最後面,後者則是會經過dense block。partial transition layer所涉及的所有步驟如下:首先,dense layers的output,$\left[ x^{''}_0, x_1,\cdots, x_k \right]$,將會經歷一個transition layer。再來就是transition layer的output,$x_T$,它會連接$x^{''}_0$,然後經歷另一個transition layer,然後生成output $x_U$。CSPDenseNet的前向傳遞與權重更新的方程式各別為Equation 3與4所說明。 ![](https://hackmd.io/_uploads/S1Xj3wNdi.png) ::: :::info We can see that the gradients coming from the dense layers are separately integrated. On the other hand, the feature map $x^{'}_0$ that did not go through the dense layers is also separately integrated. As to the gradient information for updating weights, both sides do not contain duplicate gradient information that belongs to other sides. Overall speaking, the proposed CSPDenseNet preserves the advantages of DenseNet’s feature reuse characteristics, but at the same time prevents an excessively amount of duplicate gradient information by truncating the gradient flow. This idea is realized by designing a hierarchical feature fusion strategy and used in a partial transition layer. ::: :::success 我們可以看的到,來自dense layers的梯度是各自整合的。另一方面,沒有通過dense layers的feature map,$x^{'}_0$,也是單獨整合的。對於要用來更新權重的梯度信息,兩邊都沒有包含到屬於另一邊的那種完全一樣的梯度信息(就是沒有重複性的梯度信息)。 總體來說,我們提出來的CSPDenseNet保留了DenseNet特徵重用重性的優點,與此同時也通過截斷梯度流(grandient flow)的方式來防止過多的重複性梯度信息。這個想法是透過設計hierarchical feature fusion strategy(分層特徵融合策略?)來實現,並將之應用於transition layer。 ::: :::info **Partial Dense Block.** The purpose of designing partial dense blocks is to 1.) increase gradient path: Through the split and merge strategy, the number of gradient paths can be doubled. Because of the cross-stage strategy, one can alleviate the disadvantages caused by using explicit feature map copy for concatenation; 2.) balance computation of each layer: usually, the channel number in the base layer of a DenseNet is much larger than the growth rate. Since the base layer channels involved in the dense layer operation in a partial dense block account for only half of the original number, it can effectively solve nearly half of the computational bottleneck; and 3.) reduce memory traffic: Assume the base feature map size of a dense block in a DenseNet is $w \times h \times c$, the growth rate is $d$, and there are in total $m$ dense layers. Then, the CIO of that dense block is $(c \times m) + ((m^2 + m)\times d)/2$, and the CIO of partial dense block is $((c\times m) + (m^2 + m) \times d)/2$. While $m$ and $d$ are usually far smaller than $c$, a partial dense block is able to save at most half of the memory traffic of a network. ::: :::success **Partial Dense Block.** 我們設計partial dense blocks的目的在於:1.)增加梯度路徑(gradient path):透過分割、合併的策略,我們可以把梯度路徑的數量加倍。由於跨階段策略(cross-stage strategy),我們可以緩解使用顯示特徵圖(explicit feature map)複本做連接(concatenation)所帶來的缺點;2.)平衡每一層的計算量:通常,在DenseNet的基礎層(base layer)中的通道(channel)數量是遠大於長成率。由於partial dense block中密集層(dense layer)的操作所涉及的基礎層通用數(base layer channels)只會是原始數量的一半,因此這可以有效解決將近一半的計算瓶頸;3.)降低記憶體流量:假設DenseNet的基本特徵圖大小為$w \times h \times c$,成長率為$d$,然後總共有$m$個dense layers。dense block的CIO為$(c \times m) + ((m^2 + m)\times d)/2$,partial dense block的CIO為$((c\times m) + (m^2 + m) \times d)/2$。儘管$m$跟$d$通常是比$c$小很多,不過partial dense block最多能夠節省網路一半的記憶體流量。 ::: :::info **Partial Transition Layer.** The purpose of designing partial transition layers is to maximize the difference of gradient combination. The partial transition layer is a hierarchical feature fusion mechanism, which uses the strategy of truncating the gradient flow to prevent distinct layers from learning duplicate gradient information. Here we design two variations of CSPDenseNet to show how this sort of gradient flow truncating affects the learning ability of a network. 3 (c) and 3 (d) show two different fusion strategies. CSP (fusion first) means to concatenate the feature maps generated by two parts, and then do transition operation. If this strategy is adopted, a large amount of gradient information will be reused. As to the CSP (fusion last) strategy, the output from the dense block will go through the transition layer and then do concatenation with the feature map coming from part 1. If one goes with the CSP (fusion last) strategy, the gradient information will not be reused since the gradient flow is truncated. If we use the four architectures shown in 3 to perform image classification, the corresponding results are shown in Figure 4. It can be seen that if one adopts the CSP (fusion last) strategy to perform image classification, the computation cost is significantly dropped, but the top-1 accuracy only drop 0.1%. On the other hand, the CSP (fusion first) strategy does help the significant drop in computation cost, but the top-1 accuracy significantly drops 1.5%. By using the split and merge strategy across stages, we are able to effectively reduce the possibility of duplication during the information integration process. From the results shown in Figure 4, it is obvious that if one can effectively reduce the repeated gradient information, the learning ability of a network will be greatly improved. ::: :::success **Partial Transition Layer.** 設計partial transition layers的目的在於最大化梯度組合的差異。partial transition layer是一種分層特徵融合機制(hierarchical feature fusion mechanism),它使用截斷梯度流的策略來預防不同的層(layers)學習到重複的梯度信息。這邊我們設計兩種不同的CSPDenseNet來說明這種梯度流截斷如何影響網路的學習能力。Figure 3(c)與3(d)說明著兩種不同的融合策略。CSP (fusion first)指的將兩部份所生成的特徵圖連接起來,然後再做轉換(transition)的操作。如果採用這種策略,那就會有大量的梯度信息被重複使用。如果是CSP (fusion last)的話,那來自dense block的output就會經過transition layer,然後跟part 1過來的特徵圖做連接。如果採用CSP (fusion last)這個策略的話,那梯度信息就不會被重複使用,因為梯度流已經被截斷了。如果我們使用Figure 3中四種架構來做影像分類的話,相對應的結果就會如Figure 4所示。 可以看的出來,如果我們採用CSP (fusion last)來做影像分類的話,那計算成本可以明顯的降低,而且,top-1準確度只會下降0.1%。另一方面,CSP (fusion first)確實明顯降低計算成本,不過,top-1準確度的部份也明顯下降1.5%。透過跨階段(accross stages)使用分割、合併的策略,我們能夠有效地降低信息整合過程中重複的可能性。從Figure 4的結果可以看的出來,很明顯的,如果我們可以有效地降低重複的梯度信息的話,那麼,網路的學習能力就能夠很大的提升。 ::: :::info ![](https://hackmd.io/_uploads/SkNMRJPOs.png) Figure 3: Different kind of feature fusion strategies. (a) single path DenseNet, (b) proposed CSPDenseNet: transition → concatenation → transition, (c) concatenation → transition, and (d) transition → concatenation. ::: :::info ![](https://hackmd.io/_uploads/SJxFRJwOj.png) Figure 4: Effect of truncating gradient flow for maximizing difference of gradient combination. ::: :::info **Apply CSPNet to Other Architectures.** CSPNet can be also easily applied to ResNet and ResNeXt, the architectures are shown in Figure 5. Since only half of the feature channels are going through Res(X)Blocks, there is no need to introduce the bottleneck layer anymore. This makes the theoretical lower bound of the Memory Access Cost (MAC) when the FLoating-point OPerations (FLOPs) is fixed. ::: :::success CSPNet可以很輕易的應用到ResNet與ResNeXt,相關架構如Figure 5所示。由於只會有一半的特徵通道(feature channels)會通過Res(X)Blocks,所以我們沒有必要需要再引入任何的bottleneck layers。當浮點計算(FLOPs)固定的時候,這使用Memory Access Cost (MAC)(記憶體存取成本?) 的理論下限值成為固定值。 ::: :::info ![](https://hackmd.io/_uploads/BJRCmlwdi.png) Figure 5: Applying CSPNet to ResNe(X)t. ::: ### 3.2 Exact Fusion Model :::info **Looking Exactly to predict perfectly.** We propose EFM that captures an appropriate Field of View (FoV) for each anchor, which enhances the accuracy of the one-stage object detector. For segmentation tasks, since pixel-level labels usually do not contain global information, it is usually more preferable to consider larger patches for better information retrieval [22]. However, for tasks like image classification and object detection, some critical information can be obscure when observed from image-level and bounding box-level labels. Li et al. [15] found that CNN can be often distracted when it learns from image-level labels and concluded that it is one of the main reasons that two-stage object detectors outperform one-stage object detectors. ::: :::success **Looking Exactly to predict perfectly.** 我們提出EFM來為每個anchor補捉適當的[視野](https://zh.wikipedia.org/zh-tw/%E8%A6%96%E9%87%8E)(FoV),這強化了one-stage object detector的準確度。對於分割任務的話,因為pixel-level labels通常不包含全域信息,因此通常可以更好的去考慮更大的patches來取得更好的信息檢索 [22]。不過,對於像是影像分類或是物體偵測的任務的話,當你從image-level與bounding box-level labels來觀察的話,一些關鍵信息就會變的比較模糊。Li等人[15]發現到,CNN在從image-level labels學習的時候通常會分心,並得到結論,也就是two-stage object detectors會比one-stage object detectors還要來的好的主要原因之一。 ::: :::info **Aggregate Feature Pyramid.** The proposed EFM is able to better aggregate the initial feature pyramid. The EFM is based on YOLOv3 [29], which assigns exactly one bounding-box prior to each ground truth object. Each ground truth bounding box corresponds to one anchor box that surpasses the threshold IoU. If the size of an anchor box is equivalent to the FoV of the grid cell, then for the grid cells of the $s^{th}$ scale, the corresponding bounding box will be lower bounded by the $(s − 1)^{th}$ scale and upper bounded by the $(s + 1)^{th}$ scale. Therefore, the EFM assembles features from the three scales. ::: :::success **Aggregate Feature Pyramid.** 我們所提出EFM能夠更好的聚合初始的特徵金字塔。EFM是基於YOLOv3 [29],它把每一個實際物件都分配一個bounding box prior。每個實際的邊界框都對應一個超過threshold IoU的anchor box。如果anchor box的大小等價於網格(grid cell)的FoV,那麼第$s$個尺度的grid cells,其相對應的邊界框就會以第$(s-1)$個尺度做為下限,以第$(s+1)$個尺度做為上限。因此,EFM從這三個尺度中組成特徵。 ::: :::info **Balance Computation.** Since the concatenated feature maps from the feature pyramid are enormous, it introduces a great amount of memory and computation cost. To alleviate the problem, we incorporate the Maxout technique to compress the feature maps. ::: :::success **Balance Computation.** 因為來自特徵金字塔所連接的特徵圖非常的巨大,這會同時帶來大量的記憶體需求與計算成本。為了緩解這個問題,我們結合了Maxout這個技術來壓縮特徵圖。 ::: ## 4. Experiments :::info We will use ImageNet’s image classification dataset [2] used in ILSVRC 2012 to validate our proposed CSPNet. Besides, we also use the MS COCO object detection dataset [18] to verify the proposed EFM. Details of the proposed architectures will be elaborated in the appendix. ::: :::success 我們預計使用ILSVRC 2012 ImageNet的影像分類資料集[2]來驗證我們所提出的CSPNet。此外,我們還使用MS COCO物體偵測資料集[18]來驗證我們所提出的EFM。我們所提出的架構細節會在附錄中說明。 ::: ### 4.1 Implementation Details :::info **ImageNet.** In ImageNet image classification experiments, all hyper-parameters such as training steps, learning rate schedule, optimizer, data augmentation, etc., we all follow the settings defined in Redmon et al. [29]. For ResNet-based models and ResNeXt-based models, we set 8,000,000 training steps. As to DenseNet-based models, we set 1,600,000 training steps. We set the initial learning rate 0.1 and adopt the polynomial decay learning rate scheduling strategy. The momentum and weight decay are respectively set as 0.9 and 0.005. All architectures use a single GPU to train universally in the batch size of 128. Finally, we use the validation set of ILSVRC 2012 to validate our method. ::: :::success **ImageNet.** 在ImageNet影像分類實驗中,所有的超參數,像是training steps、learning rate schedule、optimizer、data augmentation等,我們都會依循著Redmon等人[29]的設置定義。對於ResNet-based models與ResNeXt-based models,我們設置8,000,000個training steps。DenseNet-based models的話,我們設置training steps為1,600,000。初始的learning rate為0.1,然後採用polynomial decay learning rate scheduling strategy。momentum與weight decay各別設置為0.9與0.005。所有的架構都是在單一塊GPU上以batch size=128來做訓練。最後,我們使用ILSVRC 2012的驗證集來驗證我們的方法。 ::: :::info **MS COCO.** In MS COCO object detection experiments, all hyper-parameters also follow the settings defined in Redmon et al. [29]. Altogether we did 500,000 training steps. We adopt the step decay learning rate scheduling strategy and multiply with a factor 0.1 at the 400,000 steps and the 450,000 steps, respectively. The momentum and weight decay are respectively set as 0.9 and 0.0005. All architectures use a single GPU to execute multi-scale training in the batch size of 64. Finally, the COCO test-dev set is adopted to verify our method. ::: :::success **MS COCO.** 在MS COCO物體偵測實驗中,所有參數同樣會依循著Redmon等人[29]的設置定義。我們總共執行了500,000個training steps。我們採用step decay learning rate scheduling strategy,分別在400,000 steps與450,000 steps的時候乘上0.1。momentum與weight decay各自設置為0.9與0.0005。所有的架構都使用單一塊GPU以batch size=64來執行多尺度的訓練。最後,我們採用COCO test-dev set來驗證我們的方法。 ::: ### 4.2 Ablation Experiments :::info **Ablation study of CSPNet on ImageNet.** In the ablation experiments conducted on the CSPNet, we adopt PeleeNet [37] as the baseline, and the ImageNet is used to verify the performance of the CSPNet. We use different partial ratios $\gamma$ and the different feature fusion strategies for ablation study. Table 1 shows the results of ablation study on CSPNet. In Table 1, SPeleeNet and PeleeNeXt are, respectively, the architectures that introduce sparse connection and group convolution to PeleeNet. As to CSP (fusion first) and CSP (fusion last), they are the two strategies proposed to validate the benefits of a partial transition. ::: :::success 在CSPNet上所做的[ablation experiments](https://www.zhihu.com/question/60170398)([切除實驗](https://www.digitimes.com.tw/col/article.asp?id=674)、消融實驗)中,我們採用PeleeNet [37]做為基線(baseline),然後用Imagenet來驗證CSPNet的效能。我們使用不同的partial ratios,$\gamma$,與不同的feature funsiion的策略來做切除研究。Table 1給出在CSPNet上所做的切除研究的結果。在Table 1中,SPeleeNet與PeleeNeXt各自是引入稀疏連接(sparse connection)與群組卷積(group convolution)引入PeleeNet的架構。對於CSP(fusion first)與CSP(fusion last)而言,他們是用來驗證我們所提出的partial transition的好處所提出的兩種策略。 ::: :::info Table 1: Ablation study of CSPNet on ImageNet. ![](https://hackmd.io/_uploads/Bk9H3ZwOo.png) ::: :::info From the experimental results, if one only uses the CSP (fusion first) strategy on the cross-stage partial dense block, the performance can be slightly better than SPeleeNet and PeleeNeXt. However, the partial transition layer designed to reduce the learning of redundant information can achieve very good performance. For example, when the computation is cut down by 21%, the accuracy only degrades by 0.1%. One thing to be noted is that when $\gamma=0.25$, the computation is cut down by 11%, but the accuracy is increased by 0.1%. Compared to the baseline PeleeNet, the proposed CSPPeleeNet achieves the best performance, it can cut down 13% computation, but at the same time upgrade the accuracy by 0.2%. If we adjust the partial ratio to $\gamma=0.25$, we are able to upgrade the accuracy by 0.8% and at the same time cut down 3% computation. ::: :::success 從實驗結果來看,如果我們只在cross-stage partial dense block上使用CSP (fusion first)的話,效能可以比SPeleeNet跟PeleeNeXt還要好一點點點。不過啊,設計用來減少學習到冗餘信息的partial transition layer是可以得到不錯的效能。舉例來說,當計算量減少21%的時候,準確度僅僅降低0.1%。要注意的是,當$\gamma=0.25$的時候,計算量會降低11%,不過準確度是提升0.1%。對比做為基線的PeleeNet,我們所提出的CSPPeleeNet得到最佳效能,它可以降低13%的計算量,同時提升0.2%的準度率。如果我們把partial ratio調整為$\gamma=0.25$,我們就可以提升0.8%的準確度,同時降低3%的計算量。 ::: :::info **Ablation study of EFM on MS COCO.** Next, we shall conduct an ablation study of EFM based on the MS COCO dataset. In this series of experiments, we compare three different feature fusion strategies shown in Figure 6. We choose two state-of-the-art lightweight models, PRN [35] and ThunderNet [25], to make comparison. PRN is the feature pyramid architecture used for comparison, and the ThunderNet with Context Enhancement Module (CEM) and Spatial Attention Module (SAM) are the global fusion architecture used for comparison. We design a Global Fusion Model (GFM) to compare with the proposed EFM. Moreover, GIoU [30], SPP, and SAM are also applied to EFM to conduct an ablation study. All experiment results listed in Table 2 adopt CSPPeleeNet as the backbone. ::: :::info Table 2: Ablation study of EFM on MS COCO. ![](https://hackmd.io/_uploads/By32RZP_o.png) ::: :::success **Ablation study of EFM on MS COCO.** 接下來,我們就要來進行EFM的切除研究(基於MS COCO dataset)。在這系列的研究中,我們比較三種不同的特徵融合策略(如Figure 6所示)。我們選擇兩種目前最好的輕量模型,PRN[35]與Thundernet[25],來做比較。PRN是用來做為特徵金字塔架構的比較,而ThunderNet with Context Enhancement Module (CEM)與Spatial Attention Module (SAM)則是用來對比全域融合(global fusion)架構。我們設計一個Global Fusion Model (GFM)來跟我們所提出的EFM做比較。此外,我們也會把GIoU [30]、SPP與SAM拿來用到EFM進行切除研究。Table 2中所列出的所有實驗結果都是採用CSPPeleeNet做為骨幹。 ::: :::info ![](https://hackmd.io/_uploads/HyfflMPds.png) Figure 6: Different feature pyramid fusion strategies. (a) Feature Pyramid Network (FPN): fuse features from current scale and previous scale. (b) Global Fusion Model (GFM): fuse features of all scales. (c) Exact Fusion Model (EFM): fuse features depand on anchor size. ::: :::info As reflected in the experiment results, the proposed EFM is 2 fps slower than GFM, but its AP and AP50 are significantly upgraded by 2.1% and 2.4%, respectively. Although the introduction of GIoU can upgrade AP by 0.7%, the AP50 is, however, significantly degraded by 2.7%. However, for edge computing, what really matters is the number and locations of the objects rather than their coordinates. Therefore, we will not use GIoU training in the subsequent models. The attention mechanism used by SAM can get a better frame rate and AP compared with SPP’s increase of FoV mechanism, so we use EFM (SAM) as the final architecture. In addition, although the CSPPeleeNet with swish activation can improve AP by 1%, its operation requires a lookup table on the hardware design to accelerate, we finally also abandoned the swish activation function. ::: :::success 正如實驗結果所反應出的那般,我們所提出的EFM比GFM慢了2 fps,不過它的AP與AP50則是分別明顯提升2.1%與2.4%。雖然GIoU的引入可以提升0.7%的AP,不過AP50卻是明顯的降低了2.7%。不過,對於邊緣計算而言,真的重要的是物體的數量與位置,而不是它們的座標。所以,我們不會在後續的模型中使用GIoU來訓練。對比於SPP所增加的FoV機制,使用注意力機制的SAM可以得到較好的[取像速度](https://terms.naer.edu.tw/detail/156a6c45b83f57b5d9bb768829649edd/)與AP,因此我們會使用EFM(SAM)來做為最終的架構。此外,雖然CSPPeleeNet使用swish activation可以提升1%的AP,不過它的計算需要在硬體設計上的[查表](https://terms.naer.edu.tw/detail/6cef4f03c5b04f70a6b01fc5a88ff2db/)來加速,所以最終我們就放生它了。 ::: ### 4.3 ImageNet Image Classification :::info We apply the proposed CSPNet to ResNet-10 [7], ResNeXt-50 [39], PeleeNet [37], and DenseNet-201-Elastic [36] and compare with state-of-the-art methods. The experimental results are shown in Table 3. ::: :::success 我們把所提出的CSPNet應用於ResNet-10 [7]、ResNeXt-50 [39]、PeleeNet [37]與DenseNet-201-Elastic [36],然後跟最好的方法來比較一番。實驗結果如Table 3所示。 ::: :::info Table 3: Compare with state-of-the-art methods on ImageNet. ![](https://hackmd.io/_uploads/BJIx4MP_j.png) ::: :::info It is confirmed by experimental results that no matter it is ResNet-based models, ResNeXt-based models, or DenseNetbased models, when the concept of CSPNet is introduced, the computational load is reduced at least by 10% and the accuracy is either remain unchanged or upgraded. Introducing the concept of CSPNet is especially useful for the improvement of lightweight models. For example, compared to ResNet-10, CSPResNet-10 can improve accuracy by 1.8%. As to PeleeNet and DenseNet-201-Elastic, CSPPeleeNet and CSPDenseNet-201-Elastic can respectively cut down 13% and 19% computation, and either upgrade a little bit or maintain the accuracy. As to the case of ResNeXt-50, CSPResNeXt-50 can cut down 22% computation and upgrade top-1 accuracy to 77.9%. ::: :::success 實驗結果證實,不管是ResNet-based models、ResNeXt-based models,還是DenseNetbased models,只要引入CSPNet的概念,其計算負載秒降至少10%,準確度不是持平就是更好。引入CSPNet的概念對於輕量模型的改進特別有效。舉例來說,對比ResNet-10,CSPResNet-10可以提高準確度1.8%。相對於PeleeNet與DenseNet-201-Elastic,CSPPeleeNet與CSPDenseNet-201-Elastic各自可以降低13%與19%的計算量,然後可以準確度可以持平或是提升一咪咪。ResNeXt-50的話,CSPResNeXt-50可以降低22%的計算量,並且將top-1 accuracy提升到77.9%。 ::: :::info If compared with the state-of-the-art lightweight model – EfficientNet-B0, although it can achieve 76.8% accuracy when the batch size is 2048, when the experiment environment is the same as ours, that is, only one GPU is used, EfficientNetB0 can only reach 70.0% accuracy. In fact, the swish activation function and SE block used by EfficientNet-B0 are not efficient on the mobile GPU. A similar analysis has been conducted during the development of EfficientNet-EdgeTPU. ::: :::success 如果跟當今世上最好的輕量模型,EfficientNet-B0,相比的話,雖然在batch size=2048的時候可以來到76.8%的準確度,不過在實驗環境跟我們一樣,也就是僅使用一塊GPU的時候,EfficientNetB0也只能夠來到70%的準確度。事實上,EfficientNet-B0所用的swish activation function與SE block在mobile GPU上的效率並不是那麼好。這在EfficientNet-EdgeTPU開發過程中有做過類似的分析。 ::: :::info Here, for demonstrating the learning ability of CSPNet, we introduce swish and SE into CSPPeleeNet and then make a comparison with EfficientNet-B0\*. In this experiment, SECSPPeleeNet-swish cut down computation by 3% and upgrade 1.1% top-1 accuracy. ::: :::success 這邊,為了能夠證明CSPNet的學習能力,我們把swish跟SE引入CSPPeleeNet,用這個來跟EfficientNet-B0\*做個比較。在這個實驗中,SECSPPeleeNet-swish減少3%的計算量,並提高1.1%的top-1準確度。 ::: :::info Proposed CSPResNeXt-50 is compared with ResNeXt-50 [39], ResNet-152 [7], DenseNet-264 [11], and HarDNet-138s [1], regardless of parameter quantity, amount of computation, and top-1 accuracy, CSPResNeXt-50 all achieve the best result. As to the 10-crop test, CSPResNeXt-50 also outperforms Res2Net-50 [5] and Res2NeXt-50 [5]. ::: :::success 我們拿我們所提出的CSPResNeXt-50來跟ResNeXt-50 [39]、ResNet-152 [7]、DenseNet-264 [11]跟HarDNet-138s [1]做比較,不管是參數的數量、計算量還是top-1 準確度,CSPResNeXt-50始終得到最好的結果。就算是10-crop test,CSPResNeXt-50也都還是優於Res2Net-50 [5]與Res2NeXt-50 [5]。 ::: ### 4.4 MS COCO Object Detection :::info In the task of object detection, we aim at three targeted scenarios: (1) real-time on GPU: we adopt CSPResNeXt50 with PANet (SPP) [20]; (2) real-time on mobile GPU: we adopt CSPPeleeNet, CSPPeleeNet Reference, and CSPDenseNet Reference with the proposed EFM (SAM); and (3) real-time on CPU: we adopt CSPPeleeNet Reference and CSPDenseNet Reference with PRN [35]. The comparisons between the above models and the state-of-the-art methods are listed in Table 4. As to the analysis on the inference speed of CPU and mobile GPU will be detailed in the next subsection. ::: :::success 在物體偵測的任務中,我們瞄準三個目標場景:(1)在GPU上實時執行:我們採用CSPResNeXt50搭配PANet (SPP) [20];(2)在mobile GPU上實時執行:我們採用CSPPeleeNet、CSPPeleeNet Reference與CSPDenseNet Reference搭配我們所提出的EFM (SAM);(3)在CPU上實時執行:我們採用CSPPeleeNet Reference與CSPDenseNet Reference搭配PRN [35]。上述模型與目前最好的方法之間的比較列在Table 4。關於CPU與mobile GPU上的推論速度的分析會在下一小節中再詳細說明。 ::: :::info Table 4: Compare with state-of-the-art methods on MSCOCO Object Detection. ![](https://hackmd.io/_uploads/HJRp4POdi.png) ::: :::info If compared to object detectors running at 30∼100 fps, CSPResNeXt50 with PANet (SPP) achieves the best performance in AP, AP50 and AP75. They receive, respectively, 38.4%, 60.6%, and 41.6% detection rates. If compared to state-ofthe-art LRF [38] under the input image size 512×512, CSPResNeXt50 with PANet (SPP) outperforms ResNet101 with LRF by 0.7% AP, 1.5% AP50 and 1.1% AP75. If compared to object detectors running at 100∼200 fps, CSPPeleeNet with EFM (SAM) boosts 12.1% AP50 at the same speed as Pelee [37] and increases 4.1% [37] at the same speed as CenterNet [45]. ::: :::success 如果跟那種以30∼100 fps執行的物體偵測器比較的話,CSPResNeXt50搭配PANet (SPP)可以在AP、AP50、Ap75得到最佳效能。它們所得到的偵測率分別為38.4%、60.6%、41.6%。如果我們是跟輸入影像大小為512 $\times$ 512的最好的LRF [38]比的話,CSPResNeXt50搭配PANet (SPP)優於ResNet101搭配LRF各自為AP 0.7%、AP50 1.5%、AP75 1.1%。如果是跟以100∼200 fps執行的物體偵測器比較的話,CSPPeleeNet搭配EFM (SAM)用跟Pelee [37]相同速度的話,在AP50會提高12.1%,跟CenterNet [45]相同速度的話則是增加4.1%[37] ::: :::info If compared to very fast object detectors such as ThunderNet [25], YOLOv3-tiny [29], and YOLOv3-tiny-PRN [35], the proposed CSPDenseNetb Reference with PRN is the fastest. It can reach 400 fps frame rate, i.e., 133 fps faster than ThunderNet with SNet49. Besides, it gets 0.5% higher on AP50. If compared to ThunderNet146, CSPPeleeNet Reference with PRN (3l) increases the frame rate by 19 fps while maintaining the same level of AP50. ::: :::success 如果是跟非常快,像是ThunderNet [25]、YOLOv3-tiny [29]、YOLOv3-tiny-PRN [35]相比的話,我們所提出的CSPDenseNetb Reference with PRN是最快的。它可以來到400 fps的更新速速,也就是比ThunderNet with SNet49還要快133 fps啊!此外,它在AP50上還高了0.5%。如果跟ThunderNet146比的話,CSPPeleeNet Reference with PRN (3l)的更新速度增加19 fps,同時保持著AP50相同的水準。 ::: ### 4.5 Analysis :::info **Computational Bottleneck.** Figure 7 shows the BLOPS of each layer of PeleeNet-YOLO, PeleeNet-PRN and proposed CSPPeleeNet-EFM. From Figure 7, it is obvious that the computational bottleneck of PeleeNet-YOLO occurs when the head integrates the feature pyramid. The computational bottleneck of PeleeNet-PRN occurs on the transition layers of the PeleeNet backbone. As to the proposed CSPPeleeNet-EFM, it can balance the overall computational bottleneck, which reduces the PeleeNet backbone 44% computational bottleneck and reduces PeleeNet-YOLO 80% computational bottleneck. Therefore, we can say that the proposed CSPNet can provide hardware with a higher utilization rate. ::: :::success Figure 7說明著PeleeNet-YOLO、PeleeNet-PRN與我們所提出的CSPPeleeNet-EFM的每個層的BLOPS。從Figure 7很明顯的看的出來,PeleeNet-YOLO的計算瓶頸是發生在頭部(head)結合特徵金字塔的時候。PeleeNet-PRN的計算瓶頸則是發生在PeleeNet backbone的transition layers。如果是我們所CSPPeleeNet-EFM的話是可以平衡所有的計算瓶頸,它可以降低PeleeNet backbone 44%的計算瓶頸,也可以降低PeleeNet-YOLO 80%的計算瓶頸。所以啊,我們可以大聲的說,我們所提出的CSPNet是可以提供硬體更高的利用率。 ::: :::info Figure 7: Computational bottleneck of PeleeNet-YOLO, PeleeNet-PRN and CSPPeleeNet-EFM. ![](https://hackmd.io/_uploads/SJXzPEKui.png) ::: :::info **Memory Traffic.** Figure 8 shows the size of each layer of ResNeXt50 and the proposed CSPResNeXt50. The CIO of the proposed CSPResNeXt (32.6M) is lower than that of the original ResNeXt50 (34.4M). In addition, our CSPResNeXt50 removes the bottleneck layers in the ResXBlock and maintains the same numbers of the input channel and the output channel, which is shown in Ma et al. [24] that this will have the lowest MAC and the most efficient computation when FLOPs are fixed. The low CIO and FLOPs enable our CSPResNeXt50 to outperform the vanilla ResNeXt50 by 22% in terms of computations. ::: :::success Figure 8說明著ResNeXt50跟我們所提出的CSPResNeXt50的每個layer的大小。我們所提出的CSPResNeXt (32.6M)的CIO是原始的ResNeXt50(34.4M)還要來的少。此外,我們的CSPResNeXt50 移除了ResXBlock的瓶頸層,然後維持著相同數量的輸出、入的channel,如Ma et al. [24]所述,當FLOPs固定的時候,這將會有著最低的MAC以及最有效的計算。低的CIO與FLOPs讓我們的CSPResNeXt50在計算方面能夠優於普通的ResNeXt50來到22%。 ::: :::info Figure 8: Input size and output size of ResNeXt and proposed CSPResNeXt. ![](https://hackmd.io/_uploads/S1gNvEYdi.png) ::: :::info **Inference Rate.** We further evaluate whether the proposed methods are able to be deployed on real-time detectors with mobile GPU or CPU. Our experiments are based on NVIDIA Jetson TX2 and Intel Core i9-9900K, and the inference rate on CPU is evaluated with the OpenCV DNN module. We do not adopt model compression or quantization for fair comparisons. The results are shown in Table5. ::: :::success 我們進一步的去評估我們所提出的方法是否能夠部署在搭載mobile GPU或是CPU的實時偵測器上。我們的實驗是基於NVIDIA Jetson TX2與Core i9-9900K,然後inference rate是用OpenCV DNN module來評估。我們並沒有採用模型壓縮或是[量化](https://terms.naer.edu.tw/detail/9ff3177e3cbc35575361ee3c70290dc7/)以表公平。結果如Table 5所示。 ::: :::info Table 5: Inference rate on mobile GPU (mGPU) and CPU real-time object detectors (in fps). ![](https://hackmd.io/_uploads/HkqrcNF_j.png) ::: :::info If we compare the inference speed executed on CPU, CSPDenseNetb Ref.-PRN receives higher AP50 than SNet49- TunderNet, YOLOv3-tiny, and YOLOv3-tiny-PRN, and it also outperforms the above three models by 55 fps, 48 fps, and 31 fps, respectively, in terms of frame rate. On the other hand, CSPPeleeNet Ref.-PRN (3l) reaches the same accuracy level as SNet146-ThunderNet but significantly upgrades the frame rate by 20 fps on CPU. ::: :::success 如果我們比較在CPU上的推論速度的話,CSPDenseNetb Ref.-PRN得到比SNet49- TunderNet、YOLOv3-tiny、YOLOv3-tiny-PRN還要高的AP50,然後在更新速度上也比上述三個模型各別優於55 fps、48 fps、31 fps。另一方面,CSPPeleeNet Ref.-PRN (3l)得到跟SNet146-ThunderNet 相同的精度水平,但在CPU上的更新速度明顯提高了20 fps。 ::: :::info If we compare the inference speed executed on mobile GPU, our proposed EFM will be a good model to use. Since our proposed EFM can greatly reduce the memory requirement when generating feature pyramids, it is definitely beneficial to function under the memory bandwidth restricted mobile environment. For example, CSPPeleeNet Ref.-EFM (SAM) can have a higher frame rate than YOLOv3-tiny, and its AP50 is 11.5% higher than YOLOv3-tiny, which is significantly upgraded. For the same CSPPeleeNet Ref. backbone, although EFM (SAM) is 62 fps slower than PRN (3l) on GTX 1080ti, it reaches 41 fps on Jetson TX2, 3 fps faster than PRN (3l), and at AP50 4.6% growth. ::: :::success 如果我們比較在mobile GPU上的推論速度的話,我們所提出的EFM會是一很好的模型。因為我們提出的EFM在生成特徵金字塔的時候可以大量的降低記憶體需求,這對於在記憶體頻寬有限的移動環境(mobile environment)下執行是絕對有好處的。舉例來說,CSPPeleeNet Ref.-EFM (SAM)可以有著比YOLOv3-tiny還要高的更新速度(frame rate),而且它的AP50比YOLOv3-tiny還要高11.5%,明顯提升一個檔次!相同的骨幹(SPPeleeNet Ref.),雖然在GTX 1080ti上,EFM (SAM)比PRN (3l)慢了62 fps,不過它在Jetson TX2上可以有著41 fps,比PRN (3l)快3 fps,AP50也提高4.6%。 ::: ## 5. Conclusion :::info We have proposed the CSPNet that enables state-of-the-art methods such as ResNet, ResNeXt, and DenseNet to be light-weighted for mobile GPUs or CPUs. One of the main contributions is that we have recognized the redundant gradient information problem that results in inefficient optimization and costly inference computations. We have proposed to utilize the cross-stage feature fusion strategy and the truncating gradient flow to enhance the variability of the learned features within different layers. In addition, we have proposed the EFM that incorporates the Maxout operation to compress the features maps generated from the feature pyramid, which largely reduces the required memory bandwidth and thus the inference is efficient enough to be compatible with edge computing devices. Experimentally, we have shown that the proposed CSPNet with the EFM significantly outperforms competitors in terms of accuracy and inference rate on mobile GPU and CPU for real-time object detection tasks. ::: :::success 我們提出CSPNet,這可以讓一些目前最好的方法,像是ResNet、ResNeXt、DenseNet,能夠輕量化用於mobile GPU或是CPU上。最主要的貢獻之一就是,我們已經認知到冗餘的梯度信息問題,這會導致效率低的最佳化以及昂貴的推論計算,我們提出利用cross-stage feature fusion strategy(跨階段特徵融合策略)跟截斷梯度流的方式增強在不同層內學習特徵的變異性。此外,我們提出結合Maxout計算的EFM來壓縮從特徵金字塔生成的特徵圖,這大大的減少需要的記憶體頻寬,也因此讓推論的效率足以跟邊緣計算設備相容。通過實驗,我們已經證明,在mobile GPU與CPU上的實時檢測任務,我們所提出的CSPNet(搭載EFM)在準確度跟推論速率上都明顯優於競爭對手。 :::