yolo - HackMD

# yolo ## [1024, 分享 idea] [YOPO](https://docs.google.com/presentation/d/1RcMlPQfXiNZumdZ3R01nM35bvTl2d8xa9TkjPXMxspA/edit?usp=sharing) - YOLO v1, - confidence: IOU - NMS![](https://i.imgur.com/QMYnUmv.jpg) - LOSS - YOLO v2, - Anchor boxes - 416x416 大小的輸入代替原來的 448x448 - 416, 208, 104, 52, 26, 13(odd) - 448, 224, 112, 56, 28, 14(even) - YOLO v3. $\leftarrow$ 目前還是業界用最多的, 很好用 - Darknet 53 $\leftarrow$ 這個有趣 - FPN - 13x13 feature map (有最大的感受野) 用於偵測大物體，所以用較大的Anchor prior (116x90), (156x198), (373x326) - 26x26 feature map (中等的感受野) 用於偵測中等大小的物體，所以用中等的Anchor prior (30x61), (62x45), (59x119) - 52x52eature map (較小的感受野) 用於檢測小物體，所以用較小的Anchor prior (10x13), (16x30), (33x23) - ![](https://i.imgur.com/0Jb0M8y.png) - [YOLO演進 — 1](https://medium.com/ching-i/yolo演進-1-33220ebc1d09), [[yolo.v1 paper](https://arxiv.org/pdf/1506.02640.pdf)], [[yolo.v2 paper](https://arxiv.org/pdf/1612.08242.pdf)] - [anchor](https://zhuanlan.zhihu.com/p/32678099) - anchor free 傾向大尺寸 - 大小變動大，所以訓練不穩定 - [YOLO演進 — 2](https://medium.com/ching-i/yolo演進-2-85ee99d114a1), [[yolo.v3 paper](https://arxiv.org/pdf/1804.02767.pdf)] - [超详细的Yolov3边框预测分析](https://zhuanlan.zhihu.com/p/49995236) 含 anchor 的選擇 - [YOLO演進 — 3 — YOLOv4詳細介紹](https://medium.com/ching-i/yolo演進-3-yolov4詳細介紹-5ab2490754ef) - YOLOX[slide](https://docs.google.com/presentation/d/12eEWge1EprgjddF1msFE-czsEN5M6h5CpKkWhClV2vM/edit?usp=sharing) - decouple heads - augmentation; - mosaic - mixup - 最後 15 epochs 要關掉 augmentation - [[train from scratch story...:同時發現，採用了 augmentation 後 train from scratch is better than pre-train model @ImageNet](https://www.facebook.com/groups/774141029405112/permalink/2032707223548480/)] - anchor free, 主觀點 FCOS - multiple positives - YOLOR[slide](https://docs.google.com/presentation/d/12eEWge1EprgjddF1msFE-czsEN5M6h5CpKkWhClV2vM/edit?usp=sharing) - dynamic head[slide](https://docs.google.com/presentation/d/17ug_jiS8by_py4CxEbRR_kiHv1aSTZmXE07p-oKKDVA/edit?usp=sharing) [1017, 分享 idea] - YOPO to know how to solve a problem - how to study YOLO? - darknet [CH. Tseng blog](https://chtseng.wordpress.com/2018/09/01/%E5%BB%BA%E7%AB%8B%E8%87%AA%E5%B7%B1%E7%9A%84yolo%E8%BE%A8%E8%AD%98%E6%A8%A1%E5%9E%8B-%E4%BB%A5%E6%9F%91%E6%A9%98%E8%BE%A8%E8%AD%98%E7%82%BA%E4%BE%8B/) - yolos papers - keras [qqwweee github](https://github.com/qqwweee/keras-yolo3) - 解題[slide](https://docs.google.com/presentation/d/1RcMlPQfXiNZumdZ3R01nM35bvTl2d8xa9TkjPXMxspA/edit?usp=sharing) - YOLO v1, compare with [YOPO](https://docs.google.com/presentation/d/1RcMlPQfXiNZumdZ3R01nM35bvTl2d8xa9TkjPXMxspA/edit?usp=sharing) - confidence: IOU - NMS - LOSS - YOLO v2, - Anchor boxes - 416x416 大小的輸入代替原來的 448x448 - 416, 208, 104, 52, 26, 13(odd) - 448, 224, 112, 56, 28, 14(even) - YOLO v3. $\leftarrow$ 目前還是業界用最多的, 很好用 - Darknet 53 $\leftarrow$ 這個有趣 - FPN - 13x13 feature map (有最大的感受野) 用於偵測大物體，所以用較大的Anchor prior (116x90), (156x198), (373x326) - 26x26 feature map (中等的感受野) 用於偵測中等大小的物體，所以用中等的Anchor prior (30x61), (62x45), (59x119) - 52x52eature map (較小的感受野) 用於檢測小物體，所以用較小的Anchor prior (10x13), (16x30), (33x23) - YOLO v4 - .... - YOLOX[slide](https://docs.google.com/presentation/d/12eEWge1EprgjddF1msFE-czsEN5M6h5CpKkWhClV2vM/edit?usp=sharing) - decouple heads - augmentation; - mosaic - mixup - 最後 15 epochs 要關掉 augmentation - [[train from scratch story...:同時發現，採用了 augmentation 後 train from scratch is better than pre-train model @ImageNet](https://www.facebook.com/groups/774141029405112/permalink/2032707223548480/)] - anchor free, 主觀點 FCOS - multiple positives - YOLOR[slide](https://docs.google.com/presentation/d/12eEWge1EprgjddF1msFE-czsEN5M6h5CpKkWhClV2vM/edit?usp=sharing) - dynamic head[slide](https://docs.google.com/presentation/d/17ug_jiS8by_py4CxEbRR_kiHv1aSTZmXE07p-oKKDVA/edit?usp=sharing) - background [[AP/mAP](/UqABJjk6SsaYLBjzvcBEng)][深度學習系列: 什麼是AP/mAP?](https://chih-sheng-huang821.medium.com/深度學習系列-什麼是ap-map-aaf089920848) - [YOLO演進 — 1](https://medium.com/ching-i/yolo演進-1-33220ebc1d09), [[yolo.v1 paper](https://arxiv.org/pdf/1506.02640.pdf)], [[yolo.v2 paper](https://arxiv.org/pdf/1612.08242.pdf)] - [anchor](https://zhuanlan.zhihu.com/p/32678099) - anchor free 傾向大尺寸 - 大小變動大，所以訓練不穩定 - [YOLO演進 — 2](https://medium.com/ching-i/yolo演進-2-85ee99d114a1), [[yolo.v3 paper](https://arxiv.org/pdf/1804.02767.pdf)] - [超详细的Yolov3边框预测分析](https://zhuanlan.zhihu.com/p/49995236) 含 anchor 的選擇 - [YOLO演進 — 3 — YOLOv4詳細介紹](https://medium.com/ching-i/yolo演進-3-yolov4詳細介紹-5ab2490754ef), [[yolo.v4 paper](https://arxiv.org/pdf/2004.10934.pdf)] - **[YOLOv4 產業應用心得整理 - 張家銘](https://aiacademy.tw/yolo-v4-intro/)** - [AP/mAP](/UqABJjk6SsaYLBjzvcBEng) ---- ---- |#|Model|Countermeasure|issue|benifit|side effect|check| |---|---|---|---|---|---|---| |1.|YOLOv1|先 train 20 conv layers followed by a average-pooling layer and a fully connected layer.then, 再加 4 conv layers and and two fully connected layers with randomly initialized weights.Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.||||ok| |2.|YOLOv1|NMS||||OK| |3.|YOLOv2|anchor||||OK| |4.|YOLOv2|Batch Normalization||+2% mAP|| |4.|YOLOv2|Hi_Resolution||+4% mAP|| |4.|YOLOv2|BAnchor Boxer with Clustering Prior||recall 81% to 88%|...| ||YOLOv2|multi-scale training|||| ||YOLOv2|Multi-head|||| ||YOLOv2|fine Grained|||| ||YOLOv3|ResNet|||| ||YOLOv3|FPN|||| ||YOLOv4|CSP|||| ||YOLOv4|SPP, PAN|||| ||YOLOv4|CBL (Conv. BN, LeakyRelu) $\Rightarrow$ CBM (Conv., BN, Mish)|||| ||YOLOX|decoupled|||| ||YOLOX|Augmentation: MixUp, Mosaic|||| ||YOLOX|FCOS: Anchor Free|||| * | ||YOLOX|SimOTA: Label Assignment |||| * | ||YOLOX|End2End||descrease preformance...|| * | ||YOLOR|Manifold|||| * | ||YOLOR|Modeling Error|||| * | ||YOLOR|Detangle relation between Input and Task|||| * | #### [[YOLOX](/7aMVrZ6xSweg2YzV1RgeNw)] #### [You Only Learn One Representation: Unified Network for Multiple Tasks](/WyreEUW0SaCIU2I9btadEQ), [[url](https://arxiv.org/pdf/2105.04206.pdf)] - [[知乎](https://zhuanlan.zhihu.com/p/391456531)] - [paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/pdf/1506.01497.pdf) - [blog: anchor free](https://medium.com/軟體之心/cv-object-detection-1-anchor-free大爆發的2019年-e3b4271cdf1a) - [paper FPN](https://arxiv.org/pdf/1612.03144.pdf) - [paper Bridging the Gap Between Anchor-based and Anchor-free Detection viaAdaptive Training Sample Selection, 論文中認定了了FPN與focal loss是anchor free方法崛起的關鍵，並認為如何定義正負樣本才是anchor free與anchor base之間最大的差異。](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Bridging_the_Gap_Between_Anchor-Based_and_Anchor-Free_Detection_via_Adaptive_CVPR_2020_paper.pdf) #### [Meta Pseudo Labels](/r0mxE8NMTgSiae7P98fgkg)[[知乎](https://zhuanlan.zhihu.com/p/125478086)] ## YOLOv1 - 順一次 - input: 448 x 448 - output = 7 x 7 x (5+5+20) - 7 x 7 x ($x_1, y_1, w_1, h_1, c_1, x_2, y_2, w_2, h_2, c_2, classes *20$) - $loss \space function = \lambda_{cord}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^B \space 1_{ij}^{obj} \space {\Large [}(x_i-\hat{x_i})^2 + (y_i - \hat{y_i})^2{\Large ]}$ **+** $\lambda_{cord}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^B \space 1_{ij}^{obj} \space {\Large [}(\sqrt{w_i}-\sqrt{\hat{w_i}})^2 + (\sqrt{h_i}-\sqrt{\hat{h_i}})^2 {\Large ]}$ **+** $\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^B \space 1_{ij}^{obj} \space (C_i-\hat{C_i})^2$ **+** $\lambda_{noobj} \sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^B \space 1_{ij}^{noobj} \space (C_i-\hat{C_i})^2$ **+** $\sum\limits_{i=0}^{S^2} \space 1_i^{obj} \space \sum\limits_{c \space \in \space classes} (p_i(c) - \hat{p}_i(c))^2$ - 留意，只有最後一項是 $1_i$，其他都是 $1_{ij}$ - where $1_i^{obj}$ denotes if object appears in cell $i$ and $1_{ij}^{obj}$ denotes that the $j$th bounding box predictor in cell $i$ is “responsible” for that prediction. - 就是說如果在某一格子 $i$中有物件，$1_i^{obj}$ 就會是 1 (最後一條式子） - 而如果某一個格子中 $i$ 有物件，$1_i^{obj} = 1$，其中會有一些 bnd boxes 如果第 $j$ bnd box 匹配（max IOU)，則 $1_{ij}^{obj} = 1$ - 沒有匹配到的 bnd boxed 都會被第四項作小逞罰，比如 $\lambda_{noobj} = 0.5$, as $1_{ij}^{noobj}$ - 所以我們有三個權重，順序依序為 $[\lambda_{cord}, \lambda_{cord}, 1, \lambda_{noobj}, 1] = [ 5, 5, 1, 0.5, 1]$ - YOLO predicts multiple bounding boxes per grid cell. **At training time** we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall. - NMS is at inference phase, not at training phase. - NMS threshold, threshold 愈大，表示小重疊不砍，所以會有很多框框。threshold 愈小，表示小重疊會砍，所以不會有很多框。 - training 中的技巧： - 似乎參考這篇的說明：S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015. 3, 7 - 先 train 20 conv layers followed by a average-pooling layer and a fully connected layer. - then, 再加 4 conv layers and and two fully connected layers with randomly initialized weights. - Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448. - 問題與深入理解 - 我們順順看著 input ==> output [5 個預測值 x 2 (+ 20))=] * 7x7 - 然後 gradient descent... 這時候就要看 loss function: 結果發現有一些不熟悉：$\lambda_{i}^{cord}、\lambda_{ij}^{noob}、1_{i}^{obj}、1_{ij}^{obj}、1_{ij}^{noobj}$ - 觀察我們的 labelimg, our label information: x, y, w, h and classes. we don't have c: confidence score. so what is $\hat C_i$, 反正就是... - We define confidence as $P_r(Object) * IOU_{pred}^{truth}$. If no object exists in that cell, the confidence scores should be zeor. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth. - ie. - 如果cell 沒有物件，我們直接令 $\hat Ci = 0$ - 如果cell 有物件，我們直接令 $\hat Ci = IOU$ - 補充：誰(指哪一格）去負責某一個 ground truth 物件：我們計畫，If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. - NMS ## YOLOv2 - YOLO's shortcomings, lower recall. - YOLOv2 比較偏 algorithm 的部分就是要 on improving **recall** and **localization** while maintaining classification accuracy. - 仍然要快 - ### 採用有 - BatchNormal (+2% mAP), - Hi_Resolution(+4% mAP), - Anchor Boxer with Clustering Priori (recall 81% to 88%) - 其實 paper 將 anchor 與 clustering + direct location 視為兩個不同的措施 - multi-scale training - 所以： - BatchNormal - 2% improvement in mAP. from YOLO - ### With batch normalization we can remove dropout from the model without overfitting.(?) - High Resolution Classifier. 224x224(YOLO) $\rightarrow$ 448x448 $\rightarrow$ 416x416 (odd grids, to have single center cell) - 10 epochs at ImageNet, then finetune - This high resolution classification network gives us an increase of almost 4% mAP. - Convolutional With **Anchor Boxes** - 講到這裡，我們也該退一步看全局了 - 利用 k-means to have 5 clustering, 採用 k=5 - We choose k = 5 as a good tradeoff between **model complexity** and **high recall**. - input: 416x416 $\Rightarrow$ (t_x, t_y, t_w, t_h, t_o) - or 13x13x5(5,80), the first "5" vs. anchors - paper mentioned, - When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box. - ### 這個是指 YOLOv1 (5, 5, 20), YOLOv2 5x(5, 80) - Using anchor boxes we get a small decrease in accuracy. 就是分類上的 performance 稍稍下降。why? - **Without** anchor boxes our intermediate model gets 69.5 mAP with a recall of **81%**. - **With** anchor boxes our model gets 69.2 mAP with a recall of **88%**. - $b_x = \sigma(t_x) + c_x$ - $b_y = \sigma(t_y) + c_y$ - box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats)) - $b_w = p_w e^{t_w}$ - $b_h = P_h e^{t_h}$ - **Since we constrain the location prediction the parametrization is easier to learn, making the network more stable.** - Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the version with anchor boxes. - Fine-Grained Features - 26x26 $\Rightarrow$ concat(26x26, 13x13) - Multi-Scale Training. - Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. - 訓練時說 320, 608。但是測試時可以到更小 288x288... 32 的倍數。![](https://i.imgur.com/sJVsm70.png) # YOLOv3 ## 以下 Keras YOLOv3 study. - box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats)) - we got anchors = [10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326] from 'model_data/yolo_anchors.txt' then array[9,2] 再送入 loss (wwqqeee/model.py:) - 在 loss 中去呼叫 head 時，anchors 已經根據 "三組尺寸" 分好了（anchors[anchor_mask[l]]）l=0 anchor_mask = [6,7,8], l=1, anchor_mask=[3,4,5], l=2, anchor_mask =[0, 1, 2] ``` def yolo_head(feats, anchors, num_classes, input_shape, calc_loss=False): """Convert final layer features to bounding box parameters.""" num_anchors = len(anchors) # Reshape to batch, height, width, num_anchors, box_params. anchors_tensor = K.reshape(K.constant(anchors), [1, 1, 1, num_anchors, 2]) 紀錄 anchors_tensor = [1, 1, 1, 3, 2] grid_shape = K.shape(feats)[1:3] # height, width grid_y = K.tile(K.reshape(K.arange(0, stop=grid_shape[0]), [-1, 1, 1, 1]), [1, grid_shape[1], 1, 1]) grid_x = K.tile(K.reshape(K.arange(0, stop=grid_shape[1]), [1, -1, 1, 1]), [grid_shape[0], 1, 1, 1]) grid = K.concatenate([grid_x, grid_y]) grid = K.cast(grid, K.dtype(feats)) feats = K.reshape( feats, [-1, grid_shape[0], grid_shape[1], num_anchors, num_classes + 5]) # Adjust preditions to each spatial grid point and anchor size. box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats)) box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats)) box_confidence = K.sigmoid(feats[..., 4:5]) box_class_probs = K.sigmoid(feats[..., 5:]) if calc_loss == True: return grid, feats, box_xy, box_wh return box_xy, box_wh, box_confidence, box_class_probs ``` ``` def yolo_loss(args, anchors, num_classes, ignore_thresh=.5, print_loss=False): '''Return yolo_loss tensor Parameters ---------- yolo_outputs: list of tensor, the output of yolo_body or tiny_yolo_body y_true: list of array, the output of preprocess_true_boxes anchors: array, shape=(N, 2), wh num_classes: integer ignore_thresh: float, the iou threshold whether to ignore object confidence loss Returns 從檔案中輸入 anchors = [10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326] ==> array [9,2] ------- loss: tensor, shape=(1,) ''' num_layers = len(anchors)//3 # default setting yolo_outputs = args[:num_layers] y_true = args[num_layers:] anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]] input_shape = K.cast(K.shape(yolo_outputs[0])[1:3] * 32, K.dtype(y_true[0])) grid_shapes = [K.cast(K.shape(yolo_outputs[l])[1:3], K.dtype(y_true[0])) for l in range(num_layers)] loss = 0 m = K.shape(yolo_outputs[0])[0] # batch size, tensor mf = K.cast(m, K.dtype(yolo_outputs[0])) for l in range(num_layers): object_mask = y_true[l][..., 4:5] true_class_probs = y_true[l][..., 5:] grid, raw_pred, pred_xy, pred_wh = yolo_head(yolo_outputs[l], anchors[anchor_mask[l]], num_classes, input_shape, calc_loss=True) pred_box = K.concatenate([pred_xy, pred_wh]) # Darknet raw box to calculate loss. raw_true_xy = y_true[l][..., :2]*grid_shapes[l][::-1] - grid raw_true_wh = K.log(y_true[l][..., 2:4] / anchors[anchor_mask[l]] * input_shape[::-1]) raw_true_wh = K.switch(object_mask, raw_true_wh, K.zeros_like(raw_true_wh)) # avoid log(0)=-inf box_loss_scale = 2 - y_true[l][...,2:3]*y_true[l][...,3:4] # Find ignore mask, iterate over each of batch. ignore_mask = tf.TensorArray(K.dtype(y_true[0]), size=1, dynamic_size=True) object_mask_bool = K.cast(object_mask, 'bool') def loop_body(b, ignore_mask): true_box = tf.boolean_mask(y_true[l][b,...,0:4], object_mask_bool[b,...,0]) iou = box_iou(pred_box[b], true_box) best_iou = K.max(iou, axis=-1) ignore_mask = ignore_mask.write(b, K.cast(best_iou<ignore_thresh, K.dtype(true_box))) return b+1, ignore_mask _, ignore_mask = K.control_flow_ops.while_loop(lambda b,*args: b<m, loop_body, [0, ignore_mask]) ignore_mask = ignore_mask.stack() ignore_mask = K.expand_dims(ignore_mask, -1) # K.binary_crossentropy is helpful to avoid exp overflow. xy_loss = object_mask * box_loss_scale * K.binary_crossentropy(raw_true_xy, raw_pred[...,0:2], from_logits=True) wh_loss = object_mask * box_loss_scale * 0.5 * K.square(raw_true_wh-raw_pred[...,2:4]) confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True)+ \ (1-object_mask) * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True) * ignore_mask class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[...,5:], from_logits=True) xy_loss = K.sum(xy_loss) / mf wh_loss = K.sum(wh_loss) / mf confidence_loss = K.sum(confidence_loss) / mf class_loss = K.sum(class_loss) / mf loss += xy_loss + wh_loss + confidence_loss + class_loss if print_loss: loss = tf.Print(loss, [loss, xy_loss, wh_loss, confidence_loss, class_loss, K.sum(ignore_mask)], message='loss: ') return loss ``` 上面的兩段 keras yolov3 code (yolo.py) # YOLOv3 - It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. - 我是覺得主要在網路的新架構 - (13, 13) $\xrightarrow[upsampling]{}$ concat (26, 26) $\rightarrow$ head (26,26) - concat((13,13)$_{upsampling}$,(26, 26)) $\xrightarrow[upsampling]{.}$ concat (52, 52) $\Rightarrow$ head (52,52) - On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119), (116 × 90), (156 × 198), (373 × 326). - multilabel, Class Prediction - adopt sigmoid to instead of softmax. - binary cross entropy $loss_{clf} = - \sum\limits_{i=1}^n y_ilog(\hat y_i) + (1-y_i)log(1-\hat y_i)$ - $loss \space function = \lambda_{cord}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B} 1_{ij}^{obj}\space (2-b_{w_i}\times b_{hi}) {\Large [} (b_{x_i}-\hat{b_{x_i}})^2 + (b_{y_i}-\hat{b_{y_i}})^2 {\Large ]}$ **+** $\lambda_{cord}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B} 1_{ij}^{obj}\space (2-b_{w_i}\times b_{hi}) {\Large [} (b_{w_i}-\hat{b_{w_i}})^2 + (b_{h_i}-\hat{b_{h_i}})^2 {\Large ]}$ **-** $\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B} 1_{ij}^{obj}\space c_{ij}log(\hat{c_{ij}}) + (1-c_{ij})log( 1-\hat{c_{ij}})$ **-** $\lambda_{noobj}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B} 1_{ij}^{noobj}\space c_{ij}log(\hat{c_{ij}}) + (1-c_{ij})log( 1-\hat{c_{ij}})$ **-** $\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B} 1_{ij}^{obj} \sum\limits_{c \in classes} {\Large [} p_{ij}(c) log (\hat {p_{ij}}(c)) + (1-p_{ij}(c)) log (1-\hat {p_{ij}}(c)) {\Large ]}$ - (vs.) YOLOv2. loss function uses MSE for confidence score and classification ![](https://i.imgur.com/GmOLpQY.png) 在 YOLOR 中討論到的 Kernel alignment 中，會說在 FPN 的 output 加在 channel 256, 512, 1024 如果是做 re-fine predection，就是加在 FPN branches 之後，都是256 ![](https://i.imgur.com/QXLQ0NO.png) # YOLOv4 - Bag-of-Freebies 是指在網絡訓練時所用到的技巧，不影響推理預測的時間，主要包括: - 數據增強 (data augmentation): Random erase、CutOut、Hide-and-seek、Grid mask、GAN、MixUp、CutMix - 正則化方法: DropOut、DropConnect - 處理數據不平衡問題: focal loss、Online hard example mining、Hard negative example mining - **focal loss** didn't help @YOLOv3 - 處理 bndBox 回歸問題: MSE, IOU, GIOU, DIOU/CIOU - 📌 Bag-of-specials 是指在網絡設計或後處理時所用到的技巧，輕微增加推理預測時間，但可提升較大的精度，主要包括: - 感受野: SPP、ASPP、RFB - 特徵融合: FPN、PAN - 注意力機制: attention module - 激活函數: Swish、Mish - NMS: Soft-NMS、DIoU NMS - ### why CSP: - Study: [[CSPDarknet53 keras](https://github.com/Ma-Dan/keras-yolo4/blob/master/yolo4/model.py)] [[paper](https://arxiv.org/pdf/1911.11929.pdf)] - 增加CNN的學習能力，即便將模型輕量化，也能保持準確性 - 去掉計算力較高的計算瓶頸結構 (降低計算) - 降低內存占用 - YOLOv4 Network - BackBone: CSPDarknet53 - Neck: SPP+PAN - HEAD: YOLO HEAD 看起來就是一堆小 tricks 該改善的全都用上了！！！ - ## BoF for backbone - cutmix ![](https://i.imgur.com/QG8QMNd.png) - Mosaic data augmentation： - 採用隨機縮放、裁剪的方式混合拼接4種圖片進行訓練，可以說是 CutMix 的加強版採用這個方法可以豐富檢測數據集， - 並且因為隨機縮放增加了很多小目標，讓模型穩健性更好。 - 此外，作者考慮到可能需要只使用單GPU的計算也能達到好的效果，因此使用 Mosaic 訓練時，可以直接計算4張圖片的數據，使 Mini-batch 的大小不需要很大 - dropblock![](https://i.imgur.com/F3bmVTA.png) - 圖b是使用 Dropout 的方法，會隨機刪除神經元的數量，但網路仍可以從相鄰的激活單元學習到相同信息 - 圖c是使用 DropBlock 的方法，隨機將整個局部區域進行刪減，網路就會去注重學習某些特徵以實現正確分類而得到更好的泛化效果 - Class label smoothing - from one hot to soft - - For a network trained with a label smoothing of parameter α, we minimize instead the cross-entropy between the modified targets $y_k^{LS}$ and the networks’ outputs pk, where $y_k^{LS} = y_k(1-\alpha) + \alpha/K$ ![](https://i.imgur.com/4Mb5T0f.png) - ## Bag of Specials (BoS) for backbone - Mish activation - Multiinput weighted residual connections (MiWRC) # [[YOLOX](/7aMVrZ6xSweg2YzV1RgeNw)] - 作者：我们认为与之前 YOLO 最大的区别在于 Decoupled Head，Data Aug，Anchor Free 和样本匹配这几个地方 - Decoupled Head: it helps a lot at end2end. Yolo to End2End Yolo AP descrese 4.2%, but decoupled, it descreses only 0.8%. and it converge fast. - Data Augmentation: Mosaic + MixUp - 我们组早期在其他研究上发现，为 Mosaic 配上 Copypaste，依然有不俗的提升。组内的共识是：当模型容量足够大的时候，相对于先验知识（各种 tricks，hand-crafted rules ），更多的后验（数据/数据增强）才会产生本质影响。 - 留意：YOLOX 里的 Mixup 有如此明显的涨点，大概是因为它在实现和涨点原理上更接近 Copypaste，而不是原版 Mixup。 - #### 为此我们也补上新的一条使用 YOLOX 的 best practice：**如果训不到 300 个 epoch 便打算终止，请记得将关 Aug 的时间节点设定为终止前的 10~15 个 epoch 。** - Anchor Free 与 Label Assignment - FCOS - (3x3) center sampling to increse Positive Samples. - End2end端到端（无需 NMS ）是个很诱人的特性，如果预测结果天然就是一个 set ，岂不是完全不用担心后处理和数据搬运？去年有不少相关的工作放出（ DeFCN，PSS, DeTR ），但是在 CNN 上实现端到端通常需要增加2个 Conv 才能让特征变的足够稀疏并实现端到端，且要求检测头有足够的表达能力( Decoupled Head 部分已经详细描述了这个问题)，从下表中可以看到，在 YOLOX 上实现 NMS Free 的代价是轻微的掉点和明显的掉 FPS 。所以我们没有在最终版本中使用 End2End 特性。 # [FCOS](/JKc4z-wUQ1yNgQXJbEdqrQ) - introduce pixel base anchor free - input (image, with c class, x0, y0, x1, y1) $\Rightarrow$ inference () # [OTA] - If an anchor receives sufficient amount of positive label from a certain gt, this anchor becomes one positive anchor for that gt . In this context, the number of positive labels each gt supplies can be interpreted as “how many positive anchors that gt needs for better convergence during the training process”. [2020 study] --- title: yolo tags: Study --- I have a great material at ... but the most difficult part is anchor 1. pronunction 2. best article [史上最详细的Yolov3边框预测分析](https://zhuanlan.zhihu.com/p/49995236) 0706 I learnt @2 is, 1. model learns tx, ty, tw, th, these values are shift. Why not use absolute value: because if we would to have 2nd boxes data at grid(5, 6), then we need [5, 6, (5+80):2(5+2* 80)], it is not convenience, so we adopt shift. 2. formula: ![](https://i.imgur.com/vLXZMoV.png) ![](https://i.imgur.com/IgCiY3m.png) 3. scale input to meet 416*416 - min(w/img_w, h/img_h) - 768,576 ==> min(416/768, 416/576) = min(0.54, 0.72) = 0.54. - 768,576 = 416, 576*0.54 = 312 (576*416/768) - so, 416x312, upper, and lower set to (128, 128, 128), we will have a 416, 416 input!