# (7/25)Computer Vision Recent Paper:YOLO ###### tags:`paper` [toc] --- ## Before Meeting :::success ### Author - Joseph Redmon - https://scholar.google.com/citations?user=TDk_NfkAAAAJ&hl=en - ![](https://i.imgur.com/EGMOeZ3.png) - Santosh Kumar Divvala - https://scholar.google.com/citations?user=-DYvinwAAAAJ&hl=zh-TW - ![](https://i.imgur.com/zfRNEj2.png) - Ross Girshick - https://scholar.google.com/citations?user=W8VIEZgAAAAJ&hl=en - ![](https://i.imgur.com/Eq9hxgN.png) - Ali Farhadi - https://scholar.google.com/citations?user=jeOFRDsAAAAJ&hl=zh-TW - ![](https://i.imgur.com/PgszyGR.png) - ::: [refer](https://pjreddie.com/darknet/yolo/) [refer](https://pjreddie.com/media/files/papers/YOLOv3.pdf) [refer](https://arxiv.org/pdf/1506.02640.pdf) --- ## Recent Paper --- - [ ] ### You Only Look Once:Unified, Real-Time Object Detection :::success #### Abstracion - We present YOLO, a new approach to object detection. - we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities - single network - real-time at 45 frames per second - ![](https://i.imgur.com/WSHXEaU.png) ::: :::info #### Detail - Introduction - A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes - frame detection as a regression problem - YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance - YOLO learns generalizable representations of objects - Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs - Unified Detection - Network Design - evaluate it on the PASCAL VOC detection dataset - inspired by the GoogLeNet model for image classification - Our network has 24 convolutional layers followed by 2 fully connected layers - simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers - ![](https://i.imgur.com/LBkCRUS.png) - Training - For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected laye - use the Darknet framework for all training and inference - adding both convolutional and connected layers to pretrained networks can improve performance - Our final layer predicts both class probabilities and bounding box coordinates - linear activation function for the final layer and all other layers use the following leaky rectified linear activation: - ![](https://i.imgur.com/rANrtBx.png) - sum-squared error - we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects - During training we optimize the following, multi-part - ![](https://i.imgur.com/XAkYWYA.png) - A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers - Inference - YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods. - some large objects or objects near the border of multiple cells can be well localized by multiple cells - Limitations of YOLO - YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. - Our model struggles with small objects that appear in groups, such as flocks of birds - Comparison to Other Detection Systems - Experiments - First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. - we show that YOLO generalizes to new domains better than other detectors on two artwork datasets. - Comparison to Other Real-Time System - Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance. - ![](https://i.imgur.com/LpSORfl.png) - VOC 2007 Error Analysis - For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error: - Correct: correct class and IOU > .5 - Localization: correct class, .1 < IOU < .5 - Similar: class is similar, IOU > .1 - ![](https://i.imgur.com/SZOLo3i.png) - Combining Fast R-CNN and YOLO - YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance - Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN. - VOC 2012 Results - On the VOC 2012 test set, YOLO scores 57.9% mAP. - ![](https://i.imgur.com/ZBnrFNz.png) - Generalizability: Person Detection in Artwork - Real-Time Detection In The Wild - YOLO is a fast, accurate object detector, making it ideal for computer vision applications - ![](https://i.imgur.com/fUoh94v.png) - ![](https://i.imgur.com/GDWbtWI.jpg) ::: :::warning #### Conclusion - Our model is simple to construct and can be trained directly on full images. - YOLO is trained on a loss function that directly correspondsto detection performance and the entire model is trained jointly. - YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection. ::: [refer]() --- :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() ---