Learning Deep Features for Discriminative Localization ==== > Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba Computer Science and Artificial Intelligence Laboratory > MIT [回目錄](https://hackmd.io/@marmot0814/ryzDejzGr) # Abstract 1. Revisit the GAP - GAP(global average pooling layer) 2. Achieve 37.1% top-5 error on ILSVRC 2014 - ILSVRC(Large Scale Visual Recognition Challenge) # Introduction 1. convolutional layers can localize objects, but this ability is lost when fully-connected layers - Network in Network, GoogLeNet avoid fully-connected layers. - minimize the number of params 2. GAP(known as a kind of structural regularizer) doesn't simply act as a regularizer. 3. This approach can be easily transferred to other recognition datasets for generic classification, localization and concept discovery. - achieves 37.1% top-5 test error, close to the fully supervised AlexNet. ## Related Work # Class Activation Mapping ![](https://i.imgur.com/J1JYC0p.png) ![](https://i.imgur.com/OtFpU1K.png) # Weakly-supervised Object Localization ## Setup - use AlexNet, VGGnet and GoogLeNet to generate *GAP 1. remove some layer (fully connected layer and softmax) 2. add some convolution layer - 3 * 3, stride 1, padding 1 with 1024 units. - GAP layer - softmax ## Results 1. Classification ![](https://i.imgur.com/XiDdTJg.png) - similar 2. Localization ![](https://i.imgur.com/CBW85VX.png) - weakly supervision still have a long way to go. 3. Generic Localization - to test the ability about feature extraction between original network and the network concatenated with GAP by linear SVM - ![](https://i.imgur.com/NX9zEQK.png) - similar - to test the ability about localization by weakly supervision - ![](https://i.imgur.com/K3g3VSb.png) - It still can find the position of the object. # Deep Features for Generic Localization ## Fine-grained Recognition - with bounding box - ![](https://i.imgur.com/urWn3KO.png) - ![](https://i.imgur.com/OGBDxGr.png) ## Pattern Discovery - Given a set of images containing a common concept, test the network whether can find where the position f the important regions in this images. 1. How to identify the important region before train the network to test the network performance. 1. use GoogLeNet-GAP network training by image-level label. use SVM weight and GAP to contruct the CAM to identify the important region. 2. Experiment 1. Discovering informative objects in the scenes - ![](https://i.imgur.com/y98uNCE.png) 2. Concept localization in weakly labeled images - ![](https://i.imgur.com/IRlKEWs.png) 3. Weakly supervised text detector - ![](https://i.imgur.com/a1pwxq1.png) - postive set: picture with text - negtive set: picture without text 4. Interpreting visual question answering (???) - ![](https://i.imgur.com/NfsVB9o.png) # Visualizing Class-Specific Units ![](https://i.imgur.com/42WpewN.png) # Conclusion 1. CAM fofr CNN with global average pooling. - enable to visualize hotmap to the given image 2. weakly supervision to find localize the object.