Learning Deep Features for Discriminative Localization
====
> Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba
Computer Science and Artificial Intelligence Laboratory
> MIT
[回目錄](https://hackmd.io/@marmot0814/ryzDejzGr)
# Abstract
1. Revisit the GAP
- GAP(global average pooling layer)
2. Achieve 37.1% top-5 error on ILSVRC 2014
- ILSVRC(Large Scale Visual Recognition Challenge)
# Introduction
1. convolutional layers can localize objects, but this ability is lost when fully-connected layers
- Network in Network, GoogLeNet avoid fully-connected layers.
- minimize the number of params
2. GAP(known as a kind of structural regularizer) doesn't simply act as a regularizer.
3. This approach can be easily transferred to other recognition datasets for generic classification, localization and concept discovery.
- achieves 37.1% top-5 test error, close to the fully supervised AlexNet.
## Related Work
# Class Activation Mapping
![](https://i.imgur.com/J1JYC0p.png)
![](https://i.imgur.com/OtFpU1K.png)
# Weakly-supervised Object Localization
## Setup
- use AlexNet, VGGnet and GoogLeNet to generate *GAP
1. remove some layer (fully connected layer and softmax)
2. add some convolution layer
- 3 * 3, stride 1, padding 1 with 1024 units.
- GAP layer
- softmax
## Results
1. Classification
![](https://i.imgur.com/XiDdTJg.png)
- similar
2. Localization
![](https://i.imgur.com/CBW85VX.png)
- weakly supervision still have a long way to go.
3. Generic Localization
- to test the ability about feature extraction between original network and the network concatenated with GAP by linear SVM
- ![](https://i.imgur.com/NX9zEQK.png)
- similar
- to test the ability about localization by weakly supervision
- ![](https://i.imgur.com/K3g3VSb.png)
- It still can find the position of the object.
# Deep Features for Generic Localization
## Fine-grained Recognition
- with bounding box
- ![](https://i.imgur.com/urWn3KO.png)
- ![](https://i.imgur.com/OGBDxGr.png)
## Pattern Discovery
- Given a set of images containing a common concept, test the network whether can find where the position f the important regions in this images.
1. How to identify the important region before train the network to test the network performance.
1. use GoogLeNet-GAP network training by image-level label. use SVM weight and GAP to contruct the CAM to identify the important region.
2. Experiment
1. Discovering informative objects in the scenes
- ![](https://i.imgur.com/y98uNCE.png)
2. Concept localization in weakly labeled images
- ![](https://i.imgur.com/IRlKEWs.png)
3. Weakly supervised text detector
- ![](https://i.imgur.com/a1pwxq1.png)
- postive set: picture with text
- negtive set: picture without text
4. Interpreting visual question answering (???)
- ![](https://i.imgur.com/NfsVB9o.png)
# Visualizing Class-Specific Units
![](https://i.imgur.com/42WpewN.png)
# Conclusion
1. CAM fofr CNN with global average pooling.
- enable to visualize hotmap to the given image
2. weakly supervision to find localize the object.