R-CNN (Rich feature hierarchies for accurate object detection and semantic segmentation)

###### tags: `Paper Notes` # R-CNN (Rich feature hierarchies for accurate object detection and semantic segmentation) * 作者：Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik * 機構：UC Berkeley * 時間：2014 年 ### Model Architecture ![](https://i.imgur.com/8zzMgRe.png) <center>圖一：R-CNN 總攬</center> * 架構總攬（如圖一所示）： 1. 輸入圖片。 2. 從圖片中提取出約 2000 個 category-independent 的 region proposals。並用 affine image warping 方法將這些 region proposals 縮放到相同大小。 3. 將這些 region proposals 丟入 CNN 網路，提取特徵。 4. 利用多個 class-specific linear SVMs 對些 region proposals 做分類。 * Region proposals： * 本文採用 selective search [39] 作為 region proposal method（但 R-CNN 沒有限定要用哪種 region proposal method）。 * 然後用 affine image warping 方法將這些 region proposals 縮放到 227x227。 * Feature extraction： * 將 region proposals 丟入 CNN 後，每一個 region proposa 都會產生一個 4096 維的 feature vector。 * CNN 的詳細架構須參考 [24]、[25]。簡單的說就是 5 層 convolutional layers 加上 2 層 fully connected layers。 * Detection： * 將 feature vector 丟入 SVM 中，得到該 class 的 confidence score。 * 對於每個 class，都用一個專門分類的 SVM。 * 對於 IOU (Intersection Over Union) 高於 learned threshold 的 regions，使用 non-maximum suppression 去除不必要的 region。 ### Experiments & Results * selective search 使用 fast mode。 * CNN 的預訓練： * 先用 ILSVRC2012 classification 資料集對 CNN 做預訓練。 * 將最後一層拆掉，改成隨機初始化的 (N + 1)-way classification layer。N 表示物件的類別數，1 表示背景。對於 VOC 來說，N = 20。 * 再用 wrap region proposal 對 CNN 做預訓練。 * 只要 region proposal 與 ground truth 的 IOU 大於 0.5，則 SVM 就要分辨出該類別，否則要辨別為背景。 * 在 Pascal VOC 2012 資料集上能達到 53.3% mAP。 * 剩下的見[論文](https://arxiv.org/pdf/1311.2524.pdf)。 ### Reference [21] C. Gu, J. J. Lim, P. Arbeláez, and J. Malik. Recognition using regions. In CVPR, 2009. 2 [24] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013. 3 [25] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 1, 3, 4, 7 [39] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. 1, 2, 3, 4, 5, 9