Image Detection research survey

# Image Detection research survey ###### tags: `論文` `學習紀錄` `coarse-to-fine` [toc] --- ## Before Meeting :::success --- ## Face Detecion ::: :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer](https://zhuanlan.zhihu.com/p/36456092) [refer](https://blog.csdn.net/hongbin_xu/article/details/78347484) [refer](https://www.learnopencv.com/face-detection-opencv-dlib-and-deep-learning-c-python/) [refer]() [refer]() [refer]() [refer]() [refer]() --- ## Recent Paper --- ### Dlib :::success #### Abstracion - Dlib是建立在OpenCV基础上的一个计算机视觉库，很多方面在一定程度上优于OpenCV的效果，比如人脸检测，人脸关键点提取，其检测准确率比OpenCV更高，鲁棒性也更好，但是随之牺牲的是计算时间以及硬件资源。最近在做疲劳检测的过程中需要用到人脸68个关键点检测，首先想到的是OpenCV和Dlib,由于在以前的项目上见识到两者效果的差距，果断采用Dlib进行开发，但在进行的过程中也是碰到了各种问题，最主要的是特征点检测时视频会卡顿的非常严重，经分析，最终确定了是dlib检测的时候占用太多计算量，代码体现在官方文档中的 - dlib毕竟是一个很有名的库了，有c++、Python的接口。使用dlib可以大大简化开发，比如人脸识别，特征点检测之类的工作都可以很轻松实现。同时也有很多基于dlib开发的应用和开源库，比如face_recogintion库（应用一个基于Python的开源人脸识别库，face_recognition）等等。 ::: :::info #### Detail - dlib的人脸特征点 - 上面下载的模型数据是用来估计人脸上68个特征点(x, y)的坐标位置，这68个坐标点的位置如下图所示： - ![](https://i.imgur.com/e4blTxu.png) - 我们的程序将包含两个步骤： - 在照片中检测人脸的区域 - 在检测到的人脸区域中，进一步检测器官（眼睛、鼻子、嘴巴、下巴、眉毛） - 人脸检测代码 - ![](https://i.imgur.com/0elBfy7.png) - 这个函数里的rect是dlib脸部区域检测的输出。这里将rect转换成一个序列，序列的内容是矩形区域的边界信息 - ![](https://i.imgur.com/A9v1j4e.png) - 这个函数里的shape是dlib脸部特征检测的输出，一个shape里包含了前面说到的脸部特征的68个点。这个函数将shape转换成Numpy array，为方便后续处理 - ![](https://i.imgur.com/CpZ4tzj.png) - 这个函数里的image就是我们要检测的图片。在人脸检测程序的最后，我们会显示检测的结果图片来验证，这里做resize是为了避免图片过大，超出屏幕范围。 - Main Program - ![](https://i.imgur.com/KWFjXpX.png) - 我们从sys.argv[1]参数中读取要检测人脸的图片，接下来初始化人脸区域检测的detector和人脸特征检测的predictor。shape_predictor中的参数就是我们之前解压后的文件的路径。 - ![](https://i.imgur.com/5BfjGoE.png) - 在检测特征区域前，我们先要检测人脸区域。这段代码调用opencv加载图片，resize到合适的大小，转成灰度图，最后用detector检测脸部区域。因为一张照片可能包含多张脸，所以这里得到的是一个包含多张脸的信息的数组rects。 - ![](https://i.imgur.com/Ezh9yiA.png) - 对于每一张检测到的脸，我们进一步检测脸部的特征（鼻子、眼睛、眉毛等）。对于脸部区域，我们用绿色的框在照片上标出；对于脸部特征，我们用红色的点标出来。 - ::: :::warning #### Conclusion ::: [refer](https://zhuanlan.zhihu.com/p/36456092) [refer](https://zhuanlan.zhihu.com/p/28448206) [refer](https://gist.github.com/ageitgey/629d75c1baac34dfa5ca2a1928a7aeaf) [refer](http://blog.dlib.net/2016/10/easily-create-high-quality-object.html) [refer](https://zhuanlan.zhihu.com/p/32781218) [refer]() [refer]() [refer]() --- ### OpenCV :::success #### Abstracion - OpenCV的全稱是Open Source Computer Vision Library，是一個跨平台的電腦視覺庫。OpenCV是由英特爾公司發起並參與開發，以BSD授權條款授權發行，可以在商業和研究領域中免費使用。OpenCV可用於開發即時的圖像處理、電腦視覺以及圖形辨識程式。該程式庫也可以使用英特爾公司的IPP進行加速處理。 ::: :::info #### Detail - Programming Language - OpenCV用C++語言編寫，它的主要介面也是C++語言，但是依然保留了大量的C語言介面。該庫也有大量的Python, Java and MATLAB/OCTAVE (版本2.5)的介面。這些語言的API介面函式可以通過線上文件獲得。現在也提供對於C#, Ch, Ruby的支援。 - 所有新的開發和演算法都是用C++介面。一個使用CUDA的GPU介面也於2010年9月開始實現. ::: :::warning #### Conclusion ::: [refer](http://monkeycoding.com/?page_id=12) [refer]() [refer]() [refer]() [refer]() --- ### Haar Cascade Face Detector in OpenCV :::success #### Abstracion - Haar Cascade based Face Detector was the state-of-the-art in Face Detection for many years since 2001, when it was introduced by Viola and Jones. There has been many improvements in the recent years. OpenCV has many Haar based models which can be found here. ::: :::info #### Detail - ![](https://i.imgur.com/JxpAhQL.png) - The above code snippet loads the haar cascade model file and applies it to a grayscale image. the output is a list containing the detected faces. Each member of the list is again a list with 4 elements indicating the (x, y) coordinates of the top-left corner and the width and height of the detected face. ::: :::warning #### Conclusion - Pros - Works almost real-time on CPU - Simple Architecture - Detects faces at different scales - Cons - The major drawback of this method is that it gives a lot of False predictions. - Doesn’t work on non-frontal images. - Doesn’t work under occlusion ::: [refer](https://github.com/opencv/opencv/tree/master/data/haarcascades) [refer]() [refer]() [refer]() --- ### DNN Face Detector in OpenCV :::success #### Abstracion - This model was included in OpenCV from version 3.3. It is based on Single-Shot-Multibox detector and uses ResNet-10 Architecture as backbone. The model was trained using images available from the web, but the source is not disclosed. OpenCV provides 2 models for this face detector. ::: :::info #### Detail - ![](https://i.imgur.com/MKudIz3.png) - We load the required model using the above code. If we want to use floating point model of Caffe, we use the caffemodel and prototxt files. Otherwise, we use the quantized tensorflow model. Also note the difference in the way we read the networks for Caffe and Tensorflow. - The output coordinates of the bounding box are normalized between [0,1]. Thus the coordinates should be multiplied by the height and width of the original image to get the correct bounding box on the image. ::: :::warning #### Conclusion Pros The method has the following merits : Most accurate out of the four methods Runs at real-time on CPU Works for different face orientations – up, down, left, right, side-face etc. Works even under substantial occlusion Detects faces across various scales ( detects big as well as tiny faces ) The DNN based detector overcomes all the drawbacks of Haar cascade based detector, without compromising on any benefit provided by Haar. We could not see any major drawback for this method except that it is slower than the Dlib HoG based Face Detector discussed next. ::: [refer](https://arxiv.org/abs/1512.02325) [refer](https://www.learnopencv.com/face-detection-opencv-dlib-and-deep-learning-c-python/) [refer]() [refer]() [refer]() --- ### HoG Face Detector in Dlib :::success #### Abstracion - This is a widely used face detection model, based on HoG features and SVM. You can read more about HoG in our post. The model is built out of 5 HOG filters – front looking, left looking, right looking, front looking but rotated left, and a front looking but rotated right. The model comes embedded in the header file itself. - The dataset used for training, consists of 2825 images which are obtained from LFW dataset and manually annotated by Davis King, the author of Dlib. It can be downloaded from here. ::: :::info #### Detail - ![](https://i.imgur.com/dmTZoDu.png) - In the above code, we first load the face detector. Then we pass it the image through the detector. The second argument is the number of times we want to upscale the image. The more you upscale, the better are the chances of detecting smaller faces. However, upscaling the image will have substantial impact on the computation speed. The output is in the form of a list of faces with the (x, y) coordinates of the diagonal corners. ::: :::warning #### Conclusion - Pros - Fastest method on CPU - Works very well for frontal and slightly non-frontal faces - Light-weight model as compared to the other three. - Works under small occlusion - Basically, this method works under most cases except a few as discussed below. - Cons - The major drawback is that it does not detect small faces as it is trained for minimum face size of 80×80. Thus, you need to make sure that the face size should be more than that in your application. You can however, train your own face detector for smaller sized faces. - The bounding box often excludes part of forehead and even part of chin sometimes. - Does not work very well under substantial occlusion - Does not work for side face and extreme non-frontal faces, like looking down or up. ::: [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() --- ### CNN Face Detector in Dlib :::success #### Abstracion ::: :::info #### Detail - This method uses a Maximum-Margin Object Detector ( MMOD ) with CNN based features. The training process for this method is very simple and you don’t need a large amount of data to train a custom object detector. For more information on training, visit the website. - The model can be downloaded from the dlib-models repository. - It uses a dataset manually labeled by its Author, Davis King, consisting of images from various datasets like ImageNet, PASCAL VOC, VGG, WIDER, Face Scrub. It contains 7220 images. The dataset can be downloaded from here ::: :::warning #### Conclusion - Pros - Works for different face orientations - Robust to occlusion - Works very fast on GPU - Very easy training process - Cons - Very slow on CPU - Does not detect small faces as it is trained for minimum face size of 80×80. Thus, you need to make sure that the face size should be more than that in your application. You can however, train your own face detector for smaller sized faces. - The bounding box is even smaller than the HoG detector. - ![](https://i.imgur.com/WyAHzzd.png) - ![](https://i.imgur.com/7rb3fKp.png) ::: [refer](https://arxiv.org/pdf/1502.00046.pdf) [refer](https://www.learnopencv.com/face-detection-opencv-dlib-and-deep-learning-c-python/) [refer]() [refer]() [refer]() --- ### Raspberry Pi :::success #### Abstracion - 是基於Linux的單晶片電腦，由英國樹莓派基金會開發，目的是以低價硬體及自由軟體促進學校的基本電腦科學教育 ::: :::info #### Detail - Hardware Structure - ![](https://i.imgur.com/2ktmhVI.png) - ![](https://i.imgur.com/iguUjBe.png) - ![](https://i.imgur.com/a2IpVSb.png) - ![](https://i.imgur.com/nIRBs4J.png) - OS software - Feature - 1. 使用 Broadcom 2711 四核心晶片(原本為 BCM2837B0)Quad-core Cortex-A72 64-bit SoC，單核心時脈可達 1.5GHz，有三倍速快。 - 2. 三種記憶體(LPDDR4 SDRAM)大小可選擇，分別是 1GB, 2GB, 和 4GB。 - 3. 乙太網路(Ethernet)達 True Gigabit Ethernet。 - 4. 支援藍牙5.0(Bluetooth 5.0)。 - 5. 兩個 USB 3.0 和兩個 USB 2.0。 - 6. 支援雙銀幕輸出，解析度可達 4K。 - 7. 使用 VideoCore VI，可支援 OpenGL ES 3.x。 - 8. 可硬體解 4Kp60 HEVC 影片。 ::: :::warning #### Conclusion ::: [refer](https://blog.csdn.net/a568713197/article/details/85267764) [refer](https://yq.aliyun.com/articles/346459) [refer](https://blog.csdn.net/leaves_joe/article/details/67656340) [refer](http://shumeipai.nxez.com/2018/03/09/real-time-face-recognition-an-end-to-end-project-with-raspberry-pi.html) [refer](http://blog.itist.tw/p/how-to-study-raspberry-pi.html) [refer](https://zh.wikipedia.org/wiki/%E6%A0%91%E8%8E%93%E6%B4%BE) [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() --- ### OpenFace :::success #### Abstracion - This research was supported by the National Science Foundation (NSF) under grant number CNS-1518865. Additional support was provided by the Intel Corporation, Google, Vodafone, NVIDIA, and the Conklin Kistler family fund. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and should not be attributed to their employers or funding sources. - OpenFace是一个开源库，它具有专利模型的性能和精确性。该项目是在考虑移动性能的情况下创建的，因此让我们来看看一些使这个库快速而准确的内部结构，并通过一些用例来思考为什么可能想要在项目中实现它。 - FaceNet - ![](https://i.imgur.com/ZJcZzC3.png) ::: :::info #### Detail - ![](https://i.imgur.com/P3nJvr8.png) - 从高层次的角度来看，OpenFace使用的Torch，是一个科学的计算框架能进行离线训练，这意味着它只在OpenFace完成一次，用户不必亲自动手训练成百上千的图片。然后把这些图像被扔进神经网络中用谷歌的FaceNet模型进行特征抽取。FaceNet依赖triplet loss方法来计算神经网络对人脸进行分类的准确性，并能够基于超球表面的测量结果进行人脸聚类。 - 在通过dlib的面部检测模型运行新图像之后，这个经过训练的神经网络在Python实现中被使用。一旦这些人脸被OpenCV的仿射变换规格化，所有的人脸都指向同一个方向，它们就会通过训练的神经网络单向传递。因此产生的128维度人脸嵌入（每张脸的128个测量值称为一个嵌入）可以对匹配分类，甚至可以用于相似检测的聚类算法中。 - Training - 在OpenFace pipeline的训练部分中，500k图像通过神经网络传递。这些图像来自两个公共数据集:CASIA-WebFace（由10575个独立的个体组成，总共有494,414张图像）和FaceScrub（由530个个体组成，共有106,863张图像，他们都是公众人物）。 - 在所有这些图像前面训练神经网络的目的是很明显的，在移动或者任何其他实时场景中都不可能训练50万张图片来获取所需的面部嵌入。但请记住，这部分pipeline只完成一次，因为OpenFace将这些图像用于生成128维度的人脸嵌入，它标识一个通用的面孔，用于Python训练的pipeline。然后使用低维度的数据（而不是在高维空间中）匹配图像，这有助于快速地创建模型。 - 正如前面提到的，OpenFace使用Google的FaceNet架构来进行特征提取，并使用triplet loss方法来测试神经网络对脸部的精确程度。它通过对三个不同的图像进行训练，其中一个是已知的人脸图像，叫做anchor图像，然后另一个图像是positive嵌入，而最后一个是不同的人的图像，被称为negative嵌入。 - 如果你想了解更多关于triplet loss的信息，请点击Andrew NG的卷积神经网络Coursera视频。 - 使用triplet嵌入的很重要的一点是，在一个单位超球面上，用欧氏空间距离能判断哪些图像更近，哪些更远。很明显，negative图像嵌入比positive和anchor测量的更深，而posivit与anchor空间距离更近。这一点很重要，因为决定集群算法可以用于相似性检测。如果想要在族谱网站上检测家庭成员，或者在社交媒体上寻找可能的营销活动（比如说团购），可能需要使用聚类算法。 - 脱离背景的人脸 - ![](https://i.imgur.com/xvCFJCg.png) - 上面已经介绍了OpenFace如何使用Torch来训练成千上万的来自公共数据集的图片，以获得低维度的面部嵌入，现在通过查看他们对流行的人脸检测库dlib的使用情况，来看看为什么要使用它，而不是OpenCV的面部检测库。 - 人脸识别软件的第一步是将人脸从图像的背景中分离出来，并将人脸从图像中分离出来。人脸检测算法还必须能够处理弱的和不一致的光线，以及不同的面部位置，比如倾斜或旋转的脸。幸运的是，dlib和OpenCV一起处理了所有这些问题。Dlib负责寻找脸部的基准点，而OpenCV则负责处理面部的标准化。 - 值得注意的是，在使用OpenFace时，可以实现dlib的面部检测，它组合使用了HOG向量的梯度)和支持向量机或OpenCV的Haar级联分类器。它们都接受过positive和negative图像训练，但是在实现上、检测速度和准确性方面都有很大的不同。 - 使用HOG分类器有几个好处。首先，在图像上使用一个滑动子窗口进行训练，因此不需要进行子采样和参数操作，就像在OpenCV中使用Haar分类器一样。这使得dlib的HOG和SVM的人脸检测更容易使用，而且可以更快地进行训练，同时所需要的数据更少，此外HOG的面部识别比OpenCV的Haar级联分类器更准确。所以使用dlib的HOG搭配SVM做人脸检测非常方便 - Preprocessing - ![](https://i.imgur.com/2GRhZ35.png) - ![](https://i.imgur.com/Ya6v5X8.png) - 将图像从背景中分离出来，并使用dlib和OpenCV进行预处理之后，再将图像传送到经过训练的神经网络中，这是在Torch的pipeline部分完成的。这个步骤中，神经网络上有一个单向传递来获得用于预测的128维度嵌入（面部特征）。这些低维度的面部嵌入在分类或聚类算法中会使用到。 - 测试中OpenFace使用一个线性支持向量器，它通常在现实世界中使用以匹配图像的特征。OpenFace令人印象最深刻的一点是对图像进行分类只需要几毫秒的时间。 - 使用场景 - 经过一个较高层次的OpenFace架构的讨论之后，现在列举一些关于使用场景的有趣的想法。前面提到，人脸识别被用作一种访问控制和识别的形式。其中之一便是几年前探索的当进入办公室时，用它来识别和定制经历https://blog.algorithmia.com/hey-zuck-we-built-your-facial-recognition-ai/。那已经是很久以前的故事了，可以尝试考虑创建一款移动应用，用于识别出参加俱乐部或派对的VIP用户。保镖们不需要记住每个人的脸，也不需要依赖一份名单来让人们进入。同时，在训练数据上添加新面孔也很容易，当个体回到户外呼吸新鲜空气，并想再次进入俱乐部时，模型就会被训练出来。参照以上这些原则，可以在聚会或会议上进行人脸识别，因为在那里有人需要临时进入楼层或办公室，安保人员或前台人员可以轻松地更新或删除手机上的数据集。 ::: :::warning #### Conclusion ::: [refer](https://blog.csdn.net/dev_csdn/article/details/79176037) [refer](https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_089.pdf) [refer](https://www.cnblogs.com/pandaroll/p/6590339.html) [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() [refer]() --- ### FaceNet :::success #### Abstracion 并强调了干净数据的对实验结果的影响，同时还对网络结构和参数做了优化 ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() --- ### ArcFace(InsightFace) :::success #### Abstracion ::: - 这篇文章提出一种新的用于人脸识别的损失函数：additive angular margin loss，基于该损失函数训练得到人脸识别算法ArcFace（开源代码中为该算法取名为insightface，二者意思一样，接下来都用ArchFace代替）。ArcFace的思想（additive angular margin）和SphereFace以及不久前的CosineFace（additive cosine margin ）有一定的共同点，重点在于：在ArchFace中是直接在角度空间（angular space）中最大化分类界限，而CosineFace是在余弦空间中最大化分类界限，这也是为什么这篇文章叫ArcFace的原因，因为arc含义和angular一样。除了损失函数外，本文的作者还清洗了公开数据集MS-Celeb-1M的数据， :::info #### Detail ::: :::warning #### Conclusion ::: [refer](https://blog.csdn.net/u014380165/article/details/80645489) [refer](https://arxiv.org/pdf/1801.07698.pdf) [refer]() [refer]() [refer]() --- ### :::success #### Abstracion - ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() --- ### :::success #### Abstracion ::: :::info #### Detail ::: :::warning #### Conclusion ::: [refer]() [refer]() [refer]() [refer]() [refer]() ---