Lecture 11 - HackMD

# Lecture 11 # Detection and Segmentation ###### tags: `CS231n` [Link to the lecture](https://www.youtube.com/watch?v=nDPWywWRIRo) This lecture is about solving Computer Vision Tasks using Deeplearning concepts The tasks can be divided in the following categories * Symentic Segmentation * Classification and Localisation * Object Detection * Instance segmentation ## Symentic Segmentation * Label each pixel into some category * Differentiation of two objects of the same category isn't important ![](https://i.imgur.com/wD9rTuy.png) Following are a few Ideas on how to do this ### Sliding Windows ![](https://i.imgur.com/xSP2ChP.png) * In this approach we extract patches from the image and take the central pixel and try to classify the central pixel using a CNN * By doing this we will then classify the image pixel wise * But this method is very inefficient. ### Fully convolution ![](https://i.imgur.com/K3u1CDl.png) * Here we take the image and go on apply various convolution layer keeping the size of the image constant. * In the last layer we take the classification scores and pile up the image to get one single image which shows the segmented objects. * there is a problem related to heavy computational cost. It is not feasible to apply this on a high resolution image. ### Encoding Decoding ![](https://i.imgur.com/Ku9D3bw.png) * In this method we first apply the usual convolution operation (Encoding), which reduces the size of the input image * Then we take the encoded image and then try to upsample it(Decoding) using various techniques to get back the orignal size image. There are two types of upsampling based on if parameters are used or not. #### Non Learnable Upsampling There are various types of unpooling It should be clear from the images below * Nereast neighbour * Bed of nails ![](https://i.imgur.com/kmKyCzM.png) * Max Unpooling In this we remember the positon of the maximum value and place the input value at that posion in the output image. By doing this we kind of maintain the boundaries and strong changes in the categories. ![](https://i.imgur.com/O2Hn6DL.png) #### Learnable Upsampling Transpose convolution is the best method we can use My detailed notes on [Transpose Upsampling](https://hackmd.io/@Sushant240/H1xtYnW-P) ## Classification and Localization * Here the main task is to identify if there is an object in the image * If there is one we need to classify it and find its location. This is done using bounding boxes. * These networks generally calulate two types of losses: * **Categorical Loss** for classification problems. These losses include Softmax,Cross Entropy and SVM origin Losses. This is used to check if the object is correctly identified. * **Regression Losses** are for calculating losses of outputs which are continous. These include L1, L2 losses. This is used to find if the object is correctly localized. ![](https://i.imgur.com/rEvc5sT.png) > Note: Here there are two losses involved. So we use a technique called Multitask Loss to improve the network parameters. ###### Multitask Loss Loss is the derivative of a scalar wrt to network parmeters.Here we have two such scalars wrt which we need to minimize the derivatives. Hence we use this technique to simplify the process. * Here we take additional hyperparameters. * Then we take weighted sum of the two losses using this hyperparamer. * Then we get a single loss and we use this loss to train the network parameter. * Selecting this hyperparameter very carefully is important as it can change the loss calculated. ## Object Detection * Object Detection involves identifying all the objects present in the given input image. * Object detection has to give different number of outputs as per input image ![](https://i.imgur.com/SVWeNWN.png) Some of the metods involved are ### Sliding window method * This is the most naive method to be used to carryout object detection task. * This mehtod requires moving a window over the entire image and passign the filter into a CNN. * But here all the objects can be of different sizes and in different numbers. * It isn't practically feasible to move filters of all different sizes on the image ![](https://i.imgur.com/4r6iJxS.png) * Here the rest of the two dogs are in a region of different size. * So this isn't a good approach to the given problem. * There is a another simpler approach that we select the small areas which are likely to have the desired object. ![](https://i.imgur.com/pwnloQE.png) ### R CNN * In this approach to reduce the computation, Region of interst are extracted from the image. These RoIs are not extracted using learnable parameters but by using traditional cv. * Then they are feed into CNNs, for detection * The RoIs are of different sizes, so before passing it into the conv net they are wrapped into a uniform size. * THen the classification is done using SVMs. ![](https://i.imgur.com/plxqxjI.png) #### Problems associated with RCNNs ![](https://i.imgur.com/Kj0Txb5.png) ### Fast RCNN * This method is an improved version of the previous one * Here the RoIs are taken from a feature map obtained after the input image is passed through a CNN. * This helps in reusing a some computation, for classification problem. ![](https://i.imgur.com/B71M13J.png) * Here we use Softmax loss and smooth L1 loss to do backpropagation. #### Comparision of RCNN and Fast RCNN ![](https://i.imgur.com/mnSp0UD.png) Here it should be noticed that RCNNs are bottlenecked by the Region Proposals! i.e. The model predicts faster than RoIs are identified. This is improved in Faster RCNN ### Faster RCNN * In this paper another set of Region Proposal Network parameters are introduced that help find more relevent RoIs faster. * This has 4 losses. > Note the RPN doesn't learn to recognise or find the accurate position of bounding boxes ![](https://i.imgur.com/z0RbcMG.png) Test timing comparision of the various techniques ![](https://i.imgur.com/YOepggP.png) There are a few more methods of object detection which are important from learning point of view but aren't discussed in great detail in this lecture ### YOLO and SSD * YOLO stands for you only look once * SSD stands for single shot detection ![](https://i.imgur.com/a5iNMnp.png) ## Instance segmentation * In this the there is object detection as well as pixel wise classification * Here the task is to differentiate between two objects of similar class. * Mask CNNs are used to carry out these tasks. * As we can see in the image there are 2 CNNs one which finds the classification scores as well as bounding box coordinates. * And the other network finds the mask for each of the predicted class. ![](https://i.imgur.com/RBA19yV.png) These masked R-CNNs have produced very good results. ![](https://i.imgur.com/PsGkcdo.png) In the 1st image notice it is able to segment those people at the far end. * These Masked CNNs can be used to detect the pose if the input images are labelled with the pose as well and the 1st conv layer gives the joint coordinates as well. ![](https://i.imgur.com/1iGJKPN.png) The pose estimation results are also very good --- For corrections, suggestions and improvements mail - sushant24vnit@gmail.com