Pose Estimation

# Pose Estimation ## [OPEN POSE](https://arxiv.org/pdf/1812.08008.pdf) #### This algorithm uses a bottom up approach to generate full-body keypoints. Which means that instead of first detecting persons and then identifying the pose. It first identifies the limbs and then using PAF (Part Affinity Fields) groups them together using a Greedy Approach. PAF are simple vectors that encode the direction of one joint to the next joint of a limb. This algorithm is much improved than top down approach in terms of speed as it is invariant of the number of people in the picture. The limb is viewed by the algorithm as a confidence map which encloses the location of the pixel and a probablity that the pixel contains a limb. The training of the algorithm comprises of mainly 3 steps 1. This step uses a pretrained CNN model to generate features(F) from the image 2. Now this set of features are passed through a different CNN model that generate predictions for PAF. This step is done Tp times to refine the features and at every step the input is the original feature F and the last generated feature from this step (L,t-1 )[Where t-1 is denoting the time step just before this] 3. Next the Confidence Maps are generated and this step is also repeated, Tc times. The input for this feature F, the last PAF (L,Tc) and the Confidence Map output for the t-1 step (S,t-1) #### PROS * This is the second version of this algorithm and many of the cons of the first version has been mended. For example the number of people in a frame does not affect computation time. * The initial CNN architecture used (VGG in original) can be very easily modified to suit the hardware requirements. [This](https://github.com/ildoonet/tf-pose-estimation) link showcases that. #### CONS * I have not yet found any cons for the algorithm apart from some bad predictions in crowded places * Other algorithms like Wrnch show similar accuracy, so have to compare in depth #### MODEL ARCHITECTURE ![](https://i.imgur.com/3aengwm.png) ## [Simple and Lightweight Human Pose Estimation](https://arxiv.org/pdf/1911.10346.pdf) #### This is a modification of HR Net, a new pose estimation network developed by microsoft asia Research. It is a Top-Down Approach HR Net uses high resolution and low resolution images in cojugation and introduces 4x lower resolution networks in the deeper parts of the network. This can be easily seen and understood from the [model architecture](https://www.researchgate.net/profile/Bin_Xiao6/publication/331343676/figure/fig1/AS:730402738155521@1551152991245/Illustrating-the-architecture-of-the-proposed-HRNet-It-consists-of-parallel-high-to-low.png) #### PROS * This network promises much better accuracy than OpenPose and improved computational time than the original HR Net #### CONS * This network has been used and tested on the COCO dataset which does not have as many crowded people images in it. So its speed and accuracy on a crowded dataset (like the one provided by CrowdPose) is still unknown #### MODEL ARCHITECTURE ![](https://i.imgur.com/jh7WTXp.png) ## [CrowdPose](https://arxiv.org/pdf/1812.00324.pdf) #### This is a Pose Estimation algorithm which was mainly developed keeping Multi-Person Pose Detection in mind. It is a modified version of Alpha Pose which is a Top-Down Approach. There are two new features in Crowd Pose which make it better in pose estimation in crowded scenes 1. It uses a new loss function which according to the authors improved accuracy from 61% to 67% 2. Instead of directly using Non-Maximal Suppression, it uses a new way of penalizing multiple Limb detection on the same person. Instead of completely dismissing the Other limbs estimated, this new algorithm penalizes them slightly and this same detected limb can be incorporated later on by the algorithm for another person whose bounding box has high IOU with the present person. The authors have developed their own custom dataset for accuracy estimation and each image has a Crowd Score between [0,1] to measure the crowd density. It is on this dataset that the network shows significant improvement over other novel architectures #### PROS * Since the main idea for development was Pose Estimation in Crowded Pictures, this model can be better for our use case which has similar requirements with higher accuracy than Open Pose * This model is also usable for real-time detection similar to other models mentioned above. #### CONS * It is not as widely used as Open Pose so real world use might be different from the custom dataset it was tested on #### MODEL ARCHITECTURE ![](https://i.imgur.com/QVz0Pxu.png)