--- title: Internship Task Review3 tags: description: Action recognition in Kids using UNITY Pose Augmentation --- # Internship Task Review 3 ## Action recognition in Kids using UNITY Pose Augmentation <!-- Put the link to this slide here so people can follow --> ### Task done last week 1. Analyse the model on previous Dataset with varied FPS , Varied window size to extract features and Shift Augmentation 2. Increased the Dataset of kids to see the Accuracy change with 3 FPS 3. Moved to custom Object Detector implementaton (SSD: Single-Shot MultiBox Detector) 4. Prepared Dataset for Class Kid reading book and Statrted Training --- ### Task-1 #### Analyse the model on previous Dataset with varied FPS , Varied window size to extract features and Shift Augmentation **Initial Training Data Details** | Class | 30 FPS | 15 FPS | 6 FPS |3 FPS| | -------- | ------ | ------ | ----- | --- | | **Crawl** | 2079 | 1061 | 424 | 215| | **Cruising** | 2209 | 1818 | 729| 366| | **Run** | 2541 | 2190 | 878| 441| | **Jump** | 2193 | 1102 | 448| 227| *Note : all the below accuracy is on a custom test data created by 1 test videos of each class* **Accuracy analysis on Custom Dataset** | Window size | 30 FPS | 15 FPS | 6 FPS | 3 FPS | | ----------- | ------ | ------ | ----- | ----- | | **1 sec** | 78% |75% | 74% | 67% | | **2 sec** | 80% | 81% | 76% | 71% | **3 FPS video** | Window size | Test accuracy | | ---------------- | -------- | | **1 sec** (3 Frames) | 67% | | **2 sec** (6 Frames) | 71% | | **3 sec** (9 Frames) | 68% | **Shif augmentation** | Accuracy | without Shift Augmentation | with shift Augmentation | | ------------------- | -------------------------- | ----------------------- | | Validation Accuracy | 97% | 100% | | Custom test Accuracy| 73% | 77% | --- #### :bulb: Conclusion * Accuracy dropped on decreasing FPS of video, major reason could be decrement of Training Data. * Increasing window size used for extracting features increases accuracy significantly * But on 3 FPS video on increasing window size from 2 sec to 3 sec accuracy decreases significantly. * Shift augmentation leads to further improvement in accuracy --- ### Task-2 #### Increased the Dataset of kids to see the Accuracy change - Downloaded Many Japnese YouTube videos to increased the Kid Dataset - Trimmed the required part of videos. - Different videos had different dimension which was affecting th pose estimation badly. - Masked the videos such that dimension become a factor of input size of openpose 656X368 - Masking helps to get accurate Pose estimation --- ![](https://i.imgur.com/xtDgBBy.png) --- ![](https://i.imgur.com/eutKKeY.png) **Training Results** **Window size 1 sec (3 Frames)** ![](https://i.imgur.com/00m4p7c.png) **Window size 2 sec (6 Frames)** ![](https://i.imgur.com/beB0Va6.png) --- #### :bulb: Conclusion * Pose estimation further become accurate on masking * Accuracy increased drastically on increasing Training Data. * Further Increment in Accuracy can be seen by increasing Window size from 2 sec to 3 sec. --- ### Task-3 #### Moved to custom Object Detector implementaton (SSD: Single-Shot MultiBox Detector) * **Fine tuning the SSD300 model** trained on COCO Dataset with 80 different class and 1 Backround class. * **Sub sampling or up sample** the pretrained weights available as per your requirement of no. of classes * **data generator** defining some transformation for pre-processing and data augmentation ##### For now finalize the 3 classes and fine tuning the model 1. kid playing with ball - **ball detection** 2. Kid reading book - book **detection** 3. Kid cutting papaer with scissors - **scissor detection** ##### SSD trained on MS COCO has 80 class among which * Sports ball (**I'd 33**) and Scissor ( **I'd 77**) already present * But no Book class is present so for that going with Kite class (similar shape and both made of most of time papers) (**I'd 34**) so its pre-trained weights may help a little bit. * By default backround class (**I'd 0**) ##### other keypoints about SSD * SSD512 **has better accuracy** (2.5%) than SSD300 but run at 22 FPS instead of 59 (*SSD300 has input resolution 300X300 while SSD512 has 512X512*) * SSD performs worse than Faster R-CNN **for small-scale objects**. * Scissor and ball with which kid play are relatively small * However scissor class already present in COCO dataset ![](https://i.imgur.com/GpX3biP.png) --- ### Task-4 #### Prepared Dataset for Class Kid reading book and Started Training * First I download images in bulk directly from google using downloader however it took a lot of time to filter out the GOOD images from them. * For now Downloaded images from Istock Sites for reading book class (*took less to filter out good images as most of the images were good*) ##### **Some training images** ![](https://i.imgur.com/wr8XnnY.png) #### Annoted the 1000 images for Book Class and using 750 for training and 150 for Validation --- ### Further work * Object Detection on all classes accurately. * End to end frame work to combine both object detection and Action Recognition * Below is the Rough Pipeline ![](https://i.imgur.com/J7szwyP.png) *Please Suggest further modification or suggestion in above Pipeline* --- ---