Project Description

# Project Description ## Sections * Introduction/Background * Objective * Project * Human Pose Estimation * Data * Collecting data * processing * Model * Application * Results * Moving Forward/Conclusion/Future Research * References ### Introduction/Background * Introduce the problem statement. * Modify from the slides (Presentation to Jonathan) * Current research - literature review * Using CNN (use the whole video frame as input) ### Objective * To build a system that translates emergency Auslan signs in real-time; * To explore the robustness of using human pose estimation in complicated tasks. * [kiv] bring forth the awareness of the Auslan community towards the public; * [kiv] integrate our knowledge in digital signal processing and machine learning to build a real-world application ### Human Pose Estimation * What is human pose esitmation? * Human pose estimation is the ability for computers to track human joint keypoints from a video frame. * How did you do pose estimation? * Used OpenPose, open-source software to transform video into a sequence of human keypoints. * Well cited and developed by CMU's research team. * Contains ability to track: * Pose (Skeleton) * Hands * Foot * Face * How does OpenPose work in relation to related work? * Link to Paper: https://arxiv.org/pdf/1812.08008.pdf * Multi-Person 2D Pose Estimation using Part Affinity Fields * Two approaches in world of human pose estimation: * Top Down * Find bounding box of human in frame * Detect each keypoints directly from the boudning box * We see, the number of humans grow linearly with the computational power needed to perform human pose estimation. * Bottom Up - OpenPose Way * Detect the keypoints of all possible humans first in a frame using Part Affinity Fields. * PAFs - a representation consisting of a set of flow fields that encodes unstructured pairwise relationships between body parts of a variable number of people * Use a greedy algorithm to then determine which joints are linked (NP-Hard problem) * From linked joints, we have inferences of human skeletons in frame. * How does openpose work? * System takes in color image of size w x h * Present an explicit nonparametric representation of the keypoint association that encodes both position and orientation of human limbs. * Design an architecture that jointly learns part detection and association. * Demonstrated that a greedy parsing algorithm is sufficient to produce high quality parses of body poses, & preserves efficiency regardless of number of people. * Produces 2D locations of anatomical keypoints for each person in image ### Data #### Collecting Data * Recorded ourselves doing signs, as we lacked proper videos online as our dataset to train our model. * Synthetic Data Generation: * Created scripts of video augmentation to synthetically generate "more" data to train our model. * Eg: Frame Flipping, Changing Video Speeds, Frame Translation/Tilting. * Passed these recorded videos into OpenPose. * Then, we labelled data respectively using scripts. #### Processing Data * Normalization - having all keypoint x,y within same range (-1,1) * Increase training rate * Prevent exploding gradient/vanishing gradient problem? (unsure) * Normalized using frame width, height * K-Fold Cross-Validataion * Splitting dataset into train & validation in K different classes. ### Model * We have data now, so what's next? * We need to perform "feature extraction", * Determining what/how to format the "data" going into our model. * We did this using human pose estimation. * What model did you use to recognise signs? * Once we have keypoints from Openpose, we can formalise the problem as time series classification. * Given a list of data in time, what category is it? * We realised from research that LSTM suited the best for this. * Our model is based on LSTM, which is a improved RNN specialised in sequential data processing * What did you do after training the model? * We stored checkpoints of the trained model for deployment and re-training? * Hyperparameter tuning, we had to re-train and retuned our model for different hyperparameters to get the best performance. ### Application * What is your application? * We wanted to create a web application that recognises signs in real time hosted on MSE-IT's Computers. * It involves a web server hosting our model and when people connect to the application, they will interact with the model to have their signs recognised. * What tools/software systems did you use for the system? * Video streaming is based off on WebRTC, the same software framework used for Zoom, Whatsapp calls and more. * Model deployment is done using Tensorflow.js * OpenPose runs in python, to translate WebRTC video stream to keypoints in real time. * Web Framework - Aiortc, a python based web server framework for WebRTC. ### Results * Model Accuracy * Based on our own test set, we achieved x % accuracy of recognising 6 unique Auslan signs. * Model Recognition Time: * [Offline Prediction] Takes up to x seconds to infer. * [Online Web App] Takes up to x seconds to infer. * [Online Desktop] Takes up to x seconds to infer. * Able to sustain x amount of real-time connections on server. * Proved that it is possible to use Pose Estimation to do Sign Language Recognition. ### Moving Forward/Conclusion/Future Research * Sequence2Sequence detection * Transforming a sequence of translated signs into a sequence of words that form a sentence. (Encoder/Decoder Model) * Port applicaton to systems that stream from CCTVs/IP Cameras to recognise emergencies. * Increase number of signs detected using an opensource way of collecting data? (Setting up a video stream online where users can record themselves to add signs to the model themselves).