# Project Description
## Sections
* Introduction/Background
* Objective
* Project
* Human Pose Estimation
* Data
* Collecting data
* processing
* Model
* Application
* Results
* Moving Forward/Conclusion/Future Research
* References
### Introduction/Background
* Introduce the problem statement.
* Modify from the slides (Presentation to Jonathan)
* Current research - literature review
* Using CNN (use the whole video frame as input)
### Objective
* To build a system that translates emergency Auslan signs in real-time;
* To explore the robustness of using human pose estimation in complicated tasks.
* [kiv] bring forth the awareness of the Auslan community towards the public;
* [kiv] integrate our knowledge in digital signal processing and machine learning to build a real-world application
### Human Pose Estimation
* What is human pose esitmation?
* Human pose estimation is the ability for computers to track human joint keypoints from a video frame.
* How did you do pose estimation?
* Used OpenPose, open-source software to transform video into a sequence of human keypoints.
* Well cited and developed by CMU's research team.
* Contains ability to track:
* Pose (Skeleton)
* Hands
* Foot
* Face
* How does OpenPose work in relation to related work?
* Link to Paper: https://arxiv.org/pdf/1812.08008.pdf
* Multi-Person 2D Pose Estimation using Part Affinity Fields
* Two approaches in world of human pose estimation:
* Top Down
* Find bounding box of human in frame
* Detect each keypoints directly from the boudning box
* We see, the number of humans grow linearly with the computational power needed to perform human pose estimation.
* Bottom Up - OpenPose Way
* Detect the keypoints of all possible humans first in a frame using Part Affinity Fields.
* PAFs - a representation consisting of a set of flow fields that encodes unstructured pairwise relationships between body parts of a variable number of people
* Use a greedy algorithm to then determine which joints are linked (NP-Hard problem)
* From linked joints, we have inferences of human skeletons in frame.
* How does openpose work?
* System takes in color image of size w x h
* Present an explicit nonparametric representation of the keypoint association that encodes both position and orientation of human limbs.
* Design an architecture that jointly learns part detection and association.
* Demonstrated that a greedy parsing algorithm is sufficient to produce high quality parses of body poses, & preserves efficiency regardless of number of people.
* Produces 2D locations of anatomical keypoints for each person in image
### Data
#### Collecting Data
* Recorded ourselves doing signs, as we lacked proper videos online as our dataset to train our model.
* Synthetic Data Generation:
* Created scripts of video augmentation to synthetically generate "more" data to train our model.
* Eg: Frame Flipping, Changing Video Speeds, Frame Translation/Tilting.
* Passed these recorded videos into OpenPose.
* Then, we labelled data respectively using scripts.
#### Processing Data
* Normalization - having all keypoint x,y within same range (-1,1)
* Increase training rate
* Prevent exploding gradient/vanishing gradient problem? (unsure)
* Normalized using frame width, height
* K-Fold Cross-Validataion
* Splitting dataset into train & validation in K different classes.
### Model
* We have data now, so what's next?
* We need to perform "feature extraction",
* Determining what/how to format the "data" going into our model.
* We did this using human pose estimation.
* What model did you use to recognise signs?
* Once we have keypoints from Openpose, we can formalise the problem as time series classification.
* Given a list of data in time, what category is it?
* We realised from research that LSTM suited the best for this.
* Our model is based on LSTM, which is a improved RNN specialised in sequential data processing
* What did you do after training the model?
* We stored checkpoints of the trained model for deployment and re-training?
* Hyperparameter tuning, we had to re-train and retuned our model for different hyperparameters to get the best performance.
### Application
* What is your application?
* We wanted to create a web application that recognises signs in real time hosted on MSE-IT's Computers.
* It involves a web server hosting our model and when people connect to the application, they will interact with the model to have their signs recognised.
* What tools/software systems did you use for the system?
* Video streaming is based off on WebRTC, the same software framework used for Zoom, Whatsapp calls and more.
* Model deployment is done using Tensorflow.js
* OpenPose runs in python, to translate WebRTC video stream to keypoints in real time.
* Web Framework - Aiortc, a python based web server framework for WebRTC.
### Results
* Model Accuracy
* Based on our own test set, we achieved x % accuracy of recognising 6 unique Auslan signs.
* Model Recognition Time:
* [Offline Prediction] Takes up to x seconds to infer.
* [Online Web App] Takes up to x seconds to infer.
* [Online Desktop] Takes up to x seconds to infer.
* Able to sustain x amount of real-time connections on server.
* Proved that it is possible to use Pose Estimation to do Sign Language Recognition.
### Moving Forward/Conclusion/Future Research
* Sequence2Sequence detection
* Transforming a sequence of translated signs into a sequence of words that form a sentence. (Encoder/Decoder Model)
* Port applicaton to systems that stream from CCTVs/IP Cameras to recognise emergencies.
* Increase number of signs detected using an opensource way of collecting data? (Setting up a video stream online where users can record themselves to add signs to the model themselves).