# Poster Notes
## Design Layout (ignore for now)
* Overall Layout Suggestion

## Sections
* Preamble:
* Introduction
* Objective
* Features
* Specifications
* Process:
* Data
* Model Training
* Application
* Pose Estimation
* Model
* Deployment
* Outcomes
* Experimental Results
* [optional] Future Work
## Contents
### Introduction
[Yick's Source]
https://ps.is.tuebingen.mpg.de/research_fields/seeing-understanding-people;
* What is computer vision?
* Giving computers the ability to observe and perceive the world?
* What we can do with computer vision?
* For computers to be full partners with humans, they have to see us and understand our behavior.
* They have to recognize our facial expressions, our gestures, our movements and our actions.
* This means that we need robust algorithms and expressive representation that can capture human pose, motion, and behavior.
* What is Human Pose Estimation?
* [picture] Human Pose Estimation recognises both the position and orientation of humans.
* Add a skeleton to demonstrate pose estimation (from overleaf).
### Objectives/Specifications
* Exploring Sign Language Recognition using only Human Pose Estimation.
* Building a proof-of-concept system that recognises Auslan signs in real-time.
* Insert diagram of a laptop & webcam.

* Recognises four Auslan emergency signs (moving).
* Ambulance, Help, Pain, Hospital.




### Application:
* Pose Estimation
* How are we using it?
* Feature Extraction
* Getting keypoints out of an image.
* We are using openpose, open source software for human pose estimation.
* Flow (diagram)

* Model:
* Problem Formulation:
* Sequential Classification Problem
* We have as input, a series of frames.
* [Optionally] Add a diagram for illustration.
* Long Short Term Memory (LSTM)
* Given a continuous (in frames) sequences of gestures, we choose to use LSTM model that could recognises a sequence of connected gestures.
* It is based on splitting of continuous signs into sub-units and modelling them with neural networks.
* Improved version of RNN - Good for sequence classification.
* Model Structure
[64 LSTM, 0.2 Dropout] -> [128 LSTM, 0.2 Dropout] -> [Softmax, 4]
* Deployment
* Deployed as a web application.
* Show them Lucid chart diagram.
* (Principle) Flow:
* Clients connect to application with Webcam.
* Video sent to server (MSE-IT) to process keypoints (OpenPose).
* Keypoints sent to client.
* Model sits on the client's browser to deduce sign.
* Tech Stack:
* WebRTC - Web Real Time Communication.
* Python AioRTC, Aiohttp - Web Server Framework.
* Tensorflow.js - Browser Side Machine Learning Framework.
## Outcomes
* Able a dynamic sign within 1 second delay (given sign time of 1 second).
* Hardware Specification Latency - Given 8GB RAM Nvidia Card.
* Model Accuracy Plots:


=================================================
### tmp
### Auslan/Background
* Speaking is an essential part in communication.
* Approximately 20,000 Australians rely on Australian Sign Language (Auslan) everyday.
* This communication gap between Auslan users and non-Auslan users is worsen during emergencies.
### Objective/Goals
* What can we do to provide a solution to recognise Auslan signs that is:
* Inexpensive - Saving on Human Resources and Cost.
* Efficient - Gurantee that it would work based on statistical evidence.
* Reliable - Assurance that solution will work given a set of constraints.
* To build a system for proof-of-concept work that human pose estimation is a possible approach towards sign language recognition.
* To realise a hardware independent system to perform sign language recognition (without sensors) and can be run on software systems (on the cloud).
### Block Diagram of System

### Project
#### Human Pose Estimation - OpenPose
* Human Pose Estimation is the ability for computers to infer human body parts from images.
* Use OpenPose - Open Source software for Human Pose Estimation developed by Carnegie Mellon Computing Team.
* We use pose estimation to convert a video stream of a human into a stream of keypoints (x,y)
* Diagram

#### Data
* Faced issue with lack of data, so we recorded videos of ourselves doing signs.
* We used video and signal processing techniques to perform **Synthetic Data Generation** - a way to generate more data for our model to train.
* Image examples:


#### Model
* Model our problem as time series classification
* For a time period, we have N frames converted into N keypoints
* Sequential time series classification using LSTM - Long Short Term Memory Networks, special RNN
* After training, we tuned model by performing hyperparameter tuning.
* After training and tuning, we used the model to predict signs given a series of video frames.
* 
#### Application
* Once we developed model, we need to tie it into the full application for users to use.
* We deployed it as a web application, hosted on a computer hosted by MSE-IT team.
* Users are able to log into to the website, and see their signs recognised using their own webcam.
* Built on WebRTC, Aiohttp, aiortc and WebSockets.
* System Layout

### Results
* We are currently able to recognise up to 4 unique Auslan signs in a row.
* Model test vs training accuracy.
* Dynamic signs takes on average 2 to 3 seconds to be recognised.
### Future Work
* Increase number of signs to be recognised.
* Having more powerful compute power to predict signs at a faster rate.
* Optimize model parameters to have faster inference.
* Performing principal component analysis to figure out which features/body joints play a huge role so we can discard he uncessary ones to imporve compute time.
* Reach out to the deaf community:
* To develop a product that is more suitable for their needs
* To collect more data
### References
* Definition of Human Pose Estimation:
https://www.sciencedirect.com/science/article/pii/S1047320315001121#:~:text=Human%20pose%20estimation%20(HPE)%20is,on%20the%20captured%20body%20joints.
* OpenPose:
https://arxiv.org/pdf/1812.08008.