Methods section

# Methods section ## Split into * Human Pose Estimation with OpenPose * Data Collection And Processing * Model Development * Application Deployment ### Human Pose Estimation with OpenPose * What is Human Pose Estimation? * Extract estimated human key-points from image input of humans. * What did we use? * OpenPose - open source human pose estimation software * Capable of running multi person pose detection in real-time (footnote given appropriate hardware specs). * Developed by researchers at CMU, made in C++ and Caffe framework. * What features does OpenPose extract? * OpenPose is extracts out face, hand, pose and foot keypoints (using separate models) * As numerical coordinates * How did we use OpenPose output for feature extraction? * Take (x,y) coordinates from following key-point map. * (Show image of OpenPose hand and pose output) * Normalized them from -1 to 1 where (-1,-1) = top left, (1,1) = bottom right. ### Data Collection and Processing * How did we collect data? * Struggled to find data from online sources. * Resorted to recording 400 raw videos of ourselves. * Approximately 100 raw videos per sign using a 30 FPS camera * Stored it on an online virtual machine server * How did we organise data? * Sorted videos into folders, where folder name is sign label * Within that directory, we labeled it as sign_n where sign is the sign label and n is the numerical ID for that data. * Formatting and processing data from OpenPose * We took in x,y coordinates from OpenPose as mentioned in previous section * Removed confidence levels for data * X data = Flatten (x,y) into a single array, has n data points from n frames. * Y data = 1 label for these data points * Increasing data variability with * Goal - Create more data from our current data set as training. * Referred to paper (link), creating new training data with transformations/augmentations * Video processing * Variable video speeds (0.8 and 1.5) * Extract frames as before and put them into text files. * Data Augmentation * First Augmentation - using classical image affline transformations BEFORE openpose. * Rotation, Flip, Shear ... * Second Augmentation - keypoint value changes AFTER openpose * Using additive gaussian noise ## Model Development * Problem formulation * Time Series Classification * Given N Features, classify class * Why did we choose RNNs? * Good at sharing states between time data. * Suffers from Vanishing/Exploding gradient problem. * Why LSTMs? * Solve the Vanishing/Exploding gradient problem using gates * Forget Gate, Select Gate, Update Gate * Knows how to choose to forget, select and update between states to avoid gradients vanishing due to multiplication of <1 number * Model Architecture * We propose the following architecture in our system * LSTM (36) -> Dropout(0.2) -> LSTM (36) -> Dropout(0.2) -> Dense(Softmax) -> 4 scores for 4 classes * How did we improve model? * Hyperparameter optimisation * Running variations in model from a search space * Getting model with optimal performance on test set. * Knoweldge Distillation * Can we achieve similar results with a smaller model? * Teacher student distillation * Sub-frame sampling * We know humans can only sign as fast as 0.5 HZ * Tried reducing N frames to N/2 frames, * Using N/2 frames for one data point in training and the other N/2 frames for another data point. ## Application Deployment * Our focus this year was to deploy our model as an application for people to test gesture recognition from their browser. * Web application work flow * Users connect to camera on browser * Camera works with WebRTC to transmit frames to back-end server for processing * Processing runs OpenPose, and spits out keypoints * Keypoints returned to users via socketio * Our model sits on user's browsers, and performs relevant classification for gesture ## Results and Analysis * Performing K-Fold cross validation to see average model performance * Confusion Matrix * General Error, how many signs successfully classified? * Specific error, how many of that class's signs were successfully recognised?