# Research on HMM and LSTMs Largely based on: https://pdfs.semanticscholar.org/1801/67646e8a6c910837b4df26ae8f325cdabb63.pdf ## HMM * Based on HMM techniques in NLP. * Because * Gesture vary in time, location and social factors * Body movements, like speech sounds carry certain meanings * Regularities in gesture performances while speaking similar to syntatic rules * Can use linguistic models in gesture recognition ### Comparison Between HMM and Neural Networks Key difference: What is hidden and what is observed The thing that is hidden in a hidden Markov model is the same as the thing that is hidden in a discrete mixture model, so for clarity, forget about the hidden state's dynamics and stick with a finite mixture model as an example. The 'state' in this model is the identity of the component that caused each observation. In this class of model such causes are never observed, so 'hidden cause' is translated statistically into the claim that the observed data have marginal dependencies which are removed when the source component is known. And the source components are estimated to be whatever makes this statistical relationship true. The thing that is hidden in a feedforward multilayer neural network with sigmoid middle units is the states of those units, not the outputs which are the target of inference. When the output of the network is a classification, i.e., a probability distribution over possible output categories, these hidden units values define a space within which categories are separable. The trick in learning such a model is to make a hidden space (by adjusting the mapping out of the input units) within which the problem is linear. Consequently, non-linear decision boundaries are possible from the system as a whole. ### Description * Given a sequence of observations {Y1..Yt}, we can infer what was the most likely state {X1,..Xn}. * We can formulate {Y1..Yt} as the observed keypoints from openpose. * We can say that hidden states {X1,..Xn} are the gestures that we want to map to. * As such, the HMM algorithm can be used to find the most likely classified state (Gesture) based on a sequence of observations (openpose keypoints) in temporal sense. * Key, we use multi-dimensional HMM representing the defined gestures (as seen in diagram) ### The HMM Approach ![HMM Graph](https://i.imgur.com/YL3kWM8.jpg) 1. Define meaningful gestures -- Meaningful gestures must be specified, for example certain list of vocab to use 2. Describe each gesture in terms of an HMM * Multi Dimensional HMM is employed to each gesture (seen from figure above) * A gesture is described by a set of N distinct hidden states and r dimensional M distinct observable states. * HMM Characterized by a transition matrix A and r discrete output distributions matrices Bij, i = 1..r 3. Collect training data * For us, we have openpose data keypoints 4. Train HMM through training data * Model parameters adjusted such that the maximize the likelihood P(O|lambda), for given training data * No analytical solution, but can apply Baum-Welch algorithm to iteratively re-estimate model parameters to achieve local maximum 5. Evaluate gestures with trained model * Trained model can be used to classify incoming gestures * Use forward-backward algorithm or Viterbri Algorithm to classify isolated gestures. * Can use Viterbri to decide continuos gestures. ### Links * [Real Time ASL Recognition Paper using HMM](https://www.cc.gatech.edu/~thad/p/031_10_SL/real-time-asl-recognition-from%20video-using-hmm-ISCV95.pdf) * [Hidden Markov Model for Gesture Recognition](https://pdfs.semanticscholar.org/1801/67646e8a6c910837b4df26ae8f325cdabb63.pdf)