Capstone Meeting #8 (24/04/2020)

# Capstone Meeting #8 (24/04/2020) ## Agenda | Schedule | Item | |----------|------| | 11:00 - 11:15 | Progress report | | 11:15 - 11:35 | Final check on Assignment 1 Part 1 | | 11:35 - 11:50 | Discuss things to do for Part 2 | | 11:50 - 12:05 | WBS for Auslan | | 12:05 - 12:20 | Tasks to do this week | | 12:20 - 12:30 | Other concerns | ## Final check on Assignment Part 1 - Already did a final check using rubric. - Submitting version now on LMS. ## Discuss things to do for Part 2 (or a final check) ### Team SWOT Analysis #### Tsz Kiu Strengths: - Strong interest and knowledge in software - Familiar with music - Keen on learning - Interest in field of research - Experienced with music-related research Weakness: - Not familiar with machine learning/gesture recognition - Not familiar with Windows OS Opportunities: - Picking up machine learning #### Matthew Strength: - Collaborative tools - Cloud computing - Keen on learning Weakness: - Machine learning - Reading research paper Opportunities: - Wants to pick up machine learning #### Yick Strength: - Comfortable with computing/programming - Forward Thinking - Good at organizing workflow of doing assignment Weakness: - Lack of formal trainings/fundamentals Opportunities: - Learning proper techniques in machine learning ## Progress Report - Everyone's update from previous week (before Assignment 1 stuff) ### Matthew - IP camera: ~40fps - IP camera on localhost using cam2web - cam2web issue: camera resolution - Webcam issue: maybe the issue is on OpenCV side - Not much help online - Need to read documentation - UDP stream - What the output of pose estimation should be? - This depends on the gesture recognition block ### Yick - Gesture recognition is not that straight forward - Gesture recognition technologies now do not require pose estimation keypoints. - Datasets depend on what current Gesture Recognition algorithms use - There are two parts of our project: - Research - Figure out which algorithms to employ - Search up on how well OpenPose works in gesture recognition - Implementation - Using data to train model itself - Need large datasets with annotation - Can be found online, just need to process - Transfer learning - Using other people's models to train ours - But requires a lot of research in altering datasets ### Tsz Kiu - Searching up on Music Related stuff - Able to link json output to PureData in real time - Reading each json file one by one - Using coordinates of body parts to do stuff in PureData ## WBS for Gesture Recognition - Talk about our steps for transition to AusLan] - Give ourselves 1.5 weeks to gather insights while developing our WBS ## Tasks for this week - Debugging the webcam issue (Matthew) - Start working on the Project Charter (Team) - Finish the WBS for Gesture Recognition in two weeks (Team) - Finding and document the gesture recognition research built on top of body coordinates (maybe for two weeks, could be longer) (Tsz Kiu) - Find and set up open source code on gesture recognition built on top of (ideally) OpenPose (Yick) - Send Email to Jonathan after project charter submission (Tsz Kiu) - GRA; due date (Yick) - Assignment 02 (yick) ## Tasks for the future (potentially including in WBS) - Come up with a class of gestures from Auslan to focus on (Time variant or Time invariant) - Liaise with the deaf community ## Other concerns (if anything) - Drafting email to Jonathan describing our project change # Capstone Meeting #9 -- 31/4/2020 ## Agenda * Check GRA before submitting; [Done] * Each person update on previous week's work (briefly!) * Add/Update our WBS for AusLan * Go in depth on insights gain from sign language recognition research * Discuss about our engagement with AusLan community? (Not priority for now?) ### Research Updates #### Yong Yick * Researced on LSTM: Long short term memory * LSTM is a specific implementation of RNN that * Introductory Explanation: [Explanation](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) * OpenSource and can be built on Pose Estimation * Will get a list of papers so that we can refer to it. * Look at other research topics that may be similar. * Managed to narrow down to a few research papers that have sufficiently large dataset for AusLan. * Need for annotated datasets. #### Tsz Kiu * [Possible Dataset](https://github.com/Signbank/Auslan-signbank/tree/master/signbank) * Learnt alphabets for Auslan. * Learnt that fingerspelling is quite dynamic in movements as well. * Looking at Auslan translation based on computer vision. * Steps that they took: * (Segmentation) - only face; only hand; or both?; * (Feature Extraction) - * Used to train the model * Further process such as angle of joints; etc... * (Gesture Recognition) - * Inputting keypoints + features extracted into our model to predict sign + word. * Segementation to get the necessary body parts. * We need to narrow down a list of features that we want to extract that would be useful for our case. * Gesture Recognition research --> Hidden Markov Model. * [Implementing Hidden Markov Model](http://htk.eng.cam.ac.uk/) * HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. * Problem with segementation/feature extraction without Pose Estimation Keypoints: * Only able to detect the locations of joints * Can't work with dynamic signs and movements ### Phases in our Project #### Phase 1 -- Static Gestures * Static Alphabets / Numbers * Sign Language for one word and static gesture (Use Australian Wide version) * Set a dictionary a set number of words/alphabets * Collecting training datasets; * Annotated images or videos * Possibly re-process raw videos before feeding into models * Possibly look into asking UniMelb department to get Sign language datasets. * Look into implementing with different models: * CNN * LSTM * Can look into image classification algorithms * etc... * Collecting test datasets and testing. #### Phase 2 -- Moving Gestures * Temporal Alphabets (J and H) * Detecing words with temporal gestures * Workflow similar to phase 1 #### Phase 3 -- Series of Gestures * Detecting a sequence of alphabets to form a word * Detecting sequence of words * (POSSIBLE WAY): Taking a series of words and using another AI/BOT to rearrange words in a grammatically correct way * Workflow similar to phase 1 ### Concerns * Gesture Recognition seems to be quite a challenging task due to time constraints. * Grammar can be an issue when we focus only on finger spelling. ### Moving forward #### Administrative / Assignments * [Yick] Submit our GRA by 1st May -- Yick to Submit * [Yick] Project brief - auslan (not urgent) - Latex; * [Yick] template for final project; * [Matthew] Working with IP Camera Stream * Assignment 2? #### Capstone * [Yick] set up and run different models (primarily LSTM); * (team) let's set up and run different models (in terms of generalization, accuracy, ease of use); * (Team) Implicitly, continuing researching the techniques/rabbit hole (also as self-development) * (Team) let's find out different models (not urgent); * Liaise with AusLan communities to get datasets (in a few weeks) * [Matthew] - Unimelb subject/departments * [yick] - auslan, other universities; * Start colllecting possible datasets from sources. (in two weeks) * Using web scrapping libraries * Manually downloading for personal testing; #### Good-to-do * approach lecturers (jingge, erik, jonathan, Iman Shames); # Capstone Meeting #10 -- 7/5/2020 ## Agenda * Updates from last week's todos * Reviewing on Phase Plans from last meeting: * [Reference from last meeting](https://github.com/relientm96/capstone2020/blob/master/meeting-logs/GroupLogs/log-09.md) * In-Depth Phase 1 Discussion * To-Do's for next week ## Last Week ToDo's * [Yick] Set up and run different models using LSTM * [Team] Set up and run different models (in terms of generalization, accuracy and ease of use) * [Team] Team research * [Team] Finding out different models to use (not urgent) * Liase with AusLan communities * Matthew --> Unimelb * Yick --> Auslan / Other Unis * Start Collecting Datasets for Phase 1 * Web Scrapping using libraries * Manually downloading for personal tests * Approach lecturers or experts for help (if needed) ## Updates from Last Week * Leave to Yick to liase with communities. * Yick has a list of computer vision papers ## continue * [team] run other frameworks - colba/keras to implement the models; ## Phase 1 Discussion * Defining which words to focus on? * Let's start alphabets, fingerspelling * How do we collect above data? * [Auslan Corpus](https://elar.soas.ac.uk/Collection/MPI55247) * [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Australian+Sign+Language+signs) * Model (Big rabbit holes) * HMM * Dynamic Time Warping * CNN * RNN + CNN * LSTM * attention based encoder-decoder (transformer) * Small rabbit hole; * Transfer learning, * Implementing above models and testing on our datasets. * How do we generate more data from current data? * Adding noise to our current datasets for OpenPose * reminder - not necessary sign language, could be gesture/action ## ToDo for Next Few Weeks * [Tsz Kiu] Setting up data uploading protocol (Saturday) * [team] Scrape whatever you could (alphabets + fingerspelling) * you could record yourself! * [rabbit hole] * Annotating data * [others] * [Matt] Figure out implementation side of system for models * [Yick] Liasing, with Auslan Communities (by wednesday; 13 may, 2020) # Capstone meeting #11 ## Agenda - Last Week TODO's - Phase 1 update - Assignment 2 ## Last Week TODO's * [Tsz Kiu] Setting up data uploading protocol (Saturday) * [team] Scrape whatever you could (alphabets + fingerspelling) * you could record yourself! * [rabbit hole] * Annotating data * [others] * [Matt] Figure out implementation side of system for models * [Yick] Liasing, with Auslan Communities (by wednesday; 13 may, 2020) ## Phase 1 update ### Matthew * Successfully integrated OpenPose with a Python Wrapper! * Created a flask based web application to expose a user interface for our program! * Helping Tsz Kiu out in the setting up preprocessing for training using python openpose wrapper * Struggling now to make web application available to us because of a stupid HTTPS/SSL issue. ### Tsz Kiu * Finished creating a protocol using Google Drive * Copy paste data from google drive to repository * Try to look into automating process to make things easier * Setup the preprocesing pipeline ### Yick * Found a possible repository * https://github.com/HealthHackAu2018/auslan-party * Look into using datasets from other sign languages * Chinese Sign Language * Digging into the process idea and theoretical side ## To-do, Moving Forward * [Tsz Kiu] Setting up a protocol/infrastructure to train dataset * [Tsz Kiu] (if have time) automate downloading from Google drive * [Tsz Kiu] (if have time) set up the gesture recognition part (with any model) * [Matt] Creating the web application to be used on our local laptops * [Matt] Creating OpenPose module for modularity * [Matt] Learning Keras to better understand * [yick] normalize; ## Free time * [yick] image processing stuff - filtering; training - increase the variability prior/after ; * [yick] optimize openpose (lightweight) * [yick] statistics; # Capstone Update # 12 (Online) For each of your name, talk about * What you did last week * eg Research? Data Collection? Etc... ### Yick * nothing much ... * failed to charm auslan communities to sharing their datasets; * underestimated the complexity of auslan-sign-bank webpage structure; working on python-scrapy-api to have a workaround; * https://en.wikipedia.org/wiki/Scrapy; have set it up in vm, at least in my local workspace, not sure whether it affects globally; * intended to use some image enhancement to improve the input feed to openpose * but overestimated my capabilities to coding real-time image processing: gaussian filter and median filter; * crumbled by other commitments; :cry: * let the team down :disappointed: (NOT TRUE!! haha you are not letting the team down) * my own reference: * (different python models experimented) https://github.com/jayshah19949596/DeepSign-A-Deep-Learning-Architecture-for-Sign-Language-Recognition * https://github.com/frankibem/CS_6001 * (transfer learning) https://github.com/devbihari/Sign-Language-Translation ### Tsz Kiu ### Matthew #### Virtual Machine Update * Lucas has now "fixed" (temporarily) our camera issue by increasing resolution to 640x480 * You can now try to run the application and see camera change effect * Problem still in frames dropping out (think could be due to VPN connection issues) * Thus, if in scenario that VM Connection is too bad, I can try to make an option where you record your sign, and we return you the translation for each frame (no more real time) #### Gesture Recognition Development * Read tutorials and articles on LSTM and RNN. * Trying out models on [Kaggle](https://www.kaggle.com/) which is a great easy to learn machine learning IDE with notebook + easy to load datasets ## Assignment 2 Assign yourself to choose which assignment 2 you would like to do: * Ethics --> * Project Management --> Matthew? * Sustainability --> yick * Financial Analysis --> # Research on HMM and LSTMs Largely based on: https://pdfs.semanticscholar.org/1801/67646e8a6c910837b4df26ae8f325cdabb63.pdf ## HMM * Based on HMM techniques in NLP. * Because * Gesture vary in time, location and social factors * Body movements, like speech sounds carry certain meanings * Regularities in gesture performances while speaking similar to syntatic rules * Can use linguistic models in gesture recognition ### Comparison Between HMM and Neural Networks Key difference: What is hidden and what is observed The thing that is hidden in a hidden Markov model is the same as the thing that is hidden in a discrete mixture model, so for clarity, forget about the hidden state's dynamics and stick with a finite mixture model as an example. The 'state' in this model is the identity of the component that caused each observation. In this class of model such causes are never observed, so 'hidden cause' is translated statistically into the claim that the observed data have marginal dependencies which are removed when the source component is known. And the source components are estimated to be whatever makes this statistical relationship true. The thing that is hidden in a feedforward multilayer neural network with sigmoid middle units is the states of those units, not the outputs which are the target of inference. When the output of the network is a classification, i.e., a probability distribution over possible output categories, these hidden units values define a space within which categories are separable. The trick in learning such a model is to make a hidden space (by adjusting the mapping out of the input units) within which the problem is linear. Consequently, non-linear decision boundaries are possible from the system as a whole. ### Description * Given a sequence of observations {Y1..Yt}, we can infer what was the most likely state {X1,..Xn}. * We can formulate {Y1..Yt} as the observed keypoints from openpose. * We can say that hidden states {X1,..Xn} are the gestures that we want to map to. * As such, the HMM algorithm can be used to find the most likely classified state (Gesture) based on a sequence of observations (openpose keypoints) in temporal sense. * Key, we use multi-dimensional HMM representing the defined gestures (as seen in diagram) ### The HMM Approach ![HMM Graph](https://i.imgur.com/YL3kWM8.jpg) 1. Define meaningful gestures -- Meaningful gestures must be specified, for example certain list of vocab to use 2. Describe each gesture in terms of an HMM * Multi Dimensional HMM is employed to each gesture (seen from figure above) * A gesture is described by a set of N distinct hidden states and r dimensional M distinct observable states. * HMM Characterized by a transition matrix A and r discrete output distributions matrices Bij, i = 1..r 3. Collect training data * For us, we have openpose data keypoints 4. Train HMM through training data * Model parameters adjusted such that the maximize the likelihood P(O|lambda), for given training data * No analytical solution, but can apply Baum-Welch algorithm to iteratively re-estimate model parameters to achieve local maximum 5. Evaluate gestures with trained model * Trained model can be used to classify incoming gestures * Use forward-backward algorithm or Viterbri Algorithm to classify isolated gestures. * Can use Viterbri to decide continuos gestures. ### Links * [Real Time ASL Recognition Paper using HMM](https://www.cc.gatech.edu/~thad/p/031_10_SL/real-time-asl-recognition-from%20video-using-hmm-ISCV95.pdf) * [Hidden Markov Model for Gesture Recognition](https://pdfs.semanticscholar.org/1801/67646e8a6c910837b4df26ae8f325cdabb63.pdf) # Capstone Update #13 - 5/6/2020 ## Agenda (getting back on track!) * Briefly describe what you did last two weeks * yick, tsz kiu * almost nothing * read up on the research papers * matthew * reading; * ==> HMM; * Dataset Collection * [yick] * Scrapping online for videos * Recording yourself (if you want) * Discuss dataset pre-processing * research more; * Put in a confidence threshold in filtering out good images/videos. * Discuss about modelling; * just set any model as long as it works! * [Good paper on HMM in Gesture Recognition](https://pdfs.semanticscholar.org/1801/67646e8a6c910837b4df26ae8f325cdabb63.pdf) * [Using OpenPose for Finger Spelling](https://dl.acm.org/doi/pdf/10.1145/3373477.3373491?download=true) # Capstone 2020 [Trello Board Link](https://trello.com/b/NcY5ovgD/agile-sprint-board) [Kaggle](https://www.kaggle.com/) ## Project Management Stuff * Moving to Trello * Working in Sprints * Where we have Plan, Build, Test, Deploy ## Check in from last checkpoint ### What do we have so far? * Video streaming to VM * Keras Preprocessing, turning imags to keypoints in json. ### What do we need? * Figuring out which model(s) to test/implement * Datasets (both videos and openpose keypoints) ### Due Dates * Layman Presentation * End oral presentation assestment * Final Report ## Data Collection * Video Or Images: Both * Classes of Signs: * Alphabets; * Numbers; * Matthew * higher complexity; * Jobs to be Done * Yick, Matthew, Tsz Kiu: Collect sources and links. * Yick: Working on media collecting module. * Matthew: Set up a markdown file for url links. * About 20 images per sign ## Moving forward * Tsz Kiu explore more on models to use, play around with a working model. * https://www.kaggle.com/ # Capstone Meeting -- 10/07/2020 [Trello Link](https://trello.com/b/NcY5ovgD/capstone2020-board) ## Agenda * Go over trello cards * Evaluate progress and discuss steps moving forward * Re-create sprint cards for next week * Re-assign sprint card tasks ## Evaluating Steps Moving Forward * Making working model ### Which Model to use? * CNN * Multi-Model * lstm * transformer ## Sprint Cards for Next Week * see the trello board;