Machine learning

# Machine learning ## Goals First: linear model, focus on the data pipeline Second: augmentation when good learning curves Third: further analysis to create ensemble ## Data exploration * Shape of the landmark-sequence is (12, 125, 3). 12 Frames with 125 key locations, each with x, y, z coordinates **Orientation of xyz?** * If we look at the visualise_pose() method in the utils-file, we can see that only x and y-coordinates are used. From this we can deduce that x and y should correspond to height and width (like in the 2D-plane, since that is where the visualization happens) and z must be depth? * Y = downward vertical, X = right horizontal, Z = ? * Keypoints: * Body: indices 0-22 * Face: indices 23-82 * Left hand: indices 83-103 * Right hand: indices: 104-124 * Sometimes tracking fails and coordinates are set to 0. Example is visible in the figure: Sample 1, keypoint index 104, 12 frames * ![](https://i.imgur.com/qtR5hDK.png) * Untracked locations are set to 0 aswell: See example below: sample 1 and 11 * ![](https://i.imgur.com/sSyqjry.png) * Over all the sequences, the amount of times the keypoint locations were (0,0,0) has been counted. We can see that some points of the body were not be able to be tracked. For the face and hands it is all or nothing: it can find all the keypoints or it can find none. See image below for the counts: * ![](https://i.imgur.com/qkt2y8s.png) * Amount of frames per sequence differ as well: 2 samples with only 1 frame!, 1 sample with as much as 117 frames! A histogram of the distribution of the amount of frames is shown below: * ![](https://i.imgur.com/gDhZXbC.png) * Nice way to view the flow of the movement in one plot. Code and example below (sample 7, ear of hare) * ![](https://i.imgur.com/70y8AR7.png) * ![](https://i.imgur.com/fZGahD3.png) https://bansal-pranav.medium.com/indian-sign-language-recognition-using-googles-mediapipe-framework-3425ddce6748 This guy uses some distance stuff between keypoints as features https://www.youtube.com/watch?v=We1uB79Ci-w This guy just uses the raw keypoint coordinates as features. (Uses MediaPipe so the keypoints are same as ours) # Wim: Plots: ![](https://i.imgur.com/A3LXHjh.png) ![](https://i.imgur.com/AK2J6qV.png) ![](https://i.imgur.com/BjumbXX.png) ![](https://i.imgur.com/IHnmuGc.png) ![](https://i.imgur.com/5ztsnNn.png) ![](https://i.imgur.com/gUqIQsa.png) Proposal: - First use logistic regression with averages over frames (with StandardScaler) - Plot learning curves, validation curves, confusion matrix (make helper functions in python script) - Fix folds + seeds - Use this as a baseline, try to improve -> See what goes wrong, what can be improved (data pipeline is most important for us at the moment) Things that might be useful: - Frame interpolation (in case when confusion matrix shows that model performs bad on classes with little frames + skewed frame distribution, model might overfit + way more calculations) - Centering around a single keypoint (e.g. that always center around neck of person) - Deleting keypoints (many keypoints for mouth region -> however mouth region less useful, determine which keypoints most useful (look at what keypoints are used to calculate the features below)) - Calculate new features based on physics: acceleration, velocity, trajectory, etc. - Because individual keypoints not so useful -> feature extractor that extract features like: - For hands: - Length of individual fingers + angles - Area of hands - For body: - Calculate angles (elbow neck angle, wrist neck angle) - Radial velocity and radial acceleration - Angular velocity and acceleration (e.g. speed at which angles change) - For face: (look at this paper for calculations https://ieeexplore.ieee.org/abstract/document/4813472) - Face height - Face width - Eye to eye distance - Mouth area - Mouth width - Mouth height - For all keypoints over all frames: - Mean xyz positions - Velocity for xyz positions - Acceleration for xyz positions Paper from Joni about sign language classification: https://biblio.ugent.be/publication/8660743/file/8660744.pdf Another paper from Joni about sign language: https://users.ugent.be/~mcdcoste/assets/SLR_DeCoster2021Isolated.pdf We use OpenPose as a fixed feature extractor. For every frame, OpenPose extracts 137 keypoints. 25 keypoints represent the body pose, 70 are facial keypoints, and there are 21 keypoints per hand representing the hand pose Onze keypoints zijn geextraheerd door mediapipe -> goeie zoekterm voor papers The facial keypoints from the body model are also removed, because that information is present in the keypoints of the face model. Every keypoint is a triplet (x,y,c), where x and y are rational numbers representing the 2D coordinates of the keypoint, and c is the confidence of OpenPose in the correctness of this keypoint. We use entire triplets as input features. The lower body is not in frame and is not relevant for sign language, so we decide to drop those keypoints. As spatial pre-processing, we rotate the pose such that the shoulders are horizontal to account for seating position, and we standardize the body pose such that the length of the neck is 1. For data augmentation, we perform the following transformations. First, we introduce Gaussian noise on the keypoints, i.e., translating every keypoint by (x, y), where x and y are sampled from N(0,0.005). Secondly, we randomly rotate both hands separately up to 20 degrees using the wrist keypoints as pivot points. - If we look at the clips we notice that often the last few frames are the most important (e.g. for the 2 sign) - Number of frames differ per sequence ## Survey papers: - https://link.springer.com/article/10.1007/s12652-020-02396-y - Skip parts about neural nets and feature extraction as we already have the features - Part about isolated sign recognition highly relevant - Most relevant techniques are: - K-nearest neighbour - Support vector machine (SVM) - SVM good for static gestures - Reliance vector machine (RVM) - RVM good for complex motions - K-means clustering - Dynamic time warping (look into this) - To compute the effective distance between the temporal sequences - Self organizing map - Overview table ![](https://i.imgur.com/L0DUSnL.png) - Isolated sign recognition relevant search term - Part about signers also relevant - Subunit sign recognition - Subunits are formed by breaking the whole signs into smaller sub-units - https://link.springer.com/content/pdf/10.1007/s13042-017-0705-5.pdf ## Machine learning: - https://www.sciencedirect.com/science/article/pii/S0045790621003529?casa_token=WgD6pYTYfcEAAAAA:jEC4M4Tc71N6vLcTFKny2M_WyCqqVu3_TM9N50ykcNK0TsQXo3HBhoYKBY1yuaqHISyb_Kbh - https://www.researchgate.net/profile/Abhiruchi-Bhattacharya/publication/339336794_Classification_of_Sign_Language_Gestures_using_Machine_Learning/links/5e4d631392851c7f7f45ff1b/Classification-of-Sign-Language-Gestures-using-Machine-Learning.pdf - https://www.cv-foundation.org/openaccess/content_cvpr_workshops_2015/W15/html/Dong_American_Sign_Language_2015_CVPR_paper.html - https://www.jmlr.org/papers/volume13/cooper12a/cooper12a.pdf - https://link.springer.com/chapter/10.1007/978-0-85729-997-0_27 - https://www.tandfonline.com/doi/abs/10.1080/02564602.2014.961576 - https://www.researchgate.net/profile/Archana-Ghotkar-2/publication/272509785_Study_of_vision_based_hand_gesture_recognition_using_indian_sign_language/links/54e830010cf2f7aa4d4f7ce2/Study-of-vision-based-hand-gesture-recognition-using-indian-sign-language.pdf ## Feature extraction: - https://link.springer.com/article/10.1007/s10916-017-0819-z - https://ieeexplore.ieee.org/abstract/document/4200816/?casa_token=vZ__FQk1ag0AAAAA:pWPT7wajeIu9u5U8z_o8XNtFHwnM6eSd2TzvALiQUDhObMGdukd-IgrPymaqcSd2s8urofCP ## Feature transform: - http://www.ijetch.org/papers/427-C074.pdf (wavelet transform) (meh) - https://www.scirp.org/journal/paperinformation.aspx?paperid=67616 (wavelet tranform) - https://link.springer.com/content/pdf/10.1007/s40031-016-0250-8.pdf (Scale invariant feature transform) - www.wseas.us/e-library/conferences/2015/Tenerife/MATH/MATH-35.pdf (Scale invariant feature transform + intensity) - https://www.researchgate.net/profile/Abdolah-Chalechale/publication/272853897_Persian_Sign_Language_Recognition_Using_Radial_Distance_and_Fourier_Transform/links/5d10e32a299bf1547c7a327c/Persian-Sign-Language-Recognition-Using-Radial-Distance-and-Fourier-Transform.pdf (radial distance + fourier transform) - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.462.4078&rep=rep1&type=pdf (wavelet transform + pca) Data augmentation: - https://www.researchgate.net/publication/343462680_Data_Augmentation_for_Human_Keypoint_Estimation_Deep_Learning_Based_Sign_Language_Translation - Random keypoint removal - Finger length conversion Popular transforms: - wavelet - scale invariant feature transform (sift) # Kevin: - Are labels distributed evenly or do we need to balance from trainset? - ![](https://i.imgur.com/F2X7v3f.png) I think we need to balance this?, some labels are 4 times rarer than others Counter({'C: 1': 66, 'AANKOMEN-A': 144, 'AUTO-RIJDEN-A': 315, 'C: 2': 111, 'ZELFDE-A': 200, 'WAT-A': 104, 'HAAS-oor': 187, 'SCHILDPAD-Bhanden': 132, 'NAAR-A': 148, 'c.OOK': 147, 'c.AF': 245, 'HEBBEN-A': 156, 'c.ZIEN': 126, 'GOED-A': 54, 'MOETEN-A': 56}) - https://github.com/nicknochnack/RealTimeObjectDetection https://www.youtube.com/watch?v=pDXdlXlaCco A full tutorial on how to detect sign language in real time, so could be useful - https://openaccess.thecvf.com/content_CVPR_2020/papers/Camgoz_Sign_Language_Transformers_Joint_End-to-End_Sign_Language_Recognition_and_Translation_CVPR_2020_paper.pdf (more about recognizing sign language and then transforming it into words) https://sign-language-processing.github.io/ (lots of examples about sign language , how to detect it in lots of languages ) - https://link.springer.com/article/10.1007/s00521-020-05279-7 (uses a lineaire SVM classifier, could be usefull for phase1) - https://www.kaggle.com/mete6944/sign-language-with-logistic-regression (uses logistic regression, could be usefull for phase1) - https://medium.com/analytics-vidhya/sign-language-classification-64fe8ad0fc2c (recognizing sign numbers using Logistic Regression,Decision Tree ,Random Forest... comparing various algorithms ) # Taken: - Baseline in main notebook zetten (vince) + Folds vastleggen (5) + Seeds vastleggen (420) - Baseline: StdScaler, SelectKBest, Logreg - Utils for plotting learning curves, validation curves, confusion matrix: kevin? Sieben? - Feature extraction (physics stuff) in Python script - Face: Vince (width/height mouth, distance between eyes) - Body: Wim - Hands (angle, finger length, ...): Sieben - Everything else (velocity, ...): Kevin - Outliers ? - Scaling to 1 keypoint (Meest centrale keypoint) Wim - Keypoint removal (not useful) (not yet) - Amount of frames per sequence Wim - Which frames more important? (middle?) # Vince -- First feature extraction (face): - Basic idea is to use information from the face - Width and height of mouth - Area of mouth - Distance between eyes (use for scaling mouth?) - Problem with dataset: points of face are entirely scattered so we cant distinguish eyebrows and mouth by just splitting the features - Current solution: find uppermost and lowermost y point and divide points using this line. Grey line is division, red is uppermost, blue is lowermost. Eyebrow points are green, mouthpoints are purple. - ![](https://i.imgur.com/faHrrhn.png) - From this, find upper, lower, leftmost and rightmost points on the mouth. Width and height of mouth can be estimated by calculating the distance between these points. An estimate of the area can also be calculated this way. - ![](https://i.imgur.com/IjY8EFS.png) - Results over an entire sequence of frames is shown below: - ![](https://i.imgur.com/JDpTAdA.png) - This method wil probably give bad results if A) Face is not tracked or B) the face is heavy rotated. Maybe use more advanced techniques for this.