AI-Automated Training System for Figure Skating

# AI-Automated Training System for Figure Skating > We are developing a training system for figure skating. The system is designed to provide coach instructions to users with no professional equipment or 3D camera. The system includes 2D video analysis, human skeleton tracking, pose detection, and instruction generation. > This document focuses on **temporal video alignment**. Code is available [here](https://github.com/CleaLin/Video_Align). ###### tags: `self-supervised` `embeddings` `cycle-consistency` `nearest neighbors` `computer vision` ## System Structure ![](https://i.imgur.com/1D1cu1H.jpg) ## Get the Embedding Space Model To compare and find the differences between the learner's video and the standard motion, we have to temporally align two videos and automatically get the timestamp where the motion starts. Because of the lack of labelled data, we implement a self-supervised representation learning method, which originated from [Temporal Cycle-Consistency (TCC) Learning](https://sites.google.com/view/temporal-cycle-consistency/home)[^TCC]. The method aims to find the temporal correspondence between video pairs and align two similar videos based on the resulting per-frame embeddings. [^TCC]: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal Cycle-Consistency Learning. In *2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. ### Dataset Preparation Trim the start and end of each video, keeping only the target motion part. 50+ training videos are recommended for each motion. \ To get per-frame human joint location and confidence data, we apply [AlphaPose](https://github.com/MVIG-SJTU/AlphaPose)[^AlphaPose], an accurate multi-person pose estimator, as the data preprocessing method. The original `.mp4` videos are transformed into frames and keypoints data. Check the AlphaPose output format [here](https://github.com/MVIG-SJTU/AlphaPose/blob/master/docs/output.md). [^AlphaPose]: Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, Cewu Lu. RMPE: Regional Multi-Person Pose Estimation. In *2017 IEEE International Conference on Computer Vision (ICCV)*. \ The resulting data for each video includes: 1. A `/vis` folder which contains all frames in the video. 2. A `*.json` file which contains COCO 17 keypoints information for every person in the video. All preprocessed data should be placed in `/data/[motion_name]/[train/val/test]` folder. ### Training 1. Copy the preprocessed data into `/train` folder. 2. Tune hyperparameters in `tcc_config.py`, including training steps, batch size, loss type, etc. 3. Run `python3 tcc_train.py --dataset [motion_name] --mode train`. 4. Get the model checkpoint in `/log` folder. :::info 💡 **NOTE** Before getting into the training process, always make sure that: - The main person in the video should get the highest AlphaPose score. We define the person with the highest AlphaPose score as the main person in the video. When loading the training dataset, we crop all frames based on the main person's location to subtract the background effect. - All videos for training should contain only the correct motion. The TCC method assumes that the first(last) frame for all training videos is the start(end) of the target motion. The aligning performance could drop if your training videos contain frames that are not part of the correct motion. - `NUM_STEPS` in `tcc_config.py` should be less than and close to the minimum sequence length (`min(seq_lens)` in `tcc_data.py`), or else the training performance would drop. ::: ## Get the Aligned Video 1. Do [data preprocessing](#Dataset-Preparation) for the learner's video and the standard motion video. 2. Copy the preprocessed data into `/test` folder. By default, the first data in the folder is the learner's video and the second one is the standard motion video. 3. Run `python3 tcc_get_start.py --dataset [motion_name] --mode test`. 4. The resulting aligned video is generated in the `/result` folder. An example video is shown as below, where the video on the left is the learner's video, and the one on the right is the standard motion video. \ {%youtube JZ5oUwk-N6U %} ## Other Functions You should save your [embedding space model](#Get-the-Embedding-Space-Model) in the `/log` folder in order to run the funtions below. ### Extract Per-frame Embeddings 1. Run `python3 tcc_get_embed.py --dataset [motion_name] --mode [train/val/test]` 2. The resulting per-frame embeddings are extracted in `/log/[motion_name]_embeddings.npy` ### Align Videos by DTW After getting per-frame embeddings, we align videos frame by frame based on [DTW method](https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S2_DTWbasic.html). Check the original function [here](https://github.com/pollen-robotics/dtw/blob/6c080af4ca0ff12c0eba1fd4eb678260bb0b4f9f/dtw/dtw.py). 1. Run `python3 tcc_align.py --dataset [motion_name] --mode [train/val/test]` 2. By default, we randomly select 4 videos and pick one of them as the query video, the rest are the candidate videos. You can change the number of videos in `tcc_align.py`, line 50. The number should be no less than 4. 3. The resulting aligned video `/output_*.mp4` is generated in the `/result` folder. 4. Meanwhile, we reduce the 128-dimension embeddings into 2D space by t-SNE[^tSNE], the result 2D image `/output_*.jpg` is generated in the `/result` folder. An example is shown as below, where each color represents a video and the numbers represent the frame indices. ![](https://i.imgur.com/jd1mmOe.jpg) [^tSNE]: Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. *Journal of Machine Learning Research 9(Nov)*:2579-2605, 2008. ### Get Kendall's Tau After getting per-frame embeddings, we can get Kendall's Tau for the videos. [Kendall's Tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) is a statistical measure that can determine how well-aligned two sequences are in time. 1. Run `python3 tcc_get_kendalls_tau.py --dataset [motion_name] --mode [train/val/test]` 2. The resulting value is between 1 and -1. A value of 1 implies the videos are perfectly aligned while a value of -1 implies the videos are aligned in the reverse order.