Try   HackMD

AI-Automated Training System for Figure Skating

We are developing a training system for figure skating. The system is designed to provide coach instructions to users with no professional equipment or 3D camera. The system includes 2D video analysis, human skeleton tracking, pose detection, and instruction generation.
This document focuses on temporal video alignment. Code is available here.

tags: self-supervised embeddings cycle-consistency nearest neighbors computer vision

System Structure

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Get the Embedding Space Model

To compare and find the differences between the learner's video and the standard motion, we have to temporally align two videos and automatically get the timestamp where the motion starts. Because of the lack of labelled data, we implement a self-supervised representation learning method, which originated from Temporal Cycle-Consistency (TCC) Learning[1]. The method aims to find the temporal correspondence between video pairs and align two similar videos based on the resulting per-frame embeddings.

Dataset Preparation

Trim the start and end of each video, keeping only the target motion part. 50+ training videos are recommended for each motion.

To get per-frame human joint location and confidence data, we apply AlphaPose[2], an accurate multi-person pose estimator, as the data preprocessing method. The original .mp4 videos are transformed into frames and keypoints data. Check the AlphaPose output format here.


The resulting data for each video includes:

  1. A /vis folder which contains all frames in the video.
  2. A *.json file which contains COCO 17 keypoints information for every person in the video.

All preprocessed data should be placed in /data/[motion_name]/[train/val/test] folder.

Training

  1. Copy the preprocessed data into /train folder.
  2. Tune hyperparameters in tcc_config.py, including training steps, batch size, loss type, etc.
  3. Run python3 tcc_train.py --dataset [motion_name] --mode train.
  4. Get the model checkpoint in /log folder.

💡 NOTE
Before getting into the training process, always make sure that:

  • The main person in the video should get the highest AlphaPose score. We define the person with the highest AlphaPose score as the main person in the video. When loading the training dataset, we crop all frames based on the main person's location to subtract the background effect.
  • All videos for training should contain only the correct motion. The TCC method assumes that the first(last) frame for all training videos is the start(end) of the target motion. The aligning performance could drop if your training videos contain frames that are not part of the correct motion.
  • NUM_STEPS in tcc_config.py should be less than and close to the minimum sequence length (min(seq_lens) in tcc_data.py), or else the training performance would drop.

Get the Aligned Video

  1. Do data preprocessing for the learner's video and the standard motion video.
  2. Copy the preprocessed data into /test folder. By default, the first data in the folder is the learner's video and the second one is the standard motion video.
  3. Run python3 tcc_get_start.py --dataset [motion_name] --mode test.
  4. The resulting aligned video is generated in the /result folder. An example video is shown as below, where the video on the left is the learner's video, and the one on the right is the standard motion video.

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Other Functions

You should save your embedding space model in the /log folder in order to run the funtions below.

Extract Per-frame Embeddings

  1. Run python3 tcc_get_embed.py --dataset [motion_name] --mode [train/val/test]
  2. The resulting per-frame embeddings are extracted in /log/[motion_name]_embeddings.npy

Align Videos by DTW

After getting per-frame embeddings, we align videos frame by frame based on DTW method. Check the original function here.

  1. Run python3 tcc_align.py --dataset [motion_name] --mode [train/val/test]
  2. By default, we randomly select 4 videos and pick one of them as the query video, the rest are the candidate videos. You can change the number of videos in tcc_align.py, line 50. The number should be no less than 4.
  3. The resulting aligned video /output_*.mp4 is generated in the /result folder.
  4. Meanwhile, we reduce the 128-dimension embeddings into 2D space by t-SNE[3], the result 2D image /output_*.jpg is generated in the /result folder. An example is shown as below, where each color represents a video and the numbers represent the frame indices.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Get Kendall's Tau

After getting per-frame embeddings, we can get Kendall's Tau for the videos. Kendall's Tau is a statistical measure that can determine how well-aligned two sequences are in time.

  1. Run python3 tcc_get_kendalls_tau.py --dataset [motion_name] --mode [train/val/test]
  2. The resulting value is between 1 and -1. A value of 1 implies the videos are perfectly aligned while a value of -1 implies the videos are aligned in the reverse order.

  1. Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal Cycle-Consistency Learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). ↩︎

  2. Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, Cewu Lu. RMPE: Regional Multi-Person Pose Estimation. In 2017 IEEE International Conference on Computer Vision (ICCV). ↩︎

  3. Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008. ↩︎