We are developing a training system for figure skating. The system is designed to provide coach instructions to users with no professional equipment or 3D camera. The system includes 2D video analysis, human skeleton tracking, pose detection, and instruction generation.
This document focuses on temporal video alignment. Code is available here.
self-supervised
embeddings
cycle-consistency
nearest neighbors
computer vision
To compare and find the differences between the learner's video and the standard motion, we have to temporally align two videos and automatically get the timestamp where the motion starts. Because of the lack of labelled data, we implement a self-supervised representation learning method, which originated from Temporal Cycle-Consistency (TCC) Learning[1]. The method aims to find the temporal correspondence between video pairs and align two similar videos based on the resulting per-frame embeddings.
Trim the start and end of each video, keeping only the target motion part. 50+ training videos are recommended for each motion.
To get per-frame human joint location and confidence data, we apply AlphaPose[2], an accurate multi-person pose estimator, as the data preprocessing method. The original .mp4
videos are transformed into frames and keypoints data. Check the AlphaPose output format here.
The resulting data for each video includes:
/vis
folder which contains all frames in the video.*.json
file which contains COCO 17 keypoints information for every person in the video.All preprocessed data should be placed in /data/[motion_name]/[train/val/test]
folder.
/train
folder.tcc_config.py
, including training steps, batch size, loss type, etc.python3 tcc_train.py --dataset [motion_name] --mode train
./log
folder.💡 NOTE
Before getting into the training process, always make sure that:
NUM_STEPS
in tcc_config.py
should be less than and close to the minimum sequence length (min(seq_lens)
in tcc_data.py
), or else the training performance would drop./test
folder. By default, the first data in the folder is the learner's video and the second one is the standard motion video.python3 tcc_get_start.py --dataset [motion_name] --mode test
./result
folder. An example video is shown as below, where the video on the left is the learner's video, and the one on the right is the standard motion video.You should save your embedding space model in the /log
folder in order to run the funtions below.
python3 tcc_get_embed.py --dataset [motion_name] --mode [train/val/test]
/log/[motion_name]_embeddings.npy
After getting per-frame embeddings, we align videos frame by frame based on DTW method. Check the original function here.
python3 tcc_align.py --dataset [motion_name] --mode [train/val/test]
tcc_align.py
, line 50. The number should be no less than 4./output_*.mp4
is generated in the /result
folder./output_*.jpg
is generated in the /result
folder. An example is shown as below, where each color represents a video and the numbers represent the frame indices.After getting per-frame embeddings, we can get Kendall's Tau for the videos. Kendall's Tau is a statistical measure that can determine how well-aligned two sequences are in time.
python3 tcc_get_kendalls_tau.py --dataset [motion_name] --mode [train/val/test]
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal Cycle-Consistency Learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). ↩︎
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, Cewu Lu. RMPE: Regional Multi-Person Pose Estimation. In 2017 IEEE International Conference on Computer Vision (ICCV). ↩︎
Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008. ↩︎