Video Alignment

--- tags: SportTech --- Video Alignment === * **Project code**: `/home/lin10/projects/tcc` * [**Github repo by Nancy**](https://github.com/YilingLin10/tcc) * A clone of Clea's repo ### References: * [Document by Clea](https://hackmd.io/@CleaLin/SysHqZMOq#fnref3) * [Github by Clea (TCC)](https://github.com/CleaLin/Video_Align) * Note that the code was modified from [Colab by Google](https://colab.research.google.com/drive/1-JYJXKoRWKcQvw5Tlacteotewpd2Bkts) * [Github by Calvin (AlphaPose)](https://github.com/cjwku1209/alpha_pose) * [Original paper](https://sites.google.com/view/temporal-cycle-consistency/home) ## Dataset Preparation > Note: 50+ training videos are recommended for each motion. [color=#907bf7] For each video: 1. Trim the start and end of each video, keeping only the target motion part. :::info :mega: **Training videos should contain only the correct motion.** The TCC method assumes that the first(last) frame for all training videos is the start(end) of the target motion. The aligning performance could drop if your training videos contain frames that are not part of the correct motion. ::: 3. Get per-frame human joint location and confidence data with [AlphaPose](https://github.com/cjwku1209/alpha_pose). * [AlphaPose output format](https://github.com/MVIG-SJTU/AlphaPose/blob/master/docs/output.md) * The resulting data for each video is stored in a folder called `alpha_pose_{$video_name}`, including * A `/vis` folder which contains all frames in the video in `*.jpg` format * A `/alphapose-results.json` file which contains COCO 17 keypoints information for every person in the video 4. Separate all data (`alpha_pose_{$video_name}` folders) into *train/val/test* split and place them in`./data/[motion_name]/[train/val/test]` folder. :::danger :warning: **Please make sure that the main person of the videos is the skater.** :warning: **Remove the videos that don't meet this criteria from the dataset** * Run `./draw_skeleton.py` to check the videos with drawn main skeletons * Or run `./data/crop_jump_test.py` to check the cropped and resized videos ::: ## Data Preprocessing :::info :mega: We define the person with the highest AlphaPose confidence in the first frame as the main person in the video. When loading the training dataset, we crop all frames based on the main person’s location to subtract the background effect. > Let $\tilde{p_i}$ denote the extracted pose of the skater for the $i$-th frame. Since the output of AlphaPose includes poses of multi-person, to extract the 2D poses of the skater, we extract the pose with the highest confidence in the first frame as $\tilde{p_1}$ while ensuring that the pose belongs to the skater. For $P_{i}=\{p_{i,1}, p_{i,2}, ...\}$, where $p_{i,j}$ denotes the $j$-th pose for the ${i}$-th frame, we compute the euclidean distance between $\tilde{p_{i-1}}$ and $p_{i,j}$, and obtain the pose with the minimum distance as $\tilde{p_i}$. [color=#907bf7] ::: ### extract the 2D poses of the skater * Please refer to `get_main_skeleton(path_to_json)` in `./utils/tcc_skeleton.py` ### crop frames by the bounded box of the skater * Please refer to `load_skate_data(path_to_raw_videos, dataset, mode)` in `./utils/tcc_data.py` ## Training 1. Place the preprocessed data of the training videos into `./data/${motion_name}/train` folder. 2. Tune hyperparameters in `./tcc_config.py`, including training steps, batch size, loss type, etc. :::warning :mega:`NUM_STEPS` in `./tcc_config.py` should be less than and close to the printed minimum sequence length (`min(seq_lens)` in `./utils/tcc_data.py`), or else the training performance would drop. ::: 3. Run `python3 tcc_train.py --dataset [motion_name] --mode train`. 4. Get the model checkpoint in `./log` folder. ## Align 2 videos 1. Place the preprocessed data of the 2 testing videos into `./data/${motion_name}/test` folder. 2. Run `python3 tcc_get_start.py --dataset ${motion_name} --mode test` 3. The aligned video is stored in the `./result` folder ### Temporal alignment using Dynamic Time Warping (DTW) ![](https://i.imgur.com/4dkh5VT.jpg) :::info :mega: TCC assumes that the videos contain only the target motion, without redundant parts, however, that is usually not the case when it comes to testing videos. Therefore, we use a standard video that contains only the correct target motion as a **pivot video** and compute its DTW distance with the two testing videos in a sliding window manner. > Let $V_{pivot}$ denote the pivot video, and $V_1, V_2$ denote the two testing videos respectively, where $V_{pivot}$ has $p$ frames, $V_1$ has $m$ frames, $V_2$ has $n$ frames, and $1 \leq p \leq min(m,n)$. For $V_i, where\space i\in\{1,2\}$, we compute the DTW distance between $V_{pivot}$ and > * the $0 \sim (p-1)$ frames of $V_i$, > * $1 \sim (p)$ frames of $V_i$, > * .... > * $len(V_i)-p+1 \sim len(V_i)$ frames of $V_i$. > > We let the start frame of the sequence with the minimum DTW distance be the start of target motion in $V_i$. [color=#907bf7] ::: ### Demo {%youtube hzUzbcf4JEU %} ## Other Functions Please refer to [Clea's document](https://hackmd.io/@CleaLin/SysHqZMOq#Other-Functions)