CoachMe - HackMD

# CoachMe 開發紀錄 > contributed by <[`weihsinyeh`](https://github.com/weihsinyeh)> [原始程式碼](https://github.com/MotionXperts/MotionExpert) [Concept Difference](https://hackmd.io/I_Egug6sStCAArt4122Mmw) 1. 擁有教練看細節的能力 Understanding detailed body part interrelationships, beyond overall action recognition. (第一句話） 2. 擁有教練分辨對錯的能力 Typical video captioning fails to distinguish skill levels (e.g., beginner vs. expert). ## 模型的問題 ![image](https://hackmd.io/_uploads/H1okHkRv1e.png) 1.圈數不足->因為骨架被固定在原地->不知道哪時候落地->圈數看得不準 2.缺點被放大<-目前的模型被設計來挑毛病，訓練模型時教練的指導也全是負面的指導語，因此就算運動員跳的不錯也沒有一個機制去評價是否已經達到目標，而是一昧的吹毛求疵 3. 471706304466387043_0 正確的指導語講在錯誤的時間點 ex "471706263479386249_0 & _1 4. 講的話是要懂一些溜冰才講得出來的。 5. 換句話說 6. -> 錯誤區間錯第一跳講第二跳的問題。 -> 動作分類對 7. 換句話說的比例我先分類文字。文字 A 類 -> 影片 B C D 類　文字 B 類 -> 影片 B 類文字 C 類 -> 影片 D 類 video llava. LLaVA consists of a CLIP image encoder, a multimodal projector and a Vicuna text decoder. 文字跟文字的關係。entity relationship. 量化分析。教練習慣指導腳的問題，指導的大部分都是某部位可以回報給做資料的人。關節的 relationship 可以給做模型的人。教練用語的 distribution 與生成出來的 distribution 的差別。 ## 建立資料集 > github repository : `/MotionXperts/HybrIK` [HybrIK 論文連結](https://arxiv.org/abs/2011.14672) 以 30 frame per second 去預測影片中的骨架。此骨架 format 為 SMPL 格式。且為local coordinate system 以胸腔為起點。因此從中可以得到以 fps : 30 預測的影片長度 $sequence\ length$ 以及每一個 frame 有的 $22 \times 3$ 的關節 (joints) 區域座標 : $x$、$y$、$z$。 ### 修改原先 HybriK 為輸出 $24 \times 3$ 的 24 個關節的$x$、$y$、$z$ 座標，但因為 pre-trained 資料集 HumanML3D 使用的是 $22 \times 3$ 的 22 個關節的$x$、$y$、$z$ 座標。因此我也調整為 22 個並且跟 HumanML3D 一樣的關節對應，少掉的 2 個關節為雙手的末端。 :::info > [name= weihsinyeh ] 檢查縮放比例。之前有寫，但不確定新版本有沒有納入這個縮放比例的考量。 ::: ## 模型架構 > github repository : `/MotionXperts/MotionExpert` 首先輸入資料為 : $3 \times\ sequence\ length \times 22$ 為 joints 的 $x$、$y$、$z$ 其中 $sequence\ length$ 為影片的長度。 ![skeleton](https://hackmd.io/_uploads/BJU53Zd-A.png =60%x) bonelink = $[(0, 1), (0, 2), (0, 3), (1, 4), (2,5),(3,6), (4, 7), (5, 8), (6, 9), (7, 10), (8, 11),$$(9, 12), (9, 13), (9, 14), (12, 15), (13, 16), (14, 17), (16, 18), (17, 19), (18, 20), (19, 21)]$ 而 bone 的 $x$、$y$、$z$ 則透過將這 joints 相減得到向量。此向量的方向均為從 pelvis outward。因此從dataset 可以做出 $6 \times\ sequence\ length \times 22$ 為 joints 與 bone 的 $x$、$y$、$z$ 此稱作 channel 。 ![modality](https://hackmd.io/_uploads/rJ3akQO-0.png =60%x) 而 `input_ids` 為 $batch\ size \times 6 (channel) \times\ sequence\ length \times 22$。而下方的 get encoder feature 。下方為整個模型的 `forward` 函式 ```python=1 def forward(self, input_ids, attention_mask, decoder_input_ids=None, labels=None, use_embeds=True,output_attentions=True): if use_embeds: batch_size, channel,seq_length, feature_dim = input_ids.shape input_embeds, attention_node, attention_matrix = self._get_encoder_feature(input_ids) ``` 上方的程式碼經過 `_get_encoder_feature` 函式: 而之後給 T5 作為輸入的就是這裡的embedding。 ```python def _get_encoder_feature(self, src): embedding, attention_node, attention_matrix = self.STAGCN(src) return embedding, attention_node, attention_matrix ``` 而給 T5 作為輸入的就是這裡的 embedding。所以看到第 7 行 `inputs_embeds=input_embeds` ```python= def forward(self, input_ids, attention_mask, decoder_input_ids=None, labels=None, use_embeds=True,output_attentions=True): if use_embeds: batch_size, channel,seq_length, feature_dim = input_ids.shape input_embeds, attention_node, attention_matrix = self._get_encoder_feature(input_ids) new_attentention_mask = attention_mask[:,:,::4].clone() attention_mask = new_attentention_mask[:,0,:] output = self.t5(inputs_embeds=input_embeds, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, labels=labels,output_attentions=True) ``` 因此整體的模型分為前半部分的 STAGCN 模型與後部分的 T5 模型。接下來細看前半部分的 STAGCN 模型 --- ### [Spatial Temporal Attention Graph Convolution Network](https://github.com/machine-perception-robotics-group/SpatialTemporalAttentionGCN) 其中分為三個 branch : (1) Feature Extractor 、(2) Attention Branch 、(3) Perception branch ![image](https://hackmd.io/_uploads/rJ3Jy7qb0.png) #### 0. [Spatial Temporal Graph Convolutional Networks](https://github.com/yysijie/st-gcn) 這個模型的架構是做圖卷積其中包含空間(spatial dimension) 與時間(temporal dimension)的維度。而卷積的部分我改成 SMPL ，但目前也支援其他模式此為 optional 。但因為 pre-trained 的資料集為 humanML3D 其為sequence 維度下的 feature 維度為 $22 \times 3$。 `/MotionExpert/net/Utils_attention/make_graph.py` `layout == 'SMPL'`。透過這個做出最初始的 Adjacency Matrix $A_0$。在骨骼中相鄰的關節彼此的關係則設定為 $1$。 ##### 空間 (Spatial Convolution) ###### **Make Graph** > 參考 /MotionExpert/net/Utils_attention/make_graph.py 接著我設定做出 Graph 的方式為 `strategy == 'spatial':` 。概念如下。 ![image](https://hackmd.io/_uploads/rJz3gXOZC.png =70%x) ${\color{Green}{Root\ node}}$ : ${\color{DodgerBlue}{Centripetal\ group}}$ : the neighboring nodes that are closer to the gravity center of the skeleton than the root node ${\color{Gold}{Centrifugal\ group}}$ : otherwise the centrifugal group 這裡我設定 `hop = 3`，意思為在空間中最多會拜訪一個關節點與它距離為 3 的關節點。接下來會有 7 個 Adjacency Matrix $M_{0-6}$ 。透過 Adjacency matrix 計算相鄰的節點 `hop = 1` 儲存在$A_1$ ，與較近的節點 `hop = 2` 儲存在 $A_1 \times A_1 = A_2$ ，與較遠的節點 `hop = 3` 儲存在 $A_1 \times A_1 \times A_1= A_3$。計算關節間彼此的距離，下段演算法用來形成 group 。如果`hop <= 3` 則這些兩個關節點為 group。 ```python def get_hop_distance(num_node, edge, hop_size=1): A = np.zeros((num_node, num_node)) for i, j in edge: A[j, i] = 1 A[i, j] = 1 hop_dis = np.zeros((num_node, num_node)) + np.inf transfer_mat = [np.linalg.matrix_power(A, d) for d in range(hop_size + 1)] arrive_mat = (np.stack(transfer_mat) > 0) for d in range(hop_size, -1, -1): hop_dis[arrive_mat[d]] = d return hop_dis ``` 接下來為 spatial 策略的演算法: 我設定SMPL 的 `self.center = 0` 也就是 pevis 的關節點，此符合[STGCN 論文](https://arxiv.org/pdf/1801.07455)中稱這個關節點為重心 : "gravity center of the skeleton"。接下來來尋找每個 group，中有兩個關節點分別為 $joint = i$ 與 $joint = j$，兩者在 hop 的範圍內，則兩者為 group。 * case 1 : 如果距離 pevis 的關節點 $joint = 0$ 的距離相等。那彼此在這個 group 都為 ${\color{Green}{Root\ node}}$，因為他們距離重心相同距離。因此放到 $M_{root}[j,i]$。 * case 2 : 如果 $joint = i$ 較 $joint = j$ 距離 pevis 的關節點 $joint = 0$ 的距離更近。則 $joint = i$ 相對於 $joint = j$ 為 ${\color{DodgerBlue}{Centripetal\ group}}$ 。因此放到 $M_{close}[j,i]$。 * case 3 : 如果 $joint = i$ 較 $joint = j$ 距離 pevis 的關節點 $joint = 0$ 的距離更遠。則 $joint = i$ 相對於 $joint = j$ 為 ${\color{Gold}{Centrifugal\ group}}$ 。因此放到 $M_{far}[j,i]$。因此可以觀察發現三個關係(1)、$M_{root}[j,i] = 1$ 則 $M_{root}[i,j]$ (2)、$M_{close}[j,i] = 1$ 則 $M_{far}[j,i]$ (3)、$M_{close}[i,j] = 1$ 則 $M_{far}[i,j]$ ```python A = [] for hop in valid_hop: a_root = np.zeros((self.num_node, self.num_node)) a_close = np.zeros((self.num_node, self.num_node)) a_further = np.zeros((self.num_node, self.num_node)) for i in range(self.num_node): for j in range(self.num_node): if self.hop_dis[j, i] == hop: if self.hop_dis[j, self.center] == self.hop_dis[i, self.center]: a_root[j, i] = normalize_adjacency[j, i] elif self.hop_dis[j, self.center] > self.hop_dis[i, self.center]: a_close[j, i] = normalize_adjacency[j, i] else: a_further[j, i] = normalize_adjacency[j, i] if hop == 0: A.append(a_root) else: A.append(a_root + a_close) A.append(a_further) A = np.stack(A) self.A = A ``` 其中 $M_0$ 為 root node 也就是在製作最原先的 Adjacency Matrix layout的矩陣 $A_0$，相鄰的兩節點距離重心距離都相等。接著 $M_{1-6}$ 分別如下 : $M_{1}$ 為 `hop = 1` 範圍內時，${\color{Green}{Root\ node}}$ + ${\color{DodgerBlue}{Centripetal\ group}}$ (較近關節點) $M_{2}$ 為 `hop = 1` 範圍內時，${\color{Gold}{Centrifugal\ group}}$ (較遠關節點) $M_{3}$ 為 `hop = 2` 範圍內時，${\color{Green}{Root\ node}}$ + ${\color{DodgerBlue}{Centripetal\ group}}$ (較近關節點) $M_{4}$ 為 `hop = 2` 範圍內時，${\color{Gold}{Centrifugal\ group}}$ (較遠關節點) $M_{5}$ 為 `hop = 3` 範圍內時，${\color{Green}{Root\ node}}$ + ${\color{DodgerBlue}{Centripetal\ group}}$ (近關節點) $M_{6}$ 為 `hop = 3` 範圍內時，${\color{Gold}{Centrifugal\ group}}$ (較遠關節點) 透過這些對輸入 $x$、$y$、$z$ 此 3 個 channel 。卷積後會進而產生 7 個 channel。 ###### **Convolution** > 參考 /MotionExpert/net/Utils_attention/graph_convolution.py ```python def __init__(self, in_channels, out_channels, s_kernel_size, bias): super().__init__() self.s_kernel_size = s_kernel_size self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels * s_kernel_size, kernel_size=(1, 1), padding=(0, 0), stride=(1, 1), dilation=(1, 1), bias=bias) def forward(self, x, A, att_A): x = self.conv(x) n, kc, t, v = x.size() x = x.view(n, self.s_kernel_size, kc//self.s_kernel_size, t, v) x = torch.einsum('nkctv,kvw->nctw', (x, A)) return x.contiguous() ``` Randomly-Initialized Neural Network $(7\times 32) \times 3$ convolution 裡的權重。再透過back propagation 更新權重的值。 > Training from scratch: randomly initialize your neural network and train it (in a supervised manner) on your target task. > Transfer learning: pre-train the network on a separate dataset, then fine-tune it (i.e., train it more) on the target task. $X$ : $batch\ size \times 3(channel) \times sequence\ length \times 22$ ![image](https://hackmd.io/_uploads/HkwQ8DFZ0.png =80%x) `self.conv(x)` : $batch\ size \times (32\times 7)(channel) \times sequence\ length \times 22$ `x = x.view(n, self.s_kernel_size, kc//self.s_kernel_size, t, v)` : $batch\ size \times 7 \times 32 \times sequence\ length \times 22$ ![image](https://hackmd.io/_uploads/rksMEVtbC.png =80%x) `torch.einsum('nkctv,kvw->nctw',(a,b))` `a: tensor(n,k,c,t,v)` 也就是 ($batch\ size$,${ \color{red}7}$,$32$,$sequence\ length$,${\color{red}{22}}$) `b: tensor(k,v,w)` 也就是 ($7$,$22$,$22$) 輸出 : `tensor(n,c,t,w)` 也就是 ($batch\ size$,$32$,$sequence\ length$,$22$) ![image](https://hackmd.io/_uploads/Bk6ECG9-R.png =80%x) ![image](https://hackmd.io/_uploads/H1R0YHK-A.png =80%x) :::info > [name= weihsinyeh ] TODO : Adjacency Matrix 讓他可以被learned ::: 所以經過圖卷積(用 [Einstein summation](https://rogerspy.github.io/2021/09/12/einsum-mhsa/) 達到卷積中結合乘法與加法運算的效果) ``` for n in range(N): # batch size for c in range(C): # channel for t in range(T): # sequence length for w in range(W): # 22 個關節點 # Perform the einsum operation for each index for k in range(K): # 7 個空間Matrix for v in range(V): # 22 個關節點 result[n, c, t, w] += x[n, k, c, t, v] * A[k, v, w] ``` 輸入 : $batch\ size \times 3(channel) \times sequence\ length \times 22$ 輸出 : $batch\ size\times 32(channel) \times sequence\ length \times 22$ GCN 的用意是在改變 channel 的維度，且卷積將 graph layout 考慮進去。 ##### 時間 (Temporal Convolution) ```python self.tgc = nn.Sequential( nn.BatchNorm2d(out_channels), nn.ReLU(), nn.Conv2d(out_channels, out_channels, (t_kernel_size, 1), #(temporal kernel size , spatial kernel size) (stride, 1), #(temporal stride = 1, spatial stride) ((t_kernel_size - 1) // 2, 0), #(temporal padding kernal,spatial padding) bias=bias), nn.BatchNorm2d(out_channels), nn.Dropout(dropout), nn.ReLU()) ``` > 參數設定 : dropout=0.5, t_kernel_size=9 。且用上面空間(Spatial Convolution) 的情況， temporal stride = 1 ，out_channels = 32 (因為 stride = 1)。因為 `temporal kernel size = 9` ，所以 `padding kernel` $=(temporal\ kernel\ size - 1) // 2$ ![image](https://hackmd.io/_uploads/SyfJiuY-A.png =80%x) 但 Feature Extractor 中第一個 `block dropout=0` ， `residual=False` 。當做完空間卷積後做了時間卷積，由於 `residual=True` ，所以會做 ResNet 的工作。 ```python sgc_out = self.sgc(x, A * self.M, att_A) x = self.tgc(sgc_out) + self.residual(x) return x ``` ![image](https://hackmd.io/_uploads/Bk6J3utWA.png =20%x) --- #### 1. Feature Extractor ![image](https://hackmd.io/_uploads/By2WpsYWA.png =40%x) 首先 Feature extractor 會處理原先的 $6 \times\ sequence\ length \times 22$ 。接下來做卷積就是用來自 **0. STGCN**。以下為 joint 與 bone channel 的改變。 $3 \to 32$ `(stride = 1)` $32 \to 32$ `(stride = 1)` $32 \to 32$ `(stride = 1)` $32 \to 64$ `(stride = 2)` $64 \to 64$ `(stride = 1)` 接下來將 joint 與 bone 的 channel 連結起來。`feature = torch.cat([x1, x2], dim=1)` 因為有一次 `temportal stride = 2`，所以維度變為$(sequence\ length / 2)$ 再來就得到黃色 embedding ![image](https://hackmd.io/_uploads/BkJzuoFWC.png =5%x) : $batch\ size\times 128(channel) \times (sequence\ length / 2) \times 22$。 --- #### 2. Attention Branch ![image](https://hackmd.io/_uploads/Bky-wYKW0.png =40%x) 從 Feature Extractor 得到 $batch\ size\times 128(channel) \times (sequence\ length / 2) \times 22$ 。接下來做卷積就是用來自 **0. STGCN** 。以下為 channel 的改變。 $128 \to 128$ `(stride = 1)` $128 \to 128$ `(stride = 1)` $128 \to 256$ `(stride = 2)` $256 \to 256$ `(stride = 1)` $256 \to 256$ `(stride = 1)` 因為有一次 `temportal stride = 2` 所以$(sequence\ length / 2)$ 再來就得到橘色 embedding ![image](https://hackmd.io/_uploads/H1U1doFZA.png =5%x) : $batch\ size\times 256(channel) \times (sequence\ length / 4) \times 22$。 ### Attention ```python self.att_bn0 = nn.BatchNorm2d(config[-1][1]) # 256 self.att_conv = nn.Conv2d(config[-1][1], # 256 num_class, kernel_size=1, padding=0, stride=1, bias=False) ... x_att = self.att_bn0(x_last) x_att = self.att_conv(x_att) # inp ``` > att_conv 的參數 : input channel = 256, output channel = 1200 (我自己定義的) 因此得到 embedding : $batch\ size\times 1200(channel) \times (sequence\ length / 4) \times 22$ ##### Attention Node ```python= self.att_node_conv = nn.Conv2d(num_class, 1, kernel_size=1, padding=0, stride=1, bias=False) self.att_node_bn = nn.BatchNorm2d(1) self.sigmoid = nn.Sigmoid() # Attention node x_node = self.att_node_conv(x_att) x_node = self.att_node_bn(x_node) x_node = F.interpolate(x_node, size=(T, V)) att_node = self.sigmoid(x_node) ``` > att_conv 的參數 : input channel = 1200 (我自己定義的), output channel = 1 > 因此原先輸入的 embedding : $batch\ size\times 1200(channel) \times (sequence\ length / 4) \times 22$ 第 10 行的時候 embedding 卷積為 : $batch\ size\times 1(channel) \times (sequence\ length / 4) \times 22$ 第 11 行 : 經 [Pytorch 預設 nearest 線性差值](https://pytorch.org/docs/stable/_modules/torch/nn/functional.html#interpolate)，將$(sequence\ length / 4)$ 兩相鄰點線性插值，則相當於放大二倍 : $(sequence\ length / 2)$ 。此為 `attention node`。 $batch\ size\times 1(channel) \times (sequence\ length / 2 )\times 22$ 得到 : ![image](https://hackmd.io/_uploads/ByDnvsKZC.png =10%x) ##### Attention Matrix ```python= self.att_A_conv = nn.Conv2d(num_class, num_att_A * A_size[2], kernel_size=1, padding=0, stride=1, bias=False) self.att_A_bn = nn.BatchNorm2d(num_att_A * A_size[2]) self.tanh = nn.Tanh() self.relu = nn.ReLU() x_A = F.avg_pool2d(x_att, (x_att.size()[2], 1)) x_A = self.att_A_conv(x_A) x_A = self.att_A_bn(x_A) x_A = x_A.view(N, self.num_att_A, V, V) x_A = self.tanh(x_A) att_A = self.relu(x_A) ``` > att_conv 的參數 : input channel = 1200 (我自己定義的), output channel = 4(# of attention head) x 22 因此原先輸入的 embedding : $batch\ size\times 1200(channel) \times (sequence\ length / 4 ) \times 22$ > **參考 [torch.nn.functional.avg_pool2d](torch.nn.functional.avg_pool2d)** > input – input tensor (minibatch, in_channels, 𝑖𝐻, 𝑖𝑊) kernel_size – size of the pooling region. Can be a single number or a tuple (kH, kW) 這裡用的 `kernel size` 為 $((sequence\ length / 4 ),1)$ 在第5行的時候 embedding : $batch\ size\times 1200(channel) \times 1 \times 22$ 第 6 行的時候 embedding 卷積，此為 `attention matrix`。第 8 行做 reshape 為 : [參考 STAGCN attention_branch.py](https://github.com/machine-perception-robotics-group/SpatialTemporalAttentionGCN/blob/master/Tools/Model/Utils/attention_branch.py#L46C48-L46C58) $batch\ size\times 88(channel) \times 1 \times 22 \to batch\ size\times 4(channel) \times 22 \times 22$ 得到 `Attention Matrix` : ![image](https://hackmd.io/_uploads/rJ3YPsYbC.png =10%x) :::info > [name= weihsinyeh ] 研究別人的 GCN 的操作是如何的。 ::: --- #### 3. Perception Branch ![image](https://hackmd.io/_uploads/BkXPwtKbC.png =40%x) 從 Feature Extractor 得到的黃色 embedding : $batch\ size\times 128(channel) \times (sequence\ length /2) \times 22$ 。從 Attention Branch 得到 Attention Matrix 與 Attention Node。然而要輸入到 ST${\color{red}{A}}$GCN 的黃色 embedding `feature` $batch\ size\times 128(channel) \times (sequence\ length / 2 ) \times 22$，會先將 Attention Branch 產生的 `Attention Node(att_node)` : ![image](https://hackmd.io/_uploads/ByDnvsKZC.png =10%x) $batch\ size\times 1(channel) \times (sequence\ length /2)\times 22$ 做運算得到 `att_x` : $batch\ size\times 128(channel) \times (sequence\ length /2)\times 22$ ```python # Attention Mechanism att_x = feature * att_node ``` 接下來就是用新的 `att_x` 作為新的輸入到 Perception branch 的 embedding : $batch\ size\times 128(channel) \times (sequence\ length /2)\times 22$ 而之所以叫 ST${\color{red}{A}}$GCN 是因為原先在 **0. STGCN** 只用了 7 個 channel，而現在這 7 個 Adjacency Matrix $A_{0-6}$ 會加上新的 4 個 `Attention Matrix` 。成為有 11 個 Matrix 。這裡再補充說明 11 個 Matrix 的意義。前 7 個 Adjacency Matrix $M_{0-6}$ 為在 `hop = 3` 之內**基於骨架**所形成的**空間關係(hop <=3)**，而他們的關係權重 (weight) 也就是 Adjacency Matrix 中的數值都為 1 。因此在 STGCN 中關節點的關係都一樣重要。所有影片每次 STGCN 卷積 (先 $*$ 再 $+$ )的時候，$M_{0-6}$ 都為一樣的，因為都是基於 SMPL 的骨頭架構建立的。而後 4 個 Adjacency Matrix $M_{7-10}$ 是透過 Feature Extractor 與 Attention Branch 分別都為 ($5 \times STGCN\ block$) 產生得來。與 $M_{0-6}$ 不同的意義有三個 : 1. $M_{7-10}$ 關節的關係並不基於**骨架的空間關係(hop <= 3)**。他能夠找到原先不在基於骨架的空間佈局中，某些節點是**有潛在的關係(可能大於 hop > 3)**。 2. 此外也 $M_{7-10}$ 的值是介於 $0 \sim 1$ ，代表有些關係重要的程度。 3. 且 $M_{7-10}$ 會隨影片不同而有不同。 ```python def forward(self, x, A, att_A): x = self.conv(x) n, kc, t, v = x.size() # kc = (11*128) x = x.view(n, self.s_kernel_size, kc//self.s_kernel_size, t, v) # (self.s_kernel_size = 7+4), (128) x1 = x[:, :self.s_kernel_size-self.num_att_A, :, :, :] # 7 x2 = x[:, -self.num_att_A:, :, :, :] # 4 x1 = torch.einsum('nkctv,kvw->nctw', (x1, A)) x2 = torch.einsum('nkctv,nkvw->nctw', (x2, att_A)) x_sum = x1 + x2 ``` 從這裡可以知道 attention matrix 最後做 einsum (乘完再等權重相加，所以 einsum 並沒有表示每個 attention matrix 到底最後貢獻多少給最後的 embedding。因此能夠分配貢獻多少的要從原先的embedding 來計算得出。 :::info > [name= weihsinyeh ] > 跟前面attention matrix 結合的想法: * 將前面哪個身體部位的 attention matrix 也就是上面 4 個 attention matrix 產生出來的`PA_embedding` 用 gradient back propagation 回去。看是哪個 attention matrix 的貢獻較大。 * 此作法的缺點是無法跟文字 token 建立關係，只能跟一整句話建立關係。 ::: 接下來做卷積就是用來自 **0. STGCN** 。以下為 channel 的改變跟 Attention Branch 一樣。 $128 \to 128$ `(stride = 1)` $128 \to 128$ `(stride = 1)` $128 \to 256$ `(stride = 2)` $256 \to 256$ `(stride = 1)` $256 \to 256$ `(stride = 1)` 因為有一次 `temportal stride = 2` 所以$(sequence\ length / 2)$ 因此得到紫色 embedding ![image](https://hackmd.io/_uploads/Sk5D_itZ0.png =5%x) : $batch\ size\times 256(channel) \times (sequence\ length / 4) \times 22$。 --- 最後將 `attention_last` : $batch\ size\times 256(channel) \times (sequence\ length /4)\times 22$ ![image](https://hackmd.io/_uploads/H1U1doFZA.png =5%x) `perception_last` $batch\ size\times 256(channel) \times (sequence\ length /4)\times 22$ ![image](https://hackmd.io/_uploads/Sk5D_itZ0.png =5%x) 先交換順序 `attention_last` : $batch\ size\times (sequence\ length /4) \times 22 \times 256(channel)$ `perception_last` $batch\ size\times (sequence\ length /4) \times 22 \times 256(channel)$ 合併後變為 `PA_embedding` : $batch\ size\times (sequence\ length /4) \times 22 \times 512(channel)$ 得到 : ![image](https://hackmd.io/_uploads/SJK_IiY-R.png =3%x) ```python! perception_last = perception_last.permute(0,2,3,1) attention_last = attention_last.permute(0,2,3,1) PA_embedding = torch.cat([perception_last, attention_last], dim=-1) ``` ### [T5](https://huggingface.co/google-t5/t5-base) 輸入是 `PA_embedding` : $batch\ size\times (sequence\ length /4) \times 22 \times 512(channel)$ 給 T5 前又將feature dimension 降為到 T5-base 的 input embedding 長度。 sequence length 用影片每個 frame ，可以讓原先有先後關係的序列在 T5 可以做 positional encoding。像在 ViT 模型中也是將影片的 frame 用在 sequence length 的維度上。此方式可應用在讓 T5 的 attention 視覺化時間區段有問題。像是說明左手有問題會顯示運動員伸左手的時間區段。然而由於 sequence length 太大難以視覺化，因此將 sequence length影片長度跟 22關節點交換。而 22 關節點間也有關係在 T5 positional encoding。 :::info > [name= weihsinyeh ] 在輸入 embedding 給 T5 前，先將 $22$ 跟 $sequence\ lenght$ 交換。這樣 T5 做 token 的時候不是以時間為單位。而是以關節點作為單位，缺點是導致 : T5 的輸入 (22個token) 不含前後關係，這樣沒有善用語言模型 positional encoding 的特性。改成22 x times的架構 ::: [commit 066710](https://github.com/MotionXperts/MotionExpert/commit/066710bbd94212c1d37892cec212497d1c9d25aa) attention matrix 可以比以前的方式多跑出一個，也就是可以一個 attention matrix 可以 mapping 到 Openpose 成功。 ![image](https://hackmd.io/_uploads/SyLbDsFb0.png =20%x) ```python self.embedding = nn.Sequential( # (22 * 512 = 11264) -> (22 * 256 = 5632) nn.Linear(11264,5632), nn.ReLU(), # (22 * 256 = 5632) -> (768) nn.Linear(5632,768)).to(PA_embedding.get_device()) # batch_size, (sequence length /4), feature_dim (11264 -> 768) output_embedding = self.embedding(PA_embedding.view(-1,11264)).view(N, -1, 768) return output_embedding, att_node, att_A ``` 將 `output_embedding` 輸入給 T5 (下方程式碼第七行)。並用 `att_node` 與 `att_A` 做後面的 Evaluation。(下面為整個模型的 `forward` 函式)。 ```python=1 def forward(self, input_ids, attention_mask, decoder_input_ids=None, labels=None, use_embeds=True,output_attentions=True): if use_embeds: batch_size, channel,seq_length, feature_dim = input_ids.shape input_embeds, attention_node, attention_matrix = self._get_encoder_feature(input_ids) new_attentention_mask = attention_mask[:,:,::4].clone() attention_mask = new_attentention_mask[:,0,:] output = self.t5(inputs_embeds=input_embeds, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, labels=labels,output_attentions=True) ``` #### T5 的模型架構 ![image](https://hackmd.io/_uploads/r1zfZfUM0.png) [圖片來源](https://medium.com/analytics-vidhya/t5-a-detailed-explanation-a0ac9bc53e51) #### T5 visualize attention 參考 (1)[bertviz github](https://github.com/jessevig/bertviz) (2) [bertviz 論文](https://aclanthology.org/P19-3007.pdf) (3) [bertviz 網頁](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1) ```python= odict_keys(['loss', 'logits', 'past_key_values', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_attentions']) ``` `model_view()` 需要 decoder_attention 與 encoder_attention 與 cross_attention。查詢文件 [Hugging face 's Document : text_generation](https://huggingface.co/docs/transformers/main_classes/text_generation ) 後發現 `generate` 的函式不只要 `output_attentions=True`也要 `return_dict_in_generate=return_dict_in_generate` 。然而當我可以拿到視覺化需要有的 encoder , decoder, cross attention，卻無法直接用到 `model_view()` 或是 `head_view()` 的函式中，會有 tuple 與 tensor 的型態轉換錯誤。這裡參考 [Error when trying to visualize attention in T5 model 的討論](https://discuss.huggingface.co/t/error-when-trying-to-visualize-attention-in-t5-model/35350/2?fbclid=IwZXh0bgNhZW0CMTAAAR2yhEnldQ9f-2bNl7HEnwiiKN2iTuekNKgwqi05s6VmlOCYZbD_WUC4Ca0_aem_AQkFeGkQTeBS7y6oQ37F18-ympxFr3HOg4k_CQKeP5v-6U-W64N3COVsHqP2eYiBpRI3cdkdVEBeEKbRDEW3alEb) 後解決了。[commit 066710](https://github.com/MotionXperts/MotionExpert/commit/066710bbd94212c1d37892cec212497d1c9d25aa) 視覺化後的 attention 意義可以參考 [blog : illustrated-transformer](https://jalammar.github.io/illustrated-transformer/)。但實驗結果發現 cross attention 全連接有點怪，可能跟我直接在 sequence length 與 22 joints 交換過程中，直接做 maxpool 有關係。 :::info > [name= weihsinyeh ] 1. 研究將 max pool 的操作換成其他的。 2. 改成Args 的方式去讓train的人可以自己選要哪一種 finetune 方式要用 prompt 的方式還是放 token (Chain of Thought)的方式 ::: ## Evaluation > [github : Evaluation](https://github.com/MotionXperts/Evaluation) 參考 [卷積計算的倒傳遞推導](https://chih-sheng-huang821.medium.com/卷積神經網路-convolutional-neural-network-cnn-卷積計算的倒傳遞推導與稀疏矩陣觀點來看卷積計算-e82ac16e510f) 直接用 back propagation 算attention 的 weigth。 --- ### Reference > [卷積參考](https://zhuanlan.zhihu.com/p/77471866) > [Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers](https://proceedings.neurips.cc/paper_files/paper/2021/file/67f7fb873eaf29526a11a9b7ac33bfac-Paper.pdf) >[ SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning.](https://arxiv.org/abs/2111.13196) #### finetune 設定 | Column 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | --------------- | -------- | -------- | --------- | -------- | ----- | ---| --| | pretrained model| cindy's | cindy's | tommy's | cindy's |tommy's | tommy's| tommy's | | pooling type |temporal |temporal |temporal |temporal |skeleton| skeleton| skeleton | | Encoder | STAGCN | STAGCN | STAGCN | STAGCN | STAGCN | STAGCN + RGB | STAGCN + RGB(TCC) | | alignment | ❌ | ❌ | ✅ | ✅ |❌ | ✅ | ✅| | decorder random | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌| | 指導語結果 | 不正常(重複)| 正常 | 正常 | 正常 | 正常 | 正常|正常| | T5 attention 結果|注意在特定關節|注意在特定關節|看不出來全連接|注意在特定關節| ？ |？|？| #### pretrain 設定 | Column 1 | A (cindy's) | B | C (cindy's) | | --------------- | -------- | -------- | ----------- | | pooling type | skeleton | temporal | skeleton | | Encoder | STAGCN | STAGCN | STAGCN | | alignment | ✅ | ❌ | ❌ | | decorder random | ❌ | ✅ | ✅ | | 描述語結果 | 正常 | 正常 | 正常 | | T5 attention 結果| 注意在特定關節 | ？ | 注意在特定關節 | #### 實驗一 : 2 vs 4 vs 6: (1) Is alignment better? (2) Is RGB better? (3) Is skeleton pooling better? 2: baseline using temporal pooling 4: alignment with temporal pooling 5: RGB aligned with skeleton pooling Thinking: What does "alignment" better mean? (1) More fluent? (2) More accurate? #### 實驗二 : alignment 對 description 的效果? (automatic metric) A 跟 C 需不需要做 alignment 也可以 description 比較好 #### 實驗三 : pretrain 跟 finetune 是同個 task? (human eval) 3 跟 4 的差別一 : 有沒有做 alignment ，差別二有使用pretrained's transform module 4 : pretrain 跟 finetune 是同個 task，只是 dataset 不一樣 3 : pretrain 跟 finetune 是不同個 task，因為模型就不一樣(差在有無做alignment) #### 實驗四 : pooling type 的實驗 (automatic metric) B 跟 C #### 實驗五 : 如果只靠generator 設定就可以好有什麼不好嗎？現在的+generator 可以更好嗎？類似這樣 (human eval) 1 跟 4 可以用來說明 alignment 有用 -> theres no reason to eval 1 2 跟 4 教練 human evaluation 來做看是否能辨別 2 的指導語是靠 decoder 生出。 -> should revised to exp1. #### 實驗六 : (human eval) 6 vs 7: Does the performance of aligning module affect generation? (first do autometic eval) --- ## 7/7 討論 1. intruduction 2. 從兩個 challenge 出發。 challenge 1 challenge 2 challenge 的敘述描述不對。從頭就要講是要做指導語生成就是要做這兩件事情。 4. introduction 這個問題是要如何做 5. introduction 啥問題，問題困難 general，困難別人如何做，但無法解決。 contribution 。 6. how we solve 教練有些講的是對的，但講的區段錯誤。 Related work Presentation : 一開始就先 flows 。 video 進來 black box 與出去 black box 。 part 功能 module 方法 Human Pose Percetion 22 x vector 。 bones 相減的 vector 。輸入跟輸出。 GCN 。要說抽象 spatial matrix 骨頭間的關係骨骼間的關係。 input 黃色 feature， output 是橘色 feature 。 spatial matrix 進到 attention branch 它可以從上面的東西黃色有呈現啥。但我們還想要啥。是在做什麼，如何做，可以做，可以 highlight 哪個 feature 啥。可以抽出啥。單張的照片，影片的差異。 Concept 是啥 : 兩個 frame 特徵 Concept Difference ：兩個 frame 之間動作的差異。影片的差異如何連接到 Concept Difference Understanding : 理解啥： 1. alignment 2. 量化之間的差距。 Difference 大做的不好離 standard 越遠。其中幾個分支。 Results 要 show 啥 ref 對 description 沒有幫助。 description 是啥，左邊的 module 可以表現的比較好。 instruction 為啥。線 : 手的線不要如此多顏色。 ref 對 insstruction 有幫助。 1. 動作分類錯誤， 2. 不知道實際 input 的視覺內容。 ## 模型架構參考 [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/) ![image](https://hackmd.io/_uploads/ryBt6c0Q1g.png) [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485) [ImprovedBaselineswithVisualInstructionTuning](https://arxiv.org/pdf/2310.03744) [LORA](https://arxiv.org/pdf/2106.09685) * [Lora Github](https://github.com/microsoft/LoRA) [Introducing Parameter-Efficient Fine-Tuning (PEFT)](https://medium.com/@jyotikhetan2/introducing-parameter-efficient-fine-tuning-peft-e1188943d7fc) * [huggingface/peft](https://github.com/huggingface/peft) ```python interpolated_tensor = F.interpolate( stagcn_embedding.permute(0, 2, 3, 1), # Change to (Batch, Time, 22, 512) size=(standard_embedding.shape[3], standard_embedding.shape[1]), # Spatial dimensions only mode="bilinear", align_corners=True ).permute(0, 3, 1, 2) # Restore to (Batch, Time, 22, 512) stagcn_embedding = interpolated_tensor print(stagcn_embedding.shape) print(standard_embedding.shape) if (video_name[0] == '471703066060784233_1'): print("Visualize") print("stagcn_embedding",stagcn_embedding.shape) print("standard_embedding",standard_embedding.shape) embs = [stagcn_embedding[0].clone().cpu(),standard_embedding[0].clone().cpu()] viz_tSNE(embs,"/home/weihsin/projects/MotionExpert/results/finetune_joint_training/471703066060784233_1.jpg") ``` ### Rebuttal 實驗 Dataset 比較表格 | Sport | 標註者 | 運動員 | 拍攝影片視角 | 動作類型 | - | - | - | - | - | | Skating | 同個標註者 | 同個運動員 | 隨手拍的影片 | 不同動作 | Boxing | 不同標註者 | 不同運動員 | 不同視角 | 不同動作不同人，指導語的變化 -> 分析每個人錯誤率的變化 V 同個人，不同角度的變化 -> 目標 : 證明視角改變變不影響 V 找出哪個 epoch 花很久。 --- 1. 同個人，來自不同標註者的指導語都 concatenate 2. 同個人，來自不同標註者的指導語都分開 (2) 看生出來的指導語是否描述錯誤的問題會更豐富 (diversity)。 (2) 反而變成看起來過於 general (G-eval) 。 (2) 生出的指導語有衝突的問題。 (2) 有傾向哪個標註者。 6 個動作基本 record teach 再做一次。 prompt 不一樣。一個影片 -> 一句指導語 -> GPT -> 四句標註者標註的最像。算最像的那個一個影片 x 4 -> 對應一個影片 x 1 -> predict <-> 去找四個最像(cosine similarity)的一個去收斂。 ## Boxing dataset : 原本的 alignment model 的程式碼放在 : `/home/c1l1mo/SkateAppModels/SkatingApp/engine/BoxingAlignment/app/model.py` ```python=81 def process_video(self, video_tensor: torch.Tensor) -> np.ndarray: """Process a single video tensor and return embeddings""" with torch.no_grad(): if self._cfg.MODEL.EMBEDDER_TYPE != 'conv': with torch.cuda.amp.autocast(): emb_feats = self._model(video_tensor.unsqueeze(0).cuda(), video_masks=None) else: seq_len = video_tensor.size(0) steps = torch.arange(0, seq_len, self._cfg.DATA.SAMPLE_ALL_STRIDE) context_stride = self._cfg.DATA.CONTEXT_STRIDE steps = steps.view(-1,1) + context_stride*torch.arange( -(self._cfg.DATA.NUM_CONTEXTS-1), 1 ).view(1,-1) steps = torch.clamp(steps.view(-1), 0, seq_len - 1) input_video = video_tensor[steps.long()].unsqueeze(0).cuda() with torch.cuda.amp.autocast(): emb_feats = self._model(input_video, video_masks=None) return emb_feats[0].cpu().numpy() def align_videos(self, video1_path: str, video2_path: str,output_dir: str) -> str: """Align two videos and return the path to the output video""" # Create transform for video preprocessing transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Read and preprocess videos def load_video(path): video, _, _ = read_video(path) video = video.float() / 255.0 video = video.permute(0, 3, 1, 2) # (T, H, W, C) -> (T, C, H, W) video = transform(video) return video video1 = load_video(video1_path) video2 = load_video(video2_path) # Get embeddings timer.start('process both videos') embs1 = self.process_video(video1) embs2 = self.process_video(video2) process_time = timer.end('process both videos') # print(f"Processed both videos in {process_time:.2f} seconds") # Prepare output path os.makedirs(os.path.join(output_dir,'result'), exist_ok=True) video_path = os.path.join(output_dir,'result',f'output.mp4') # Create alignment video timer.start('create video') aligned_query, aligned_key, query_indices, key_indices = create_video( # originally, there is no return value embs1, video1.permute(0, 2, 3, 1), # (T, H, W, C) embs2, video2.permute(0, 2, 3, 1), video_path, use_dtw=True, interval=200 ) create_time = timer.end('create video') # print(f"Created alignment video in {create_time:.2f} seconds") return video_path, aligned_query, aligned_key, query_indices, key_indices ``` alignment 的影片轉成很多個幀的程式碼 : https://github.com/pytorch/vision/blob/main/torchvision/io/video.py#L274 。其中第 112 行是用第 13 行 `from torchvision.io import read_video` 是使用原始影片的 fps 。其中 `https://github.com/MotionXperts/HybrIK/blob/main/scripts/demo_video.py#L39` 是用 ```python=33 def get_video_info(in_file): stream = cv2.VideoCapture(in_file) assert stream.isOpened(), 'Cannot capture source' # self.path = input_source datalen = int(stream.get(cv2.CAP_PROP_FRAME_COUNT)) fourcc = int(stream.get(cv2.CAP_PROP_FOURCC)) fps = int(stream.get(cv2.CAP_PROP_FPS)) frameSize = (int(stream.get(cv2.CAP_PROP_FRAME_WIDTH)), int(stream.get(cv2.CAP_PROP_FRAME_HEIGHT))) # bitrate = int(stream.get(cv2.CAP_PROP_BITRATE)) videoinfo = {'fourcc': fourcc, 'fps': fps, 'frameSize': frameSize} stream.release() return stream, videoinfo, datalen ``` --- ## Loss Calculation sample A -> augment to 4 ground truth ### Method A : PerGT Loss 訓練 : 算四次 loss ``` sample A -> prediction A-1 <-> GT A-1 (算 loss) sample A -> prediction A-2 <-> GT A-2 (算 loss) sample A -> prediction A-3 <-> GT A-3 (算 loss) sample A -> prediction A-4 <-> GT A-4 (算 loss) ``` > Skating > skating_t5_6 epoch 135 ```bash $ ./MotionExpert_tmp/MotionExpert/results/finetune_skeleton_t5_6/jsons/results_epoch135.json ``` > Boxing : better ```bash $ ./MotionExpert_tmp/MotionExpert/results/finetune_boxing_0304/jsons ``` 去看 GT 怎麼變化。 --- method A 還沒辦法修好 Lora 無法 reproduce 的問題。method B 跟 method C 都修好 Lora 沒辦法 reproduce 的問題 ### Method B : ClosestSimGT Loss 訓練 : 每一個 sample 算一次 loss ``` sample A -> prediction <-> 找最相似的 GT 從 (A-1,A-2,A-3,A-4) (算 loss) ``` #### Skating ```bash $ ./MotionExpert_tmp/skating_gt_tmp/jsons/results_epoch175.json ``` | Skating ClosestSimGT : Loss | Skating ClosestSimGT : (Trainging) Metric | | - | - | | ![image](https://hackmd.io/_uploads/Bk1Wv2iCJl.png) | ![image](https://hackmd.io/_uploads/rkI-DnoRJl.png) | | Skating ClosestSimGT : GT | ![image](https://hackmd.io/_uploads/HyjWkFJJex.png) | #### Boxing bad > 去看 GT 怎麼變化 : 視覺化後 ground truth 尋找時跳來跳去的。 ```bash $ ./boxing_gt2/jsons ``` | Boxing ClosestSimGT : Loss | Boxing ClosestSimGT : (Trainging) Metric | | - | - | | ![image](https://hackmd.io/_uploads/SysAInjRkg.png) | ![image](https://hackmd.io/_uploads/S1TkvniRkg.png) | |Boxing ClosestSimGT : GT | ![image](https://hackmd.io/_uploads/rJkFpFJygl.png) | ### Method C : PerGT Loss Change 1 : 修好 Lora 沒辦法 reporduce 的問題 | Skating PerGT : Loss | Skating PerGT : (Trainging) Metric | | - | - | | ![image](https://hackmd.io/_uploads/SywmDnsA1x.png) | ![image](https://hackmd.io/_uploads/HkCXPnjAyx.png) | | Skating PerGT : GT | ![image](https://hackmd.io/_uploads/SkSFjYkyxe.png) | Boxing GT 我只跑到 135 epoch 因為趨勢沒有進步的空間。 | Boxing PerGT: Loss | Boxing PerGT : (Trainging) Metric | | - | - | | ![image](https://hackmd.io/_uploads/SyjCXKJyll.png)| ![image](https://hackmd.io/_uploads/H1Jg4FkJll.png) | | Boxing PerGT: GT | ![image](https://hackmd.io/_uploads/SybuNKkJxe.png) | | Method | bleu_1 | bleu_4 | rouge | bertscore | G-eval | epoch | - | - | - | - | - | - | - | | A (bad lora)| 24.7 | 2.3 | 16.9 | 26.5 | 1.73 | 135 | | B ClosestSimGT Loss | 21.3 | 2.8 | 17.7 | 12.2 | 1.78 | 175 | | C PerGT Loss| 24.3 | 6.8 | 19.6 | 12.8 | 1.75 | 90 | ## 0416 ### 測試 : 1. 四個影片接在一起變成一個長影片對應到 4 個 GT 2. 新的 boxing dataset 3. 一個影片對應到 4 個 #### Lora Config ```python if cfg.TASK.SPORT == "Skating" : stagcn_lora_config = {"bias" : "none", "r" : 32, "lora_alpha" : 64, "lora_dropout" : 0.1} transformation_lora_config = {"bias" : "none", "r" : 32, "lora_alpha" : 64, "lora_dropout" : 0.1} elif cfg.TASK.SPORT == "Boxing" : stagcn_lora_config = {"bias" : "none", "r" : 64, "lora_alpha" : 128, "lora_dropout" : 0.5} transformation_lora_config = {"bias" : "none", "r" : 64, "lora_alpha" : 128, "lora_dropout" : 0.5} stagcn_lora_config = {"bias" : "none", "r" : 32, "lora_alpha" : 64, "lora_dropout" : 0.1} transformation_lora_config = {"bias" : "none", "r" : 32, "lora_alpha" : 64, "lora_dropout" : 0.1} ``` ```python import pickle, os input_path = "/home/c1l1mo/datasets/scripts/skating_pipeline/Skating_GT_test/aggregate.pkl" output_path = "./skating_test_aggregate_reformat.pkl" with open(input_path, "rb") as f: data = pickle.load(f) for item in data: trimmed_start = item.get('trimmed_start', 0) if not item['standard_longer']: start_frame = item['start_frame'] + trimmed_start end_frame = item['end_frame'] + trimmed_start item['start_frame'] = start_frame item['end_frame'] = end_frame item['std_start_frame'] = 0 item['std_end_frame'] = end_frame - start_frame else: std_start = item['start_frame'] std_end = item['end_frame'] item['std_start_frame'] = std_start item['std_end_frame'] = std_end item['start_frame'] = trimmed_start item['end_frame'] = trimmed_start + (std_end - std_start) with open(output_path, "wb") as f: pickle.dump(data, f) ``` --- ```python if cfg.TASK.SPORT == 'Skating' : if item['standard_longer'] : std_start = start_frame usr_start = trimmed_start else : usr_start = start_frame + trimmed_start std_start = 0 length = min(min(length, len(std_features[0]) - std_start), len(features[0]) - usr_start) if cfg.TASK.SPORT == 'Boxing' : std_start = item["std_start_frame"] usr_start = item["start_frame"] length = min(min(item["aligned_seq_len"], len(std_features[0]) - item["std_start_frame"]), len(features[0]) - item["start_frame"]) # std_start = item["std_start_frame"] # usr_start = item["start_frame"] # length = item["aligned_seq_len"] ``` --- remove 2 coach. boxing 1 2 3 4 -> (1->1, 3->3, 4->4) video : 10 * 2 clip 10 : A B C D clip E B C D new boxing 1 -> 5 skating video : 292 1 -> 5 clip : 4 A B C D --- ## 有錯的資料 > /home/weihsin/datasets/Axel_com/471706401488502884 > /home/weihsin/datasets/Lutz/485958788181131265 > /home/weihsin/datasets/Lutz/485958785647771873 > /home/weihsin/datasets/Lutz/485958798918549798 > /home/weihsin/datasets/Axel_com/471706229102870738 > /home/weihsin/datasets/Axel_com/471706248866431172 > /home/weihsin/datasets/Axel_com/471706360988565506 > /home/weihsin/datasets/Axel_com/471706339664986566 **coach1** 2_back_10 2_back_8 2_back_9 3_back_1 **coach2** 5_front_10 5_front_9 ```bash $ source /home/c1l1mo/miniconda3/etc/profile.d/conda.sh $ conda activate yolo $ python crop_video.py --dataset new_BX --alphacrop 1 ``` 我先在 `/home/c1l1mo/testings/yolov7` 在這個路徑裡面，新增了 `/home/c1l1mo/testings/yolov7/bbox_queue_gen.py` 因為當用 bash 檔案 run.sh 生出每個影片的 bounding box 還要再切影片，切影片的程式碼是 crop_video.py `PYTHONPATH=. python crop_video.py --dataset newBX --alphacrop 1` 需要 bboxes_queue.json 檔案 ## New boxing | video | type | action | | - | - | - | | 20250318101244 | 1-1 | deleted | 20250318101839 | 1-2 | 前手直拳 | 20250318102107 | 1-3 | | 20250318102438 | 1-4 | | 20250318102651 | 1-5 | | 20250318102834 | 1-6 | | 20250318103431 | 1-7 | | 20250318103922 | 1-8 | | 20250318104331 | 1-9 | | 20250318105057 | 1-10 | | 20250318105602 | 2-1 | 後手直拳 | 20250318110730 | 2-2 | | 20250318111226 | 2-3 | | 20250318111705 | 2-4 | | 20250318112203 | 2-5 | | 20250318112356 | 2-6 | | 20250318112602 | 2-7 | | 20250318113208 | 3-1 | 前勾手 | | 20250318113629 | 3-2 | | 20250318114103 | 3-3 | | 20250318114659 | 3-4 | | 20250318114848 | 3-5 | | 20250318115344 | 4-1 | 後勾手 | 20250318115912 | 4-2 | | 20250318120441 | 4-3 | | 20250318120757 | 4-4 | | 20250318121025 | 4-5 | | 20250318121600 | 5-1 | upper前手 | 20250318121952 | 5-2 | | 20250318122427 | 5-3 | | 20250318122737 | 5-4 | | 20250318123118 | 5-5 | | 20250318123506 | 6-1 | | 20250318123706 | 6-2 | | 20250318124132 | 6-3 | | 20250318124403 | 6-4 | | 20250318124603 | 6-5 |