SubProject 2 - Computer Vision: Action Recognition

# SubProject 2 - Computer Vision: Action Recognition - [Action Recognition](#) - [Project Introduction](#) - [Kinectics Skeleton Dataset](#) - [Data Exploration](#) - [Key Concepts and Metrics for the Problem](#) - [Model](#) * [Spatial Graph Convolutional Neural Networks](#) - [Program Running Instructions](#) * [Data Exploration](#) * [Model Experiments](#) - [Demo](#) ## Self-Evaluation Based on SubProject 2 - Computer Vision |Content|| |-|-| |Explore the Kenectics-Skeleton dataset| Completed | |Data cleaning and preprocessing| Completed | |Re-implement the Spatial Temporal Graph Convolutional Networks algorithm| Completed | |Improve the model using more layers| Completed | |Improve the model using RNN, LSTM, BiDirectional LSTM| Completed | |Improve the model using attention mechanism| Done | |Test on other datasets such as NTU-RGB+D| Done | ***Result:*** The group gained an overall understanding of action recognition methods using graphs in general and graph neural networks in particular. By studying related papers, the group acquired experience in using small techniques for input data processing and building ***deep learning*** models. ***This is a project about how our group built an action recognition system on daily videos. It is a computer vision project at my university.*** --- ## Project Introduction The 4.0 technology era is rapidly developing in all fields, especially artificial intelligence, enabling us to do everything smarter and easier. Machines can assist with tasks ranging from basic to advanced and interact with abundant resources such as images, audio, and videos. Recognizing behaviors such as shoplifting in supermarkets, cheating during exams, etc., are tasks we want to automate intelligently. Therefore, our group explored action recognition, contributing to advancing AI in general and ***video understanding*** in particular. ## Key Concepts and Metrics in the Project ### Cross Entropy Loss ### TOP-K ![](https://i.imgur.com/EzbhJii.png) ## Model ### Spatial Graph Convolutional Neural Networks Evaluation metrics used: ***Cross Entropy Loss*** #### Graph-based Modeling For the ***action recognition*** task, we chose ***graph*** as the data structure to model input information. Since a video is essentially a collection of images, graph construction is performed as follows: + For each frame in a video, we use ***OpenPose*** to extract keypoints. + Each keypoint is treated as a node in the graph. + ***Node pairs*** connected form ***edges***. For example, the image below is extracted using ***OpenPose*** with ***18 keypoints***. ![](https://i.imgur.com/GRF4kIE.png) Observations: + ***Keypoints***, also called ***joints***, are connected to form edges: ***(0, 1), (1, 0), (2, 1), (1, 2), ... are edges*** + Additionally, self-connections like ***(0, 0), (1, 1) are edges*** Based on the above, we model data as: + $G = (V, E)$ is the graph of a frame + $N$ is the ***number of nodes (keypoints)*** to extract per person + $T$ is the number of ***frames*** extracted from the video Then: $V = \{v_{ti} | t = 1, 2, ... T , i = 1, 2, ..., N \}$ is the set of nodes for ***1 video*** + In each frame, $E_S = \{ v_{ti}v_{tj} | i, j \in H\}$ is the edge set with $H$ being the ***set of joints connected*** according to ***OpenPose***. + Moreover, we can model the relative position of ***node X*** in frame $t$ with respect to ***node X*** in frame $t + 1$: $E_F = \{ v_{ti}v_{(t + 1)i} \}$ represents ***temporal edges***. #### Graph Neural Network ##### ST-GCN Algorithm Before considering all ***frames***, consider each ***frame*** individually: At ***frame $\tau$***, $V_t = \{ v_{ti}v_{tj} | t = \tau, (i, j) \in H\}$. + For each frame, feature extraction leverages the strength of ***Graph CNN*** layers. + Let: + Kernel size: $K \times K$ + Input feature: $f_{in}$ + Input feature has ***$c$ channels*** Then: $$f_{out}(x) = \sum\limits_{h = 1}^K \sum\limits_{w = 1} ^ K f_{in}(p(x, h, w)) \cdot w(h, w) \text { (1)}$$ + ***Sampling function $p$***: sample neighbors of $x$ + ***Weight function $w$***: compute gradients and update weights ***Note:*** Weight function does not depend on each $x$ individually, allowing generalization when nodes appear/disappear. Each ***node*** in frame $\tau$ outputs a ***$c$-dimensional vector $\in R^c$.*** ###### Sampling Function $p(w, h)$ samples neighbors of center node $x$. Simplest method: distance-based sampling: $B(v_{ti}) = \{ v_{tj} | d(v_{ti}, v_{tj}) \le D\}$, where $d(v_{ti}, v_{tj})$ is the ***shortest distance***. Notation: $p(v_{ti}, v_{tj}) = v_{tj} \text { (2)}$ ###### Weight Function Weight vector has shape $(c, K, K)$. Partition neighbors $B(v_{ti})$ into ***$K$ subsets***, mapping $l_{ti}: B(v_{ti}) \rightarrow \{0, ..., K - 1 \}$. Then $w(v_{ti}, v_{tj}) \text { (3)}$ is a $c$-dimensional vector with each channel corresponding to the label assigned. From (1), (2), (3): $$f_{out}(v_{ti}) = \sum\limits_{v_{tj} \in B(v_{ti})} \frac{1}{Z_{ti}(v_{tj})} f_{in}(p(v_{ti}, v_{tj})) \cdot w(v_{ti}, v_{tj})$$ $Z_{ti}(v_{tj}) = |\{ v_{tk} | l_{ti}(v_{tk}) = l_{ti}(v_{tj}) \}|$ normalizes based on label frequency. ###### Across Multiple Frames Neighbors across frames are considered within a temporal window $G$: $B(v_{ti}) = \{ v_{qj} | d(v_{tj}, v_{ti}) \le K, -G \le q - t \le G \}$ Redefine weight function: $l_{ST}(v_{tj}) = l_{ti}(v_{tj}) + (q - t + G) \times K$ ###### Learnable Edge Importance Edges connecting joints may belong to multiple ***body parts***, so edges are weighted differently per frame. ##### Model Enhancements ###### Using LSTM Original ST-GCN only captures ***short-memory***. Incorporating LSTM provides ***long-memory***. Outputs are concatenated with original feature vectors. ![](https://i.imgur.com/aF8DLas.png) ###### Using Attention Attention captures temporal dependencies across frames. ![](https://i.imgur.com/BQ5J07H.png) ###### Learnable Edge Weights Different models learn different aspects; we ensemble models with learnable weights. ![](https://i.imgur.com/DcPx424.png) ## Achieved Results ||Top-1|Top-5| |-|-|-| |Baseline STGCN|30.7 %|52.8%| |+ 1 STGCN layer|30.822%|54.93%| |+ 2 STGCN layers|30.846%|55.21%| |+ Simple RNN|30.704%|53.02%| |+ LSTM|33.768%|58.16%| |+ Attention|38.194%|65.2591%| |+ 2 STGCN layers + LSTM|34.12%|60.1782%| |+ 2 STGCN layers + Attention|39.431%|62.1004%| Observations: + Adding ST-GCN layers improves results slightly. + Adding RNN features helps learn from nearby frame clusters. + LSTM significantly improves performance due to long-memory contribution. + Attention captures long-range dependencies and further boosts accuracy beyond layer/RNN/LSTM improvements.