Cross-category Video Highlight Detection via Set-based Learning

--- tags: 生物辨識 --- # Cross-category Video Highlight Detection via Set-based Learning > Autonomous highlight detection ## Contribution 1. Propose a novel set-based learning mechanism which is able to identify whether a video segment is highlight or not under a broader context 2. Utilize a dual-learner-based scheme to transfer the concepts about highlight moments across different video categories ## Set-based Learning Module ![](https://i.imgur.com/TSiqP0J.png) ### Pipeline: 1. A set of $N$ annotated segments randomly sampled from the same video, $$ x = \{(s_j, y_j)\}^N_{j=1} $$ 2. Extracts the feature embedding of each segment from Pretrained model $F$ (C3D, fixed), $$ z = \{z_j\}^N_{j=1} = \{F(s_j)\}^N_{j=1} $$ 3. On these segment embeddings, a Transformer encoder $T$ models the interrelationship among different segments and outputs the contextualized segment embeddings, $$ \widetilde{z} = \{\widetilde{z}_j\}^N_{j=1} = \{T(\widetilde{z}_j)\}^N_{j=1} $$ 4. Upon the contextualized segment embeddings, a scoring model $C$ predicts the highlight score of each video segment, $C$ is multi-layer perceptron $$ \hat{y} = \{\hat{y}_j\}^N_{j=1} = \{C(\hat{z}_j)\}^N_{j=1} $$ 5. Learning Objective, σ(·) denotes softmax function, $D_{KL}$ stands for the Kullback–Leibler divergence $$ L_{pred} = D_{KL}(\sigma(\{\hat{y}_j\}^N_{j=1}), \sigma(\{y_j\}^N_{j=1})) $$ ## Dual-Learner-based Video Highlight Detection ![](https://i.imgur.com/vJpQhKC.png) 1. Randomly sampled from the same video for each Domain - Source Domain $$ x_S = \{(s_j^S, y_j^S)\}^N_{j=1} $$ - target Domain $$ x_T = \{(s_j^T)\}^N_{j=1} $$ - Mixed Domain, one half randomly sampled from $x_S$ , and the other half are from $x_T$, $$ x_M = \{(s_j^M, y_j^M)\}^N_{j=1}, y_j^M =1\; \mbox{if}\; s_j^M \in x_T $$ 2. Using the C3D feature extractor and Transformer encoder, we respectively derive the contextualized segment embeddings for these three sets $$ \widetilde{z}_S = \{\widetilde{z}_j^S\}^N_{j=1},\widetilde{z}_T = \{\widetilde{z}_j^T\}^N_{j=1},\widetilde{z}_M = \{\widetilde{z}_j^M\}^N_{j=1} $$ 3. Learning Objective for each domain - Mixed Domain, a coarse-grained learner $C_{coarse}$ to learn the basic distinction of target video segments with the source ones $$ L_{coarse} = D_{KL}(\sigma(\{\hat{y}_j^M\}^N_{j=1}), \sigma(\{y_j^M\}^N_{j=1})) $$ - Source Domain, a fine-grained learner $C_{fine}$ acquire the knowledge about highlight moments on the source video category. $$ L_{fine} = D_{KL}(\sigma(\{\hat{y}_j^S\}^N_{j=1}), \sigma(\{y_j^S\}^N_{j=1})) $$ - Target Domain - coarse-grained and fine-grained learner are both utilized to predictthe highlight scores of the segments in set $x_{T}$ $$ \hat{y}_{T, coarse} = \{\hat{y}_j^{T, coarse}\}^N_{j=1} = \{C_{coarse}(\hat{z}_j^T)\}^N_{j=1} \\ \hat{y}_{T, fine} = \{\hat{y}_j^{T, fine}\}^N_{j=1} = \{C_{fine}(\hat{z}_j^T)\}^N_{j=1} \\ \hat{y}_{T, avg} = \{\hat{y}_j^{T, avg}\}^N_{j=1} = \{(\hat{y}_j^{T, coarse}+\hat{y}_j^{T, fine})/2\}^N_{j=1} $$ - Perform knowledge distillation between two learners, defines the distillation loss $$ L_{distill} = \frac{1}{2}(D_{KL}(\sigma(\{\hat{y}_j^{T,avg}\}^N_{j=1}), \sigma(\{\hat{y}_j^{T,coarse}\}^N_{j=1}))+D_{KL}(\sigma(\{\hat{y}_j^{T,avg}\}^N_{j=1}), \sigma(\{\hat{y}_j^{T,fine}\}^N_{j=1})) $$ 4. Total Loss $$ L_{coarse}+L_{fine}+\lambda L_{distill} $$