---
tags: 生物辨識
---
# Cross-category Video Highlight Detection via Set-based Learning
> Autonomous highlight detection
## Contribution
1. Propose a novel set-based learning mechanism which is able to identify whether a video segment is highlight or not under a broader context
2. Utilize a dual-learner-based scheme to transfer the concepts about highlight moments across different video categories
## Set-based Learning Module

### Pipeline:
1. A set of $N$ annotated segments randomly sampled from the same video,
$$
x = \{(s_j, y_j)\}^N_{j=1}
$$
2. Extracts the feature embedding of each segment from Pretrained model $F$ (C3D, fixed),
$$
z = \{z_j\}^N_{j=1} = \{F(s_j)\}^N_{j=1}
$$
3. On these segment embeddings, a Transformer encoder $T$ models the interrelationship among different segments and outputs the contextualized segment embeddings,
$$
\widetilde{z} = \{\widetilde{z}_j\}^N_{j=1} = \{T(\widetilde{z}_j)\}^N_{j=1}
$$
4. Upon the contextualized segment embeddings, a scoring model $C$ predicts the highlight score of each video segment, $C$ is multi-layer perceptron
$$
\hat{y} = \{\hat{y}_j\}^N_{j=1} = \{C(\hat{z}_j)\}^N_{j=1}
$$
5. Learning Objective, σ(·) denotes softmax function, $D_{KL}$ stands for the Kullback–Leibler divergence
$$
L_{pred} = D_{KL}(\sigma(\{\hat{y}_j\}^N_{j=1}), \sigma(\{y_j\}^N_{j=1}))
$$
## Dual-Learner-based Video Highlight Detection

1. Randomly sampled from the same video for each Domain
- Source Domain
$$
x_S = \{(s_j^S, y_j^S)\}^N_{j=1}
$$
- target Domain
$$
x_T = \{(s_j^T)\}^N_{j=1}
$$
- Mixed Domain, one half randomly sampled from $x_S$ , and the other half are from $x_T$,
$$
x_M = \{(s_j^M, y_j^M)\}^N_{j=1}, y_j^M =1\; \mbox{if}\; s_j^M \in x_T
$$
2. Using the C3D feature extractor and Transformer encoder, we respectively derive the
contextualized segment embeddings for these three sets
$$
\widetilde{z}_S = \{\widetilde{z}_j^S\}^N_{j=1},\widetilde{z}_T = \{\widetilde{z}_j^T\}^N_{j=1},\widetilde{z}_M = \{\widetilde{z}_j^M\}^N_{j=1}
$$
3. Learning Objective for each domain
- Mixed Domain, a coarse-grained learner $C_{coarse}$ to learn the basic distinction of target video segments with the source ones
$$
L_{coarse} = D_{KL}(\sigma(\{\hat{y}_j^M\}^N_{j=1}), \sigma(\{y_j^M\}^N_{j=1}))
$$
- Source Domain, a fine-grained learner $C_{fine}$ acquire the knowledge about highlight moments on the source video category.
$$
L_{fine} = D_{KL}(\sigma(\{\hat{y}_j^S\}^N_{j=1}), \sigma(\{y_j^S\}^N_{j=1}))
$$
- Target Domain
- coarse-grained and fine-grained learner are both utilized to predictthe highlight scores of the segments in set $x_{T}$
$$
\hat{y}_{T, coarse} = \{\hat{y}_j^{T, coarse}\}^N_{j=1} = \{C_{coarse}(\hat{z}_j^T)\}^N_{j=1}
\\
\hat{y}_{T, fine} = \{\hat{y}_j^{T, fine}\}^N_{j=1} = \{C_{fine}(\hat{z}_j^T)\}^N_{j=1}
\\
\hat{y}_{T, avg} = \{\hat{y}_j^{T, avg}\}^N_{j=1} = \{(\hat{y}_j^{T, coarse}+\hat{y}_j^{T, fine})/2\}^N_{j=1}
$$
- Perform knowledge distillation between two learners, defines the distillation loss
$$
L_{distill} = \frac{1}{2}(D_{KL}(\sigma(\{\hat{y}_j^{T,avg}\}^N_{j=1}), \sigma(\{\hat{y}_j^{T,coarse}\}^N_{j=1}))+D_{KL}(\sigma(\{\hat{y}_j^{T,avg}\}^N_{j=1}), \sigma(\{\hat{y}_j^{T,fine}\}^N_{j=1}))
$$
4. Total Loss
$$
L_{coarse}+L_{fine}+\lambda L_{distill}
$$