Notes on "[Multi-modal Transformer for Video Retrieval](https://)"

# Notes on "[Multi-modal Transformer for Video Retrieval](https://)" ## Idea * Tackle the tasks of caption to video and video to caption retrieval. * Previous works do not fully exploit the cross-modal cues for the tasks; ignore multi-modal signal, treat modalities separately or use a gating mechanism to modulate certain modality dimensions. Some also dicard long-term temporal information which is useful for video retrieval task. * Common approach for retireival task: Perform inner product to get the similarity score of each video with each caption. * While learning representation of text is exhaustively studied, learning accurate representations of video is still difficult. * To obtain an accurate video representation, all the modalities need to be used. ### Method * Learn a function $s$ to compte similarity between video and caption pairs. That is, given a dataset of caption-video pairs $\{(v_1, c_1), (v_2, c_2), ..., (v_n, c_n)\}$, the goal s to learn $s(v_i, c_j)$; it should be high for $i=j$ and low when equality doesn't hold. This requres accurate representations of videos and captions. ### Video Representation ![](https://i.imgur.com/RSy8ztZ.png) #### Feature Embedding * Use $N$ pretrained models for different tasks, they give a sequence $Fˆn(v)=[F_1ˆn, ..., F_Kˆn]$ of $K$ dimension. An agrregated embedding of an expert is obtained by $F_{agg}ˆn = maxpool(\{F_kˆn\})_{k=1}ˆK$. So, the input features of video encoder is $F(v)=[F_{agg}ˆ1, F_{}ˆ1, F_2ˆ1, ..., F_{agg}ˆN, F_1ˆN, ..., F_KˆN]$ 7 experts are used: 1. Moion: S3D 2. Audio: VGGish model 3. Scene: DenseNet 4. OCR: 5. Face: DDS face detector extracts bounding boxes, which are then passed through a ResNet50 trained for face classification. 6. Speech: Google cloud seech to text API, detected words are encoded with word2vec. 7. Appearance: SENet-154 #### Experts embeddings To understand, to which expert does the embedding correspond to, a sequence of expert embedding is added. ![](https://i.imgur.com/yTe1VTS.png) #### Temporal embeddings To provide temporal information about time in video. ![](https://i.imgur.com/L4Wkflb.png) ![](https://i.imgur.com/Vgg0f3L.png) ### Caption representation $\phi(c)=\{\phiˆi\}_{i=1}ˆN; \phi = g o h$ where $h$ is the $[CLS]$ output of BERT. ## Similarity estimation Weightesd sum of each expert's video-caption similarity. ![](https://i.imgur.com/wztR3jL.png) ## Training Train the model with bi-directional max-argin ranking loss ![](https://i.imgur.com/01aOzEM.png) ## Metrics Standard retrieval metrics: 1. recall at rank N (R@N, higher is better) 2. median rank (MdR, lower is better) 3. mean rank (MnR, lower is better).