# Notes on "[Multi-modal Transformer for Video Retrieval](https://)"
## Idea
* Tackle the tasks of caption to video and video to caption retrieval.
* Previous works do not fully exploit the cross-modal cues for the tasks; ignore multi-modal signal, treat modalities separately or use a gating mechanism to modulate certain modality dimensions. Some also dicard long-term temporal information which is useful for video retrieval task.
* Common approach for retireival task: Perform inner product to get the similarity score of each video with each caption.
* While learning representation of text is exhaustively studied, learning accurate representations of video is still difficult.
* To obtain an accurate video representation, all the modalities need to be used.
### Method
* Learn a function $s$ to compte similarity between video and caption pairs. That is, given a dataset of caption-video pairs $\{(v_1, c_1), (v_2, c_2), ..., (v_n, c_n)\}$, the goal s to learn $s(v_i, c_j)$; it should be high for $i=j$ and low when equality doesn't hold. This requres accurate representations of videos and captions.
### Video Representation

#### Feature Embedding
* Use $N$ pretrained models for different tasks, they give a sequence $Fˆn(v)=[F_1ˆn, ..., F_Kˆn]$ of $K$ dimension. An agrregated embedding of an expert is obtained by $F_{agg}ˆn = maxpool(\{F_kˆn\})_{k=1}ˆK$. So, the input features of video encoder is $F(v)=[F_{agg}ˆ1, F_{}ˆ1, F_2ˆ1, ..., F_{agg}ˆN, F_1ˆN, ..., F_KˆN]$
7 experts are used:
1. Moion: S3D
2. Audio: VGGish model
3. Scene: DenseNet
4. OCR:
5. Face: DDS face detector extracts bounding boxes, which are then passed through a ResNet50 trained for face classification.
6. Speech: Google cloud seech to text API, detected words are encoded with word2vec.
7. Appearance: SENet-154
#### Experts embeddings
To understand, to which expert does the embedding correspond to, a sequence of expert embedding is added.

#### Temporal embeddings
To provide temporal information about time in video.


### Caption representation
$\phi(c)=\{\phiˆi\}_{i=1}ˆN; \phi = g o h$ where $h$ is the $[CLS]$ output of BERT.
## Similarity estimation
Weightesd sum of each expert's video-caption similarity.

## Training
Train the model with bi-directional max-argin ranking loss

## Metrics
Standard retrieval metrics:
1. recall at rank N (R@N, higher is better)
2. median rank (MdR, lower is better)
3. mean rank (MnR, lower is better).