Notes on [ViLBERT](https://arxiv.org/pdf/1908.02265.pdf)

# Notes on [ViLBERT](https://arxiv.org/pdf/1908.02265.pdf) ## Idea * Propose a model for learning task-agnostic visual grounding. * Utilize co-attentional transformer layers to communicate the vision and language streams. * Use Conceptual captions dataset (approx. 3.3 million).  * Similar to BERT, they train ViLERT on 2 tasks: * masked multi-modal modelling * multi-modal alignment prediction * Perform 4 downstream tasks: * Visual Question Answering * Visual Commonsense Reasoning * Referring expressions * Caption-based image retrieval ## Method * Directly discretizing visual space via clustering to create tokens and treat them like text input may result in * Discretization error from clustering and lose visual details. * Treating different modalities identically, ignoring the different processing that each may require. * ViLBERT consists of 2 parallel BERT style models, each being a series of transformer and co-attention transformer blocks. * Co-attentional trasformer layers enable information exchnage between the wo modalities. #### Co-Attentional Transformer Layers ![](https://i.imgur.com/csv5LoH.png) Multihead attention is performed using keys and values from other modlaity. Image region features are generated by extracting bounding boxes using pre-trained object detction network. Unlike words in text, image regions lack a natural ordering. They encode spatial location instead, constructing a 5-d vector from region position (normalized top-left and bottom-right coordinates) and the fraction of image area covered. This is then projected to match the dimension of the visual features and they are summed. Model predicts a distribution over semantic classes for the corresponding image region. They extract the output distribution for the region from the same pretrained detection model used in feature extraction. Model is trained to minimze the KL divergence between these two distributions. Why KL?