# Vision and Language
###### tags: `Deep Learning for Computer Vision`
A smart AI system that has joint understanding of both image/video and text/audio modalities

## Text-to-Image Synthesis
### XMC-GAN:Cross-Modal Contrastive Learning for Text-to-Image Generation
* Idea
* Leverage intra-modal and inter-modal contrastive losses to improve text-to-image synthesis

## Image Captioning
### AoANet:Attention on Attention for Image Captioning
* Insight
* Only some words attend to meaningful visual content (e.g., subject, verbs, object)

* Idea
* Extend conventional attention mechanism to Attention on Attention (AoA)
* Use a gating function to filter out redundant information

## Visual Question Answering
### MCAN: Deep Modular Co-Attention Networks for Visual Question Answering
* Idea
* Proposed a deep co-attention model for VQA


## Visual Reasoning

## Image Change Captioning

## Composed Query Image Retrieval

## Primary & Auxiliary tasks

# Pre-Training on Visual & Text Data
## Unified Encoder
### UNITER: UNiversal Image-TExt Representation Learning
Proposed 4 different pre-training tasks for vision-language Transformer

* MLM: Masked Language Modeling
* Recover the masked word
* MRM: Masked Region Modeling
* Regress the masked region features
* ITM: Image-Text Matching
* 0: Mismatch, 1: Match
* WRA: Word Region Alignment
* Align word and region via Optimal Transport
### Oscar/VinVL


### SimVLM: Simple Visual Language Model Pretraining with Weak Supervision



## Dual Encoder
### CLIP: Learning Transferable Visual Models From Natural Language Supervision


### ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
