# Vision and Language ###### tags: `Deep Learning for Computer Vision` A smart AI system that has joint understanding of both image/video and text/audio modalities ![](https://i.imgur.com/RBLwXSQ.jpg) ## Text-to-Image Synthesis ### XMC-GAN:Cross-Modal Contrastive Learning for Text-to-Image Generation * Idea * Leverage intra-modal and inter-modal contrastive losses to improve text-to-image synthesis ![](https://i.imgur.com/vflSFUC.png) ## Image Captioning ### AoANet:Attention on Attention for Image Captioning * Insight * Only some words attend to meaningful visual content (e.g., subject, verbs, object) ![](https://i.imgur.com/yxNMudL.png) * Idea * Extend conventional attention mechanism to Attention on Attention (AoA) * Use a gating function to filter out redundant information ![](https://i.imgur.com/oDDci0n.png) ## Visual Question Answering ### MCAN: Deep Modular Co-Attention Networks for Visual Question Answering * Idea * Proposed a deep co-attention model for VQA ![](https://i.imgur.com/b6BRlMF.jpg) ![](https://i.imgur.com/JsrDoUe.png) ## Visual Reasoning ![](https://i.imgur.com/xYwXO7x.jpg) ## Image Change Captioning ![](https://i.imgur.com/YArOTrv.jpg) ## Composed Query Image Retrieval ![](https://i.imgur.com/AXZeifD.png) ## Primary & Auxiliary tasks ![](https://i.imgur.com/gU1RU2p.png) # Pre-Training on Visual & Text Data ## Unified Encoder ### UNITER: UNiversal Image-TExt Representation Learning Proposed 4 different pre-training tasks for vision-language Transformer ![](https://i.imgur.com/pnRPz60.png) * MLM: Masked Language Modeling * Recover the masked word * MRM: Masked Region Modeling * Regress the masked region features * ITM: Image-Text Matching * 0: Mismatch, 1: Match * WRA: Word Region Alignment * Align word and region via Optimal Transport ### Oscar/VinVL ![](https://i.imgur.com/vGVyEPM.jpg) ![](https://i.imgur.com/7pIfgPn.png) ### SimVLM: Simple Visual Language Model Pretraining with Weak Supervision ![](https://i.imgur.com/ce2lqCC.jpg) ![](https://i.imgur.com/IDHGWhG.png) ![](https://i.imgur.com/t1xE7Ys.png) ## Dual Encoder ### CLIP: Learning Transferable Visual Models From Natural Language Supervision ![](https://i.imgur.com/wC3rZ28.png) ![](https://i.imgur.com/QdOeC0L.png) ### ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ![](https://i.imgur.com/QIl225T.png)