RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://proceedings.mlr.press/v202/zhao23l/zhao23l.pdf) | ICML 2023 :::success **Thoughts** This study presents a simple yet effective representation learning method, RLEG, guided by diffusion-based embedding generators. Diffusion models generate embeddings online to aid in learning effective vision-language representations. Pretrained generators transfer embeddings between vision and language domains, and these generated embeddings serve as augmented samples for contrastive learning. ::: ## Abstract Vision-language models like CLIP excel in various tasks but require large datasets for training. Generative diffusion models, such as DALL-E 2, show that diverse high-quality samples can be produced by sampling from generative distributions. Leveraging this generative capability, this study proposes a novel method called **Representation Learning with diffusion-based Embedding Generation (RLEG)**. This approach uses diffusion models to generate feature embeddings online for effective vision-language representation. ## Background Vision-language representation learning requires large datasets, but collecting high-quality image-text pairs is challenging. This study aims to learn robust representations by using generative models to create diverse training samples online. This study is inspired by the hypothesis that real-world data resides on a low-dimensional manifold within a high-dimensional space. ![image](https://hackmd.io/_uploads/ryVEfChcR.png) First, the input image and text are encoded using respective encoders to obtain input embeddings for alignment. Next, diffusion-based embedding generators are used to create image embeddings from text embeddings and text embeddings from image embeddings. Multiple samplings generate more embeddings, enhancing data augmentation in the feature space. Finally, the input and generated embeddings are aligned using a unified contrastive learning scheme. ## Method ![image](https://hackmd.io/_uploads/rJZzMAnqA.png) ### Vision-Language Contrastive Learning Given a set of image-text pairs $\{ \boldsymbol{x}_i, \boldsymbol{y}_i \}_{i=1}^N$, where $\boldsymbol{x}_i$ is an image and $\boldsymbol{y}_i$ is its corresponding text description, two encoders are used: one for images and one for text. The image encoder extracts image feature vectors $\{ \boldsymbol{v}_i \}_{i=1}^N$, and the text encoder extracts text feature vectors $\{ \boldsymbol{t}_i \}_{i=1}^N$. This study uses image-text embedding pairs $\{ \boldsymbol{v}_i, \boldsymbol{t}_i \}_{i=1}^B$ to train using the InfoNCE loss. :::info InfoNCE (Information Noise Contrastive Estimation) loss is a contrastive loss function used for learning representations. It is usually used in contrastive learning and aims to maximize the similarity between pairs of positive samples while minimizing the similarity between pairs of negative samples. This loss function is often used in unsupervised learning and self-supervised learning tasks, such as contrastive learning of images and text. ::: ### Diffusion-based Embedding Generation Diffusion-based embedding generators, pre-trained using models like DALL-E 2, are used in the proposed framework to translate embeddings between image and text domains. ### Generative Distribution Guidance Generated embeddings from a diffusion-based generator serve as augmented samples within a generative distribution. These samples can be infinitely generated to expand the limited real-world training data. ## Experiment ### Model This study utilizes a BERT-like 12-layer Transformer as the text encoder and a Vision Transformer ViT-B/32 as the vision encoder. The diffusion prior model in DALL-E 2, pre-trained on LAION 400M, is used as the embedding generator. ### Dataset The proposed model is trained on the YFCC-15M dataset, a subset of YFCC 100M used in CLIP. ### Downstream tasks The model is evaluated on downstream tasks such as ImageNet for image classification and COCO and Flickr30K for image-text retrieval. The table compares previous vision-language pretraining methods with different supervision and augmentation on both vision and vision-language tasks. ![image](https://hackmd.io/_uploads/r1Bly1pcC.png) It includes image classification on ImageNet-1K and image-text retrieval on COCO and Flickr30K. All models are evaluated with the same backbone, dataset, and training settings.