Visual Instruction Tuning - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2304.08485) | [Note link](https://blog.csdn.net/qq_40491305/article/details/131400432) | [Code link](https://github.com/haotian-liu/LLaVA) | NeurIPS 2023 :::success **Thoughts** This study demonstrated the effectiveness of visual instruction tuning. They introduced an automatic pipeline for generating language-image instruction-following data, which was used to train LLaVA, a multimodal model capable of following human intent to complete visual tasks. ::: ## Abstract Leveraging the trends in machine-generated instruction-following data, LLaVA (Large Language and Vision Assistant) is an end-to-end multimodal model designed to integrate a vision encoder with a large language model (LLM). This combination enables general-purpose understanding of both visual and textual information. ## Background Large language models (LLMs) have demonstrated that language can serve as a universal interface for a general-purpose assistant, where various task instructions are explicitly represented in text to guide the neural assistant in switching between tasks. When LLMs are connected with multi-modal vision-and-language instructions, they extend this capability further, enabling the assistant to interpret and solve tasks that involve both visual and textual information seamlessly. ### GPT-assisted Visual Instruction Data Generation Currently, the availability of multimodal instruction-following data is limited. To address this, the study leverages ChatGPT/GPT-4 for collecting such data, using widely available image-pair datasets. Given an image $\mathbf{X}_\mathrm{v}$ and its associated caption $\mathbf{X}_\mathrm{c}$, a set of questions $\mathbf{X}_\mathrm{q}$ can be generated with the intent of instructing the assistant to describe the image content. A straightforward method to convert an image-text pair into an instruction-following format is as follows: **Human:** $\mathbf{X}_\mathrm{q} \ \mathbf{X}_\mathrm{v}$ \<STOP> **Assistant:** $\mathbf{X}_\mathrm{c}$ \<STOP> However, this approach lacks diversity and in-depth reasoning in both the instructions and responses. To address this limitation, the study leverages the capabilities of language-only models like GPT-4 or ChatGPT, acting as strong teachers (both accepting only text as input), to generate more sophisticated instruction-following data that involves visual content. The table below provides an example to illustrate instruction-following data. ![image](https://hackmd.io/_uploads/Sk_3_GGjA.png) ## Method ### Architecture The primary goal is to effectively harness the strengths of both the pre-trained LLM and the visual model. Below is the network architecture. ![image](https://hackmd.io/_uploads/BklR_GMiC.png) ### Training They use Vicuna as the LLM $f_\phi(\cdot)$ and the pre-trained CLIP visual encoder ViT-L/14 for input images $\mathbf{X}_\mathrm{v}$, generating visual features $\mathbf{Z}_\mathrm{v} = g(\mathbf{X}_\mathrm{v})$. A linear layer with a trainable projection matrix $\mathbf{W}$ maps these features into the word embedding space, producing language embedding tokens $\mathbf{H}_\mathrm{v}$ that match the dimensionality of the LLM's word embeddings: $$ \mathbf{H}_\mathrm{v} = \mathbf{W} \cdot \mathbf{Z}_\mathrm{v} $$ For each image $\mathbf{X}_{\mathrm{v}}$, we generate multi-turn conversation data $\left(\mathbf{X}_{\mathrm{q}}^1, \mathbf{X}_{\mathrm{a}}^1, \dots, \mathbf{X}_{\mathrm{q}}^T, \mathbf{X}_{\mathrm{a}}^T\right)$, where $T$ is the total number of turns. These are organized into a sequence, with all answers as the assistant's responses. The instruction at the $t$-th turn $\mathbf{X}_{\text{instruct}}^t$ is defined as: $$ \mathbf{X}_{\text{instruct}}^t = \begin{cases} \text{Randomly choose } \left[\mathbf{X}_{\mathrm{q}}^1, \mathbf{X}_{\mathrm{v}}\right] \text{ or } \left[\mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\mathrm{q}}^1\right], & \text{if } t=1 \\ \mathbf{X}_{\mathrm{q}}^t, & \text{if } t > 1 \end{cases} $$ This results in a unified format for the multimodal instruction-following sequence. ![image](https://hackmd.io/_uploads/rkyYKzfiC.png) They then perform instruction-tuning of the LLM on the prediction tokens using its original auto-regressive objective. For a sequence of length $L$, the probability of the target answers $\mathbf{X}_{\mathrm{a}}$ is computed as: $$ p\left(\mathbf{X}_{\mathrm{a}} \mid \mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\text{instruct}}\right) = \prod_{i=1}^L p_{\boldsymbol{\theta}}\left(x_i \mid \mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\text{instruct},<i}, \mathbf{X}_{\mathrm{a},<i}\right) $$ where $\boldsymbol{\theta}$ represents the trainable parameters, $\mathbf{X}_{\text{instruct},<i}$ denotes the instruction tokens from all turns before the current prediction token $x_i$, and $\mathbf{X}_{\mathrm{a},<i}$ represents the answer tokens from all previous turns. For LLaVA model training, they consider a two-stage instruction-tuning procedure. 1. Pre-training for Feature Alignment 2. Fine-tuning End-to-End ## Experiment ### Multimodal Chatbot Below are the examples in the original GPT-4 paper. ![image](https://hackmd.io/_uploads/HJ1-YzMjA.png) They also vary the training datasets to evaluate the effectiveness of different types of instruction-following data. ![image](https://hackmd.io/_uploads/B1l4YfMjC.png) ### ScienceQA The results for ScienceQA, which includes 21,000 multimodal multiple-choice questions, are shown below. ![image](https://hackmd.io/_uploads/Hy9HKGfo0.png)