<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/2304.08485) | [Note link](https://blog.csdn.net/qq_40491305/article/details/131400432) | [Code link](https://github.com/haotian-liu/LLaVA) | NeurIPS 2023
:::success
**Thoughts**
This study demonstrated the effectiveness of visual instruction tuning.
They introduced an automatic pipeline for generating language-image instruction-following data, which was used to train LLaVA, a multimodal model capable of following human intent to complete visual tasks.
:::
## Abstract
Leveraging the trends in machine-generated instruction-following data, LLaVA (Large Language and Vision Assistant) is an end-to-end multimodal model designed to integrate a vision encoder with a large language model (LLM).
This combination enables general-purpose understanding of both visual and textual information.
## Background
Large language models (LLMs) have demonstrated that language can serve as a universal interface for a general-purpose assistant, where various task instructions are explicitly represented in text to guide the neural assistant in switching between tasks.
When LLMs are connected with multi-modal vision-and-language instructions, they extend this capability further, enabling the assistant to interpret and solve tasks that involve both visual and textual information seamlessly.
### GPT-assisted Visual Instruction Data Generation
Currently, the availability of multimodal instruction-following data is limited.
To address this, the study leverages ChatGPT/GPT-4 for collecting such data, using widely available image-pair datasets.
Given an image $\mathbf{X}_\mathrm{v}$ and its associated caption $\mathbf{X}_\mathrm{c}$, a set of questions $\mathbf{X}_\mathrm{q}$ can be generated with the intent of instructing the assistant to describe the image content.
A straightforward method to convert an image-text pair into an instruction-following format is as follows:
**Human:** $\mathbf{X}_\mathrm{q} \ \mathbf{X}_\mathrm{v}$ \<STOP>
**Assistant:** $\mathbf{X}_\mathrm{c}$ \<STOP>
However, this approach lacks diversity and in-depth reasoning in both the instructions and responses.
To address this limitation, the study leverages the capabilities of language-only models like GPT-4 or ChatGPT, acting as strong teachers (both accepting only text as input), to generate more sophisticated instruction-following data that involves visual content.
The table below provides an example to illustrate instruction-following data.

## Method
### Architecture
The primary goal is to effectively harness the strengths of both the pre-trained LLM and the visual model.
Below is the network architecture.

### Training
They use Vicuna as the LLM $f_\phi(\cdot)$ and the pre-trained CLIP visual encoder ViT-L/14 for input images $\mathbf{X}_\mathrm{v}$, generating visual features $\mathbf{Z}_\mathrm{v} = g(\mathbf{X}_\mathrm{v})$.
A linear layer with a trainable projection matrix $\mathbf{W}$ maps these features into the word embedding space, producing language embedding tokens $\mathbf{H}_\mathrm{v}$ that match the dimensionality of the LLM's word embeddings:
$$
\mathbf{H}_\mathrm{v} = \mathbf{W} \cdot \mathbf{Z}_\mathrm{v}
$$
For each image $\mathbf{X}_{\mathrm{v}}$, we generate multi-turn conversation data $\left(\mathbf{X}_{\mathrm{q}}^1, \mathbf{X}_{\mathrm{a}}^1, \dots, \mathbf{X}_{\mathrm{q}}^T, \mathbf{X}_{\mathrm{a}}^T\right)$, where $T$ is the total number of turns.
These are organized into a sequence, with all answers as the assistant's responses.
The instruction at the $t$-th turn $\mathbf{X}_{\text{instruct}}^t$ is defined as:
$$
\mathbf{X}_{\text{instruct}}^t = \begin{cases}
\text{Randomly choose } \left[\mathbf{X}_{\mathrm{q}}^1, \mathbf{X}_{\mathrm{v}}\right] \text{ or } \left[\mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\mathrm{q}}^1\right], & \text{if } t=1 \\
\mathbf{X}_{\mathrm{q}}^t, & \text{if } t > 1
\end{cases}
$$
This results in a unified format for the multimodal instruction-following sequence.

They then perform instruction-tuning of the LLM on the prediction tokens using its original auto-regressive objective.
For a sequence of length $L$, the probability of the target answers $\mathbf{X}_{\mathrm{a}}$ is computed as:
$$
p\left(\mathbf{X}_{\mathrm{a}} \mid \mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\text{instruct}}\right) = \prod_{i=1}^L p_{\boldsymbol{\theta}}\left(x_i \mid \mathbf{X}_{\mathrm{v}}, \mathbf{X}_{\text{instruct},<i}, \mathbf{X}_{\mathrm{a},<i}\right)
$$
where $\boldsymbol{\theta}$ represents the trainable parameters, $\mathbf{X}_{\text{instruct},<i}$ denotes the instruction tokens from all turns before the current prediction token $x_i$, and $\mathbf{X}_{\mathrm{a},<i}$ represents the answer tokens from all previous turns.
For LLaVA model training, they consider a two-stage instruction-tuning procedure.
1. Pre-training for Feature Alignment
2. Fine-tuning End-to-End
## Experiment
### Multimodal Chatbot
Below are the examples in the original GPT-4 paper.

They also vary the training datasets to evaluate the effectiveness of different types of instruction-following data.

### ScienceQA
The results for ScienceQA, which includes 21,000 multimodal multiple-choice questions, are shown below.
