<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/2208.08241) | [Code link](https://github.com/ml-research/ILLUME) | ICML 2023
:::success
**Thoughts**
This study uses human-in-the-loop rationalization approach for multimodal transformers.
:::
## Abstract
Visual-langauge models (VLM) rarely align with user’s rationales for specific answers.
In order to improve this alignment and reinforce commonsense reasons, this study proposes a tuning paradigm based on human interactions with machine generated data.
## Background
Recently, InstructGPT has demonstrated tuning language models with humans-in-the-loop produces outputs that humans prefer over those of larger, conventionally trained models.
Similarly, they use minimal interactive feedback from a human critic on self-generated samples to guide the fine-tuning process.
Further, they apply their approach to multimodal applications and facilitate the transfer of capabilities between LMs and VLMs.
## Method
### Problem Statement
They define task as open-ended text generation.
VQA tuples ($i, q, a$) consist an image $i$ and a respective pair of text sequences for the question $q$ and answer $a$.
This study employs the model to perform a funxtion $f(i, q, a) = e$.
The ouput $e$, which is a textual explanation.
### Self-talk Prompting
Their method establishes a baseline for commonsense reasoning in natural language with self-talk approach evaluate.
And then, it selects fitting LM candidates.
Both clarification and context are prompted to the model to predict the final answer.
### ILLUME: Tuning through Machine Interaction

At each iteration, they sample explanations from the training data using the tuned model of the previous iteration.
Minimal human feedback is provided to the model through marking fitting explanations.
> Sampling
It combines top-k and temperature sampling.
First, they use top-k sampling to limit the generated sequence to the most probable tokens.
To filter tokens, they apply with temperature sampling
$$
\hat{l}_i = \mathrm{softmax}_I (\frac{l_i}{T}) = \frac{e^{\frac{l_i}{T}}}{\sum_j e^{\frac{l_j}{T}}}
$$
Consider the logit $l_i$ of the output probability $p_i$ assigned to a token $i$.
> Human Feedback
They identify and reinforce those portions of the generated jabber conforming to human intent.
Human feedback can be simulated by comparing the generated candidates to existing human-generated ground truth explanations using task-specific metrics.
ROUGE-L score is used in their approach.
> Continual Learning
They train VQA and explanation generation task simultaneously, with the training loss
$$
\mathcal{L}(X, \theta) = \mathcal{L}_{vqa} (X^A, X^E, \theta) + b \cdot \mathcal{L}_{exp} (X^E, \theta)
$$
$b = \frac{n(X^A)}{n(X^E)}$ is the parameter to balance out the disproportional number of samples, $n(X)$ is the number of elements in set $X$.
$\mathcal{L}_{vqa}$ and $\mathcal{L}_{exp}$ is the language modeling loss for the next token prediction for the answer and explanation.
## Experiment
This study consider three VLMs:
1. MAGMA
2. BLIP
3. OFA
They use siz different common-sense reasoning benechmarks:
1. CSQA
2. COPA
3. Mc-Taco
4. PIQA
5. SocialQA
6. WinoGrande
Three evluation metric: BLEU-4, ROUGE-L and CIDEr.
Below is the LM’s self-talk performances.

Below figure is the generated explanations on the VQA-X training set.
