ILLUME: Rationalizing Vision-Language Models through Human Interactions - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2208.08241) | [Code link](https://github.com/ml-research/ILLUME) | ICML 2023 :::success **Thoughts** This study uses human-in-the-loop rationalization approach for multimodal transformers. ::: ## Abstract Visual-langauge models (VLM) rarely align with user’s rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, this study proposes a tuning paradigm based on human interactions with machine generated data. ## Background Recently, InstructGPT has demonstrated tuning language models with humans-in-the-loop produces outputs that humans prefer over those of larger, conventionally trained models. Similarly, they use minimal interactive feedback from a human critic on self-generated samples to guide the fine-tuning process. Further, they apply their approach to multimodal applications and facilitate the transfer of capabilities between LMs and VLMs. ## Method ### Problem Statement They define task as open-ended text generation. VQA tuples ($i, q, a$) consist an image $i$ and a respective pair of text sequences for the question $q$ and answer $a$. This study employs the model to perform a funxtion $f(i, q, a) = e$. The ouput $e$, which is a textual explanation. ### Self-talk Prompting Their method establishes a baseline for commonsense reasoning in natural language with self-talk approach evaluate. And then, it selects fitting LM candidates. Both clarification and context are prompted to the model to predict the final answer. ### ILLUME: Tuning through Machine Interaction ![image](https://hackmd.io/_uploads/B1dN1XQsR.png) At each iteration, they sample explanations from the training data using the tuned model of the previous iteration. Minimal human feedback is provided to the model through marking fitting explanations. > Sampling It combines top-k and temperature sampling. First, they use top-k sampling to limit the generated sequence to the most probable tokens. To filter tokens, they apply with temperature sampling $$ \hat{l}_i = \mathrm{softmax}_I (\frac{l_i}{T}) = \frac{e^{\frac{l_i}{T}}}{\sum_j e^{\frac{l_j}{T}}} $$ Consider the logit $l_i$ of the output probability $p_i$ assigned to a token $i$. > Human Feedback They identify and reinforce those portions of the generated jabber conforming to human intent. Human feedback can be simulated by comparing the generated candidates to existing human-generated ground truth explanations using task-specific metrics. ROUGE-L score is used in their approach. > Continual Learning They train VQA and explanation generation task simultaneously, with the training loss $$ \mathcal{L}(X, \theta) = \mathcal{L}_{vqa} (X^A, X^E, \theta) + b \cdot \mathcal{L}_{exp} (X^E, \theta) $$ $b = \frac{n(X^A)}{n(X^E)}$ is the parameter to balance out the disproportional number of samples, $n(X)$ is the number of elements in set $X$. $\mathcal{L}_{vqa}$ and $\mathcal{L}_{exp}$ is the language modeling loss for the next token prediction for the answer and explanation. ## Experiment This study consider three VLMs: 1. MAGMA 2. BLIP 3. OFA They use siz different common-sense reasoning benechmarks: 1. CSQA 2. COPA 3. Mc-Taco 4. PIQA 5. SocialQA 6. WinoGrande Three evluation metric: BLEU-4, ROUGE-L and CIDEr. Below is the LM’s self-talk performances. ![image](https://hackmd.io/_uploads/HJEIEX7o0.png) Below figure is the generated explanations on the VQA-X training set. ![image](https://hackmd.io/_uploads/ByOs4QXiC.png)