Evaluating Object Hallucination in Large Vision-Language Models - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2305.10355) | [Note link](https://blog.csdn.net/Mars_prime/article/details/134922913) | [Code link](https://github.com/AoiDragon/POPE) | EMNLP 2023 :::success **Thoughts** This study conducted evaluation experiments on several LVLMs to examine their susceptibility to object hallucination. ::: ## Abstract Large vision-language models (LVLMs) can suffer from issues like object hallucinations, where generated descriptions include objects that don't match the target images. This study introduces a polling-based query method called POPE to improve the evaluation of object hallucinations. ## Background Large vision-language models (LVLMs) use a visual encoder from VLPMs and replace the language encoder with LLMs to handle image data and follow human instructions for various tasks. However, LVLMs still suffer from hallucination, where generated content may include objects not present in the image. The figure below shows cases of object hallucination in LVLMs. ![image](https://hackmd.io/_uploads/H1BETWJi0.png) **Bold** objects represent ground-truth objects, while **red** objects indicate hallucinated items. The left case uses a traditional instruction-based evaluation method, and the right cases use three variants of the POPE method. The figure below shows the frequency of hallucinations for commonly appearing or co-occurring objects in the MSCOCO dataset. ![image](https://hackmd.io/_uploads/SJRja-yiC.png) ## Method ### Caption Hallucination Assessment with Image Relevance(CHAIR) This metric is widely used to evaluate object hallucination in image captioning tasks. Existing methods commonly use two variants: $\mathrm{CHAIR}_I$, which assesses hallucination at the object instance level, and $\mathrm{CHAIR}_S$, which evaluates it at the sentence level. They can be formulated as follows: $$ \mathrm{CHAIR}_I = \frac{|\{ \mathrm{hallucinated \ objects} \}|}{|\{ \mathrm{all \ mentioned \ objects} \}|} $$ $$ \mathrm{CHAIR}_S = \frac{|\{ \mathrm{captions \ with \ hallucinated \ objects} \}|}{|\{ \mathrm{all \ captions} \}|} $$ Disadvantages of CHAIR include its susceptibility to factors such as instruction design and caption length, which can influence the evaluation results. ### POPE This study introduces Polling-based Object Probing Evaluation (POPE), a simple yet effective method for evaluating hallucination in LVLMs. ![image](https://hackmd.io/_uploads/ByZOpWksA.png) Given an input image, POPE first extracts ground-truth objects using human annotations or automatic segmentation tools like SEEM. It then performs negative sampling for nonexistent objects under Random, Popular, or Adversarial settings. Finally, it formulates these objects into question templates to poll LVLMs. Unlike CHAIR, POPE is more stable with prompt forms and can be easily extended to unannotated datasets. Its probing results are also highly consistent with the model’s captions. ## Experiment This study evaluates all LVLMs using POPE, based on the validation set of MSCOCO. The table below shows the results of LVLMs under three POPE evaluation settings on the MSCOCO validation set. "Yes" denotes the proportion of responses answering "Yes" to the given question. The best results in each block are highlighted in bold. ![image](https://hackmd.io/_uploads/Bku0k71i0.png) The table below presents the evaluation results of LLaVA using POPE and CHAIR with different prompt templates. ![image](https://hackmd.io/_uploads/r12lRZki0.png) The table below shows SEEM-based POPE results for LVLMs on MSCOCO. The F1 Score (Truth) results, obtained using ground-truth annotations. The best results in each block are highlighted in bold. ![image](https://hackmd.io/_uploads/HkR-0ZkoC.png)