2023.9.15 - HackMD

# 2023.9.15 [TOC] ## LVLMs / MLLMs Large Vision-Language Models / Multi-modal Large Language Models * Enhance vision-language pre-trained model (VLPM) by incorporating powerful LLMs. * Reuse the visual encoder in VLPMs to handle image data, replace the language encoder with LLMs ### Popular Models * LLaVa * InstructGPT * mPLUG-Owl * MultiModal-GPT * MiniGPT4 * BLIP-2 ![](https://hackmd.io/_uploads/Hyuq0ol1p.png) *Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., & Shan, Y. (2023). SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension (arXiv:2307.16125). arXiv. http://arxiv.org/abs/2307.16125* ### Visual Instruction Tuning (Liu et al., 2023) * Instruction Tuning: Enable LLMs to follow natural language instructions and complete tasks * Leverage GPT-4 for instruction-following data collection * Images are encoded into two types of symbolic representations: 1. Captions: describe the visual scene from various perspectives 2. Bounding boxes: localize the objects in the scene * Types of instruction-following data: 1. Conversation: a conversation between the assistant and a person asking questions about the photo 2. Detailed description: Rich and comprehensive description 3. Complex reasoning: In-depth reasoning questions ![](https://hackmd.io/_uploads/rJ3lC2ekp.png) ### Object Hallucination * When image captioning models generate nonexistent objects in a scene * LVLMs specially suffer from object hallucinations as LLMs tend to hallucinate unintended text ### Some Metrics * CIDEr (Consensus-based Image Description Evaluation) (2015) * CHAIR (Caption Hallucination Assessment with Image Relevance) (2019) * POPE (Polling-Based Object Probing Evaluation) (2023) * MME (MLLM Evaluation) (2023) * LVLM-eHub (LVLM evaluation Hub) (2023) * SEED-Bench ## Our Research [Image Captioning based Smart Navigation System for Visually Impaired](https://ieeexplore.ieee.org/document/9510102) ### Goals * Improve the image captioning based smart navigation system for visually imparied * Chatbot! * Avoid hallucinations, as navigation systems should be precise ### Methodology ### Challenges * Dataset: * Street View Image - text * Midjourney * Mapiliary * Instruction-following data ## Papers ### Hallucinations * Complexity-Based Prompting for Multi-Step Reasoning * Detecting and Preventing Hallucinations in Large Vision Language Models * Evaluating Object Hallucination in Large Vision-Language Models * Evaluation and Analysis of Hallucination in Large Vision-Language Models * * Object Hallucination in Image Captioning * Self-Consistency Improves Chain of Thought Reasoning in Language Models * Simple Token-Level Confidence Improves Caption Correctness * Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models ### LVLM * A Survey on Multimodal Large Language Models * InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning * Large Multimodal Models: Notes on CVPR 2023 Tutorial * MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models * Transfer Visual Prompt Generator across LLMs * Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding * Visual Instruction Tuning ### Benchmarks * LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models * MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models * Object Hallucination in Image Captioning * SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension ### Applications * Artificial General Intelligence for Medical Imaging * DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs * Embodied Task Planning with Large Language Models * FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings * Garbage in, garbage out: Zero-shot detection of crime using Large Language Models * GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text * LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day * Path to Medical AGI: Unify Domain-specific Medical LLMs with the Lowest Cost * SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model * Stone Needle: A General Multimodal Large-scale Model Framework towards Healthcare * XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models