# 2023.9.15
[TOC]
## LVLMs / MLLMs
Large Vision-Language Models / Multi-modal Large Language Models
* Enhance vision-language pre-trained model (VLPM) by incorporating powerful LLMs.
* Reuse the visual encoder in VLPMs to handle image data, replace the language encoder with LLMs
### Popular Models
* LLaVa
* InstructGPT
* mPLUG-Owl
* MultiModal-GPT
* MiniGPT4
* BLIP-2

*Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., & Shan, Y. (2023). SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension (arXiv:2307.16125). arXiv. http://arxiv.org/abs/2307.16125*
### Visual Instruction Tuning (Liu et al., 2023)
* Instruction Tuning: Enable LLMs to follow natural language instructions and complete tasks
* Leverage GPT-4 for instruction-following data collection
* Images are encoded into two types of symbolic representations:
1. Captions: describe the visual scene from various perspectives
2. Bounding boxes: localize the objects in the scene
* Types of instruction-following data:
1. Conversation: a conversation between the assistant and a person asking questions about the photo
2. Detailed description: Rich and comprehensive description
3. Complex reasoning: In-depth reasoning questions

### Object Hallucination
* When image captioning models generate nonexistent objects in a scene
* LVLMs specially suffer from object hallucinations as LLMs tend to hallucinate unintended text
### Some Metrics
* CIDEr (Consensus-based Image Description Evaluation) (2015)
* CHAIR (Caption Hallucination Assessment with Image Relevance) (2019)
* POPE (Polling-Based Object Probing Evaluation) (2023)
* MME (MLLM Evaluation) (2023)
* LVLM-eHub (LVLM evaluation Hub) (2023)
* SEED-Bench
## Our Research
[Image Captioning based Smart Navigation System for Visually Impaired](https://ieeexplore.ieee.org/document/9510102)
### Goals
* Improve the image captioning based smart navigation system for visually imparied
* Chatbot!
* Avoid hallucinations, as navigation systems should be precise
### Methodology
### Challenges
* Dataset:
* Street View Image - text
* Midjourney
* Mapiliary
* Instruction-following data
## Papers
### Hallucinations
* Complexity-Based Prompting for Multi-Step Reasoning
* Detecting and Preventing Hallucinations in Large Vision Language Models
* Evaluating Object Hallucination in Large Vision-Language Models
* Evaluation and Analysis of Hallucination in Large Vision-Language Models
* * Object Hallucination in Image Captioning
* Self-Consistency Improves Chain of Thought Reasoning in Language Models
* Simple Token-Level Confidence Improves Caption Correctness
* Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
### LVLM
* A Survey on Multimodal Large Language Models
* InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
* Large Multimodal Models: Notes on CVPR 2023 Tutorial
* MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
* Transfer Visual Prompt Generator across LLMs
* Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
* Visual Instruction Tuning
### Benchmarks
* LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
* MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
* Object Hallucination in Image Captioning
* SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
### Applications
* Artificial General Intelligence for Medical Imaging
* DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
* Embodied Task Planning with Large Language Models
* FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings
* Garbage in, garbage out: Zero-shot detection of crime using Large Language Models
* GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
* LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
* Path to Medical AGI: Unify Domain-specific Medical LLMs with the Lowest Cost
* SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model
* Stone Needle: A General Multimodal Large-scale Model Framework towards Healthcare
* XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models