# LVLM Related Papers ## Survey * Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2023). **A Survey on Multimodal Large Language Models** (arXiv:2306.13549). arXiv. http://arxiv.org/abs/2306.13549 ## Models * **LLaVA** - Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning (arXiv:2304.08485). arXiv. http://arxiv.org/abs/2304.08485) * **LLaVA v1.5** - Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). Improved Baselines with Visual Instruction Tuning (arXiv:2310.03744). arXiv. http://arxiv.org/abs/2310.03744 * **MiniGPT-4** - Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (arXiv:2304.10592). arXiv. http://arxiv.org/abs/2304.10592 * **BLIP-2** - Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (arXiv:2301.12597). arXiv. http://arxiv.org/abs/2301.12597 * **InstructBLIP** - Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (arXiv:2305.06500). arXiv. http://arxiv.org/abs/2305.06500 ## Benchmarks * **CHAIR** - Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., & Saenko, K. (2019). Object Hallucination in Image Captioning (arXiv:1809.02156). arXiv. http://arxiv.org/abs/1809.02156 * **POPE** - Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., & Wen, J.-R. (2023). Evaluating Object Hallucination in Large Vision-Language Models (arXiv:2305.10355). arXiv. http://arxiv.org/abs/2305.10355 * **MME** - Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., & Ji, R. (2023). MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models (arXiv:2306.13394). arXiv. http://arxiv.org/abs/2306.13394 * **LVLM-eHub** - Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y., & Luo, P. (2023). LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models (arXiv:2306.09265). arXiv. http://arxiv.org/abs/2306.09265 * **SEED-Bench** - Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., & Shan, Y. (2023). SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension (arXiv:2307.16125). arXiv. http://arxiv.org/abs/2307.16125