# 2023.09.22 ## 專案進度規劃 * 九月 * Survey * Dataset * 十月 * * 十一月 * Vision Transformer Training * * 十二月 * * 一月 ## Methodology ### 目前主流的四種MLLM方法 ![](https://hackmd.io/_uploads/rkoB8x51T.png) ### Instruction Tuning ![](https://hackmd.io/_uploads/H1h8weqka.png) ### In-Context Learning ### Chain of Thought ![](https://hackmd.io/_uploads/B16vde5k6.png) ### LLM-Aided Visual Reasoning ![](https://hackmd.io/_uploads/rJSqul51p.png) ## BLIP: Vision Transformer - Q-Former - LLM ![](https://hackmd.io/_uploads/HynQEg5kT.png) ### Apply SAM (RAM) > The object hallucination issue is widespread, which largely affects the reliability of MLLMs. This may be ascribed to insufficient alignment pretraining. Thus, a possible solution is to perform a more fine-grained alignment between visual and textual modalities. The fine granularity refers to the local features of images, which can be obtained by SAM, and the corresponding local textual descriptions. * Apply SAM (Segment Anything Model) to provide more features of images. [Recognize Anything Model](https://huggingface.co/spaces/xinyu1205/recognize-anything) ![](https://hackmd.io/_uploads/HJZv3ec1a.png) ![](https://hackmd.io/_uploads/Sypc2e5k6.png) [Grounded SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything)