# 2023.09.22
## 專案進度規劃
* 九月
* Survey
* Dataset
* 十月
*
* 十一月
* Vision Transformer Training
*
* 十二月
*
* 一月
## Methodology
### 目前主流的四種MLLM方法

### Instruction Tuning

### In-Context Learning
### Chain of Thought

### LLM-Aided Visual Reasoning

## BLIP: Vision Transformer - Q-Former - LLM

### Apply SAM (RAM)
> The object hallucination issue is widespread, which largely affects the reliability of MLLMs. This may be ascribed to insufficient alignment pretraining. Thus, a possible solution is to perform a more fine-grained alignment between visual and textual modalities. The fine granularity refers to the local features of images, which can be obtained by SAM, and the corresponding local textual descriptions.
* Apply SAM (Segment Anything Model) to provide more features of images.
[Recognize Anything Model](https://huggingface.co/spaces/xinyu1205/recognize-anything)


[Grounded SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything)