## Vision Language model (VLM) * What is Vision Language model? Language models able to **see** things * How Language models get visual capabilities? Language models read words via embedding (vector representation), as well as images _Encoder is necessary for image-embedding transformation_ --- <!-- 其他模型亦可,表列數個知名 VLM --> ### Popular VLMs The table below lists popular open source or partially open source VLMs: | Model | params | licence | | --- | :---: | :---: | | Llama 4 | 400B \| 109B | [Llama 4](https://www.llama.com/llama4/license/) | | Llama 3.2 | 1B \| 3B \| 11B \| 90B | [Llama 3.2](https://www.llama.com/llama3_2/license/) | | [Pixtral large](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411) | 124B | [MRL](https://mistral.ai/static/licenses/MRL-0.1.md) | | [Command A vision](https://huggingface.co/CohereLabs/command-a-vision-07-2025) (SOTA) | 112B | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0) | | Qwen 2.5 VL instruct | 72B \| 32B \| 8B | [Qwen](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/resolve/main/LICENSE) (72B, 8B) \| [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) (32B) | | Gemma 3 | 1B \| 4B \| 12B \| 27B (quantised available)| [Gemma](https://ai.google.dev/gemma/terms) | | Mistral small 3.1 instruct 2503 | 24B | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | | Pixtral 2409 | 12B | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | | Phi 4 multimodal instruct | 6B | [MIT](https://opensource.org/license/mit) | | SmolVLM instruct | 256M \| 500M \| 2.25B | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | <!-- [SmolVLM research paper](https://arxiv.org/pdf/2504.05299) -->