CLIP論文 | Learning Transferable Visual Models From Natural Language Supervision

# CLIP論文 | Learning Transferable Visual Models From Natural Language Supervision ## Abstract >SOTA computer vision systems: >Being trained to predict a **==fixed==** set of **==predetermined==** object categories. >Problem: This restriced form may limits its generality and usability. >Solution: Training SOTA computer vision systems using raw text information about images rather than a fixed set of predetermined object categories. ![image](https://hackmd.io/_uploads/ryew1V0P4p.png) --- ## 3.1.4. PROMPT ENGINEERING AND ENSEMBLING > prompt is a method used during fine-tuning of a model. We can see prompt as guide to the text data. ### Why use Prompt Engineering, and Prompt Ensembling? 1. **Word Confusion (Polysemy) Problem:** Words can mean different things, causing confusion. For instance, "cranes" might mean construction cranes or birds. Using just one word as a prompt can lead to misunderstandings. | construction crane | bird crane | | -------- | -------- | | ![image](https://hackmd.io/_uploads/B1ESpTPNT.png =300x) | ![](https://hackmd.io/_uploads/SJAw06wNp.png =300x) | 2. **Distribution Gap Problem:** During pre-training, we use sentences with images. However, if we use just one word during predictions, it might not match how we trained the model. This could lead to problems in understanding. ### Prompt Engineering * **Using meaningful Templates** To solve this, they came up with a trick called a "prompt template." They use a template like: ```python= "A photo of a {label}." ``` This template means "This is a picture of xxx." The label usually represents a noun, which can reduce the confusion in some respects. For example, "remote" means a remote control, not something far away. ==This template boosted accuracy by 1.3%.== * **Customizing for Tasks** Customizing prompts for each task improves zero-shot abilities. For animal image dataset (*Oxford-IIIT Pet*), we might say, "A photo of a {label}, a type of pet." This can narrows down the possibilities. Similar tricks work for food, aircraft, OCR, and other datasets. ### Prompt Ensembling Ｔhe arthur also tried mixing different prompts, like saying "A photo of a big {label}" and "A photo of a small {label}." On ImageNet, a total of 80 different context prompts were integrated. The integration of these diverse prompts improved accuracy by 3.5%. [**Example Prompts:**](https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb) ![image](https://hackmd.io/_uploads/HJIJzAw4p.png) * `"a photo of many {}"` for pictures with lots of things * `"a photo of the hard to see {}"` for tricky targets. --- ## 3.1.5. ANALYSIS OF ZERO-SHOT CLIP PERFORMANCE To analyze the performance of CLIP Zero-Shot, the paper compares it with a Linear Probe approach using ResNet-50 as a baseline model. The Linear Probe involves taking the features learned by ResNet-50 and fine-tuning a classifier specifically for a dataset. The comparison, indicates how Zero-Shot CLIP performs relative to Linear Probe ResNet-50. The green bars represent accuracy improvement, while the blue bars show accuracy decline. ### 1. Zero-shot Performance Comparison ![image](https://hackmd.io/_uploads/rJgHR0DVa.png =400x) On more than half of the tasks (16 out of 27), Zero-Shot CLIP outperforms the supervised ResNet-50. Also, CLIP shows weakness in certain specialized, complex, or abstract tasks, such as tumor detection, object counting, traffic sign recognition, and distance recognition. --- ### Zero-shot vs. Few-shot Comparison: ![image](https://hackmd.io/_uploads/ByIYbyO46.png =400x) For tasks where CLIP's accuracy is relatively low, the paper explores Few-Shot learning, a form of supervised fine-tuning. It involves using a Linear Probe on CLIP and other models (BiT-M, SimCLRv2, and ResNet-50) by adding a classifier head and fine-tuning it on 20 tasks. The result in the figure above reveal that CLIP achieves higher accuracy than other models, even with just a few training samples per class.（1 to 4） Interestingly, when it comes to CLIP, utilizing only 1 or 2 training samples per class in Few-Shot fine-tuning leads to lower accuracy compared to Zero-Shot. To match the accuracy of Zero-Shot, it is necessary to use 4 training samples per class. This challenges the expectation that Few-Shot should outperform Zero-Shot. :::success * The performance of the Zero-shot CLIP is similar to that of the 4-shot CLIP; * The performance of the Few shot CLIP is much higher than the previous SOTA model. ::: ### Data Efficiency Comparison: ![image](https://hackmd.io/_uploads/rJipByuNT.png =400x) Further experiments explore the number of training samples per class needed to achieve accuracy equivalent to Zero-Shot in Few-Shot learning for each task. Figure above illustrates that half of the datasets require fewer than 5 training samples per class, with a median of 5.4 and an average of 20.8. However, for challenging datasets, more training samples are needed. ### Model Size and Performance Relationship: ![image](https://hackmd.io/_uploads/r1vW8JdV6.png =400x) The paper demonstrates that larger models, such as ResNet-101, RN50x4, RN50x16, RN50x64, exhibit better Zero-Shot performance. The trend indicates that as models get larger, Zero-Shot performance improves. ### Comparison with Fully Supervised Models: ![image](https://hackmd.io/_uploads/B1fgIk_4T.png =400x) Zero-shot CLIP consistently lags behind fully supervised models by 10-25%. --- ## 3.2. Representation Learning > In this section, the evaluation focuses on studying CLIP's representation learning capabilities. There are two common methods used for representation evaluation: 1. fitting linear classifiers (Linear Probe) 2. end-to-end fine-tuning Despite the flexibility of fine-tuning, we primarily rely on linear classifiers for their simplicity and clear feedback on model development. This approach aligns with our goal of creating a high-performing, task-agnostic pre-training model. To compare CLIP with a broad set of existing models across various tasks, the evaluation of 66 models on 27 datasets requires numerous evaluations, and using linear classifiers simplifies the process compared to the more complex fine-tuning. > Figure below presents a comparison of CLIP models' linear probe performance with state-of-the-art computer vision models across various architectures. ![image](https://hackmd.io/_uploads/BkolixONp.png ) * **Left:** Averaged over 12 datasets.(Because many model there only used those 12 dataset for training) -> There's model outperform to CLIP. -> Those 12 datasets may have strong connections to Image Net. * **Right:** Averaged over 27 datasets. The model gets better and better as it go up to the left we can see that CLIP-ViT and CLIP-RestNet outperformed other models. ![image](https://hackmd.io/_uploads/B14IaedEp.png =400x) CLIP’s features outperform the features of the best ImageNet model on a wide variety of datasets :arrow_right: Reference [OpenAI 的 multimodal 神經網路 CLIP: Connecting Text and Images](https://blog.infuseai.io/openai-%E7%9A%84-multimodal-%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-%E4%B8%8B-clip-connecting-text-and-images-2e9962905504)