9/20-9/27 研究報告

--- title: openai-clip model 論文淺讀 tags: notes disqus: hackmd --- # openai-clip model 論文淺讀 **openai-clip model 論文淺讀** [Paper Src](https://arxiv.org/pdf/2103.00020.pdf) **以往的視覺系統模型的訓練方法缺點** **改善方法** > Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet CLIP 改善三個過往方法的缺點：昂貴的標記成本 — 現在你只需要提供一段文字就可以，不用提供任何圖片過度任務相依 — 以前針對一個任務訓練的模型，只能用在這個任務上，很難用在其他任務上 benchmark 很好，落地很差 — 很多模型都會過度 benchmark 導向，而 benchmark 跟現實資料的 distribution 其實有差 **實作:** 利用開源的[OpenAI clip repo](https://github.com/openai/CLIP) ![](https://hackmd.io/_uploads/B1OBmB1x6.png) https://colab.research.google.com/drive/1R2hy_GtUx_cDFCqbD7cEot0qeww7BGHe?hl=zh-tw#scrollTo=32y7_DAfq-F4 https://gist.github.com/ggosiang/c8c72d9da9f5ec83a3d49bd189acba32#file-openai-clip-puipui-ipynb --- **CLIP prefix captioning論文淺讀** [Paper Src](https://arxiv.org/pdf/2111.09734.pdf) > Semantic Understanding: The model needs to understand the objects and elements in the image. For instance, it should recognize that an object in an image is a gift. Diversity of Descriptions: There are numerous ways to describe a single image, and the training dataset typically dictates the preferable option for a given image. *infer : they fine-tune a language model, extract the visual prefix of an input image x using the CLIP encoder and the mapping network F. We start generating the caption conditioned on the visual prefix, and predict the next tokens one by one, guided by the language model output. For each token, the language model outputs probabilities for all vocabulary tokens, which are used to determine the next one by employing a greedy approach or beam search.* **實作:** [利用別人處理好的Replicate Service](https://replicate.com/rmokady/clip_prefix_caption?prediction=zsg3xprbug6ust4qodfsnmynly) 給一個狗的圖片，output一個針對此圖檔的敘述 ![](https://hackmd.io/_uploads/HyMBSBJeT.png) ![](https://hackmd.io/_uploads/S111LSkl6.png) [利用官方repo提供之colab](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing) 上傳一個windows登入畫面供辨識 *inactivate beam-search* ![](https://hackmd.io/_uploads/HJk78ayxT.png) *activate beam-search* ![](https://hackmd.io/_uploads/S1UTL6Jgp.png) --- **proper noun** [beam search ](https://www.width.ai/post/what-is-beam-search) [zero-shot](https://www.pinecone.io/learn/series/image-search/zero-shot-image-classification-clip/) https://ithelp.ithome.com.tw/articles/10260464