# DLCV HW3 ###### tags: `Course` :::success 湯濬澤 NTUST_M11015117 HackMD Link:https://hackmd.io/@RTon/BkpQxeGHi ::: ## Problem 1 - Zero-shot image Classification with CLIP ### 1. Methods analysis #### **Previous methods (e.g. VGG and ResNet)** 傳統的方法是在對某個領域的照片做 label 之後,讓 CNN 去提取特徵,然後訓練 mlp 做分類。但再怎麼說,這些 model 的表現相當仰賴於 dataset 的內容,dataset 沒標的,或是照片與 dataset 的分布不同,就可能表現的不好。需要再做 transfer learning 等才能拿到好的結果。 #### **CLIP** CLIP 不一樣的是,它訓練時只需要一段描述的文字與圖片就好,不僅 data 好取得,更能像是人類一樣理解事物。CLIP 透過將圖片與文字映射到一個空間,透過計算相似度來辨識,也因此比較廣泛一些。 ### 2. Prompt-text analysis | Prompt-text | Probability | | :--------: | :--------: | | This is a photo of {object} | 0.6080 | | This is a {object} image. | 0.6824 | | No {object}, no score. | 0.5624 | 從結果來看,單純的語句在 zero shot prediction 上表現得比較好,不過這是因為 dataset 較為單一且影像單純的關係吧,如果影像中出現很多物件時,前者的推論可能就不太一樣了。 ### 3. Quantitative analysis | ![](https://i.imgur.com/DzvprbB.png =250x) | ![](https://i.imgur.com/zEfKaZK.png) | | :------: | :------: | | ![](https://i.imgur.com/r26uSto.png =250x) | ![](https://i.imgur.com/wHcmfSf.png) | | :------: | :------: | | ![](https://i.imgur.com/zX48vZk.png =250x) | ![](https://i.imgur.com/kyaBHYy.png) | | :------: | :------: | | ![](https://i.imgur.com/hDZ2H7V.png =250x) | ![](https://i.imgur.com/0N5JHt3.png) | | :------: | :------: | | ![](https://i.imgur.com/saUtXXN.png =250x) | ![](https://i.imgur.com/nqr16bW.png) | | :------: | :------: | ## Problem 2 - Image Captioning with VL-model ### 1. Report your best setting and its corresponding CIDEr & CLIPScore on the validation data. Parameters: ```python vocab_size = 18022 nHead = 12 hidden_size = 384 nLayers = 3 dropout_dec = 0.2 dropout_pos = 0.1 vit_encoder = timm.create_model('vit_small_patch16_384', pretrained=True) encoder = models.Encoder(vit_encoder, embed_size=384) # Loss & optimizer criterion = nn.CrossEntropyLoss(ignore_index=0) optimizer_dec = torch.optim.AdamW(decoder.parameters(), lr=0.0001) ``` #### Performance on validation dataset | CIDEr | 0.2025 | | -------- | -------- | | CLIPScore | 0.5217 | ### 2. Report other 3 different attempts (e.g. pretrain or not, model architecture, freezing layers, decoding strategy, etc.) and their corresponding CIDEr & CLIPScore. (7.5%, each setting for 2.5%) | | CIDEr | CLIPScore | | -------- | -------- | -------- | | Not freeze layers | 2.1543538363994163e-07 | 0.4760 | | freeze layers | 0.2307 | 0.5439 | | CNN encoder | 0.5284 | 0.6541 | ## Problem 3 - Visualization of Attention in Image Captioning ### 1. COCO attention maps | bike | | -------- | | ![](https://i.imgur.com/IGKGKi1.jpg) | | girl | | -------- | | ![](https://i.imgur.com/1bS6aSq.jpg) | | sheep | | -------- | | ![](https://i.imgur.com/MEDkE0N.png) | | ski | | -------- | | ![](https://i.imgur.com/GSmglWw.png) | | unbrella | | -------- | | ![](https://i.imgur.com/BcHyXNc.jpg) | ### 2. According to CLIPScore, you need to visualize: #### Top-1 | CLIPScore | 0.8990 | | -------- | -------- | | Predicted | a person standing on a beach with colorful kite . | | Ground Truth | a man is walking towards his kite on the ground. | | Image | ![](https://i.imgur.com/xLXXBVH.jpg) | #### Least-1 | CLIPScore | 0.1331 | | -------- | -------- | | Predicted | a man in a white shirt and tie sitting at a table with food . | | Ground Truth | an aging rocker performs on stage in a sleeveless shirt and striped pants . | | Image | ![](https://i.imgur.com/MtkBaZi.jpg) | ### 3. Analyze the predicted captions and the attention maps for each word according to the previous question. Is the caption reasonable? Does the attended region reflect the corresponding word in the caption? 從問題 1 中的 attention mask 結果看來,似乎 heatmap 與文字之間的關係有點偏差,不知道是不是 ViT model 的 feature 並沒有照 XY 的格式傳出來的關係。不過圖片大致上與文字都有相關,雖說顏色部分好像有時會出現錯誤。另外從 PyTorch 的 Transformer 來拿 attention mask 比想像中要複雜,需要把整個 decoder 架構拆開,才能找到其中一個 layer 的隱藏參數。