DLCV HW3

tags: `Course`

湯濬澤
NTUST_M11015117
HackMD Link：https://hackmd.io/@RTon/BkpQxeGHi

Problem 1 - Zero-shot image Classification with CLIP

1. Methods analysis

Previous methods (e.g. VGG and ResNet)

傳統的方法是在對某個領域的照片做 label 之後，讓 CNN 去提取特徵，然後訓練 mlp 做分類。但再怎麼說，這些 model 的表現相當仰賴於 dataset 的內容，dataset 沒標的，或是照片與 dataset 的分布不同，就可能表現的不好。需要再做 transfer learning 等才能拿到好的結果。

CLIP

CLIP 不一樣的是，它訓練時只需要一段描述的文字與圖片就好，不僅 data 好取得，更能像是人類一樣理解事物。CLIP 透過將圖片與文字映射到一個空間，透過計算相似度來辨識，也因此比較廣泛一些。

2. Prompt-text analysis

Prompt-text	Probability
This is a photo of {object}	0.6080
This is a {object} image.	0.6824
No {object}, no score.	0.5624

從結果來看，單純的語句在 zero shot prediction 上表現得比較好，不過這是因為 dataset 較為單一且影像單純的關係吧，如果影像中出現很多物件時，前者的推論可能就不太一樣了。

3. Quantitative analysis

Problem 2 - Image Captioning with VL-model

1. Report your best setting and its corresponding CIDEr & CLIPScore on the validation data.

Parameters：

vocab_size = 18022
nHead = 12
hidden_size = 384
nLayers = 3
dropout_dec = 0.2
dropout_pos = 0.1

vit_encoder = timm.create_model('vit_small_patch16_384', pretrained=True)
encoder = models.Encoder(vit_encoder, embed_size=384)

# Loss & optimizer
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer_dec = torch.optim.AdamW(decoder.parameters(), lr=0.0001)

Performance on validation dataset

CIDEr	0.2025
CLIPScore	0.5217

2. Report other 3 different attempts (e.g. pretrain or not, model architecture, freezing layers, decoding strategy, etc.) and their corresponding CIDEr & CLIPScore. (7.5%, each setting for 2.5%)

	CIDEr	CLIPScore
Not freeze layers	2.1543538363994163e-07	0.4760
freeze layers	0.2307	0.5439
CNN encoder	0.5284	0.6541

Problem 3 - Visualization of Attention in Image Captioning

1. COCO attention maps

bike

girl

sheep

ski

unbrella

2. According to CLIPScore, you need to visualize:

Top-1

CLIPScore	0.8990
Predicted	a person standing on a beach with colorful kite .
Ground Truth	a man is walking towards his kite on the ground.
Image

Least-1

CLIPScore	0.1331
Predicted	a man in a white shirt and tie sitting at a table with food .
Ground Truth	an aging rocker performs on stage in a sleeveless shirt and striped pants .
Image

3. Analyze the predicted captions and the attention maps for each word according to the previous question. Is the caption reasonable? Does the attended region reflect the corresponding word in the caption?

從問題 1 中的 attention mask 結果看來，似乎 heatmap 與文字之間的關係有點偏差，不知道是不是 ViT model 的 feature 並沒有照 XY 的格式傳出來的關係。不過圖片大致上與文字都有相關，雖說顏色部分好像有時會出現錯誤。另外從 PyTorch 的 Transformer 來拿 attention mask 比想像中要複雜，需要把整個 decoder 架構拆開，才能找到其中一個 layer 的隱藏參數。