DLCV HW3 Problem 1 - Zero-shot image Classification with CLIP 1. Methods analysis Previous methods (e.g. VGG and ResNet) 傳統的方法是在對某個領域的照片做 label 之後,讓 CNN 去提取特徵,然後訓練 mlp 做分類。但再怎麼說,這些 model 的表現相當仰賴於 dataset 的內容,dataset 沒標的,或是照片與 dataset 的分布不同,就可能表現的不好。需要再做 transfer learning 等才能拿到好的結果。
CLIP CLIP 不一樣的是,它訓練時只需要一段描述的文字與圖片就好,不僅 data 好取得,更能像是人類一樣理解事物。CLIP 透過將圖片與文字映射到一個空間,透過計算相似度來辨識,也因此比較廣泛一些。
2. Prompt-text analysis
Prompt-text
Probability
This is a photo of {object}
0.6080
This is a {object} image.
0.6824
No {object}, no score.
0.5624
從結果來看,單純的語句在 zero shot prediction 上表現得比較好,不過這是因為 dataset 較為單一且影像單純的關係吧,如果影像中出現很多物件時,前者的推論可能就不太一樣了。
3. Quantitative analysis
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Problem 2 - Image Captioning with VL-model 1. Report your best setting and its corresponding CIDEr & CLIPScore on the validation data. Parameters:
vocab_size = 18022
nHead = 12
hidden_size = 384
nLayers = 3
dropout_dec = 0.2
dropout_pos = 0.1
vit_encoder = timm.create_model('vit_small_patch16_384' , pretrained=True )
encoder = models.Encoder(vit_encoder, embed_size=384 )
criterion = nn.CrossEntropyLoss(ignore_index=0 )
optimizer_dec = torch.optim.AdamW(decoder.parameters(), lr=0.0001 )
CIDEr
0.2025
CLIPScore
0.5217
2. Report other 3 different attempts (e.g. pretrain or not, model architecture, freezing layers, decoding strategy, etc.) and their corresponding CIDEr & CLIPScore. (7.5%, each setting for 2.5%)
CIDEr
CLIPScore
Not freeze layers
2.1543538363994163e-07
0.4760
freeze layers
0.2307
0.5439
CNN encoder
0.5284
0.6541
Problem 3 - Visualization of Attention in Image Captioning 1. COCO attention maps
bike
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
girl
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
sheep
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
ski
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
unbrella
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
2. According to CLIPScore, you need to visualize: Top-1
CLIPScore
0.8990
Predicted
a person standing on a beach with colorful kite .
Ground Truth
a man is walking towards his kite on the ground.
Image
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Least-1
CLIPScore
0.1331
Predicted
a man in a white shirt and tie sitting at a table with food .
Ground Truth
an aging rocker performs on stage in a sleeveless shirt and striped pants .
Image
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
3. Analyze the predicted captions and the attention maps for each word according to the previous question. Is the caption reasonable? Does the attended region reflect the corresponding word in the caption? 從問題 1 中的 attention mask 結果看來,似乎 heatmap 與文字之間的關係有點偏差,不知道是不是 ViT model 的 feature 並沒有照 XY 的格式傳出來的關係。不過圖片大致上與文字都有相關,雖說顏色部分好像有時會出現錯誤。另外從 PyTorch 的 Transformer 來拿 attention mask 比想像中要複雜,需要把整個 decoder 架構拆開,才能找到其中一個 layer 的隱藏參數。