2. Report other 3 different attempts (e.g. pretrain or not, model architecture, freezing layers, decoding strategy, etc.) and their corresponding CIDEr & CLIPScore. (7.5%, each setting for 2.5%)
CIDEr
CLIPScore
Not freeze layers
2.1543538363994163e-07
0.4760
freeze layers
0.2307
0.5439
CNN encoder
0.5284
0.6541
Problem 3 - Visualization of Attention in Image Captioning
1. COCO attention maps
bike
girl
sheep
ski
unbrella
2. According to CLIPScore, you need to visualize:
Top-1
CLIPScore
0.8990
Predicted
a person standing on a beach with colorful kite .
Ground Truth
a man is walking towards his kite on the ground.
Image
Least-1
CLIPScore
0.1331
Predicted
a man in a white shirt and tie sitting at a table with food .
Ground Truth
an aging rocker performs on stage in a sleeveless shirt and striped pants .
Image
3. Analyze the predicted captions and the attention maps for each word according to the previous question. Is the caption reasonable? Does the attended region reflect the corresponding word in the caption?