# DL Project: Image Captioning **Mentor:** Carlos Escolano **Authors:** Joel Moriana, Oscar Redondo, Ferran Pauls, Josep M Carreras --- ## Introduction The readme file contains the documentation of the final deep learning project for AIDL postgraduate at UPC. The main goal is to build an image captioning model from scratch using the Pytorch framework to build deep learning models. ![](https://i.imgur.com/UznZlY6.png) So the model has an image as an input and has to predict a caption that describes the content of the image. --- ## Index 1. [Motivation and goals](#Motivation-and-goals) 2. Dataset 3. Ingestion pipeline 4. Model architecture 4.1 Baseline Model 4.2 Attention model 5. Results 6. Conclusions and nexts steps 7. References --- ## 1. Starting point and goals **Personal Motivation** The main goal is to learn deep learning in a practical way through and image captioning model. We decide to do this because: * This model uses image analysis (CNN) and natural language processing (RNN) nets, both studied in the course. * Possibility to enhance the baseline model with several modifications. **Applications:** In the real life, the are a lot of applications for the image captioning. Some of the most important are: * Medical Diagnosys * Help blind people * Better searches on Internet ## 2. Dataset The dataset used to build the model is [Flickr8k](http://academictorrents.com/details/9dea07ba660a722ae1008c4c8afdd303b6f6e53b). It contains 8.000 images with five captions each. At the moment there are bigger datasets available, but the intention from the beginning was to test different ideas, so a small dataset has helped us to iterate fast. The dataset is been splitted into three parts. The trainset to actualize the weights, 6000 images. The validation set to determine when the model has learned and the testset with 1000 images to asses the performance, computing the BLEU metric. **Image Dataset** ![](https://i.imgur.com/apFoFKa.png) ![](https://i.imgur.com/Av6Tv1T.png) **Vocabulary** | Train | Test | | -------- | -------- | | 7.489 non-stopwords| 4.727 non-stopwords| It is important to asses the model with the same vocabulary distribution as the one it has been trained for, we can see the most frequent words are the same in both datasets ![](https://i.imgur.com/nzywtkF.png) ![](https://i.imgur.com/ttbYoN2.png) We can do a similar assessment with the distribution of the caption lengths. We can see that they look alike. <img src="https://i.imgur.com/4MQWj9V.png"> It's also important to not expect the model to predict captions with words it hasn't seen before. Those are the most frequent words on the test set that are not in the trainset. ![](https://i.imgur.com/yeF5vIf.png) ## 3. Ingestion pipeline Data representation, data augmentation and data ingestion * **Dimensions of the data** In our datasets (Training, Validation and test) we have 8091 images and the most common sizes are: | Num. Images | Height | Width | |:-----------:|:------:|:-----:| | 1523 | 333 | 500 | | 1299 | 375 | 500 | | 659 | 500 | 333 | | 427 | 500 | 375 | the rest of the images have slight variations of these. * **Tensor representation** As we are doing transfer learning we have to adapt the size of the images to the ones the encoders expect, as they are trained using the ImageNet dataset ResNet101 and SeNet154 expect: | Height | Width | | ------ | ----- | | 224 | 224 | * **Data augmentation to avoid overfitting (regularization)** From the beginning instead of feeding the data straight as it is in Flickr8k we did data augmentation by applying In this project, the images must be transformated into tensors in order to process the inputs. Finally, we decided to use data augmentation in the pipeline | Batch_size | Height | Width | Channels | | ------------ | ------ | ----- | -------- | | 32 (default) | 224 | 224 | 3 **Data Augmentation Pipeline** ![](https://i.imgur.com/EdXev5z.png) ## 4. Model architecture The model is split into two different parts: (1) the encoder and (2) the decoder. The encoder is responsible for processing the input image and extracting the features maps, whereas the the decoder is responsible for processing those features maps and predict the caption. **Encoder** The encoder is composed of a Convolutional Neural Network (CNN) and a last linear layer to connect the encoder to the decoder. Due to the reduced dataset we are working on, we use pretrained CNNs on the Imagenet dataset and apply transfer learning. Therefore, the weights of the CNN are frozen during the training and the only trainable parameters of the encoder are the weights of the last linear layer. **Decoder** The decoder is composed of a Recurrent Neural Network (RNN) as we are dealing with outputs of variable lengths. Specifically, to avoid vanishing gradients and loosing context in long sequences, we use a a Long Short Term Memory (LSTM) network. Depending on the decoder, we differentiate two different model architectures: (1) the baseline model and (2) the attention model. Both models are explained more into detail in the following sections. ### 4.1 Baseline model This model uses a vanilla LSTM as decoder and the last layer of the encoder as inputs of the first LSTM iteration as the context vector. The next picture summarizes the architecture of this model: ![](https://i.imgur.com/a4Hp0Lf.png) We use two different methods depending on if we are on training or inference time: * In training mode, we use teacher forcing as we know the targets (i.e., we use the ground truth from the prior time step as input to the next LSTM step). * In inference mode, the model uses the embedding of the previously predicted word to the next LSTM step. ### Attention model This model is based on the previous one but adding the attention mechanism to the decoder. The attention mechanism is responsible for deciding which parts of the image are more important while predicting a word of the sequence. Therefore, the decoder pays attention to particular areas or objects rather than treating the whole image equally. Specifically, we are using additivite attention which is a type of soft attention. In contrast to hard attention, soft attention is differentiable and it attends to the entire input space whereas hard attention is not differentiable because it selects the focusing region by random sampling. Therefore, hard attention is not deterministic and needs other techniques like reinforcement learning to be applied. The output of the attention is a conext vector as a weighted sum of the features map computed by the encoder. Each time the model infers a new word in the caption, it will produce an attention map (alphas) which is a probability density function with a sum equal to one. The overall architecture of this model is shown in the next figure. It should be taken into account that the input of each LSTM cell is the concatenation of the embedding and the context vector computed by the attention block. ![](https://i.imgur.com/lA7D8nm.png) Similarly to the model explained above, we use teacher forcing while training. <!--because it attends to the entire input space. With soft attention the model is differentiable but expensive when the source of input is large. Hard attention requires lless calculation at inference time but it requires techniques like reinforcement learning to train. Basically: the difference shouldn't be great and it wasn't worth it. Context vector as a weighted sum of the encoder hidden states--> ## 5. Rocks on the road... Once implemented the models, the three most important issues we found were: **Noisy results** In the loss and accuracy results we had a lot of noise. Our hypotesis was it was due to the great number of results we save (every 25 batches). Our solution was one save every epoch and the results will be smoothy because our model could have more time to learn. It solved the issue. **Accuracy not well calculated in training** In training, our model needs a fixed output length and we use, in a batch, the maximum length of the inputs. So our output is a fixed tensor where there and **<end>** token. When we compare the our output with the original sentence, we had a bad BLEU score. Our solution in this case was only compare the original sentence with our output only until de <end> token. We were right. **Start/End Token in training** ## 5. Implementation - Parameters to select: \-\-data-root: default directories. \-\-batch-size: backpropagation, accuracy and loss calculated of these number of images each time (default 32). \-\-learning-rate: We use Adam optimization, which computes individual learning rates for different parameters being based on the gradients. The initial value is 1e-3. \-\-encoder-type: resnet101 or se-resnet154. Default: resnet101. -\-attention-type: Only implemented additive attention. \-\-vocab-size: Minimum frequency of a word to be added in the vocab (default: 1). \-\-max-seq-length: maximum sequence length, in words (default: 25). \-\-log-interval: logging step with tensorboard, per batch (default: 25). \-\-save-checkpoints: save checkpoints. Checkpoints are saved in the checkpoints folder. \-\-overfitting: use the overfitting dataset (data/flickr8k_overfitting) to check if the model is learning. Other parameters: \-\-session-name,\-\-num-epochs, \-\-num-workers, \-\-encoder-size, \-\-hidden-size, \-\-embedding-size, \-\-attention-size, \-\-no-save-model. - Selected encoders: The first solution to solve complex problems was to make your neural networks deep. But at a moment, with increase in number of layers, the network’s train accuracy began to drop. When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error. - ResNet tries to solve the vanishing gradient issue (network too deep: first layers cannot be updated properly). Solution: creation of shortcuts that backpropagate through identity matrix (i.e. maintaining the weights of early layers). Consecuence: preserving information across layers prevents the gradient falling to 0 but maintining his update. - We drop the following layers of the default architecture: FC, avgpool. ![](https://i.imgur.com/3objt8s.png) - Squeeze-and-Excitation Networks: Built on top of another network, SENets introduce a tiny computational overhead while potentially improving any convolutional model, like resnet. - It squeezes the 3 channels each convolutional block of the network using average pooling -> FC+ReLU (non-linearity) -> FC+sigmoid (gait) -> weight each feature map. Consecuence: the network can adaptively adjust the weighting of each feature map -> improve model performance. It's a re-weight. - We did a se-resnet154. We drop the following layers of the default architecture: FC, dropout, avgpool. ![](https://i.imgur.com/SinLXr5.png) - Computing accuracy (BLEU): The BLEU score is the most widely used string-matching metric, it scales 0 to 1. It's computed over the entire corpus as following: splits the sentence in four pieces (default) and calculates the proportion of matchings (string order accounted) for each, penalizing short translations. This is the formula: ![](https://i.imgur.com/8Hc0UPl.png) The BLEU metric is on a scale of 0 to 1, a BLEU score very high (>0.7) it's a rule of thumb of overfitting. ## 6. Results ### 6.1 Overfitting We use a reduced dataset in order to perform overfitting to the models explained above. The following table summarizes the number of pictures contained in each split: | Train | Validation | Test | | -------- | -------- | -------- | | 15 Photos | 5 Photos | 5 Photos | On the other hand, the next table depicts the selected parameters for the models: | Parameter | value | | --- | --- | | num-epochs | 300 | | batch-size | 15 | | learning-rate | 1e-3 | | encoder-type | `resnet101`, `senet154` | | attention-type | `none`, `additive` | | encoder-size | 64 | | hidden-size | 256 | | embedding-size | 128 | | attention-size | 64 | ### 6.2 Results In this section, we use the entire dataset with the following parameters: | Parameter | value | | --- | --- | | num-epochs | 20 | | batch-size | 32 | | learning-rate | 1e-3 | | encoder-type | `resnet101`, `senet154` | | attention-type | `none`, `additive` | | encoder-size | 64 | | hidden-size | 256 | | embedding-size | 128 | | attention-size | 64 | ## 7. Examples ![](https://i.imgur.com/yiOwTRg.png) ## 8. Conclusions and nexts steps **What did we learn?** * Course Concepts & AI Background. * How important is the your DataSet. * Importance of the continous improvement of the architecture. **What would we have liked to do?** * Bigger dataset and another data augmentation implementation. * Keep on improving the performance of our model trying new architectures (Hard Attention, bidirectional decoder...). * Apply learning rate scheduler. * Training time changing image quality. * Build our own image encoder. * Specific field (eg cars, planes...) * Bidirectional LSTM ## 9. References [1] https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf [2] https://pytorch.org/docs/stable/optim.html [3] https://blogs.nvidia.com/blog/2019/02/07/what-is-transfer-learning/ [4] https://arxiv.org/pdf/1908.01878v2.pdf [5] https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/ [6] https://lab.heuritech.com/attention-mechanism [7] https://blog.floydhub.com/attention-mechanism/ when training and ## Difficulties ### The model always predicting `<START>` This issue came up due to two factors: * We used teacher forcing in validation. * We used the same transformed caption for both training and loss computation and as a consecuence we had problems with the special tokens. Example of an initial target caption: ['`<START>`', 'A', 'brown', 'dog', 'is', 'sprayed', 'with', 'water', '`<END>`'] As we were using the same tokenized sequence to compute the loss and train, the model was learning to always predict the input word, therefore, while training, the model seemed to learn but at inference time, the model was predicting a sequence of `<START>` words because the first input to the model was the <START> word. This issue was solved by applying a shift and using two different captions: * The source caption used for training without the `<END>` token: ['`<START>`', 'A', 'brown', 'dog', 'is', 'sprayed', 'with', 'water'] * The target caption used to compute the loss without the `<START>` token because we do not want to learn to produce the `<START>`: ['A', 'brown', 'dog', 'is', 'sprayed', 'with', 'water', `<END>`] ### No overfitting without attention While doing overfitting without attention, the loss was not converging to zero while the accuracy in training was 100%: Accuracy train | Accuracy eval :---:|:---: <img src="imgs/issues/Accuracy_train_ignore_index_issue.svg" width=1000 /> | <img src="imgs/issues/Accuracy_eval_ignore_index_issue.svg" width=1000 /> Loss train | Loss eval :---: | :---: <img src="imgs/issues/Loss_train_ignore_index_issue.svg" width=1000> | <img src="imgs/issues/Loss_eval_ignore_index_issue.svg" width=1000> This issue was due to we were taking into account the `<PAD>` token when computing the loss and that part of the loss was constant. The issue was solved by adding de `ignored_index` = `<PAD>`. ### Overfitting but low accuracy The solution of the previous issues, guided us to another problem. The model was overfitted because the training loss was converging to zero (whereas the validation loss was increasing) but the accuracy was not increasing: Accuracy train | Accuracy eval :---:|:---: <img src="imgs/issues/Accuracy_train_accuracy_issue.svg" width=1000 /> | <img src="imgs/issues/Accuracy_eval_accuracy_issue.svg" width=1000 /> Loss train | Loss eval :---: | :---: <img src="imgs/issues/Loss_train_accuracy_issue.svg" width=1000> | <img src="imgs/issues/Loss_eval_accuracy_issue.svg" width=1000> The issue came up because we were taking into account the words predicted after the <`END`> and therefore there was a penalty when computing the BLEU.