owned this note
owned this note
Published
Linked with GitHub
---
title: 'HW4 Programming: Image Captioning'
---
# HW4 Programming: Image Captioning
:::info
Programming assignment due **Tuesday, November 12th, 2024 at 6:00 PM EST**
:::
## Getting the stencil
Please click <ins>[here](https://classroom.github.com/a/h1wSOAez)</ins> to get the stencil code. Reference this <ins>[guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg)</ins> for more information about GitHub and GitHub Classroom.
:::danger
**Do not change the stencil except where specified**. Changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.
:::
## Assignment Overview
In this assignment, we will be generating English language captions for images using the Flickr 8k dataset. This dataset contains over 8000 images with 5 captions apiece. Provided is a slightly-altered version of the dataset, with uncommon words removed and all sentences limited to their first 20 words to make training easier.
Follow the instructions in `preprocessing.py` to download and process the data. This should take around 10 minutes the first time and will generate a 100MB file. The dataset file itself is around 1.1GB and can be found [here](https://www.kaggle.com/datasets/adityajn105/flickr8k?resource=download).
Note: You may need to sign in with google and create an account for kaggle. Make sure to download these and put the Images and captions.txt in your data directory before you run `preprocessing.py`.
You will implement **two** different types of image captioning models that first encode an image and then decode the encoded image to generate its English language caption. The first model you will implement in this assignment will be based on **Recurrent Neural Networks**, while the second model is based on **Transformers**. Since both of these models are trying to solve the same problems, they share the same preprocessing, training, testing, and loss function.
To simplify the assignment and improve your models’ quality, all the images in the dataset have been passed through <ins>[ResNet-50](https://arxiv.org/abs/1512.03385)</ins> to get 2048-D feature vectors. You will use these feature vectors in both your RNN and Transformer models to generate captions.
:::warning
**Note**: A major aspect of this assignment is the implementation of attention, and then using your attention heads to implement a decoder transformer block. Although these are the last two steps of the assignment roadmap, they will probably be the most time consuming, so be sure to account for that when working on this assignment.
:::
## Roadmap
### Step 1: Training the Model
In this assignment, both of the models you are implementing perform the same task, the only difference is their implementation. Because of this, there are some methods which can be shared between them in the ImageCaptionModel class, located in the `model.py` file. This class takes in a decoder layer that we will implement in the next step and gives us a fully functioning image captioning model. This class is partially filled out, except for the `train` method which is left to you. After steps 1 and 2, you will be able to run your RNN model.
**ImageCaptionModel**
- `train` and `test`: In contrast to most of your previous assignments which use the keras fit routines, a large amount of control is required over your scenarios (for experimentation purposes). In these cases, it may be better to specify the optimization process in all of its details. While doing this, you can also mimic the functionality of keras components (i.e. compile, call, etc.) to make your model architecture feel familiar to users who are already used to the standard practices employed in keras.
:::info
__Task 1.1 [ImageCaptionModel.train]:__ Fill out the `train` method for `ImageCaptionModel` in `model.py`.
- We have provided you with the test method, and you should implement the train method in its entirety and as appropriate. The train method will take in the model and do the forward and backward pass. Details provided in the code.
:::
:::warning
**Note**: You'll notice `get_config` and `from_config` in the stencil code. This is for Tensorflow and helps prevent issues with saving and loading the model on different machines. If you are having issues, ensure you are using tensorflow==2.15.
:::
### Step 2: RNN Image Captioning
You will now build an RNN decoder that takes in a sequence of word IDs (an incomplete sentence) and an image embedding, and outputs logits over the vocabulary for each timestep in the sequence, following a similar architecture to the one in <ins>[Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555.pdf)</ins>.
:::info
__Task 2.1 [RNNDecoder.__init__]:__ Fill out the init function, and define your trainable variables in `decoder.py`.
- Remember to use keras layers for your model. **The following layers may be helpful**:
- [tf.keras.layers.GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) OR [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) for your RNN layer
- [tf.keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) for the word embeddings
- [tf.keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) for all feed forward layers.
:::
:::info
__Task 2.2 [RNNDecoder.__call__]:__ Fill out the call function using the trainable variables you've created.
**Model should return logits, not probabilities**.
- As mentioned before, 2048-dimensional ResNet embeddings are already available. You should pass the image embeddings to your RNN as its initial state. However, before you pass the image embedding to your RNN, you will need to resize the image vector to match the same size of the hidden state of your RNN. Otherwise, they won’t be compatible.
- The decoder should take the English inputs shifted over one timestep, and use the combined hidden state and shifted input to predict the next English word. This procedure is called Teacher Forcing, and helps to stabilize training since it doesn’t force errors to propagate through the entire chain. In other words, we initialize the decoder with the "encoded" image, then give it the previous correct English word and have it guess the next, as if we were language modeling. In the `decoder.py` file, fill out the RNNDecoder class:
:::
#### Running your RNN Model
Your `assignment.py` should now be able to run your RNN model.
<!-- The notebook also contains functions to visualize your models and generate captions once you complete the assignment. -->
Depending on your use cases, you may choose to structure your model in a variety of ways. In contrast to previous assignments, this one is intended to mimic a lot of modern research-oriented repositories you might find in the wild. Specifically: ***\*Instead of providing easy-to-use APIs for experimenters, they rigidify their implementation to make tests replicable.\**** Specifically, they may provide a command-line interface and define testing/training procedures which log results.
First, try running
```python
python assignment.py
```
You'll notice that you'll get back a nice usage error message if you are missing any required arguments. You can investigate `assignment.py` to find that main will try to parse command-line arguments and fill in a variety of defaults. Specifically, you'll find this:
```python
def parse_args(args=None):
"""
Perform command-line argument parsing (other otherwise parse arguments with defaults).
To parse in an interative context (i.e. in notebook), add required arguments.
These will go into args and will generate a list that can be passed in.
For example:
parse_args('--type', 'rnn', ...)
"""
parser = argparse.ArgumentParser(...)
parser.add_argument('--type', required=True, ...)
parser.add_argument('--task', required=True, ...)
parser.add_argument('--data', required=True, ...')
parser.add_argument('--epochs', type=int, default=3, ...)
parser.add_argument('--lr', type=float, default=1e-3, ...)
parser.add_argument('--optimizer', type=str, default='adam', ...)
parser.add_argument('--batch_size', type=int, default=100, ...)
parser.add_argument('--hidden_size', type=int, default=256, ...)
parser.add_argument('--window_size', type=int, default=20, ...)
parser.add_argument('--chkpt_path', default='', ...)
parser.add_argument('--check_valid', default=True, ...)
if args is None:
return parser.parse_args() ## For calling through command line
return parser.parse_args(args) ## For calling through notebook.
```
Feel free to add extra CLI arguments to adjust your training process. The following command will therefore be sufficient to try what an author (or you) might consider to be a "default training run" of the model:
```
## TODO: Increase epochs to a larger size when ready (maybe 2 or 3 would be enough?)
python assignment.py --type rnn --task both --data ../data/data.p --epochs 1 --chkpt_path ../rnn_model
```
Since this command also saves the model, we should be able to load it back in and use it. Feel free to modify the saving utility as needed based on your modifications, but the default system should work fine for the initial requirements.
:::info
**Task 2.3:** Adjust the hyperparameters and train away!
:::
:::warning
**Note**: You must include a checkpoint filepath to save the model weights, as the visualizations in the notebook rely on these checkpoints.
:::
:::success
Your target perplexity should be ≤ 20, and your target per symbol accuracy should be >30%. For reference, our RNN model trains within 5 minutes w/o gpu and receives a validation perplexity of around 15. **Your RNN model shouldn’t take more than 5 minutes max to finish training**.
:::
:::danger
**Your RNN model shouldn’t take more than 5 minutes max to finish training**.
:::
#### Hyperparameters for RNN
While not required, we’ve filled in the arguments in the notebook/defaults that we found to be the best for our implementation. Additionally we recommend choosing embeddings/hidden layer sizes between 32-256 and training for around 2-4 epochs.
<!--Testing We've provided unit tests (`hw4_test.ipynb`) as sanity checks for your Transformer and RNN implementations.-->
### Step 3: Attention and Transformer Blocks
RNNs are neat! However, since 2017, ***Transformers*** have been the standard. Transformer-based models have shown state-of-the-art results on a variety of NLP tasks like language modeling, translation, and image captioning.
In the first part of this assignment, we used built in keras RNN layers for our sequence decoder, but in this part of the assignment we will be implementing our own Transformer layer to use in our `TransformerDecoder`.
These architectures rely on stacked self-attention modules rather than recurrence, and we will specifically be implementing a simplified version of the [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) architecture.
:::info
__Tasks 3.1-3.5 [Transformer]:__ These attention modules turn a sequence of embeddings into Queries, Keys, and Values. In self-attention, each timestep has a query (Q), key (K), and value (V) embedding. The queries are compared to every key to produce an attention matrix, which is then used to create new embeddings for each timestep.
You will need to fill out the following classes:
- `AttentionMatrix`: Computes attention given key and query matrices.
- `AttentionHead`: Contains and manages a single head of attention, which includes initializing the key, value and query weight matrices and calling AttentionMatrix to compute attention weights.
- You may find `add_weight(name=, shape=??)` to be helpful - it is a function provided by `tf.keras.Layer` to add a weight variable to a given layer. `name` can be initialized to whatever you want!
- `MultiHeadedAttention` This class should create 3 self attention heads of input size `emb_sz` and output size `emb_sz//3` and concatenate their results.
Note that the visualization notebook’s get_attention method will need to be tweaked to account for this.
- `TransformerBlock`: This component needs to compute the attention offered to both the language input and the image context, and then reason about the pair to generate a decoding for the model. You are not doing anything novel here - this is where everything you have implemented comes together to form the transformer!
- `PositionalEncoding`: Embed your labels into an optimizable vector space and apply a positional offset. We will be using a sinusoidal positional encoding scheme as described <ins>[here](https://www.tensorflow.org/text/tutorials/transformer#the_embedding_and_positional_encoding_layer)</ins>.
:::
:::warning
Self-attention can be fairly confusing. **Make sure you read and follow the instructions in the STENCIL CODE.** We also encourage students to refer back to the lecture slides. Another great resource that explains the intuition/implementation of Transformers, and all the operations we implement in this part of the assignment can be found in this <ins>[article](http://jalammar.github.io/illustrated-transformer/)</ins>. We *highly* recommend reading and understanding that one if you want to make life easier.
:::
:::warning
**Warning: MultiHeadedAttention**
Please define each of your attention heads in separate instance variables inside your MultiHeadedAttention. Tensorflow will not be able to find the 9 variables across the 3 attention heads otherwise.
:::
:::success
**Hint**: Use [tf.keras.layers.LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization) for your layer normalization operations in your Transformer Block. You should also look to section 3.3 of [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) for a general idea of what your transformer Feed-Forward network should look like.
:::
### Step 4: Transformers Image Captioning
:::info
__Task 4.1 [TransformerDecoder]:__ In the `decoder.py` file, fill out the `TransformerDecoder` class as follows:
- Fill out the `__init__` function and define your trainable variables.
- Fill out the call function using the trainable variables you've created.
:::
:::warning
This part of the assignment is very similar to step 1. The two differences are:
- You will now use a `TransformerBlock` instead of a keras RNN layer.
- Instead of the initial state, you will pass the encoded images to the `TransformerBlock` as the context sequence. Because the transformers expect sequences when calculating attention, you must reshape the image vectors to a sequence of window length 1.
:::
:::warning
Additionally, please note that, for this architecture, your embedding/hidden_state size must be the same for your word embeddings and transformer blocks.
:::
#### Running your Transformer
Like your RNN, you can run your transformer with
```python
python assignment.py --type transformer --task both --data ../data/data.p --epochs 1 --chkpt_path ../transformer_model
```
:::info
**Task 4.2:** Adjust the hyperparameters and train away!
:::
:::success
You should be able to reach validation perplexity in the ballpark of 15-18 by the end of training! We found that around 4 epochs was enough for our settings, but your results may vary. Though you are not constrained by any time limits, know when to stop and try to be proactive with your time.
:::
:::warning
Note: you must include a checkpoint filepath to save the model weights, as the visualizations in the notebook rely on these checkpoints.
:::
#### Mandatory Hyperparameters for Transformer
:::success
Similar to RNN, your Transformer model shouldn’t take more than 5-10 minutes to finish training. Your target validation perplexity should also be around 15, and your target per symbol accuracy should be around 35%.
:::
## Grading
**Code**: You will be primarily graded on functionality. Your RNN model should have a perplexity of ≤ 20, and your Transformer model should have a perplexity that is ≤ 18. This specifically refers to the testing set.
**README**: Your README should contain your perplexity, accuracy, and any bugs.
## Autograder
Our autograder will take your saved model and will run it, thereby alleviating the burden of having to wait for the autograder to train your model. Additionally, the Transformer components will be tested in pseudo-isolation to make sure they function as expected.