HW4 Programming: Image Captioning

Programming assignment due Tuesday, November 12th, 2024 at 6:00 PM EST

Getting the stencil

Please click here to get the stencil code. Reference this guide for more information about GitHub and GitHub Classroom.

Do not change the stencil except where specified. Changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.

Assignment Overview

In this assignment, we will be generating English language captions for images using the Flickr 8k dataset. This dataset contains over 8000 images with 5 captions apiece. Provided is a slightly-altered version of the dataset, with uncommon words removed and all sentences limited to their first 20 words to make training easier.

Follow the instructions in preprocessing.py to download and process the data. This should take around 10 minutes the first time and will generate a 100MB file. The dataset file itself is around 1.1GB and can be found here. Note: You may need to sign in with google and create an account for kaggle. Make sure to download these and put the Images and captions.txt in your data directory before you run preprocessing.py.

You will implement two different types of image captioning models that first encode an image and then decode the encoded image to generate its English language caption. The first model you will implement in this assignment will be based on Recurrent Neural Networks, while the second model is based on Transformers. Since both of these models are trying to solve the same problems, they share the same preprocessing, training, testing, and loss function.

To simplify the assignment and improve your models’ quality, all the images in the dataset have been passed through ResNet-50 to get 2048-D feature vectors. You will use these feature vectors in both your RNN and Transformer models to generate captions.

Note: A major aspect of this assignment is the implementation of attention, and then using your attention heads to implement a decoder transformer block. Although these are the last two steps of the assignment roadmap, they will probably be the most time consuming, so be sure to account for that when working on this assignment.

Roadmap

Step 1: Training the Model

In this assignment, both of the models you are implementing perform the same task, the only difference is their implementation. Because of this, there are some methods which can be shared between them in the ImageCaptionModel class, located in the model.py file. This class takes in a decoder layer that we will implement in the next step and gives us a fully functioning image captioning model. This class is partially filled out, except for the train method which is left to you. After steps 1 and 2, you will be able to run your RNN model.

ImageCaptionModel

train and test: In contrast to most of your previous assignments which use the keras fit routines, a large amount of control is required over your scenarios (for experimentation purposes). In these cases, it may be better to specify the optimization process in all of its details. While doing this, you can also mimic the functionality of keras components (i.e. compile, call, etc.) to make your model architecture feel familiar to users who are already used to the standard practices employed in keras.

Task 1.1 [ImageCaptionModel.train]: Fill out the train method for ImageCaptionModel in model.py.

We have provided you with the test method, and you should implement the train method in its entirety and as appropriate. The train method will take in the model and do the forward and backward pass. Details provided in the code.

Note: You'll notice get_config and from_config in the stencil code. This is for Tensorflow and helps prevent issues with saving and loading the model on different machines. If you are having issues, ensure you are using tensorflow==2.15.

Step 2: RNN Image Captioning

You will now build an RNN decoder that takes in a sequence of word IDs (an incomplete sentence) and an image embedding, and outputs logits over the vocabulary for each timestep in the sequence, following a similar architecture to the one in Show and Tell: A Neural Image Caption Generator.

Task 2.1 [RNNDecoder.init]: Fill out the init function, and define your trainable variables in decoder.py.

Remember to use keras layers for your model. The following layers may be helpful:
- tf.keras.layers.GRU OR tf.keras.layers.LSTM for your RNN layer
- tf.keras.layers.Embedding for the word embeddings
- tf.keras.layers.Dense for all feed forward layers.

Task 2.2 [RNNDecoder.call]: Fill out the call function using the trainable variables you've created. Model should return logits, not probabilities.

As mentioned before, 2048-dimensional ResNet embeddings are already available. You should pass the image embeddings to your RNN as its initial state. However, before you pass the image embedding to your RNN, you will need to resize the image vector to match the same size of the hidden state of your RNN. Otherwise, they won’t be compatible.
The decoder should take the English inputs shifted over one timestep, and use the combined hidden state and shifted input to predict the next English word. This procedure is called Teacher Forcing, and helps to stabilize training since it doesn’t force errors to propagate through the entire chain. In other words, we initialize the decoder with the "encoded" image, then give it the previous correct English word and have it guess the next, as if we were language modeling. In the decoder.py file, fill out the RNNDecoder class:

Running your RNN Model

Your assignment.py should now be able to run your RNN model.

Depending on your use cases, you may choose to structure your model in a variety of ways. In contrast to previous assignments, this one is intended to mimic a lot of modern research-oriented repositories you might find in the wild. Specifically: *Instead of providing easy-to-use APIs for experimenters, they rigidify their implementation to make tests replicable.* Specifically, they may provide a command-line interface and define testing/training procedures which log results.

First, try running

python assignment.py

You'll notice that you'll get back a nice usage error message if you are missing any required arguments. You can investigate assignment.py to find that main will try to parse command-line arguments and fill in a variety of defaults. Specifically, you'll find this:

def parse_args(args=None):
    """ 
    Perform command-line argument parsing (other otherwise parse arguments with defaults). 
    To parse in an interative context (i.e. in notebook), add required arguments.
    These will go into args and will generate a list that can be passed in.
    For example: 
        parse_args('--type', 'rnn', ...)
    """
    parser = argparse.ArgumentParser(...)
    parser.add_argument('--type',           required=True,              ...)
    parser.add_argument('--task',           required=True,              ...)
    parser.add_argument('--data',           required=True,              ...')
    parser.add_argument('--epochs',         type=int,   default=3,      ...)
    parser.add_argument('--lr',             type=float, default=1e-3,   ...)
    parser.add_argument('--optimizer',      type=str,   default='adam', ...)
    parser.add_argument('--batch_size',     type=int,   default=100,    ...)
    parser.add_argument('--hidden_size',    type=int,   default=256,    ...)
    parser.add_argument('--window_size',    type=int,   default=20,     ...)
    parser.add_argument('--chkpt_path',     default='',                 ...)
    parser.add_argument('--check_valid',    default=True,               ...)
    if args is None: 
        return parser.parse_args()      ## For calling through command line
    return parser.parse_args(args)      ## For calling through notebook.

Feel free to add extra CLI arguments to adjust your training process. The following command will therefore be sufficient to try what an author (or you) might consider to be a "default training run" of the model:

## TODO: Increase epochs to a larger size when ready (maybe 2 or 3 would be enough?)
python assignment.py --type rnn --task both --data ../data/data.p --epochs 1 --chkpt_path ../rnn_model

Since this command also saves the model, we should be able to load it back in and use it. Feel free to modify the saving utility as needed based on your modifications, but the default system should work fine for the initial requirements.

Task 2.3: Adjust the hyperparameters and train away!

Note: You must include a checkpoint filepath to save the model weights, as the visualizations in the notebook rely on these checkpoints.

Your target perplexity should be ≤ 20, and your target per symbol accuracy should be >30%. For reference, our RNN model trains within 5 minutes w/o gpu and receives a validation perplexity of around 15. Your RNN model shouldn’t take more than 5 minutes max to finish training.

Your RNN model shouldn’t take more than 5 minutes max to finish training.

Hyperparameters for RNN

While not required, we’ve filled in the arguments in the notebook/defaults that we found to be the best for our implementation. Additionally we recommend choosing embeddings/hidden layer sizes between 32-256 and training for around 2-4 epochs.

Step 3: Attention and Transformer Blocks

RNNs are neat! However, since 2017, Transformers have been the standard. Transformer-based models have shown state-of-the-art results on a variety of NLP tasks like language modeling, translation, and image captioning.

In the first part of this assignment, we used built in keras RNN layers for our sequence decoder, but in this part of the assignment we will be implementing our own Transformer layer to use in our TransformerDecoder.

These architectures rely on stacked self-attention modules rather than recurrence, and we will specifically be implementing a simplified version of the Attention Is All You Need architecture.

Tasks 3.1-3.5 [Transformer]: These attention modules turn a sequence of embeddings into Queries, Keys, and Values. In self-attention, each timestep has a query (Q), key (K), and value (V) embedding. The queries are compared to every key to produce an attention matrix, which is then used to create new embeddings for each timestep.

You will need to fill out the following classes:

AttentionMatrix: Computes attention given key and query matrices.
AttentionHead: Contains and manages a single head of attention, which includes initializing the key, value and query weight matrices and calling AttentionMatrix to compute attention weights.
- You may find add_weight(name=, shape=??) to be helpful - it is a function provided by tf.keras.Layer to add a weight variable to a given layer. name can be initialized to whatever you want!
MultiHeadedAttention This class should create 3 self attention heads of input size emb_sz and output size emb_sz//3 and concatenate their results. Note that the visualization notebook’s get_attention method will need to be tweaked to account for this.
TransformerBlock: This component needs to compute the attention offered to both the language input and the image context, and then reason about the pair to generate a decoding for the model. You are not doing anything novel here - this is where everything you have implemented comes together to form the transformer!
PositionalEncoding: Embed your labels into an optimizable vector space and apply a positional offset. We will be using a sinusoidal positional encoding scheme as described here.

Self-attention can be fairly confusing. Make sure you read and follow the instructions in the STENCIL CODE. We also encourage students to refer back to the lecture slides. Another great resource that explains the intuition/implementation of Transformers, and all the operations we implement in this part of the assignment can be found in this article. We highly recommend reading and understanding that one if you want to make life easier.

Warning: MultiHeadedAttention

Please define each of your attention heads in separate instance variables inside your MultiHeadedAttention. Tensorflow will not be able to find the 9 variables across the 3 attention heads otherwise.

Hint: Use tf.keras.layers.LayerNormalization for your layer normalization operations in your Transformer Block. You should also look to section 3.3 of Attention Is All You Need for a general idea of what your transformer Feed-Forward network should look like.

Step 4: Transformers Image Captioning

Task 4.1 [TransformerDecoder]: In the decoder.py file, fill out the TransformerDecoder class as follows:

Fill out the __init__ function and define your trainable variables.
Fill out the call function using the trainable variables you've created.

This part of the assignment is very similar to step 1. The two differences are:

You will now use a TransformerBlock instead of a keras RNN layer.
Instead of the initial state, you will pass the encoded images to the TransformerBlock as the context sequence. Because the transformers expect sequences when calculating attention, you must reshape the image vectors to a sequence of window length 1.

Additionally, please note that, for this architecture, your embedding/hidden_state size must be the same for your word embeddings and transformer blocks.

Running your Transformer

Like your RNN, you can run your transformer with

python assignment.py --type transformer --task both --data ../data/data.p --epochs 1 --chkpt_path ../transformer_model

Task 4.2: Adjust the hyperparameters and train away!

You should be able to reach validation perplexity in the ballpark of 15-18 by the end of training! We found that around 4 epochs was enough for our settings, but your results may vary. Though you are not constrained by any time limits, know when to stop and try to be proactive with your time.

Note: you must include a checkpoint filepath to save the model weights, as the visualizations in the notebook rely on these checkpoints.

Mandatory Hyperparameters for Transformer

Similar to RNN, your Transformer model shouldn’t take more than 5-10 minutes to finish training. Your target validation perplexity should also be around 15, and your target per symbol accuracy should be around 35%.

Grading

Code: You will be primarily graded on functionality. Your RNN model should have a perplexity of ≤ 20, and your Transformer model should have a perplexity that is ≤ 18. This specifically refers to the testing set.

README: Your README should contain your perplexity, accuracy, and any bugs.

Autograder

Our autograder will take your saved model and will run it, thereby alleviating the burden of having to wait for the autograder to train your model. Additionally, the Transformer components will be tested in pseudo-isolation to make sure they function as expected.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.