---
tags: hw5, handout
---
# HW5 Programming: Image Captioning
:::info
Conceptual questions due **Monday, April 8th, 2024 at 6:00 PM EST**
Programming assignment due **Sunday, April 14th, 2024 at 6:00 PM EST**
:::
## Theme
![image alt](https://nypost.com/wp-content/uploads/sites/2/2021/05/FISH-WP.png)
*The fish have made great technological advancements, including the creation of **Fishstagram**, but they are having trouble coming up with good captions. Your task is to develop a deep learning model that helps them, so this pufferfish can attract **attention** to his selfie with a diver!*
## Conceptual Questions
Please fill out the conceptual questions linked on the website under HW5: Image Captioning. You should be able to type in the questions with LaTeX or upload photos of written work.
## Getting the stencil
Please click <ins>[here](https://classroom.github.com/a/L_fpc3jZ)</ins> to get the stencil code. Reference this <ins>[guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg)</ins> for more information about GitHub and GitHub Classroom.
:::danger
**Do not change the stencil except where specified**. Changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.
:::
## Note on Autograder Tests (Important!!)
:::danger
Many of the tests in the autograder run based on the code in the notebook. Some of these tests run based on the value that is output in the notebook without running your code. Due to this, HW5 has a significant manual grading component. Please make sure that you feel confident in your code, since manual inspection could lead to point deductions if the code is wrong, even if the result initially gets a full score on the autograder.
:::
## Assignment Overview
In this assignment, we will be generating English language captions for images using the Flickr 8k dataset. This dataset contains over 8000 images with 5 captions apiece. Provided is a slightly-altered version of the dataset, with uncommon words removed and all sentences limited to their first 20 words to make training easier.
Follow the instructions in `preprocessing.py` to download and process the data. This should take around 10 minutes the first time and will generate a 100MB file. The dataset file itself is around 1.1GB and can be found [here](https://www.kaggle.com/datasets/adityajn105/flickr8k?resource=download).
Note: You may need to sign in with google and create an account for kaggle. Make sure to download these and put the Images and captions.txt in your data directory before you run `preprocessing.py`.
You will implement **two** different types of image captioning models that first encode an image and then decode the encoded image to generate its English language caption. The first model you will implement in this assignment will be based on **Recurrent Neural Networks**, while the second model is based on **Transformers**. Since both of these models are trying to solve the same problems, they share the same preprocessing, training, testing, and loss function.
To simplify the assignment and improve your models’ quality, all the images in the dataset have been passed through <ins>[ResNet-50](https://arxiv.org/abs/1512.03385)</ins> to get 2048-D feature vectors. You will use these feature vectors in both your RNN and Transformer models to generate captions.
Conceptually, this assignment is similar to the Language Model assignment. Your model will be predicting the next word given a sequence of previous words. However, the difference is that in this task, your model is also given a vector (the 2048-D feature vector) that represents what the sentence that the model is generating should describe.
**Note**: A major aspect of this assignment is the implementation of attention, and then using your attention heads to implement a decoder transformer block. Although these are the last two steps of the assignment roadmap, they will probably be the most time consuming, so be sure to account for that when working on this assignment.
## Roadmap
### Step 1: Training the Model
In this assignment, both of the models you are implementing perform the same task, the only difference is their implementation. Because of this, there are some methods which can be shared between them in the ImageCaptionModel class, located in the `model.py` file. This class takes in a decoder layer that we will implement in the next step and gives us a fully functioning image captioning model. This class is partially filled out, except for the `train` method which is left to you. After steps 1 and 2, you will be able to run your RNN model.
**ImageCaptionModel**
- `train` and `test`: In contrast to most of your previous assignments which use the keras fit routines, a large amount of control is required over your scenarios (for experimentation purposes). In these cases, it may be better to specify the optimization process in all of its details. While doing this, you can also mimic the functionality of keras components (i.e. compile, call, etc.) to make your model architecture feel familiar to users who are already used to the standard practices employed in keras.
- We have provided you with the test method, and you should implement the train method in its entirety and as appropriate. The train method will take in the model and do the forward and backward pass. Details provided in the code.
- NOTE: If you are unable to access the decoder due to a Tensorflow version error try adding these lines of code to your ImageCaptionModel class
- ```python
def get_config(self):
return {"decoder": self.decoder} ## specific to ImageCaptionModel
@classmethod
def from_config(cls, config):
return cls(**config)
```
<!-- NOTE: M1 Macs may run into -->
### Step 2: RNN Image Captioning
You will now build an RNN decoder that takes in a sequence of word IDs (an incomplete sentence) and an image embedding, and outputs logits over the vocabulary for each timestep in the sequence, following a similar architecture to the one in <ins>[Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555.pdf)</ins>.
As mentioned before, 2048-dimensional ResNet embeddings are already available. You should pass the image embeddings to your RNN as its initial state. However, before you pass the image embedding to your RNN, you will need to resize the image vector to match the same size of the hidden state of your RNN. Otherwise, they won’t be compatible.
The decoder performs similarly to the language model in the last assignment, and should take the English inputs shifted over one timestep, and use the combined hidden state and shifted input to predict the next English word. This procedure is called Teacher Forcing, and helps to stabilize training since it doesn’t force errors to propagate through the entire chain. In other words, we initialize the decoder with the "encoded" image, then give it the previous correct English word and have it guess the next, as if we were language modeling. In the `decoder.py` file, fill out the RNNDecoder class:
**RNNDecoder**
- Fill out the init function, and define your trainable variables.
- Fill out the call function using the trainable variables you've created.
**Model should return logits, not probabilities**.
Remember to use keras layers for your model. **The following layers may be helpful**:
- [tf.keras.layers.GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) OR [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) for your RNN layer
- [tf.keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) for the word embeddings
- [tf.keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) for all feed forward layers.
#### Running your RNN Model
Your `assignment.py` should now be able to run your RNN model.
In this assignment, there are two ways to run your` assignment.py` file. You can run your model on your personal computer, or you can save a copy of the Jupyter Notebook associated with this assignment to your Google Drive or Github. With the latter, you can utilize some of Google Colab free GPU availability to train your model faster. The notebook also contains functions to visualize your models and generate captions once you complete the assignment.
You can run it with something like the following:
```python
python assignment.py —-task both —-type <MODEL TYPE> —-data <DATA FILE>
```
where `<MODEL TYPE>` is `rnn` or `transformer` and `<DATA FILE>` is the path to the `im_cap_data.p` file. There are plenty of other arguments (including ones for the optimizer, learning rate), so check those out as well! Refer to the ***Running your RNN Model*** section of the notebook or the `arg_parse` function in `assignment.py` for additional details.
**Your RNN model shouldn’t take more than 5 minutes max to finish training**.
Your target perplexity should be ≤ 20, and your target per symbol accuracy should be >30%. For reference, our RNN model trains within 5 minutes w/o gpu and receives a validation perplexity of around 15.
#### Hyperparameters for RNN
While not required, we’ve filled in the arguments in the notebook/defaults that we found to be the best for our implementation. Additionally we recommend choosing embeddings/hidden layer sizes between 32-256 and training for around 2-4 epochs.
#### Testing
We've provided unit tests (`hw5_test.ipynb`) as sanity checks for your Transformer and RNN implementations.
### Step 3: Attention and Transformer Blocks
RNNs are neat! However, since 2017, there has been a new cool kid on the Natural Language Processing block: ***Transformers***. Transformer-based models have shown state-of-the-art results on a variety of NLP tasks like language modeling, translation, and image captioning.
In the first part of this assignment, we used built in keras RNN layers for our sequence decoder, but in this part of the assignment we will be implementing our own Transformer layer to use in our `TransformerDecoder`.
These architectures rely on stacked self-attention modules rather than recurrence, and we will specifically be implementing a simplified version of the [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) architecture.
These attention modules turn a sequence of embeddings into Queries, Keys, and Values. In self-attention, each timestep has a query (Q), key (K), and value (V) embedding. The queries are compared to every key to produce an attention matrix, which is then used to create new embeddings for each timestep.
**TODO**: You will need to fill out the following components:
- `AttentionMatrix`: Computes attention given key and query matrices.
- `AttentionHead`: Contains and manages a single head of attention, which includes initializing the key, value and query weight matrices and calling AttentionMatrix to compute attention weights.
- `MultiHeadedAttention` **[Required for 2470; bonus otherwise]** Note that the notebook’s get_attention method will need to be tweaked to account for this.
- `TransformerBlock`: This component needs to compute the attention offered to both the language input and the image context, and then reason about the pair to generate a decoding for the model.
- `PositionalEncoding`: Embed your labels into an optimizable vector space and apply a positional offset. We will be using a sinusoidal positional encoding scheme as described <ins>[here](https://www.tensorflow.org/text/tutorials/transformer#the_embedding_and_positional_encoding_layer)</ins>.
Self-attention can be fairly confusing. **Make sure you read and follow the instructions in the STENCIL CODE.** We also encourage students to refer back to the lecture slides. Another great resource that explains the intuition/implementation of Transformers, and all the operations we implement in this part of the assignment can be found in this <ins>[article](http://jalammar.github.io/illustrated-transformer/)</ins>. We *highly* recommend reading and understanding that one if you want to make life easier.
**Hint**: Use [tf.keras.layers.LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization) for your layer normalization operations in your Transformer Block. You should also look to section 3.3 of [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) for a general idea of what your transformer Feed-Forward network should look like.
#### Testing
We've provided unit tests (`hw5_test.ipynb`) as sanity checks for your attention and Transformer block implementations in `transformer.py`.
### Step 4: Transformers Image Captioning
This part of the assignment is very similar to step 1. The two differences are:
You will now use a `TransformerBlock` instead of a keras RNN layer. Instead of the initial state, you will pass the encoded images to the `TransformerBlock` as the context sequence. Because the transformers expect sequences when calculating attention, you must reshape the image vectors to a sequence of length 1.
In the `decoder.py` file, fill out the `TransformerDecoder` class as follows.
#### TransformerDecoder TODOS:
- Fill out the `__init__` function and define your trainable variables.
- Fill out the call function using the trainable variables you've created.
Additionally, please note that, for this architecture, your embedding/hidden_state size must be the same for your word embeddings and transformer blocks.
#### Mandatory Hyperparameters for Transformer
Similar to RNN. Your Transformer model shouldn’t take more than 5-10 minutes to finish training. Your target validation perplexity should also be around 15, and your target per symbol accuracy should be around 35%.
#### CS2470 Students
You will have to also implement the Multi-Headed attention class. This class should create 3 self attention heads of input size `batch_sz` * `window_sz` * `emb_sz` and output size `batch_sz` * `window_sz` * `emb_s/3` and concatenate their results.
Also please complete the CS2470-only conceptual questions in addition to the coding assignment and the CS1470 conceptual questions.
## Grading
**Code**: You will be primarily graded on functionality. Your RNN model should have a perplexity of ≤ 20, and your Transformer model should have a perplexity that is ≤ 18. This applies to both CS1470 and CS2470 students, and specifically refers to the testing set.
<!-- We will also look at your notebook for partial grading purposes if necessary. -->
**Conceptual**: Primarily graded on correctness (when applicable), thoughtfulness, and clarity.
**README**: Your README should contain your perplexity, accuracy, and any bugs.
## Autograder
Our autograder will take your saved model and will run it, thereby alleviating the burden of having to wait for the autograder to train your model. Additionally, the Transformer components will be tested in pseudo-isolation to make sure they function as expected.
## Handing In
Handing in the conceptual questions of this assignment is similar to previous homeworks. On Gradescope, make sure you submit to the 1470 version ONLY if you're enrolled in 1470, or the 2470 version ONLY if you're enrolled in 2470.
:::warning
:warning: IMPORTANT! Please make sure all `*.py` are in “hw5/code” this is very important for our autograder to work! DELETE the data folder or add it to the gitignore before you zip up your code, it might be too big to upload to Gradescope.
:::
IF YOU ARE IN 2470: PLEASE REMEMBER TO ADD A BLANK FILE CALLED “2470student” IN THE hw5/code DIRECTORY, WE ARE USING THIS AS A FLAG TO GRADE 2470 SPECIFIC REQUIREMENTS, FAILURE TO DO SO MEANS LOSING POINTS ON THIS ASSIGNMENT.