---
title: 'HW5 Programming: Image Captioning'
---
# HW5 Programming: Image Captioning
:::info
Programming assignment due **Sunday, March 29th, 2026 at 11:59 PM EST**
:::
## Theme

Bruno, hungry, has made a pit stop on his journey through the cosmos to get some food. The trouble is, he's having trouble communicating with the locals! The only way they are able to communicate is through pictures, but Bruno is not very good at understanding what they mean. Your task is to create a model that helps Bruno understand what the images are trying to communicate!
## Getting the stencil
Please click <ins>[here](https://classroom.github.com/a/sTCtEvjn)</ins> to get the stencil code. Reference this <ins>[guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg)</ins> for more information about GitHub and GitHub Classroom.
:::danger
**Do not change the stencil except where specified**. Changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.
**This assignment uses PyTorch.** Make sure your environment has PyTorch installed (`pip install torch torchvision`).
:::
## Assignment Overview
In this assignment, we will be generating English language captions for images using the Flickr 8k dataset. This dataset contains over 8000 images with 5 captions apiece. Provided is a slightly-altered version of the dataset, with uncommon words removed and all sentences limited to their first 20 words to make training easier.
You will implement **two** different types of image captioning models that first encode an image and then decode the encoded image to generate its English language caption. The first model is based on **Recurrent Neural Networks**, while the second is based on **Transformers**. Since both models solve the same task, they share the same preprocessing, training, testing, and loss function.
To simplify the assignment and improve model quality, all images have been passed through <ins>[ResNet-50](https://arxiv.org/abs/1512.03385)</ins> to get 2048-D feature vectors. You will use these feature vectors in both models.
:::warning
**Note**: A major aspect of this assignment is implementing attention and then using your attention heads to build a Transformer decoder block. Although these are the last two steps, they will probably be the most time-consuming so plan accordingly.
:::
:::success
The stencil comes with two notebooks in `notebooks/`:
`HW5_Colab.ipynb` contains the full assignment stencil in notebook form so you can run it on Google Colab if needed. You can complete all TODOs there, but **copy your code back to the source files before submitting.** We will not autograde the notebook.
`HW5-Explore.ipynb` is an optional notebook to explore the data and visualize your learned attention heads. You'll need a saved Transformer checkpoint before attempting the attention visualization. It is not required but worth doing.
:::
---
## Setup
From within your stencil directory, run:
```bash
conda activate csci1470
bash setup.sh (or chmod +x setup.sh then ./setup.sh)
```
This installs `torchvision` and `kaggle` into your active environment if they are not already present, then checks that the Flickr8k dataset (`Images/` and `captions.txt`) exists in `data/`. If the data is not found, the script will print a download link and exit.
:::warning
You will need to download the Flickr8k dataset manually from [Kaggle](https://www.kaggle.com/datasets/adityajn105/flickr8k). Once you have it, extract `Images/` and `captions.txt` into the `data/` folder at the root of the repo, then re-run `bash setup.sh` to confirm everything is in order.
:::
---
### Step 1: Text Processing
Before any training can happen, you need to complete three helper functions in `preprocessing/process_text.py`. These are called by `preprocessing.py` to convert raw caption strings into the padded integer sequences the model expects.
:::info
**TODO 1.1: Implement `pad_captions` in `preprocessing/process_text.py`**
After `preprocess_captions` runs, each caption is a token list of variable length up to `window_size + 1`. Pad each caption in-place with `'<pad>'` tokens until it is exactly `window_size + 1` tokens long.
:::
:::info
**TODO 1.2: Implement `unk_captions` in `preprocessing/process_text.py`**
Replace every token whose frequency in `word_count` is less than or equal to `minimum_frequency` with `'<unk>'`. Modify the captions in-place.
:::
:::info
**TODO 1.3: Implement `build_word_dictionary` in `data/process_text.py`**
Build a `word2idx` dictionary by scanning `train_captions` in order. The first time a token is seen, assign it the next available integer index starting from 0, and replace the token with that index. Then convert `test_captions` the same way, using `'<unk>'`'s index for any token not seen during training.
Return the completed `word2idx` dictionary.
:::
Once the dataset is in place and you have implemented these functions, run preprocessing to generate the `data.p` file used during training. You must make sure you are in the preprocessing folder!
```bash
cd preprocessing
python preprocessing.py
```
This reads `data/Images/` and `data/captions.txt`, extracts 2048-D ResNet-50 feature vectors for every image, tokenises and pads the captions, and writes everything to `data/data.p` (~100 MB). **This takes about 10 minutes** the first time.
:::info
It is worth reading through `preprocessing/preprocessing.py` to understand what the data looks like before you start coding such as the shapes and vocabulary conventions it establishes will matter when you implement the decoders.
:::
---
### Step 2: Training the Model
In this assignment, both models perform the same task and differ only in their decoder architecture. The `ImageCaptionModel` class in `model.py` wraps any decoder and provides shared `test()`, `accuracy_function()`, and `loss_function()` utilities. In many cases you could use a high-level training API, but for academic research a large amount of control over training is often required. This assignment is designed to feel like modern research-oriented code.
`forward`, `test`, and `compile` are already provided. Please do not change them.
:::info
**TODO 2.1: Implement `train_epoch()` in `model.py`**
Fill out the training loop. For each mini-batch, slice the captions into decoder inputs (all tokens except the last) and decoder labels (all tokens except the first), build a padding mask, run the forward pass, compute loss, and step the optimizer. Shuffle the data at the start of each epoch and return the running loss, accuracy, and perplexity.
:::
:::warning
**Tip:** If your training loss is unstable or exploding, try adding `torch.nn.utils.clip_grad_norm_(self.parameters(), 1.0)` between `loss.backward()` and `optimizer.step()`. This clips the gradient norm to 1 and can significantly stabilize training.
:::
After Steps 2 and 3 you will be able to run your RNN model.
---
### Step 3: RNN Image Captioning
You will now build an RNN decoder that takes a sequence of word IDs and an image embedding, and outputs vocabulary logits for each timestep, following a similar architecture to <ins>[Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555.pdf)</ins>.
The 2048-D ResNet embeddings are pre-computed. Pass the image embedding to your RNN as its **initial hidden state**, but you'll first need to project it down to `hidden_size` with a small feed-forward layer.
We will then use **teacher forcing**: give the decoder the previous *correct* word at each step and have it predict the next.
:::info
**TODO 3.1: Implement `RNNDecoder.__init__` in `decoder.py`**
Define the layers needed to run the decoder. You will need:
- an image embedding network to project the 2048-D features down to `hidden_size`
- a word embedding layer for caption tokens
- a recurrent unit (GRU or LSTM with `batch_first=True`)
- a classifier head to project from `hidden_size` to `vocab_size`.
- The constructor accepts a `dropout` parameter
- Use [`nn.Dropout`](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) in your embedding and classifier layers to regularize training
:::
:::info
**TODO 3.2: Implement `RNNDecoder.forward` in `decoder.py`**
- Project the image embedding to `hidden_size`
- Use it as the recurrent unit's initial hidden state
- Embed the input caption tokens
- Run them through the recurrent unit
- Pass the output through the classifier
- Return logits, not probabilities
:::
#### Running your RNN Model
```bash
python assignment.py --type rnn --task both --data ../data/data.p --epochs 1
```
:::info
**TODO 3.3:** Adjust the hyperparameters and train away! We recommend hidden sizes between 256–512 and 10–15 training epochs.
:::
:::success
Your RNN model should not take more than 5 minutes to train. Target validation perplexity ≤ 20 and per-symbol accuracy ≥ 30%. For reference, our implementation reaches ~15 perplexity within a few epochs without a GPU.
:::
---
### Step 4: Attention and Transformer Blocks
RNNs are neat! But since 2017, **Transformers** have been the standard. Transformer-based models have shown state-of-the-art results on language modeling, translation, and image captioning. Here you will implement attention from scratch, following a simplified version of <ins>[Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)</ins>.
Attention turns a sequence of embeddings into Queries, Keys, and Values. Each timestep has a query (Q), key (K), and value (V). Queries are compared to every key to produce an attention matrix, which creates new contextualized representations.
:::warning
Self-attention can be fairly confusing. **Read and follow the instructions in the stencil code carefully.** We also encourage you to refer back to the lecture slides. The <ins>[Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)</ins> is an excellent resource that explains the intuition and implementation of everything you'll build here.
:::
:::info
**TODO 4.1: Implement `AttentionMatrix` in `transformer.py`**
Computes scaled dot-product attention weights given K and Q matrices.
- Compute the raw scores, apply a causal upper-triangular mask if `self.use_mask` is true (set masked positions to `-inf`), then apply softmax along the key dimension.
- [`torch.bmm`](https://pytorch.org/docs/stable/generated/torch.bmm.html) and [`F.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) will be useful here.
:::
:::info
**TODO 4.2: Implement `AttentionHead` in `transformer.py`**
Holds the K, V, Q linear projections and calls `AttentionMatrix` to produce the attention output for a single head. You can initialise your projections as `nn.Linear` layers without bias.
:::
:::info
**TODO 4.3: Implement `MultiHeadedAttention` in `transformer.py`**
Creates 3 attention heads each of output size `emb_sz // 3`, concatenates their outputs along the last dimension, and applies a final `nn.Linear` projection back to `emb_sz`.
:::
:::info
**TODO 4.4: Implement `TransformerBlock` in `transformer.py`**
This is where everything comes together.
- Apply masked self-attention over the caption input
- cross-attention using the image context as keys and values
- feed-forward sublayer (some combination of linear, activation, dropout).
- Each sublayer should get a residual connection and a [`nn.LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html).
- The constructor accepts a `dropout` parameter
- Use [`nn.Dropout`](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) within and outside of your attention and feed-foward blocks
:::
:::info
**TODO 4.5: Implement `positional_encoding` and `PositionalEncoding` in `transformer.py`**
`positional_encoding` should return a sinusoidal encoding matrix of shape `[length x depth]` using alternating sin/cos at different frequencies. See the scheme described <ins>[here](https://www.tensorflow.org/text/tutorials/transformer#the_embedding_and_positional_encoding_layer)</ins>.
- `PositionalEncoding.forward` should embed the token ids
- Scale by `sqrt(embed_size)`, add the positional encoding, and apply dropout
:::
---
### Step 5: Transformer Image Captioning
This step is very similar to Step 3. Instead of passing the image as a recurrent hidden state, you pass it as the **context sequence** for cross-attention inside the `TransformerBlock`. Project the image to `hidden_size` and reshape it to `[B, N, hidden_size]` before passing it as context.
:::warning
The embedding size for `PositionalEncoding` and `TransformerBlock` must match. Pass `dropout` through to both.
:::
:::info
**TODO 5.1: Implement `TransformerDecoder.__init__` in `decoder.py`**
Define an image projection network, a `PositionalEncoding` layer for caption tokens, one or more `TransformerBlock`(s), and a classifier head projecting to `vocab_size`. Pass `dropout` through to `PositionalEncoding` and each `TransformerBlock`.
:::
:::info
**TODO 5.2: Implement `TransformerDecoder.forward` in `decoder.py`**
Project and reshape the image embedding to serve as the cross-attention context. Embed and positionally-encode the caption tokens. Pass through your transformer block(s) and classifier. Return logits, not probabilities.
:::
:::warning
Stacking multiple `TransformerBlock`s tends to help significantly here.
:::
#### Running your Transformer
```bash
python assignment.py --type transformer --task both --data ../data/data.p --epochs 1
```
:::info
**TODO 5.3:** Adjust the hyperparameters and train away! We found around 10–15 epochs with a hidden size of 512 to work well.
:::
:::success
Your Transformer model should finish training in 5–10 minutes. Target validation perplexity ≤ 25 and per-symbol accuracy ≥ 30%. `MultiHeadedAttention` must use 3 heads of output size `emb_sz // 3`, concatenated and projected back to `emb_sz`.
:::
---
## Grading
**Code**: Graded primarily on functionality. RNN perplexity ≤ 20; Transformer perplexity ≤ 25 (test set). Both models must reach ≥ 30% per-symbol accuracy.
**README**: Include your perplexity, accuracy, and any known bugs.
## Autograder
Our autograder will take your saved model and will run it, thereby alleviating the burden of having to wait for the autograder to train your model. Additionally, the Transformer components will be tested in pseudo-isolation to make sure they function as expected.