---
tags: hw5, programming
---
# HW6 Programming: CLIP
:::info
Assignment due **November 13th at 10 pm EST** on Gradescope
:::
# CLIP Image Captioning
Welcome to the Image Captioning assignment! In this assignment, you'll build an end-to-end image captioning system using a Transformer-based decoder with cross-attention. Your model will learn to generate natural language descriptions of images by combining visual features from images with a language model.
**CLIP** stands for **Contrastive Language–Image Pre-training**. It's a model that revolutionized the space and was developed by OpenAI. It bridges the mediums of text and images in a very powerful way.
## Overview
Your system will consist of three main components:
1. **Preprocessing**: Tokenize captions and build a vocabulary
2. **Model Architecture**: A Transformer decoder with cross-attention that generates captions
3. **Evaluation**: Metrics (BLEU and F1) to evaluate caption quality
4. **Generation**: Beam search for efficient caption decoding
This handout guides you through each step. The training loop and data loading are **PROVIDED**. While you don't need to implement them, but understanding them is important!
---
### Backstory
Deeper still in his underground odyssey, Bruno has discovered something extraordinary, The Gallery of Ten Thousand Tales, an ancient subterranean library carved into the living rock by a long-lost civilization known as the Petroglyph Keepers.
The cavern walls are covered in thousands of luminous stone tablets, each displaying a perfectly preserved image. But there's a problem the original captions that accompanied these images have eroded away over millennia. According to an ancient inscription at the gallery's entrance, these captioned images form a map but only when their descriptions are correctly restored will the path to the surface be revealed!
Bruno has found a mysterious artifact: the CLIP Crystal (Cross-modal Language-Image Portal), which the Petroglyph Keepers used to bridge vision and language. But the crystal is dormant. To reactivate it, Bruno must teach it to understand the relationship between images and their descriptions by training it on the few partially intact tablets that remain.
Your mission: Help Bruno build a Transformer-based captioning system that can generate descriptions for the unlabeled stone tablets, ultimately revealing the map that will guide him back to Providence!
The training loop has already been carved into the CLIP Crystal by the ancients I just need you to implement the core components. Once we can generate accurate captions for all ten thousand tablets, the map home will reveal itself. The deeper we dig into attention mechanisms, the closer we get to sunlight!
---
### Stencil
Please click the following banner for the Github Classrom [link](https://classroom.github.com/a/G1Heuy6x): {%preview https://classroom.github.com/a/G1Heuy6x %} to get the stencil code. Reference this [guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg) for more information about GitHub and GitHub Classroom.
:::warning
Make sure you clone this repository within the same parent folder as your virtual environment! Remember from assignment 1:
```python
csci1470-course/ ← Parent directory (you made this)
├── csci1470/ ← Virtual environment (from HW1)
├── HW6-CLIP/ ← This repo (you cloned this)
└── ...
```
:::
:::danger
**Do not change the stencil except where specified**. While you are welcome to write your own helper functions, changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.
:::
You will need to use the virtual environment that you made in Assignment 1.
:::info
**REMINDER ON ACTIVATING YOUR ENVIRONMENT**
1. Make sure you in your cloned repository folder
- Your terminal line shuold end with the name of the repository
2. You can activate your environment by running the following command:
```bash
# Activate environment
source ../csci1470/bin/activate # macOS/Linux
..\csci1470\Scripts\activate # Windows
```
:::
:::spoiler Possible Issues with the stencil code!
If you have any issues running the stencil code, be sure that your virtual environment contains at least the following packages by running `pip list` once your environment is activated:
- `python==3.11`
- `numpy`
- `tensorflow==2.15` (any version greater than 2.15 is fine)
On Windows conda prompt or Mac terminal, you can check to see if a package is installed with:
```bash
pip list -n csci1470 <package_name>
```
On Unix systems to check to see if a package is installed you can use:
```bash
pip list -n csci1470 | grep <package_name>
```
:::
---
# Before You Start: Setup Instructions
## Step 0.1: Run setup.sh
Before implementing anything, you need to set up your data and environment.
```bash
bash setup.sh
```
### What setup.sh Does
The `setup.sh` script automates all data preparation in one go. It internally calls Python scripts to handle each step. Here's what happens:
**Stage 1: Install Dependencies**
- Installs h5py (≥3.3.0) for reading HDF5 feature files
**Stage 2: Download Flickr8k Dataset** (if needed)
- Downloads 8,000+ images from Kaggle
- Saves to `data/flickr8k/images/` and `data/flickr8k/captions.txt`
- Skips download if dataset already exists
**Stage 3: Prepare Caption File**
- Runs Python script: `src/data/download_data.py`
- Note: This is a Python script, not `download_data.sh`
- Validates caption format and creates clean caption file for tokenization
**Stage 4: Extract Image Features** (if needed)
- Runs Python script: `src/data/extract_features.py`
- Uses pre-trained ResNet50 CNN to extract features from all images
- Saves features to `data/image_features.h5`
- **Takes ~30 minutes** (only runs once)
:::danger
**IMPORTANT:** The feature extraction step takes a long time! Run this EARLY and let it complete before you start coding. Features are already pre-extracted if you see `data/image_features.h5` exist.
:::
### Files Created
After running setup.sh, you should have:
```
data/
├── flickr8k/
│ ├── images/ # 8,091 actual image files
│ └── captions.txt # Caption annotations
├── image_features.h5 # Pre-extracted CNN features (HDF5 format)
└── vocab.pkl # Tokenizer vocabulary (created during training)
```
---
## Step 0.2: Understanding Data Dimensions
Your data will have specific dimensions at each stage. Here's a reference:
### Raw Image Data
- **Images**: 8,091 total images from Flickr8k
- **Image Format**: JPEG files (~224×224 pixels RGB)
- **Captions**: ~40,455 captions total (~5 per image)
### Extracted Features (ResNet50)
- **Feature File**: `data/image_features.h5` (HDF5 format)
- **Per-image features shape**: `(49, 2048)`
- `49` = spatial regions (7×7 grid after ResNet50)
- `2048` = feature dimension from ResNet50 final layer
- **Batch shape**: `(batch_size, 49, 2048)`
### After Image Projection
- **Projected features shape**: `(batch_size, 49, embed_dim)`
- `embed_dim` = embedding dimension (smaller than original 2048)
- Makes computation faster while keeping rich information
### Caption Data
- **Vocabulary size**: ~3,000 unique words (after filtering)
- Max vocabulary: 10,000
- Min word frequency: 5 (words appearing <5 times are removed)
- **Special tokens**: 4 reserved (PAD, START, END, UNK)
- **Caption tokens shape**: `(batch_size, 40)`
- `40` = max caption length (including START and END)
- Shorter captions are padded with PAD tokens
- Longer captions are truncated
### Model Output
- **Logits shape**: `(batch_size, 40, vocab_size)`
- `batch_size` = number of examples per batch (default 16)
- `40` = caption sequence length
- `vocab_size` = probability distribution over vocabulary
### Data Format Table
| Component | Shape | Description |
|-----------|-------|-------------|
| Raw image | (224, 224, 3) | RGB image |
| Image features | (49, 2048) | ResNet50 features per image |
| Batch features | (16, 49, 2048) | Batched image features |
| Caption tokens | (16, 40) | Batch of token sequences |
| Model output | (16, 40, 3000) | Logits for vocabulary |
| Embeddings | (16, 40, `embed_dim`) | Text embeddings |
---
## Step 0.3: Using generate_captions.ipynb
After training your model, you'll use `generate_captions.ipynb` to generate and visualize captions.
### What the Notebook Does
The notebook provides an **interactive interface** for caption generation with these features:
#### 1. **Load Pre-trained Model**
- Loads your best trained model from `checkpoints/best_model.keras`
- Can also load provided pre-trained checkpoint
- Loads tokenizer to convert tokens ↔ text
#### 2. **Generate Captions for Random Images**
- Pick a random image from the dataset
- Generate caption using specified method (greedy or beam search)
- Display image alongside generated and reference captions
#### 3. **Compare Decoding Methods**
- Run both **greedy decoding** and **beam search** on same image
- Show top-k beam search results with their scores
- Visualize the difference in caption quality
#### 4. **Evaluate on Validation Set**
- Generate captions for 50+ images
- Compute BLEU-1, BLEU-2, BLEU-3, BLEU-4, and F1 metrics
- Show sample good and bad predictions
- Display examples with scores
#### 5. **Find Best & Worst Examples**
- Compute BLEU score for many images
- Show images with highest BLEU scores (model performs well)
- Show images with lowest BLEU scores (model struggles)
- Analyze what types of images are hard
#### 6. **Visualization Features**
- Display images with matplotlib
- Show generated and reference captions side-by-side
- Color-coded output for easy reading
- Progress bars for long operations
### Key Sections in the Notebook
```python
# Load model and data
model = tf.keras.models.load_model(hp.BEST_MODEL_PATH)
tokenizer = CaptionTokenizer.load(hp.VOCAB_PATH)
image_features = load_features(hp.FEATURES_FILE)
# Generate caption (single image)
caption = generate_caption(model, features, tokenizer, method='beam')
# Compare methods
results = compare_decoding_methods(model, features, tokenizer)
# Evaluate on many images
metrics = evaluate_captions(references, candidates, verbose=True)
```
### Typical Workflow
1. **Train your model** using `src/training/train.py`
2. **Open the notebook** in Jupyter
3. **Run Setup section** to load model and data
4. **Experiment with generation** using various cells
5. **Collect results** for your assignment report
### Example Output
When you run the notebook, you'll see:
```
Image: flickr8k_12345.jpg
Generated: "a dog is running in the park"
Reference: "a brown dog runs through the grass in a park"
Greedy Decoding:
"a dog is running in the park"
Beam Search (top 2):
1. "a dog is running in the park" (score: -2.34)
2. "a dog runs in a park" (score: -2.41)
BLEU-1: 0.67
BLEU-2: 0.50
F1: 0.75
```
:::success
**Tip:** Save interesting examples from this notebook for your assignment report. Show both good and bad predictions with analysis of why your model succeeds/fails.
:::
---
# Part 1: Preprocessing & Tokenization
## Overview of Text Preprocessing
The preprocessing module handles converting raw caption text into the token sequences your model will work with. It's located in `src/preprocessing/preprocess.py`.
We've **PROVIDED** you with:
- `clean_text(text)`: Cleans raw captions (removes URLs, HTML, normalizes case, etc.)
- `load_captions(caption_file)`: Loads captions from a CSV file into a dictionary
You need to **IMPLEMENT** the `CaptionTokenizer` class to:
- Build a vocabulary from captions
- Encode captions to token sequences
- Decode token sequences back to text
- Pad sequences for batching
### Special Tokens
The tokenizer uses special tokens to denote specific positions and unknown words:
| Token | Purpose | Index |
|-------|---------|-------|
| `<PAD>` | Padding for short sequences | 0 |
| `<START>` | Beginning of caption | 1 |
| `<END>` | End of caption | 2 |
| `<UNK>` | Unknown/out-of-vocabulary words | 3 |
Regular vocabulary words start at index 4.
### Vocabulary Statistics
You'll be building a vocabulary with:
- Maximum size: 10,000 words
- Minimum frequency: Words must appear at least 5 times across all captions
- This keeps the model manageable while covering common words
---
## Task 1.1: Implement `CaptionTokenizer.fit()`
:::info
**Task 1.1 [preprocess.CaptionTokenizer.fit]:** Build a vocabulary from captions.
This method is called once at the start to create the word→index mappings.
**Steps:**
1. Count word frequencies across all captions
- Iterate through all images and their captions
- Split each caption into words (by whitespace)
- Count how many times each word appears
- Hint: Use `Counter` from collections module
2. Filter words by minimum frequency
- Keep only words that appear at least `min_word_freq` times (default: 5)
- This removes rare words and noise from the vocabulary
3. Sort by frequency and limit vocabulary size
- Sort remaining words by frequency (descending)
- Keep only the top `max_vocab_size - 4` words (reserve 4 slots for special tokens)
- Hint: Use `most_common(n)` method on Counter
4. Create `word2idx` mapping
- You'll start with special tokens at indices 0-3
- Then add regular words starting at index 4
- Final mapping: `word -> token_index`
5. Create `idx2word` mapping
- Reverse of `word2idx`: `token_index -> word`
- This is used for decoding
After this step, `self.vocab_size` should be set to the total number of tokens (special + regular words).
**Expected Output:**
- `word2idx`: Dict with ~10,000 entries (special tokens + top words)
- `idx2word`: Reverse mapping
- `vocab_size`: Should be around 10,000
- Print statement showing vocabulary size
:::
---
## Task 1.2: Implement `CaptionTokenizer.encode()`
:::info
**Task 1.2 [preprocess.CaptionTokenizer.encode]:** Convert caption text to token indices.
This method converts a string like `"a dog is running"` to `[1, 5, 23, 42, 8, 2]` (with START and END tokens).
**Steps:**
1. Split caption into words
- Use whitespace as delimiter
- Hint: Use `.split()` on the caption string
2. Convert words to token indices
- For each word, look it up in `word2idx`
- If the word is not in the vocabulary, use the UNK token index instead
- Hint: Use `word2idx.get(word, word2idx[self.UNK_TOKEN])`
3. Add special tokens if requested
- If `add_special_tokens=True`:
- Prepend START token index (1)
- Append END token index (2)
- The result should be: `[START, word1, word2, ..., wordN, END]`
**Example:**
```
caption = "a dog is running"
encode(caption) = [1, idx_of_a, idx_of_dog, idx_of_is, idx_of_running, 2]
```
**Important Notes:**
- Start and END tokens should only be added if `add_special_tokens=True`
- When encoding for training, you'll use this (teacher forcing)
- When encoding for generation, you might only start with START token
:::
---
## Task 1.3: Implement `CaptionTokenizer.decode()`
:::info
**Task 1.3 [preprocess.CaptionTokenizer.decode]:** Convert token indices back to text.
This is the inverse of `encode()`: `[1, 5, 23, 42, 8, 2]` → `"a dog is running"`
**Steps:**
1. Convert each token index to its corresponding word
- Use `idx2word` mapping
- Iterate through the list of indices
2. Optionally skip special tokens
- If `skip_special_tokens=True`, remove PAD, START, END, UNK tokens from output
- This gives you cleaner output text
3. Join words with spaces
- Use `' '.join(words)` to create the final caption string
**Example:**
```
indices = [1, idx_of_a, idx_of_dog, idx_of_is, idx_of_running, 2]
decode(indices, skip_special_tokens=True) = "a dog is running"
```
:::
---
## Task 1.4: Implement `CaptionTokenizer.pad_sequences()`
:::info
**Task 1.4 [preprocess.CaptionTokenizer.pad_sequences]:** Pad variable-length sequences to fixed length.
Neural networks require fixed-size inputs for batching. This method converts sequences of variable length into fixed-size sequences filled with padding tokens.
**Input Format:**
- `sequences`: List of sequences, where each sequence is a list of token indices
- `max_length`: Target length (default: 40)
- `padding`: Where to add padding ('post' = pad at end, 'pre' = pad at start)
**Output Format:**
- NumPy array of shape `[num_sequences, max_length]` with dtype `int32`
- Filled with PAD token index (0) for padding
**Steps:**
1. Create numpy array
- Initialize array of shape `(len(sequences), max_length)`
- Fill with PAD token index: `word2idx[self.PAD_TOKEN]`
2. Fill in sequences
- For each sequence, copy its tokens into the array
- Handle two cases:
- **Sequence too short**: Pad with PAD tokens (array already initialized with these)
- **Sequence too long**: Truncate to `max_length` (keep first `max_length` tokens)
3. Respect padding direction
- If `padding='post'`: sequence goes at start, padding at end (most common)
- If `padding='pre'`: padding at start, sequence goes at end
**Example:**
```
sequences = [[1, 5, 23], [1, 5], [1, 5, 23, 42, 8, 2, 7, 9]]
max_length = 5, padding = 'post'
Result:
[[1, 5, 23, 0, 0],
[1, 5, 0, 0, 0],
[1, 5, 23, 42, 8]]
```
**Hint:**
- For 'post' padding: `padded[i, :len(seq)] = seq[:max_length]`
- For 'pre' padding: `padded[i, -len(seq):] = seq[-max_length:]`
:::
---
# Part 2: Model Architecture
## Overview of the Image Captioning Model
Your model combines **vision** and **language** using a Transformer decoder with cross-attention:
1. **Image Encoder** (PROVIDED): Pre-extracted features from a CNN
2. **Image Projection**: Project image features to embedding dimension (YOU)
3. **Decoder**: Transformer decoder that generates captions (YOU)
The flow is:
```
Image Features [batch, 49, 2048]
↓ (projection)
Projected Features [batch, 49, embed_dim]
↓
Captions + Positions [batch, seq_len, embed_dim]
↓
Through Decoder Blocks (self-attention + cross-attention + FFN)
↓
Logits [batch, seq_len, vocab_size]
```
---
## Task 2.1: Implement `ImageCaptioningModel.__init__()`
:::info
**Task 2.1 [model.ImageCaptioningModel.__init__]:** Set up the image projection layer.
The image projection layer projects pre-extracted image features from 2048 dimensions down to the embedding dimension (defaulted to 128).
**What is provided:**
- Model class structure with docstrings
- `self.decoder`: Already initialized CaptionDecoder (PROVIDED)
- Hyperparameters: `embed_size`, `image_feature_dim`, etc.
**What you need to implement:**
Create the `self.image_projection` sequential model with three layers:
1. **Dense Layer**: Project from `image_feature_dim` (2048) to `embed_size`
2. **Dropout**: Apply dropout with rate `dropout_rate`
3. **LayerNormalization**: Normalize the features
**Why these components?**
- Dense: Dimension reduction and feature transformation
- Dropout: Regularization to prevent overfitting
- LayerNorm: Stabilizes training by normalizing activations
**Hint:** Use `keras.Sequential([...])` with `Dense`, `Dropout`, and `LayerNormalization` layers.
**Output Shape:** `[batch_size, num_regions, embed_size]`
:::
---
## Task 2.2: Implement `ImageCaptioningModel.call()`
:::info
**Task 2.2 [model.ImageCaptioningModel.call]:** Implement the forward pass.
This method ties the image projection and decoder together.
**Input:**
- `inputs`: Tuple of (image_features, caption_tokens)
- image_features: `[batch_size, num_regions, image_feature_dim]` (e.g., [16, 49, 2048])
- caption_tokens: `[batch_size, seq_length]` (e.g., [16, 40])
- `training`: Boolean indicating training vs inference mode
**Steps:**
1. Unpack the inputs
- Extract image_features and caption_tokens from the tuple
2. Project image features
- Apply `self.image_projection` to image_features
- Result should have shape `[batch_size, num_regions, embed_size]`
3. Pass through decoder
- Call `self.decoder(caption_tokens, projected_features, training=training)`
- Returns logits of shape `[batch_size, seq_length, vocab_size]`
4. Return logits
**Output Shape:** `[batch_size, seq_length, vocab_size]`
**Note:** The `training` parameter is important for proper dropout behavior!
:::
---
## Task 2.3: Implement `DecoderBlock.__init__()`
:::info
**Task 2.3 [decoder.DecoderBlock.__init__]:** Initialize components of a single decoder block.
A decoder block contains three sublayers:
1. Self-attention on captions (already provided)
2. Cross-attention between captions and images (YOU)
3. Feedforward network (YOU)
Each sublayer follows the pattern: LayerNorm → Sublayer → Residual Connection → Dropout
**What is PROVIDED:**
- `self.self_attention`: MultiHeadSelfAttention layer (already initialized)
**What you need to initialize:**
1. **Cross-Attention Layer**
- Class: `MultiHeadCrossAttention`
- Parameters: `embed_size`, `num_heads`
- This allows the caption tokens to attend to image features
2. **Feedforward Network (2 Dense layers)**
- First Dense: `embed_size` → `ff_dim` (e.g., 128 → 256) with ReLU activation
- Second Dense: `ff_dim` → `embed_size` (e.g., 256 → 128)
- Think of this as feature transformation: expand then compress
3. **Layer Normalization layers**
- You need 3 layer norm layers (one after each sublayer)
- Use `keras.layers.LayerNormalization(epsilon=1e-6)`
4. **Dropout layers**
- You need 3 dropout layers (one after each sublayer)
- Rate: `dropout_rate`
**Architecture Pattern:**
```
For each sublayer:
1. LayerNorm on input
2. Apply sublayer
3. Add dropout
4. Residual connection (add to input)
```
**Hint:** Store these in `self.` so you can use them in the call() method. Consider naming them clearly (e.g., `self.cross_attention`, `self.ffn_dense1`, `self.ffn_dense2`, etc.)
:::
---
## Task 2.4: Implement `DecoderBlock.call()`
:::info
**Task 2.4 [decoder.DecoderBlock.call]:** Implement the forward pass through a decoder block.
**Pattern for each sublayer called "post-layer-norm with residual connections."** It is your job to implement this (you already did on the last assignment)
1. **Self-Attention on captions**
- You still need to pay attention to the current tokens
- You should use the same post-layer-norm with residual connections pattern here!
2. **Cross-Attention (caption ↔ image)**
- Apply cross-attention using post-layer-norm with residual connections pattern
3. **Feedforward Network**
- Project you final outputs into the correct output size
**Output:** `output` with shape `[batch_size, target_seq_len, embed_size]`
**Important:** Make sure `training=training` is passed to dropout layers!
:::
---
## Task 2.5: Implement `CrossAttentionHead.__init__()`
:::info
**Task 2.5 [cross_attention.CrossAttentionHead.__init__]:** Initialize a single cross-attention head.
Cross-attention is different from self-attention in that now we are not just paying attention our previous inputs. Instead, we are rather paying attention to the output of the decoder! This allows each caption word to selectively attend to relevant image regions!
**What you need to initialize:**
1. **Three Linear Projections:**
- Query (Q) projection: `embed_size` → `head_size`
- Key (K) projection: `embed_size` → `head_size`
- Value (V) projection: `embed_size` → `head_size`
2. **Attention Matrix Computation:**
- Important to ask yourself: Should cross-attention use causal masking?
:::
---
## Task 2.6: Implement `CrossAttentionHead.call()`
:::info
**Task 2.6 [cross_attention.CrossAttentionHead.call]:** Apply cross-attention.
**Input:**
- `decoder_input`: `[batch_size, target_seq_len, embed_size]` - caption embeddings
- `encoder_output`: `[batch_size, source_seq_len, embed_size]` - image features
**Steps:**
1. **Project Q, K, V:**
- Q = Linear(decoder_input) → `[batch_size, target_seq_len, head_size]`
- K = Linear(encoder_output) → `[batch_size, source_seq_len, head_size]`
- V = Linear(encoder_output) → `[batch_size, source_seq_len, head_size]`
2. **Compute Attention:**
- Call `self.attention_matrix([K, Q])` → attention weights
- Shape: `[batch_size, target_seq_len, source_seq_len]`
3. **Apply Attention to Values:**
- Weighted sum: `attention_weights @ V`
- Result: `[batch_size, target_seq_len, head_size]`
**Output:** `[batch_size, target_seq_len, head_size]`
**Mathematical Formula:**
```
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(head_size)) @ V
```
Where each caption token (Q) computes its attention over all image regions (K) to select which regions to combine (V).
:::
---
## Task 2.7: Implement `MultiHeadCrossAttention.__init__()`
:::info
**Task 2.7 [cross_attention.MultiHeadCrossAttention.__init__]:** Initialize multi-head cross-attention.
Similar to the transformer assignment, we use multiple attention heads to capture different types of relationships.
**What you need to initialize:**
1. **Multiple Attention Heads:**
- Create a list of `CrossAttentionHead` instances
- Number of heads: `num_heads`
- Each head's size: `head_size = embed_size // num_heads`
- Hint: Use a list comprehension
2. **Output Projection:**
- Linear layer: `embed_size` → `embed_size`
- This combines outputs from all heads
- Shape: Takes `[batch_size, seq_len, embed_size]` and outputs same shape
**Why multiple heads?**
- Head 1 might learn to attend to object shapes
- Head 2 might learn to attend to colors
- Head 3 might learn to attend to backgrounds
- The output projection learns how to combine these insights
:::
---
## Task 2.8: Implement `MultiHeadCrossAttention.call()`
:::info
**Task 2.8 [cross_attention.MultiHeadCrossAttention.call]:** Apply multi-head cross-attention.
**Input:**
- `decoder_input`: `[batch_size, target_seq_len, embed_size]`
- `encoder_output`: `[batch_size, source_seq_len, embed_size]`
**Steps:**
1. **Apply Each Attention Head:**
- Pass inputs through each attention head in your list
- Collect outputs from all heads
- Each output has shape `[batch_size, target_seq_len, head_size]`
2. **Concatenate Head Outputs:**
- Concatenate along the last dimension (head_size dimension)
- Result shape: `[batch_size, target_seq_len, embed_size]`
- Hint: Use `tf.concat(head_outputs, axis=-1)`
3. **Apply Output Projection:**
- Pass concatenated output through the linear projection
- Final shape: `[batch_size, target_seq_len, embed_size]`
**Output:** `[batch_size, target_seq_len, embed_size]`
**Key Concept:**
```
MultiHead(Q,K,V) = Concat(Head1, Head2, ..., HeadN) @ W^O
```
Where each head produces different attention patterns, and the projection learns their optimal combination.
:::
---
# Part 3: Evaluation Metrics
## Overview of Caption Evaluation
After generating captions, we need metrics to evaluate quality. We'll implement two metrics:
1. **BLEU (Bilingual Evaluation Understudy)**: Measures n-gram overlap with reference
2. **F1 Score**: Measures word overlap (bag-of-words)
Both compare generated captions to reference captions.
:::info
**Important Concept: N-grams**
An n-gram is a sequence of n consecutive words. For example:
For caption: "a dog is running"
- 1-grams (unigrams): ["a", "dog", "is", "running"]
- 2-grams (bigrams): ["a dog", "dog is", "is running"]
- 3-grams (trigrams): ["a dog is", "dog is running"]
- 4-grams: ["a dog is running"]
Why n-grams matter:
- 1-grams measure vocabulary overlap
- 2,3,4-grams measure fluency and phrase quality
- BLEU-4 (using all up to 4-grams) is most common
:::
---
## Task 3.1: Implement `compute_ngrams()`
:::info
**Task 3.1 [metrics.compute_ngrams]:** Extract n-grams from a token list.
This helper function computes all n-grams of size `n` from a caption.
**Input:**
- `tokens`: List of tokens (already split into words)
- `n`: N-gram size (1, 2, 3, or 4)
**Steps:**
1. **Sliding Window:**
- Iterate through the token list
- Create n-grams by taking n consecutive tokens
- Join them with spaces to create n-gram strings
Example for n=2, tokens=["a", "dog", "is", "running"]:
```
"a dog"
"dog is"
"is running"
```
2. **Count N-grams:**
- Use `Counter` to count how many times each n-gram appears
- Return the Counter object
**Example Code Structure:**
```python
ngrams = []
for i in range(len(tokens) - n + 1):
ngram = ' '.join(tokens[i:i+n])
ngrams.append(ngram)
return Counter(ngrams)
```
**Output:** Counter object (dictionary-like) mapping n-gram strings to counts
:::
---
## Task 3.2: Implement `compute_bleu_score()`
:::info
**Task 3.2 [metrics.compute_bleu_score]:** Compute BLEU score between reference and candidate.
BLEU measures how many n-grams in the generated caption appear in the reference.
**BLEU Formula:**
```
BLEU = BP × exp(Σ w_i × log(p_i))
```
Where:
- `p_i` = precision for i-grams (what fraction of candidate i-grams appear in reference)
- `w_i` = weight for i-grams (usually 1/max_n)
- `BP` = brevity penalty (penalizes short captions)
**Brevity Penalty:**
- If candidate is longer than reference: BP = 1.0
- Otherwise: BP = exp(1 - reference_length/candidate_length)
- This prevents the model from just outputting very short sequences
**Precision for N-grams:**
For each n-gram size (1 through max_n):
1. Get all n-grams from reference (counts)
2. Get all n-grams from candidate (counts)
3. For each n-gram in candidate:
- Count matches: min(candidate_count, reference_count)
- Sum all matches
4. Precision = total_matches / total_candidate_ngrams
**Important:** Clip matches at reference count! If reference has "the" 2 times and candidate has it 5 times, count only 2 matches.
**Steps:**
1. **Compute Brevity Penalty:**
- Check if candidate length > reference length
- If yes: BP = 1.0
- If no: BP = exp(1 - len(reference)/len(candidate))
2. **Compute Precision for Each N-gram Size:**
- For n from 1 to max_n:
- Compute n-grams from both reference and candidate
- Count matches with clipping (min of counts)
- Calculate precision = matches / total_candidate_ngrams
- Handle division by zero (if precision is 0.0, return 0.0)
3. **Combine Precisions:**
- Use weighted geometric mean: exp(Σ weights × log(precisions))
- Only compute log for non-zero precisions
- If any precision is 0, return 0.0 (can't compute log(0))
4. **Final BLEU:**
- BLEU = BP × exp(weighted_sum)
**Edge Cases:**
- Empty candidate: return 0.0
- Zero precision: return 0.0
- Empty reference: handle gracefully
**Example:**
```
reference: ["a", "dog", "is", "running", "fast"]
candidate: ["a", "dog", "is", "happy"]
1-grams: candidate has ["a", "dog", "is"], reference has all 3 → 3/4 = 0.75
2-grams: candidate has ["a dog", "dog is"], reference has both → 2/3 = 0.67
BP = exp(1 - 5/4) = exp(-0.25) ≈ 0.78
BLEU = 0.78 × exp((0.25×log(0.75)) + (0.25×log(0.67)) + ...)
```
:::
---
## Task 3.3: Implement `compute_f1_score()`
:::info
**Task 3.3 [metrics.compute_f1_score]:** Compute F1 score using bag-of-words overlap.
Unlike BLEU which focuses on n-grams, F1 treats captions as "bags of words" (order doesn't matter).
**F1 Formula:**
```
Precision = (words in candidate that appear in reference) / len(candidate)
Recall = (words in reference that appear in candidate) / len(reference)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
```
This is the harmonic mean of precision and recall.
**Intuition:**
- Precision: "Of the words you predicted, how many were correct?"
- Recall: "Of the words you should have predicted, how many did you get?"
- F1: Balanced score between these two
**Steps:**
1. **Handle Edge Cases:**
- If both are empty: return 1.0 (perfect match)
- If one is empty: return 0.0 (no match)
2. **Find Common Words:**
- Convert tokens to sets (or use Counter for word frequency)
- Find intersection: words in both
3. **Compute Precision:**
- precision = len(common_words) / len(candidate)
4. **Compute Recall:**
- recall = len(common_words) / len(reference)
5. **Compute F1:**
- If both precision and recall are 0: return 0.0
- Otherwise: F1 = 2 × (precision × recall) / (precision + recall)
**Example:**
```
reference: ["a", "dog", "is", "running"]
candidate: ["a", "big", "dog", "runs"]
common: ["a", "dog"] (2 words)
precision = 2/4 = 0.5 (2 out of 4 predicted words were correct)
recall = 2/4 = 0.5 (we got 2 out of 4 reference words)
F1 = 2 × (0.5 × 0.5) / (0.5 + 0.5) = 0.5
```
**Note:** This treats all words equally - "the" is as important as "dog"
:::
---
# Part 4: Generation & Beam Search
## Understanding Caption Generation
After training, you need to generate captions for new images. There are two approaches:
**Greedy Decoding** (PROVIDED):
- At each step, pick the word with highest probability
- Fast but can get stuck in local optima
- Example: always picking "a" because it has high probability
**Beam Search** (YOU):
- Maintain k hypotheses (beams) at each step
- Explore multiple paths and keep the best k
- More thorough but slower
- Usually produces better captions
:::info
**Key Concepts for Beam Search:**
1. **Beam Width (k):** Number of hypotheses to maintain
- k=1 is equivalent to greedy
- k=2 maintains top 2 sequences
- k=10 would maintain top 10
2. **Score:** Cumulative log probability
- Log probability is used because: log(P(w1)) + log(P(w2)) = log(P(w1,w2))
- We use log to avoid numerical underflow (multiplying many small probabilities)
3. **Length Penalty:** Normalizes scores by sequence length
- Without it, shorter sequences always score higher (log P is negative)
- Formula: final_score = cumulative_log_prob / (length^penalty)
- Higher penalty = longer captions preferred
4. **Teacher Forcing vs Generation:**
- Training: Decoder knows correct previous word (from data)
- Generation: Decoder must predict previous word itself
- This is why generation is harder!
:::
---
## Task 4.1: Understand Greedy Decoding
:::info
**Reading Task [generation.greedy_decode]:** Study the provided greedy decoding function.
This is PROVIDED - don't implement it, but understand it! It's simpler than beam search.
**Key Points:**
1. Start with START token
2. Loop until MAX_LENGTH:
- Get model predictions for current sequence
- Take argmax (greedy choice)
- Stop if END token generated
- Append chosen token to sequence
This is your reference for beam search.
:::
---
## Task 4.2: Implement `BeamSearchDecoder._get_next_tokens()`
:::info
**Task 4.2 [beam_search.BeamSearchDecoder._get_next_tokens]:** Get top-k next tokens for batch of sequences.
This helper computes the top k next token candidates for beam search.
**Input:**
- `sequences`: `[batch_size, seq_len]` - current token sequences (batch of beams)
- `image_features`: `[1, num_regions, embed_size]` - image features (same for all beams)
- `k`: Number of top tokens to return
**Steps:**
1. **Get Model Predictions:**
- Pass sequences through model: `model.decoder(sequences, projected_features, training=False)`
- Get logits: `[batch_size, seq_len, vocab_size]`
2. **Get Last Position Logits:**
- We only care about predicting the NEXT token
- Take logits from last position: `logits[:, -1, :]`
- Shape: `[batch_size, vocab_size]`
3. **Convert to Log Probabilities:**
- Use `tf.nn.log_softmax(logits, axis=-1)`
- Why log? Because we'll sum log probs for cumulative score
4. **Get Top-k Tokens:**
- Use `tf.nn.top_k(log_probs, k=k)`
- Returns: (top_k_values, top_k_indices)
- `top_k_values`: log probabilities - shape `[batch_size, k]`
- `top_k_indices`: token IDs - shape `[batch_size, k]`
**Output:**
- `top_k_tokens`: `[batch_size, k]` - top k token IDs
- `top_k_scores`: `[batch_size, k]` - log probabilities
**Hint:** You'll use this multiple times in the main beam search loop.
:::
---
## Task 4.3: Implement `BeamSearchDecoder.decode()`
:::info
**Task 4.3 [beam_search.BeamSearchDecoder.decode]:** Implement full beam search.
This is the main algorithm. It's complex but follow the algorithm carefully!
**High-Level Algorithm:**
```
1. Initialize beam_width hypotheses with START token
2. For each step until max_length:
a. For each active (non-finished) hypothesis:
- Get top beam_width next tokens
b. Combine all candidates (beam_width × beam_width total)
c. Select top beam_width by score
d. Mark hypotheses that end with END token as finished
e. Continue with remaining unfinished hypotheses
3. Apply length penalty to final scores
4. Return sorted hypotheses
```
**Data Structure:**
Each hypothesis should track:
- `sequence`: List of token IDs [START, w1, w2, ..., wt]
- `score`: Cumulative log probability (sum of log probs)
- `finished`: Boolean (True if END token was generated)
**Steps:**
1. **Initialize Beams:**
- Create beam_width hypotheses
- Each starts with START token
- Each has score = 0.0 (log(1) = 0)
- None are finished yet
2. **Main Loop (for step in range(max_length)):**
a. **Collect Candidates from Active Beams:**
```python
all_candidates = []
for each beam that's not finished:
- Create sequence tensor from this beam
- Get top beam_width next tokens using _get_next_tokens
- For each next token:
- New sequence = beam.sequence + [next_token]
- New score = beam.score + log_prob[next_token]
- Add (new_sequence, new_score, is_END_token) to candidates
```
b. **Select Top beam_width Candidates:**
- Sort all candidates by score (descending)
- Keep top beam_width candidates
- Update beams with these top candidates
c. **Check Termination:**
- If all beams are finished OR step == max_length-1: break
- Otherwise continue
3. **Apply Length Penalty:**
- For each hypothesis: `final_score = score / (len(sequence) ** length_penalty)`
- This prevents bias toward short sequences
4. **Return Results:**
- Sort hypotheses by final_score (descending)
- For each hypothesis, decode tokens to text using tokenizer
- Return list of (caption_text, final_score) tuples
**Important Details:**
- **Finished Beams:** Once a beam generates END token, mark it finished
- You can stop generating for it
- Keep it in results with current score
- **Batching:** You can batch the _get_next_tokens call:
- Concatenate active sequences into one tensor
- Call _get_next_tokens on batch
- Unbatch results
- **Edge Cases:**
- Empty candidate list: shouldn't happen if algorithm correct
- All beams finished before max_length: that's okay, stop early
- No END tokens generated: return sequences at max_length
**Example (beam_width=2, simplified):**
```
Step 0: sequences = [[START]]
Get top 2 tokens: [("the", -0.1), ("a", -0.15)]
Candidates: [([START, "the"], -0.1), ([START, "a"], -0.15)]
Keep top 2: all of them
Step 1: sequences = [[START, "the"], [START, "a"]]
From "the": top 2 = [("cat", -0.2), ("dog", -0.25)]
From "a": top 2 = [("dog", -0.3), ("cat", -0.35)]
All candidates (4 total):
([START, "the", "cat"], -0.1-0.2=-0.3)
([START, "the", "dog"], -0.1-0.25=-0.35)
([START, "a", "dog"], -0.15-0.3=-0.45)
([START, "a", "cat"], -0.15-0.35=-0.5)
Keep top 2:
([START, "the", "cat"], -0.3)
([START, "the", "dog"], -0.35)
... continue until END token or max_length
```
**Hints:**
- Use TensorFlow operations where possible (more efficient than Python loops)
- Or use Python with careful indexing
- Store scores as floats, not TensorFlow tensors (easier to debug)
- Consider using numpy arrays for bookkeeping
:::
---
# Part 5: Training & Concepts
## Overview (No Implementation Required)
The training code is **PROVIDED** in `src/training/train.py`. You don't need to implement it, but understanding these concepts is important!
## Key Training Concepts
### Teacher Forcing
In the data loader (`src/data/data_loader.py`), notice this code:
```python
# Input is all tokens except last
input_tokens = tokens[:-1]
# Target is all tokens except first
target_tokens = tokens[1:]
```
**Why?** This is called **teacher forcing**:
- During training, the model learns from the correct previous token
- It doesn't need to recover from its own mistakes yet
- This helps training converge faster
Example:
```
Caption tokens: [START, "a", "dog", "is", "running", END]
Input to model: [START, "a", "dog", "is", "running"]
Target (what to predict): ["a", "dog", "is", "running", END]
At step t: model sees correct words 0 to t-1, predicts word t
```
This makes training easier, but at test time you must generate one token at a time!
### Data Sampling
The code has optional **data sampling** (using only 20% of data):
```python
if sample_ratio is not None and 0 < sample_ratio < 1:
np.random.choice(len(dataset), sample_size, replace=False)
```
**Why?**
- Quick prototyping: test your code on small dataset first
- Check for bugs before running 24-hour training
- Recommended workflow: try on 10-20% data, then full data
### Early Stopping
The trainer monitors validation loss:
```python
if is_best:
self.best_val_loss = val_loss
self.epochs_no_improve = 0
else:
self.epochs_no_improve += 1
if early_stopping and self.epochs_no_improve >= self.patience:
print(f"Early stopping: no improvement for {self.patience} epochs")
break
```
**How it works:**
- Track validation loss each epoch
- If validation loss improves: reset counter
- If no improvement for `patience` epochs (default 5): stop training
- Save the best model seen so far
**Why?**
- Prevents overfitting
- Training time is expensive
- Stop when further training doesn't help
### Loss Function
The loss is **masked sparse categorical cross-entropy**:
```python
def compute_loss(self, targets, logits, pad_token_id):
loss = self.loss_fn(targets, logits)
# Create mask (don't count padding tokens)
mask = tf.cast(tf.not_equal(targets, pad_token_id), tf.float32)
loss = loss * mask
return tf.reduce_sum(loss) / tf.reduce_sum(mask)
```
**Why masking?**
- Padding tokens (0) shouldn't contribute to loss
- Only real caption tokens count
- Prevents model from gaming the metric
---
# Testing & Debugging
## How to Test Your Code
We recommend building incrementally:
1. **Test Tokenizer First**
- Encode a caption
- Verify shape
- Decode back
- Check vocab size
2. **Test Model Components**
- Test CrossAttentionHead in isolation
- Test DecoderBlock with dummy inputs
- Test full model with print_model_summary()
3. **Test Metrics**
- Compute BLEU on simple examples
- Verify F1 makes sense
4. **Test Generation**
- Start with greedy decoding (PROVIDED)
- Debug beam search incrementally
5. **Full Training**
- Use data sampling first (20% data)
- Check losses are decreasing
- Generate a few captions
- If working: train on full data
---
# Common Mistakes
:::danger
**1. Tokenizer mistakes:**
- Forgetting to add/remove special tokens
- Not filtering by min frequency
- Wrong index assignment for special tokens
**2. Model shape mismatches:**
- Forgetting to project image features first
- Cross-attention input shapes
- Concatenation dimension errors
**3. Beam search logic:**
- Not sorting candidates correctly
- Forgetting length penalty
- Continuing after all beams finished
- Off-by-one errors with indices
**4. Metrics:**
- Not clipping matched n-grams
- Wrong brevity penalty formula
- Forgetting to handle division by zero
**5. Generation:**
- Using training mode (training=True) during generation
- Forgetting to handle END token
- Not applying length penalty
:::
---
Good luck!