HW5 Programming: Sherlock HoLLMes

--- tags: hw5, programming --- # HW5 Programming: Sherlock HoLLMes :::info Assignment due **November 6, 2025 at 10 pm EST** on Gradescope! ::: ## Assignment Overview In this assignment, you will create language models trained on a dataset of mystery, crime, and thriller texts. These texts include the likes of Sherlock Holmes, Agatha Christie, and other classic detective fiction authors. You will begin by implementing small language models with RNNs and progress to a Large Language Model (Transformer). ### 🕵️ The Mystery of the Missing Language Model 🐻 ![ChatGPT Image Oct 16, 2025, 01_52_18 AM](https://hackmd.io/_uploads/SkrZYZ0ale.png) **Image generated with GPT5** London, 1895. Bruno the Bear has followed mysterious electromagnetic readings to Victorian London, where he's discovered that the great detective Sherlock Holmes has vanished! The only clue: a collection of scattered manuscript pages from famous mystery novels floating in a temporal vortex. Dr. Watson approaches Bruno with a desperate plea: "The criminal underworld grows bolder by the day without Holmes! We need a way to predict their next moves by understanding the patterns in criminal literature." Your mission: Create **Sherlock HoLLMes** — a revolutionary language understanding system that can: 1. Learn the intricate patterns of detective fiction and criminal psychology 2. Generate plausible mystery narratives to anticipate criminal plots 3. Help Bruno and Watson restore order to Victorian London **Watson's Notes: "The game is afoot, but time is of the essence. These temporal anomalies won't stabilize much longer!"** ___ ### Assignment Goals 1. Implement RNN-based language models from scratch - Build **Vanilla RNN** and **LSTM** architectures - Implement text **tokenization** and **sequence generation** - Master **autoregressive text generation** techniques 2. Construct a complete **Transformer architecture** - Implement **multi-head attention** mechanisms - Build **feed-forward** layers - Create a full **language model** capable of coherent text generation 3. Apply advanced text generation strategies - Implement **temperature**, **top-k**, and **top-p** sampling - Train models on **Project Gutenberg mystery corpus** - Generate **Victorian detective fiction** in the style of classic authors ## Getting Started ### Stencil Please click [here](https://classroom.github.com/a/2jL9_5li) to get the stencil code. Reference this [guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg) for more information about GitHub and GitHub Classroom. :::warning Make sure you clone this repository within the same parent folder as your virtual environment! Remember from assignment 1: ``` csci1470-course/ <- Parent directory (you made this) ├── csci1470/ <- Virtual environment (from HW1) ├── HW5-Sherlock-HoLLMes/ <- This repo (you cloned this) └── ... ``` ::: :::danger **Do not change the stencil except where specified.** You are welcome to write your own helper functions. However, changing the stencil's method signatures **will** break the autograder. ::: ### Environment You will need to use the virtual environment that you made in Assignment 1. :::info **REMINDER ON ACTIVATING YOUR ENVIRONMENT** 1. Make sure you are in your cloned repository folder - Your terminal line should end with the name of the repository 2. You can activate your environment by running the following command: ```bash # Activate environment source ../csci1470/bin/activate # macOS/Linux ..\csci1470\Scripts\activate # Windows ``` ::: ## Setup ### Technical Notes This assignment uses **TensorFlow eager execution mode** to simplify implementation. You'll notice this line in `main.py`: ```python # Enable eager execution to avoid graph mode issues with RNN implementations tf.config.run_functions_eagerly(True) ``` **Why eager execution?** - Allows you to implement RNNs using Python loops and control flow - Makes debugging easier since operations execute immediately - Avoids complex TensorFlow graph compilation issues related to for loops **Other tips for this assignment:** - Use consistent data types: prefer `tf.int32` for token indices - A general note: If you choose a type for your tensors, stick with it throughout the **ENTIRE** model! ## Implementation Roadmap ### Before you Begin: Our Recommended Gameplan **CORE IDEA: Read First, Code Second!** Before diving into any implementation, follow these steps to help make this assignment more digestible! 1. **Read this entire document from start to finish** - This is a dense, long document with many different sections. You should take a minute to walk through the entire document and inspect each of the sections to familiarize yourself. You will likely be confused, but that is the whole point! 2. **Explore the provided stencil code** - Go through the repository and inspect the file structure and breakdown. You will see some files contain a lot of stencil code, but many are filled with TODOs for you! 3. **Start with data processing** to understand the text processing pipeline 4. **Implement RNNs first** as a simpler baseline before moving to Transformers 5. **Build Transformers incrementally** attention first, then full blocks ### Implementation Tasks Don't worry if these tasks seem daunting at first glance! We've included detailed implementation guidance below. 1. **Data Processing** (`data.py`): Text tokenization and sequence creation 2. **RNN Models** (`RNNs.py`): Vanilla RNN and LSTM implementations 3. **Text Generation** (`language_model.py`): Sampling strategies for text generation 4. **Training Infrastructure** (`train.py`): Training loops with WandB logging 5. **Transformer Architecture** (`transformer.py`): Full attention-based language model :::warning **[TESTING INCREMENTALLY]** Throughout this assignment, test each component as you implement it. The complex nature of language models makes debugging much harder if you wait until the end! ::: :::danger **DO NOT USE `NUMPY` FUNCTIONS IN THIS ASSIGNMENT OTHER THAN `data.py`. GPUs REQUIRE ALL FUNCTIONS TO BE IMPLEMENTED IN TENSORFLOW.** ::: ## 1. `data.py` Our dataset comes from [Project Gutenberg](https://www.gutenberg.org/), a free repository of open source texts whose copyrights have expired. It is an amazing source of free natural language data, without having to worry about copyright issues. For this homework, we have specifically chosen books from the top-50 (or so) [Crime, Mystery, and Thriller](https://www.gutenberg.org/ebooks/bookshelf/640) genres. :::info **Task 1.1 [Process the Mystery Dataset]:** Navigate to the `data/` directory by running ```bash cd data ``` and then run: ```bash python process_data.py ``` *Note*: There are **no TODOs** in this file. This downloads, processes, and cleans Project Gutenberg mystery novels. It should generate: - `processed_text/` directory with cleaned individual texts - `combined_mystery.txt` all raw texts combined in one file - `mystery_data.pkl` with tokenized sequences ready for training ::: **Note:** The following section explains how `process_data.py` works. While you will not be implementing any functions in this file, it is important to understand what is happening, not only to understand how raw text is processed, but also for your final projects where you will need to implment a preprocessing pipeline yourself. This operation cleans the text by removing footnotes, editor notes, special characters, removes chapter headings, and lot's of other special things. You can look at the full cleaning function in `process_data.py` (you may find it useful if you work with other language data). After cleaning all the input documents, we also tokenize it for you. The tokenization process consists of splitting our entire corpus of text into individual words and assigning the most frequent words in our dataset to indices. `process_data.py` will output some of the most common words in our dataset. By default, we use a vocabulary size of 10,000 with three additional special tokens for `<pad>`, `<unk>`, and `<eos>` for padding, unknown tokens, and end-of-sequence tokens, respectively. **Vocabulary Size and Model Dimensions for Manageable Training:** You can dramatically reduce training time by using smaller vocabulary and model dimensions. Our dataset has a vocabulary size of 10,000 by default, but you may wish to use a much smaller vocabulary while testing/training your assignment. When running `main.py` to train your model, you can add command line arguments for the vocabulary size and size of the model (e.g., hidden state size of RNNs/LSTMs or the embedding size of the transformer): | Configuration | Use Case | Training Speed | Quality | |---------------|----------|----------------|---------| | `--vocab-size 1000 --d-model 128` | Fast development/debugging | Very Fast | Basic | | `--vocab-size 3000 --d-model 256` | Balanced experiments | Moderate | Good | | `--vocab-size 5000 --d-model 512` | Production training | Slow | Best | | `--vocab-size 10000 --d-model 512` | Full dataset (default) | Very Slow (Run on Oscar)| Maximum | **Tip**: It is common practice to keep the embedding dimension (d-model) much smaller than vocabulary size for efficiency and performance. :::spoiler Train Test Split We have already split the training and test data into two different train and test datasets. It's not necessarily easy to split sequential data into train and test splits in the same way we do for other types of data (i.e., shuffle the data and split), because sequential data would break the data... The core of the problem is that sequential data is inherently not i.i.d. (independent and identically distributed), because sequences **depend** on what comes before them. Even text all the way at the end of the book will depend on text from the beginning. The way we have handled this is to not shuffle the data, and do a standard 80/20 split. The final 20% of data in the test set likely has multiple complete texts by authors not in the training data. This makes the test set particularly challenging, but if we can avoid overfitting, then we can have extra confidence that our model is performing well. ::: :::info **Task 1.2 [Implement Data Loading (`data.py`)]:** Your first implementation task is to create a tokenizer to handle converting text-to-tokens and tokens-to-text. During preprocessing, we created a mapping of tokens (strings of words/punctuation) to indices. Tokenizer will be provided a `vocab`, which is a list of all the known words to our model (and the three special tokens). Each word in this list has a particular index. We have provided an additional dictionary that maps tokens to their respective index for faster lookups of tokens. 1. **`TextTokenizer.encode()`** - Convert text to token indices - Split text on whitespace and punctuation - Map tokens to indices using vocabulary mapping - Handle out-of-vocabulary words with `<UNK>` token - Return list of token indices (one index for each token in sequence) 2. **`TextTokenizer.decode()`** - Takes in a list of token indices and returns the corresponding text - Map indices to tokens - Remove special tokens (`<PAD>`, `<UNK>`, `<EOS>`) if necessary (`<UNK>` can be replaced with "unk") - Join tokens and clean up spacing (for aesthetics) ::: You will notice there are other functions defined in `data.py` such as `limit_vocabulary` and `load_mystery_data`. These functions are already implemented for you, but the essentially limit the vocabulary to be of the top `k` tokens we specify and loads in the dataset respectively. :::info **Task 1.3 [Sequence Creation]:** There are many ways to structure our data when we feed it into our network. We will use a **fixed window size** for all models that limits how long of a sequence we consider at once. You *could* split our text into separate sequences for each sentence, but a certain sentence may depend on the sentence that came before it. You *could* split our sequences by paragraphs, but the same issue will still apply. **Instead**, you will implement a *simple approach*: In `create_sequences` you will split the text into fixed width windows of a provided size with no overlap between consecutive sequences. If a piece of text with 100 words is provided and the window size is 10, you will simply divide the 100 words into 10 separate non-overlapping windows. ::: :::warning The way you are expected to generate sequences for training actually expects a 1-token overlap where sequence 1 might range from `[0:n+1]`, and sequence 2 will range from `[n:n+1]` where is the desired sequence length. Therefore, the length of each sequence should actually be `seq_len + 1`. The reason for this will make more sense during training, but this allow for the create of input and target pairs for each sequence. ::: :::info **Task 1.4 [Creating the Input Data]**: It is now time to write the actual functions that will be responsible for taking our processed text and turning it into a useable form for our model. You will now implement this following functions: 1. **`create_sequences()`** - Create input sequences from token lists - Generate sequences with specified window size of size `seq_len + 1` - Split tokens into consecutive, non-overlapping chunks - Return numpy array of sequences for further processing (make sure to cast to cast to `np.int32` or `tf.int32`) 2. **`create_tf_datasets()`** - Create TensorFlow datasets for training and validation - Use `create_sequences()` to generate sequences from train/test tokens - Convert sequences to TensorFlow constants - Batch and configure datasets to prepare for training ([this documentation will be very helpful](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)) - The training dataset should also be shuffled (`.shuffle(buf_size)` will be nice here) - Return tuple of `(train_dataset, test_dataset)` ::: :::warning **What should happen to the final batch if there aren't batch_size items left?** The easy way to handle it in this assignment is to throw it away and not use it. You will find we are not lacking in data for this assignment... Throwing away one incomplete window's worth of data will not cause our performance to decrease. ::: :::info **[Task 1.4] Testing:** You can now check if you have implemented the `encode`, `decode` and sequence functions correctly by running the following command ```bash python tests/run_tests.py data ``` which will run some tests to ensure the basic functionality of the functions you implemented. If you need to run any tests in specific, you can run the command (the `-v` flag is optional) ```bash python -m pytest tests/test_data.py::<test_name> [-v verbose] ``` :::warning *Note: These are not comprehensive tests, but rather just a minimal functionality guarantee!* ::: ## 2. `RNNs.py` Now that you've created a tensorflow dataset, you are ready to create a model. Before working with a Large Language Model (i.e., a transformer), you'll first use RNNs to perform language modeling. For these implementations, you are free to use keras implementations of an embedding lookup layer (`tf.keras.layers.Embedding`), dense layers, and activation functions. Essentially, the only tensorflow operation that is off limits is using the tensorflow implementation of an RNN or LSTM cell. :::warning **Note:** For both models, the outputs should be logits (i.e., outputs of the model before a softmax is applied) ::: ### What is a Recurrent Neural Network? Unlike feedforward neural networks that process inputs independently, **Recurrent Neural Networks (RNNs)** maintain an internal hidden state that gets updated at each timestep. This allows them to process sequential data by "remembering" information from previous timesteps which makes them ideal for tasks like language modeling where context matters. The key difference is while a feedforward network treats each word independently, an RNN processes sequences left-to-right, using its hidden state $\mathbf{h_t}$ as a memory mechanism that captures information about all previous words in the sequence. ### Vanilla Recurrent Neural Network (RNN) The core computation at each timestep in our RNN is defined by: $$\mathbf{h_t} = \tanh{(\mathbf{W_x} \times \mathbf{x_t}+\mathbf{W_h}\times\mathbf{h_{t-1}} + b_h)}$$ **Understanding the equation:** - $\mathbf{x_t}$: the input at timestep $t$ (e.g., current word embedding) - $\mathbf{h_{t-1}}$: the hidden state from the previous timestep (the network's "memory") - $\mathbf{W_x}$: weight matrix for transforming the current input - $\mathbf{W_h}$: weight matrix for transforming the previous hidden state - $\mathbf{b_h}$: bias vector - $\mathbf{h_t}$: the new hidden state, combining current input with past context via $\tanh$ activation Think of it as at each step, the RNN takes the current word and the "summary" of everything before it ($\mathbf{h_{t-1}}$), combines them through learned transformations, and produces a new summary ($\mathbf{h_t}$) that will be passed forward. :::info **Task 2.1 [Vanilla RNN `__init__`]:** Now it's time for you to fill out the `__init__` function's TODOs and initalize your weights and biases! :::warning You may find `add_weights` provided by the [Keras Base Model class](https://keras.io/api/layers/base_layer/) or `Dense` with `bias=False` to be helpful here! ::: :::info **Task 2.2 [Vanilla RNN `call`]:** Now implement the `call` function where you will implement the above calculation. Notice, we have provided you with the scaffolding, but you must write out the computation! Make sure you understand what the stencil code is doing as you will be implementing this in the `LSTM`! ::: :::success Implementation Tips: - You will find the following [documentation](https://keras.io/api/layers/base_layer/) for the Keras Base Model to be very hepful for understanding the `add_weights` function - Don't forget about appropriate shapes, initializers (e.g., `"glorot_uniform"`), and `trainable=True` - Use `tf.nn.tanh()` for tanh activations ::: ### Long Short-Term Memory (LSTM) While vanilla RNNs can theoretically capture long-range dependencies, they suffer from the **vanishing gradient problem** during training as gradients become exponentially small when backpropagating through many timesteps which makes it difficult to learn long-term patterns. **LSTMs** address this through gating mechanisms that provide more direct paths for gradient flow, enabling the network to learn which information to remember, forget, or output at each step. The LSTM architecture uses **gating mechanisms** to control information flow. At each timestep, an LSTM cell computes four components: **Core LSTM Equations:** 1. **Forget gate:** decides what to forget from previous cell state $$\mathbf{f_t} = \sigma(\mathbf{W_f} \times \mathbf{x_t} + \mathbf{U_f} \times \mathbf{h_{t-1}} + \mathbf{b_f})$$ 2. **Input gate:** decides what new information to add $$\mathbf{i_t} = \sigma(\mathbf{W_i} \times \mathbf{x_t} + \mathbf{U_i} \times \mathbf{h_{t-1}} + \mathbf{b_i})$$ 3. **Candidate cell state:** new information to potentially add $$\mathbf{\tilde{c}_t} = \tanh(\mathbf{W_c} \times \mathbf{x_t} + \mathbf{U_c} \times \mathbf{h_{t-1}} + \mathbf{b_c})$$ 4. **Cell state update:** combines forgotten past with new info $$\mathbf{c_t} = \mathbf{f_t} \odot \mathbf{c_{t-1}} + \mathbf{i_t} \odot \mathbf{\tilde{c}_t}$$ 5. **Output gate:** decides what to output from cell state $$\mathbf{o_t} = \sigma(\mathbf{W_o} \times \mathbf{x_t} + \mathbf{U_o} \times \mathbf{h_{t-1}} + \mathbf{b_o})$$ 6. **Hidden state:** filtered version of cell state $$\mathbf{h_t} = \mathbf{o_t} \odot \tanh(\mathbf{c_t})$$ Where: - $\sigma$ is the sigmoid activation function - $\odot$ represents element-wise multiplication (Hadamard product) - $\mathbf{c_t}$ is the **cell state** (the LSTM's "memory") - $\mathbf{h_t}$ is the **hidden state** (the output at timestep $t$) - $\mathbf{W}$ matrices represent affine transfomation of the **inputs** - $\mathbf{U}$ matrices represent affine transformations of the **hidden states** **Guiding Questions before you begin your implementation:** 1. How many weights do you need for an LSTM? a. How many weight matrices? (Hint: total number of $W+U$) - *Note: You can implement this with less weight matrices if you concatenate the inputs with the previous hidden state which can represented as:* $$\mathbf{f_t}=\sigma([\mathbf{x_t},\mathbf{h_{t−1}}]×\mathbf{W_f}+\mathbf{b_f}) $$ b. How many bias vectors do we need? (Hint: total number of $B$ terms) :::spoiler Understanding the Gates - **Forget gate** ($\mathbf{f_t}$): Values close to 0 mean "forget this part of memory", close to 1 means "keep it" - **Input gate** ($\mathbf{i_t}$): Controls how much of the new candidate information to add - **Output gate** ($\mathbf{o_t}$): Controls how much of the cell state to expose as the hidden state - The **cell state** acts as a "conveyor belt" that can carry information across many timesteps with minimal modification ::: :::info **Task 2.3 [LSTM `__init__`]:** Now that you know what equations you need to implement, fill out the TODOs in the `__init__` function to initalize your weights and biases. *Hint: Remember the bias for the forget gate, $b_f$, is initalized to ones, not zeros.* ::: :::info **Task 2.4 [LSTM `call`]:** Now fill out the TODOs in the `call` function. Within your `call` function: - Follow the scaffolding structure similar to VanillaRNN - Embed the input tokens - Initialize **both** $\mathbf{h_0}$ (hidden state) and $\mathbf{c_0}$ (cell state) to zeros - For each timestep $t$, compute all six equations above **in order** - Append the hidden state $\mathbf{h_t}$ (not the cell state) to your outputs list - Stack outputs and project to vocabulary size ::: :::warning **Key Differences between Vanilla RNN and LSTM:** - LSTM maintains **two** state vectors (cell state $\mathbf{c_t}$ and hidden state $\mathbf{h_t}$) vs. vanilla RNN's single state - LSTM uses **sigmoid** gates (values between 0-1) to modulate information flow, enabling "soft" addition/removal - LSTM has **more parameters** to learn long-term dependencies better without vanishing gradients ::: :::success **Implementation Tips:** - Use `tf.nn.sigmoid()` for sigmoid activations - Use `*` operator for element-wise multiplication ($\odot$) in TensorFlow ::: :::info **Task 2.5 [Testing `RNNs.py`]:** Now, you can test to see if you have implemented the RNN models correctly. These tests will ensure that data can flow through your models and return the right shapes as output. However, they are not checking for reasonable outputs, gradients, etc. ```bash! python tests/run_tests.py rnn ``` In order to run specific tests from the RNN test suite, you can run the following command ```bash python -m pytest tests/test_rnns.py::<test_name> [-v verbose] ``` **Note: This doesn't necessarily mean you've implemented them correctly, just that the final outputs are of the correct shape and type.** ::: ## 3. `language_model.py` During training, you will feed in the "correct" tokens at each timestep and generate a prediction for the next token. Test time is when we actually use that language modeling prediction to generate new text and new sentences. ![generation (1)](https://hackmd.io/_uploads/rkmU7reAxl.gif) Text generation is an **autoregressive process** where you start with a prompt and repeatedly predict the next token. The quality and diversity of generated text depends heavily on how you sample from the model's probability distribution. The most straight-forward approach is known as **greedy decoding** where we always choose the token with the highest probability output. While this may seem like a good idea, this often results in repetitive and non-coherent outputs. Therefore, the next simplest approach, **simple sampling**, consists of randomly selecting a token based on the distribution of tokens returned from the softmax output over the logits. However, while simple sampling works, it still degenerates over longer sequences. There are several sophisticated sampling methods, each with different trade-offs between quality and diversity. We will go over and implement the following three sampling strategies: ### 3.1 Temperature Sampling (Multinomial Sampling) Temperature sampling modulates the probability distribution before sampling by dividing logits by a temperature parameter $T > 0$: $$P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$ where $z_i$ are the logits for each token. The temperature controls the "sharpness" of the distribution: - **Low temperature** ($T < 1$): Makes the distribution sharper, favoring high-probability tokens (more conservative) - **High temperature** ($T > 1$): Flattens the distribution, giving more probability mass to lower-probability tokens (more creative/random) - **$T = 0$**: Greedy Decoding (top-element everytime) - **$T = 1$**: Standard softmax (no change) **Algorithm:** 1. Divide logits by temperature $T$ 2. Apply softmax to get probabilities 3. Sample from the resulting distribution This is already implemented for you in `sample_categorical()`. :::warning Note that `tf.random.categorical` takes in the logits and internally applies softmax. Therefore if you use this function instead of applying Softmax yourself, any underlying changes to the distribution must be made to the **logits**! ::: ### 3.2 Top-K Sampling Top-k sampling restricts sampling to only the $k$ most probable tokens, effectively truncating the long tail of unlikely tokens that could derail generation. Think of it as filtering your vocabulary to only consider the $k$ "best candidates" at each step. **Algorithm:** 1. Find the top $k$ logits/probabilities 2. Set all other logits to $-\infty$ (or probabilities to 0) 3. Renormalize the remaining $k$ probabilities so they sum to 1 4. Sample from this restricted distribution :::success **Implementation hints:** - Use `tf.nn.top_k()` to get the top $k$ values and their indices - You'll need to create a mask or use `tf.scatter_nd()` to zero out/$-\infty$ non-top-k probabilities - Remember to renormalize after filtering ::: ### 3.3 Top-P Sampling (aka Nucleus Sampling) Top-p sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds threshold $p$ (typically 0.9 or 0.95 in most prod environments). Unlike top-k which always selects exactly $k$ tokens, nucleus sampling adapts the vocabulary size based on the confidence of the distribution. When the model is confident, a peaked distribution, fewer tokens may be needed to reach $p$. For example, if $p=0.9$ and the top-token has a probability of $0.9$, then it makes up the entire probability mass and we only sample from one token in our distribution. If the model is uncertain, a flatter distribution, more tokens are included. This creates an *adaptive "nucleus"* of likely tokens. **Step to the Algorithm:** 1. Sort probabilities in descending order 2. Compute cumulative probabilities 3. Find the cutoff index where cumulative probability first **exceeds** $p$ (*you always want it to be greater than or equal to, never less than*) 4. Mask out all tokens beyond this cutoff 5. Renormalize the remaining probabilities 6. Sample from this restricted distribution :::success **Implementation hints:** - Use `tf.argsort()` or `tf.nn.top_k()` with `k=vocab_size` to sort - Use `tf.cumsum()` to compute cumulative probabilities - Create a mask where cumulative probability $\leq p$ - You'll need to use the sorted indices to map back to original token positions ::: :::info **Task 3.1 [Implement Sampling Strategies]:** It is now your turn. We have already implemented temperature sampling for you in `sample_categorical()`. You will now implement the following TODOs for in `language_model.py`: - **`sample_top_k()`** - Top-k sampling - **`sample_top_p()`** - Nucleus (top-p) sampling ::: :::info **Task 3.2 [Testing]** You can now test the functionality of your sampling methods by running the following tests using the command ```bash python tests/run_tests.py sampling [-v verbose] ``` ::: ## 4. `train.py` Now that you've created your RNN models, it's time to train them. At this point, you've written this loop a couple times in your previous assignments and it will pretty much look the same here. However, before we get to the actual training routine, as your training gets larger and larger, it becomes more important to be able to track your model's progress along the way. There are a large number of stats that you might want to keep track of, like average training and validation loss, in addition to qualitative things like "How good is my LM at actually generating interesting text". Printing alone starts to become isufficient and hard to keep track of (as you might have already found). For this assignment, we'll be using a popular experiment tracking platform called [Weights and Biases](https://wandb.ai/site) (abbreviated wandb or w&b). Wandb provides a logging interface that allows us to log stats locally while our model is training and view plots and figures for these stats online in real-time. You should at least log average training and validation loss periodically, but you may find it useful to log generated text as well. Look at the Weights and Biases [documentation](https://docs.wandb.ai/ref/python/data-types/) for information on logging different types of data. :::info **Task 4.1 [WandB Setup]:** 1. Run the following command which will interact with the wandb API in order to create your account and authenticate your API key ```bash python src/utils/wandb_login.py ``` 2. Select option ""(1) Create a W&B account" which will lead you to [here](https://wandb.ai/authorize?signup=true) where you will be provided with your API key. - When prompted whether creating a "Professional" or "Educational" account, you should choose "educational" and input "Brown University"! - This is not super important, but Professional accounts expire after some time and require you to pay :cry: 3. Copy your API key and upload it in the specifed spot in the terminal and you are all set :) 4. You should also log in to wandb in a web browser to check out the GUI! Once you are logged in, you shouldn't have to re-authenticate and log in again. We included the wandb library in the virtual environment, but if you don't have it, you can install it by running `pip install wandb`. ::: ### 4.1 Training Language Models Training language models differs from the supervised learning you've done with feedforward neural networks (FNNs) and convolutional neural networks (CNNs) in a fundamental way: **we're training the model to predict the next token in a sequence**. #### Sequence Structure and Shifting Now remeber when we prepare our training data, each sequence in our dataset had length `(seq_len + 1)`. Why the extra token? Because we need to create **input-target pairs** where the target is shifted by one position: - **Input sequence**: tokens at positions `[0, 1, 2, ..., seq_len-1]` (the first `seq_len` tokens) - **Target sequence**: tokens at positions `[1, 2, 3, ..., seq_len]` (everything shifted left by one) This setup allows the model to learn: "Given tokens $0$ through $i$, predict token $i+1$". For example, if our sequence is `["The", "cat", "sat", "on", "the"]`: - Input: `["The", "cat", "sat", "on"]` - Target: `["cat", "sat", "on", "the"]` At each position, the model predicts the *next* token. This is called **teacher forcing** during training where we provide the correct previous tokens as input, even if the model would have predicted something different. #### Loss Function: Sparse Categorical Cross-Entropy For language modeling, we use **Sparse Categorical Cross-Entropy (CCE)** as our loss function. Mathematically, this is the **exact same as regular CCe**. However, the "sparse" part is crucial here as it means our targets are represented as **integer class indices** (e.g., token ID 42) rather than one-hot encoded vectors (e.g., a vector of length vocab_size with a 1 at position 42 and 0s everywhere else). Why is sparse better? With vocabularies of 10,000-50,000+ tokens, one-hot encoding would be extremely memory-intensive. Sparse CCE computes the same loss but works directly with integer labels, making it far more efficient. Mathematically, for a prediction at position $i$: $$ \textrm{CCE}_i = -\log(\text{P}(\textrm{target_token}_i \mid \textrm{input_tokens})) $$ :::warning The total loss is the average CCE across all positions in all sequences in the batch **(hint: this is a reminder to call `tf.reduce_mean`.** ::: #### Perplexity: The Language Model Metric While we calculate gradients and *optimize* using cross-entropy loss, we typically *report* model performance using **perplexity**. Perplexity is the exponentiated average cross-entropy loss: $$ \text{Perplexity} = e^{(\text{average_CCE})} $$ **What does perplexity mean?** Intuitively, perplexity measures how "surprised" the model is by the next token. A perplexity of 100 means the model is as confused as if it had to uniformly guess among 100 possible tokens at each position. Lower perplexity = better model. Why use perplexity instead of raw CCE? Perplexity is more interpretable and has a nice property: random guessing among N tokens gives perplexity of exactly N. This makes it easier to understand model quality at a glance. For reference, good language models might achieve perplexities between 20-50 on standard benchmarks, while a random baseline would have perplexity equal to the vocabulary size. :::warning **Important**: During training, you'll compute CCE for backpropagation, but you should also *calculate and log perplexity* (using your validation set performance) for tracking model quality. ::: :::warning We have provided you with a function `calculate_perplexity` which you should use once you calculated the average CCE. ::: ### 4.2 Checkpoints A checkpoint is a version of our model at a given point during training. Additionally, it will become more and more important to save "checkpoints" of our models. We might, for instance, save a checkpoint after every epoch. This protects us from catastrophic events, like our computer crashing. If you wait until the end of the training loop to save a model and your computer crashes, you're in for a bad time. Tensorflow provides a very convenient [CheckpointManager](https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager) that you can use to automatically save (and load) checkpoints. Use a checkpoint manager to save checkpoints periodically during training. :::spoiler How often is "periodically"? The more frequently you log your stats, the better view you have into your model's performance. However, computing your average validation loss takes time (which could be spent training your model). We typically log metrics after some number of batches (dependent on your dataset size). In this case, you may want to log every 250 batches, 500, or 1,000 depending on your batch size. ::: :::info **Task 4.2 [Implement Training Function]:** Complete the `train()` function to train your language model. Your implementation should handle the following: **Basic Requirements:** - Train the model for the specified number of epochs - (*Already Implemented for you*) Support resuming training from a checkpoint (if `continue_training=True`, use the provided `start_epoch` value Now, breaking it down a bit more, **for each epoch:** 1. **Training Loop:** - Iterate over batches from your `tf.data.Dataset` (the dataset will return batches automatically) - For each batch, split the sequence into input and target: - Input: tokens at positions `[0:seq_len]` - Target: tokens at positions `[1:seq_len+1]` (shifted by one) - Compute the model's logits (forward pass) - Calculate the loss, compute gradients, and apply them using the optimizer - **Track the loss for each batch** and accumulate for epoch averaging - **Periodically print batch progress** (e.g., every 50-250 batches) since you'll likely have 2000+ batches—this helps you monitor training without waiting for the full epoch 2. **Validation:** - After training during each epoch, evaluate on the test/validation set - Use the same batching and input/target sequence shifting as in training - Calculate the average validation loss (CCE) - **Calculate perplexity** from the validation loss: `perplexity = exp(avg_validation_loss)` 3. **Logging and Tracking:** - Update the `history` with: - Average training loss (CCE) for the epoch - Validation loss (CCE) - Perplexity (calculated from validation performance) - As a reminder, these must all be **scalar** values - wandb should also updated each epoch with the values you calculate, so update the dictionary with your values! 4. **Checkpointing:** - Periodically save model checkpoints (using TensorFlow's CheckpointManager) - Consider saving every N epochs, or whenever your model reaches its new best performance (lowest validation loss) **Helper Functions:** You may find it helpful to write separate methods such as: - A function to compute average loss/perplexity on the validation set - A function to handle the input/target sequence splitting - Any other utilities that keep your code organized and readable Your implementation should be robust, well-organized, and provide clear feedback during the potentially long training process. ::: :::info **Task 4.3 [Testing]** You can now test the functionality of the `train` function you implemented by running the following command ```bash python tests/run_tests.py train [-v verbose] ``` As always, if you need to run a specific command you can also run ```bash python -m pytest tests/test_training.py::<test_name> -v ``` ::: ## 5. `transformer.py` Now you'll implement the Transformer architecture, the foundation of modern large language models. The Transformer uses self-attention mechanisms to process sequences in parallel, making it much more efficient than RNNs while achieving better performance. You'll build this up from core attention mechanisms to the full language model. :::danger Make sure to cast all of your tensor to `tf.float32`! You must retain consistency otherwise your model will crash on OSCAR! ::: :::info **Task 5.1 [Core Attention Mechanism]:** The attention mechanism allows the model to focus on different parts of the input sequence when processing each token. You'll implement the scaled dot-product attention from "Attention Is All You Need". **Understanding Attention:** These attention modules turn a sequence of embeddings into Queries, Keys, and Values. In self-attention, each timestep has a query (Q), key (K), and value (V) embedding. The queries are compared to every key to produce an attention matrix, which is then used to create new embeddings for each timestep. You will need to fill out the following classes: - `AttentionMatrix`: Computes attention given key and query matrices. - `AttentionHead`: Contains and manages a single head of attention, which includes initializing the key, value and query weight matrices and calling AttentionMatrix to compute attention weights. - You may find `add_weight(name=, shape=??)` or `Dense` with `bias=False` to be helpful here! - `MultiHeadedAttention` This class should create $n$ self attention heads of input size `emb_sz` and output size `emb_sz//n` and concatenate their results. ::: :::warning **Understanding Self-Attention:** Self-attention can be fairly confusing. **Make sure you read and follow the instructions in the STENCIL CODE.** We also encourage students to refer back to the lecture slides. An excellent resource that explains the intuition/implementation of Transformers can be found in the [Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) article. We *highly* recommend reading and understanding it to make implementation easier. ::: :::success **💡 Implementation Hints:** - **Attention dimensions**: If `d_model=512` and `n_heads=8`, then each head has `d_k = d_model // n_heads = 64` - **Scaling factor**: Don't forget `1/sqrt(d_k)` scaling in attention scores - **Shape tracking**: Q, K, V should all have shape `[batch_size, seq_length, d_k]` after projection ::: :::info **Task 5.2 [Transformer Components]:** These components combine attention with other essential elements like positional encoding and feed-forward networks to create a complete transformer block. You need to implement the following three classes now: 1. **PositionalEncoding** - Sinusoidal position encodings - **Positional encoding formula**: $$PE(pos,2i) = \sin(pos/10000^{(2i/d_{model})})$$ $$PE(pos,2i+1) = \cos(pos/10000^{(2i/d_{model})})$$ - Add to input embeddings (residual connection) - **Note: We have already implemented the math for you! You are only responsible for implement `call` within this class :)** 2. **LanguageTransformerBlock** - Complete transformer block - Self-attention with residual connection - Feed-forward network with residual connection - Layer normalization after each sub-layer ::: :::success **💡 Implementation Hints:** - **Residual connections**: Always add the input to the output: `output = input + sublayer(layer_norm(input))` - **Layer normalization**: Use [tf.keras.layers.LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization) for your layer normalization operations ::: :::info **Task 5.3 [Language Model Integration]:** Finally, you'll integrate all components into a complete language model capable of both training and text generation. 3. **`TransformerLanguageModel()`** - Complete language model - Token embedding with scaling - Add positional encodings - Apply dropout to input embeddings - Pass through all transformer blocks - Apply final layer normalization - Project to vocabulary ::: :::warning **Important Architecture Requirements:** - **Embedding consistency**: Your embedding/hidden_state size must be the same for your word embeddings and transformer blocks - **Dimension matching**: Ensure all components use consistent `d_model` dimensions throughout the architecture ::: :::info **Task 5.4 [Testing]:** You can now verify the functionality of your transformer by running the following tests: ```bash python tests/run_tests.py transformer [-v verbose] ``` As always, if you need to run a specific command you can also run ```bash python -m pytest tests/test_transformer.py::<test_name> [-v] ``` ::: ## 6. Training and Evaluation You have now put together all of the necessary parts to the model and now its time to train. Thankfully, we have written `main.py` for you which properly calls your model for you based on the parameters you have provided. **It will also generate the necessary submission artifacts you are required to submit!** You can run `main.py` using the following command: ``` python main.py [-h] [--epochs EPOCHS] [--batch-size BATCH_SIZE] [--seq-length SEQ_LENGTH] [--learning-rate LEARNING_RATE] [--continue-training] [--force-fresh] [--model-type {transformer,vanilla_rnn,lstm}] [--vocab-size VOCAB_SIZE] [--d-model D_MODEL] [--n-heads N_HEADS] [--n-layers N_LAYERS] [--d-ff D_FF] [--dropout-rate DROPOUT_RATE] [--no-wandb] ``` :::warning If you ever need a refresher on what each of these flags do, you can run ``` python main.py --help ``` which will print out the following options :::spoiler Options ```bash -h, --help show this help message and exit --epochs EPOCHS Number of training epochs --batch-size BATCH_SIZE Batch size --seq-length SEQ_LENGTH Sequence length --learning-rate LEARNING_RATE Learning rate --simple Use simple training mode --interactive Run interactive generation --generate-only Skip training, only generate --continue-training Continue from checkpoint --force-fresh Force fresh training --model-type {transformer,vanilla_rnn,lstm} Type of model to train --vocab-size VOCAB_SIZE Vocabulary size --d-model D_MODEL Model dimension/hidden size --n-heads N_HEADS Number of attention heads --n-layers N_LAYERS Number of model layers --d-ff D_FF Feed-forward dimension --dropout-rate DROPOUT_RATE Dropout rate --no-wandb Disable wandb logging ``` ::: We define the following ***minimum requirements*** for each of the models you train (vocab size, epochs, perplexity, and the estimated training time (ETT)) | Model Type | Epochs | Vocab Size | Sequence Length | Perplexity | ETT (minutes) | -------- | -------- | -------- | -------- | -------- | -------- | | Vanilla RNN | 2 | 3000 | 64| 100-300 | 5-20 | | LSTM | 3 | 5000 | 64 | 80-250 | 20-60 | | Transformer | 5 | 10000 | 256 | 10-50 | 120-240 | :::info **Task 6.3 [Text Generation and Submission Prep]:** In order to test that your models work, you can run the following commands that will run small versions of your model ```bash # 1. Start with fast RNN model python main.py --model-type vanilla_rnn <your params> # 2. LSTM python main.py --model-type lstm <your params> # 3. Train main transformer model python main.py --model-type transformer <your params> ``` **Test Generation:** ```bash # Generate text with specific model (loads pre-trained weights and requires specifying model type) python main.py --generate-only --model-type vanilla_rnn > vanilla_rnn_samples.txt python main.py --generate-only --model-type lstm > lstm_samples.txt python main.py --generate-only --model-type transformer > transformer_samples.txt # Interactive generation with specific model python main.py --generate-only --model-type <model> --interactive ``` ::: :::danger You will find it hard to train the transformer locally on the entire vocabulary for a sufficient amount of time without leveraging GPUs. That brings us to OSCAR! ::: ### OSCAR OSCAR (Ocean State Center for Advanced Research) is Brown University's high-performance computing cluster. It consists of more than 300 multi-core nodes sharing a high-performance interconnect and file system with 8 petabytes of all-flash storage. The cluster is maintained by Brown's Center for Computation and Visualization (CCV) Oscar and is available to anyone with a Brown account, making high-performance computing accessible across disciplines. Now, you should ensure that you properly follow the steps below to make sure you have set-up OSCAR properly. :::warning If you don't already have an account! Make one [here](https://brown.co1.qualtrics.com/jfe/form/SV_0GtBE8kWJpmeG4B). Select the “**Exploratory Account without PI**” option. Getting approved should take about a day so *don’t worry* if you feel like it’s taking a while. Once that's done! All the Oscar documentation can be found [here](https://docs.ccv.brown.edu/oscar). There are a variety of ways to access Oscar; however, our personal favorite is to use [Open OnDemand](https://ood.ccv.brown.edu/pun/sys/dashboard), here’s the [documentation](https://docs.ccv.brown.edu/oscar/connecting-to-oscar/open-ondemand). ::: :::info **Task 6.2 [OSCAR Set-Up]:** It is now your turn to get set up on the compute nodes! Once you have created an OSCAR account, you need to get setup in order to clone your repository. We are going to use GitHub SSH authentication! 1. Log into an OSCAR terminal using whatever method you prefer - We find it easiest to use OSCAR Shell Access online through [OOD](https://ood.ccv.brown.edu/pun/sys/dashboard) or [SSH access through your local terminal](https://docs.ccv.brown.edu/oscar/getting-started#connecting-to-oscar-for-the-first-time) 2. Now you need to generate an public SSH key to authenticate OSCAR by running the following commmand ``` ssh-keygen -t ed25519 -C "your_email@example.com" ``` - You can just keep hitting "return" without any input unless you want to change the destination of your key 3. Next is starting the SSH agent and adding your key by running the following commands: ``` eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_ed25519 ``` 4. Now, finally you can copy the SSH-key to your clipboard which you are going to add to GitHub in a few steps ``` cat ~/.ssh/id_ed25519.pub ``` - You should copy the entire line print to the terminal including the `ssh` and email portions 5. It is now time you add your key to GitHub by doing the following: - Log on to GitHub and navigate to [settings](https://github.com/settings/profile), which are access by hitting your profile icon in the top right corner and settings are in the dropdown - Find and navigate to the "SSH and GPG Keys" in the left-hand sidebar - Click "New SSH Key" - For name, you can choose whatever you want - Leave the key type as "Authentication Key" - For the "key", paste the entire public key (including the ssh and email) into the designated slot - Finally, to verify that your installation worked correctly, return to your OSCAR terminal and run the command ``` ssh -T git@github.com # Expected Output Hi <name>! You've successfully authenticated, but GitHub does not provide shell access. ``` 6. Now, you can finally you can clone your repository into OSCAR. Revisit the Github classroom link [here](https://classroom.github.com/a/2jL9_5li). **When you go to get the link to clone the repository, make sure you select SSH, not HTTPS**. - You can close the repository into any folder you want! ::: :::info **Task 6.3 [Setting Up Your Environment]:** Once you have cloned your repository, we need to set up your environment so you properly load your wandb dashboard. 1. You should `cd` into the repository and then run the following command to setup `miniconda` ```bash module load miniconda3/23.11.0s && echo "Module loads OK" || echo "Module failed" ``` 2. Now we need to to setup the course environment using the following command (this will take a few minutes) ``` bash setup_oscar_env.sh ``` 3. You should then run the following command to activate your conda environment ``` conda init conda activate csci1470 ``` 4. Finally, you should see a prefix of `csci1470` in your terminal line. You can now run to log into wandb and follow the same steps you had earlier ``` python src/utils/wandb_login.py ``` 5. Now, you can **run `conda deactivate`** and move onto the next steps ::: :::info **Task 6.4 [Training on Oscar]:** This is an assignment on large language models, so it's about time you make a **large** one. The previously trained models decrease loss, but their outputs probably look mostly incomprehensible. 1. Create a Slurm script to train a model on Oscar. We have provided a suggested starting point in `slurm_template.sh` that loads appropriate modules and tensorflow, but you must edit the parameters for the call to `main.py` with appropriate arguments. - We recommend training your Transformer LLM with the following parameters: d_model = 512, vocab_size = 10,000, and Adam with a learning rate of 1e-4, and trained over 5 epochs. This set of hyperparameters should perform relatively well. - You will also likely need to adjust the amount of time it takes to run the entire job depending on how long a job takes. - Don't forget, you can always continue training from the last checkpoint/epoch by call `main.py` with `--continue_training` 1. run your Slurm script using sbatch. You can monitor your job using the tools presented in class and using weights and biases using ``` sbatch slurm_template.sh ``` :::warning You can verify your job is working by running the following in the terminal ``` myq ``` You should get an output the prints the name of your job, how much time has been allocated to run, ::: ## Implementation Tips (More tips may be added as we see common issues) ### Transformer Debugging Tips :::warning **Common Shape Issues:** - Attention output: `[batch_size, seq_length, d_model]` - Multi-head concatenation: `[batch_size, seq_length, n_heads * d_k]` where `n_heads * d_k = d_model` - Verify shapes after each layer with `print(f"Shape after layer: {tensor.shape}")` ::: ## Submission :::info **Requirements:** Submit your completed stencil files with all TODO sections implemented, plus trained model weights and training plots: **Code Files:** - All stencil `.py` files with TODO sections implemented - Your implementations should pass unit tests for individual components **Model Artifacts (Required for each model type):** - **`submission_{model_type}.json`** - Sumbission file that reports your final perplexity and training regime (*you will need to rename the submssion files in `submission/` directory*) - If you manually alter the values in this file, you will fail the autograder tests as we verify your submissions with hashing - **`{model_type}_model.weights.h5`** - Trained model weights for each architecture you implement - **`{model_type}_samples.txt`** - Generated text samples from your trained model **Training Evidence Requirements:** - Models must be actually trained (not randomly initialized) - Generated text should be coherent mystery-themed content :::