Try   HackMD

HW4 Programming: Language Models

Conceptual questions due Monday, March 18th, 2024 at 6:00 PM EST
Programming assignment due Friday, March 22nd, 2024 at 6:00 PM EST

In this assignment, you will be building a Language Model to learn various word embedding schemas to help minimize your NLP losses. Please read this handout in its entirety before beginning the assignment.

Theme

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Oh no! One of our HTAs, Raymond, has been captured by an octopus! You can't save him, but our fishy friends can! To help them help us, we must commence Operation RNN (Raymond's Nautical Neutralization)!

Introduction

Conceptual Questions

Please fill out the conceptual questions on the website under HW4: Language Models. You should be able to type up the questions with LaTeX or upload photos of written work (as long as it's clearly legible).

2470 students only: If you are in 2470, all conceptual questions (including non-2470 ones) should be written as one pdf and submitted to the 2470 submission box.

You can find the conceptual questions here.

Getting the stencil

Please click on the GitHub Classroom link to get the stencil code.

Do not change the stencil except where specified. While you are welcome to write your own helper functions, changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.

Assignment Overview

[Word Prediction] You'll get to make a language model that learns to generate sentences by training the model to predict a next word conditioned on a subset of previous words.

  • Trigram: Train a supervised formulation to predict word[i] = f(word[i-2], word[i-1]).
  • RNN: Train a recurrent neural network to predict word[i] = f(word[0], …, word[i-1]).

[LSTM/GRU] You'll get to create a recurrent layer from scratch (with some starter code) using a recurrent kernel.

Roadmap

Step 1: Preprocessing

The Corpus

When it comes to language modeling, there is a LOT of data out there! Since we're just predicting words from other words, almost any large text corpus will do. Our data comes from the Simple English Wikipedia corpus and comprises articles about various earthly topics. English, and all natural languages, are tricky, so to make things easier we have done the following simplification on the already-simple Simple Wiki.

  • Lemmatization: You will notice a few strange things in the corpus: inflectional morphology, numbers, and pronouns have been "lemmatized" (e.g. "I am teleporting" \(\rightarrow\) "be teleport"). Lemmatization is based on the principle that "teleport", "teleports", "teleported", and "teleporting" all contribute similar content to the sentence; rather than learn 4 words, the model only needs to learn one.
  • Limiting Vocabulary: Additionally, you will find things that look like <_UNK>. UNK-ing is a common preprocessing technique to remove rare words while preserving basic information about their role in the sentence. This will help the model focus more on learning patterns in general and common language. (Effectively, <_UNK> refers to an unknown word i.e. a word that is outside the defined vocabulary).
  • Meta-tokens: you will notice that each text file has one article per line, followed by a 'STOP'. For this assignment, we don't need to worry about where some articles end and others begin. Thus, in preprocessing, you should concatenate all the words from all the articles together. Doing this will mean that you can avoid padding and masking (but don't fret; we'll need masking and padding later)!

Preprocessing

Your preprocessing file should contain a get_data() function. This function takes a training file and testing file and performs the following:

  1. Load the train words and split the words on whitespace.
  2. Load the test words and split the words on whitespace.
  3. Create a vocab dictionary that maps each word in the corpus to a unique index (its ID). Only one dictionary is needed for training and testing as the testing set should only test on words in the training set. There is no need to strip punctuation. To specify, we define the ID as a number that increments with each new word, such that every time a new word is added to the vocabulary dictionary, that word is assigned a newly incremented value.
  4. Convert the list of training and test words to their indices, making a 1-d list/array for each (i.e. each sentence is concatenated together to form one long list).
  5. Return an iterable of training ids, an iterable of testing ids, and the vocab dict.

Hyperparameters

When creating each of your Trigram/RNN models, you must use a single embedding matrix. Remember that there should be a nonlinear function (in our case ReLU), applied between the feed-forward layers.

For your Trigram, you must use two words to predict a third. For your RNN, you must use a window size of 20. You can have any batch size for either of these models.

However, your models must train in under 10 minutes on a department machine!

Step 2: Trigram Language Model

In the Trigram Language Model part of the assignment, you will build a neural network that takes two words and predicts the third. It should do this by looking up the input words in the embedding matrix, and feeding the result into a set of feed-forward layers. You can look up words in an embedding matrix with their ids using the function tf.nn.embedding_lookup or the tf.keras.layers.Embedding layer.

Train and test

In the main function, you will want to preprocess your train and test data by splitting them into inputs and labels as appropriate. Initialize your model, train it for 1 epoch, and then test it. At the end, you can print out your model's perplexity and use your model to generate sentences with different starting words via the generate_sentence function (already provided).

Please complete the following tasks

  • Fill out the init function and define your trainable variables or layers.
  • Fill out the call function using the trainable components in the init function.
  • Compile your model as usual and feel free to play around with hyperparameters.
  • For loss, please use sparse categorical crossentropy , as we don't want to compute loss relative to the entire vocabulary every time.
    • Note that this function can take in logits or probabilities depending on how you set the from_logits parameter.
  • For accuracy, please write a function, perplexity, that returns perplexity. This is a relatively standard quality metric for NLP tasks and is merely exponentiated cross-entropy.
    • You can (and should) make use of sparse_categorical_crossentropy or create a subclass of its layer analogue to write this function.
  • See accuracy threshold in grading section below.

FAQ

  • When we say "layer anologue," we mean the counterpart of a tf.nn function in the tf.keras.layers module. For example, tf.keras.layers.Conv2D is the layer analogue (or counterpart) of the tf.nn.convolve function.
  • There are no restrictions regarding the use of Sequential when building your models.
  • You may experience an interesting bug where your scce loss will continue to go down but your perplexity will start lowering and then take a sharp upwards turn for some reason. This means you’re really close and are just missing an operation somewhere in your forward pass…

Step 3: RNN Language Model

For this part, you'll be able to formulate your problem as that of predicting an i-th word based off of words 0 through i-1. This is a perfect use-case for RNNs!

Train and test

This is exactly the same as trigram except for your data formulation. You will now be predicting the next word token for every single word token in the input. You will be able to achieve this by making the output an offset of the input!

  • You'll still want to embed your inputs into a high-dimensional space and allow the representations to be trained up through the optimization loop, so you'll still want an embedding layer.
  • Use tf.keras.layers.GRU OR tf.keras.layers.LSTM for your RNN to predict the sequence of next numbers.
    • HINT: Make sure to play around with return_sequences=True and return_state=True as necessary/desired. You might need one of these set for your model to run properly! Note that the final state from an LSTM has two components: the last output and cell state. These are the second and third outputs of calling the RNN. If you use a GRU, the final state is the second output returned by the RNN layer.
  • Fill out the call function as appropriate, and everything otherwise should be the same as trigram!

FAQ

  • Make sure you end up using non-overlapping windows when you pass data into your RNN. This should be automatically handled by the fit batching system, but be wary of information bleeding if you make custom components.
  • If you would like to use the leaky_relu activation function at any point, you may have to use tf.keras.layers.LeakyReLU() rather than the string "leaky_relu" depending on which TensorFlow version you're using.

Trying Out Results

We've provided a generate_sentence function for this part of the assignment. You can pass your model to this function to see sample generations. If you see lot's of UNKs, have no fear! These are expected and quite common in the data! You should, however, start to at least see some reason forming in the outputs.

Bonus Section: LSTM/GRU Manual

The RNN layers are surprisingly nuanced in their operations, but it is important to consider how these can actually be implemented when thinking about their pros and cons. There is a walk-through of roughly how to do it in the GRU and LSTM notebooks, and the process honestly isn't that bad after some initial setup which is already provided. Still, it does take a bit of time to understand what's going on since there's a notion of a recurrent kernel which has some interesting uses…

The custom_rnns folder contains notebooks to help you manually implement LSTMS and GRUs. These sections are optional and are worth 5/70 each in addition to the 70/70 you can get otherwise.

FAQ

  • Please work on GRU first, and then LSTM. GRU provides you with more starter code which serves as a good example for your LSTM implementation.
  • Please leave your code in MyGRU and MyLSTM. Make sure it stays in custom_rnns.
  • The \(\otimes\) in the GRU equations stands for component-wise multiplication.

Grading

Language Models : You will be primarily graded on functionality. Your Trigram model should have a test perplexity < 120, and your RNN model should have a test perplexity < 100 to receive full credit (This applies to both CS1470 and CS2470 students).

Conceptual : You will be primarily graded on correctness, thoughtfulness, and clarity.

LSTM/GRU Manual: This will be graded by the autograder. Tests also in the notebook.

README: Put your README in the /code directory. This should just contain your accuracy and any known bugs you have.

Autograder

Your model must complete training within 10 minutes for your Trigram Model, and 10 minutes for the RNN. (It can do with less). Our autograder will import your model via get_text_model similar to previous ones. The LSTM/GRU manual implementations will be pulled in from the GRU.ipynb or LSTM.ipynb notebook and we will run similar tests as seen in the notebook.

Handing In

You should submit the assignment via Gradescope under the corresponding project assignment by zipping up your hw4 folder or through GitHub (recommended). To submit through GitHub, commit and push all changes to your repository to GitHub. You can do this by running the following three commands (this is a good resource for learning more about them):

  1. git add file1 file2 file3
    • Alternatively, git add -A will stage all changed files for you.
  2. git commit -m “commit message”
  3. git push

After committing and pushing your changes to your repo (which you can check online if you're unsure if it worked), you can now just upload the repo to Gradescope! If you’re testing out code on multiple branches, you have the option to pick whichever one you want.

If you wish to submit via zip file:

  1. Please make sure your python files are in “hw4/code” this is very important for our autograder to work!
  2. Make sure any data folders are not being uploaded as they may be too big for the autograder to work.

IF YOU ARE IN 2470: PLEASE REMEMBER TO ADD A BLANK FILE CALLED 2470student IN THE hw4/code DIRECTORY, WE ARE USING THIS AS A FLAG TO GRADE 2470 SPECIFIC REQUIREMENTS, FAILURE TO DO SO MEANS LOSING POINTS ON THIS ASSIGNMENT

Good luck everyone!!!