Conceptual questions due Monday, March 18th, 2024 at 6:00 PM EST
Programming assignment due Friday, March 22nd, 2024 at 6:00 PM EST
In this assignment, you will be building a Language Model to learn various word embedding schemas to help minimize your NLP losses. Please read this handout in its entirety before beginning the assignment.
Please fill out the conceptual questions on the website under HW4: Language Models. You should be able to type up the questions with LaTeX or upload photos of written work (as long as it's clearly legible).
2470 students only: If you are in 2470, all conceptual questions (including non-2470 ones) should be written as one pdf and submitted to the 2470 submission box.
You can find the conceptual questions here.
Please click on the GitHub Classroom link to get the stencil code.
Do not change the stencil except where specified. While you are welcome to write your own helper functions, changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.
[Word Prediction] You'll get to make a language model that learns to generate sentences by training the model to predict a next word conditioned on a subset of previous words.
word[i] = f(word[i-2], word[i-1])
.word[i] = f(word[0], …, word[i-1])
.[LSTM/GRU] You'll get to create a recurrent layer from scratch (with some starter code) using a recurrent kernel.
When it comes to language modeling, there is a LOT of data out there! Since we're just predicting words from other words, almost any large text corpus will do. Our data comes from the Simple English Wikipedia corpus and comprises articles about various earthly topics. English, and all natural languages, are tricky, so to make things easier we have done the following simplification on the already-simple Simple Wiki.
Your preprocessing file should contain a get_data()
function. This function takes a training file and testing file and performs the following:
When creating each of your Trigram/RNN models, you must use a single embedding matrix. Remember that there should be a nonlinear function (in our case ReLU), applied between the feed-forward layers.
For your Trigram, you must use two words to predict a third. For your RNN, you must use a window size of 20. You can have any batch size for either of these models.
However, your models must train in under 10 minutes on a department machine!
In the Trigram Language Model part of the assignment, you will build a neural network that takes two words and predicts the third. It should do this by looking up the input words in the embedding matrix, and feeding the result into a set of feed-forward layers. You can look up words in an embedding matrix with their ids using the function tf.nn.embedding_lookup
or the tf.keras.layers.Embedding
layer.
In the main function, you will want to preprocess your train and test data by splitting them into inputs and labels as appropriate. Initialize your model, train it for 1 epoch, and then test it. At the end, you can print out your model's perplexity and use your model to generate sentences with different starting words via the generate_sentence
function (already provided).
Please complete the following tasks
init
function and define your trainable variables or layers.call
function using the trainable components in the init
function.from_logits
parameter.perplexity
, that returns perplexity. This is a relatively standard quality metric for NLP tasks and is merely exponentiated cross-entropy.
sparse_categorical_crossentropy
or create a subclass of its layer analogue to write this function.tf.keras.layers.Conv2D
is the layer analogue (or counterpart) of the tf.nn.convolve
function.Sequential
when building your models.For this part, you'll be able to formulate your problem as that of predicting an i-th word based off of words 0
through i-1
. This is a perfect use-case for RNNs!
This is exactly the same as trigram except for your data formulation. You will now be predicting the next word token for every single word token in the input. You will be able to achieve this by making the output an offset of the input!
tf.keras.layers.GRU
OR tf.keras.layers.LSTM
for your RNN to predict the sequence of next numbers.
return_sequences=True
and return_state=True
as necessary/desired. You might need one of these set for your model to run properly! Note that the final state from an LSTM has two components: the last output and cell state. These are the second and third outputs of calling the RNN. If you use a GRU, the final state is the second output returned by the RNN layer.leaky_relu
activation function at any point, you may have to use tf.keras.layers.LeakyReLU(
) rather than the string "leaky_relu"
depending on which TensorFlow version you're using.We've provided a generate_sentence
function for this part of the assignment. You can pass your model to this function to see sample generations. If you see lot's of UNKs, have no fear! These are expected and quite common in the data! You should, however, start to at least see some reason forming in the outputs.
The RNN layers are surprisingly nuanced in their operations, but it is important to consider how these can actually be implemented when thinking about their pros and cons. There is a walk-through of roughly how to do it in the GRU and LSTM notebooks, and the process honestly isn't that bad after some initial setup which is already provided. Still, it does take a bit of time to understand what's going on since there's a notion of a recurrent kernel which has some interesting uses…
The custom_rnns
folder contains notebooks to help you manually implement LSTMS and GRUs. These sections are optional and are worth 5/70 each in addition to the 70/70 you can get otherwise.
MyGRU
and MyLSTM
. Make sure it stays in custom_rnns
.Language Models : You will be primarily graded on functionality. Your Trigram model should have a test perplexity < 120, and your RNN model should have a test perplexity < 100 to receive full credit (This applies to both CS1470 and CS2470 students).
Conceptual : You will be primarily graded on correctness, thoughtfulness, and clarity.
LSTM/GRU Manual: This will be graded by the autograder. Tests also in the notebook.
README: Put your README in the /code
directory. This should just contain your accuracy and any known bugs you have.
Your model must complete training within 10 minutes for your Trigram Model, and 10 minutes for the RNN. (It can do with less). Our autograder will import your model via get_text_model
similar to previous ones. The LSTM/GRU manual implementations will be pulled in from the GRU.ipynb
or LSTM.ipynb
notebook and we will run similar tests as seen in the notebook.
You should submit the assignment via Gradescope under the corresponding project assignment by zipping up your hw4 folder or through GitHub (recommended). To submit through GitHub, commit and push all changes to your repository to GitHub. You can do this by running the following three commands (this is a good resource for learning more about them):
git add file1 file2 file3
git add -A
will stage all changed files for you.git commit -m “commit message”
git push
After committing and pushing your changes to your repo (which you can check online if you're unsure if it worked), you can now just upload the repo to Gradescope! If you’re testing out code on multiple branches, you have the option to pick whichever one you want.
If you wish to submit via zip file:
IF YOU ARE IN 2470: PLEASE REMEMBER TO ADD A BLANK FILE CALLED 2470student
IN THE hw4/code DIRECTORY, WE ARE USING THIS AS A FLAG TO GRADE 2470 SPECIFIC REQUIREMENTS, FAILURE TO DO SO MEANS LOSING POINTS ON THIS ASSIGNMENT
Good luck everyone!!!