# HW4: Language Models
**Conceptual Questions:** Due Monday, 10/31/2022 at 11:59 PM EST
**Programming Assignment:** Due Monday, 11/07/2022 at 11:59 PM EST
In this assignment, you will be building a Language Model to learn various word embedding schemas to help minimize your NLP losses. _Please read this handout in its entirety before beginning the assignment._
## Introduction
### Conceptual Questions
Please fill out the conceptual questions linked on the website under HW4: Language Models. You should be able to type up the questions with LaTeX or upload photos of written work (as long as it's clearly legible).
### Getting the stencil
Please click on the [GitHub Classroom link](https://classroom.github.com/a/Ex-PsmM_) to get the stencil code.
### Setup
Please use the conda env you created at the beginning of the semester using csci1470.yml or csci1470-m1.yml. Double check that the tensorflow version is 2.5 and the numpy version is less than 1.19.4.
### Assignment Overview
**[Word Prediction]** You'll get to make a language model that learns to generate sentences by training the model to predict a next word conditioned on a subset of previous words.
- **Trigram:** Train a supervised formulation to predict word[i] = f(word[i-2], word[i-1]).
- **RNN:** Train a recurrent neural network to predict word[i] = f(word[0], …, word[i-1])
**[LSTM/GRU]** 2470 component w/ bonus options. You'll get to create a recurrent layer from scratch (with some starter code) using a recurrent kernel.
## Roadmap
### Step 1: Preprocessing
#### The Corpus
When it comes to language modeling, there is a LOT of data out there! Since we're just predicting words from other words, almost any large text corpus will do. Our data comes from the Simple English Wikipedia corpus and comprises articles about various earthly topics. English, and all natural languages, are tricky, so to make things easier we have simplified the already-simple Simple Wiki.
- **Lemmatization:** You will notice a few strange things in the corpus: inflectional morphology, numbers, and pronouns have been "lemmatized" (e.g. "I am teleporting" --\> " be teleport"). Lemmatization is based on the principle that "teleport", "teleports", "teleported", and "teleporting" all contribute similar content to the sentence; rather than learn 4 words, the model only needs to learn one.
- **Limiting Vocabulary:** Additionally, you will find things that look like _\<\_UNK\>_. _UNK_-ing is a common preprocessing technique to remove rare words while preserving basic information about their role in the sentence. This will help the model focus more on learning patterns in general and common language.
- **Meta-tokens:** you will notice that each text file has one article per line, followed by a 'STOP'. For this assignment, we don't need to worry about where some articles end and others begin. Thus, in preprocessing, you should concatenate all the words from all the articles together. Doing this will mean that you can avoid padding and masking (but don't fret; we'll need masking and padding later)!
#### Preprocessing
Your preprocessing file should contain a get\_data() function. This function takes a training file and testing file and performs the following:
1. Load the train words and split the words on whitespace.
2. Load the test words and split the words on whitespace.
3. Create a vocab dictionary that maps each word in the corpus to a unique index (its ID). Only one dictionary is needed for training and testing as the testing set should only test on words in the training set.
4. Convert the list of training and test words to their indices, making a 1-d list/array for each (i.e. each sentence is concatenated together to form one long list).
5. Return an iterable of training ids, an iterable of testing ids, and the vocab dict.
#### Hyperparameters
When creating each of your Trigram/RNN models, you must use a single embedding matrix. Remember that there should be a nonlinear function (in our case ReLU), applied between the feed-forward layers.
For your Trigram, you must use two words to predict a third. For your RNN, you must use a window size of 20. You can have any batch size for either of these models.
However, your models must train in under 10 minutes on a department machine!
### Step 2: Trigram Language Model
In the Trigram Language Model part of the assignment, you will build a neural network that takes two words and predicts the third. It should do this by looking up the input words in the embedding matrix, and feeding the result into a set of feed-forward layers. You can look up words in an embedding matrix with their ids using the function [tf.nn.embedding\_lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) or the [tf.keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.
#### Train and test
In the main function, you will want to preprocess your train and test data by splitting them into inputs and labels as appropriate. Initialize your model, train it for 1 epoch, and then test it. At the end, you can print out your model's perplexity and try printing out sentences with different starting words via the generate\_sentence function of your model (already provided).
- Fill out the init function and define your trainable variables or layers.
- Fill out the call function using the trainable components in the init function.
- Compile your model as usual and feel free to play around with hyperparameters.
- For loss, please use **sparse categorical crossentropy** , as we don't want to compute loss relative to the entire vocabulary every time.
- _Note that this function can take in logits or probabilities depending on how you set the_ from\_logits_parameter._
- For accuracy, please use perplexity. This is a relatively standard quality metric for NLP tasks and is merely exponentiated cross-entropy. However, it is something that isn't provided for you by default, so you'll have to make it.
- If you're using the layer analogue, you could actually subclass your loss function and make a minor change in the call function to make it all work…
- See accuracy threshold in grading section below. It shouldn't be too hard to hit.
### Step 3: RNN Language Model
For this part, you'll be able to formulate your problem as that of predicting an i-th word based off of words 0 through i-1. This is a perfect use-case for RNNs!
#### Train and test
This is exactly the same as trigram except for your data formulation. You will now be predicting the next word token for every single word token in the input. You will be able to achieve this by making the output an offset of the input!
- You'll still want to embed your inputs into a high-dimensional space and allow the representations to be trained up through the optimization loop, so you'll still want an embedding layer.
- Use [tf.keras.layers.GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) OR [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) for your RNN to predict the sequence of next numbers.
- You are free to also use return\_sequences=True and return\_state=True as necessary/desired. Note that the final state from an LSTM has two components, the last output and cell state. These are the second and third outputs of calling the RNN. If you use a GRU, the final state is the second output returned by the RNN layer.
- Fill out the call function as appropriate, and everything otherwise should be the same as trigram!
#### Common Errors
- Make sure you end up using non-overlapping windows when you pass data into your RNN. This should be automatically handled by the fit batching system, but be wary of information bleeding if you make custom components.
- If you would like to use the leaky\_relu activation function at any point, you will have to use tf.keras.layers.LeakyReLU() rather than the string "leaky\_relu" since tensorflow 2.5 doesn't have this.
#### Trying Out Results
We've provided a generate\_sentences function for this part of the assignment. You can pass your model to this function to see sample generations. If you see lot's of _UNKs_, have no fear! These are expected and quite common in the data! You should, however, start to at least see some reason forming in the outputs.
### Step 4: LSTM/GRU Manual
The RNN layers are surprisingly nuanced in their operations, but it is important to consider how these can actually be implemented when thinking about their pros and cons. There is a walk-through of roughly how to do it in the GRU and LSTM notebooks, and the process honestly isn't that bad after some initial setup which is already provided. Still, it does take a bit of time to understand what's going on since there's a notion of a recurrent kernel which has some interesting uses…
**[1470]:** Correctly implement the GRU component.
**[2470]:** Correctly implement both the LSTM and GRU components.
**UPDATE:** Please leave your code in MyGRU and MyLSTM. Make sure it stays in custom\_rnns.
## Grading
**Language Models** : You will be primarily graded on functionality. Your Trigram model should have a test perplexity \< 120, and your RNN model should have a test perplexity \< 100 to receive full credit (This applies to both CS1470 and CS2470 students).
**Conceptual** : You will be primarily graded on correctness, thoughtfulness, and clarity.
**LSTM/GRU Manual:** This will be graded by the autograder. Tests also in the notebook.
**README** : Your README should just contain your accuracy and any bugs you have.
### Autograder
Your model must complete training within 10 minutes for your Trigram Model, and 10 minutes for the RNN. _(It can do with less)_. Our autograder will import your model via get\_text\_model similar to previous ones. The LSTM/GRU manual implementations will be pulled in from the GRU.ipynb or LSTM.ipynb notebook and we will run similar tests as seen in the notebook.
## Handing In
You should submit the assignment via Gradescope under the corresponding project assignment by zipping up your hw folder (the path on Gradescope MUST be hw4/code/filename.py) or through GitHub ( **recommended** ). To submit through GitHub, commit and push all changes to your repository to GitHub. You can do this by running the following three commands ([this](https://github.com/git-guides/#how-to-use-git) is a good resource for learning more about them): 
1. git add file1 file2 file3 (or -A)
2. git commit -m "commit message"
3. git push
After committing and pushing your changes to your repo (which you can check online if you're unsure if it worked), you can now just upload the repo to Gradescope! If you're testing out code on multiple branches, you have the option to pick whichever one you want.
**IMPORTANT!**
1. Please make sure all your files are in hw4/code. Otherwise, the autograder will fail!
2. **2470 STUDENTS** : Add a blank file named 2470student in the hw4/code directory!
The file should have no extension, and is used as a flag to grade 2470-specific requirements. If you don't do this, YOU WILL LOSE POINTS!