HW4 Conceptual: Language Models

Conceptual questions due Monday, March 18th, 2024 at 6:00 PM EST
Programming assignment due Friday, March 22nd, 2024 at 6:00 PM EST

Answer the following questions, showing your work where necessary. Please explain your answers and work.

Please typeset your answers. We recommend the use of

L A T E X

, as it makes it easier for you and us.

Do NOT include your name anywhere within this submission. Points will be deducted if you do so.

DO assign pages on Gradescope. Points will be deducted if you do not assign your pages.

Theme

Image Not Showing Possible Reasons

When your LSTM remembers less than this guy

What are the dimensions of an embedding matrix? What do they represent?
Given the following sentences (any relation to any real words is purely coincidental), plot reasonable embeddings in 2D for “Ritambhara”, “went”, “found”, “cute”, and “beautiful”. (Hint: A simple graph with some clusters is fine.)

Ritambhara went scuba diving. 
Then Ritambhara found a beautiful seashell.
A cute narwhal was next to the seashell.

What are some benefits of using RNNs over trigrams (or n-grams generally speaking?). Illustrate with an example.
Why are LSTMs able to ‘remember’ information for longer timeframes than vanilla RNNs? Use the "dog" example in the class to illustrate this point.
(Optional) Have feedback for this assignment? Found something confus-
ing? We’d love to hear from you!

The Gated Recurrent Unit (GRU) is another recurrent network cell that can, like the LSTM, retain information over long sequences. How is it able to do this? Draw the architectures (roughly) of the LSTM and GRU cells, and compare them. Also write the benefits and drawbacks of each (4-6 sentences).
While we have studied Convolutional Neural Networks (CNNs) in the context of 2D images, CNNs can also be used for 1D sequence modeling tasks, such as language modeling. Look up some papers that have attempted this, and that compare CNN language models to RNN language models (cite which papers you read). What appears to be the general consensus on the pros and cons of the two approaches?