Conceptual questions due Monday, April 8th, 2024 at 6:00 PM EST
Programming assignment due Sunday, April 14th, 2024 at 6:00 PM EST
Answer the following questions, showing your work where necessary. Please explain your answers and work.
We encourage the use of to typeset your answers, as it makes it easier for you and us, though you are not required to do so.
Do NOT include your name anywhere within this submission. Points will be deducted if you do so.
99% of models fail to caption this correctly
Attention mechanism can be used for both RNNs and Transformers.
What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? (2-4 sentences)
Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? (2-4 sentences)
Hint: Check out this helpful article for an in-depth explanation on transformer blocks!
Suppose we are in the encoder block learning the embedding for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. Remember: For this question, the self attention final output ( in the figure) is being calculated only for the word "Thinking" with respect to the sentence "Thinking Machine".
(Optional) For bonus points Calculate , final output for "Machine" with respect to the sentence.
(Optional) Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "Machine à penser" how will it be different from the encoder block? Explain using 4-5 sentences.
(Optional) Have feedback for the homework? Found something confusing?
We’d love to hear from you!
The following are questions that are only required by students enrolled in CSCI2470.
What requires more parameters: single or multi-headed attention? Explain. Does this mean one necessarily trains faster than another?
Transformers can also be used for language modeling. In fact, they are the current state-of-the-art method used. (see https://openai.com/blog/better-language-models/). How are transformer-based language models similiar to convolution? What makes them more suited for language modeling?
Hint: Think about the key vectors!
Read about BERT, a former state-of-the-art transformer-based language model here: https://arxiv.org/pdf/1810.04805.pdf, and answer the following questions.
a) What did the researchers claim was novel about BERT? Why was this better than previous forms of language modeling techniques?
b) What was the masked language model objective? Describe this in 1-2 sentences.
c) Pretraining and finetuning both are forms of training a model. What’s the difference between pretraining and finetuning, and how does BERT use both techniques?