HW5 Conceptual: Image Captioning

Conceptual questions due Monday, April 8th, 2024 at 6:00 PM EST
Programming assignment due Sunday, April 14th, 2024 at 6:00 PM EST

Answer the following questions, showing your work where necessary. Please explain your answers and work.

We encourage the use of

L A T E X

to typeset your answers, as it makes it easier for you and us, though you are not required to do so.

Do NOT include your name anywhere within this submission. Points will be deducted if you do so.

Theme

Image Not Showing Possible Reasons

99% of models fail to caption this correctly

Attention mechanism can be used for both RNNs and Transformers.
- Is the statement true or false? Please explain/illustrate using lecture material. (4-5 sentences)
- Is there an advantage of using one framework over the other? (4-5 setences)
What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? (2-4 sentences)
Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? (2-4 sentences)

Hint: Check out this helpful article for an in-depth explanation on transformer blocks!

Suppose we are in the encoder block learning the embedding
$z$ for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output
$z$ given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. Remember: For this question, the self attention final output (
$z_{1}$ in the figure) is being calculated only for the word "Thinking" with respect to the sentence "Thinking Machine".
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
(Optional) For bonus points Calculate
$z_{2}$ , final output for "Machine" with respect to the sentence.
(Optional) Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "Machine à penser" how will it be different from the encoder block? Explain using 4-5 sentences.
(Optional) Have feedback for the homework? Found something confusing?
We’d love to hear from you!

The following are questions that are only required by students enrolled in CSCI2470.

What requires more parameters: single or multi-headed attention? Explain. Does this mean one necessarily trains faster than another?
Transformers can also be used for language modeling. In fact, they are the current state-of-the-art method used. (see https://openai.com/blog/better-language-models/). How are transformer-based language models similiar to convolution? What makes them more suited for language modeling?

Hint: Think about the key vectors!