HW4 Conceptual: Image Captioning

--- tags: hw4, conceptual --- # HW4 Conceptual: Image Captioning :::info Conceptual questions due **Friday, March 21st, 2025 at 10:00 PM EST** ::: Answer the following questions, showing your work where necessary. Please explain your answers and work. :::info We encourage the use of $\LaTeX$ to typeset your answers, as it makes it easier for you and us, though you are not required to do so. ::: :::warning Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so. ::: ## Conceptual Questions 1. Attention mechanism can be used for both RNNs and Transformers. * Is the statement *true* or *false*? Please explain/illustrate using lecture material. *(2-4 sentences)* * Is there an advantage of using one framework over the other? *(2-4 setences)* 2. What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? *(2-4 sentences)* 3. Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? *(2-4 sentences)* :::info **Hint:** Check out [this helpful article](http://jalammar.github.io/illustrated-transformer/) for an in-depth explanation on transformer blocks! ::: 4. Suppose we are in the encoder block learning the embedding $z$ for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output $z$ given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. **Remember:** For this question, the self attention final output ($z_1$ in the figure) is being calculated **only for the word "Thinking"** with respect to the sentence "Thinking Machine". ![Self-attention](https://hackmd.io/_uploads/rJxYIQnKa.png) 5. Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "*Machine à penser*" how will it be different from the encoder block? Explain using 2-3 sentences.