---
tags: hw4, conceptual
---
# HW4 Conceptual: Image Captioning
:::info
Conceptual questions due **Friday, March 21st, 2025 at 10:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers, as it makes it easier for you and us, though you are not required to do so.
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Conceptual Questions
1. Attention mechanism can be used for both RNNs and Transformers.
* Is the statement *true* or *false*? Please explain/illustrate using lecture material. *(2-4 sentences)*
* Is there an advantage of using one framework over the other? *(2-4 setences)*
2. What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? *(2-4 sentences)*
3. Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? *(2-4 sentences)*
:::info
**Hint:** Check out [this helpful article](http://jalammar.github.io/illustrated-transformer/) for an in-depth explanation on transformer blocks!
:::
4. Suppose we are in the encoder block learning the embedding $z$ for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output $z$ given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. **Remember:** For this question, the self attention final output ($z_1$ in the figure) is being calculated **only for the word "Thinking"** with respect to the sentence "Thinking Machine".

5. Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "*Machine à penser*" how will it be different from the encoder block? Explain using 2-3 sentences.