HW5 Conceptual: Language Modeling

--- tags: hw5, conceptual --- # HW5 Conceptual: Language Modeling :::info Conceptual questions due **Thursday, October 30th, 2025 at 10:00 PM EST** ::: Answer the following questions, showing your work where necessary. Please explain your answers and work. :::info We encourage the use of $\LaTeX$ to typeset your answers, as it makes it easier for you and us, though you are not required to do so. ::: :::warning Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so. ::: ## Conceptual Questions The following resource is an amazing way of really seeing how transformers work and understanding them. Give this a read if you are stuck on any of the below parts, or want to explore some great ML sources. :::info Check out [this helpful article](http://jalammar.github.io/illustrated-transformer/) for an in-depth explanation on transformer blocks! ::: 1. Attention mechanism can be used for both RNNs and Transformers. * Is the statement *true* or *false*? Please explain/illustrate using lecture material. *(2-4 sentences)* * Is there an advantage of using one framework over the other? *(2-4 setences)* 2. What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? *(2-4 sentences)* 3. Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? *(2-4 sentences)* 4. Suppose we are in the encoder block learning the embedding $z$ for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output $z$ given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. **Remember:** For this question, the self attention final output ($z_1$ in the figure) is being calculated **only for the word "Thinking"** with respect to the sentence "Thinking Machine". ![Self-attention](https://hackmd.io/_uploads/rJxYIQnKa.png) 5. Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "*Machine à penser*" how will it be different from the encoder block? Explain using 2-3 sentences. 6. Suppose your lanauge model is generating the next token in a caption, and the raw output logits for the top 5 candidate tokens are: | Token | Logit | | ---------- | ----- | | "cat" | 4.0 | | "dog" | 2.0 | | "animal" | 1.0 | | "pet" | 0.5 | | "creature" | 0.0 | 1. Calculate the probability distribution after applying softmax with **temperature T = 1.0**. Round to 3 decimal places. (**you do not need to show your math work, but you do need to explain WHY!** ;) 2. Now calculate the probability distribution with **temperature T = 0.5**. What effect does lowering the temperature have on the distribution? *(1-2 sentences)* 3. Starting from the T = 1.0 distribution, apply **top-k sampling with k = 3**. What is the renormalized probability distribution over the remaining tokens? 4. Starting again from the T = 1.0 distribution, apply **top-p (nucleus) sampling with p = 0.90**. Which tokens are kept, and what is their renormalized probability distribution? 5. In 2-3 sentences, explain when you might prefer using top-p sampling over top-k sampling for caption generation.