---
tags: hw5, conceptual
---
# HW5 Conceptual: Language Modeling
:::info
Conceptual questions due **Thursday, October 30th, 2025 at 10:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers, as it makes it easier for you and us, though you are not required to do so.
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Conceptual Questions
The following resource is an amazing way of really seeing how transformers work and understanding them. Give this a read if you are stuck on any of the below parts, or want to explore some great ML sources.
:::info
Check out [this helpful article](http://jalammar.github.io/illustrated-transformer/) for an in-depth explanation on transformer blocks!
:::
1. Attention mechanism can be used for both RNNs and Transformers.
* Is the statement *true* or *false*? Please explain/illustrate using lecture material. *(2-4 sentences)*
* Is there an advantage of using one framework over the other? *(2-4 setences)*
2. What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? *(2-4 sentences)*
3. Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? *(2-4 sentences)*
4. Suppose we are in the encoder block learning the embedding $z$ for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output $z$ given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. **Remember:** For this question, the self attention final output ($z_1$ in the figure) is being calculated **only for the word "Thinking"** with respect to the sentence "Thinking Machine".

5. Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "*Machine à penser*" how will it be different from the encoder block? Explain using 2-3 sentences.
6. Suppose your lanauge model is generating the next token in a caption, and the raw output logits for the top 5 candidate tokens are:
| Token | Logit |
| ---------- | ----- |
| "cat" | 4.0 |
| "dog" | 2.0 |
| "animal" | 1.0 |
| "pet" | 0.5 |
| "creature" | 0.0 |
1. Calculate the probability distribution after applying softmax with **temperature T = 1.0**. Round to 3 decimal places. (**you do not need to show your math work, but you do need to explain WHY!** ;)
2. Now calculate the probability distribution with **temperature T = 0.5**. What effect does lowering the temperature have on the distribution? *(1-2 sentences)*
3. Starting from the T = 1.0 distribution, apply **top-k sampling with k = 3**. What is the renormalized probability distribution over the remaining tokens?
4. Starting again from the T = 1.0 distribution, apply **top-p (nucleus) sampling with p = 0.90**. Which tokens are kept, and what is their renormalized probability distribution?
5. In 2-3 sentences, explain when you might prefer using top-p sampling over top-k sampling for caption generation.