---
tags: hw5, conceptual
---
# HW5 Conceptual: Image Captioning
:::info
Conceptual questions due **Monday, April 8th, 2024 at 6:00 PM EST**
Programming assignment due **Sunday, April 14th, 2024 at 6:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers, as it makes it easier for you and us, though you are not required to do so.
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::
## Theme
![](https://i.imgur.com/RC7Wu8U.png)
*99% of models fail to caption this correctly*
## Conceptual Questions
1. Attention mechanism can be used for both RNNs and Transformers.
* Is the statement *true* or *false*? Please explain/illustrate using lecture material. *(4-5 sentences)*
* Is there an advantage of using one framework over the other? *(4-5 setences)*
2. What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector and how is it calculated (on a high-level, refer lecture slides)? *(2-4 sentences)*
3. Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? *(2-4 sentences)*
:::info
**Hint:** Check out [this helpful article](http://jalammar.github.io/illustrated-transformer/) for an in-depth explanation on transformer blocks!
:::
4. Suppose we are in the encoder block learning the embedding $z$ for the word "Thinking" (in the input sentence "Thinking Machine") after applying self-attention. See figure below. What will be the final output $z$ given the values for queries, keys, and values for both the words in the sentence? Show all the calculation steps. At the softmax stage and after, use three decimal points. **Remember:** For this question, the self attention final output ($z_1$ in the figure) is being calculated **only for the word "Thinking"** with respect to the sentence "Thinking Machine".
![Self-attention](https://hackmd.io/_uploads/rJxYIQnKa.png)
(Optional) **For bonus points** Calculate $z_2$, final output for "Machine" with respect to the sentence.
5. (Optional) Now suppose there is a decoder block to translate the example sentence "Thinking Machine" above into "*Machine à penser*" how will it be different from the encoder block? Explain using 4-5 sentences.
6. (Optional) Have feedback for the homework? Found something confusing?
We’d love to hear from you!
## 2470-only Questions
The following are questions that are only required by students enrolled in CSCI2470.
1. What requires more parameters: single or multi-headed attention? Explain. Does this mean one necessarily trains faster than another?
2. Transformers can also be used for language modeling. In fact, they are the current state-of-the-art method used. (see https://openai.com/blog/better-language-models/). How are transformer-based language models similiar to convolution? What makes them more suited for language modeling?
:::info
**Hint:** Think about the key vectors!
:::
3. Read about BERT, a former state-of-the-art transformer-based language model here: https://arxiv.org/pdf/1810.04805.pdf, and answer the following questions.
a) What did the researchers claim was novel about BERT? Why was this better than previous forms of language modeling techniques?
b) What was the masked language model objective? Describe this in 1-2 sentences.
c) Pretraining and finetuning both are forms of training a model. What’s the difference between pretraining and finetuning, and how does BERT use both techniques?
<!-- ## Ethical Implications
![](https://i.imgur.com/vVsTexp.jpg)
***Deep Learning at Scale***
As deep learning models are scaled up to serve more users and deployed in several domains (as shown in this homework), it is worth considering how we can limit the risk these models pose. The following paper examines how unintended consequences can arise from the design of machine learning systems. The authors focus on reinforcement learning, which is a paradigm that involves real-time interactions with the world. For the sake of this class, however, we can generalize this to deep learning (not to mention that a combination of the two, called deep reinforcement learning, has become popular recently). Read this [paper](https://arxiv.org/pdf/1606.06565.pdf) from the beginning to the end of section two—pay close attention to the descriptions of the failure modes they discuss:
1. Avoiding Negative Side Effects
2. Avoiding Reward Hacking
3. Scalable Oversight
4. Safe Exploration
5. Robustness to Distributional Shift
The authors of the paper used a cleaning robot to show how these possible failure modes were problematic.
1. Pick an application of deep learning different than the one that Amodei et al. discuss. You can choose one from this [article](https://bernardmarr.com/what-is-deep-learning-ai-a-simple-guide-with-8-practical-examples/) or elsewhere. Explain how 2-3 failure modes can manifest. Be sure to provide separate explanations for each failure mode that you select. (5-7 sentences)
***Artificial General Intelligence***
You may have noticed that image captioning required using two separate models. Using ResNet-50, feature vectors were generated from images in the Flickr 8k dataset. This step encoded the images. Using either RNN or Transformer models, captions were generated from the encoded images. This is an example of combining two separate models to perform a task that neither model could achieve on its own. The composition of deep learning models specialized for one task allows for novel applications to many more tasks. Skim this [article](https://dataconomy.com/2022/06/artificial-narrow-intelligence/) to learn about Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI).
The clear scientific consensus is that we have not created a generally intelligent machine just yet, and are likely decades or centuries away. However, you've seen the power of combining deep learning models in this assignment.
2. Read this [piece](https://www.scientificamerican.com/article/artificial-general-intelligence-is-not-as-imminent-as-you-might-think1/) from the *Scientific American* which touches on the pitfalls of current state-of-the-art models. Do you agree that something more generally intelligent is so far off? Can we get something close to general intelligence just by composing the narrow models we have today? You can attack this from any angle you'd like—technical, philosophical, or ethical. Feel free to use ideas discussed in lecture, lab, or previous SRC content. (6-8 sentences)
-->