---
tags: hw5, conceptual
---
# HW5 Conceptual: Image Captioning
Due **Monday, April 10th** at 6PM EST
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::info
We encourage the use of $\LaTeX$ to typeset your answers, as it makes it easier for you and us, though you are not required to do so.
:::
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::

*99% of algorithms fail to caption this correctly*
## Conceptual Questions
1. What are some differences between RNNs and Transformers? What are the limitations of RNNS that Transformers solve? *(4-5 sentences)*
2. What is the purpose of the positional encoding in the Transformer architecture? What is the size of a positional encoding vector? *(2-4 sentences)*
3. Consider the parameters for two different attention heads. Is it necessary that they be initialized randomly, or could we just start them all with the same vector of values? *(2-4 sentences)*
4. Let’s say we have the sentence: “Deep Learning is awesome”. In a trans- former encoder that uses self-attention, is the attention that the word “Learning” pays to the word ”awesome” the same as the attention the word “awesome” pays to the word “Learning”? Why or why not? *(3-5 sentences)*
5. The transformer architecture is comprised of both encoders and decoders. What are the similarities and differences between encoder and decoder architectures and why are they important? (2-4 sentences)
:::info
**Hint:** Check out [this helpful article](http://jalammar.github.io/illustrated-transformer/) for an in-depth explanation on transformer blocks!
:::
6. (Optional) Have feedback for the homework? Found something confusing?
We’d love to hear from you!
## Ethical Implications

***Deep Learning at Scale***
As deep learning models are scaled up to serve more users and deployed in several domains (as shown in this homework), it is worth considering how we can limit the risk these models pose. The following paper examines how unintended consequences can arise from the design of machine learning systems. The authors focus on reinforcement learning, which is a paradigm that involves real-time interactions with the world. For the sake of this class, however, we can generalize this to deep learning (not to mention that a combination of the two, called deep reinforcement learning, has become popular recently). Read this [paper](https://arxiv.org/pdf/1606.06565.pdf) from the beginning to the end of section two—pay close attention to the descriptions of the failure modes they discuss:
1. Avoiding Negative Side Effects
2. Avoiding Reward Hacking
3. Scalable Oversight
4. Safe Exploration
5. Robustness to Distributional Shift
The authors of the paper used a cleaning robot to show how these possible failure modes were problematic.
1. Pick an application of deep learning different than the one that Amodei et al. discuss. You can choose one from this [article](https://bernardmarr.com/what-is-deep-learning-ai-a-simple-guide-with-8-practical-examples/) or elsewhere. Explain how 2-3 failure modes can manifest. Be sure to provide separate explanations for each failure mode that you select. (5-7 sentences)
***Artificial General Intelligence***
You may have noticed that image captioning required using two separate models. Using ResNet-50, feature vectors were generated from images in the Flickr 8k dataset. This step encoded the images. Using either RNN or Transformer models, captions were generated from the encoded images. This is an example of combining two separate models to perform a task that neither model could achieve on its own. The composition of deep learning models specialized for one task allows for novel applications to many more tasks. Skim this [article](https://dataconomy.com/2022/06/artificial-narrow-intelligence/) to learn about Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI).
The clear scientific consensus is that we have not created a generally intelligent machine just yet, and are likely decades or centuries away. However, you've seen the power of combining deep learning models in this assignment.
2. Read this [piece](https://www.scientificamerican.com/article/artificial-general-intelligence-is-not-as-imminent-as-you-might-think1/) from the *Scientific American* which touches on the pitfalls of current state-of-the-art models. Do you agree that something more generally intelligent is so far off? Can we get something close to general intelligence just by composing the narrow models we have today? You can attack this from any angle you'd like—technical, philosophical, or ethical. Feel free to use ideas discussed in lecture, lab, or previous SRC content. (6-8 sentences)
## 2470-only Questions
The following are questions that are only required by students enrolled in CSCI2470.
1. What requires more parameters: single or multi-headed attention? Explain. Does this mean one necessarily trains faster than another?
2. Transformers can also be used for language modeling. In fact, they are the current state-of-the-art method used. (see https://openai.com/blog/better-language-models/). How are transformer-based language models similiar to convolution? What makes them more suited for language modeling?
:::info
**Hint:** Think about the key vectors!
:::
3. Read about BERT, a state-of-the-art transformer-based language model here: https://arxiv.org/pdf/1810.04805.pdf, and answer the following questions.
a) What do the researchers claim is novel about BERT? Why is this better than previous forms of language modeling techniques?
b) What is the masked language model objective? Describe this in 1-2 sentences.
c) Pretraining and finetuning both are forms of training a model. What’s the difference between pretraining and finetuning, and how does BERT use both techniques?