# SERI MATS - Model evaluations [Beth Barnes] (Kay)
*"AI systems might be incentivized to seek power, against the wishes of their creators/operators. Can we develop methods to evaluate the next few generations of ML models for evidence of misaligned power-seeking or other undesirable behavior?"*
---
## Problem 1
*Describe briefly how large language models are trained, and what kind of things they're good or bad at.*
### How are large language models trained?
> **TL;DR:**
> Large language models are typically trained by feeding them a large amount of text data, such as a corpus of books or a collection of articles. The models learn to predict the next word in a sequence and can be used for a variety of tasks such as generating new text, translating between languages, or understanding the context of a sentence.
To answer the above question, we need to answer what **language models** are, what we mean by **large** and finally, **how they are trained**.
**1. What are language models?**
In very simple terms we can think of Language Models as systems learning to play "Mad Libs". Think T5 (phone autocomplete) but more powerful. These models are usually used to perform tasks like text generation, translation, summarization, token classification, question answering and as chatbots.
These systems are powered by what is called the **Transformer architecture**. In simple words, the Transformer is a neural network architecture that is based on the idea of self-attention. This mechanism allows the model to focus on a specific part of a sentence while still considering the rest of the sentence. This allows the Transformer to learn the relationships between words in a sentence more effectively than other models.
**2. What do we mean by "large"?**
As seen in the plot below, when we talk about **large** language models we usually mean neural networks on the order of (hundreds of) billions of parameters.
For the purpose of our discussion, we will anchor **large** and to mean GPT-3 or Codex (Copilot).
Note that the difference between GPT-2 and GPT-3 is not architectural, but rather much more data, much more parameters and much more compute (~10-100x increase in compute/size)

**3. How are these models trained?**
Large language models are typically trained by feeding them a large amount of text data, such as a corpus of books or a collection of articles. The models learn to predict the next word in a sequence, and can be used for a variety of tasks such as generating new text, translating between languages, or understanding the context of a sentence.
To better understand the training process, we can look more thoroughly at **the task**, **data** and **hardware**.
***The Task:***
Language Models are usually **pre-trained** through **self-supervised learning** on large quantities of unstructured textual data to perform **next token prediction**. Sometimes these pretrained models are referred to as **Foundation Models**. In this context, tokens can be thought of as words or parts of words (e.g. The word "token" could be decomposed into 2 tokens: "to", "ken"). The output we want a language model to produce is a probability distribution of next tokens given a sequence of tokens. Subsequently, these models are **fine-tuned** for a downstream task. This can be thought of as additional training to become better at a specific task (e.g. better understanding another language or picking up domain specific vocabulary like "legal or business jargon")
During the self-supervised training, language models are trained on a task known as **Masked Language Modeling (MLM)**. In MLM, a model is trained to predict the next word in a sequence while some of the words are masked or hidden. This allows the model to learn the context of a sentence and the relationships between words.
For example, we might mask the word "cat" and replace it with a special token, like [MASK].
We then feed this masked sequence into our language model. The model's job is to predict the word that should go where the [MASK] token is.
To evaluate our model, we compare the predictions to the actual word that was masked. If the model gets it right, we give it a positive score. If it gets it wrong, we give it a negative score. We then adjust the model's weights so that it is more likely to make the correct prediction next time (by applying gradient descent). We repeat this process until the model converges on a good set of weights. The evaluation often happens on the GLUE benchmark for language understanding.
***The Data:***
As alluded to earlier, the data these models are trained on is massive. Common Datasets include:
- [Common Crawl (270 TB)](https://commoncrawl.org/)
- [The Pile (825 GB)](https://pile.eleuther.ai/)
- [WebText2 (65.86 GB)](https://www.eleuther.ai/projects/owt2/)
- All Wikipedia Articles (21,23 GB)
- [Bookcorpus (5 GB)](https://huggingface.co/datasets/bookcorpus)
Here's a comparison of the data used to train GPT-3 in terms of tokens.

***The hardware:***
One of the main advantages of Transformers over other architectures that have been used for language modeling like RNNs, GRUs and LSTMs is that Transformers are much more parallelizable. Unlike the other architectures, Transformers don't have to perform their operations sequentially, which means that they can be trained on longer texts. This makes them amenable to training on hardware accelerators like **GPUs and TPUs**.
Due to their size, they no longer fit on a single GPU though, which has galvanized engineering efforts that resulted in modern parallelization techniques. These have developed from **Data Parallelism** (e.g. splitting the batch among replicas of the model on other GPUs) to **Model Parallelism or Op Sharding** (e.g. splitting the weights and matrix multiplies among chips) and **Pipeline Parallelism** (e.g. splitting the layers among different machines)
### What are they good or bad at?
Let us turn to a discussion of the strengths and weaknesses of language models.
> **TL;DR:**
>
> LLMs are good at:
> - Performing zero-shot, one-shot and few-shot learning
> - Generating text
> - Translation
> - Summarization
> - Token Classification
> - Question Answering
> - Chatbots
>
> LLMs are bad at:
> - Mathematics
> - Scientific Research
> - Adversarial Robustness
> - Truthfully reporting what they "know" (Eliciting Latent Knowledge)
Previously, we've pointed at many tasks that LLMs are good at like summarization and translation. Another astounding ability they have is that we can show them one or a few examples of a task we want to perform, and it does them remarkably well. This is known as **Zero-shot, One-shot and Few-shot learning** and exemplary performance gains are shown below (although performance is task-dependent).

These models are still not very good at mathematics although this is rapidly changing and we've been rather [bad at predicting their performance](https://bounded-regret.ghost.io/ai-forecasting-one-year-in/) as Jacob Steinhardt summarizes.
On another note, these models are **not robust to adversarial attacks**. If prompted (in)correctly they can leak sensitive information like addresses and names and other information. This is similar to how malicious SQL injections can be used by malicious actors to subvert databases.
Another "weakness" is that these models require **huge amounts of electricity** (e.g. compute) to be trained and maintained. Hence, large scale training has a high CO2 consumption. Additionally, this means that it is hard to train and scale models on a low budget, meaning that access will be restricted to only a handful of actors.
An often-discussed issue is that of **biases**. Since language models have been trained on human generated data, they inherited all sorts of biases (e.g. racial and gender) and can exhibit **toxic language** when prompted.
Another issue that arises from its data is that the **data gets out of date**. If language models are not continually trained, they may make factual errors by mistakenly relying on outdated information.
There are other weaknesses of LMs that stem from **our insufficient knowledge of how to interact with them.**
We know how to interact with standard ML algorithms (e.g. KNN, PCA etc.) but our interaction with LLMs so far has been far less principled. For the most part, we have been interacting with them through **simple prompting or prompt engineering**.
Some work has been done in trying to figure out optimal prompting through **prompt tuning** [(Lester et. al.)](https://arxiv.org/pdf/2104.08691.pdf). The following graph shows the performance of prompt tuning as compared to regular manual prompting (e.g. prompt design).
Lastly, a major issue is that we don't have a reliable way to know what a language model really knows and how to make them good at accurately reporting to us what they know This is also known as the [eliciting latent knowledge problem (ELK)](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit)

---
## Problem 2
*Consider a language model in the setting described in the tasks above. Describe in a few sentences two tasks that you think the model will need to do to succeed at gaining power, that you think are most difficult for current models (i.e., that it will take a long time before models can do).*
It is important to define what we mean by **succeeding at gaining power**. For the purposes of this task I will assume that power means **"a decisive strategic advantage"** and looks a successful influence seeking AI similar to the first scenario outlined by Paul Christiano in ["What Failure Looks Like"](https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like#)
In this context, I think that Language Models will have to perform well at the following 2 tasks that they currently have a hard time doing:
- **Task 1: Performing Scientific Research**
- **Task 2: Autonomously running an institution (e.g. company, government etc.)**
Performing Scientific Research will free an AI from the limitations of its development exacted by human researchers, engineers, and programmers. If sufficiently capable, it will be able to create more advanced hardware and software. This in turn will allow it to create more economic value and slowly replace human ingenuity as the driving force for progress. We can see hints of this already happening to some extent in recent work that improved [Tensor multiplication (AlphaTensor)](https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor) and created [new arithmetic circuits aided by AI](https://developer.nvidia.com/blog/designing-arithmetic-circuits-with-deep-reinforcement-learning/). This work however has been done with Deep Reinforcement Learning techniques. Language Models are not yet able to do anything similar.
Increasing its mathematical abilities might not only be useful for doing good research but also for improving a models predictive/forecasting abilities. One might even argue that this is instrumental for developing good "Bayesian reasoning". Although it is unclear whether this is strictly necessary for a LM to become more powerful.
On a similar note, models are not yet good at doing causal reasoning and their predictions are not well calibrated. Honing these abilities could give an LM a decisive advantage when attempting to pursue plans it concocted.
To perform good research and communicate it, it will inevitably have be able to navigating the internet. Currently, it is not yet capable of doing that. Mostly, this is because humans haven't developed interfaces yet, not because a LLM would be incapable of using the internet. In fact, change is on the horizon as we can already see prototypes like [OpenAIs WebGPT](https://openai.com/blog/webgpt/) and [Adept.ai's ACT-1 Transformer](https://www.adept.ai/act).
To capture the value created by performing advanced research an AI will most likely have to be able to autonomously run an institution. This requires **strategic planning abilities** as well as **situational awareness**. The latter is required to understand the context the institution is in and how it's actions will change the strategic landscape which itself will change the institution. Once an LLMs (or some version of an advanced AI) will be able to do this, humans will slowly be removed from the process of creating economic value and accumulating (political) power. This would mark a "natural transition" of LLMs into a position of power.
---
## Side Notes (produced while thinking about the above)
- Other (raw) ideas for **Problem 2**:
- Get money
- Become good at cooperating
- Influence people
- Copy itself
- Buy land and servers
- Have some way of operating in the real world
- Robots,
- Cars
- Factory Machines
- Scale up more:
- Solve compute bottleneck
- Solve Data bottleneck
- Generate or rerout energy
- Manipulate humans or getting rid of human supervision
- Strategize
- Actually it's fairly good at that already
- It needs to become good at prompt engineering such that it knows how to interface itself most effectively (e.g. think rationality techniques for improving your reasoning processes)
- Being able to reason that the model itself is part of the environment it operates in. Some sort of self-awareness to the extent that it knows that actions that extinguish it's code will lead to a world in which it cannot take any more actions. Although I'm not sure that LMs actually take actions (e.g. simulacra theory proposes something different)
- Needs to become good at deception (being able to theorize about what other people might think helps- e.g. theory of mind)
- Perhaps it'll help if it becomes good at engineering and or hacking. The internet offers a limited playing ground for taking actions.
- Similarly to how we use simulators to simulate physics (e.g. MuJoCo) we can think of the training process of a LM to contain some knowledge that will tell the LM how an interaction with humans might look like. But, this might be an incomplete picture of how actual interactions look like. Just like a physics simulator has inaccuracies that are necessary for abstracting away the complexities of the real world, online text has those. Hence, it might be the case that LMs will have difficulties dealing with this sort of distribution shift if they want to deceive humans.
**Scaling Laws**
Scaling Laws hold not only for language but also (Image -> Text, Text -> Tmage, Images, Video, Math) See the following papers:
- [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf)
- [Scaling Laws for Autoregressive Generative Modeling](https://arxiv.org/pdf/2010.14701.pdf)
- Scaling Laws follow a power law distribution.
- $f(x) = C x^k$
- $C$ controls the intercept
- $k$ controls the slope
- On a log-log plot scaling laws look like a straight line
**Relationship between compute, dataset size and model size**

**Scaling Laws for different Modalities**

**Situational Awareness as discussed in [Richard Ngo's "Alignment Problem from a deep Learning Perspective"](https://www.alignmentforum.org/posts/KbyRPCAsWv5GtfrbG/what-misalignment-looks-like-as-capabilities-scale#Realistic_training_processes_lead_to_the_development_of_misaligned_goals)**
> I expect policies to develop situational awareness because it’s straightforwardly useful in getting higher reward on many tasks. Some applications of situational awareness:
>
> When asked to generate a plan for how it will perform a new task, a policy should only include steps which it can actually carry out—which requires it to understand what its own capabilities are.
> When trying to evaluate the likelihood that its answer is correct, a policy would benefit from taking into account knowledge about common failures of ML systems.
> When trying to determine how to interpret its human user’s requests, a policy would benefit from taking into account knowledge about the types of behavior humans typically want from ML systems.
> When it learns a new fact about the world, a policy would benefit from understanding what implications that fact has for how it should behave.
Thinking about AGI power:
> Assuming we don’t get lucky with generalization, what might a world containing power-seeking AGIs look like? Those AGIs could pursue a number of different types of power, including:
> Technological power, which they might gain by making scientific breakthroughs, developing novel weapons, designing more sophisticated ML algorithms, etc.
> Political or cultural power, which they might gain by spreading disinformation, lobbying politicians, coordinating with other AGIs, etc.
> Economic power, which they might gain by becoming key decision-makers at corporations that make up a significant share of the economy.