# SERI MATS - Interdisciplinary AI Safety [Dan Hendrycks] (Kay) ## Problem 1 *Read and summarize “Pragmatic AI Safety” three, on complex systems, and four, on capabilities externalities. Bonus points if you provide substantive discussion or disagreements with some points.* *[“Pragmatic AI Safety:"](https://www.lesswrong.com/s/FaEBwhhe3otzYKGQt)* - *[3 on complex systems:](https://www.lesswrong.com/s/FaEBwhhe3otzYKGQt/p/n767Q8HqbrteaPA25)* - *[4 on capabilities externalities:](https://www.lesswrong.com/s/FaEBwhhe3otzYKGQt/p/dfRtxWcFDupfWpLQo)* --- **Context:** The Pragmatic AI Safety (PAIS) research umbrella argues for performing impactful AI safety research without simultaneously advancing capabilities. It stresses that we should treat the ML research community as a complex system that is affected by a multitude of sociotechnical factors. One such key factor is the development of a strong safety culture among ML practitioners. An important assertion made in the PAIS framework is that technical work is not enough and that other activities with less direct causal chains of impact should be pursued. ### 1. Summary: Complex Systems for AI Safety (PAIS) **TL;DR:** > In this article the authors take a systems view of AI safety to derive insights about how to shape the AI research ecosystem. A brief introduction of complex systems is followed by a discussion of what they teach us about direct and diffuse impact. The improvement of systemic factors, in particular safety culture, is highlighted as an important sociotechnical factor that has been neglected so far. > The article concludes by encouraging the AI safety community to diversify its research efforts and put more emphasis on work with indirect causal impact chains as opposed to simple direct impact. The main arguments as presented in the article are: - The authors begin their analysis by defining complex systems as having the following properties: - many interacting components - emergent collective behavior - rendering reductive analysis and decomposition insufficient - too "organized" (e.g. having circular dependencies) for statistical analysis - After establishing this definition, they assert that Deep Learning systems, AI safety and the broader Machine Learning community are all complex systems. - They state that the most important lesson that complex systems theory offers is that we cannot solve complex systems problems by solely focusing on solving subproblems. In other words, reductionism is not enough. - Following this observation, they motivate their exploration of indirect vs direct impact. Indirect/Diffuse impact is harder to quantify than direct impact since it doesn't offer a simple causal chain to track. Nonetheless it's often as important if not more to focus on diffuse impact and thus it shouldn't be neglected. - An example of direct impact can be technical AI Safety work that tackles the safety problems head on (e.g. making neural networks more adversarially robust or trying to align advanced AIs through improving our iterated distillation and amplification methods) - Contrastively, indirect or diffuse impact is less directly attributable to the solution of a problem, but can be instrumental nonetheless (e.g. a community building effort without which a research group would not have formed) - After establishing their importance, the authors propose a list of diffuse factors to improve. Examples include, safety budget and compute allocation, safety team size, safety mechanism cost and safety culture. - Safety culture is identified as particularly important, and the authors analyze how to improve it. They speculate about the necessary actions that contribute to cultivating it and show the importance of infrastructure, community building, incentive structures and policy making. - On another note, the authors argue that experimentation with advanced DL systems will continue to become more expensive and thus available to only a few select people. Taken in combination with the claim that research impact is heavy tailed, they conjecture that buy-in to AI safety among top AI researchers should be increased. They observe that many top researchers are not yet sympathetic to AI safety research and that changing this should be a priority. (e.g. by means of hosting competitions or creating new interesting and challenging metrics) - Moving forward, the article asserts that complex systems theory offers predictive, not just explanatory power. An illustrative AI safety example is that instrumental goals for self-preservation or power-seeking are predicted by the complex systems lens. - After proposing the systems view of AI safety, the authors conclude that it is not enough to prioritize a single research direction or contributing factor. They argue that we should diversify our research efforts as a collective, but should strive for specialization as individuals, as research impact is likely to occur in the tail-ends. This resembles the portfolio approach to managing risk taught in Finance. - Diversification has the added benefit that it makes researchers less at odds with each other. - Moreover, a large portfolio of efforts increases our odds of pursuing research that prepares us for black swans. It offers defense through depth. ### Discussion / Disagreements A key assertion of the article is that safety culture is the most relevant diffuse factor to improve. Although the authors spend a lot of time explaining how to influence it, they fail to motivate why they think it's important in the first place. Very loosely, the explanation given is: "Safety culture is probably the most important factor because Leveson said so. He's been consulting on designing high risk technology systems, so he knows what he's talking about." It seems to me like safety culture is indeed important but their argument is not very refined. Making a strong case for the importance of any "diffuse factor" is hard because they are hard to measure. Nonetheless, I think it would be valuable to present at least a statistically significant expert poll or other data supporting their claim. Apart from that, I'm not sure "safety culture" is sufficiently well distinguished from other factors. It seems like safety culture could be an umbrella term that encompasses others. In particular, safety budget and compute allocation, safety team size and safety mechanism cost seem like factors that influence and are influenced by safety culture. Conditional on accepting that safety culture is an important factor, I think that the following example from the article offers valuable concreteness: > *Some technical problems are instrumentally useful for safety culture in addition to being directly useful for safety. One example of this is reliability: building highly reliable systems trains people to specifically consider the tail-risks of their system, in a way that simply building systems that are more accurate in typical settings does not. On the other hand, value learning, while it is also a problem that needs to be solved, is currently not quite as useful for safety culture optimization* I think it would be valuable to expand and apply this lens to compare different efforts in AI safety research e.g work on objective robustness, agent foundations, mechanistic interpretability etc. Lastly, the points raised in the explanation of lessons from the systems bible, especially the prediction of mesa-optimizers impressed me. I wonder if there are other "low hanging fruits" that one could pick. Moreover, the systems bible predicts that most failures will be discovered through experimenting with the system in question. We can currently see this in the way we interact with powerful ML systems like GPT-3 and DALL-E. Most new capabilities and shortcomings are discovered through tinkering, rather than whiteboard analysis. Emergent capabilities or phase shifts are empirically observed in scaling large language models (e.g. ability on arithmetic problems). This provides a strong case for why we should do practical, empirical work. One aspect the article touches on but leaves insufficiently explored is how this "tinkering" approach works in the case of a deceptive mesa-optimizer. This could potentially be improved. --- ### 2. Summary: Perform Tractable Research While Avoiding Capabilities Externalities (PAIS) **TL;DR:** > This article focuses on 2 key attributes the authors want AI safety researchers to strive for in their work: > 1. Tractably producing tail impact > 2. Avoiding capabilities externalities The aforementioned goals are supplemented with a non-exhaustive list of strategies for generating long tail research impact: First, they argue that research success should be viewed as a **multiplicative process** in contrast to additive processes. Multiplicative processes occur when a mix of factors has to be "just right". To exemplify this, think of the **product** of many Gaussian variables which will lead to a tail distributed variable, whereas the **addition** of Gaussian variables leads to another Gaussian variable. In the former, the tails are much longer and there is still "substantial" probability mass. Simultaneously, it is much harder to land in the tails, as they are results of many multiplicative variables needing to have high values. This carries importance for research, as can be seen in the following example: Selecting for a diverse group of researchers with complementary skills will likely yield better outcomes than a team solely comprised of IMO gold medalists. Similarly, the success of an individual researcher is not only determined by their intelligence, but by access to other brilliant researchers, funding etc. Secondly, **preferential attachment** and the Matthew Effect: "To those who have, more will be given" imply that it is important to succeed early in one's career. Reaching the long tail of research impact depends on getting in early and accumulating opportunities, funding, and collaboration opportunities. Thirdly, **the edge of chaos** refers to a project selection heuristic to find long tail impact projects. It posits that wherever there is a field on the cusp of chaos there is an opportunity to translate it into an ordered one. This activity carries a chance for long tail impact. As an example, the design of a new metric is provided. Metrics influence the type of subsequent experiments devised and can steer the direction of the discourse of a research field. Other **specific high leverage points for influencing AI safety** the authors propose, include: - **Managing Moments of peril** - Be cautious during precarious times and prepare safety instruments for when they are needed - Try to stabilize international affairs to not run into a Cold War situation - Increase predictability by improving our forecasting - **Getting in early is useful** - Most critical safety decisions of a system are made in early development - Delaying safety work can thus result in accumulating fatal design flaws - AI Safety solutions must be time tested when they become "needed" - Safety research begets safety research. We should make use of the compound effect. - **Improving scaling laws for safety** - This will become more expensive and will require careful resource allocation - The slope and intercept of these scaling laws can be influenced by ideas which we can start developing (novel architecture designs or training procedures) Next, the article provides some reasons for why **asymptotic reasoning is problematic**. - Many attempts at reasoning about advanced AI systems in the limit refer to Goodhart's Law which, according to the authors is important but often misinterpreted - Thy agree that metrics can collapse under optimization pressure, but that doesn't mean that doom is inevitable. - We should instead seek to introduce counter measures to constrain in the exploitation of the metric: - Consider for example how companies are incentivized to exploit maximize profits but are kept in check by states and their constitutions. (e.g. pumping up prizes via collusion with other companies is forbidden) - Asymptotic reasoning often only thinks about offensive capabilities in the limit. Ideally, we will be building defensive capabilities at the same pace. This does not however solve the problem that 2 powerful AI agents (defensive and offensive system) could collude In the same vein, the authors point at problems with research that presupposes a superintelligence. They point out that this research: 1. Is less tractable, since it is often fruitless to do proof-based work in an engineering science like DL 2. Makes research less concrete and prolongs feedback loops 3. Focuses on 1 super intelligence instead of multiple superintelligent systems 4. Ignores sociotechnical factors (e.g. doesn't contribute to building safety culture) Subsequently, it is argued that we should **improve cost/benefit variables instead of viewing safety work as a binary property** as done in asymptotic reasoning. Some arguments make their case by positing that human extinction is binary and there would be no recovery. While being true, this does not mean that improving our odds of survival through safety work is futile. We can change the costs and benefits of hazardous behavior by reducing the cost and increasing the benefits of safety measures. The goal should be to reduce risk as much as possible, not to extinguish it with certainty (which is likely unattainable). Lastly, according to the authors **safety/capabilities tradeoffs** should be taken into account. The following figure illustrates the authors main assertion: "We should produce safety research without producing capability externalities". In terms of the picture our aim should be to produce results vertically, not just diagonally along the linear trend. ![](https://i.imgur.com/vr1VtG8.png) --- ## Problem 2 *Developing research taste requires critiquing and evaluating others’ arguments. Write a critique of some prominent work or agenda in alignment theory. (250-500 words) The following presents the main argument and critiques the paper "Risk from Learned Optimization"* --- **1. The Main Idea** In Risks from Learned Optimization Evan Hubinger et. al. identify and analyze the inner alignment problem as distinct from the outer alignment problem. They introduce the dichotomy between base optimizer and mesa optimizer, the latter being a neologism they introduce to describe a hypothetical phenomenon in which the learned algorithm produced by the base optimizer (e.g. SGD), produces an optimizer itself. In this light, they discuss how the task, optimization algorithm, network architecture and dataset influence the likelihood of mesa-optimizers occurring. In addition, they identify safety risks stemming from mesa-optimizers. In particular, the notion of deceptive alignment is being introduced and it's likelihood of occurrence is analyzed in light of the training task and base optimizer. (For a more thorough explanation of the key points see "4. Additional Explanation of the Argument in the Paper") **2. Discussion** - The authors were the first to draw the distinction between outer and inner alignment. As contrasted to philosophical work that identified the outer alignment problem, they ground their work in concrete observations of machine learning systems. - An interesting observation is that this problem could have been identified in yet another way, by borrowing insights form complex systems theory. The Systems bible posits that systems often decompose goals into subparts and that the decomposition process often distorts the goals. - A potential weakness of the authors approach is the lack of empirical evidence to back up their speculative claims. Although the authors did not devise any experiments at the time of releasing the paper (other than presenting a toy example in the form of a thought experiment), some follow up work has experimentally hinted at the existence of mesa-optimizers or [goal misgeneralization (Koch et. al.)](https://arxiv.org/pdf/2105.14111.pdf) - A graver mistake is that the authors choose to ground their analysis in light of modern Reinforcement Learning systems while completely omitting large language models. Most modern ML progress has been made in this domain and yet it doesn't seem like these systems exhibit agency (like RL systems). A revision of inner alignment problems through the lens of [Simulacra theory](https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators) could be a useful addition to the claims made by the paper. To be fair, GPT-3 wouldn't enter the scene for another year at the time of publishing. (BERT and GPT-2 however were already around) - Lastly, the paper does not offer many solution approaches. Suggestions could have been useful to inform future work. **3. Conclusion** The paper identifies and frames an important problem in AI alignment. Although speculative at the time, mesa-optimization have been experimentally explored since. The main weakness of the paper comes from assuming that advanced ML systems will be developed agentically as in the RL framework whereas most recent advances in ML have come from Deep Learning and in particular Language Modeling. A complete treatment of the risks from learned optimization should take into account and analyze risks based on the fact that modern ML systems are unsupervised learning systems more often than RL systems. --- **4. Additional Explanation of the Argument in the Paper:** - Modern ML systems are developed by using an optimizer (usually some variant of gradient descent) to optimize some objective (usually achieving a low value or high reward according to some loss or reward function) by means of adjusting the parameters in a neural network. The paper defines this optimizer as the base optimizer and its objective the base objective. - Abstractly, we are searching over a space of parameterized functions to find a function that performs well according to the base objective. - After the training process the set of weights of the neural network result in what the authors call a learned algorithm (e.g. the function found by searching over the space of parameterized functions). Now, this learned algorithm can be very simple and just try to memorize all training examples it has seen (e.g. it can overfit). Alternatively, it might implement something more sophisticated. Perhaps it finds some sort of algorithm that performs search itself. This internal algorithm will have some objective. The authors call this internal algorithm a mesa-optimizer and the objective it is pursuing the mesa-objective. - The base objective is not necessarily equal to the mesa objective. The base objective might be "find the fastest path through a maze", but the mesa objective might be "follow the red signs to reach the end of the maze". - The authors speculate that if the mesa-optimizer gets "powerful" enough, it will be able to pursue the mesa-objective sufficiently well to a degree that its actions might not resemble those intended by the programmers of the system. Worse yet, we might be unable to detect that the mesa-optimizer is pursuing an alternative objective, if during training it purposefully follows the base-objective and deceive us into believing that it will do so once it's no longer in training. This is what the authors define as deceptive alignment --- # Additional Context The Pragmatic AI Safety (PAIS) research umbrella argues for performing impactful AI safety research without simultaneously advancing capabilities. It stresses that we should treat the ML research community as a complex system that is affected by a multitude of sociotechnical factors. One such key factor is the development of a strong safety culture among ML practitioners. An important assertion made in the PAIS framework is that technical work is not enough and that other activities with less direct causal chains of impact should be pursued.