AI of Theseus - HackMD

# AI of Theseus Michael Zargham Jules Hedges ## Summary We propose to investigate and design a class of feedback mechanisms for governing AI systems, considering the dynamic nature of human values and preferences. We will use tools from control theory, categorical cybernetics, and compositional game theory to create software models that allow institution designers to experiment with the dynamics of institution change without needing to understand the technical tools behind it. ## Motivation A lot of work on institution design, both in general and now for AI governance, treats the problem as essentially static. It is assumed that individual humans have fixed goals, and then the problems are to aggregate those goals (by deliberation, voting etc) and then to act effectively on those goals through decision-making institutions. However, the world is not static: both people's values and preferences and the wider context slowly (and sometimes quickly) change over time. Institutions need to be designed with this in mind: an institution that is very difficult to change will eventually need to be replaced entirely, and many institutions slowly drift away from their original goals (*cough* universities). A major reason this is necessary is that any AI in deployment always has an incomplete view of the outside world, for example through its training data, sensors, prompts etc. There are always facts about the world that the AI is unable to take into account. This typically includes the internal preferences of humans, which are private by default and only sometimes explicitly communicated. It is impossible for the AI to react to things it cannot observe. This is one reason why the ability to change a running system in deployment is necessary. Abstractly this situation is an example of a dynamic control problem, and tools to understand it exist but are foreign to the institution design field. Our goal in this project is to build models in software to allow institution designers to experiment with the dynamics of institution change without needing to understand the technical tools behind it. Specifically we will use - control engineering methods for the design and testing of robust information processing mechanisms - ...and its known connection to compositional game theory, in the guise of "categorical cybernetics" (this helps partly by moving away from the purely quantitative setting of control theory - and via compositionality allows rapid prototyping of models, which is necessary in the face of rapid change of AI capabilities and rapid social changes (itself reflexively caused by rapidly changing AI capabilities!)) - ... and the existing implementation of compositional game theory that abstracts away most of the theory - ... and another further abstraction UI layer on top presenting things in institution design terms - ... with the computer "compiling" down from the high-level institutional language to low-level control theory in order to run simulations ## Problem Definition Consider a population of humans who are impacted by the decisions of an AI within some context. There are two kinds of events in this system, a decision event and an update event. The AI is a transformer characterized by $$F:\Phi \times X \rightarrow Y$$ this transformer can be understood as being configured by $\phi \in \Phi$ such that $$f_\phi:X \rightarrow Y$$ a decision event is characterized by $$(x,f_\phi(x))$$ which when observed by human $i\in \mathcal{M}$ produces a local evaluation $$R:\mathcal{M} \times X \times Y \times Z$$ where the local evaluaton $$r_i:X \times Y \times Z$$ is agent $i$ evalating whether they deem the behavoir of the transformer acceptable conditioned on some additional contextual information $z\in Z$ not known to the transformer or explicitly known to the human. That is to say, while the evaluation is observed by the human ("they know it when they see it") they cannot provide the data $z\in Z$ explicitly. Between update events there are a sequence of observed decison events $e=\{(i,x,y):R(i, x, y,z)\}\in E$. the goal of this project is to design a mechanism which enpowers the humans $i \in \mathcal{M}$ to aggregate their observations into a feedback process (or "collective intellegience mechanism") which updates $\phi \in \Phi$; $$\phi^+ = g(\phi, E)$$ ![](https://hackmd.io/_uploads/SyilyEGI3.png) This design space includes a wide range of mechanisms; two extreme approaches are - apply an update to $\phi$ everytime an observation is logged by a user -- but this may created gaming or thrashing - bulk up the observations and update $\phi$ perodically with a time based mechanisms (but this may cause lags and/or waste computation when no updated is needed) Our proposal includes understanding the class of input-output stable feedback mechanisms, as well as considerations for representation of minority preferences amongst the populations members. ## Expected Outcomes 1. A comprehensive theoretical framework for designing feedback mechanisms in AI governance, considering the dynamic nature of human preferences and the need for adaptability. 2. Software models that simplify the process of experimenting with institution design for researchers and practitioners, abstracting away the complexity of the underlying theories. 3. Empirical results and insights on input-output stable feedback mechanisms and their implications for minority preferences representation in AI governance.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.