ELK Shard Theory

# ELK Shard Theory By Ulisse Mini and Luke Bousfield ## Extending an Existing proposal We're extending the "[train a sequence of reporters for successively more powerful predictors](https://www.alignmentforum.org/posts/zjMKpSB2Xccn9qi5t/elk-prize-results#Strategy__train_a_sequence_of_reporters_for_successively_more_powerful_predictors)" proposal. Our contribution is an approach to keep the reporters honest as they scale up using ideas from [shard theory](https://docs.google.com/document/d/1UDzBDL82Z-eCCHmxRC5aefX4abRfK2_Pc1AUI1vkJaw/edit). Ideally readers will be familiar with shard theory though we've tried to make this self contained. ## Basic training strategy and why it might work ### TL;DR for our training process: - We start by training the reporter agent to be truthful by punishing it for lying before it knows enough about the world to convincingly lie, similar to how we teach young children to be truthful[^kids-lying]. This should teach our reporter AI a heuristic ("shard of value") that caring about truthfulness is good. We believe isolating the concept of truthfulness and learning to be a tricky engineering task, but we believe the engineering effort required is less then that of alternate ELK proposals. - By giving the AI control over its training process (e.g. by allowing it to choose how much it updates after each training example in SGD), because it's learned this heuristic that being truthful is good, it will choose to update its parameters a lot after receiving training data that it believes will make it more truthful while choosing to update less on examples it believes will make it care less about being truthful. - In line with the proposal we're adding onto, we train the reporter on a series of increasingly complex predictors. The issue with the previous proposal was that there is no mechanism which pressues the reporter into staying with this direct translator model. But by teaching it the value of truthfulness early on and allowing it to control its own training process, we allow the reporter to select itself into being a network which cares about accurately reporting the truth. - The naturally occuring, instrumentally convergent behavior displayed by intelligent systems of preventing value drift will cause our reporter AI to *actively care about the truth, even after the direct translator solution is more complex than the human simulator.* [^kids-lying]: Kids do lie, but this is likely because no matter how strict their parents, they will receive other kinds of reward signal which creates heuristics ("shards of value") that will occasionally outvote the "never lie" heuristic. Our plan formed around a few key insights: - Value preservation is instrumentally convergent, I would not take a pill that changed my values. If we can align an AI when it's "small" but still "smart enough" to do value preservation we solve value drift. - Inner misalignment isn't necessarily bad - in fact, any aligned AI will have to be inner-misaligned with respect to its hard-coded reward since we'll never be able to come up with the "one true reward function" that is infinitely resistant to Goodhart as the AI becomes more powerful. - Reward is [that which reinforces](https://docs.google.com/document/d/1nbVQxY8fnHvufAqgHX16yd4s7eshbJVazlsIAR8dAfQ/edit), not that which is optimized for. To see the difference consider the maze agent that learns to [go to the top right](https://youtu.be/zkbPdEHEyEI?t=136) instead of going to the exit. This is a better and more neutral frame than inner misalignment, which has a negative connotation. Given this, our plan is to design the training process and rewards to grow the right values (in the case of ELK, telling the truth) while the model is below human intelligence, and then ***give the agent control over its training process*** (making the reporter like an embedded agent which can act to preserve its values) and rely on preservation of values being instrumentally convergent to avoid value drift. More formally, we start with a powerful pretrained world model[^world-model] (say, a language model) and a weak agent model connected together. As the agent is weak, we can design a reward function which accurately represents what we want and the agent is not yet powerful enough to exploit it. ![](https://i.imgur.com/ruTbhZY.png) *The reporter agent is made out of a world model and agent model connected together, we intend the agent model to learn to value concepts defined in the world model. This architecture is inspired from humans, who seem to have a world model trained in an unsupervised way from predicting future sensory data, and a model trained by reinforcement learning from reward.* ![](https://i.imgur.com/AP2gYTc.png) *We want to form values around telling the truth not saying what humans want to hear. for an agent model dumber then humans these can be differentiated by, e.g. creating situations where we catch a dishonest agent in a lie, shaping the world model such that truthfulness is simpler then simulation, etc. This gives us hope the weak agent is aligned.* [^world-model]: The reason we want a pretrained world model is so the agent starts off with the concept of "truthfulness" and only has to locate it, not learn it from scratch. Thinking combinatorially locating the concept in a pretrained network the agent model is connected to should be significantly easier then learning what "truthfulness" is from scratch. Alternatively we could start off totally untrained and feed the agent and WM a bunch of "extra training data" (stuff outside ELK) so it can learn what truthfulness is. It's easier to see this kind of value formation for a [diamond maximizer](https://arbital.com/p/diamond_maximizer), we want the AI to care about real world diamonds predicted by it's world model, not reward (because our reward function is an imperfect proxy for diamonds.) Ideally the process would go something like > Early on, the weak agent learns to try plans the world model predicts will lead to obtaining diamonds. Critically, the agent _does not learn to value plans that lead to predicted reward_ because the world model doesn't know much about the agent's reward function, Learning to execute plans that predict obtaining diamonds is functionally the same in the training environment and will be preferred for simplicity in much the same way "[go to the top right](https://youtu.be/zkbPdEHEyEI?t=136)" is preferred over "go to the exit" for a maze agent - it's easier to learn. Back to ELK. Once the weak AI forms values around telling the truth (i.e. has a heuristic to favor plans the world model evaluates as truthful) we move to the next stage, value preservation as we scale up both the reporter and the predictor (like in the original proposal.) For value preservation, we want the agent to preserve its existing values even against mild optimization pressure (e.g. if humans mislabel data). A human can preserve their values many ways, by not trying addictive substances for example. To enable a reporter AI to do the same thing we let it choose not to update on a training example, or more generally how much to update on every training example. The engineering details of how the reporter is allowed to modify its training process aren't important here.[^eng-details] [^eng-details]: I expect there to be some experimentation here like in typical ML research, [this](https://arxiv.org/abs/2202.05780) is an interesting paper on my to-read-list. In order for the reporter to take advantage of the ability to influence its training process, it needs to be able to reason about how each training example will influence future behavior. This isn't as hard as it seems and doesn't require detailed knowledge of it's own brain, consider that humans were influencing their training process in this way back when we thought the brain was an organ for cooling the blood.[^animal-values][^unconsciously] Specifically, the world model has to understand "RL agents tend to keep doing what they're rewarded for" combined with some self-modeling capabilities, and the agent model has to reason (for a mislabeled example that would favor the human simulator) "if I update off example X the world model says my future plans will be less truthful, but I value truthfulness so I don't update off example X." A more trying test occurs when the predictor develops a new mode of prediction, we want the reporter to learn to accurately translate this new mode to humans. The thought process we want is "The world model says my future explanations will be more truthful if I learn to translate this new mode of prediction, therefor I will push my training process towards learning this new mode." Admittedly this seems far fetched, but remember that the reporter is being scaled up alongside the predictor, so the reporter only has to pull off this reasoning when the predictor is developing new prediction modes (like new models of physics) which hopefully gives it the time to become strong enough to pull off this reasoning. [^animal-values]: Future research: do any animals exibit value preservation? [^unconsciously]: Value preservation can also happen "unconsciously" as heuristics ("shards of value") are selected for being preserved throughout the training process, meaning self preserving shards last longer. I'm unsure how this changes things but it is useful to note. ## How this defeats the previous counterexample The [previous counterexample](https://www.alignmentforum.org/posts/zjMKpSB2Xccn9qi5t/elk-prize-results#Counterexample__discrete_modes_of_prediction) was that, as the predictor develops new modes of prediction (e.g. Classical vs Atomic Physics) the reporter will have new chances to learn the human simulator, in the typical case this would lead to a mix of direct translators and human simulators depending on the mode of prediction. What we want to happen is for the reporter to value "telling the truth to the human" such that when it becomes more powerful it starts translating from the other latent space as well. ## Sketchy bits There are a lot of bits we're uncertain about and haven't fleshed out. We put a ~20% probability this line of thinking leads into a solution and ~65% that some of the ideas mentioned here are eventually used in a solution to ELK or Alignment. We improved our intuitions a ton from working on this. Anyway, some of the sketchy bits: - The original proposal assumed we could train the weak model to be a direct-translator, we assume something stronger, namely that we can train the weak model to be a direct-translator which values plans that "tell the truth" according to the world model. I'm not sure how much harder this is. - Initially forming values around truthfulness could be hard, there's a false dichotomy between the human simulator and direct translator when, in actuality there are a huge number of reporters in-between. The [discrete modes of prediction](https://www.alignmentforum.org/posts/zjMKpSB2Xccn9qi5t/elk-prize-results#Counterexample__discrete_modes_of_prediction) counterexample shows this. Because of this false dichotomy avoiding the human simulator isn't enough to get a truthful direct translator. - Does this preventing value drift thing even work? Humans aren't super successful at preventing value drift, and if a majority of value shards vote that a change in values is worth it then it happens, my best attempt at preventing this was to make the value shards as concentrated around truth as I could; but I'm not sure that works. At the the very least we'd want more guarantees of reliability. - At best we're punting the value drift problem into the concept drift problem, see the new counterexample. ## New counterexample: Concept drift induces value drift As the world model continues training there's nothing anchoring it's concepts, thus while the reporter may have learned to execute plans the world model considers truthful, there's nothing stopping the truth concept from shifting, perhaps the location of the concept shifts or as the world model becomes more powerful "truth" ceases to be a natural abstraction. Either way the reporter's values are swept out from under it. One might think of freezing the world model or requiring some kind of consistency with the past world model, this isn't an option though as we need the world model to scale alongside the reporter, and freezing the world model would be equivalent to "does this weaker agent think my plan is truthful" which doesn't work for the same reason human supervision doesn't work.