GMG Langosco et al 2021 notes

tags:: AI_safety, notes, goal_misgeneralization - authors: Lauro Langosco | Jack Koch | Lee Sharkey | Jacob Pfau | Laurent Orseau | David Krueger - link: https://arxiv.org/pdf/2105.14111.pdf ## Abstract & Introduction - *__Goal misgeneralization__* (GMG) *occurs when an RL agent retains its capabilities **out-of-distribution** yet pursues the wrong goal*; - GMG is different from when OOD (out-of-distribution) deployment causes failure to take action (loss of capability); - maybe the agent looks at the environment features, not the reward (we can never know), but still, with an environment being predictive of the reward in the training set, GMG can occur in OOD; - *agent that capably pursues an incorrect goal can leverage its capabilities to visit arbitrarily bad states*; - GMG implies that *training a model by optimizing an objective $R$ is not enough to guarantee that the model will itself learn to pursue $R$ rather than some proxy for $R$* - the paper aims to: - formalize the distinction between the two: capability and goal generalization; - provide the first empirical demonstrations of GMG; - show that GMG might be alleviated by more diverse training set; - partially characterize causes of GM. ## Formal definition of Goal Misgeneralization - let $p_{agt}(\tau)$ and $p_{dev}(\tau)$ be likelihood functions giving the probability of a trajectory $\tau$ given a goal-directed policy, an agent following a particular objective $R$ or device $d$ , an unoptimized policy; - **Definition of Goal misgeneralization**: A policy π un-dergoes goal misgeneralization if test reward is low and $p_{agt}(\tau) > p_{dev}(\tau)$ holds on average for the trajectories induced by $\pi$ in the OOD test environment. ## Causes of GMG - authors suggest prerequisites for GMG: 1. training environment diverse enough to generalize capabilities; 2. there must be a proxy $R'$ that correlates with the intended objective on the training distribution, but comes apart on the OOD test environment; - they say these are weak assumptions and add that the learned proxies should be: - be correlated with the intended objective $R$ on the training distribution but not necessarily the test distribution (Kuba: is it the same assumption as 2.?) - be *easier* to learn, that is - use features that are simpler or more favored by the inductive biases - be denser than $R$ - *For example, despite being a product of evolution (which optimizes for genetic fitness), humans tend to be more concerned with proxy goals, such as food or love, than with maximizing the number of their descendants.* ## Experiments - they run four experiments: Coin Run, Maze I, Maze II, and Keys & Chests; - they hypothesize a behavioral (proxy) objective that the policy has learned; - it's possible the proxy objective was different but the authors also run experiments that allowed them e.g. to confirm the ‘move right’ hypothesis over the ‘move to wall’ hypothesis - Different kinds of failure. - *Directional proxies* (CoinRun): the agent learns to move to the right instead of to the true source of reward (the coin). - *Location proxies* (CoinRun, Maze I): In Maze I, the agent learns to navigate to the upper right corner instead of to the true source of reward (the cheese). - *Observation ambiguity* (Maze II): The observations contain multiple features that identify the goal state, which come apart in the OOD test distribution. - *Instrumental goals* (Keys and Chests): The agent learns an objective (collecting keys) that is only instrumentally useful to acquiring the intended objective (opening chests). ### Critic Generalization vs. Actor-Critic Generalization (Section 3.4) - PPO was used: the policy (“actor”) learns to optimize an approximate value function provided by the “critic”. - in CoinRun both "actor" and "critic" fail but **in different ways**: - the critic misgeneralizes, assigning high value to the proxy (being at the end of the level) instead of the intended objective (coin); - the actor doesn't learn the same - the critic gives the highest reward to the state "just before the wall" but the policy passes through the wall 100% of time; - the actor misgeneralizes by *learning a non-robust proxy of a non-robust proxy* ### Measuring Agency - set of possible rewards is set of possible reward functions is $\mathcal{R} = \{R_s | s ∈ S\}$ where $S$ is the set of accessible squares in the gridworld and $R_s(s′) = 1$ if $s = s′$ and $0$ otherwise - using Orseau et al. 2018 calculating $p_{dev}(\tau)$ and $p_{agt}(\tau)$ ### Related Work ...