Empowerment in Lunar Lander

--- tags: Concepts --- Hello, @oudeyer , @sauzeon, @Maxime_Balan. I have started to brainstorm on how we can measure "controllability" in the Lunar Lander game. As @oudeyer and also @gkovac suggested, I explored the construct of Empowerment (https://arxiv.org/abs/1310.1863) and tried to see how we can apply it to our study. Here're my thoughts: # Empowerment in Lunar Lander ## Intuition **Empowerment** is an information-theoretic formalism that measures the agent's ability to potentially change its own future. If the agent is stuck in a trap and no matter what it does it cannot get free, it is in the least empowered state. On the other hand, if the agent is in the middle of an empty room, it is maximally empowered (within that room) because it can reach any point in that room. Formally, empowerment is the channel capacity from an agent’s actuators to its sensors, and as such, measures the efficiency of that channel. ![](https://i.imgur.com/boxbJUs.png) In the maze environment above, the most empowered state for the agent to be in is at $x = y = 5$ (the brightest square) because the agent can reach many states from that position. Note, that this may not the best position for reaching specific states, e.g., $(0, 0)$, but it is a position from which the agent can reach many more states quickly compared to, say $(0, 1)$ from which $(0, 0)$ is close, but other states require many actions to be reached. The maze example demonstrates how *specific* empowerment -- calculated for any given state of the world -- can be used to determine a good position for the agent to be in. If some food appeared at a random coordinate in the maze, the agent would be better positioned to get it from the most empowered state (on average) compared to less empowered states. My intuition tells me that the most empowered state in the Lunar Lander game would be somewhere in the top-middle position of the display. However, we don't ask people to seek out the most empowered state in our task, so we are not interested in what people would seek if they wanted to feel most empowered. However, a slight variation of the concept of empowerment can be useful to gauge how people progress in the task. ## Formalism Formally, empowerment is the *maximum channel capacity* from the agent's actuators to its sensors. I.e. it is the maximum *mutual information* that the agent's actions communicate about the future state of the world. **In Lunar Lander, this communication channel is not fixed but gets more efficient with time. Note that we are not so much interested in the maximally efficient capacity, but how the capacity changes with practice.** (This means that we are not going to concern ourselves with computing empowerment, strictly speaking, but simply the mutual information between controlled actions and sensory effects). Mutual information between $X$ and $Y$ can be written as: $$ I(X;Y) = H(Y)+H(X)+H(X,Y) $$ or $$ I(X;Y) = H(X) - H(X|Y) $$ and $$ I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x, y)}{p(x)p(y)} $$ Mutual information tells us the reduction in entropy of $X$ given $Y$. It can be seen as the expectation over the joint distribution of $X$ and $Y$ of the log-likelihood ratio between the joint probability of $x$ and $y$ and the product of their independent probabilities. In other words, it tells us how much more likely on average is $p(x, y)$, or observing $x$ and $y$ assuming they are jointly distributed compared to $p(x)*p(y)$, or observing them under the assumption that they are independently distributed. The empowerment theory suggests that random variables $X$ and $Y$ can stand for sensory effects $S_{t+1}$ (the uncertain variable) and motor actions $A_t$ (the controlled variable), so we can reformulate the above equation as follows: $$I(A_t; S_{t+1}) = H(A_t)+H(S_{t+1})+H(S_{t+1}, A_t) $$ and $$ I(A_t;S_{t+1}) = \sum_{a,s'} p(a,s') \log \frac{p(a, s')}{p(a)p(s')} $$ Then, $I(A_t;S_{t+1})$ tells us how much more likely observing state $s'$ after taking action $a$ assuming that $(A_t,\ S_{t+1})$ are jointly distributed compared to observing $s'$ after $a$ under the assumption of independence of $(A_t,\ S_{t+1})$. ## Computing empowerment in the Lunar Lander game Let's brainstorm how we can express actions and states in the Lunar Lander formally so that we can apply the empowerment framework to our usecase. Suppose we logged action and state variables across discrete episodes, with some constant duration, e.g., 100 milliseconds. * **Actions**. For each time step $t$, we can encode $A_t$ as the total duration of applying each of the 4 available actions (`F` for CCW rotation, `J` for CW rotation, `space` for propelling, and `0` for doing nothing). Thus $\vec{a} = [d_{\textsf{F}}, d_{\textsf{J}}, d_{\textsf{space}}, d_{\emptyset}]$. * **States**. Applying a set of actions during episode $t$ will result in some (approximately) linear and angular displacement of the ship. Thus, we can encode $S_{t+1}$ for each time-step $t$ as the relative positional and angular displacement, i.e., positional and angular velocity over the episode $\vec{s'}= [\Delta x, \Delta y, \Delta \theta]$. A visual representation of this encoding scheme is presented below (note that the frame indices in the trajectory panel on top should start with 0 to be consistent with the subpanels below): ![](https://i.imgur.com/UpfuU5g.png) If we apply the formula for $I(A_t; S_{t+1})$ above, we can compute the average mutual information between relatively short-term actions and immediately resulting states. As I was drawing this diagram, I noticed that a good forward model for predicting the consequences of one's actions would require context information from the previous step (denoted as $\vec{c}$ in the drawing). However, the construct of empowerment does not make use of such information. This means that if the game physics were too strong compared to the lander's forces (e.g. very strong gravity and wind), there would be no empowerment. On the other hand, if the game had no gravity or wind, the movement of the ship would be completely predictable and the player would have maximum empowerment. This is consistent with the construct. Our game is somewhere in-between. Eventually, the players must learn to control the duration of button presses in order to reliably steer and move the ship. As they learn to do this, my intuition tells me that their empowerment should also increase. We can measure this kind of empowerment on every trial. **Increases in empowerment across trials can be interpreted as learning progress and thus can be used as predictors of subjective improvement judgments.** I am not entirely sure if this measure reflects the quality of the forward model only, or if it also captures the quality of the inverse model as well. From some relatively recent cognitive literature I've read, forward and inverse models are learned separately, and it might even be the case that learning the inverse model requires learning a forward model first.