FeUdal Networks for Hierarchical Reinforcement Learning - Notes

--- title : "FeUdal Networks for Hierarchical Reinforcement Learning - Notes" tags : "IvLabs, RL" --- # FeUdal Networks for Hierarchical Reinforcement Learning Link to the [Research Paper](https://arxiv.org/abs/1703.01161) {%pdf https://arxiv.org/pdf/1703.01161.pdf%} Feudal Reinforment Learning - gains power and efficacy by decoupling end to end learning across multiple levels - different resolutions of time Encourages the emergence of sub-policies associated with different goals # Introduction - Long term credit assignment is a challenge - Feudal Reinforcement Learning - Levels of hierarchy within an agent communicate via explicit goals - Goal setting can be decoupled from goal achievement Manager - sets goals at a lower temporal resolution Worker - Operates at higher temporal resolution and produces primitive actions - conditiones on the goals it recieves from the Manager - No gradients propagate between Worker and Manager - Manager - learns to select latent goal and maximises extrinsic reward Contributions - A consistent end-to-end differentiable model - FRL - Approximate transition policy gradient update for training the Manager - exploit the semantic meaning of the goal - Use of goals that are directional rather than absolute - Dialated LSTM - extends longevity of the recurrent state memories and allows gradients to flow through large hops in time # Related Work Two level hierarchy Option - sub-policy with a termination condition Policy over options - Queried after termination condition Option-Critic - Learning options jointly with a policy-over-options in an end-to-end fashion by extendingthe policy gradient theorem to options When options are learned end-to-end - degenrate to two trivial solutions - One active option that solves the whole task - Policy-over-option that changes options at every step - micro-managing the behaviour Difference - Top level produces a meaningful and explicit goal for the bottom level to achieve Auxiliary Losses and Rewards - Pseudo-count based auxiliary rewards for exploration - new parts of state space Unsupervised Auxiliary Tasks - Refine internal representations # The Model FuN - Modular NN - Two Modules - Manager - Latent State Representation - $s_t$ - Outputs Goal Vector - $g_t$ - Worker - Input - External Observation, own state and Managers goal - Output - Actions - Manager and Worker - share perceptual module - Take observation from environment $x_t$ and computes a shared intermidiate representation $z_t$ - Manager's $g_t$ is trained using an approximate transition policy gradient - Dynamic $$\displaystyle\begin{align} z_t &= f^{\text{percept}}(s_t) \\ s_t &= f^{M\text{space}}(z_t) \\ h_t^M, \hat g_t &= f^{\text{Mrnn}}(s_t, h^{M}_{t-1}) \\ g_t &= \hat g_t / ||\hat g_t|| \\ w_t &= \phi(\underset{o = t-c}{\overset{t}{\sum}}g_i) \\ h^W, U_t &= f^{\text{Wrnn}}(z_t, h^{W}_{t-1})\\ \pi_t &= \text{SoftMax}(U_tw_t) \end{align} $$ - rnn - Recurrent Neural Network ## Goal embedding - Goal modulates via a multiplicative interaction in a low dimensional goal-embedding space $R^{k}$, k << d - Worker produces embedding vector for every action - $U\in R^{|a|\times k}$ - For goals - last $c$ goals are first pooled by summation and then embedded into a vector $w \in R^k$ using a linear projection $\phi$ - no bias - learnt from Worker's actions - No bias - never produce a constant non-zero vector - never ignore the Manager's input - Due to pooling of goals over several time-steps - conditioning from the Manager varies smoothly ## Learning - FuN produces a distribution over possible actions - FuN - fully differentiable - We can train it end-to-end - Using a policy gradient algorithm operating on the actions taken by the Worker - Manager would be trained by gradients coming from the Worker - This would deprive the Manager's goals g of any semantic meaning - Independently train Manager to predict advantageous directions - and reward Worker to follow these directions $\nabla g_t = A_t^M\nabla_\theta d_{\text{cos}}(s_{t+c} - s_t, g_t(\theta))$ - Manager's Advantage Function $A_t^M = R_t + V_t^M(x_t,\theta)$ - Cosine Similarities betweem two vectors $d_{\text{cos}}(\alpha, \beta) = \alpha^T\beta/(|\alpha||\beta|)$ - $g_t$ acquires a semantic meaning - advantageous direction - at a horizon c - defines the temporal resolution of the Manager - Intrinsic reward for Worker $\qquad r_t^I = 1/c \underset{i = 1}{\overset{c}{\sum}}d_{\text{cos}}(s_t-s_{t-i}. g_{t-i})$ - Directions - more feasible for Workers - reliably cause directional shifts in the latent state - It also gives a degree of invariance and structural generalisation - FRL - advocated for completely concealing the reward from the environment from lower levels of hierarchy - In practice - Reward added to intrinsic reward - to retain environment reward - Worker and Manager have different discount factor $\gamma$ - Worker can be greedy and focused on immediate rewards - Manager can consider long-term perspective ## Transition Policy Gradients - Manager update rule - Policy gradient with respect to a model of the Worker's behaviour - Transition Policy $\pi^{\text{TP}}(s_{t+c}|s_t) = p(s_{t+c}|s_t,\mu(s_t,\theta))$ - The state always transitions to the end state picked by the transition policy $s_{t+c} = \pi^{\text{TP}}(s_t)$ - We can apply policy gradient theorem to the transition policy $\pi^{\text{TP}}$ $\nabla_\theta\pi_t^{\text{TP}} = \Bbb E [(R_t - V(s_t))\nabla_{\theta}\text{log }p(s_{t+c}|s_t,\mu(s_t,\theta))]$ - The worker may follow a complexe trajectory - policy gradient would learn from these samples - instead if the end state is known - we can skip directly over the Worker's behaviour and instead follow the policy gradient of the predicted transition - FuN - Transition model - direction of state-space follows von Mises-Fisher distribution - Worker's intrinsic reward - log-likelihood of state trajectory # Architecture details - Perceptual Module $f^{\text{percept}}$ - CNN + Fully Connected Network - Manager state space $f^{\text{Mspace}}$ - Fully Connected Network - Worker's recurrent network $f^{\text{Wrnn}}$ - LSTM - Manager's recurrent network $f^{\text{Mrnn}}$ - dilated LSTM ## Dilated LSTM - At each time step only the corresponding part of the state is updated and the output is pooled across the previous c outputs # Experiments - Goal - FuN learns non-trivial, helpful and interpretable sub-polices and sub-goals, and also to validate components of the architecture