---
title : "FeUdal Networks for Hierarchical Reinforcement Learning - Notes"
tags : "IvLabs, RL"
---
# FeUdal Networks for Hierarchical Reinforcement Learning
Link to the [Research Paper](https://arxiv.org/abs/1703.01161)
{%pdf https://arxiv.org/pdf/1703.01161.pdf%}
Feudal Reinforment Learning - gains power and efficacy by decoupling end to end learning across multiple levels - different resolutions of time
Encourages the emergence of sub-policies associated with different goals
# Introduction
- Long term credit assignment is a challenge
- Feudal Reinforcement Learning
- Levels of hierarchy within an agent communicate via explicit goals
- Goal setting can be decoupled from goal achievement
Manager - sets goals at a lower temporal resolution
Worker - Operates at higher temporal resolution and produces primitive actions - conditiones on the goals it recieves from the Manager
- No gradients propagate between Worker and Manager
- Manager - learns to select latent goal and maximises extrinsic reward
Contributions
- A consistent end-to-end differentiable model - FRL
- Approximate transition policy gradient update for training the Manager - exploit the semantic meaning of the goal
- Use of goals that are directional rather than absolute
- Dialated LSTM - extends longevity of the recurrent state memories and allows gradients to flow through large hops in time
# Related Work
Two level hierarchy
Option - sub-policy with a termination condition
Policy over options - Queried after termination condition
Option-Critic - Learning options jointly with a policy-over-options in an end-to-end fashion by extendingthe policy gradient theorem to options
When options are learned end-to-end - degenrate to two trivial solutions
- One active option that solves the whole task
- Policy-over-option that changes options at every step - micro-managing the behaviour
Difference
- Top level produces a meaningful and explicit goal for the bottom level to achieve
Auxiliary Losses and Rewards
- Pseudo-count based auxiliary rewards for exploration - new parts of state space
Unsupervised Auxiliary Tasks
- Refine internal representations
# The Model
FuN
- Modular NN
- Two Modules
- Manager
- Latent State Representation - $s_t$
- Outputs Goal Vector - $g_t$
- Worker
- Input - External Observation, own state and Managers goal
- Output - Actions
- Manager and Worker - share perceptual module - Take observation from environment $x_t$ and computes a shared intermidiate representation $z_t$
- Manager's $g_t$ is trained using an approximate transition policy gradient
- Dynamic
$$\displaystyle\begin{align}
z_t &= f^{\text{percept}}(s_t) \\
s_t &= f^{M\text{space}}(z_t) \\
h_t^M, \hat g_t &= f^{\text{Mrnn}}(s_t, h^{M}_{t-1}) \\
g_t &= \hat g_t / ||\hat g_t|| \\
w_t &= \phi(\underset{o = t-c}{\overset{t}{\sum}}g_i) \\
h^W, U_t &= f^{\text{Wrnn}}(z_t, h^{W}_{t-1})\\
\pi_t &= \text{SoftMax}(U_tw_t)
\end{align}
$$
- rnn - Recurrent Neural Network
## Goal embedding
- Goal modulates via a multiplicative interaction in a low dimensional goal-embedding space $R^{k}$, k << d
- Worker produces embedding vector for every action - $U\in R^{|a|\times k}$
- For goals - last $c$ goals are first pooled by summation and then embedded into a vector $w \in R^k$ using a linear projection $\phi$ - no bias - learnt from Worker's actions
- No bias - never produce a constant non-zero vector - never ignore the Manager's input
- Due to pooling of goals over several time-steps - conditioning from the Manager varies smoothly
## Learning
- FuN produces a distribution over possible actions
- FuN - fully differentiable
- We can train it end-to-end
- Using a policy gradient algorithm operating on the actions taken by the Worker
- Manager would be trained by gradients coming from the Worker
- This would deprive the Manager's goals g of any semantic meaning
- Independently train Manager to predict advantageous directions - and reward Worker to follow these directions
$\nabla g_t = A_t^M\nabla_\theta d_{\text{cos}}(s_{t+c} - s_t, g_t(\theta))$
- Manager's Advantage Function
$A_t^M = R_t + V_t^M(x_t,\theta)$
- Cosine Similarities betweem two vectors
$d_{\text{cos}}(\alpha, \beta) = \alpha^T\beta/(|\alpha||\beta|)$
- $g_t$ acquires a semantic meaning - advantageous direction - at a horizon c - defines the temporal resolution of the Manager
- Intrinsic reward for Worker
$\qquad r_t^I = 1/c \underset{i = 1}{\overset{c}{\sum}}d_{\text{cos}}(s_t-s_{t-i}. g_{t-i})$
- Directions - more feasible for Workers - reliably cause directional shifts in the latent state
- It also gives a degree of invariance and structural generalisation
- FRL - advocated for completely concealing the reward from the environment from lower levels of hierarchy
- In practice - Reward added to intrinsic reward - to retain environment reward
- Worker and Manager have different discount factor $\gamma$ - Worker can be greedy and focused on immediate rewards - Manager can consider long-term perspective
## Transition Policy Gradients
- Manager update rule - Policy gradient with respect to a model of the Worker's behaviour
- Transition Policy
$\pi^{\text{TP}}(s_{t+c}|s_t) = p(s_{t+c}|s_t,\mu(s_t,\theta))$
- The state always transitions to the end state picked by the transition policy
$s_{t+c} = \pi^{\text{TP}}(s_t)$
- We can apply policy gradient theorem to the transition policy $\pi^{\text{TP}}$
$\nabla_\theta\pi_t^{\text{TP}} = \Bbb E [(R_t - V(s_t))\nabla_{\theta}\text{log }p(s_{t+c}|s_t,\mu(s_t,\theta))]$
- The worker may follow a complexe trajectory - policy gradient would learn from these samples - instead if the end state is known - we can skip directly over the Worker's behaviour and instead follow the policy gradient of the predicted transition
- FuN - Transition model - direction of state-space follows von Mises-Fisher distribution
- Worker's intrinsic reward - log-likelihood of state trajectory
# Architecture details
- Perceptual Module $f^{\text{percept}}$
- CNN + Fully Connected Network
- Manager state space $f^{\text{Mspace}}$
- Fully Connected Network
- Worker's recurrent network $f^{\text{Wrnn}}$
- LSTM
- Manager's recurrent network $f^{\text{Mrnn}}$
- dilated LSTM
## Dilated LSTM
- At each time step only the corresponding part of the state is updated and the output is pooled across the previous c outputs
# Experiments
- Goal - FuN learns non-trivial, helpful and interpretable sub-polices and sub-goals, and also to validate components of the architecture