Goal conditioned affordances

# Goal conditioned affordances --- (27 Sept) ### Updates - Training offpolicy (DQN in our case) - is much more sensitive to hyperparameters. We will need to tune parameters for the env (we might need to add tricks, for e.g. preliminary results show advantage using priority replay sampling) - Adding extra trajectories with solved goals ("HER trajectories") can cause problems -- we must still keep a decent distribution of conditioning goals in the replay buffer or solve this problem when sampling training batch (e.g. sample from all goals or maybe priority is enough). Concrete example of the problem: if some goals are easier to solve we will have more of them in the replay. -- it helps if extra trajectories achieve only the new set goal. Let's consider trajectory (s0,s1...,sn) where we achieve a different goal **g** at state sn (different than the one conditioned during rollout). We can add extra transitions conditioned by **g** to the replay buffer from the sub-trajectory starting with sx, where no other goal except **g** is achieved in (sx,sx+1, ... sn) - Training policy over options at the same time with option policy learning is very unstable (*preliminary tests*) ### Proposal **Analyise** option policy success on a generated test set* - during option policy training - offpolicy learning with ~HER (*the offpolicy version of the "data collection stage" for option policy learning*) - No learning of option over policy **Compare** - random sampling/ sampling with soft attention/sampling with hard attention - when selecting/changing goal while collecting a new trajectory from the environment *general protocol for collecting a test set for evaluating option policy success. The procedure can be used for any environment and goal definiton as long as we have a goal_reached function. (*Defined below*) - We can scale for more goals - So far preliminary results show that we might be able to handle a broader set of environments / set of goal representations (e.g. goals can be achieved multiple times in the same env) with the following "condition": -- we know for each state which goals have been achieved **Generating test set for option policy evaluation** - Generate random trajectory in an environment while the desired goal is not reached - Save initial env state and set of actions to recreate evironment state up until a step from where the goal can be reached in max_option_steps (TODO) - Adjust Pseudocode v1 only for learning option policy --- (17 Sept) **Notes**: - I have described below a possible off-policy version where we learn at the same time policy over options, option policy and affordances. - Can we learn the policy over options while learning the affordances? Or we have to do the 2 stages we did in the paper? **Goal represented as:** [obj, obj_color, obj_state, obj_picked, obj_picked_color] *Environment described below* **Pseudocode v1**: ```javascript= Given: • intent completion function icr : S × A × G → Reward (different than the env. reward; used for the option policy, equivalent to our intent completetion function) • T - max number of steps per episode • N == episode length • G - set of goals (goal embeddings) • ω_max_steps - maximum number of steps to run an option for • GSMDP a subset of G - Generated by the environment for each new episode - This can be the entire set G - we will consider this as the action space for QΩ • (IC_net) a network for predicting IC_target - learns discounted intent completion - (soft/hard) attention is calculated based on IC_net predictions (AFFORDANCE(IC_net(s_0, g))) • QΩ an off-policy RL algorithm to learn Q_Ω(s, g) values - g ∈ GSMDP goals - learns based on env reward - the "policy" over options learns to maximize enironment Reward - (replay_Ω) replay memory for Q_Ω • Qω an off-policy RL algorithm to learn Q_ω(s||g, a) values - g ∈ G - the "option policy" - learns based on icr (intent completion reward) - (replay_ω) replay memory • a strategy S for sampling reached goals for replay_ω based on a seq of states - S(si, . . . , st ) returns a list of goals reached in this traj Initialize QΩ & Qω & IC_net . e.g. initialize neural networks Initialize replay replay_Ω and replay_ω def S(GSMDP, si, ..., st): return list of all goals from GSMDP achieved in this sequence for e = 1, M do // e episodes // Run 1 episode in the environment and collect data for QΩ & Qω GSMDP_e ← subset goals of G for this episode Sample an initial state (s_0) // Sample a goal based on the SMDP Q-values modulated by affordances g_0 ← π_Ω(s_0) ∝ AFFORDANCE(IC_net(s_0, ω)) q_Ω(s_0, ω) (for ω in GSMDP) for t = 0, T − 1 do Sample an action (a_t) using the policy from Qω: a_t ← π_ω(s_t||g_t) || denotes concatenation Execute the action (a_t) and observe s_{t+1}, r_env_t if done: break ic_t := icr(s_t, a_t, g_t) if ic_t > 0 or ω_max_steps reached: g_{t+1} ← π_Ω(s_{t+1}) ∝ AFF(s_{t+1}0, ω) q_Ω(s_{t+1}, ω) (for ω in GSMDP) Store the transition (s_t, g_t, r_env_t, s_{t+1}) in replay_Ω else: g_{t+1} = g_t end for // Add extra transitions for Qω based on goal achieved in the current episode(HER) max_step = t for t = 0, max_step do ic_t := icr(s_t, a_t, g_t) Store the transition (s_t||g_t, a_t, ic_t, s_{t+1}||g_t) in replay_ω Sample a set of additional goals for replay EXTRAG := S(GSMDP, s_{t+1}..s_{max_step}) for g' ∈ EXTRAG do ic_t' := icr(s_t, a_t, g') Store the transition (s_t||g', a_t, ic_t, s_{t+1}||g') in replay_ω end for end for // Train q-values for i = 1, NΩ do Sample a minibatch B from the replay buffer replay_Ω Perform one step of optimization using QΩ and minibatch B for i = 1, Nω do Sample a minibatch B from the replay buffer replay_ω Perform one step of optimization using Qω and minibatch B // Learn affordances for t = 1, last_n / (minibatch_size) do Sample batch of (s, g) from replay_ω Calculate ic_target (based on number of steps until reaching g from s / if reached) Train IC_net to predict ic_target end for end for ``` - Representing the goal such as the (important*) goals are only achievable once still seems weird. Can we find a better goal representation (that can be generated naturally from the state?). - *important e.g. go_to_x_y_and_pick_up_key vs just go_to_x_y - Note that in this DoorKey environment defining goals in this way, will result in affordances over this goals to give you the desired SMDP policy to reach the environment goal. - So we could change the environment ([see other suggestions bellow](#OtherEnvs)) or consider a different set of GSMDP goals defined below in order for learning the SMDP policy to be meaningfull. - We could use eps-greedy sampling when choosing the goal when we collect a new episode (eps-greedy also disregarding the affordable constraints). This could give us enough exploration to disregard also the constraint that some of the goals must be defined such that they can be reached only once. We could have in this case only goals represented by [x_coord, y_coord] ## Environment example **Fully observable NxNx5** (obs[i, j] = [obj_id, obj_color, obj_state, agent_direction, carrying, activity]) **Action space 4**: move-N, move-S, move-E, move-W **Env Reward** 1 when achieving goal (final goal - green box) / 0 otherwise *when moving into key we pick it up* *when moving into door with key picked up we open it* ![](https://i.imgur.com/Ly8DDNG.png =200x) ### OtherEnvs **We could try more complicated envs** E.g. Go to the blue ball - here it would be good to have affordable interesting goals multiple times e.g. go to xy door ![](https://i.imgur.com/6JVJcbe.png =200x) ## HER Pseudocode from [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495) ```javascript= Given: • an off-policy RL algorithm A, . e.g. DQN, • a strategy S for sampling goals for replay, . e.g. S(s0, . . . , sT ) = m(sT ) • a reward function r : S × A × G → R. . e.g. r(s, a, g) = −[f_g(s) = 0] Initialize A . e.g. initialize neural networks Initialize replay buffer R for episode = 1, M do Sample a goal (g) and an initial state (s_0). for t = 0, T − 1 do Sample an action (a_t) using the behavioral policy from A: a_t ← π_b(s_t||g) || denotes concatenation Execute the action (a_t) and observe a new state s_{t+1} end for for t = 0, T − 1 do r_t := r(s_t, a_t, g) Store the transition (s_t||g, a_t, r_t, s_{t+1}||g) in R . standard experience replay Sample a set of additional goals for replay G := S(current episode) for g' ∈ G do r' := r(s_t, a_t, g') Store the transition (s_t||g', a_t, r', s_{t+1}||g') in R . HER end for end for for t = 1, N do Sample a minibatch B from the replay buffer R Perform one step of optimization using A and minibatch B end for end for ```