---
title: Andres Internship
tags: Templates, Talk
description: View the slide with "Slide Mode".
---
# Andres Internship
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/e_vibZVMQPujDhg72v1dAA?both
---
For paper:
---
- General tool like word embeddings
Variances:
... print(test-100, generator1.log_prob(test), generator10.log_prob(test), generator20.log_prob(test), generator30.log_prob(test), generator50.log_prob(test), generator80.log_prob(test), generator100.log_prob(test))
...
0 tensor(-0.9189) tensor(-3.2215) tensor(-3.9147) tensor(-4.3201) tensor(-4.8310) tensor(-5.3010) tensor(-5.5241)
1 tensor(-1.4189) tensor(-3.2265) tensor(-3.9159) tensor(-4.3207) tensor(-4.8312) tensor(-5.3010) tensor(-5.5242)
10 tensor(-50.9189) tensor(-3.7215) tensor(-4.0397) tensor(-4.3757) tensor(-4.8510) tensor(-5.3088) tensor(-5.5291)
20 tensor(-200.9189) tensor(-5.2215) tensor(-4.4147) tensor(-4.5424) tensor(-4.9110) tensor(-5.3322) tensor(-5.5441)
50 tensor(-1250.9189) tensor(-15.7215) tensor(-7.0397) tensor(-5.7090) tensor(-5.3310) tensor(-5.4963) tensor(-5.6491)
100 tensor(-5000.9189) tensor(-53.2215) tensor(-16.4147) tensor(-9.8757) tensor(-6.8310) tensor(-6.0822) tensor(-6.0241)
---
---
Paper's Summaries:
-------
---
Monet: Unsupervised Scene Decomposition and Representation
In terms of entities
Vae and recurrent attention mask to parse things into objects. Multiple objects
Attention masks.
Its like a masked variational autoencoder. Where the mask is learnt recurrently.
Result: Reconstructed thing separates objects and other things.
Unsupervised way of doing that
---
Cobra
Continous control
Object based models, curiosity
Build object based models. Then model based search
adversarial: encourage actions in which transition is bad
- Monet for objects
- Transition as model (trained with pixel loss)
- exploration policy which is curious in that it it is adversarial to successful transition, a la Pulkit
- Reward function and planning instead of policy.
---
Hard but not impossible. Adversarial Intrinsic Tasks in Symbolic RL
In abstract domains, so no vision already objects:
Set itself goals through intrinsic motivation, in a very neet idea.
That encourages exploration. Much in the spirit of cobra of learning in an exploration phase that can then be used in another phase.
In the other one is not deeprl but search in the space to build theories. And use them to plan actions to obtain information and to . Heuristic.
---
Learning world graph decompositions to accelerate rl
Pretrain a VAE to recover actions. creates latent space of states.
this are used as goals proposed by manager, he can also learn traversal between those nodes (like previous paper)
Ranon minigrid
---
Automated curricula through setter-solver interactions
goal validity
goal feasibility
goal coverage
if the valid goals are not trivially observable from the environment,
"it may be difficult for the goal-setter to discover the goal structure via a generative loss alone. "
set of possible goals might be tiny compared to the set of expressible goals.
-validity: likelihood of goals already achieved
space of goals?
in a way curiosity is a version of difficulty. but less clean
note that repeating goals doesnt encourage exploration
---
Bootstrapping conditional GANS for Bideo Game Level Generation
They use a GAN to generate game levels.
"A picture of a face where one eye
is smudged out is still recognizably as a face, and a sentence
can be agrammatical and misspelled but still readable; these
types of content do not need to function. In contrast, game
levels, in which it is impossible to find the key to the exit are
simply unplayable, and it does not matter how aesthetically
pleasing they are. The same holds true for a ruleset, which
does not specify how characters move, or a car where the
wheels do not touch the ground. In this respect, it is useful to
think of most game content as being more like program code
than like images. The problem is that most generative representations are not intrinsically well-suited to create content
with functional requirements."
---
Causal Induction From Visual Observations for Goal Directed Tasks
Generate causal graph.
Policy Uses casual graph through attention
---
Plan Arithmetic: Compositional Plan Vectors for Multi-task Control
Learns embeddings for subtasks which are compositional.
Through conditioning on difference
imitiation learning
---
On the Complexity of Exploration in Goal-Driven Navigation
Measure of complexity for some minigrid games. based on goal dependency graphs (something about subgoals), and showed that heriarchical policies worked better than normal policies.
---
Multi-Modal Imitation Learning from unstructured demonstrations using GANS
Skill segmentation.
Imitation of skills.
Policy is like the Generator. And Discriminator compares Generated with original policy to be imitaed
the point is that it is multimodal
---
Language Grounding through Social Interactions and Curiosity-Driven Multi-Goal Learning
Curriculum resulting from the agent's intrinsic motivations
Also language grounding
Goals come from language supervision of what has been done
Arm controled bot
---
Exploring Without Getting Lost : Learning to Navigate With No External Supervision
Image realistic settings
1.- Nearby and far points
2.- Looking for novelty
3.- learn reward
"our approach is most related to a recent line of
research that uses multiple stages of learning to build a set
or graph of scene observations"
---
EPISODIC CURIOSITY THROUGH REACHABILITY
Visually reach
Compares memory with new, taking into acount number of steps to reach current observation
giving a reward only for those observations which take some effort to
reach (outside the already explored part of the environment). The effort is measured in the number of
environment steps. To estimate it we train a neural network approximator: given two observations,
it would predict how many steps separate them.
---
Procedural Content Generation: From automatically generating game levels to increasing generality in machine learning
---
Leveraging Procedural Generation to Benchmark
Reinforcement Learning
Games that are procedurally generated but with pixels.
---
Journal
---
September 03 2019
Conversation with Ed and Sebastian
Some directions:
- Conditioning on external information (extension of Victor's)
- Abstract disentangled information from language (interest from people in Paris)
- Producing text
- For future agents
- As a form of planning
- Deal with large action spaces
- Free text: all letters lower/higher case
- i.e relevant for Starcraft team
- Intrinsic rewards to better explore (extension of Roberta's)
---
Two Ideas:
- Paralell modules for producing language
- Consistency loss for intrinsic rewards
---
September 4th Conversation with Tim
- Exploit Symbolic input
- pseudocounts
- how often close to an object
- Agents set themselves symbolic goals (i.e try to open a door)
- Specify combinatorial space but have agent learn actual exploration
- Parts of the agent can be symbolic/logic programs
- Nantas: Learn from expert.
- Choose a problem that has easy component
Key Questions
---
Key property:
- Neurosymbolic: Deep RL + some sturctured representation/logic program/language
Step1-Directions:
- Externalize knowledge for interpretability or planning
- Exploit symbolic input to intrinsic motivations
Step2-Task:
- What is the minimal domain task to start testing research hypothesis?
---
September 5th, Conversation with Ed and Tim
Need to define:
Problem or Research Hypothesis
Method
Task-Domain
---
**Possible Research Problems and Methods**
Research Problems:
- Agent sets his own goals?
- Transferability?
- Outputing Language?
- Planning based on symbolic input?
Features:
- Exploit symbolic input
- Use symbolic components in agent
Methods:
- Consistency Loss?
- Hybrization module for generalization and transferability?
---
---
Idea 1. Consistency-Imagination
---
**Goal:**
- Learning without reward through Curiosity-Imagination
(Children do not only fit to the data but imagine Dragons)
**Domain, Task:**
- SuperMario, clear Benchmark
- MiniGrid, Roberta's?
**Method/Idea:**
Original Paper: Pathak et al 2017, [Curiosity-driven Exploration by Self-supervised Prediction](https://pathak22.github.io/noreward-rl/resources/icml17.pdf)

- Inverse Model
$$\hat{a}_t = g(s_t,s_{t+1};\theta_I)$$ decomposed into two submodules $\theta_{I1}$ and $\theta_{I2}$:
$$\hat{a}_t =g(\phi(s_t;\theta_{I1}),\phi(s_{t+1};\theta_{I1});\theta_{I2})$$$$min_{\theta_I}||\hat{a}_t-a_t||$$
- Forward Model
$$\hat{\phi}(s_{t+1})=f(\phi(s_t),a_t;\theta_F)$$$$min_{\theta_F}||\hat\phi(s_{t+1})-\phi(s_{t+1})||$$ *note that during forward training gradient does not propagate back into state encoding
**Auxiliary Consistency Loss Idea**
$$\hat{\phi}_{imagined}(s_{t+1})=f(\phi(s_t),a_{imagined};\theta_F)$$$$\hat{a}_{imagined}=g(\phi(s_t),\hat{\phi}_{imagined}(s_{t+1});\theta_{I2})$$
$$min_{\theta_F,\theta_{I2}} ||a_{imagined}-\hat{a}_{imagined}||$$
**Ed's Idea to mitigate the encoding of action into $\hat\phi(s_{t+1})$**
Discriminator objective, where we put proposed $\hat\phi(s)$ into a buffer $\mathcal{B}$ and assume there is a dataset (experience buffer) $\mathcal{D}$ of real states:
$$
min_{\theta_D, \theta_{I1}} \mathbb{E}_{s \sim \mathcal{D}}\left[ -\log{}D(\phi(s, \theta_{I1}), \theta_D) \right] + \mathbb{E}_{\hat\phi(s') \sim \mathcal{B}} \left[ -\log{}(1 - D(\hat\phi(s'), \theta_D)) \right]
$$
**More Recent Paper:** Burda et al. 2018 [Large Scale Study of Curiosity-Driven Learning](https://arxiv.org/pdf/1808.04355.pdf)

Idea 2: HierarchicalRL/NModules with Expanding Library
---
**Goal:** Decompose task in a compositional way
**Domain:**
- BabyAi: Compositional Structure, Curriculum, Simple Language
**Method Idea**
- Have a library of primitives (essential subtasks) that grows.
- Conjunction, Sequential Concatenation Arguments
- Library that grows as in
[Library Learning for Neurally-Guided Bayesian Program Induction](https://papers.nips.cc/paper/8006-learning-libraries-of-subroutines-for-neurallyguided-bayesian-program-induction.pdf)
But neurally as in ["Neural Modular Control for Embodied Question Answering"](https://arxiv.org/pdf/1810.11181.pdf) or as in ["Modular Multitask Reinforcement Learning with Policy Sketches"](https://arxiv.org/pdf/1611.01796.pdf)
- Possibility of a Population of agents that share a common library but have independent meta controllers or additional entries to library?
---
---
Conversation September 11th
Sebastian, Tim, Ed
- Idea 1:
Tim: Supermario is not the best setting.
ICM is not the best benchmark
Ed: What would be the metric, just Sample Efficiency?
- Idea 2:
Ed: Potentially too many moving parts
- Idea 3 (Tim's):
Agent that sets its own goals in an adversarial way (so that they are challenging). Related to intrinsic motivation.
Can be thought of as pretraining representations of the dynamics (equivalent to Embeddings or Computer Vision), that can then be used for downstream tasks
---
TODO's:
- For idea 2: Deeper literarture review on compositionality on top of options.
- Dzmitry Bahdanau
- Chelsea Finn
- Neural Modular Control for Embodied Question Answering
Warmup Subpolicies, master compositional masterpolicy
- Siddharth Karamcheti
- Engineering: Play on Minigrid.
- Formalize idea 3
---
---
Idea 3. Adversarial Intrinsic Goals:
---
**Goal:** Learn Useful Representation dynamics independent of Tasks
**Domain:** Minigrid, BabyAI, Nethack
**Key Idea:** Use adversarial loss to make agent generate challenging but feasible goals.
----
Conversation September 12th, Tim and Ed
**Task:** Learn in an unsupervised way by self-proposing goals forming a curriculum.
**New Feature:** Exploit symbolic input to propose intrinsic Goals. Use an adversarial loss
**Domain:** MiniGrid (Tim and Roberta solved first three of [Key corridor environment](https://github.com/maximecb/gym-minigrid#key-corridor-environment)). Potentially then Nethack
**Idea Setup:**
Teacher proposes subgoals
Student tries to achieve them
- Partiarly adversarial training to propose goals that are hard but not impossible.
- Rewards:
- Student rewarded for reaching subgoals
- Teacher rewarded according to (Ed's suggestion):

x axis: time for policy to reach goal
y axis: reward for teacher
- Goals: Initially can just be x,y coordinates
Potentially for later:
- A discriminator (GAN-like) could be used to increase diversity by discriminating with whether proposed goals have been seen before (similarity kernel)
- Teacher could propose more complex goals, based on types of things or even sequences of goals and whole plans.
- What is the research hypotesis?
Can we train an agent in an unsupervised way to develop general skills useful for an environment by proposing goals that are challenging but not too challenging that form a curriculum during training.
- What is a minimum experiment?
- Does the agent learn to pass levels in an unsupervised way without reward, (i.e in minigrid Key corridor environment)?
- Are the goals increasingly more complex?
- How does it compare to baselines (Automatic Goal Generation, [Assymetric Self-Play](https://arxiv.org/pdf/1703.05407.pdf)) on their very simple non changing scenarios or others.
Third Week
---
Possibilities for adversarial loss:
- [Automatic Goal Generation](https://arxiv.org/pdf/1705.06366.pdf):
Discriminator is used to learn whether goal is intermediately difficult or not.
- Discriminator used to increase diversity of goals proposed. Has this goal been proposed before.
- **No Discriminator**: Policy tries to reach goals. Generator tries to generate goals that take time to reach
Main differences with "Automatic Goal Generation":
- Adversarial loss
- Changing environment (quite harder!). In that paper the generator doesnt even take a state.
- Always back to initial position after T steps.
- More symbolic goals? ("Reach an object")
---
Generator:
goal = (x,y) <- current_state
Reward: Maximize time that policy takes to reach goal without reaching treshold
Policy:
actions <- current state, goal
$\pi_i, s_{new}\leftarrow_{update}\pi_{i-1}(s,g)$
Reward: Minimize time to reach goal
**Algorithm**
Changing Environment:
for num_iterations:
get env_config
for ep_per_config:
g <- G(s)
train policy
train generator
Alternatively, no forced change of env_config:
get env_config
while train:
g <- G(s)
train policy
train generator
update state (and env_config if passed or died)
*Initialization Bootstrap: Policy takes random actions. Generator trained to suggest empirical visitation of policy.
*Do we need a buffer of previous goals to avoid forgetting?
---
Conversation september 17th.
Tim and Ed
Some points:
- No discriminator
- If goal is discrete, is that a problem for differentiability?
- Tim: When do you switch to actual task? Should this be thought as an auxiliary reward. What happens if you die? Maybe switch extrernal reward on after some episodes, or always have it on.
- Could generator propose sequences of goals and pass them sequentially to the policy? (so that it actually looks like, (key->door).
- Generator needs a reward that correlates with cost used by policy:
- Time: Problem of just making it go around.
- Error rate, a couple of rollouts and probability of reaching
- Something like Alice and Bob?
- Ed: It could even be metalearned, ultimately teacher wants to make the most learning on the students side.
- Network that predicts performance from policy?
- As in Alice and Bob [paper](https://arxiv.org/pdf/1703.05407.pdf): Teacher tries to do it himself, and there is a negative loss on his performance. [doesnt seem like the best option]
Tim: A notion of how hard task is. Objectively it takes 10 steps, then how is policy doing relative to that. (Comparing to an average; "do this in this much time", if it has done it then it is not so interesting - explicit notion of that)
---
Conversation September 19th
- Diversity might fall out naturally (if it is constantly proposed then it is not too hard), other alternatives include:
- A loss i.e to [maximize Entropy of Goals](https://arxiv.org/pdf/1903.03698.pdf)
- Count based heuristic
- A discriminator trained for novelty detection
- Alternatives for discrete Goals
- Tim: Use a fixe visual grid sourrounding the agent
- Ed: Additionaly could have a convolution over the map to create representations with value of a dynamically-sized vecrtor before a sofmax to decide the action proposed.
Architecture of the generator:

Play with strides, conv, pooling parameters although instead really play with depth and number of channels.
---
Meeting September 24th with Tim
Have a great framework pipeline, take your time:
- [Sync between Mac and cluster](https://fb.quip.com/KbDoA8lM404F)
- [Slurm](https://github.com/fairinternal/hackrl/blob/master/scripts/slurm_staircase.py)
- [Plotting (jupytex)](https://github.com/fairinternal/hackrl/blob/master/scripts/plotting/plot_sweep_tim.py)
- Use [Tim's Code](https://github.com/fairinternal/hackrl/blob/master/hackrl/models.py) for nethack, i.e index_select instead of embeddings
Experiments:
Make sure torchbeast runs well in some scenarios for example in:
ObstructedMaze2Dlh comparing to Robertas paper.
So far it learned in:
- MiniGrid-Empty-5x5-v0
- MiniGrid-Empty-Random-5x5-v0
----
Meeting October 1st
TODO:
1- Plotting: [DONE**try]
Jupyter lab is ran in devfair. Locally only open browser. Start things as new notebook in jupyter lab.
With [Jupytext](https://github.com/mwouts/jupytext) they are saved as python files, add it as extension:
pip install
and then add extension
2- Add LSTM. Errors: [Done]
With lstm either either empty or cuda error, something about sizes [done:1-,~]
With 1 actor empty also [DONE:num_buffers]
3- Obscuremaze
4- Slurm: Tim runs it from et terminal (not as notebook), apparently Ed does [DONE*Runs2experiments,consider changing saving]
- Obscuremaze [???]
- Plotting [DONE]
*-Slurm [DONE]
Next:
5- Add Embeddings
6- Implement Architecture from ED. For convolutional strides an codes see Model for Nethack (Padding of 1)
7- Move to polybeast?
---
OMP_NUM_THREADS=1 python -m examples.minigrid.monobeast --env MiniGrid-Empty-5x5-v0 --num_actors 4 --num_threads 2 --total_frames 400000 --unroll_length 80 --use_lstm
---
Questions/IDEAS for Roberta:
- Negative reward for losses to encourage exploration? It hinders exploration
- more exploration? I guess there is sampling from logits
- Did you change the timelimit? Nope
- Lstm worked worst than NoLSTM? Which hyper parameters?
- #total timesteps affects linear decay of learning rate
- batch size of 8? What about 4?
- did you modify it in other ways? respect discounting for example
- did you discount additionally? cause the env already discounts, where is the discount being used
---
---
Meeting October 15th
- Propose new goals once a goal is solved.
- Both generator and teacher get rewardded for extrinsic goal.
- Could be partial vision for agent and full vision for generator. Or just full vision but covered like in nethack.
TODO DONEE:
2.- policy agent, goal conditioned [DONE]
4.- Generator_loss (number of steps) Gaussian function [DONE]
3.- Intrinsic_rewards [DONE]
1.- Full obserbation [DONE]
5.- episode should be done if reach intrinsic goal, right? [DONE]
5.5.- what to do with learner_ouptus. losses. finish learner function [DONE]
- Change batch in actor (st0) [DONE]
-batch size run [DONE]
6.- vtrace [DONE]
- discounts [DONE]
- Clean dictionary. [DONE]
7..- DEBUG [DONE]
TODO Details DONEE:
- reshape in line 280 in learner [DONE]
-In learn check that i am not screwing up for not using .clone [DONE]
-313 reached goal for learn, indexing correct. Notice we delay by 1. [DONE]
- check indexing and alignment line 453 [DONE]
- Mean intrsinsic reward is -1? [DONE]
- logits and actions are from different steps!! **Ask Nantas Later [DONE].. not true
- LSTM is not reinitialized [DONE]
NOTE/QUESTION:
Careful that goals dont make agent dont want to reach the environment reward and only intrinsic
motivation. Too curious :)
If it proposes the goal at the extrinsic_goal. Then it will never reach the curious_goal, because it first has to lift the goal.
Right now goal is to REACH a block! Changing block can also be done (in which case he wouldnt get
extrinsic reward, only curious reward)
- Nantas: Goal can be structured into x and y: 16x16 is a lot, propose an action.
- Actor gives st0 every frame
- Generator gets trained every B episodes
For training you use policy gradient on the generator, and independently on the policy.
Intrinsic reward defined within the model, or in the learner?
- Michiel: try to predict actions of policy. generator could do that
-----
Debugging Model:
- Only Extrinsic Rewards [WORKS]
- Fixed Env, fixed_goal [WORKS]
- Fixed Env, No policy, does generator learn? [Skipped]
- Fixed Env, Generator, NonAdversarial [WORKS]
- Fixe Policy (Optimal [WORKS])- problem... plateaus into local optima
- Fixed Env, Generator, Adversarial [WORKS]
- Rando ENV, Learns but doesnt converge
TODO/Ed:
Right reward at different positions.
Loss to the generator is correct.
Generator a thousand steps.
Heat map of distance from the agents.
Temperature hyperparameter. is softmax useful. measure entropy
Remove embeddings.
TODO-Meta:
- Embeddings [DONE]
- Generator: [Done]
- Generator both in actor and learner. [Done]
- Logging the difficulty [DONE]
- ***Heinrich -Why with more than 1 actor it is repeating the actions? Extremely weird, if buffers>80 then it works
- Setup so that I can run both through flags
- ****Visualizing agent as he plays?
- Feudal Neural Networks paper
TODOS:
- Logging problem when no steps [DONE]
- Change-Loss step [DONE]
- Why is minigrid four rooms not raising loss even higher? [DONE]: Time
- Log Steps [DONE]
- Rewards also when goal not reached?
- Compare losses of generator and policy to see if it is training
Possible frameworks:
- Where does goal embedding get in? FILM
Early conditinoning is better for harder to leran
Late conditiong is hard because need to compress everything.
- Proposing goals anywhere might be a problem, big unuseful scenarios?
- Episode Restart?
- Partial Observability of Policy? a la Nethack?
- Generator rewarded for Extrinsic goal?
- Generator tries to predict actions from policy?
- Reach or Modify?
1.- Revise why it is not proposing goals next to it, or if it is why it takes 5 step.
2.- Add things.
Ordered TODO
0.- Play with generator to see why no converging
0.1.- Train with dommy policy (distance metric)
1- fix so that not only reached cases
2.- Add x,y direct line
4.- Film architecture
- Generator, spawning randomly.
- Policy is controlledly suboptimal--- poisson distribution.
Trick for debugging:Add noise to optimal policy
- Bug, students work.
OMP_NUM_THREADS=1 python -m examples.minigrid.monobeast --env MiniGrid-Empty-5x5-v0 --num_actors 4 --num_threads 1 --total_frames 2000000 --unroll_length 80 --num_input_frames 1 --num_buffers 8
NEW Errors:
With Random Generator doesnt converge: It does converge after a bit longer!
- Learning_rate_of_generator
-entropy penalty of generator
Plotting error:[SOLVED]
Batch problem, repeating things in batch: [CAN IGNORE for NOW]
- Heiner send me a thing to debugg
October 25th
===
### Analysis of Adversarial runs:
1.- Generator Converges too narrowly and fast (22 epochs)
2.- If fixed environment, policy learns fast (another 15 epochs)
3.- When Generator does happen to propose other goals, it either proposes a bad goal, or even when proposing a good goal, the policy might not reach it so it ends up proposing the same goal.
4.- Never has other observations. Rabit hole
5.- Policy steps move too slowly or dont incentivze sufficiently
**NOTE:
FourRooms_RAndom 0noise(5250331) 1 Transition from Goal=98 to Goal=65
- Pick 2 environments to experiment with:
FourRooms Fixed/Random?. 2Road Fixed
## TODO
### Engineering
- Batch Problem (Replicate Heinrich Code). Priority: Low, Difficulty: Medium/Hard
- FILM Architecture. Priority: Medium, Difficulty: Medium
- Agent Position fed directly. Priority: High, Dif: Easy [DONE]
- Heatmap of logits, goals proposed. Priority: **High**, Dif: Medium
### Research Ideas
- Sampling Noise
(So the generator does not fixate in a single goal)
- Epsilon Greedy [DONE] - Not good, didnt help?
- Dropout [DONE] - seems arbitrary, didnt help?
- Explore adding Noise Vector Priority: Medium, Dif: Easy (/19-10-24_14-59-15/)
- Softmax Temperature (divide by T). Priority: **High**, Dif: Easy
- Architecture
(Shouldnt Generator learn some topology of actions? i.e propose neighboring goals)
- Deconvolution Priority: Medium, Dif: Medium, Easy
- y|x decouple coordinates Priority: Low, Dif: Medium
- Use Convolutions from Policy which has learned more Priority: Low, Dif: High
Could be too random as it is changing
Learn things other than topology
- Loss Function
(Instead of Gaussian place different incentives)
- Linear and then abrupt decay Priority: Highest Dif: Easy [DONE]
- Increase Variance of goals proposed
(As an additional objective)
- Sample Hardness of Goal. Priority: Low, Dif: Medium
- Change hardness of Goal depending on history of actions. Priority: Low, Dif: Hard
- Don't Restart Episode Flag Priority: High, Dif: Medium [DONE]
(After reaching an intrinsic goal, continue with episode)
- Discriminator that evaluates if it is a new goal Priority: Low, Dif: Hard
- Something Countbased. Priority: Low, Dif: ?
### NextSteps
1.- Analyze Runs
2.- Pick 2 scenarios. [KeyCorridorS4R3-v0, FourRooms-v0]
3.- Build High Priority
4.- Analyze
...
###Details
- remove vc trace, and even actor-critic altogether from Generator [DONE]
- cross entropy is much smaller for generator... maybe baseline pushing down
- generator converges look at:
/torchbeast/tmp/minigrid/19-10-24_14-59-15/eMiniGrid-FourRooms-v0-lr0.001-na40-nll1-nif1-ul100-bs8-gbs32-gec0.0005-ec0.0005-tf150000000-ulFalse-fsFalse-sed256-dueFalse-ugFalse-foTrue-nerTrue-reTrue-tv30.0-gt30.0-n0-glr0.0001-r0--00
---
---
---
#### LATER
- Setup so that I can run both through flags
- Visualizing agent as he plays?
- Eventually move to scenarios like procedurally generated envs.
#### Other possible setups and Design Choices
- Rewards also when extrinsic goal? [DONE]
- Partial Observability of Policy? a la Nethack?
- It could Reach or Modify intrinsic cell
- Feudal Neural Networks paper
- Generator tries to predict actions from policy? (More Model Based)
---
Discussions:
What if it is getting rewarded more for proposing many times that thing
Two possible setups:
- No notion of an episode for the generator, just entries into batch.
-
- Visualization
- Load Model
- Visualize some acctions: map, proposed, trajectorie
- Varying difficulty [Done]
- Maybe difficulty also in policy? Otherwise the generator becomes tougher and tougher while policy is quite inept still
Possible things to do:
- Encouraging diversity
- Varying Reward of form: 0 and then linear decaying [don't think so]
- In the direction of the goal(hacky)
- Threshold depending on progress, [DONE]
- Change max steps (hacky and doesnt sound too important)
- add noise? Could do as hyperparameter [DONE]
- Restart episode --> Collapse?
- Fixed env? [TRIED]
- De este 19-11-09_07-54-49 viene la idea de poner solo sin curricula empezar con 30steps directo.
a) current_target = 33 RUNNING
b) 0 rewards otherwise for generator Running
Think of a different way for curriculum to advance.
Is having the key encoded maybe in the agent?
Do we give it enough information for it to be solved
featurization for the items.
I could hardcode the items (preprocessing) [DONE]
vtrace and generato baseline. maybe not so important with binary rewards.
add Generator baseline
Add Vtrace
5.- Visualize to see what is the change in the env when i grab or not grab.
-direction is on agent ok
- If agent on door, then door lost. ok
- door open or closed ok
- no hay tal cosa como goal en el environment ok
- you dont know if you are carrying anything!!!!!!!!!!!
******- more if he actually reaches the goal
-- fixe seed with novelty on places
All ideas together
film architecture
increase number of steps?
New Ideas/TODO:
---
- Remove outer wall (high priority) [DONE]
- Slower Curricula (high priority) [DONE]
- Curricula increases with a rolling mean (probably not needed)
- **Next?** Curriculum goes back (if things got fucked up) (hopefully not needed) But next thing to try
- Embeddings shared?
- increase number of steps
- fixed seed with novelty on places
- film architecture
- If reaching the neighborhood already good? even if only a little good.
- Size of embedding? Nantas says 10 is good. Could do hyperparmeter change. [DONE]
Question: Should Generator be rewarded for proposing goals which make policy reach extrinsic goal?
Conflict of interests:
- If it reaches the goal then keep proposing because that was helpful somewhat?
- Generator should learn to propose the right goals
---
- REaching instead of just getting there. Should do it after a hyper parameter search
- If it is raising because of extrinsic reward, do i want to raise the level? [DONE] That affected one agent in one level.
Next steps:
- Curriculum goes back? naah
- Share Embeddings?
- Film acitecture [DONE]
- modifying instead of just getting there? i dont think it is going to be a big difference... [Done]
- novelty issue which is not solving the thing.
- Does difficulty change if it is getting extrinsic rewards? [Nah]
Besides that:
What would be the next project steps. Is idea plus these results enough?
Change environment?
-Additional bonus if generator proposes goal... better only if it reaches no?
Generator actually doesnt know what the goal is!!
**Baseline**:
- Give rewards just for modifying random things. so only novelty.
What happened when it learned:
Problem with reached and rewarded at the same time, so if it reaches end of episode,
Because frame changes and then reached would tend to be true.
Problem will happen only when: Successfully finishes episode, targetgoal gets modified from change of frame, then generator things it was reached.
---
---
Next Steps:
Share embeddings?
Baseline of only novelty?
Changing max steps could potentially help, it would probably just make it work.
Interesting point, unclear if we want it:
Policy learns to recollect intermediate rewards... this increases intrinsic rewards. gen reward for the generator goes down because extrinsic rewards in some entries is 0. notice this is after it is already reaching the goal
- Problem that i dont like but maybe is fine to ignore.
Generator is rewarded if:
Runs out of time, change of episode makes it "modify it":
this happens:
- when you want it because generator proposed the actual goal. ***Is this what is happening. would be nice
- when you dont want it because it just randomly changed the situation.
suppose the good one never happens. why could it be good to reward situations where frame changed and give positive to both intrinsic and extrinsic:
-
Generator incentivized to propose things that change from one scenario to another. what things change?
Things that can change are reachable.
Generator is incentivized to propose:
- Things that are more dificult
- Novel objects.
- Things that can change one enivronment to the next.
This last things helps quite a lot
2 experiments to try:
---
26th November:
- proposing to change things
- Interesting dynamic policy learns to first collect intermediate rewards.
- Better than baselines, ask roberta if she can run them
- Film didnt work
- how would a paper look like?
Is it worked better in this minigrid settings + displaying the idea enough. I like that but is it?
----
Plotting
Weird dynamic
Diversity into the environments.
Hardcode a bias if needed or something like frequency of things proposed.
- Comparison to RIDE in all of them.
Nethack staircase. beat the baseline.
Ride:
Partial view
not removing walls
Grid maps are huge.
What if things change because there are enemies that move..
- Steps=100...
- dont wait for 1000steps but 4x decrease
Generator could be reccurrent on the actions
share embeddings, iterate training generator policy
Towards objects
Towards novel objects
Proposing positions
Propose depending on number of times something apperar
minecraft gridworld like things
----
- I think something that happens is that generator just goes into the rabbit hole of just proposing things that change and satisfies with that low reward.
different parametrization.
Direct channel, inventory
Partial view.
Three points:
- baselines
partial view vs full_obs
- objects, didnt work but now several other versions
-Whole thing is not very robust, for example a hyperparameter i didnt think would affect actually does, the 100 limit
- should i start writing something or whats the plan regarding that
do i need something to maintain the learning of the previous tasks? Like ocassionally sample a lower threshold or something like that hmmm...
cleanup vs harvest
keycorridor vs obstructed maze
and just behavior in general
----------------------------
Varias cosas que hacer:
1.- make the accident in a more principled way.
2.- Fix that x, y have embeddings. that makes no sense (because things change from env to env, what is the semantics that gets shared??), it should be part of a convolution, a one hot embedding.
3.- the fact that i am using linear layers at the end is problematic because of dynamic changes.
So i have to use dynamic pooling
And i have to use film
-run in other changing environments
-make sure comparison with roberta is same parameters.
Questions:
- Policy has fc at the end is that ok? (ok or not ok, its fine)
- Last convolution needs to have output dimension of 1? (yeap)
- Carried info only in policy. (also in generator)
Two things that would be nice to do:
Same architecture for policy and generator. Both with film
Transfer from one scenario to another?
-----
Ideas:
- Transfer from one scenario to another
- Partial view
- share embeddings (problem that heinrich pointed out about having different training schedules)
Research obstacles/paths:
- Other scenarios are too big, not a problem with partial view.
Imapla doesnt solve them but RIDE does.
AGENDA
- Robertas results.
- how to transfer logs?
- access to code, do i have to do anything?
- Actual results: results are ok but not as good as before.
- Other Scenarios are too big.
-o-o-o-o-o-o-o-o-
-Bias sampling, towards close to the agent. Gaussian around agent loc.
-convolution -- deconvolution.
- Launchable.
- Explainable, and the narrative is clear.
- Plots. Analysis pipeline
1.- Works better in this scenarios.
- what is the storyline and problem
2.- Works in all the other ones.
- mask
- Make sure it runs. see my sh file and paths between env and not env.
To run:
module load anaconda3
source activate torchbeast
- pass all the logs.
For january:
- plot pipeline
- write paper
- all games, masking
- storyline, weird effect or clean effect.
Principled version:
- Train to classify objects.
How about loading a model trained in ObstructedMaze, and transferring it to S6R6 or to obstructed Maze 1Q...
Plan:
Revisar que es lo que pasa, que goals se proponen con el hack y porque funciona.
----
----
For Reviews - April 5th 2020
===
There are three papers which are most similar to ours:
### 1. Assymetric Self-Play
[Paper - ICLR 2018](https://arxiv.org/pdf/1703.05407.pdf)
[Reviews and Replies: Accepted 8-5-8](https://openreview.net/forum?id=SkT5Yg-RZ)
***Concept***
Two policies trained adversarially: Alice starts from a start point and tries to reach goals so that Bob fails to travel in reverse from the goal to the start point.
**Downsides**
- They require 2 independent agents that act and learn in the world (can be costly)
- Their environments have to be reversible (MiniGrid is) or resettable.
- Goals proposed are only those that have already been achieved
- [Florensa 2017](https://arxiv.org/pdf/1707.05300.pdf) argues there are additional training limitations of getting stuck in local optimum and Alice's rewards being too sparse
**Replication Running their algorithm**
I wrote a wrapper for their model and ran it in MiniGrid. While it is able to learn in the easiest scenarios (Empty 5x5, Empty Random 5x5, KeyCorridor S3R1), it already fails at KeyCorridorS3R2 even after 200M steps, and all of the harder environments too. Note that IMPALA also fails to learn at anything harder or equal to KeyCorridorS3R3 (I am not sure how IMPALA performs at KeyCorridorS3R2).
### 2 GoalGAN
[Paper ICML 2018](https://arxiv.org/pdf/1705.06366.pdf)
[Reviews and Replies: Rejected 6-4-8](https://openreview.net/forum?id=SyhRVm-Rb) - (Later accepted for ICML)
**Concept**
Three-step iterative procedure which consists of (i) labeling goals based on the fraction of successes of previous trajectories, (ii) training both a discriminator and a generator of a GAN to propose new goals, and (iii) training the policy.
**Downsides**
- Their Generator (which is supposed to propose appropriately difficult goals) is not conditioned in the observation.
- They only train in non varying environments where the agent is returned to origin after every episode
- They Require a three step iterative procedure and a discriminator (complex process).From [Deepmind's Goal Setter](https://arxiv.org/pdf/1909.12892.pdf): "Furthermore, maintaining and sampling from a large memory buffer, and running the agents on each goal multiple times to get a label of whether it was intermediate difficulty were very costly, and their approach thus required more memory and wallclock time than ours for an equivalent amount of agent steps"
**Replication Running their algorithm**
GoalGAN's generator is built for static environments and for locomation. It requires a buffer to prevent catastrophic foretting, but a buffer of goals proposed makes no sense when the environments are changing. From [Deepmind's Goal Setter](https://arxiv.org/pdf/1909.12892.pdf): "Florensa et al. did not attempt the challenge of curriculum generation in environments that vary (which is why we did not compare to their algorithm in the alchemy environment)".
I could still run their method in the environments with a fixed seed? but that would kind of loose the whole flavor of our paper of Procedurally Generated Environment's, we don't even have results for AMIGo's performance when the environment doesn't change.
### 3. Deepmind's Goal Setter
[Paper - ICLR 2020](https://arxiv.org/pdf/1909.12892.pdf)
[Reviews and Replies: Acepted 6-6-6](https://openreview.net/forum?id=H1e0Wp4KvH)
This is actually the best attempt compared to ours. "To the best of our knowledge, these are the first results demonstrating the success of any automated curriculum approach for goal-conditioned RL in a varying environment." In our paper we didnt emphasize it as such, partly because it is not trained in an adversarial way.
**Concept**
Three main components:
- A solver – the goal-conditioned policy.
- A judge - predicts "feasibility" (probability" of success of a particular goal) with supervision on solver's result
- A setter - Generator encouraged to propose goals which: have been achieved in the past (validity), have the appropiate difficulty according to judge (feasibility), and which are diverse (diversity).
- At each episode, the target feasibility is sampled uniformly from [0,1]
**Downsides**
- Half of their experiments are in none-varying environments, the setter and judge are not conditioned in the observation. The varying environments in which they do condition the setter and judge are fairly simple and focused on visual tasks. From their paper "in more complex versions of the alchemy environment (for example, introducing more objects with more colors), even our conditioned setter algorithm could not learn to
reliably generate feasible goals from raw observations".
- Model is complex. Requires a discriminator, three losses for the setter. One of the reviewers: "However, it is not convincing that the detailed techniques proposed in this paper can be easily generalized to more complicated environments and tasks. The writing of this paper is not clear and enough and the whole paper is difficult to understand" or "There are too many details and hyperparameters involved when applied to specific tasks as shown in this paper. Considering the two environments in this paper is easier than most game environments studied in other papers"
- Their approach achieves something similar to ours by encouraging to propose goals which have different "feasibilities" but less directly than ours.
- **NOTE:** In our paper we say that our model is trained end-to-end suggesting theirs isnt. This was inappropriate, what we meant is that their training procedure is more complex and iterative but is also end to end.
**Code Replication**
Their model is quite complex and has no code released so didn't attempt to run: Too many details, hyperparameters, 3 losses for generator, no code released.