Deep Hierarchical Planning from Pixels

# Deep Hierarchical Planning from Pixels https://arxiv.org/pdf/2206.04114.pdf ![](https://hackmd.io/_uploads/ryc-GEjDn.png) A manager policy is trained to maxize the MSE of the decoder(s_(t+1)) ![](https://hackmd.io/_uploads/ByzsfVswh.png) This manager policy will output a goal which is worth explored. Then the goal-conditioned "worker" policy will do action toward the goal. This can help the agent explore. Problem: We let the manager policy output a state maximize the reconstruction error, and use the decoder to reconstruct it as a goal. But if its reconstruction error is high, why we trust the decoder so much. Maybe there's a better way to define what's a good goal and how to reconstruct it. Future work in the paper: ![](https://hackmd.io/_uploads/SJehI4iPn.png) ## Thoughts 1. #### Additional low-level exploration hurts. In this case, low-level exploration (e.g. trying to fall down) is worthless. Hierachical RL has manager and worker policy, and we only give manager policy exploration reward(decoding error). It happens to avoid the low-level exploration problem. If we give worker policy exploration reward, it may try to fall down even it has the ability to get to some goal even though it doesn't need to do it. The manager can focus on giving a worth-explored goal, while the worker focuses on getting to the goal. In the other word, We should give states exploration bonus but not state action pair, because states are what we actually care. <br>![](https://hackmd.io/_uploads/HygfSUsDn.png) 2. #### **The reason Plan2Explore failed but Dreamer succeeded in Ant Maze S:**&emsp;In P2E's setting, the exploration bonus was given to both high-level and low-level policies (because there's no such policy separation actually), thus the P2E agent tries to fall down every time it arrives a new position on the map, which increases the exploring dimension of the whole world. For dreamer, without additional exploration bonus, the agent can have a higher probability to arrive the goal accidentally because it isn't eager to try to fall down at different places.<br>![](https://hackmd.io/_uploads/ryVuIUsv2.png)