Notes on "[DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION](https://arxiv.org/pdf/1912.01603.pdf)"

# Notes on "[DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION](https://arxiv.org/pdf/1912.01603.pdf)" ###### tags: `Latent-world-models` `Dreamer-v1` These notes are created from an implementation POV. ## Main contribution: Their main contribution is to learn long-horizon behaviors by propagating analytic value gradients through imagined trajectories. They show that this method gives empirically scalable results on complex control tasks. 1) Learning long-horizon behaviors by latent imagination. 2) Empirical performance for visual control. ## Algorithm: ![](https://i.imgur.com/tBugvQJ.png) The algorithms is divided intro Dynamics learning, behaviour learning, environment interaction. Out of these three dynamics learning is the most crucial, as it is the backbone of the algorithm. The most important part is the world model architecture and dynamic learning losses. ## Learnt models and components: The following are neural network models which are learnt in dreamer v1: $p_{\theta}(s_{t}|s_{t-1},a_{t-1},o_{t})$ : Representation model $q_{\theta}(s_{t}|s_{t-},a_{t-1})$ : Transition model $q_{\theta}(r_{t}|s_{t})$ : Reward model $q_{\phi}(a_{t}|s_{t})$ : Action model (actor network) $v_{\psi}(s_{t})$ : Value model (value network) They use conv and deconv from world models (2018) and RSSM from PlaNet (2018) for representation and transition models. All other models are fully connected neural networks. ### Recurrent State Space Model (used from PlaNet): A latent dynamics model with both deterministic and stochastic components, allowing to predict a variety of possible futures as needed for robust planning, while remembering information over many time steps. $h_{t} = f(h_{t-1},z_{t-1},a_{t-1})$ : Deterministic state model $\hat{z_{t}} \sim p(\hat{z_{t}}|h_{t})$ : temporal prior over stochastic states $z_{t} \sim p(z_{t}|h_{t},o_{t})$ : temporal posterior over stochastic states $o_{t} \sim p(o_{t}|z_{t},h_{t})$ : Observation model (decoder) $r_{r} \sim p(r_{t}|z_{t},h_{t})$ : Reward model They have divided the state into deterministic and stochastic parts. $h_{t}$ is the recurrent deterministic state and $s_{t}$ is the stochastic state. $\hat{s_{r}}$ is the predicted stochastic state given the deterministic state. This reconstruction of stochastic state is used for training RSSM. #### Block diagram of RSSM: ![](https://i.imgur.com/3HiR19I.png) The model state, concatenation of deterministic and stochastic parts carries information about the state of the world at a particular timestep. ### Dynamics learning (reconstruction method) : Reconstruction objective (to maximise): ![](https://i.imgur.com/eI7egZC.png) The reconstruction error for observation and rewards are minimised while minimising the KL divergence for temporal posterior and temporal prior. Intuitively this means that we want the dynamics to reconstruct the observation and rewards as accurately as possible using the temporal prior and only take required information from the current observation. ### Behaviour learning ### Questions: 1) Why learn a reward model? 2) Can we use sampled buffer data of B batches of L length sequences from the real world? Can't we just bootstrap off of this data? 3) Why does imagination in latent space start from sampled states only? 4) Some part of a potentially complex world do not change or are never affected by our actions. Can we exploit this to make our algos more efficient?