--- tags: RL, Planning --- [TOC] # Latent Planning > Preliminaries: [Variational Autoencoders](https://hackmd.io/@bykang/Syje5YEa_) ## Summary ### Paper list - PlaNet: [Learning Latent Dynamics for Planning from Pixels.](https://arxiv.org/pdf/1811.04551.pdf) - Dreamer: [Dream to Control: Learning Behaviors by Latent Imagination.](https://arxiv.org/pdf/1912.01603.pdf) [[Code](https://github.com/danijar/dreamer)] - Dreamer2: [Mastering Atari with Discrete World Models.](https://arxiv.org/pdf/2010.02193.pdf) [[Code](https://github.com/danijar/dreamerv2)] ### Sequential VAE [Need a better name] :::info :bulb:Question: How to model a sequence of data with variational inference?? - Write down the likelihood - Be careful about how to estimate the expecation, *i.e.,* what is the distribution that a real example is drawn from? ::: Consider a sequence $\{o_t, a_t\}_{t=1}^T$ with discrete time step $t$, where $o_t$ is the observations and $a_t$ is the (continuous) action. Suppose there is a latent variable $s_t$, which is sampled from a (transition) distribution $p(s_t| s_{t-1}, a_{t-1})$. Similar to the vanilla VAE, we starts with the ***log-likelihood***, $$ \begin{align} \log p(o_{1:T}|a_{1:T}) &= \log \int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_t) p(s_t|s_{t-1}, a_{t-1}) d_{s_{1:T}} \\ &= \log \int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_t) \prod_{t=1}^T p(s_t|s_{t-1}, a_{t-1}) d_{s_{1:T}} \\ &= \log \int_{s_{1:T}} \left(\prod_{t=1}^T p(o_t|s_t) \right) p(s_{1:T}|a_{1:T}) d_{s_{1:T}} \\ &= \log \mathbb{E}_{p(s_{1:T}|a_{1:T})}\left[ \prod_{t=1}^T p(o_t|s_t) \right] \end{align} $$ :::info Note that in practice, we do not have access to the (oracle) transition distribution and the latent variable $s_t$ is not drawn from $p(s_t|s_{t-1}, a_t)$ but from an encoded distribution (the encoder) $q(s_t|o_{\leq t}, a_{<t})$. To make the above expecation tractable we need to do importance reweighting: ::: $$ \begin{align} &\log p(o_{1:T}|a_{1:T}) \\ =& \log \mathbb{E}_{p(s_{1:T}|a_{1:T})}\left[ \prod_{t=1}^T p(o_t|s_t) \right] \\ =& \log \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})}\left[\left(\prod_{t=1}^T p(o_t|s_t)\right) \frac{p(s_t | s_{t-1}, a_{t-1})}{q(s_t|o_{\leq t}, a_{<t})} \right] \\ \geq& \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})} \left[ \sum_{t=1}^{T} \log p(o_t|s_t) + \log p(s_t | s_{t-1, a_{t-1}}) - \log q(s_t|o_{\leq t}, a_{<t}) \right]_{\text{By Jensen}} \\ =& \sum_{t=1}^T \left( \underset{\text{reconstruction}}{\underline{\mathbb{E}_{q(s_t|o_{\leq t}, a_{<t})}[\log p(o_t|s_t)]}} - \mathbb{E}_{q(s_{t-1}|o_{\leq t-1}, q_{<t-1})}\left[ \underset{\text{pushing transition $p$ towards posterior encoding $q$}}{\underline{\mathcal{D}_{\text{KL}}[q(s_t|o_{\leq t}, a_{<t})||p(s_t|s_{t-1}, a_{<t})] }} \right] \right) \end{align} $$ #### Multi-step predictive distribution In this above, we implicitly assume the transition model $p(s_t|s_{t-1}, a_{t-1})$ is doing one-step prediction. One might wonder what if $d$-step prediction is desired? The corresponding likelyhood is given by, $$ \log p(o_{1:T}|a_{1:T}) = \log \int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_t) p(s_t|s_{t-d}, a_{t-d-1:t-1}) d_{s_{1:T}}. $$ Thus, the evidence lower bound is $$ \begin{align} &\log p(o_{1:T}|a_{1:T}) \\ =& \log \mathbb{E}_{p(s_{1:T}|a_{1:T})}\left[ \prod_{t=1}^T p(o_t|s_t) \right] \\ =& \log \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})}\left[\left(\prod_{t=1}^T p(o_t|s_t)\right) \frac{p(s_t|s_{t-d}, a_{t-d-1:t-1})}{q(s_t|o_{\leq t}, a_{<t})} \right] \\ \geq& \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})} \left[ \sum_{t=1}^{T} \log p(o_t|s_t) + \log p(s_t|s_{t-d}, a_{t-d-1:t-1}) - \log q(s_t|o_{\leq t}, a_{<t}) \right]_{\text{By Jensen}} \\ =& \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})} \left[ \sum_{t=1}^{T} \log p(o_t|s_t) + \underset{p(s_{t-1}|s_{t-d}, a_{t-d-1:t-2})}{\mathbb{E}[\log p(s_t|s_{t-1}, a_{t-1})]} - \log q(s_t|o_{\leq t}, a_{<t}) \right] \\ =& \sum_{t=1}^T \left( \underset{\text{reconstruction}}{\underline{\mathbb{E}_{p(s_t|o_{\leq t}, a_{<t})}[\log p(o_t|s_t)]}} - \underset{p(s_{t-1}|s_{t-d}, a_{t-d-1:t-2})}{\mathbb{E}_{q(s_{t-d}|o_{\leq t-d}, q_{<t-d})}}\left[ \underset{\text{pushing transition $p$ towards posterior encoding $q$}}{\underline{\mathcal{D}_{\text{KL}}[q(s_t|o_{\leq t}, a_{<t})||p(s_t|s_{t-1}, a_{<t})] }} \right] \right). \end{align} $$ The penultimate equality is obstained by noting the recurtion that $\log p(s_t|s_{t-d}, a_{t-d-1:t-1}) = \mathbb{E}_{p(s_t|o_{\leq t}, a_{<t})}[\log p(s_t|s_{t-1}, a_{t-1})]$. Since all expectations are on th4e outside of the objectie, we can easily obtain an unbiased estimator of this bound by changing expecations to sample averages. ### Recurrent State Space Model (RSSM) In the above, we show how to model a sequence of data by leveraging latent variables. Connecting the model to RL, the latent variable $s_t$ can be viwed as an underlying **state** at time $t$. It is easily verified that $s_t$ is a purely stochastic variable based on the above formulation, which corresponds to (b) in the following Figure. If we remove the stochasticity, it will reduce to a deterministic model as shown by Figure (b). According to PlaNet, **the purely stochastic transitions make it difficulte for the transition model to reliably remember information for multiple time steps.** This motivates RSSM, which splits the stateinto stochastic and deterministic parts. ![](https://i.imgur.com/jDFR4WQ.png "Latent dynamics model designs (Fig 2 in PlaNet)") The model is given by $$ \begin{align} \text{Deterministic State Model: } \quad &h_t = f(h_{t-1}, s_{t-1}, a_{t-1}) \\ \text{Stochastic State Model (prior): } \quad & s_t \sim p(s_t|h_t) \\ \text{Stochastic State Encoder (posterior): } \quad & s_t \sim q(s_t | h_t, o_t) \\ \text{Observation Decoder: } \quad & o_t \sim p(o_t | h_t, s_t) \\ \text{Reward Decoder: } \quad & r_t \sim p(r_t | h_t, s_t), \end{align} $$ where $f(h_{t-1}, s_{t-1}, a_{t-1})$ is implemented as a RNN. ### Deep Planning Network (PlaNet) The main idea behind PlaNet contains two parts :::success - Fit a forward model (RSSM) - Collect data by executing a policy - policy is obtained by planning using the learned model ::: The algorithm is given by ```python=3 dataset = [Traj] # Traj=[obs, action, reward] of length N params = [] # params of the above five models while no_converged(): # Fit the model for i in range(model_steps): batch = sample(dataset, batch_size) params = update(params, batch) # seq VAE loss # Data Collection obs = env.reset() traj = [] for t in range(env_steps): s = sample_from(stoch_state_encoder(obs)) a = planner(models, obs, a_last) # q(s_t|o_{\leq t}, a_{<t}) obs_next, r = env.step(a) # may repeat action traj = append([obs, a, r]) dataset.append(traj) ``` #### Cross Entropy Method: Planning Once the forward model is properly learned, we are facing another questions, which is **How to convert a forward model into a policy?** PlaNet achieves this by iteratively doing the following thing: :::success - Executing a policy (might be initialized from a random one) to collect some trajs. - Select the top K best trajs. - Fit a new (better) policy from the top-K trajs. ::: See PanNet paper appendix B for more details ```python=3 # H is the planning horizon act_dists = [normal(0, I) for _ in range(H)] for i in range(opt_iters): # Eval `num_trajs` trajectories with the learned model trajs = [] for j in range(num_trajs): actions, ret = [], 0 for t in range(H): a = act_dists[t].sample() s_next, r = forward_model(s, a) actions.append(a) ret += r trajs.append([actions, ret]) # Select K best action seqs topK_trajs = argsort(trajs, key=lambda x: x[-1]) # Refit action distributions for t in range(H): actions_at_t = [traj[0][t] for traj in topK_trajs] mu = mean(actions_at_t) sigma = std(actions_at_t) act_dists[t] = normal(mu, singma) ``` ### Dreamer Dreamer uses exactly the same forward model (RSSM) as PlaNet, ***the difference is how it obtains a policy***. Insteand of generating policies with planning, Dreamer directly fit an actor-critic with purely imagened data, as sumarrized as follows: :::success - Fit a forward model (RSSM). - Fit an actor-critic with purely imagined data. - Collect data by executing the learned actor. ::: The full algorithm is given by, ![](https://i.imgur.com/D35YwKR.png) #### Actor-Critic Learning Now we detail how dreamer performs the actor-critic learning. - **Value estimation** Consider imagined trajectories $\{s_\tau, a_\tau, r_\tau\}_{\tau=t}^{t+H}$, dreamer uses a $\lambda$ return derived as follows, $$ \begin{align} V_R(s_\tau)& \doteq \mathbb{E}\left[\sum_{n=\tau}^{t+N} r_n \right] \\ V_N^k(s_\tau) &\doteq \mathbb{E} \left[\sum_{n=\tau}^{H-1} r^{n-\tau} r_n + \gamma^{h-\tau} v_\psi(s_h) \right] \quad \text{with} \quad h=min(\tau+k, t+H) \\ V_\lambda(s_t) &\doteq (1-\lambda) \sum_{n=1}^{H-1} \lambda^{n-1} V_N^n(s_\tau) + \lambda^{H-1} V_N^H(s_\tau). \end{align} $$ - **Actor losses** Note that since we have a differentiable value network $v_\psi$, with reparameterization trick, we are able to directly backpropogate the gradient from the value network to the policy network. Thus the policy is train with $$ \max_{\theta} \mathbb{E}_{\pi_\theta}\left[\sum_{\tau=t}^{t+H} V_\lambda(s_\tau) \right]. $$ The value network is optimized by regression. ### Dreamer V2 <!-- :::success --> | components | Dreamer | Dreamer V2 | | -------- | -------- | -------- | | latent variable | Continuous (Gaussian) | Discrete(Categorical) | | Actor loss | Dynamics backpropagation | Reinforce only | | KL loss | Joint optimization | KL balancing | | Exploration | Action noise | Policy entropy maximization | ### What's Next?