---
tags: RL, Planning
---
[TOC]
# Latent Planning
> Preliminaries: [Variational Autoencoders](https://hackmd.io/@bykang/Syje5YEa_)
## Summary
### Paper list
- PlaNet: [Learning Latent Dynamics for Planning from Pixels.](https://arxiv.org/pdf/1811.04551.pdf)
- Dreamer: [Dream to Control: Learning Behaviors by Latent Imagination.](https://arxiv.org/pdf/1912.01603.pdf) [[Code](https://github.com/danijar/dreamer)]
- Dreamer2: [Mastering Atari with Discrete World Models.](https://arxiv.org/pdf/2010.02193.pdf) [[Code](https://github.com/danijar/dreamerv2)]
### Sequential VAE [Need a better name]
:::info
:bulb:Question: How to model a sequence of data with variational inference??
- Write down the likelihood
- Be careful about how to estimate the expecation, *i.e.,* what is the distribution that a real example is drawn from?
:::
Consider a sequence $\{o_t, a_t\}_{t=1}^T$ with discrete time step $t$, where $o_t$ is the observations and $a_t$ is the (continuous) action. Suppose there is a latent variable $s_t$, which is sampled from a (transition) distribution $p(s_t| s_{t-1}, a_{t-1})$. Similar to the vanilla VAE, we starts with the ***log-likelihood***,
$$
\begin{align}
\log p(o_{1:T}|a_{1:T})
&= \log \int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_t) p(s_t|s_{t-1}, a_{t-1}) d_{s_{1:T}} \\
&= \log \int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_t) \prod_{t=1}^T p(s_t|s_{t-1}, a_{t-1}) d_{s_{1:T}} \\
&= \log \int_{s_{1:T}} \left(\prod_{t=1}^T p(o_t|s_t) \right) p(s_{1:T}|a_{1:T}) d_{s_{1:T}} \\
&= \log \mathbb{E}_{p(s_{1:T}|a_{1:T})}\left[ \prod_{t=1}^T p(o_t|s_t) \right]
\end{align}
$$
:::info
Note that in practice, we do not have access to the (oracle) transition distribution and the latent variable $s_t$ is not drawn from $p(s_t|s_{t-1}, a_t)$ but from an encoded distribution (the encoder) $q(s_t|o_{\leq t}, a_{<t})$. To make the above expecation tractable we need to do importance reweighting:
:::
$$
\begin{align}
&\log p(o_{1:T}|a_{1:T}) \\
=& \log \mathbb{E}_{p(s_{1:T}|a_{1:T})}\left[ \prod_{t=1}^T p(o_t|s_t) \right] \\
=& \log \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})}\left[\left(\prod_{t=1}^T p(o_t|s_t)\right) \frac{p(s_t | s_{t-1}, a_{t-1})}{q(s_t|o_{\leq t}, a_{<t})} \right] \\
\geq& \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})} \left[ \sum_{t=1}^{T} \log p(o_t|s_t) + \log p(s_t | s_{t-1, a_{t-1}}) - \log q(s_t|o_{\leq t}, a_{<t}) \right]_{\text{By Jensen}} \\
=& \sum_{t=1}^T \left(
\underset{\text{reconstruction}}{\underline{\mathbb{E}_{q(s_t|o_{\leq t}, a_{<t})}[\log p(o_t|s_t)]}} - \mathbb{E}_{q(s_{t-1}|o_{\leq t-1}, q_{<t-1})}\left[
\underset{\text{pushing transition $p$ towards posterior encoding $q$}}{\underline{\mathcal{D}_{\text{KL}}[q(s_t|o_{\leq t}, a_{<t})||p(s_t|s_{t-1}, a_{<t})] }}
\right]
\right)
\end{align}
$$
#### Multi-step predictive distribution
In this above, we implicitly assume the transition model $p(s_t|s_{t-1}, a_{t-1})$ is doing one-step prediction. One might wonder what if $d$-step prediction is desired? The corresponding likelyhood is given by,
$$
\log p(o_{1:T}|a_{1:T})
= \log \int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_t) p(s_t|s_{t-d}, a_{t-d-1:t-1}) d_{s_{1:T}}.
$$
Thus, the evidence lower bound is
$$
\begin{align}
&\log p(o_{1:T}|a_{1:T}) \\
=& \log \mathbb{E}_{p(s_{1:T}|a_{1:T})}\left[ \prod_{t=1}^T p(o_t|s_t) \right] \\
=& \log \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})}\left[\left(\prod_{t=1}^T p(o_t|s_t)\right) \frac{p(s_t|s_{t-d}, a_{t-d-1:t-1})}{q(s_t|o_{\leq t}, a_{<t})} \right] \\
\geq& \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})} \left[ \sum_{t=1}^{T} \log p(o_t|s_t) + \log p(s_t|s_{t-d}, a_{t-d-1:t-1}) - \log q(s_t|o_{\leq t}, a_{<t}) \right]_{\text{By Jensen}} \\
=& \mathbb{E}_{q(s_{1:T}|o_{1:T},a_{1:T})} \left[ \sum_{t=1}^{T} \log p(o_t|s_t) + \underset{p(s_{t-1}|s_{t-d}, a_{t-d-1:t-2})}{\mathbb{E}[\log p(s_t|s_{t-1}, a_{t-1})]} - \log q(s_t|o_{\leq t}, a_{<t}) \right] \\
=& \sum_{t=1}^T \left(
\underset{\text{reconstruction}}{\underline{\mathbb{E}_{p(s_t|o_{\leq t}, a_{<t})}[\log p(o_t|s_t)]}} - \underset{p(s_{t-1}|s_{t-d}, a_{t-d-1:t-2})}{\mathbb{E}_{q(s_{t-d}|o_{\leq t-d}, q_{<t-d})}}\left[
\underset{\text{pushing transition $p$ towards posterior encoding $q$}}{\underline{\mathcal{D}_{\text{KL}}[q(s_t|o_{\leq t}, a_{<t})||p(s_t|s_{t-1}, a_{<t})] }}
\right]
\right).
\end{align}
$$
The penultimate equality is obstained by noting the recurtion that $\log p(s_t|s_{t-d}, a_{t-d-1:t-1}) = \mathbb{E}_{p(s_t|o_{\leq t}, a_{<t})}[\log p(s_t|s_{t-1}, a_{t-1})]$.
Since all expectations are on th4e outside of the objectie, we can easily obtain an unbiased estimator of this bound by changing expecations to sample averages.
### Recurrent State Space Model (RSSM)
In the above, we show how to model a sequence of data by leveraging latent variables. Connecting the model to RL, the latent variable $s_t$ can be viwed as an underlying **state** at time $t$. It is easily verified that $s_t$ is a purely stochastic variable based on the above formulation, which corresponds to (b) in the following Figure. If we remove the stochasticity, it will reduce to a deterministic model as shown by Figure (b). According to PlaNet, **the purely stochastic transitions make it difficulte for the transition model to reliably remember information for multiple time steps.** This motivates RSSM, which splits the stateinto stochastic and deterministic parts.
")
The model is given by
$$
\begin{align}
\text{Deterministic State Model: } \quad &h_t = f(h_{t-1}, s_{t-1}, a_{t-1}) \\
\text{Stochastic State Model (prior): } \quad & s_t \sim p(s_t|h_t) \\
\text{Stochastic State Encoder (posterior): } \quad & s_t \sim q(s_t | h_t, o_t) \\
\text{Observation Decoder: } \quad & o_t \sim p(o_t | h_t, s_t) \\
\text{Reward Decoder: } \quad & r_t \sim p(r_t | h_t, s_t),
\end{align}
$$
where $f(h_{t-1}, s_{t-1}, a_{t-1})$ is implemented as a RNN.
### Deep Planning Network (PlaNet)
The main idea behind PlaNet contains two parts
:::success
- Fit a forward model (RSSM)
- Collect data by executing a policy
- policy is obtained by planning using the learned model
:::
The algorithm is given by
```python=3
dataset = [Traj] # Traj=[obs, action, reward] of length N
params = [] # params of the above five models
while no_converged():
# Fit the model
for i in range(model_steps):
batch = sample(dataset, batch_size)
params = update(params, batch) # seq VAE loss
# Data Collection
obs = env.reset()
traj = []
for t in range(env_steps):
s = sample_from(stoch_state_encoder(obs))
a = planner(models, obs, a_last) # q(s_t|o_{\leq t}, a_{<t})
obs_next, r = env.step(a) # may repeat action
traj = append([obs, a, r])
dataset.append(traj)
```
#### Cross Entropy Method: Planning
Once the forward model is properly learned, we are facing another questions, which is **How to convert a forward model into a policy?** PlaNet achieves this by iteratively doing the following thing:
:::success
- Executing a policy (might be initialized from a random one) to collect some trajs.
- Select the top K best trajs.
- Fit a new (better) policy from the top-K trajs.
:::
See PanNet paper appendix B for more details
```python=3
# H is the planning horizon
act_dists = [normal(0, I) for _ in range(H)]
for i in range(opt_iters):
# Eval `num_trajs` trajectories with the learned model
trajs = []
for j in range(num_trajs):
actions, ret = [], 0
for t in range(H):
a = act_dists[t].sample()
s_next, r = forward_model(s, a)
actions.append(a)
ret += r
trajs.append([actions, ret])
# Select K best action seqs
topK_trajs = argsort(trajs, key=lambda x: x[-1])
# Refit action distributions
for t in range(H):
actions_at_t = [traj[0][t] for traj in topK_trajs]
mu = mean(actions_at_t)
sigma = std(actions_at_t)
act_dists[t] = normal(mu, singma)
```
### Dreamer
Dreamer uses exactly the same forward model (RSSM) as PlaNet, ***the difference is how it obtains a policy***. Insteand of generating policies with planning, Dreamer directly fit an actor-critic with purely imagened data, as sumarrized as follows:
:::success
- Fit a forward model (RSSM).
- Fit an actor-critic with purely imagined data.
- Collect data by executing the learned actor.
:::
The full algorithm is given by,

#### Actor-Critic Learning
Now we detail how dreamer performs the actor-critic learning.
- **Value estimation**
Consider imagined trajectories $\{s_\tau, a_\tau, r_\tau\}_{\tau=t}^{t+H}$, dreamer uses a $\lambda$ return derived as follows,
$$
\begin{align}
V_R(s_\tau)& \doteq \mathbb{E}\left[\sum_{n=\tau}^{t+N} r_n \right] \\
V_N^k(s_\tau) &\doteq \mathbb{E} \left[\sum_{n=\tau}^{H-1} r^{n-\tau} r_n + \gamma^{h-\tau} v_\psi(s_h) \right] \quad \text{with} \quad h=min(\tau+k, t+H) \\
V_\lambda(s_t) &\doteq (1-\lambda) \sum_{n=1}^{H-1} \lambda^{n-1} V_N^n(s_\tau) + \lambda^{H-1} V_N^H(s_\tau).
\end{align}
$$
- **Actor losses**
Note that since we have a differentiable value network $v_\psi$, with reparameterization trick, we are able to directly backpropogate the gradient from the value network to the policy network. Thus the policy is train with
$$
\max_{\theta} \mathbb{E}_{\pi_\theta}\left[\sum_{\tau=t}^{t+H} V_\lambda(s_\tau) \right].
$$
The value network is optimized by regression.
### Dreamer V2
<!-- :::success -->
| components | Dreamer | Dreamer V2 |
| -------- | -------- | -------- |
| latent variable | Continuous (Gaussian) | Discrete(Categorical) |
| Actor loss | Dynamics backpropagation | Reinforce only |
| KL loss | Joint optimization | KL balancing |
| Exploration | Action noise | Policy entropy maximization |
### What's Next?