# split importance sampling for off-policy RL
## Normal importance sampling
The policy $\pi(a\vert s)$ selects the next action given the state. We sample trajectories $\tau$ using this policy. We would like to evaluate the reward of an alternative policy $\tilde{\pi}$, which we can do via importance sampleing as follows:
\begin{align}
\mathbb{E}_{\tau \sim \tilde{p}} r(\tau) &= \mathbb{E}_{\tau \sim p} \frac{\tilde{p}(\tau)}{p(\tau)} r(\tau) \\
&= \mathbb{E}_{\tau \sim p} \prod_{t=1}^T \frac{\tilde{\pi}(a_t\vert s_t)}{\pi(a_t\vert s_t)} r(\tau)
\end{align}
where $r(\tau)$ is the total reward or discounted reward in trajectory $\tau$. The problem here is that the importance weights have high variance, making the whole thing unreliable once the policy $\tilde{\pi}$ is sufficiently different from $\pi$, the one we sampled data from. (This is why, we often optimize within a trust region, to ensure that our estimate is correct.)
## Idea: split policy into two stages
What if we define our policy as a two-stage sampling process:
$$
\pi(a_t\vert s_t) = \int \pi(a_t\vert z_t) \rho (z_t\vert s_t) dz_t
$$
So we first generate a partial decision $z_t$ based on the state $s_t$, and then generate the action from this partial decision $z_t$. This is used in the [Causally Correct Partial Models](https://arxiv.org/pdf/2002.02836v1.pdf) paper but for an entirely different reason that I'm proposing here.
With such a split policy we can do importance sampling in two ways. Either ignoring $z_t$, we calculate importance weights just as before:
\begin{align}
\mathbb{E}_{\tau \sim \tilde{p}} r(\tau) &= \mathbb{E}_{\tau \sim p} \frac{\tilde{p}(\tau)}{p(\tau)}r(\tau) \\
&= \mathbb{E}_{\tau \sim p} \prod_{t=1}^T \frac{\tilde{\pi}(a_t\vert s_t)}{\pi(a_t\vert s_t)} r(\tau) \\
&= \mathbb{E}_{\tau \sim p} \prod_{t=1}^T \frac{\int \tilde{\pi}(a_t\vert z_t) \tilde{\rho} (z_t\vert s_t) dz_t}{\int \pi(a_t\vert z_t) \rho (z_t\vert s_t) dz_t} r(\tau)
\end{align}
Or, we can also consider the sampled $z_t$ values along with the trajectory $\tau$ observed, and involve them in the importance weighting in the following way:
\begin{align}
\mathbb{E}_{\tau \sim \tilde{p}} r(\tau)&= \mathbb{E}_{\tau, z \sim \tilde{p}} r(\tau)\\
&= \mathbb{E}_{\tau,z\sim p} \prod_{t=1}^T \left(\frac{\tilde{\pi}(a_t\vert z_t)}{\pi(a_t\vert z_t)}\frac{\tilde{\rho}(z_t\vert s_t)}{\rho(z_t\vert s_t)} \right) r(\tau)
\end{align}
We get different importance sampling-based estimators in these two cases. The question is, whether this second version, which uses extra information about how the policy makes a decision, has any advantages over the first one.