split importance sampling for off-policy RL

# split importance sampling for off-policy RL ## Normal importance sampling The policy $\pi(a\vert s)$ selects the next action given the state. We sample trajectories $\tau$ using this policy. We would like to evaluate the reward of an alternative policy $\tilde{\pi}$, which we can do via importance sampleing as follows: \begin{align} \mathbb{E}_{\tau \sim \tilde{p}} r(\tau) &= \mathbb{E}_{\tau \sim p} \frac{\tilde{p}(\tau)}{p(\tau)} r(\tau) \\ &= \mathbb{E}_{\tau \sim p} \prod_{t=1}^T \frac{\tilde{\pi}(a_t\vert s_t)}{\pi(a_t\vert s_t)} r(\tau) \end{align} where $r(\tau)$ is the total reward or discounted reward in trajectory $\tau$. The problem here is that the importance weights have high variance, making the whole thing unreliable once the policy $\tilde{\pi}$ is sufficiently different from $\pi$, the one we sampled data from. (This is why, we often optimize within a trust region, to ensure that our estimate is correct.) ## Idea: split policy into two stages What if we define our policy as a two-stage sampling process: $$ \pi(a_t\vert s_t) = \int \pi(a_t\vert z_t) \rho (z_t\vert s_t) dz_t $$ So we first generate a partial decision $z_t$ based on the state $s_t$, and then generate the action from this partial decision $z_t$. This is used in the [Causally Correct Partial Models](https://arxiv.org/pdf/2002.02836v1.pdf) paper but for an entirely different reason that I'm proposing here. With such a split policy we can do importance sampling in two ways. Either ignoring $z_t$, we calculate importance weights just as before: \begin{align} \mathbb{E}_{\tau \sim \tilde{p}} r(\tau) &= \mathbb{E}_{\tau \sim p} \frac{\tilde{p}(\tau)}{p(\tau)}r(\tau) \\ &= \mathbb{E}_{\tau \sim p} \prod_{t=1}^T \frac{\tilde{\pi}(a_t\vert s_t)}{\pi(a_t\vert s_t)} r(\tau) \\ &= \mathbb{E}_{\tau \sim p} \prod_{t=1}^T \frac{\int \tilde{\pi}(a_t\vert z_t) \tilde{\rho} (z_t\vert s_t) dz_t}{\int \pi(a_t\vert z_t) \rho (z_t\vert s_t) dz_t} r(\tau) \end{align} Or, we can also consider the sampled $z_t$ values along with the trajectory $\tau$ observed, and involve them in the importance weighting in the following way: \begin{align} \mathbb{E}_{\tau \sim \tilde{p}} r(\tau)&= \mathbb{E}_{\tau, z \sim \tilde{p}} r(\tau)\\ &= \mathbb{E}_{\tau,z\sim p} \prod_{t=1}^T \left(\frac{\tilde{\pi}(a_t\vert z_t)}{\pi(a_t\vert z_t)}\frac{\tilde{\rho}(z_t\vert s_t)}{\rho(z_t\vert s_t)} \right) r(\tau) \end{align} We get different importance sampling-based estimators in these two cases. The question is, whether this second version, which uses extra information about how the policy makes a decision, has any advantages over the first one.