owned this note
owned this note
Published
Linked with GitHub
###### tags: `vrmc` `monte carlo` `expository`
# Variance Reduction in Monte Carlo Methods: Self-Normalised Importance Sampling
**Overview**: In this note, I will describe an extension of the method of importance sampling, which enables the estimation of expectations under probability measures for which the normalising constant is unknown. As in the standard case, I will characterise some notions of optimality among this class of importance sampling methods, which will suggest some strategies for constructing adaptive implementations of these methods.
## From Importance Sampling to Self-Normalised Importance Sampling
A partial drawback of the importance sampling method is that it requires us to know the normalising constant of $p$. Suppose that we only know $p$ up to a constant of proportionality, i.e.
\begin{align}
p(dx) = \frac{p_0 (x)}{Z}
\end{align}
with $Z$ unknown. If $Z$ were known, then we would report the estimator
\begin{align}
\hat{I}^{\text{IS}}_N &= \frac{1}{N} \sum_{i = 1}^N f(x_i) w(x_i) \\
&= \frac{1}{N} \sum_{i = 1}^N f(x_i) \frac{p_0 (x_i)}{Z \cdot q(x_i)} \\
&= \frac{1}{Z}\left(\frac{1}{N} \sum_{i = 1}^N f(x_i) \frac{p_0 (x_i)}{q(x_i)} \right).
\end{align}
A natural strategy is to use this same estimator, but with $Z$ replaced by an approximation. Given that we know that
\begin{align}
Z &= \int p_0 (dx) \\
&= \int q(dx) \frac{p_0 (x)}{q(x)},
\end{align}
a sensible choice of estimator seems to be
\begin{align}
\hat{Z}_N = \frac{1}{N} \sum_{i = 1}^N \frac{p_0 (x_i)}{q(x_i)}.
\end{align}
Combining these ideas leads us to the *self-normalised importance sampling* (SNIS) estimator
\begin{align}
\hat{I}^{\text{SNIS}}_N &= \frac{\frac{1}{N} \sum_{i = 1}^N f(x_i) \frac{p_0 (x_i)}{q(x_i)}}{\frac{1}{N} \sum_{i = 1}^N \frac{p_0 (x_i)}{q(x_i)}}.
\end{align}
This estimator is no longer unbiased for $I$, but under reasonable conditions it will still converge to $I$ at the standard rate, albeit with a different variance.
It bears mentioning that, by construction, this technique integrates constant functions exactly, and hence provides estimates of equal quality for $\mathbf{E}_p [f + c]$ for any $c \in \mathbf{R}$. This is to be contrasted with standard importance sampling, which **does not** automatically have this property (though there are simple ways of ameliorating this).
## Expressing the Variance of SNIS
Due to the nonlinear nature of the SNIS estimator, some extra care is required in deriving an expression for its asymptotic variance. Essentially, while we have unbiased estimators $(\hat{U}, \hat{V})$ for the numerator $U$ and the denominator $V$ respectively, our final estimator involves applying a nonlinear map $F: (U, V) \mapsto U / V$, which complicates matters.
We thus seek to characterise how well $F(\hat{U}, \hat{V})$ estimates $F(U, V)$, under the assumption that our estimates $(\hat{U}, \hat{V})$ are converging well to their limits. To this end, we carry out the following expansion (essentially a first-order Taylor expansion in the variables $(\hat{U}, \hat{V})$ about $(U, V)$):
\begin{align}
F(\hat{U}, \hat{V}) &= \frac{\hat{U}}{\hat{V}}\\
&= \frac{U}{V} + \frac{\hat{U} - U}{V} - \frac{U \left( \hat{V} - V \right)}{V^2} + \frac{\left( V - \hat{V} \right) \cdot \left( \hat{U} V - U \hat{V} \right)}{V^2 \cdot \hat{V}}.
\end{align}
Now, for large $N$, the Central Limit Theorem tells us that $(\hat{U}, \hat{V}) \approx (U, V) + \mathcal{O}_p ( N^{-1/2})$. We can thus check that the final term in this expansion is $\mathcal{O}_p (N^{-1})$, and choose to neglect it. We can thus write that
\begin{align}
\text{Var} \left( F(\hat{U}, \hat{V}) \right) &\approx \text{Var} \left( \frac{U}{V} + \frac{\hat{U} - U}{V} - \frac{U \left( \hat{V} - V \right)}{V^2} \right),
\end{align}
noting that this expression is now linear in $(\hat{U}, \hat{V})$. This allows us to compute
\begin{align}
\text{Var} \left( F(\hat{U}, \hat{V}) \right) &\approx \text{Var} \left( \frac{V \left( \hat{U} - U \right)}{V^2} - \frac{U \left( \hat{V} - V \right)}{V^2} \right) \\
&= \text{Var} \left( \frac{\hat{U} V - U \hat{V}}{V^2} \right).
\end{align}
We now recall that
\begin{align}
U &= ZI \\
\hat{U} &= \frac{1}{N} \sum_{i = 1}^N f(x_i) \frac{p_0 (x_i)}{q(x_i)} \\
V &= Z \\
\hat{V} &= \frac{1}{N} \sum_{i = 1}^N \frac{p_0 (x_i)}{q(x_i)}
\end{align}
to simplify the above to
\begin{align}
\text{Var} \left( F(\hat{U}, \hat{V}) \right) &\approx \text{Var} \left( \left(\frac{1}{Z}\hat{U} \right) - \left(\frac{1}{Z}U \right)\left(\frac{1}{Z}\hat{V} \right) \right) \\
&= \text{Var} \left( \left(\frac{1}{N} \sum_{i = 1}^N f(x_i) \frac{p (x_i)}{q(x_i)} \right) - I \cdot \left(\frac{1}{N} \sum_{i = 1}^N \frac{p (x_i)}{q(x_i)} \right) \right) \\
&= \frac{1}{N} \text{Var}_q \left( \left( f(x) - I\right) \cdot \frac{p(x)}{q(x)} \right) \\
&= \frac{1}{N} \text{Var}_q \left( \left( f(x) - I\right) w(x) \right),
\end{align}
noting that the presence of $Z$ in the variance expression allows us to replace the unnormalised density $p_0$ by the true density $p$ throughout.
Using similar manipulations to the standard IS case, one can write
\begin{align}
\text{Var}_q \left( \left( f(x) - I\right) w(x) \right) &= \text{Var}_q \left( \left| f(x) - I\right| w(x) \right) + \mathbf{E}_p \left[ |f(x) - I| \right]^2 \\
&= \text{Var}_q \left( \left| f(x) - I\right| w(x) \right) + \text{Var}_{q_*} \left( \left( f(x) - I\right) w(x) \right),
\end{align}
where the optimal $q$ is now instead given by
\begin{align}
q_*(x) &= \frac{p(x) |f(x) - I|}{\mathbf{E}_p \left[ |f(x) - I| \right]}.
\end{align}
Again, some manipulations allow us to write the variance expression as
\begin{align}
\text{Var}_q \left( \left( f(x) - I\right) w(x) \right) &= \mathbf{E}_p \left[ |f(x) - I| \right]^2 \cdot \chi^2 ( q_*, q) + \text{Var}_{q_*} \left( f(x) w(x) \right).
\end{align}
An important distinction with the standard IS case is that unless $f$ is constant (in which case there is no need for Monte Carlo), then even the estimator corresponding to the optimal $q$ will always have a positive variance.
## Examples
It is interesting to revisit the task of rare event estimation through the lens of self-normalised importance sampling, as distinct from standard importance sampling. Recall the setup: $f(x) = \mathbf{I} [x \in A]$, for some set $A \subset \mathbf{X}$, and the quantity of interest is $p_A = \mathbf{P}_p ( X \in A) \ll 1$.
We now write the optimal SNIS importance distribution:
\begin{align}
q_*(x) &= \frac{p(x) |\mathbf{I} [x \in A] - p_A|}{\mathbf{E}_p \left[ |\mathbf{I} [x \in A] - p_A| \right]} \\
&= \frac{p(x) |\mathbf{I} [x \in A] - p_A|}{2p_A(1-p_A)}.
\end{align}
Note that now $q_* \left( A \right) = q_* \left( A^\complement \right)$, i.e. the optimal importance distribution places half of its mass in the set $A$, and half of its mass on its complement. So even if we were able to sample directly from the restriction of $p$ to $A$, when using self-normalisation, this will be sub-optimal.
In general, it is harder to find analytic forms for the asymptotic variance of SNIS in concrete examples, due to the absolute value sign in the optimal importance distribution. Moreover, the presence of this expression often means that $q_*$ will have zero density at some location in the bulk of its support, which typically leads to some sampling difficulties. It is thus somewhat more common to devise importance distributions which resemble the optimal choice for standard importance sampling, despite the potential loss in efficiency.
## Conclusion
While standard importance sampling is a powerful tool, it does not directly apply to situations in which the normalising constant of the measure of interest is unknown. The technique of self-normalisation is a systematic way of handling this deficiency, and broadens the scope of problems to which IS can be applied. There are some important differences between IS and SNIS (such as the different conditions for optimality), but conceptually there is a common through line: sample more densely in regions of the space which contribute most to the integral of interest.
Having built up these tools in the basic Monte Carlo setting, one can begin to develop these ideas towards applications in more challenging, high-dimensional problems. One can construct IS estimators in high dimension by utilising on extended state-space constructions (as in e.g. Sequential Monte Carlo (Samplers), Annealed Importance Sampling), replacing exact sampling by approximate sampling (e.g. using Markov chain methods to sample from $q$), and other related strategies. We will dip into such approaches in subsequent posts.