Variance Reduction in Monte Carlo Methods: Importance Sampling

###### tags: `vrmc` `monte carlo` `expository` # Variance Reduction in Monte Carlo Methods: Importance Sampling **Overview**: In this note, I will describe a fundamental technique for variance reduction in the context of Monte Carlo methods known as *importance sampling*. I will also characterise some notions of optimality among importance sampling methods, hinting at how one might deploy such methods adaptively. ## A Recap on the Monte Carlo Method and Variance Reduction The premise of Monte Carlo methods is to express a given quantity of interest in the form of an integral \begin{align} I &= \int f(x) p(dx) \\ &= \mathbf{E}_p [f] \end{align} where $p$ is a probability measure, and $f$ is some function. One then estimates $I$ as follows. 1. Draw samples $x_1, \ldots, x_N \sim p$ 2. Form the estimator \begin{align} \hat{I}_N = \frac{1}{N} \sum_{i = 1}^N f(x_i). \end{align} The rate at which $\hat{I}_N$ converges to $I$ is governed (to leading order) by the variance of $\hat{I}_N$, which in turned is governed by the variance of the function $f$ under the law of $p$. As such, given a quantity of interest $I$, it is desirable to devise pairs $(p, f)$ such that $\langle p, f \rangle = I$ and $\text{Var}_p(f)$ is small, so as to accelerate the convergence of our estimator. ## From Bilinearity to Linear Operators and Adjoints As detailed in an earlier post, a large class of strategies for variance reduction stem from noting that the mapping \begin{align} (p, f) \mapsto \langle p, f \rangle := \int p(dx) f(x) \end{align} is bilinear in its arguments, and so one can use simple decompositions like \begin{align} \langle p, f + g \rangle &= \langle p, f \rangle + \langle p, g \rangle, \\ \langle p + q, f \rangle &= \langle p, f \rangle + \langle q, f \rangle \end{align} to re-express our original integral representation as a linear combination of other integrals, which can each be estimated more rapidly. While this simple approach can get us quite far, we can exploit this linear-algebraic perspective further by considering linear operators which act on either (signed) measures or functions. For example, suppose that our $f$ can be written as $Lg$, where $L$ is some linear operator on the space of functions. Almost by definition, we can write that \begin{align} \langle p, f \rangle &= \langle p, Lg \rangle \\ &= \langle L^*p, g \rangle, \end{align} where $L*$ is the *adjoint* operator of $L$, which acts on measures. If one can write $L^*p = \sum_{i \in I} \beta_i p_i$ with $\beta_i \in \mathbb{R}$ and $p_i$ probability measures, then this gives us a new strategy for estimating $I$. ## Diagonal Linear Operators for New Monte Carlo Representations With this abstract framework in mind, we can begin to look for linear operators $L$ which are likely to result in implementable algorithms and appreciable reductions in variance. Since linear operators are essentially generalisations of matrices, we might reasonably begin by thinking of some of the simplest matrices, namely *diagonal matrices*. When expressed in coordinates, a diagonal linear matrix $L$ is essentially characterised by its action on vectors $x$ as \begin{align} (Lx)_i = L_{ii} x_i \quad \text{for all} \, i \in I, \end{align} where $I$ indexes the coordinates of $x$. As such, a possible analog of a diagonal operator on a space of functions is a *multiplication* operator, i.e. an operator $L = L_m$ such that for all $x \in \mathcal{X}$, \begin{align} (L_m f) (x) = m(x) f(x) \end{align} where $m$ is some given function. A particularly nice feature of diagonal operators is that their adjoints are very easy to characterise. In particular, when working with real vector spaces (as we generally will in this setting), the adjoint of $L_m$ is given by $L_m$ itself! Now, let's bring this observation back to the world of Monte Carlo. Fix a nonnegative function $m(x)$, and consider estimation of the integral $I = \langle p, f \rangle$. Now, write \begin{align} I &= \langle p, f \rangle \\ &= \left\langle p, L_m \left( \frac{f}{m} \right) \right\rangle \\ &= \left\langle L_m p, \frac{f}{m}\right\rangle. \end{align} Now, assume that $m$ is chosen such that $Z_m :=\langle p, m \rangle < \infty$. It then follows that $L_m p$ can be written as $Z_m \cdot q$, where \begin{align} q(dx) := \frac{p(dx)m(x)}{Z_m} \end{align} is another probability measure! We thus have the new representation \begin{align} I &= Z_m \cdot \langle q, g \rangle \end{align} where $g(x) = f(x) / m(x)$. Thus, provided that we have a way of drawing samples from $q$, we have an alternative means of estimating $I$. ## From New Representations to Variance Reduction The above presentation is slightly unconventional, in that it focuses on the reweighting factor $m$, rather than the distribution $q$. In practice, because this method generally requires sampling from $q$, it is more natural to parametrise the problem in terms of $q$. As such, we write \begin{align} I &= \langle p, f \rangle \\ &= \left\langle q, f w\right\rangle, \end{align} where $w(x) = p(x) / q(x)$[^1]. We can thus define the *importance sampling* (IS) estimator of $I$ with proposal $q$ by 1. Draw samples $x_1, \ldots, x_N \sim q$ 2. Form the estimator \begin{align} \hat{I}^{\text{IS}}_N = \frac{1}{N} \sum_{i = 1}^N f(x_i) w(x_i). \end{align} One generally refers to $q$ as the *importance distribution*, and $w$ as a *weight function*, or similar. We now consider the variance of this estimator: \begin{align} N \cdot \text{Var} \left( \hat{I}^{\text{IS}}_N \right) &= \text{Var}_q \left( f(x) w(x) \right) \\ &= \mathbf{E}_q \left[ \left( f(x) w(x) \right)^2 \right] - \mathbf{E}_q \left[ f(x) w(x) \right]^2 \\ &= \mathbf{E}_q \left[ \left( |f(x)| w(x) \right)^2 \right] - \mathbf{E}_q \left[ f(x) w(x) \right]^2 \\ &= \text{Var}_q \left( |f(x)| w(x) \right) + \mathbf{E}_q \left[ |f(x)| w(x) \right]^2 - \mathbf{E}_q \left[ f(x) w(x) \right]^2 \\ &= \text{Var}_q \left( |f(x)| w(x) \right) + \mathbf{E}_p \left[ |f(x)| \right]^2 - \mathbf{E}_p \left[ f(x) \right]^2. \end{align} We now pause to investigate these terms: * $\text{Var}_q \left( |f(x)| w(x) \right)$ is the only part which depends on the choice of $q$ (and hence $w$). Moreover, if we choose $q$ such that $|f(x)| w(x)$ is constant, then this term will be as small as possible (i.e. $0$). * $\mathbf{E}_p \left[ |f(x)| \right]^2 - \mathbf{E}_p \left[ f(x) \right]^2$ is an 'irreducible variance', in the sense that our choice of $q$ cannot change it. Even for the optimal choice of $q$, our estimator will have at least this variance. However, note that this irreducible variance is $0$ if $f$ is either purely nonnegative or purely nonpositive, so in practice, this irreducible variance can be equal to $0$. With these observations in mind, let us characterise the optimal $q$ explicitly: \begin{align} \text{Var}_q \left( |f(x)| w(x) \right) &= 0 \\ \iff |f(x)| w(x) &\equiv \text{const.} \\ \iff |f(x)| w(x) &= \mathbf{E}_q \left[ |f(x)| w(x)\right] \\ \iff |f(x)| w(x) &= \mathbf{E}_p \left[ |f(x)| \right] \\ \iff |f(x)| \frac{p(x)}{q(x)} &= \mathbf{E}_p \left[ |f(x)| \right] \\ \iff q(x) &= \frac{p(x) |f(x)|}{\mathbf{E}_p \left[ |f(x)| \right]}. \end{align} We will write $q_*$ for this optimal importance distribution. Revisiting our earlier expression for the variance for a general $q$, we can write \begin{align} \text{Var}_q \left( f(x) w(x) \right) &= \text{Var}_q \left( |f(x)| w(x) \right) + \text{Var}_{q_*} \left( f(x) w(x) \right). \end{align} One can also compute that \begin{align} \mathbf{E}_q \left[ \left( f(x) w(x) \right)^2 \right] &= \int q(dx) \frac{f(x)^2 p(x)^2}{q(x)^2} \\ &= \mathbf{E}_p \left[ |f(x)| \right]^2 \cdot \int \frac{q_*(x)^2}{q(x)} dx \\ &= \mathbf{E}_p \left[ |f(x)| \right]^2 \cdot \left( 1 + \chi^2 ( q_*, q) \right), \end{align} where $\chi^2$ denotes the $\chi^2$ divergence between measures. One can then write \begin{align} \text{Var}_q \left( f(x) w(x) \right) &= \mathbf{E}_p \left[ |f(x)| \right]^2 \cdot \chi^2 ( q_*, q) + \text{Var}_{q_*} \left( f(x) w(x) \right). \end{align} This characterises the error in importance sampling as comprising 1. A factor for the scale of the problem, 2. A factor for the failure of the proposal $q$ to match the optimal proposal $q_*$, and, 3. An irreducible variance term, which cannot be removed by pure importance sampling techniques. In particular, this highlights that in order to succeed at using importance sampling for a given problem, success (as measured in terms of the variance of our estimator) is governed precisely by our ability to approximate the optimal importance distribution $q_*$ well, as measured by the $\chi^2$ divergence. ## Some Examples Having established some algebraic criteria by which to characterise good importance sampling strategies, it is useful to develop intuition for how these distributions behave in practice. ### Heavy-Tailed Integrands One category of problem for which importance sampling can be worthwhile is when $f$ has heavy tails under $p$, i.e. the probability of seeing very large values of $f$ is not that small. To be concrete, consider the example of \begin{align} p(dx) &= \frac{3}{x^4} \cdot \mathbf{I} [ x\geqslant 1] dx \\ f(x) &= x, \end{align} so that $P(X \geqslant x) = x^{-3}$ for $x \geqslant 1$. This is quite a heavy-tailed problem: although $f$ does have finite variance under $p$, its third moment is infinite, which is bad news: the CLT will still hold for our Monte Carlo averages, but the convergence will be slower. We might then consider importance distributions of the form \begin{align} q_\alpha(dx) &= \frac{\alpha}{x^{\alpha + 1}} \cdot \mathbf{I} [ x\geqslant 1] dx \end{align} for $\alpha > 0$. This leads to \begin{align} f_\alpha(x) &= f(x) \cdot w_\alpha(x) \\ &= \frac{3}{\alpha} \cdot x^{\alpha - 2}. \end{align} Now, note that 1. For $\alpha = 3$, we are back at our original problem. 2. For $\alpha > 3$, our importance distributions have lighter tails (making large values of $x$ less likely), but our effective integrand $f_\alpha$ grows even faster as a function of $x$. As a consequence, $f_\alpha$ will have *even heavier tails* than before. 3. For $0<\alpha \leqslant 2$, our importance distributions have heavier tails (making large values of $x$ **more** likely), but our effective integrand $f_\alpha$ is uniformly bounded above! In particular, all of the moments of $f_\alpha$ are bounded, and the Monte Carlo averages should be relatively well-behaved. 4. For $\alpha = 2$, the effective integrand is constant, and so the estimator will have $0$ variance. This example points to a more general phenomenon: one should design $q$ to place mass on the parts of the state space where $p(x) | f(x) |$ is large, and it is often safest to put too much mass in those regions, rather than too little. ### Rare Event Estimation Another category of problem to which importance sampling is well-suited is *rare event estimation*. This typically denotes the scenario where $f(x) = \mathbf{I} [x \in A]$, for some (typically small) set $A \subset \mathbf{X}$, and the quantity of interest is $p_A = \mathbf{P}_p ( X \in A) \ll 1$. In this setting, the variance of the direct Monte Carlo estimator (i.e. sampling from $p$) is usually small (the integrand is bounded, so the variance is automatically bounded). The challenge is that $p_A$ is itself small, and so in order to be useful, the standard deviation of the estimator should be smaller than $p_A$. As such, the relevant criterion is *relative variance*, rather than just variance. In any case, the intuition is similar: one should choose an importance distribution $q$ which places substantial mass on the set $A$, while otherwise looking fairly similar to $p$. Unsurprisingly, the optimal importance distribution is simply the restriction of $p$ to the set $A$, i.e. \begin{align} q_* (dx) = \frac{p(dx) \cdot \mathbf{I} [x \in A]}{p_A}. \end{align} As an example, consider taking \begin{align} p(dx) &= \exp(-x) \cdot \mathbf{I} [ x\geqslant 0] dx \\ A &= \{ x: x \geqslant t \}, \end{align} for some $t \gg 0$ (so that $p_A = \exp (-t)$), and again consider importance distributions of the form \begin{align} q_\alpha(dx) &= \alpha \exp(-\alpha x) \cdot \mathbf{I} [ x\geqslant 0] dx \end{align} One can compute the variance explicitly in this case as \begin{align} \sigma^2_\alpha &= \frac{1}{\alpha (2 - \alpha)} \exp((\alpha - 2)t) - \exp(-2t) \end{align} if $\alpha \in (0, 2)$, and $\infty$ otherwise. With some work, one can show that the optimal $\alpha$ is given by $\alpha(t) = 1 + \frac{1}{t} - \sqrt{1 + \frac{1}{t^2}} \approx \frac{1}{t}$, and the optimal variance is given by \begin{align} \sigma^2_* (t) &= \exp(-2t) \cdot \left( \left( \frac{1 + \sqrt{1 + t^2}}{2} \right) \cdot \exp \left( 1 + t - \sqrt{1 + t^2}\right) - 1 \right) \\ &\approx \frac{et}{2}\cdot \exp(-2t) \\ &\asymp p_A^2 \log \left( \frac{1}{p_A} \right). \end{align} This is consistent with our intuition: as the event seeps out further into the tails, we should choose an importance distribution with heavier and heavier tails. Note also that in this example, we are able to reduce the variance down to the order of $p_A^2$ (up to log factors); this rate is essentially the gold standard for rare event estimation. ## Conclusion Importance sampling is a key strategy for reducing the variance of Monte Carlo estimates: by focusing our efforts carefully on the most relevant parts of the integrand, we can achieve substantially improved efficiency, particularly for difficult problems. Moreover, there is a very clean characterisation of how to optimally deploy this strategy: use an importance distribution which assigns mass similarly to $p \cdot |f|$. An additional benefit of importance sampling which is only hinted at in the above is that in addition to reducing variance, it can also take us from having an estimator with infinite variance to an estimator with finite variance. In some respects, this is even more impressive: IS allows us to resurrect a Central Limit Theorem from where there previously was none. In this note, I have not made explicit reference to the ambient dimension of the problem. In some sense, this is fair, as the variance expressions, etc. do not have any explicit dependence on such parameters. However, it is generally acknowledged that as the dimension increases, the task of finding practical importance distributions $q$ such that $\chi^2 (q_*, q)$ is small can become increasingly challenging. Circumventing this difficulty typically requires some ingenuity. [^1]: This assumes that the measures $p, q$ are absolutely continuous with respect to some common dominating measure. Since in practice, one usually works directly with density functions and mass functions, I will largely overlook this subtlety.