Simple Variance Reduction for Score Gradient Estimators

###### tags: `monte carlo` `variance reduction` `one-offs` `expository` # Simple Variance Reduction for Score Gradient Estimators **Overview**: In this note, I will describe some simple strategies for variance reduction when using the Monte Carlo method to approximate integrals which involve so-called 'score gradients'. ## Setup Consider a parametrised family of probability densities $\{ q (z | \phi) : \phi \in \Phi \}$, and define the functional \begin{align} F(\phi) &= \mathbf{E}_{\phi} \left[ f(z) \right] \\ &= \int q(z | \phi) f(z) dz, \end{align} where $f$ is some function which we can evaluate. Now, consider the task of minimising $F$ with respect to $\phi$, i.e. finding the probability measure in our family under which $f$ has the lowest mean value. A natural approach would be to apply a gradient method. However, a priori, we might not have an analytic form for the relevant gradient. In any case, under suitable regularity conditions, one can write \begin{align} \nabla_\phi F (\phi) &= \nabla_\phi \mathbf{E}_{\phi} \left[ f(z) \right] \\ &= \nabla_\phi \left( \int q(z | \phi) f(z) dz \right) \\ &= \int \nabla_\phi q(z | \phi) f(z) dz \\ &= \int q(z | \phi) \nabla_\phi \log q(z | \phi) f(z) dz \\ &= \mathbf{E}_{\phi} \left[ f(z) \cdot \nabla_\phi \log q(z | \phi) \right]. \end{align} Thus, if one can draw sample from $q$, it is straightforward to form an unbiased Monte Carlo estimator of $\nabla_\phi F (\phi)$. This is known as the *score gradient estimator*, among other names. ## Variance Worries In almost all presentations of the score gradient estimator, the derivation is followed immediately by the claim that this estimator typically has quite high variance, and that it should essentially only be used in cases for which better gradient estimators (e.g. those based on the reparametrisation trick) are unavailable. Here, we will not quite fix this problem. Speaking very roughly, there is *some* essential truth to the claim that score gradients are not a great idea if certain alternatives are available. Nevertheless, there are some simple modifications which can be used to improve the score gradient estimator. ### Control Variates from Score Estimators A first step is to note that in the case where $f(z) \equiv 0$, it follows that $F (\phi) \equiv 0$, and hence that $\nabla_\phi F (\phi) \equiv 0$, from which we can deduce that \begin{align} \mathbf{E}_{\phi} \left[ \nabla_\phi \log q(z | \phi) \right] \equiv 0, \end{align} i.e. that for each $\phi$, the quantity $\nabla_\phi \log q(z | \phi)$ has mean $0$. The usual presentation then leads to the use of control variates: for any constant $c$, it follows that $\mathbf{E}_{\phi} \left[ c \cdot \nabla_\phi \log q(z | \phi) \right] = 0$, and hence that one could equally write that \begin{align} \nabla_\phi F (\phi) &= \mathbf{E}_{\phi} \left[ \left( f(z) - c \right) \cdot \nabla_\phi \log q(z | \phi) \right]. \end{align} One can then optimise the constant $c$ so as to minimise the variance of the corresponding Monte Carlo estimator. Equally, one might take $c$ to be matrix-valued, and play the same game. These can both be successful strategies. ### From Control Variates to Covariances An alternative strategy is to use this observation to re-interpret our original representation directly. Recall that for suitable functions $\alpha, \beta$, it holds that \begin{align} \textbf{Cov} [ \alpha(x), \beta(x)] &= \mathbf{E} \left[ \alpha(x) \cdot \beta(x)^T \right] - \mathbf{E} \left[ \alpha(x) \right] \cdot \mathbf{E} \left[ \beta(x) \right] \\ &= \mathbf{E} \left[ \left( \alpha(x) - \bar{\alpha} \right) \cdot \left( \beta(x) - \bar{\beta} \right)^T \right]. \end{align} This encourages us to re-write \begin{align} \nabla_\phi F (\phi) &= \mathbf{E}_{\phi} \left[ f(z) \cdot \nabla_\phi \log q(z | \phi) \right] \\ &= \textbf{Cov}_\phi \left[ f(z), \left( \nabla_\phi \log q(z | \phi) \right)^T \right] + \mathbf{E}_\phi \left[ f(z) \right] \cdot \mathbf{E}_\phi \left[ \nabla_\phi \log q(z | \phi) \right]^T \\ &= \textbf{Cov}_\phi \left[ f(z), \left( \nabla_\phi \log q(z | \phi) \right)^T \right]. \end{align} Written in this form, it is slightly clearer why naive score gradient estimators might struggle: one is trying to estimate a quantity which is naturally expressible as a covariance, using only a single sample! As such, we make the simplest possible modification: instead, take 2 samples from $q(z|\phi)$, and estimate the gradient by using the covariance representation. This leads to the estimator \begin{align} \nabla_\phi F (\phi) \approx \frac{1}{2} \left( f(z_1) - f(z_2) \right) \cdot \left( \nabla_\phi \log q(z_1 | \phi) - \nabla_\phi \log q(z_2 | \phi) \right). \end{align} Extending this estimator to the case with $M > 2$ samples is relatively straightforward, and has transparent benefits. Moreover, it can still be formed in $\mathcal{O}(M)$ time, as for standard Monte Carlo estimators. ## An Advanced Example: The Multi-Sample Case Consider the following objective function, which arises in the context of (multi-sample) variational approximation, and the estimation of normalising constants: \begin{align} F(\phi) = \mathbf{E}_{\phi^{\otimes K}} \left[ \log \left( \frac{1}{K} \sum_{k = 1}^K \frac{\gamma (z_k)}{q (z_k | \phi)} \right) \right], \end{align} where the expectation is now taken with respect to $K$ independent copies of $q(z | \phi)$. One can then compute the gradient of this expression to be \begin{align} \nabla_\phi F(\phi) &= \mathbf{E}_{\phi^{\otimes K}} \left[ \log \left( \frac{1}{K} \sum_{k = 1}^K w(z_k | \phi) \right) \cdot \left( \sum_{k = 1}^K s (z_k | \phi) \right) \right] \\ &\, + \mathbf{E}_{\phi^{\otimes K}} \left[ \nabla_\phi \log \left( \frac{1}{K} \sum_{k = 1}^K w(z_k | \phi) \right) \right], \end{align} where $w(z_k | \phi) := \gamma (z_k) / q (z_k | \phi)$, $s (z | \phi) := \nabla_\phi \log q (z | \phi)$. Some further algebra yields that \begin{align} \nabla_\phi F(\phi) &= \mathbf{E}_{\phi^{\otimes K}} \left[ \log \left( \frac{1}{K} \sum_{k = 1}^K w(z_k | \phi) \right) \cdot \left( \sum_{k = 1}^K s (z_k | \phi) \right) \right] \\ &\, - \mathbf{E}_{\phi^{\otimes K}} \left[ \frac{\sum_{k = 1}^K w (z_k | \phi) s (z_k | \phi)}{\sum_{k = 1}^K w (z_k | \phi)} \right]. \end{align} By symmetry, it thus suffices for us to derive suitable estimators for \begin{align} \mathcal{G}_1 ( \phi ) &= \mathbf{E}_{\phi^{\otimes K}} \left[ \log \left( \frac{1}{K} \sum_{k = 1}^K w(z_k | \phi) \right) \cdot s (z_1 | \phi) \right] \\ \mathcal{G}_2 ( \phi ) &= \mathbf{E}_{\phi^{\otimes K}} \left[ \frac{ w (z_1 | \phi) s (z_1 | \phi)}{\sum_{k = 1}^K w (z_k | \phi)} \right] \end{align} ### Unravelling $\mathcal{G}_1$ Focusing on the dependence of $z_1$ and $z_{-1}$ separately, we might rewrite \begin{align} \mathcal{G}_1 ( \phi ) &= \mathbf{E}_{\phi^{\otimes K}} \left[ \left( \log \left( \frac{1}{K} w(z_1 | \phi) + W_{K} (z_{-1} | \phi) \right) \right) \cdot s (z_1 | \phi) \right] \\ &= \mathbf{E}_{\phi} \left[ s (z_1 | \phi) \cdot \mathbf{E}_{\phi^{\otimes (K - 1)}} \left[ \left( \log \left( \frac{1}{K} w(z_1 | \phi) + W_{K} (z_{-1} | \phi) \right) \right) \vert z_1 \right] \right] \\ &= \textbf{Cov}_\phi \left[ s (z_1 | \phi), \mathbf{E}_{\phi^{\otimes (K - 1)}} \left[ \left( \log \left( \frac{1}{K} w(z_1 | \phi) + W_{K} (z_{-1} | \phi) \right) \right) \vert z_1 \right] \right]. \end{align} Despite the seemingly-elaborate structure here, if one writes $g(z_1 | \phi) := \mathbf{E}_{\phi^{\otimes (K - 1)}} \left[ \left( \log \left( \frac{1}{K} w(z_1 | \phi) + W_{K} (z_{-1} | \phi) \right) \right) \vert z_1 \right]$, then we are back in familiar territory, and can proceed as follows: 1. Sample $z_1^{(1)}, z_1^{(2)}, z_{2:K} \sim q(z | \phi)$. 2. For $\ell \in \{1, 2 \}$, form \begin{align} \hat{g} \left( z_1^{(\ell)} \right) := \log \left( \frac{1}{K} w(z_1^{(\ell)} | \phi) + W_{K} (z_{-1} | \phi) \right) \end{align} 3. Return \begin{align} \mathcal{G}_1 ( \phi ) \approx \frac{1}{2} \left( s \left( z_1^{(1)} | \phi \right) - s \left( z_1^{(2)} | \phi \right) \right) \cdot \left( \hat{g} \left( z_1^{(1)} \right) - \hat{g} \left( z_1^{(2)} \right) \right) \end{align} Note that even though the original expectation involved only $K$ terms, our estimator required the simulation of $(K + 1)$ samples. This is not too surprising, given that the expectation is expressed in terms of a covariance - more samples are needed in order to understand how quantities vary together. Note of course that one can re-use these $(K + 1)$ samples in $\frac{K(K+1)}{2}$ ways, i.e. picking $2$ of the samples to be "$z_1"$, and letting the remaining samples "make up the numbers" for the $\hat{g}$ terms. One can think of this in terms of Rao-Blackwellisation, or even just as a simple symmetry argument: all of the samples came from the same place, and so we should use all of them in all of the same ways. ### Unravelling $\mathcal{G}_2$ Along similar lines, we write \begin{align} \mathcal{G}_2 ( \phi ) &= \mathbf{E}_{\phi^{\otimes K}} \left[ \frac{ w (z_1 | \phi) s (z_1 | \phi)}{\sum_{k = 1}^K w (z_k | \phi)} \right] \\ &= \mathbf{E}_{\phi^{\otimes K}} \left[ \frac{ w (z_1 | \phi)}{\sum_{k = 1}^K w (z_k | \phi)} \cdot s (z_1 | \phi) \right] \\ &= \mathbf{E}_{\phi^{\otimes (K-1)}} \left[ \textbf{Cov}_\phi \left( \Omega( z_1 | z_{-1}, \phi) , s (z_1 | \phi) \right) \right], \end{align} with $\Omega( z_1 | z_{-1}, \phi) := w (z_1 | \phi) / \sum_{k = 1}^K w (z_k | \phi)$. This can then be estimated by 1. Sample $z_1^{(1)}, z_1^{(2)}, z_{2:K} \sim q(z | \phi)$. 2. For $\ell \in \{1, 2 \}$, form \begin{align} \hat{\omega} \left( z_1^{(\ell)} \right) := \Omega( z_1^{(\ell)} | z_{2:K}, \phi) \end{align} 3. Return \begin{align} \mathcal{G}_2 ( \phi ) \approx \frac{1}{2} \left( s \left( z_1^{(1)} | \phi \right) - s \left( z_1^{(2)} | \phi \right) \right) \cdot \left( \hat{\omega} \left( z_1^{(1)} \right) - \hat{\omega} \left( z_1^{(2)} \right) \right). \end{align} As before, one really should re-use all $(K + 1)$ samples to produce an estimator with much-reduced variance at an acceptable additional cost. The efficient implementation of such a strategy is a nice exercise, one solution to which is described [here](https://stats.stackexchange.com/a/498205/193742). ## Conclusion In this note, I have described the score gradient estimator, and a simple interpretation in terms of covariance estimation. Using this interpretation, I have suggested two simple variance reduction schemes for different scenarios in which such estimators arise. A key aspect is that such estimators almost universally require *at least one more* sample than the naive estimator (though rarely are more than that needed) in order to achieve this variance reduction. A parting thought is that this strategy is quite related to the idea of (computational) symmetry: if you are estimating a quantity which has some special properties (symmetry, having a known mean, etc.), then it is usually beneficial to compute an estimator which also respects these same properties. This is worth keeping in mind more broadly.