###### tags: `mcnf` `monte carlo` `expository`
# Monte Carlo Estimation of Nonlinear Functionals: Bias Reduction
**Overview**: In this note, I will return to the task of estimation of nonlinear functions of expectations. Beyond the polynomial setting, devising truly unbiased estimators is challenging, and so here I will focus on strategies for bias *reduction*. Many of these strategies have their roots in classical statistical estimation theory, rather than Monte Carlo methods, but can nevertheless be very useful.
## Formalising the Problem
We will focus on the task of estimating quantities of the form
\begin{align}
Q = \Phi ( I )
\end{align}
where $\Phi$ is a nonlinear function, and $I$ is expressible as the integral
\begin{align}
I &= \int p(dx) f(x) \\
&= \mathbf{E}_p [f].
\end{align}
The exact properties of $\Phi$ will largely be suppressed, but there will usually be some implicit assumptions of continuity, smoothness and similar.
A key building block will be the standard Monte Carlo estimator of $I$, derived by the following procedure:
1. Draw samples $x^1, \ldots, x^N \sim p$
2. Form the estimator
\begin{align}
\hat{I}_N = \frac{1}{N} \sum_{i = 1}^N f(x^i).
\end{align}
The 'naive' estimator of $Q$ is then given as $\hat{Q}_N = \Phi \left( \hat{I}_N\right)$. Many of the techniques for bias correction which we discuss are perturbative in nature, that is, they can be viewed as corrections to the naive estimator, which are hopefully small as $N$ grows.
## Towards Bias Reduction: Bias Analysis
Before getting into 'correcting' the naive estimator, it makes sense to first understand the ways in which it is incorrect. We might thus begin by Taylor-expanding our estimator about its expectation, as
\begin{align}
\hat{Q}_N &= \Phi \left( \hat{I}_N \right) \\
&= \Phi(I) + \Phi'(I) \left( \hat{I}_N - I \right) + \frac{\Phi''(I)}{2} \left( \hat{I}_N - I \right)^2 + \mathcal{O} \left( \left| \hat{I}_N - I \right|^3\right).
\end{align}
Taking expectations, we see that
\begin{align}
\mathbf{E} \left[ \hat{Q}_N \right] &= \Phi(I) + \frac{\Phi''(I)}{2} \text{Var} \left( \hat{I}_N \right) + \mathcal{O} \left( N^{-3/2} \right) \\
&= \Phi(I) + \frac{\Phi''(I)}{2N} \text{Var}_p (f) + \mathcal{O} \left( N^{-3/2} \right).
\end{align}
As such, we see that the naive estimator will typically exhibit a bias of order $N^{-1}$. When $\Phi$ is convex, we know that this bias will be nonnegative.
Similarly, we can take the variance of this expansion to estimate that
\begin{align}
\text{Var} \left( \hat{Q}_N \right) &= \text{Var} \left( \Phi(I) + \Phi'(I) \left( \hat{I}_N - I \right) + \mathcal{O}(N^{-1}) \right) \\
&= \text{Var} \left( \Phi'(I) \left( \hat{I}_N - I \right) \right) + \mathcal{O}(N^{-2})\\
&= \Phi'(I)^2 \cdot \text{Var} \left( \hat{I}_N \right) + \mathcal{O}(N^{-2}) \\
&= \frac{\Phi'(I)^2 \cdot \text{Var}_p (f)}{N} + \mathcal{O}(N^{-2}).
\end{align}
So, despite the bias which is induced by the nonlinearity, the variance of our estimator continues to decrease at the standard $N^{-1}$ rate.
## Taylor Expansions and The Delta Method
Rearranging our derivations for the bias, we might now write
\begin{align}
Q &= \mathbf{E} \left[ \hat{Q}_N \right] - \frac{\Phi''(I)}{2N} \text{Var}_p (f) + \mathcal{O} \left( N^{-3/2} \right).
\end{align}
This suggests the following modified estimator:
1. Draw samples $x^1, \ldots, x^N \sim p$
2. Form the estimators
\begin{align}
\hat{I}_N &= \frac{1}{N} \sum_{i = 1}^N f(x^i) \\
\hat{V}_N &= \frac{1}{N - 1} \sum_{i = 1}^N \left( f(x^i) - \hat{I}_N \right)^2.
\end{align}
3. Form the final estimator
\begin{align}
\tilde{Q}_N &= \Phi \left( \hat{I}_N \right) - \frac{\Phi ''\left( \hat{I}_N \right) }{2N} \cdot \hat{V}_N
\end{align}
Standard large-sample asymptotics suggest that $\Phi ''\left( \hat{I}_N \right) - \Phi''(I) \lesssim N^{-1/2}$, which should lead to the bias of $\tilde{Q}_N$ being bounded as $\mathcal{O} \left( N^{-3/2} \right)$. In particular, the bias is now of a lower order than the variance, which is reassuring.
The above derivation is related to the so-called 'Delta Method'. This term usually refers to the additional assertion that $\hat{Q}_N$ has an approximately Gaussian distribution (with the mean and variance specified above), but for our derivations, the Gaussian character is not strictly needed to establish that this method can provide us with a reduction in bias.
Many discussions of the Delta Method are happy to terminate here, as reducing the bias to be of asymptotically lower order is somehow the main achievement of the construction. However, one should generally be wary of asymptotics, particularly for difficult problems. In practice, the sizes of $\Phi(I), \Phi'(I),$ and $\Phi''(I)$ will play a large role in characterising the extent to which our modified estimator improves over the naive estimator. It is all very well and good for improvements to happen as $N$ tends to infinity, as long as we see those improvements from down here in the world of the finite!
One can of course push this idea further, by looking at higher-order Taylor expansions of $\Phi$ around $I$ and cancelling out additional sources of bias. The extent to which this is worthwhile is heavily dependent upon the smoothness of $\Phi$, and the growth of its derivatives. To my knowledge, this is not done so often in practice.
One can also relate this discussion to a previous post, which studied the case of polynomial $\Phi$, hence admitting a finite Taylor expansion. As established in that post, this allows for complete bias removal. For $\Phi$ which are well-approximated by polynomials of fixed degree, one can devise hybrid strategies based around these observations.
## From Taylor Polynomial to Uniform Polynomial Approximation
Having established in a previous post that polynomial functions of expectations can be estimated without bias, we may consider strategies which involve approximating $\Phi$ by a polynomial.
For example, suppose that a preliminary collection of samples suggest to us that $I$ is exceedingly likely to reside in the interval $[a, b]$. We might then fix a degree $d \in \mathbf{N}$, and find the polynomial $\Phi_d$ of degree $\leqslant d$ which is the best approximation of $\Phi$ on the interval in question, i.e.
\begin{align}
\Phi_{d} = \text{arg min}_{\phi \in \mathbf{R}_d [x]} \| \Phi - \phi \|_{L^\infty([a,b])}.
\end{align}
One can then proceed to estimate $\Phi_d(I)$ unbiasedly, using the techniques of the previous post. Provided that $I$ truly lies in the interval $[a, b]$, we have that the bias of this estimator is bounded from above by $\| \Phi - \Phi_{d} \|_\infty$. For smooth enough functions $\Phi$, it will often hold that $\| \Phi - \Phi_{d} \|_\infty$ decays exponentially with $d$, and so taking $d = \mathcal{O} (\log N)$ will suffice to drive the bias of our estimator below that of the variance. Note that the variance of the unbiased estimator for $\Phi_{d}$ will often grow with $d$ (sometimes exponentially so!).
For non-smooth $\Phi$, it is more common for $\| \Phi - \Phi_{d} \|_\infty$ to decay at rates which are polynomial in $d$. If the exponent is sufficiently large, one can optimally take $d = \mathcal{O}(n^c))$ for some $c \in (0, 1)$ (roughly speaking, the more derivatives $\Phi$ has, the smaller one can take $c$). For very rough $\Phi$, this whole strategy is generally not so worthwhile.
## Resampling Strategies: The Jackknife and the Bootstrap
The preceding approaches are quite specific in nature, in the sense that each application requires its own derivation, with explicit knowledge of the derivatives of $\Phi$, and so on. In this section, we will explore some more 'black-box' approaches to bias reduction.
The broad strategy will be to observe that the 'naive' estimator $\hat{Q}_N$ is asymptotically unbiased in $N$, and to assume that its behaviour as $N \to \infty$ is sufficiently regular to admit an asymptotic expansion of the form
\begin{align}
\mathbf{E} \left[ \hat{Q}_N \right] \sim \Phi(I) + \sum_{k \geqslant 1} c_k (I) N^{-k},
\end{align}
where the equivalence is understood in the sense of asymptotic expansions. In particular, the sum on the right-hand side is not required to be convergent.
Our first observation will be to note that even if we do not know the values of the $c_k (I)$ in any closed form, we can estimate them indirectly by solving an appropriate linear system, and then use them to reduce the bias of our estimator.
For a simple instance of this, suppose that $N$ is sufficiently large that we can approximate
\begin{align}
\mathbf{E} \left[ \hat{Q}_N \right] &\approx \Phi(I) + c_1 (I) N^{-1} \\
\mathbf{E} \left[ \hat{Q}_{N - 1} \right] &\approx \Phi(I) + c_1 (I) (N - 1)^{-1}.
\end{align}
Given that we can approximate both quantities on the left-hand side using our Monte Carlo samples, we now attempt to solve the set of coupled linear equations
\begin{align}
\mathbf{E} \left[ \hat{Q}_N \right] &=\Phi(I) + c_1 (I) N^{-1} \\
\mathbf{E} \left[ \hat{Q}_{N - 1} \right] &= \Phi(I) + c_1 (I) (N - 1)^{-1}
\end{align}
for $\left( \Phi(I), c_1 (I) \right)$. An explicit calculation reveals that
\begin{align}
\Phi(I) &= N \cdot \mathbf{E} \left[ \hat{Q}_N \right] - (N - 1) \cdot \mathbf{E} \left[ \hat{Q}_{N - 1} \right] \\
c_1(I) &= N ( N - 1) \cdot \left( \mathbf{E} \left[ \hat{Q}_{N - 1} \right] - \mathbf{E} \left[ \hat{Q}_N \right] \right).
\end{align}
As such, we could compute a bias-reduced estimator of $\Phi(I)$ by:
1. Estimate $\mathbf{E} \left[ \hat{Q}_N \right] \approx \Phi \left( \frac{1}{N} \sum_{i = 1}^N f(x^i) \right)$.
2. Estimate $\mathbf{E} \left[ \hat{Q}_{N - 1} \right] \approx \Phi \left( \frac{1}{N-1} \sum_{i = 1, i \neq j}^N f(x^i) \right)$ for some $j \in [N]$.
3. Return the estimate
\begin{align}
\Phi(I) \approx N \cdot \Phi \left( \frac{1}{N} \sum_{i = 1}^N f(x^i) \right)- (N - 1)\cdot \Phi \left( \frac{1}{N-1} \sum_{i = 1, i \neq j}^N f(x^i) \right).
\end{align}
This technique is historically known as *the jackknife*.
In practice, a simple way to improve the quality of our estimate of $\mathbf{E} \left[ \hat{Q}_{N - 1} \right]$ is to aggregate over all possible 'left-out' observations $j$, i.e.
\begin{align}
\mathbf{E} \left[ \hat{Q}_{N - 1} \right] \approx \frac{1}{N} \sum_{j = 1}^N \Phi \left( \frac{1}{N-1} \sum_{i = 1, i \neq j}^N f(x^i) \right),
\end{align}
which will generally reduce the variance of the estimator. However, it does increase the computational cost, and so many not always be worthwhile.
Hopefully, the derivation above is naturally suggestive of generalisations. For example, one can imagine forming a similar system based on considering $\{ \mathbf{E} \left[ \hat{Q}_N \right], \mathbf{E} \left[ \hat{Q}_{N - 1} \right], \mathbf{E} \left[ \hat{Q}_{N - 2} \right]\}$, and taking their asymptotic expansions up to and including the $c_2(I)$ term. As in previous examples, forming a more complicated estimate typically allows for further reduction in bias, at a risk of increased variance.
It should be noted that *the bootstrap*, which is in some ways the successor to the jackknife, can also be used in a similar fashion to reduce the bias of estimators. I will not provide additional details on such procedures in this note.
## Conclusion
In this note, I have outlined some strategies for estimating nonlinear, nonpolynomial functionals of integrals via Monte Carlo. The strategies vary in the extent to which they require case-specific knowledge of the nonlinearity in question, but all benefit from smoothness. Throughout, some bias remains, but is ideally reduced to a less noticeable level. Few estimators are without sin, but if we are able to take a biased estimator and modify it so that the bias is no longer one of the estimator's major problems, then progress has been made.
One benefit of the framing above is that it suggests strategies for tackling the *Nested Monte Carlo* problem, i.e. estimating quantities of the form
\begin{align}
I &= \int p(dx) \Phi( J(x)) \\
\text{where} \quad J(x) &= \int p(dy | x) f(x, y).
\end{align}
The techniques which I have presented in this note are appropriate for solving the inner problem of estimating $\Phi( J(x))$, and this opens the door to solving the outer problem of estimating $I$. This will be discussed in a future post.
The next post will focus on strategies which are able to remove bias entirely (!) in some settings, by carefully modifying the ambient space in which the original problem is posed. This possibility presents some interesting costs and benefits.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**