###### tags: `mcnf` `monte carlo` `expository`
# Monte Carlo Estimation of Nonlinear Functionals: Unbiased Estimators via Stopping Times
**Overview**: In this note, I will again return to the task of estimation of nonlinear functions of expectations. In contrast to the previous note, we will set our sights on designing truly unbiased estimators. There will be a slight twist, in that this goal is achieved by constructing estimators whose cost of simulation is random.
## A Simple Example: Estimating Rational Functions of Probabilities
Suppose that we have access to a stream of iid random variables $\{ X_i \}_{i \geqslant 1} \sim \text{Bernoulli}(p)$, and we wish to estimate functions of $p$. It is useful to observe the following:
* Given any number of samples, we can unbiasedly estimate $p$ by taking the sample mean of the $X_i$.
* Given $N \geqslant m$ samples, we can unbiasedly estimate $p^m$, using the expression $p^m = \mathbf{E}_{P^{\otimes m}}[X_1 \cdots X_m]$.
* As a corollary, given $N$ samples, we can unbiasedly any estimate any polynomial in $p$ of degree $\leqslant N$.
* Conversely, the only functions of $p$ which can be unbiasedly estimated given a fixed number of samples are polynomials in $p$ of appropriate degree.
As such, for non-polynomial $f$, we must observe that any fixed number of samples will not allow unbiased estimates to be formed. Fortunately, there are still perfectly good estimation scheme which use a non-fixed number of samples!
Consider the following protocol: given an infinite stream of $\text{Bernoulli}(p)$ random variables $\{ X_i \}_{i \geqslant 1}$, keep observing the outcomes of these variables until you observe $k \geqslant 1$ occurrences of $X_i = 1$. That is, let
\begin{align}
N = \inf \left\{ n \geqslant 1 : \sum_{i = 1}^n X_i \geqslant k \right\}.
\end{align}
The law of $N$ generated in this way is known as a *negative binomial* distribution. It is a useful exercise to prove that
\begin{align}
\mathbf{P}(N = n) = {n - 1 \choose k - 1} p^k (1-p)^{n-k} \quad \text{for} \, n \geqslant k.
\end{align}
With this construction in mind, we might wonder whether $N$ can be used to estimate functionals of $p$. We should expect the typical value of $N$ to be inversely related with $p$, i.e. for smaller values of $p$, it should take longer to observe many occurrences of $X_i = 1$, and vice versa.
In fact, one can show that the expectation of $N$ is precisely $k \cdot p^{-1}$, calculating as follows:
\begin{align}
\mathbf{E}[N] &= \sum_{n \geqslant k} n \cdot \mathbf{P}(N = n) \\
&= \sum_{n \geqslant k} n \cdot {n - 1 \choose k - 1} p^k (1-p)^{n-k} \\
&= \sum_{n \geqslant k} k \cdot {n \choose k} p^k (1-p)^{n-k} \\
&= k \cdot \sum_{(n + 1) \geqslant (k + 1)} {(n+1) - 1 \choose (k+1)-1} p^{(k + 1) - 1} (1-p)^{(n + 1) - (k + 1)} \\
&= k \cdot p^{-1} \cdot \sum_{m \geqslant (k + 1)} {m - 1 \choose (k+1)-1} p^{(k + 1)} (1-p)^{m - (k + 1)} \\
&= k \cdot p^{-1},
\end{align}
where the final sum collapses to $1$ because it is the sum of the probability mass function corresponding to a negative binomial distribution with $(k + 1)$ successes.
Hence, by randomising the number of Bernoulli trials which we observe, we are able to obtain an unbiased estimator of $p^{-1}$. This would not be possible using only a fixed number of trials.
The interested reader should seek to generalise this construction to obtain an unbiased estimator of $p^{-a}$ for $a \geqslant 1$, taking care to justify how large $k$ must be taken to enable this.
## Taylor Series and Single-Term Estimators
Suppose now that we wish to estimate $Q = f(I)$, where $I = \mathbf{E}_p [ f(x) ]$, and that $f$ admits a convergent Taylor series. For the purposes of exposition, we will assume that the Taylor series converges globally; in practice, slightly more care is often required.
Now, write
\begin{align}
Q &= f(I) \\
&= \sum_{n \geqslant 0} f_n I^n \\
&= \sum_{n \geqslant 0} f_n \mathbf{E}_p [ f(x) ]^n \\
&= \sum_{n \geqslant 0} f_n \mathbf{E}_{p^{\otimes n}} [ f(X_1) \cdots f(X_n) ].
\end{align}
We have thus expressed $Q$ as an infinite series, in which each term can be estimated without bias. However, we do not have time to evaluate all of the infinitely-many terms, and so some extra ingenuity is required.
The *single-term estimator* strategy is essentially an instance of importance sampling. Writing $Q_n$ for $f_n I^n$, one can posit some distribution $q$ which is supported on the nonnegative integers, and write
\begin{align}
Q &= \sum_{n \geqslant 0} Q_n \\
&= \sum_{n \geqslant 0} q_n \cdot \frac{Q_n}{q_n}.
\end{align}
One can then conceive of the following method:
1. Sample $N \sim q$.
2. Form an unbiased estimate of $Q_N$, and call it $\hat{Q}_N$.
3. Return $\frac{\hat{Q}_N}{q_N}$ as an estimate of $Q$.
One can verify the unbiasedness of this estimator as follows:
\begin{align}
\mathbf{E} \left[ \frac{\hat{Q}_N}{q_N} \right] &= \mathbf{E} \left[ \left[ \frac{\hat{Q}_N}{q_N} \vert N\right] \right]\\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \frac{\hat{Q}_N}{q_N} \vert N = n\right] \\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \frac{\hat{Q}_n}{q_n} \right] \\
&= \sum_{n \geqslant 0} \mathbf{E} \left[ \hat{Q}_n\right] \\
&= \sum_{n \geqslant 0} Q_n \\
&= Q.
\end{align}
Noting that $\hat{Q}_N$ requires at least $N$ samples from $p$ to be formed, we see that this estimator again requires a random number of samples, and so we have not contradicted our earlier assertions about the existence of unbiased estimators.
## Cumulative Sum Estimators
A slightly annoying feature of the single-term strategy is that we generate $M \geqslant N$ samples from $p$, which could be used to estimate any of the $Q_m$ terms with $m \leqslant M$, but we only use them to estimate a single term. A modified estimator which exploits this structure is as follows:
1. Sample $N \sim q$.
2. Sample $X_1, \cdots, X_M \sim p$, with $M \geqslant N$.
3. Form unbiased estimators of $\{ Q_n \}_{0 \leqslant n \leqslant N}$, and call them $\{ \hat{Q}_n \}_{0 \leqslant n \leqslant N}$.
4. Return $\hat{Q}_\text{CS} = \sum_{n = 0}^N \frac{\hat{Q}_n}{\bar{q}_n}$ as an estimate of $Q$, where
\begin{align}
\bar{q}_n = \sum_{m \geqslant n} q_m.
\end{align}
The unbiasedness argument here is similar to before, again hinging on an importance sampling-type decomposition:
\begin{align}
\mathbf{E} \left[ \hat{Q}_\text{CS} \right] &= \mathbf{E} \left[ \left[ \hat{Q}_\text{CS} \vert N\right] \right]\\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \hat{Q}_\text{CS} \vert N = n\right] \\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \sum_{m = 0}^n \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \sum_{m \geqslant 0} \mathbf{I} [ n \geqslant m] \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \mathbf{E} \left[ \sum_{n \geqslant 0} \sum_{m \geqslant 0} q_n \cdot \mathbf{I} [ n \geqslant m] \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \mathbf{E} \left[ \sum_{m \geqslant 0} \bar{q}_m \cdot \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \sum_{m \geqslant 0} \mathbf{E} [\hat{Q}_m] \\
&= \sum_{m \geqslant 0} Q_m \\
&= Q.
\end{align}
By a Rao-Blackwell-type argument, one would heuristically expect this estimator to have improved variance relative to the single-term estimator, at a modest increase in computational cost.
## Conclusion
In this note, I have described some elementary strategies for forming unbiased estimators by using randomised computation times. These strategies may seem specific (e.g. appearing to be grounded in the existence of Taylor series expansions, etc.), many of these arguments can be generalised substantially. Indeed, another route to these methods is to start with a sequence of convergent but biased estimators of a quantity of interest, artificially express their limit as a telescoping sum, and apply the techniques above to derive unbiased estimators. We will return to this strategy in future posts, in the context of coupling methods and Multilevel Monte Carlo.
It should also be remarked that this note has been quiet on the topic of variance. While these strategies are able to deliver unbiased estimators, this will often come at the cost of a variance which is increased, sometimes even to the point of being infinite. This is a particular risk for the single-term and cumulative sum estimators, which are both based on importance sampling, and product-form estimators for the inner estimate. The tradeoffs which are introduced by this additional complexity should not be navigated lightly.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**