###### tags: `mcnf` `monte carlo` `expository`
# Monte Carlo Estimation of Nonlinear Functionals: Unbiased Estimators via Stopping Times
**Overview**: In this note, I will again return to the task of estimation of nonlinear functions of expectations. In contrast to the previous note, we will set our sights on designing truly unbiased estimators. There will be a slight twist, in that this goal is achieved by constructing estimators whose cost of simulation is random.
## A Simple Example: Estimating Rational Functions of Probabilities
Suppose that we have access to a stream of iid random variables $\{ X_i \}_{i \geqslant 1} \sim \text{Bernoulli}(p)$, and we wish to estimate functions of $p$. It is useful to observe the following:
* Given any number of samples, we can unbiasedly estimate $p$ by taking the sample mean of the $X_i$.
* Given $N \geqslant m$ samples, we can unbiasedly estimate $p^m$, using the expression $p^m = \mathbf{E}_{P^{\otimes m}}[X_1 \cdots X_m]$.
* As a corollary, given $N$ samples, we can unbiasedly any estimate any polynomial in $p$ of degree $\leqslant N$.
* Conversely, the only functions of $p$ which can be unbiasedly estimated given a fixed number of samples are polynomials in $p$ of appropriate degree.
As such, for non-polynomial $f$, we must observe that any fixed number of samples will not allow unbiased estimates to be formed. Fortunately, there are still perfectly good estimation scheme which use a non-fixed number of samples!
Consider the following protocol: given an infinite stream of $\text{Bernoulli}(p)$ random variables $\{ X_i \}_{i \geqslant 1}$, keep observing the outcomes of these variables until you observe $k \geqslant 1$ occurrences of $X_i = 1$. That is, let
\begin{align}
N = \inf \left\{ n \geqslant 1 : \sum_{i = 1}^n X_i \geqslant k \right\}.
\end{align}
The law of $N$ generated in this way is known as a *negative binomial* distribution. It is a useful exercise to prove that
\begin{align}
\mathbf{P}(N = n) = {n - 1 \choose k - 1} p^k (1-p)^{n-k} \quad \text{for} \, n \geqslant k.
\end{align}
With this construction in mind, we might wonder whether $N$ can be used to estimate functionals of $p$. We should expect the typical value of $N$ to be inversely related with $p$, i.e. for smaller values of $p$, it should take longer to observe many occurrences of $X_i = 1$, and vice versa.
In fact, one can show that the expectation of $N$ is precisely $k \cdot p^{-1}$, calculating as follows:
\begin{align}
\mathbf{E}[N] &= \sum_{n \geqslant k} n \cdot \mathbf{P}(N = n) \\
&= \sum_{n \geqslant k} n \cdot {n - 1 \choose k - 1} p^k (1-p)^{n-k} \\
&= \sum_{n \geqslant k} k \cdot {n \choose k} p^k (1-p)^{n-k} \\
&= k \cdot \sum_{(n + 1) \geqslant (k + 1)} {(n+1) - 1 \choose (k+1)-1} p^{(k + 1) - 1} (1-p)^{(n + 1) - (k + 1)} \\
&= k \cdot p^{-1} \cdot \sum_{m \geqslant (k + 1)} {m - 1 \choose (k+1)-1} p^{(k + 1)} (1-p)^{m - (k + 1)} \\
&= k \cdot p^{-1},
\end{align}
where the final sum collapses to $1$ because it is the sum of the probability mass function corresponding to a negative binomial distribution with $(k + 1)$ successes.
Hence, by randomising the number of Bernoulli trials which we observe, we are able to obtain an unbiased estimator of $p^{-1}$. This would not be possible using only a fixed number of trials.
The interested reader should seek to generalise this construction to obtain an unbiased estimator of $p^{-a}$ for $a \geqslant 1$, taking care to justify how large $k$ must be taken to enable this.
## Taylor Series and Single-Term Estimators
Suppose now that we wish to estimate $Q = f(I)$, where $I = \mathbf{E}_p [ f(x) ]$, and that $f$ admits a convergent Taylor series. For the purposes of exposition, we will assume that the Taylor series converges globally; in practice, slightly more care is often required.
Now, write
\begin{align}
Q &= f(I) \\
&= \sum_{n \geqslant 0} f_n I^n \\
&= \sum_{n \geqslant 0} f_n \mathbf{E}_p [ f(x) ]^n \\
&= \sum_{n \geqslant 0} f_n \mathbf{E}_{p^{\otimes n}} [ f(X_1) \cdots f(X_n) ].
\end{align}
We have thus expressed $Q$ as an infinite series, in which each term can be estimated without bias. However, we do not have time to evaluate all of the infinitely-many terms, and so some extra ingenuity is required.
The *single-term estimator* strategy is essentially an instance of importance sampling. Writing $Q_n$ for $f_n I^n$, one can posit some distribution $q$ which is supported on the nonnegative integers, and write
\begin{align}
Q &= \sum_{n \geqslant 0} Q_n \\
&= \sum_{n \geqslant 0} q_n \cdot \frac{Q_n}{q_n}.
\end{align}
One can then conceive of the following method:
1. Sample $N \sim q$.
2. Form an unbiased estimate of $Q_N$, and call it $\hat{Q}_N$.
3. Return $\frac{\hat{Q}_N}{q_N}$ as an estimate of $Q$.
One can verify the unbiasedness of this estimator as follows:
\begin{align}
\mathbf{E} \left[ \frac{\hat{Q}_N}{q_N} \right] &= \mathbf{E} \left[ \left[ \frac{\hat{Q}_N}{q_N} \vert N\right] \right]\\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \frac{\hat{Q}_N}{q_N} \vert N = n\right] \\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \frac{\hat{Q}_n}{q_n} \right] \\
&= \sum_{n \geqslant 0} \mathbf{E} \left[ \hat{Q}_n\right] \\
&= \sum_{n \geqslant 0} Q_n \\
&= Q.
\end{align}
Noting that $\hat{Q}_N$ requires at least $N$ samples from $p$ to be formed, we see that this estimator again requires a random number of samples, and so we have not contradicted our earlier assertions about the existence of unbiased estimators.
## Cumulative Sum Estimators
A slightly annoying feature of the single-term strategy is that we generate $M \geqslant N$ samples from $p$, which could be used to estimate any of the $Q_m$ terms with $m \leqslant M$, but we only use them to estimate a single term. A modified estimator which exploits this structure is as follows:
1. Sample $N \sim q$.
2. Sample $X_1, \cdots, X_M \sim p$, with $M \geqslant N$.
3. Form unbiased estimators of $\{ Q_n \}_{0 \leqslant n \leqslant N}$, and call them $\{ \hat{Q}_n \}_{0 \leqslant n \leqslant N}$.
4. Return $\hat{Q}_\text{CS} = \sum_{n = 0}^N \frac{\hat{Q}_n}{\bar{q}_n}$ as an estimate of $Q$, where
\begin{align}
\bar{q}_n = \sum_{m \geqslant n} q_m.
\end{align}
The unbiasedness argument here is similar to before, again hinging on an importance sampling-type decomposition:
\begin{align}
\mathbf{E} \left[ \hat{Q}_\text{CS} \right] &= \mathbf{E} \left[ \left[ \hat{Q}_\text{CS} \vert N\right] \right]\\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \hat{Q}_\text{CS} \vert N = n\right] \\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \sum_{m = 0}^n \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \sum_{n \geqslant 0} q_n \cdot\mathbf{E} \left[ \sum_{m \geqslant 0} \mathbf{I} [ n \geqslant m] \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \mathbf{E} \left[ \sum_{n \geqslant 0} \sum_{m \geqslant 0} q_n \cdot \mathbf{I} [ n \geqslant m] \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \mathbf{E} \left[ \sum_{m \geqslant 0} \bar{q}_m \cdot \frac{\hat{Q}_m}{\bar{q}_m} \right] \\
&= \sum_{m \geqslant 0} \mathbf{E} [\hat{Q}_m] \\
&= \sum_{m \geqslant 0} Q_m \\
&= Q.
\end{align}
By a Rao-Blackwell-type argument, one would heuristically expect this estimator to have improved variance relative to the single-term estimator, at a modest increase in computational cost.
## Conclusion
In this note, I have described some elementary strategies for forming unbiased estimators by using randomised computation times. These strategies may seem specific (e.g. appearing to be grounded in the existence of Taylor series expansions, etc.), many of these arguments can be generalised substantially. Indeed, another route to these methods is to start with a sequence of convergent but biased estimators of a quantity of interest, artificially express their limit as a telescoping sum, and apply the techniques above to derive unbiased estimators. We will return to this strategy in future posts, in the context of coupling methods and Multilevel Monte Carlo.
It should also be remarked that this note has been quiet on the topic of variance. While these strategies are able to deliver unbiased estimators, this will often come at the cost of a variance which is increased, sometimes even to the point of being infinite. This is a particular risk for the single-term and cumulative sum estimators, which are both based on importance sampling, and product-form estimators for the inner estimate. The tradeoffs which are introduced by this additional complexity should not be navigated lightly.