###### tags: `one-offs` `puzzles`
# Unbiased Estimators of the Inverse of the Mean from a Poisson Stream
**Overview**: In this post, I describe an estimation puzzle related to Poisson laws, and describe a simple solution.
## The Puzzle
Suppose that you have a machine which generates iid $\mathrm{Poisson}(\lambda)$ samples, with $\lambda$ unknown, and I want a non-negative, unbiased estimate of $\lambda^{-1}$. You can take as long as you need, but after a finite amount of time, you have to give me an answer. How do you do this?
## First Thoughts
Note that from a single Poisson sample, many quantities admit simple estimators, e.g.
- For $n \in \mathbf{N}$, $\mathbf{E} \left[ \mathbf{I} \left[ X = n \right] \right] = \frac{\lambda^n \cdot \exp(-\lambda)}{n!}$.
- For $k \in \mathbf{N}$, $\mathbf{E} \left[ X(X-1)(X-2)\cdots(X-k+1) \right] = \lambda^k$.
- For $t \in \mathbf{R}$, $\mathbf{E} \left[ \exp (t \cdot X) \right] = \exp(\lambda\cdot(\exp (t) - 1))$.
In practice, when taking a fixed number of Poisson samples, one can either take simple averages of these basic estimators, or use tricks related to the Rao-Blackwell theorem to obtain improved estimators.
Anyways, let's see whether there is a viable estimator which is computable from just one sample. Suppose that when $X \sim \mathrm{Poisson}(\lambda)$, $f(X)$ is an unbiased estimator of $\lambda^{-1}$, for any $\lambda > 0$. Then, it holds that *as functions of $\lambda$*,
\begin{align}
\lambda^{-1} &= \sum_{n \geqslant 0} \frac{\lambda^n \cdot \exp(-\lambda)}{n!} \cdot f(n) \\
\lambda^{-1} \cdot \exp(\lambda) &= \sum_{n \geqslant 0} \frac{f(n)}{n!} \cdot \lambda^n.
\end{align}
This would relate $f$ to the Taylor expansion of $\lambda \mapsto \lambda^{-1} \cdot \exp(\lambda)$ around $\lambda = 0$. But there is a hitch: this function does not admit a Taylor expansion at $0$! As such, this estimator cannot exist.
## Second Pass
When a single sample does not suffice, sometimes a fixed number of samples will be enough, i.e. perhaps one could draw $K > 1$ independent Poisson realisations, and combine them somehow. However, one can show that this strategy will also not work here. Roughly speaking, any such strategy could be used to devise a single-sample estimator of $(\lambda \cdot K)^{-1}$ from a single realisation of a $\mathrm{Poisson}(\lambda \cdot K)$, and from our earlier discussion, we know that this is not possible.
The next thing to try is then to draw a random number of samples, perhaps adaptively. For example, one could try to design a procedure which is based around drawing independent Poisson variables until one of them returns a nonzero result. This would be equivalent to simulating a Geometric random variable with success parameter $p = 1 - \exp(-\lambda)$, and so the expected number of failed trials satisfies
\begin{align}
\mathbf{E} [ F ] &= p^{-1} - 1 \\
&= (1 - \exp(-\lambda))^{-1} - 1 \\
&= ( \exp(\lambda) - 1)^{-1}.
\end{align}
As $\lambda \to 0^+$, this behaves like $\lambda^{-1}$, which suggests that we might be on the right path. I mean this in the following sense: if we are interested in estimating quantities of the form $g(\lambda)$, life is broadly easier if $g$ is smooth, continuous, etc. As such, we might hope to 'absorb the singularity' at $\lambda = 0$ somehow.
To this end, write
\begin{align}
\lambda^{-1} &= ( \exp(\lambda) - 1)^{-1} \cdot \frac{\exp(\lambda) - 1}{\lambda}.
\end{align}
Bearing this in mind, it becomes clear that if we are also able to form an independent unbiased estimator of $\frac{\exp(\lambda) - 1}{\lambda}$, then we are done.
Now, suppose that we implement this procedure, generating $F$ realisations taking the value of $0$, and then generating $X_*$ with the law of $\mathrm{Poisson} (\lambda)$ conditioned to be positive, i.e. with mass function
\begin{align}
\mathbf{P}_\lambda (X_* = n) := (\exp(\lambda) - 1)^{-1} \cdot \frac{\lambda^n}{n!}.
\end{align}
As with standard Poisson variables, the expectations under this law of several expressions involving rising and falling factorials are tractable. In particular, one can compute explicitly that
\begin{align}
\mathbf{E}_{\lambda} \left[ \frac{1}{X_* + 1} \right]
&=\lambda^{-1} - (\exp(\lambda) - 1)^{-1}.
\end{align}
It thus follows that $F + \frac{1}{X_* + 1}$ has expectation $\lambda^{-1}$, and we are done. As an aside (which is not strictly needed for the unbiasedness), it can be checked that $X_*$ and $F$ are independent of one another.
## Closing Comments
Some further refinements of this puzzle which might be worth considering are:
- Analyse the variance of this estimator, and describe a strategy for controlling the variance (either in absolute or relative terms) uniformly in $\lambda$.
- Given $K$ replications of $\{ (F_k, X_{*, k}) \}_{k \in [K]}$, one can improve the estimate of $\lambda^{-1}$ by simple averaging. Can one do anything smarter, in terms of variance reduction?
- Similarly, for $K \in \mathbf{N}$, devise an estimator of $\lambda^{-K}$ which makes optimal use of the samples which are generated along the way.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**