owned this note
owned this note
Published
Linked with GitHub
###### tags: `unnormalised models` `monte carlo` `one-offs`
# From the Method of Moments to Estimation of Normalising Constants
**Overview**: The goal of this note is to describe the method of moments, outline some generalisations, and then apply these tools in the context of estimation of normalising constants.
## Maximum Likelihood Estimation in Statistical Models
Consider the task of estimating the parameters of a model from which data is observed, i.e. there is
* some parameter space $\Theta$,
* some observation space $\mathcal{X}$,
* a collection of probability measures on $\mathcal{X}$ which is indexed by $\theta \in \Theta$, $\{ P_\theta \}_{\theta \in \Theta}$, and
* some observed data $x_1, \ldots, x_N$ which were drawn i.i.d. from $P_{\theta_*}$ for some unknown parameter $\theta_* \in \Theta$
and the statistician is tasked with forming an estimator $\hat{\theta} = \hat{\theta} (x_1, \ldots, x_N) \approx \theta_*$.
Classical statistical theory recommends solving this problem by *maximum likelihood estimation*, i.e. solving the problem
\begin{align}
\max_{\theta \in \Theta} \frac{1}{N} \sum_{i = 1}^N \log p_\theta (x_i)
\end{align}
where $p_\theta$ is the density of $P_\theta$ with respect to some dominating measure $P$. There is much theory to support the use of this approach for fixed-dimensional problems as $N$ grows to infinity.
## Unnormalised Models and Alternative Estimators
However, it is not always the case that this problem is straightforward to solve. For example, there is a rich class of methods for which $p_\theta$ is only known in the form.
\begin{align}
p_\theta (x) = \frac{f (x, \theta)}{Z(\theta)}
\end{align}
where $f$ can be evaluated in closed form, but $Z$ cannot. Some common examples of this include Markov Random Field models (as used in Spatial Statistics and (older) Image Processing) and Exponential Random Graph models (as used in Network Analysis).
In this setting, the problem of maximum likelihood estimation becomes significantly more challenging.
A common solution is to replace the MLE problem by either
* solving another minimisation problem, i.e.
\begin{align}
\hat{\theta} = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i = 1}^N \ell (x_i, \theta)
\end{align}
for some other function $\ell$ which satisfies
\begin{align}
\{ \theta \} = \arg\min_{\bar{\theta} \in \Theta} \mathbf{E}_\theta \left[ \ell (X, \bar{\theta} )\right].
\end{align}
Examples of this include Score Matching, KSD Minimisation, Maximum Composite Likelihood, and others.
* solving a root-finding problem, i.e. solving
\begin{align}
\frac{1}{N} \sum_{i = 1}^N s (x_i, \theta) = 0
\end{align}
for some other 'score-like' function $s$ such that
\begin{align}
\{ \theta \} = \{ \bar{\theta} \in \Theta: \mathbf{E}_\theta \left[ s (X, \bar{\theta} )\right] = 0\}.
\end{align}
Examples of this include quasi-likelihood, the (generalised) Method of Moments, and others.
## The Generalised Method of Moments
Suppose that for the family of distributions in question, there is a vector-valued function $h: \mathcal{X} \to \mathbf{R}^H$ such that $\mathbf{E}_\theta [ h(X) ] = \eta ( \theta )$ is known in closed form. The standard Method of Moments estimator is formed by
1. Computing $\hat{\eta} = \frac{1}{N} \sum_{i = 1}^N h \left( X^i \right)$.
2. Computing $\hat{\theta}$ by solving $\eta(\theta) = \hat{\eta}$.
Here, we have tacitly assumed that Step 2 admits a unique solution which we can find efficiently.
The Generalised Method of Moments instead supposes that we know a vector-valued function $g: \mathcal{X} \times \Theta \to \mathbf{R}^G$ such that for all $\theta \in \Theta$, it holds that
\begin{align}
\mathbf{E}_\theta \left[ g (X, \theta )\right] = 0.
\end{align}
Note that the standard Method of Moments is indeed a special case of this formulation where $g(X, \theta) = h(X) - \eta(\theta)$. The estimator is then formed in two steps by
1. Form the function $\hat{\gamma} (\theta) = \frac{1}{N} \sum_{i = 1}^N g \left( X^i, \theta \right)$.
2. Computing $\hat{\theta}$ by solving $\hat{\gamma}(\theta) = 0$.
Sometimes the second step is replaced by instead minimising $|\hat{\gamma}(\theta)|$, where $|\cdot|$ is some norm-like function.
## Estimating Normalising Constants
In a nice work by [Liu and coauthors](http://auai.org/uai2015/proceedings/papers/120.pdf) (links to pdf), a question is raised as to whether generic algorithms for parameter estimation in unnormalised models can be systematically converted into algorithms for estimation of normalising constants. In this section, I discuss a sketchy link between GMoM and one approach for normalising constant estimation.
To begin with, recall the definition
\begin{align}
Z (\theta) = \int_\mathcal{X} P(dx) f (x, \theta).
\end{align}
Writing $f(x, \theta) = \exp( - V(x, \theta))$, a classical variational principle (which I have seen variously attributed to Gibbs and Donsker-Varadhan) asserts that
\begin{align}
-\log Z (\theta) = \inf_{Q \in \mathcal{P} (\mathcal{X}) } \left\{ \mathbf{E}_Q \left[ V (x, \theta )\right] + \text{KL} (Q, P) \right\},
\end{align}
where $\mathcal{P} (\mathcal{X})$ is the space of probability measures over $\mathcal{X}$, and $\text{KL} (Q, P)$ is the relative entropy, or *Kullback-Leibler divergence*. Moreover, this infimum is saturated at $Q(dx) = P_\theta (dx)$.
Using this knowledge, we can constrain the variational problem to only search over $Q$ which satisfy the known moment constraints, i.e. $\mathbf{E}_Q \left[ g (x, \theta ) \right] = \mathbf{0}_G$. We thus write
\begin{align}
-\log Z (\theta) &= \inf_{Q \in \mathcal{P} (\mathcal{X}) } \left\{ \mathbf{E}_Q \left[ V (x, \theta )\right] + \text{KL} (Q, P) \right\} \quad \text{such that } \mathbf{E}_Q \left[ g (x, \theta ) \right] = \mathbf{0}_G\\
&= \inf_{Q \in \mathcal{P} (\mathcal{X}) } \left\{ \sup_{\gamma \in \mathbf{R}^G} \left\{\mathbf{E}_Q \left[ V (x, \theta ) + \langle \gamma, g (x, \theta) \rangle \right] + \text{KL} (Q, P) \right\} \right\}.
\end{align}
Slightly carelessly interchanging the order of $\inf$ and $\sup$, and minimising over $Q$ in closed for, we obtain the representation
\begin{align}
-\log Z (\theta) &=\sup_{\gamma \in \mathbf{R}^G} - \log \zeta ( \theta, \gamma) \\
\text{where} \quad \zeta ( \theta, \gamma) &= \int_\mathcal{X} P(dx) f (x, \theta) \exp( - \langle \gamma, g (x, \theta) \rangle ).
\end{align}
That is, the true normalising constant can be recovered as an extremal normalising constant over an expanded class of probability measures $\{ P_\theta^\gamma\}_{\theta \in \Theta}^{\gamma \in \mathbf{R}^G}$, which are derived from the original $P_\theta$ by exponentially tilting the measure according to the statistics $g ( \cdot, \theta)$.
## Post-Derivation Recap
With hindsight, this representation could also have been derived through an application of Jensen's inequality:
\begin{align}
\zeta ( \theta, \gamma) &= Z(\theta) \cdot \mathbf{E}_\theta \left[ \exp( - \langle \gamma, g (x, \theta) \rangle ) \right] \\
&\geqslant Z(\theta) \cdot \exp \left( -\mathbf{E}_\theta \left[ \langle \gamma, g (x, \theta) \rangle \right] \right) \\
&= Z(\theta)
\end{align}
and noting that the inequality is tight at $\gamma = 0$.
The algorithmic implications of this representation are not yet clear. If it is the case that this new family of distributions $P_\theta^\gamma$ is easier to work with (e.g. fast sampling, tractable approximation), then this could provide some avenues to new estimators.
## Direct Use of Control Variates
Another remark is that $g$ could be used as control variates more directly. The condition $\mathbf{E}_\theta \left[ g (X, \theta )\right] = 0$ implies that $\mathbf{E}_P \left[ f(x, \theta) \cdot g (X, \theta )\right] = 0$ which furnishes us with control variates under $P$. If we can approximate
\begin{align}
f (x, \theta) = \phi (\theta) + \langle \gamma (\theta), g(x, \theta) \rangle \cdot f(x, \theta) + R(x, \theta),
\end{align}
with $R$ a small remainder term, it then follows that
\begin{align}
Z (\theta) &= \mathbf{E}_P \left[ f(x, \theta) \right] \\
&= \mathbf{E}_P \left[ \phi (\theta) + \langle \gamma (\theta), g(x, \theta) \rangle \cdot f(x, \theta) + R(x, \theta) \right] \\
&= \phi(\theta) + \mathbf{E}_P \left[ R(x, \theta) \right],
\end{align}
where the hope is that $\mathbf{E}_P \left[ R(x, \theta) \right]$ can be estimated more reliably than $\mathbf{E}_P \left[ f(x, \theta) \right]$. The extent to which this is possible will be governed by the extent to which $f$ is well-approximated in the span of the $g$. This may not be prevalent in practice, as we might expect $f$ to fluctuate on exponential scales (with respect to e.g. the dimension of the space $\mathcal{X}$), whereas $g$ will fluctuate on much smaller amplitudes.
## Conclusion
Despite the calculations presented herein, I would still say that there is not yet a convincing case for the capacity of the method of moments to improve algorithms for estimation of normalising constants. The first derivation offers some flexibility, but essentially replaces one sampling task by another. The second derivation is transparent and may work well in some simple cases, but heuristic arguments suggest that the variance reduction which it can provide may only be modest. I would be excited to see other developments in this direction which are more promising from the practical perspective.