Squeezing More out of Score Matching

###### tags: `unnormalised models` `one-offs` # Squeezing More out of Score Matching **Overview**: The following exposition describes a methodology for potentially improving the statistical properties of score matching-type estimators, though at potentially substantial cost. ## Introduction For the purposes of this post, score matching is a methodology for parameter estimation in unnormalised statistical models, i.e. models for which the likelihood can be written as the ratio of a tractable function $f(x, \theta)$ and an intractable normalising constant $Z(\theta)$, i.e. \begin{align} P(x|\theta) = f(x,\theta)/Z(\theta). \end{align} Instead of maximising the likelihood of the data as a function of the parameter $\theta$, score matching presents an alternative loss function \begin{align} J(\theta) = \mathbf{E}_q \left[ \frac{1}{2} \left| \nabla_x \log q(x) - \nabla_x \log P(x | \theta) \right|^2 \right], \end{align} where $q$ represents the law of the data, and seeks to minimise this. Using an integration by parts trick, this is equivalent to minimising \begin{align} J(\theta) &= \mathbf{E}_q \left[ \frac{1}{2} \left| \nabla_x \log q(x) - \nabla_x \log P(x | \theta) \right|^2 \right] \\ &= \mathbf{E}_q \left[ \frac{1}{2} \left| \nabla_x \log q(x) \right|^2 \right] + \mathbf{E}_q \left[ \Delta_x \log P(x | \theta) \right] + \mathbf{E}_q \left[ \frac{1}{2} \left| \nabla_x \log P(x | \theta) \right|^2 \right], \end{align} which can be treated with conventional optimisation methods, replacing the expectations over $q$ with the empirical measure of the data. The cost of optimisation is understood to generically scale linearly with the size of the dataset, and quadratically with the dimension of the ambient space, though in any specific model, it may be possible to do better. ## Statistical Inefficiency In contrast to the method of maximum likelihood, the estimators derived from score matching are typically not _statistically efficient_, in the sense that the asymptotic variance of such estimators is larger than optimal. As such, confidence sets will have to be larger, and we lose some precision in our estimator. However, there are important links between maximum likelihood and score matching which may suggest a route back to efficiency. ## Score Matching as Approximate MLE Define $q_t$ to be the law of the data upon convolution with the Gaussian measure $\gamma_t = \mathcal{N} \left(0, t \cdot I \right)$. Similarly, let $P_t^\theta$ denote the convolution of $P^\theta$ with $\gamma_t$. By de Bruijn's identity, one can compute that \begin{align} \frac{\mathrm d}{\mathrm d t} \mathrm{KL} \left( q_t, P_t^\theta \right) &= -\frac{1}{2} F \left( q_t, P_t^\theta \right) \\ \text{where} \quad F(q, p) &= \mathbf{E}_q \left[ \left| \nabla_x \log q(x) - \nabla_x \log p (x) \right|^2 \right]. \end{align} Integrating this back up, under some regularity conditions, it then holds that \begin{align} \mathrm{KL} \left( q_t, P_t^\theta \right) &= \int_0^\infty \mathbf{E}_{q_t} \left[ \frac{1}{2} \left| \nabla_x \log q_t (x) - \nabla_x \log P_t^\theta (x) \right|^2 \right] \, \mathrm{d} t. \end{align} So, in principle, we should be able to approximate the KL objective, which underpins the method of maximum likelihood. Additionally, we should be able to make this inner expectation tractable, using the same integration by parts trick, i.e. \begin{align} \mathbf{E}_{q_t} \left[ \frac{1}{2} \left| \nabla_x \log q_t (x) - \nabla_x \log P_t^\theta (x) \right|^2 \right] = \mathbf{E}_{q_t} \left[ \frac{1}{2} \left| \nabla_x \log q_t (x) \right|^2 \right] + \mathbf{E}_{q_t} \left[ \Delta_x \log P_t^\theta (x) \right] + \mathbf{E}_{q_t} \left[ \frac{1}{2} \left| \nabla_x \log P_t^\theta (x) \right|^2 \right], \end{align} So, what stands in our way? 1. We do not have samples from $q_t$ available to us. 2. We cannot compute $P_t^\theta (x)$, let alone its derivatives. ## Resolutions It turns out that the first of these is fairly straightforward to get around: if we take our observed data, and add independent Gaussian noise to it, then this is exactly a sample from $q_t$. The other terms will require more thought: First, consider this expression for the 'smoothed' likelihood \begin{align} P_t^\theta (x) = \int P^\theta \left( \mathrm d z \right) \cdot \mathcal{N} \left( x \vert z, t\cdot I \right). \end{align} Some applications of the chain rule yield the following expressions, familiar to those who have worked with latent variable models: \begin{align} \nabla_x \log P_t^\theta (x) &= \int P_t^\theta \left( \mathrm d z \vert x \right) \cdot \left( \frac{z - x}{t} \right) \\ &= \mathbf{E}_t^\theta \left[ \frac{z - x}{t} \vert z + \sqrt{t} \xi = x\right] \\ \nabla_x^2 \log P_t^\theta (x) &= t^{-1}\cdot I - \text{Cov}_t^\theta \left[ \frac{z - x}{t} \vert z + \sqrt{t} \xi = x\right] . \end{align} Thus, provided that we can approximate integrals with respect to the conditional distribution \begin{align} P_t^\theta (\mathrm d z \vert x) &= \frac{P_\theta ( \mathrm d z) \cdot \mathcal{N} \left( x \vert z, t\cdot I \right)}{P_t^\theta (x)}, \end{align} we can approximate the integrand, and then try to optimise that. ## Conclusion The above derivations demonstrate that, in principle, the ideas which underlie score matching can enable approximation of the full MLE objective, even for unnormalised models. There are nevertheless a few remaining issues: - While optimisation of the true maximum likelihood objective can give rise to statistically-efficient estimators, we are in fact optimising another objective, obtained via the integration by parts trick. This could prevent our estimator from achieving the desired stability properties. - The integral which defines the maximum likelihood objective is over an unbounded domain, and so we need to decide how to discretise the variable $t$, either deterministically or stochastically. - Computing integrals under the conditional distributions $P_t^\theta$ may be challenging. For difficult models, we will typically have to make use of Monte Carlo methods for this task. As $t$ grows, the conditional sampling task is tantamount to sampling from $P^\theta$, which is often difficult, if not computationally hard, in these models. These are just the issues attached to computing the estimator. There is likely to be additional work involved beyond that stage, e.g. computing confidence sets, which may require yet further ingenuity. It is thus difficult to recommend this strategy at present, though it may still prove to be of interest in some cases.