Connection between denoising and EBM

# Connection between denoising and EBM **Raphael Shu, 2020/3/5** Setting: let the oracle sequence (may be discrete tokens or continuous variables) be $x^*$. Then let noisy datapoints be $$\tilde x = x + \epsilon$$ where, $\epsilon$ is a random noise. In the discrete case, the noise can be injected through disruption process. Consider a denoising model: $$\min_S ||S(\tilde x) - x^*||_2$$ This loss can also be written as predicting the difference (they are equivalent): $$\min_S ||S(\tilde x) - (x^* - \tilde x)||_2$$ Let $p(x)$ be a true probablisitc model that evaluates the true probability of seqeunce $x$. Then $$ \nabla_{\tilde x} p(\tilde x) \approx x^* - \tilde x $$ This is the gradient at the point $\tilde x$. Now we can see the loss function turns to be $$ \min_S ||S(\tilde x) - \nabla_{\tilde x} p(\tilde x)||_2 $$ Here, the output of $S(\cdot)$ is a set of vectors to match the gradient. This loss is also equivalent to the following form $$ \min_E ||\nabla_{\tilde x} E(\tilde x) - \nabla_{\tilde x} p(\tilde x)||_2 $$ Where the compution happens in the gradient domain. $E$ is a energy function. This loss function is known as *score matching loss* in energy-based models. To summarize, we show that training denosing model is equivalent to training a energy-based model with score matching. Updating a sequence with denosing is equivalent to updating with the gradient from EBM.