Gradent-Guided Denoising Latent-Space Diffusion

# Gradent-Guided Denoising Latent-Space Diffusion ###### tags: research_idea Normal diffusion starts with a datapoint $x = z_0$ and converts it into noise distribution $p(z_T)$ by gradually adding noise overt $T$ steps, where usually $p(z_T)=\mathcal{N}(0, I)$. To generate data, we sample $z_T \sim p(z_T)$ and gradually denoise it by applying $q_\phi(z_{t-1}\mid z_t)$, which is learned. To train this, we use supervised training with a score-matching loss that uses the input $z_t$ and the target $z_{t-1}$ (or equivalently the noise $\epsilon$ used to create $z_t$ from $x$, or the datapoint $x$). Here, we're interested in diffusion over parameters $\theta$ of an INR $f_\theta: u \mapsto x$ with parameters $\theta$ and a low-dimensional coordinate $u$. Imagine that the $p(z_T)$ (the marginal at the final timestep) is some learned distribution and $\theta = z_0$. We parametrize $z_{t-1}$ as a gradient update of $z_t$ to better reconstruct $x$. We focus on low $T$ values $\in\{2,...,5\}$ to make this tractable (we'll need to do up to $T$ gradient steps). This should work, because it will also update the parameters of $p(z_T)$, which will then correspond to [meta-training NeRF for good initialisation](https://arxiv.org/abs/2012.02189). This gives us inputs to evaluate the score-matching loss to train the the denoising distribution $q(z_{t-1}|z_t)$, which is parametrized as a standard unet (parameters of the INR can be arranged as a multi-channel feature-map, so that's fine). We can also explore other INR parametrizations. Now the important bit: whatever iteration $t$ we sample to train the diffusion model, we denoise $z_t$ one more time (with the diffusion model, not the gradient update) to get $\tilde{z}_{t-1}$. We then set $\theta=\tilde{z}_{t-1}$ and reconstuct the data $x$. This allows us to compute the reconstruction loss and to use it to train both $p(z_T)$ and the diffusion model. The hope is that the diffusion model will learn to approximate the gradient updates, thus learning a generative model over INR parameters; reconstructing the data using params that are the result of the denoising is important because it will further guide the model to produce good parameters (beyond just approximating the gradient update). This is very related to functa, but the differences are: - The diffusion model is trained jointly with the INR parameters. - The diffusion model is used to actually improve the INR parameters and potentially share structure between different datapoints (this can be done also without learnable $p(z_T)$ e.g. with a standard normal). We can make the diffusion model conditional on side-info (text, or a different view of the scene for view-conditional NeRF). # Notes to self ## Conditioning the denoiser In the above, the denoiser should be conditioned on the timestep $t$. But this conditioning usually rescales $[0, T]$ into $[0, 1]$ and applies frequency encoding. Similarly to diffusion, we need to decide how many denoising steps we're going to do (choose $T$). Differently from diffusion, $T$ also decides the maximum number of gradient steps we are going to do, which affects the "noiseless" data. Perhaps it's fine because we'll backprop through this optimisation, similar to MAML. ## Using task loss to train the denoiser In MAML, once we get to the most-optimized parameters, we evaluate the task-specific loss and backprop through the inner-loop optimization to optimize the base distribution. Here, we do similar, but we first apply the denoiser (just once) before evaluating the task-specific loss. This will encourage the denoiser to produce the final (best) parameters, that will hopefully be better than with just gradient descent (especially if we randomize the number of sgd steps we do before). In that case, we should condition the denoiser on some special value of $t$ that signifies that we are going to $t=0$ (i.e. what should be considered the best possible parameters). We could any $t\leq T-k$ with $k$ gradient updates; perhaps we can randomize this value to encourage generalization. Not sure. ## How to generalize beyond the number of gradient updates used in training? In training we'll use up to $k$ gradient updates to produced the "trained" parameter values. At test time, the optimisation will be done by a diffusion model trained over up to $k$ steps (plus the additional step as described above). Does this mean that we will be able to only run up to $k+1$ denoising steps? That may be a bit dissapointing. How do we generalize to higher step numbers? It is possible that Fourier encoding of step number will allow a bit of generalization, but I wouldn't count on it. ## SGD-based Optimisation is Denoising Diffusion! One can see optimization with any sgd algorithm as a denoising diffusion process where p(z_T) is the distribution of initial model weights (e.g. variance-scaling normal) and p(z_0) is the distribution over converged model params. Each update step denoises z_t into z_{t-1}. If the updates are stochastic (like in sgd due to data subsampling), then this denoising process is also stochastic. Similar to DDIM, with the difference that in DDIM the injected noise is Gaussian, while here noise need not be/probably is not Gaussian. I wonder if one could (or should?) investigate this more formally. ## Going Beyond T Gradient Steps by Bootstrapping with Diffusion Gradient-guided latent diffusion is limited by the number of gradient steps we take. Models (or representations) might not converge for thousands of gradients iterations, but for efficiency reasons we won't be able to do more than 3-5 or maybe 10 gradient steps. But the good news is that the diffusion model will learn to generate parameters that are parameters after those T updates. So instead of doing a cold-start for each training example, we can start from parameters generated by the diffusion model (conditioend on that training example, say). Generating parameters with diffusion is expensive, so we wouldn't do that every time. Instead, we can generate starting parameters for all examples in the training set once per epoch or once every few epochs and just save them (we won't backprop through that generation). This is a bit similar to truncated backprop-through-time. The idea is that after $K bootrapping cycles we'll be effectively starting at $KT$ gradient updates instead of just $T$ updates. I'm not sure how to parametrize time in this case. Do we use $t \in [0, 1]$in every epoch? If so, the meaning of time will change with every epoch. Or maybe not? The $p(z_0)$ will change, but $p(z_T)$ may remain constant. If so, this will prevent the denoiser from forgetting early $t$ values (those it will need to learn to contract them as the old $[0, 1]$ interval will need to be squeezed into $[0, 1-\Delta]$. Interesting. But... instead of generating embeddings with diffusion after every epoch I could just maintain gradient-optimized embeddings from previous iterations and update them at every iter? Wouldn't that be better? The embeddings would be stale in both cases because the decoder would be updated (e.g. mid epoch) but the embeddings would be from the beginning of the epoch. So unclear if this idea is worth anything.