###### tags: `numerical analysis` `expository` `one-offs`
# Multigrid Computation
**Overview**: The goal of this note is to describe some aspects of the multigrid approach to computation, as exemplified by examples of i) solving linear systems, ii) minimisation of nonlinear functions, and iii) sampling from probability measures. The presentation draws heavily on [this article](https://journals.aps.org/prd/abstract/10.1103/PhysRevD.40.2035) of Goodman-Sokal.
## Setup
Suppose that we are interested in solving some numerical problem on a high-dimensional space. For example, we might care about
* Solving a linear system $A_L x_L = y_L$, where $x_L \in \mathcal{X}_L$, $y_L \in \mathcal{Y}_L$ are high-dimensional vectors.
* Minimising a function $F_L : \mathcal{X}_L \to \mathbf{R}$, where $x_L \in \mathcal{X}_L$ lives in a high-dimensional space.
* Sampling from a probability measure $\pi_L$ which is supported on a high-dimensional space $\mathcal{X}_L$.
It is useful to think of $x_L$ as representing the spatial discretisation of some continuum field, e.g. the solution to a PDE, where $L$ denotes the level of detail in the discretisation.
In such settings, we are often furnished with a hierarchy of discretisations, i.e. a collection of spaces $\{ \mathcal{X}_\ell \}_{\ell = 0}^L$ (and $\{ \mathcal{Y}_\ell \}_{\ell = 0}^L$ in the case of linear systems), which relate to one another in a relatively simple way. One way of formalising this concept is to posit the existence of 'prolongation' maps $P_{\ell-1, \ell} : \mathcal{X}_{\ell - 1} \to \mathcal{X}_\ell$ which behave roughly like inclusion maps. Moreover, in the case of linear systems, it is useful to introduce also the dual notion of 'restriction' maps $R_{\ell, \ell - 1} : \mathcal{Y}_{\ell} \to \mathcal{Y}_{\ell - 1}$, which behave roughly like projection maps (though this time in $\mathcal{Y}$-space).
One might initially consider ignoring these additional approximations: we are interested in solving the level-$L$ problem, so what use are the other levels? The issue is (very roughly!) that finer discretisations are often associated with tight couplings between individual components in the $\mathcal{X}_L$ space, which give rise to poor conditioning, and hence to slow convergence of many algorithms.
The idea of the multigrid approach to computation is to harness this hierarchy of approximations to accelerate the convergence of numerical methods, making use of the rapid convergence of basic methods for smaller $\ell$ to complement the conditioning difficulties at larger $\ell$.
## Multigrid Solution of Linear Systems
For solving the linear system $A_L x_L = y_L$ using the multigrid approach, we need an additional ingredient: mappings $A_\ell : \mathcal{X}_\ell \to \mathcal{Y}_\ell$. These will serve as coarse-grid approximations to the map $A_L$.
Now, suppose that our prolongation and restriction operators are 'approximately commutative' with respect to $A_\ell$, in the sense that
\begin{align}
\text{for}\, x_{\ell - 1} \in \mathcal{X}_{\ell-1}, \quad R_{\ell - 1, \ell} A_\ell P_{\ell, \ell - 1} x_{\ell - 1} \approx A_{\ell - 1} x_{\ell - 1}.
\end{align}
This says that if we take some $x$, prolongate it to an element of a finer discretisation space, apply the operator $A$ in that space, and then restrict that output to the coarser space, then this is close to simply applying the operator $A$ in the original coarse space.
Suppose now that we have an approximate solution to our problem, i.e.
\begin{align}
A_L \hat{x}_L \approx y_L,
\end{align}
and let $r_L = y_L - A_L \hat{x}_L$. Now, suppose that $e_{L-1}$ solves
\begin{align}
A_{L-1} e_{L - 1} = R_{L-1, L }r_L.
\end{align}
One can then argue that
\begin{align}
A_{L-1} e_{L - 1} &\approx R_{L - 1, L} A_L P_{L, L - 1} e_{L - 1} \\
R_{L-1, L } r_L &\approx R_{L - 1, L} A_L P_{L, L - 1} e_{L - 1} \\
r_L &\approx A_L P_{L, L - 1} e_{L - 1} \\
y_L - A_L \hat{x}_L &\approx A_L P_{L, L - 1} e_{L - 1} \\
A_L \left( \hat{x}_L + P_{L, L - 1} e_{L - 1} \right) &\approx y_L,
\end{align}
i.e that solving the coarser system $A_{L-1} e_{L - 1} = R_{L-1, L }r_L$ allows for an improved solution to the finer system $A_L x_L = y_L$.
Crucially, one can apply this idea recursively, passing the burden of solving linear systems down to the coarsest level, making slight refinements on the way back up to the fine discretisations.
To this end, let us define the $\text{MultiGrid} ( A_\ell, x_\ell^0, y_\ell)$ operator, which takes an initial guess $x_\ell^0$ such that $A_\ell x_\ell^0 \approx y_\ell$, and returns an improved guess, as follows:
1. Improve our estimate of the solution to $A_\ell x_\ell = y_\ell$ by $x_\ell^1 \leftarrow \text{BasicUpdate} ( A_\ell, x_\ell^0, y_\ell)$.
2. If $\ell > 0$, then
a. Compute the residual $r_\ell = y_\ell - A_\ell x_\ell^1$.
b. Restrict it to the next level by $y_{\ell - 1} = R_{\ell - 1, \ell} r_\ell$.
c. Obtain an estimate of the solution to $A_{\ell - 1} x_{\ell - 1} = y_{\ell - 1}$ by $x_{\ell - 1}^1 \leftarrow \text{MultiGrid} ( A_{\ell - 1}, x_{\ell - 1}^0 = 0, y_{\ell - 1})$.
d. Improve our estimate of the solution to $A_\ell x_\ell = y_\ell$ by $x_\ell^2 \leftarrow x_\ell^1 + x_{\ell - 1}^1$.
3. Improve our estimate of the solution to $A_\ell x_\ell = y_\ell$ by $x_\ell^3 \leftarrow \text{BasicUpdate} ( A_\ell, x_\ell^2, y_\ell)$.
4. Return $x_\ell^3$.
One can also iterate step 2.c. multiple times within each call. I have been vague about the nature of $\text{BasicUpdate}$ so that I can avoid talking about the specifics of numerical linear algebra; my understanding is that in practice, it often amounts to a couple of iterations of Jacobi or Gauss-Seidel iterations.
The remarkable property of multigrid methods is that, in many cases, the iterations described above can converge to a solution of acceptable quality in $\mathcal{O}(1)$ steps, even as $L$ grows. Moreover, the cost of calling $\text{MultiGrid} ( A_L, x_L^0, y_L)$ is usually bounded above by a small constant multiple of the cost of calling $\text{BasicUpdate} ( A_L, x_L^0, y_L)$.
## Multigrid Minimisation of Nonlinear Functions
Consider now the slightly more general task of minimising a nonlinear function $F_L : \mathcal{X}_L \to \mathbf{R}$. By analogy with the linear case, one can define a recursive scheme for minimising $F_L$ by define a family of operators $\text{MultiGrid} ( F_\ell, x_\ell^0)$. The interpretation is that this operator takes an initial guess $x_\ell^0$ which is an approximate minimiser of $F_\ell$, and returns an improved guess. The operator can be defined as follows:
1. Improve our approximate minimiser of $F_\ell$ by $x_\ell^1 \leftarrow \text{BasicUpdate} ( F_\ell, x_\ell^0)$.
2. If $\ell > 0$, then
a. Define $F_{\ell - 1} ( \cdot) = F_\ell \left( x_\ell^1 + P_{\ell, \ell - 1} \cdot \right)$
b. Obtain an approximate minimiser of $F_{\ell - 1}$ by $x_{\ell - 1}^1 \leftarrow \text{MultiGrid} ( F_{\ell - 1}, x_{\ell - 1}^0 = 0)$.
c. Improve our approximate minimiser of $F_\ell$ by $x_\ell^2 \leftarrow x_\ell^1 + x_{\ell - 1}^1$.
3. Improve our approximate minimiser of $F_\ell$ by $x_\ell^3 \leftarrow \text{BasicUpdate} ( F_\ell, x_\ell^2)$.
4. Return $x_\ell^3$.
Here, $\text{BasicUpdate}$ might denote a few steps of gradient descent, coordinate descent, etc. I know less about the use cases of this approach, but I still found it useful for understanding other versions of Multigrid.
## Multigrid (Markov Chain) Monte Carlo
Consider now the *even more* general task of sampling from a probability measure $\pi_L$ which is supported on $\mathcal{X}_L$. Extending the analogy further, we should design our operator $\text{MultiGrid} ( \pi_\ell, x_\ell^0)$ to be a Markov kernel which leaves $\pi_\ell$ invariant. The operator can be defined as follows:
1. Move our approximate sample from $\pi_\ell$ by $x_\ell^1 \leftarrow \text{BasicUpdate} ( \pi_\ell, x_\ell^0)$.
2. If $\ell > 0$, then
a. Define $\pi_{\ell - 1} ( \cdot) = \pi_\ell \left( x_\ell^1 + P_{\ell, \ell - 1} \cdot \right)$
b. Obtain an approximate sample from $\pi_{\ell - 1}$ by $x_{\ell - 1}^1 \leftarrow \text{MultiGrid} ( \pi_{\ell - 1}, x_{\ell - 1}^0 = 0)$.
c. Move our approximate sample from $\pi_\ell$ to $x_\ell^2 \leftarrow x_\ell^1 + x_{\ell - 1}^1$.
3. Move our approximate minimiser of $F_\ell$ by $x_\ell^3 \leftarrow \text{BasicUpdate} ( \pi_\ell, x_\ell^2)$.
4. Return $x_\ell^3$.
Despite the name, one should be careful not to confuse this approach with Multilevel Monte Carlo, where the hierarchical approach is focused directly on the estimation of expectations, rather than upon encouraging the mixing of a Markov chain.
Here, $\text{BasicUpdate}$ might be taken as a Gibbs sampler update, a simple Metropolis-Hastings step, etc. Some more details and clarifications for those with a statistical background can be found in [this paper](https://www.jstor.org/stable/pdf/2673469.pdf) of Liu-Sabatti. One can essentially view this approach as a hierarchical combination of several simple (though slightly non-standard) Gibbs sampling updates.
## Conclusion
In this note, I have described a few instantiations of the idea behind multigrid computation, with less of a focus on why it works, and more of a focus on how it works, and how one might generalise it to new application domains.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**