Multigrid Computation

###### tags: `numerical analysis` `expository` `one-offs` # Multigrid Computation **Overview**: The goal of this note is to describe some aspects of the multigrid approach to computation, as exemplified by examples of i) solving linear systems, ii) minimisation of nonlinear functions, and iii) sampling from probability measures. The presentation draws heavily on [this article](https://journals.aps.org/prd/abstract/10.1103/PhysRevD.40.2035) of Goodman-Sokal. ## Setup Suppose that we are interested in solving some numerical problem on a high-dimensional space. For example, we might care about * Solving a linear system $A_L x_L = y_L$, where $x_L \in \mathcal{X}_L$, $y_L \in \mathcal{Y}_L$ are high-dimensional vectors. * Minimising a function $F_L : \mathcal{X}_L \to \mathbf{R}$, where $x_L \in \mathcal{X}_L$ lives in a high-dimensional space. * Sampling from a probability measure $\pi_L$ which is supported on a high-dimensional space $\mathcal{X}_L$. It is useful to think of $x_L$ as representing the spatial discretisation of some continuum field, e.g. the solution to a PDE, where $L$ denotes the level of detail in the discretisation. In such settings, we are often furnished with a hierarchy of discretisations, i.e. a collection of spaces $\{ \mathcal{X}_\ell \}_{\ell = 0}^L$ (and $\{ \mathcal{Y}_\ell \}_{\ell = 0}^L$ in the case of linear systems), which relate to one another in a relatively simple way. One way of formalising this concept is to posit the existence of 'prolongation' maps $P_{\ell-1, \ell} : \mathcal{X}_{\ell - 1} \to \mathcal{X}_\ell$ which behave roughly like inclusion maps. Moreover, in the case of linear systems, it is useful to introduce also the dual notion of 'restriction' maps $R_{\ell, \ell - 1} : \mathcal{Y}_{\ell} \to \mathcal{Y}_{\ell - 1}$, which behave roughly like projection maps (though this time in $\mathcal{Y}$-space). One might initially consider ignoring these additional approximations: we are interested in solving the level-$L$ problem, so what use are the other levels? The issue is (very roughly!) that finer discretisations are often associated with tight couplings between individual components in the $\mathcal{X}_L$ space, which give rise to poor conditioning, and hence to slow convergence of many algorithms. The idea of the multigrid approach to computation is to harness this hierarchy of approximations to accelerate the convergence of numerical methods, making use of the rapid convergence of basic methods for smaller $\ell$ to complement the conditioning difficulties at larger $\ell$. ## Multigrid Solution of Linear Systems For solving the linear system $A_L x_L = y_L$ using the multigrid approach, we need an additional ingredient: mappings $A_\ell : \mathcal{X}_\ell \to \mathcal{Y}_\ell$. These will serve as coarse-grid approximations to the map $A_L$. Now, suppose that our prolongation and restriction operators are 'approximately commutative' with respect to $A_\ell$, in the sense that \begin{align} \text{for}\, x_{\ell - 1} \in \mathcal{X}_{\ell-1}, \quad R_{\ell - 1, \ell} A_\ell P_{\ell, \ell - 1} x_{\ell - 1} \approx A_{\ell - 1} x_{\ell - 1}. \end{align} This says that if we take some $x$, prolongate it to an element of a finer discretisation space, apply the operator $A$ in that space, and then restrict that output to the coarser space, then this is close to simply applying the operator $A$ in the original coarse space. Suppose now that we have an approximate solution to our problem, i.e. \begin{align} A_L \hat{x}_L \approx y_L, \end{align} and let $r_L = y_L - A_L \hat{x}_L$. Now, suppose that $e_{L-1}$ solves \begin{align} A_{L-1} e_{L - 1} = R_{L-1, L }r_L. \end{align} One can then argue that \begin{align} A_{L-1} e_{L - 1} &\approx R_{L - 1, L} A_L P_{L, L - 1} e_{L - 1} \\ R_{L-1, L } r_L &\approx R_{L - 1, L} A_L P_{L, L - 1} e_{L - 1} \\ r_L &\approx A_L P_{L, L - 1} e_{L - 1} \\ y_L - A_L \hat{x}_L &\approx A_L P_{L, L - 1} e_{L - 1} \\ A_L \left( \hat{x}_L + P_{L, L - 1} e_{L - 1} \right) &\approx y_L, \end{align} i.e that solving the coarser system $A_{L-1} e_{L - 1} = R_{L-1, L }r_L$ allows for an improved solution to the finer system $A_L x_L = y_L$. Crucially, one can apply this idea recursively, passing the burden of solving linear systems down to the coarsest level, making slight refinements on the way back up to the fine discretisations. To this end, let us define the $\text{MultiGrid} ( A_\ell, x_\ell^0, y_\ell)$ operator, which takes an initial guess $x_\ell^0$ such that $A_\ell x_\ell^0 \approx y_\ell$, and returns an improved guess, as follows: 1. Improve our estimate of the solution to $A_\ell x_\ell = y_\ell$ by $x_\ell^1 \leftarrow \text{BasicUpdate} ( A_\ell, x_\ell^0, y_\ell)$. 2. If $\ell > 0$, then a. Compute the residual $r_\ell = y_\ell - A_\ell x_\ell^1$. b. Restrict it to the next level by $y_{\ell - 1} = R_{\ell - 1, \ell} r_\ell$. c. Obtain an estimate of the solution to $A_{\ell - 1} x_{\ell - 1} = y_{\ell - 1}$ by $x_{\ell - 1}^1 \leftarrow \text{MultiGrid} ( A_{\ell - 1}, x_{\ell - 1}^0 = 0, y_{\ell - 1})$. d. Improve our estimate of the solution to $A_\ell x_\ell = y_\ell$ by $x_\ell^2 \leftarrow x_\ell^1 + x_{\ell - 1}^1$. 3. Improve our estimate of the solution to $A_\ell x_\ell = y_\ell$ by $x_\ell^3 \leftarrow \text{BasicUpdate} ( A_\ell, x_\ell^2, y_\ell)$. 4. Return $x_\ell^3$. One can also iterate step 2.c. multiple times within each call. I have been vague about the nature of $\text{BasicUpdate}$ so that I can avoid talking about the specifics of numerical linear algebra; my understanding is that in practice, it often amounts to a couple of iterations of Jacobi or Gauss-Seidel iterations. The remarkable property of multigrid methods is that, in many cases, the iterations described above can converge to a solution of acceptable quality in $\mathcal{O}(1)$ steps, even as $L$ grows. Moreover, the cost of calling $\text{MultiGrid} ( A_L, x_L^0, y_L)$ is usually bounded above by a small constant multiple of the cost of calling $\text{BasicUpdate} ( A_L, x_L^0, y_L)$. ## Multigrid Minimisation of Nonlinear Functions Consider now the slightly more general task of minimising a nonlinear function $F_L : \mathcal{X}_L \to \mathbf{R}$. By analogy with the linear case, one can define a recursive scheme for minimising $F_L$ by define a family of operators $\text{MultiGrid} ( F_\ell, x_\ell^0)$. The interpretation is that this operator takes an initial guess $x_\ell^0$ which is an approximate minimiser of $F_\ell$, and returns an improved guess. The operator can be defined as follows: 1. Improve our approximate minimiser of $F_\ell$ by $x_\ell^1 \leftarrow \text{BasicUpdate} ( F_\ell, x_\ell^0)$. 2. If $\ell > 0$, then a. Define $F_{\ell - 1} ( \cdot) = F_\ell \left( x_\ell^1 + P_{\ell, \ell - 1} \cdot \right)$ b. Obtain an approximate minimiser of $F_{\ell - 1}$ by $x_{\ell - 1}^1 \leftarrow \text{MultiGrid} ( F_{\ell - 1}, x_{\ell - 1}^0 = 0)$. c. Improve our approximate minimiser of $F_\ell$ by $x_\ell^2 \leftarrow x_\ell^1 + x_{\ell - 1}^1$. 3. Improve our approximate minimiser of $F_\ell$ by $x_\ell^3 \leftarrow \text{BasicUpdate} ( F_\ell, x_\ell^2)$. 4. Return $x_\ell^3$. Here, $\text{BasicUpdate}$ might denote a few steps of gradient descent, coordinate descent, etc. I know less about the use cases of this approach, but I still found it useful for understanding other versions of Multigrid. ## Multigrid (Markov Chain) Monte Carlo Consider now the *even more* general task of sampling from a probability measure $\pi_L$ which is supported on $\mathcal{X}_L$. Extending the analogy further, we should design our operator $\text{MultiGrid} ( \pi_\ell, x_\ell^0)$ to be a Markov kernel which leaves $\pi_\ell$ invariant. The operator can be defined as follows: 1. Move our approximate sample from $\pi_\ell$ by $x_\ell^1 \leftarrow \text{BasicUpdate} ( \pi_\ell, x_\ell^0)$. 2. If $\ell > 0$, then a. Define $\pi_{\ell - 1} ( \cdot) = \pi_\ell \left( x_\ell^1 + P_{\ell, \ell - 1} \cdot \right)$ b. Obtain an approximate sample from $\pi_{\ell - 1}$ by $x_{\ell - 1}^1 \leftarrow \text{MultiGrid} ( \pi_{\ell - 1}, x_{\ell - 1}^0 = 0)$. c. Move our approximate sample from $\pi_\ell$ to $x_\ell^2 \leftarrow x_\ell^1 + x_{\ell - 1}^1$. 3. Move our approximate minimiser of $F_\ell$ by $x_\ell^3 \leftarrow \text{BasicUpdate} ( \pi_\ell, x_\ell^2)$. 4. Return $x_\ell^3$. Despite the name, one should be careful not to confuse this approach with Multilevel Monte Carlo, where the hierarchical approach is focused directly on the estimation of expectations, rather than upon encouraging the mixing of a Markov chain. Here, $\text{BasicUpdate}$ might be taken as a Gibbs sampler update, a simple Metropolis-Hastings step, etc. Some more details and clarifications for those with a statistical background can be found in [this paper](https://www.jstor.org/stable/pdf/2673469.pdf) of Liu-Sabatti. One can essentially view this approach as a hierarchical combination of several simple (though slightly non-standard) Gibbs sampling updates. ## Conclusion In this note, I have described a few instantiations of the idea behind multigrid computation, with less of a focus on why it works, and more of a focus on how it works, and how one might generalise it to new application domains.