# Gaussian process source
[link to somewhere](https://google.com)
For now I will assume the instrumental response function just adds Gaussian noise with standard deviation $\sigma_n$.
We model the source as a zero-mean Gaussian process in the source plane by attaching a flux $\mu$ to each pixel in the image. With the notation
* $\pmb{\mu}$: set of all true fluxes;
* $\pmb{\beta}_\mathrm{src}$: set of associated source-plane points;
<!-- * $k_\theta(\cdot, \cdot)$: covariance function; -->
* $[K_{\theta}(\mathbf{z}^l)]_{ij}$: matrix of covariances between all points,
<!-- \equiv k_\theta(\pmb{\beta}_{\mathrm{src},i}, \pmb{\beta}_{\mathrm{src},j})$ -->
this means we assume
$$
\pmb{\mu} \sim P_N(\pmb{\mu} | \mathbf{0}, K_\theta(\mathbf{z}^l)),\\
\mathbf{x} \sim P_N(\mathbf{x} | \pmb{\mu}, \sigma_n).
$$
The covariance matrix depends only on the lensing parameters (and some hyperparameters), since they determine the source-plane points to which each pixel maps. Since we are ultimately not concerned with the true fluxes, we marginalize over them to get the full PDF for our model:
$$
p_\theta(\mathbf{x} | \mathbf{z}) = \int d\pmb{\mu} \ P_N(\mathbf{x} | \pmb{\mu}, \sigma_n) \ P_N(\pmb{\mu} | \pmb{0}, K_\theta(z^l)) = P_N(\mathbf{x} | \mathbf{0}, \Sigma_\theta(\mathbf{z}^l))\\
\implies \log p_\theta(\mathbf{x} | \mathbf{z}) = -\frac{1}{2} \mathbf{x}^T \Sigma^{-1} \mathbf{x} - \frac{1}{2} \log |\Sigma| - \frac{N}{2} \log 2\pi,
$$
where $\Sigma \equiv K_\theta(\mathbf{z}^l) + \sigma_n^2 I$. Note that we have not introduced any source parameters $\mathbf{z}^s$ so far since we have marginalized them out.
For images larger than $\sim 70 \times 70$ pixels, it is impossible to directly maximize this likelihood with respect to the lensing parameters and hyperparameters. Firstly, the covariance matrix is too large to store in memory. While that can be handled using a library like [KeOps](https://www.kernel-operations.io/keops/) to compute the matrix lazily, the log determinant is still intractible. Ultimately we will circumvent this issue by only *sampling* $\mathbf{x} \sim p_\theta(\mathbf{x} | \mathbf{z})$ for fixed lensing and hyperparameters.
## Reparametrizing the likelihood
While the first term in the log likelihood can be computed using conjugate gradient, the calculation is quite slow. We can ameliorate this by rewriting the likelihood so the first term can be computed using gradient descent, allowing simultaneous optimization of the lensing parameters. This requires introducing a set of source parameters $\mathbf{z}^s$.
The trick is the following identity:
$$
\mathbf{x}^T \Sigma^{-1} \mathbf{x} = \min_{\pmb{\mu}} \left\{ \frac{|\mathbf{x} - \pmb{\mu}|^2}{\sigma_n^2} + \hat{\pmb{\mu}}^T K^{-1} \pmb{\mu} \right\}.
$$
The expression is minimized by the MAP estimate of the true fluxes, $\hat{\pmb{\mu}} \equiv K \Sigma^{-1} \mathbf{x}$, which must still be computed using conjugate gradient. However, we can reparametrize this using $\mathbf{z}^s \equiv K^{-1} \pmb{\mu}$:
$$
\mathbf{x}^T \Sigma^{-1} \mathbf{x} = \min_{\mathbf{z}^s} \left\{ \frac{|\mathbf{x} - K \mathbf{z}^s|^2}{\sigma_n^2} + {\mathbf{z}^s}^T K \mathbf{z}^s \right\}.
$$
Now we can optimize with respect to $\mathbf{z}^s$ using gradient descent. At the cost of adding a set of source parameters, we have gained the ability to optimize $p_\theta(\mathbf{x} | \mathbf{z})$ using only gradient descent, and can compute a MAP estimate of the source parameters $\pmb{\mu} = K \mathbf{z}^s$.
## Covariance function from pixel overlaps
The covariance between the surface brightness of two pixels depends on how much they overlap in the source plane:
$$
K_{ij} \propto \frac{A_{ij}}{A_i A_j},
$$
where $A_{i}$ is the area of pixel $i$ and $A_{ij}$ is overlap area between pixels $i$ and $j$ (traced back to the source plane). This can be written in terms of a _window function_ $g_i(\cdot)$:
$$
K_{ij} = \alpha^2 \int d^2\pmb{\beta} \ g_i(\pmb{\beta}) g_j(\pmb{\beta}).
$$
($\alpha$ is a hyperparameter in $\theta$.) The window functions should satisfy $g_i(\pmb{\beta}) \approx A_i^{-1} \ I[\pmb{\beta} \in R_i]$, where $R_i$ is the region which pixel $i$ maps to in the source plane and $I[\cdot]$ is the indicator function (1 if the argument is true, 0 if not). To make this analytically tractable we use Gaussians to approximate the window function, with the following normalization requirement:
$$
g_i(\pmb{\beta}) = P_N(\pmb{\beta} | \pmb{\beta}_{\mathrm{src},i}, \Sigma_i),\\
\int d^2\pmb{\beta} \ [g_i(\pmb{\beta})]^2 = A_i^{-1}.
$$
The resulting kernel is
$$
K_{ij} = \alpha^2 P_N(\pmb{\beta}_{\mathrm{src},i} | \pmb{\beta}_{\mathrm{src},j}, \Sigma_i + \Sigma_j).
$$
All that remains is to compute the window function covariance. This depends on $\mathbf{z}^l$ and can be computed using the coordinates of the pixels in the source plane. Let $\mathbf{a}$ be the difference between the source-plane positions to which the image plane pixels above and below pixel $i$ map; let $\mathbf{b}$ be the same for the pixels to the right and left of pixel $i$ in the image plane. Assuming the magnification varies smoothly over these pixels, we have
$$
\Sigma_i = \frac{1}{16 \pi} (\mathbf{a} + \mathbf{b}).
$$
Lastly, this covariance function can be generalized to include a term for the intrinsic covariance of the source, characterized by the standard Gaussian kernel function $k(\beta, \beta') \equiv P_N(\beta | beta', \sigma)$:
$$
K_{ij} = \alpha^2 \int d^2\pmb{\beta} \ g_i(\pmb{\beta}) \ k_\sigma(\pmb{\beta}_{\mathrm{src},i}, \pmb{\beta}_{\mathrm{src},j}) \ g_j(\pmb{\beta})\\
= \alpha^2 P_N(\pmb{\beta}_{\mathrm{src},i} | \pmb{\beta}_{\mathrm{src},j}, \Sigma_i + \Sigma_j + \sigma^2 I).
$$
This adds a length scale $\sigma$ to the set of hyperparameters $\theta$. In the limit where the windowing functions are taken to be $\delta$ functions ($\Sigma_i, \Sigma_j \to 0$), this reduces to $\alpha^2 \ k_\sigma(\pmb{\beta}_{\mathrm{src},i}, \pmb{\beta}_{\mathrm{src},j})$, the standard squared-exponential Gaussian process kernel.