GPs 4 OI - HackMD

# GPs 4 OI ###### tags: `interpolation` `papers` --- ## Important Links * [**Mathy Bits**](/vvUAd4zrQMeGSy7sKhL6rg) - The mathy bits related specifically to this paper. (Updating as I Go along) * [**Literature Review**](/BRqNK5qoRxqv1k4oAkeT0g) - The literature review specific to this paper (Updating as I go along) * [**Data Details**](/d3gHdHZQRs2f8455IwdOuQ) - The literature review specific to this paper (Updating as I go along) * [**Results Details**](https://hackmd.io/FTu-EuL9T9GlPwIYs6n_HA) - Some preliminary results. --- * My Personal [**GP Model Zoo**](https://jejjohnson.github.io/gp_model_zoo/) - A Website I did during my PhD of all things Gaussian process with modern tools! * [Data Challenge II](https://github.com/ocean-data-challenges/2021a_SSH_mapping_OSE) - Based Purely on Observations * [Data Challenge I](https://github.com/ocean-data-challenges/2020a_SSH_mapping_NATL60) - Model Simulations and Observations Simulations --- # Planned TOC **Problem Statement** > Overview of data given, the objective, and our contributions. * Data Given - Sparse Spatio-Temporal Observations (e.g. Satellite tracks) * Objective - Interpolation the missing data. * Contribution - A modern take on OI using ML tools (e.g. Gaussian processes) E: Straight Forward Introduction --- **Background** > Given an overview of the data-driven OI framework and show the direct connection to Gaussian Processes (GPs). * Coordinates vs Fields ($\boldsymbol{X} \in \mathbb{R}^{N \times D_\phi}$ vs $\mathbf{x} \in \mathbb{R}^{D}$) * Data-Driven OI Framework * Literature in DA Field * Literature in ML Field (e.g. Kriging, ) E: Easy math to show but harder to motivate tunnel vision. High-level equations only. Derivations in the appendix. --- **Overview of Proposed Methods** > We introduce the fact that GPs are more scalable now. * GP Formulation + Benefits (mean, cov, samples) * Scalable GPs: * Conjugate Gradient (Exact GP) * Structured Kernels (KISS-GP) * Inducing Points (VFE) E: Again, easy math to show. Step-by-Step explanation about the motivation and result. High-level equations only. Derivations in the appendix. --- ## Data 1. Synthetic Data 2. Model Data 3. Real Data --- ### Synthetic Data > A toy example in a controlled environment to showcase a few of the benefits of what the GPs can do for us. We can use the data that Maxime Beauchamp used to generate. We will use psuedo-Stochastic Differential Equation (SDE) to generate simulations. I believe it will be done via a GP. We will vary some of the parameters of the simulation such that we can vary the: * Spatial Domain * Step Size * Number of Observations * Noise --- ### Movies > Movies for the toy datasets. Gaussian Field & 1 layer QG Field. * Data * Gradient * Laplacian > We also want some sort of statistic that would be useful. * Wave Number Spectra (Each Dataset) * Entropy * Total Correlation * Correlation --- ### Model Data **Details**: [**Data Details**](/d3gHdHZQRs2f8455IwdOuQ) > This toy example is also be a controlled environment. Ideally, this system will be physically consistent with what we want to emulate. We will use a 1 layer QG model to generate simulations. We will vary some of the parameters of the simulation such that we can vary the: * Spatial Domain * Step Size * Number of Observations * Noise **Code**: [TorchQG](https://github.com/hrkz/torchqg) #### Visualization We are particularly interested in the stream function, $\boldsymbol{\rho}$. We assume that the stream function takes into a vector of coordinates and outputs a scalar-value. $$ \boldsymbol{\rho} (x,y) : \mathbb{R}^2 \rightarrow \mathbb{R} $$ **Stream Function** This plot shows the output of the stream function, $\boldsymbol{\rho}$, over the entire domain, $D_X, D_Y$. ![](https://i.imgur.com/2qx8uiL.gif) **Note**: We plot each of the time steps sequentially (and independently). **Partial Derivative Summary** So given the coordinates over the full domain, $\mathbf{x} \in \mathbb{R}^{N_x}$,$\mathbf{y} \in \mathbb{R}^{N_y}$, we can calculate the sensitivity map for the full domain, $\boldsymbol{S} \in \mathbb{R}^{N_x \times N_y}$, where each entry is: $$ (\boldsymbol{S_{\nabla f}})_{n_i,n_j} = \left(\frac{\partial \boldsymbol{\rho}(\mathbf{x}_{n_i}, \mathbf{y}_{n_j})}{\partial x} \right)^2 + \left(\frac{\partial \boldsymbol{\rho} (\mathbf{x}_{n_i}, \mathbf{y}_{n_j})}{\partial y} \right)^2 $$ ![](https://i.imgur.com/yPvn7UZ.gif) **2nd Partial Derivative Summary** So given the coordinates over the full domain, $\mathbf{x} \in \mathbb{R}^{N_x}$,$\mathbf{y} \in \mathbb{R}^{N_y}$, we can calculate the 2nd order sensitivity map for the full domain, $\boldsymbol{S} \in \mathbb{R}^{N_x \times N_y}$, where each entry is: $$ (\boldsymbol{S_{\nabla^2 f}})_{n_i,n_j} = \left(\frac{\partial \boldsymbol{\rho}(\mathbf{x}_{n_i}, \mathbf{y}_{n_j})}{\partial x} \right)^2 + \left(\frac{\partial \boldsymbol{\rho} (\mathbf{x}_{n_i}, \mathbf{y}_{n_j})}{\partial y} \right)^2 $$ ![](https://i.imgur.com/46zOsiC.gif) --- ### Real Data > Using an extended version of the data challenge 2 dataset and seeing how well we can perform in terms of accuracy and scalability. * Area of Interest: GulfStream (for now) * Time period: 5 years (2015-2019) * Methods: DUACs, Exact GP, KISS-GP, SVGP #### Number of Observations * 1 Month: `55,000` * 1 Year: `660,000` * 5 Years: `3,300,000` --- **I: Model Flexibility** * Different kernel functions * RBF, Matern Familty, Periodic, Spectral Mixture E: Easy figure with different covariance structures. **II: Scalability** * Different Scaling Methods * Exact GP (CG + GPUs) * Structured Kernel (KISS-GP) * Inducing Points (SVGP) E: Simple figure showcasing the log-plot for scalability on the empirical dataset. --- **Experiment I**: Baseline Comparison > Showcase that this method matches the DUACs with minimal assumptions. * Metrics: RMSE, Resolved scale ($\lambda \mathbf{x},\lambda t$) E: More or less the same experiment as the data challenge 2021 --- **Experiment II**: Scalability > Demonstrate some empirical tests to show how the scalability allows for more data which allows for better accuracy (or not). * Showcase Full Methods: * Exact GP, Structured Kernel (KISS-GP), Inducing Points (SVGP) * Show plots of the RMSE vs the scaling * Show different simple kernels E: This is new. I've never seen anything that explicitly shows how one --- **Future Outlook** * Smaller Component in a bigger system (e.g. init for 4DVarNet) * **Further Analysis**: * We can do a more Bayesian perspective of the problem with more complete inference methods (VI, HMC) to better understand the parameters. * **Physics-Informed**: covariance function exploration + mean function constraint (GP-NODE) * The mean function is a more * **Scaling**: Trifecta - GPs, Fourier, Markovian GPs * Perhaps we still want the full state-space perspective, then we can look at other representations of GPs. --- ## Problem Statement **Data Given** * Sparse Observations from satellite tracks. * Spatio-Temporal Dataset **Objective** * Interpolate the observations to fill in the unobserved regions --- ## Contributions Modern take on Optimal Interpolation (OI) with modern tools * Revisit the connection between OI and Gaussian Processes (GPs) / Kriging * Showcase benefits: * mean, variance * sampling * flexible kernels - encode structure and priors * Demonstrate scalability on global data: * Conjugate Gradient + GPUs - $\mathcal{O}(N^2)$ * Conjugate Gradient + Kernel Interpolation - $\mathcal{O}(N + N\log N) \approx \mathcal{O}(N)$. * Sparse approximations - $\mathcal{O}(NM^2)$ --- --- ## Experiments --- ### Exp. 1: > Demonstrate that the method is as flexible, if not --- **OI** * Assumptions + * Solution Equations **Gaussian Processes** * Learning * Mean, Variance * Sampling --- ## Scaling **Extensions** * Scaling -> Sparse Gaussian Processes * Scaling (Subset) -> Sparse, Variational Gaussian Processes + MiniBatches * Very Long Time Series -> Markovian Gaussian Processes * Kernels -> Product Kernels + Composite Kernels * Fast Sampling -> CIQ * Fully Bayesian -> Parameters * Physics-Informed -> Physically Consistent Mean Function **Datasets** * Data Challenge - 400K points **Connections** * GeoStats: Kriging * CV: NerF -> Approximate Kernel Regression * DA: OI, MIOST * Tri-Fecta: Frequency, Kernel, Fourier --- **Inspiration** * Gaussian processes meet NeuralODEs: A Bayesian framework for learning the dynamics of partially observed systems from scarce and noisy data - [arxiv](https://arxiv.org/abs/2103.03385) | [code](https://github.com/PredictiveIntelligenceLab/GP-NODEs) --- --- ### Stage I > The objective is to demonstrate this can do spatial interpolation. We will use toy data that is relatively controlled. We will also work on the formulation to try and show differences between the methods. **Objectives** * **Mathematical Formulation** - look for consistencies (and differences) between the OI literature and the kernel methods literature * Try Different kernel functions, e.g. Linear, Gaussian, Polynomial * Assess time and memory constraints for the methods * Operate Sequentially * Assess the limit of the gaps we artificially create * Assess the limit of the noise we artificially inject * Visualize everything if possible **Data** * Toy Examples: PDE Functions **Dissemination** * Walk-Through Demo of Kernel Ridge Regression * Walk-Through Demo of Gaussian Process Regression --- ### Stage II > The proof of concept should be clear. So now we want to see how well this method can scale on real data using proper machine learning techniques. * Push the boundaries for scale - spatio-temporal data should start giving problems * Look at the uncertainties - do they make sense? * Look at time scales for coarser resolutions **Data** > we can really start to use real data. We can look at the Data Challenge I dataset as well as the MedWest60 Dataset. We can be liberal with the scales (dx, dt) because we are just pushing the boundaries of the method to find out where it fails. --- ### Stage III > This can be integrated into the Data Challenge baselines. **Objectives** **Dissemination** * Walk-Through Experiment (Toy Data, Model Data) * Post Data Challenge Results --- ## Algorithms **Kernel Functions** * Linear * RBF * Matern Family (12, 32, 52) * Spectral Mixture Kernel **Regression Methods** * Gaussian Process Regression (GPR) **Kernel Approximations** * Random Fourier Features (RFF) * Grid Interpolation Kernel * Inducing Point Kernel **Inference** * Maximum Likelihood Estimation (MLE) - GPR