# OceanBench Starters
Hi! Thank you very much for your interest! And thanks for raising your concerns as we are still trying to improve how we explain what we are doing. Let me try to explain the different tasks (from my perspective). I would appreciate any feedback on my explanation because I would like to include this write-up within the documentation.
***
**Dataset Available**
In general, we're working with 4 types of datasets:
* Sea Surface Height (SSH) - $\eta$
* Sea Surface Temperature (SST) - $T$
* *Discretized* AlongTrack SSH Observations - $\eta_{obs}$
* *Native* AlongTrack Observations - $\eta_{atrack}$
$$
\begin{aligned}
\text{Discretized SSH Field}: && &&
\eta = \eta(\mathbf{s},t), &&
\eta:\mathbb{R}^{D_s}\times\mathbb{R}^+\rightarrow\mathbb{R}^{D_\eta} &&
\mathbf{s}\in\Omega_{Field} \subseteq\mathbb{R}^{D_s} &&
t\in\mathcal{T}_\eta\subseteq\mathbb{R}^+\\
\text{Discretized SST Field}: && &&
T = T(\mathbf{s},t), &&
T:\mathbb{R}^{D_s}\times\mathbb{R}^+\rightarrow\mathbb{R}^{D_T} &&
\mathbf{s}\in\Omega_{Field} \subseteq\mathbb{R}^{D_s} &&
t\in\mathcal{T}_\eta\subseteq\mathbb{R}^+\\
\text{Discretized AlongTrack SSH Observations}: && &&
\eta_{obs} = \eta_{obs}(\mathbf{s},t), &&
\eta_{obs}:\mathbb{R}^{D_s}\times\mathbb{R}^+\rightarrow\mathbb{R}^{D_y} &&
\mathbf{s}\in\Omega_{Field} \subseteq\mathbb{R}^{D_s} &&
t\in\mathcal{T}_\eta\subseteq\mathbb{R}^+ \\
\text{Native AlongTrack SSH Observations}: && &&
\eta_{atrack} = \eta_{atrack}(\mathbf{s},t), &&
\eta_{atrack}:\mathbb{R}^{D_s}\times\mathbb{R}^+\rightarrow\mathbb{R}^{D_y} &&
\mathbf{s}\in\Omega_{atrack} \subseteq\mathbb{R}^{D_s} &&
t\in\mathcal{T}_{atrack}\subseteq\mathbb{R}^+
\end{aligned}
$$
where $\mathbf{s},t$ are the spatiotemporal coordinates, $\Omega$ is the spatial domain, and $\mathcal{T}$ is the temporal domain.
In general, we have some data that's on our target **state** spatiotemporal domain:
* Variables - SSH, SSH, Discretized SSH Obs
* Spatial Domain, $\mathbf{s}\in\Omega_u$ - discretized grid at a 0.05 degree resolution.
* Temporal Domain, $t\in\mathcal{T}_u$ - regularly spaced daily intervals
The *native* alongtrack SSH observations have their own **observation** domain:
* Variables - *Native* SSH Obs
* Spatial domain, $\mathbf{s}\in\Omega_u$ - alongtrack grid at a 7x7km resolution.
* Temporal Domain, $t\in\mathcal{T}_y$ - alongtrack grid at a frequency of 1Hz
**Note**: all observations are on the observation domain however, we did some simple discretization which is available for the users.
***
**Estimation Problem**
For the OceanBench: SSH edition, our objective is to estimate the best state given some partial observations and some parameters or auxillary information:
$$
f:\text{Observations}\times\text{Params} \rightarrow \text{State}
$$
In particular, we have formulated this as a state estimation problem whereby we want to estimate the state, i.e., the full SSH field, $\eta$, given some observations, e.g., $\eta_{obs}$, $\eta_{atrack}$, or $T$.
$$
\begin{aligned}
\text{OSSE Validation Data}: && &&
\mathcal{D}_{Inference} &= \{\text{Psuedo-Observations}\} \\
\text{OSE Validation Data}: && &&
\mathcal{D}_{Inference} &= \{\text{Observations}\} \\
\text{State Estimation}: && &&
\eta^* &=
\underset{\eta}{\text{argmin}}
\hspace{2mm}
\mathcal{L}(\eta;\mathcal{D}_{Inference})
\end{aligned}
$$
In OceanBench, all we do is provide different datasets for specific challenges as well as provide a framework (with examples) of how you can create your own datasets with some of our custom preprocessing routines, dataloaders, and metrics. For a user to get started with **inference** right away, you're correct: I think the [end-to-end 4DVarNet example](https://jejjohnson.github.io/oceanbench/content/getting_started/ocean_bench_4dvarnet.html) is the best place to start. It demonstrates:
* How to create an inference dataloader
* Visualize results and compute metrics
* Use hydra to help with configurations
So a user can load the **validation data** (only) for the different challenges and get started right away with making predictions (basically zero-shot predictions).
**However**: most people cannot start here right away if they do not have their own pretrained model because we don't have any training loops and this does not give the user access to the ground truth. The 4DVarNet has a "training" loop because it solves a minimization problem but most ML people probably will not use this (yet). To try something more typical (like a UNet), they would need to start with training which I outline below.
***
**Training Problem**
As I mentioned above, most people need to train **something** before getting started with inference. So we also try to provide some helpful data and tools so that the users can learn the parameters for their own models. We try to use a general definition of learning whereby we try to learn the best parameters, $\theta$, for some model, $f$, given some dataset $\mathcal{D}_{tr}$. So essentially every loss and objective function in the ML world ever whereby we provide a training dataset, $\mathcal{D}_{tr}$, and the user providers their own training objective, model, and trainer. In OceanBench, almost all of the challenges look like this:
$$
\begin{aligned}
\text{OSSE Training Data}: && &&
\mathcal{D}_{Train} &= \{ \text{Simulated States}, \text{Psuedo-Observations}\} \\
\text{OSE Training Data}: && &&
\mathcal{D}_{Train} &= \{\text{Observations}\} \\
\text{Parameter Learning}: && &&
\theta^* &=
\underset{\theta}{\text{argmin}}
\hspace{2mm}
\mathcal{L}(\theta;\mathcal{D}_{Train})
\end{aligned}
$$
We try to demonstrate tools within the OceanBench framework to help the users start training. So for people interested in training (almost everyone), the easiest place to start is the [from tasks to datasets](https://jejjohnson.github.io/oceanbench/content/getting_started/TaskToPatcher.html#) example. Of course they are welcome to use any of the tools from the **estimate problem** tutorials. We only ask that the users don't use any of the data that we use for validation. I explain this more below.
***
**OceanBench Datasets**
In OceanBench, we provide 4 challenges with their associated training and validation datasets. The four challenges available are:
* OSSE I - SSH Fields ($\eta$) + NADIR AlongTrack SSH Obs. ($\eta_{nadir}$)
* OSSE II - SSH Fields ($\eta$) + NADIR & SWOT AlongTrack SSH Obs ($\eta_{nadir},\eta_{swot}$)
* OSSE III - SSH Fields ($\eta$) + NADIR & SWOT AlongTrack SSH Obs ($\eta_{nadir},\eta_{swot}$) + SST Fields ($T$)
* OSE I - NADIR AlongTrack SSH Obs. ($\eta_{atrack}$)
The part that we take special care is the train, inference, validation split. To make things fair, for each challenge, we allow users to use whatever data they want for training $\mathcal{D}_{Train}$. They can also use the inference data, $\mathcal{D}_{Inference}$ But we ask the users not to access to the data for evaluation, $\mathcal{D}_{Validation}$.
$$
\begin{aligned}
\text{OSSE I}: && &&
\mathcal{D}_{Train} &= \{\eta,\eta_{nadir}\} &&
\mathcal{D}_{Inference} = \{\eta_{nadir}'\} &&
\mathcal{D}_{Valid} = \{\eta'\}\\
\text{OSSE II}: && &&
\mathcal{D}_{Train} &= \{\eta,\eta_{nadir},\eta_{swot}\} &&
\mathcal{D}_{Inference} = \{\eta_{nadir}',\eta_{nadir}',\eta_{swot}'\} &&
\mathcal{D}_{Valid} = \{\eta'\}\\
\text{OSSE III}: && &&
\mathcal{D}_{Train} &= \{\eta,\eta_{nadir},\eta_{swot},T\} &&
\mathcal{D}_{Inference} = \{\eta_{nadir}',\eta_{swot}',T'\} &&
\mathcal{D}_{Valid} = \{\eta'\}\\
\text{OSE}: && &&
\mathcal{D}_{Train} &= \{\eta_{atrack}\} &&
\mathcal{D}_{Inference} = \{\eta_{atrack}'\} &&
\mathcal{D}_{Valid} = \{\eta'\}\\
\end{aligned}
$$
The main difference between the train-inference-validation datasets is what period and region is being used and we take special care to make sure the inference and validation data has no overlap with the training data. See the table right below.
| Challenge | Training Data Period | Inference Data Period | **Validation Period** |
| :--- | :--: | :--: | :--: |
| OSSE I, II, III | `[2013-01-01, 2013-09-30]` | `[2012-10-01, 2012-12-02]` | `[2012-10-22, 2012-12-02]` |
| OSE | `[2016-12-01, 2018-01-31]` | `[2016-12-01, 2018-01-31] | `[2017-01-01, 2017-12-31]` |
The only challenge without a distinct period for training and inference/validation is the OSE challenge. However, these are real observations where we don't have the full ground truth for the field. So we remove some observations of a satellite entirely and use this as our split.
---
**Lightning Data Module**
> I have written up a lightning datamodule, that I would use for training in a [gist](https://gist.github.com/nilsleh/a38b3c681eb341ad79f2934ffeaab5aa) of how I would use the parts from the 4dvarnet-tutorial. Not sure if this is correct, or how you intended it to be used, so would be grateful for any feedback :)
This is the kind of contribution that would make the life of new users even easier! This is great!