<!--
Docs for making Markdown slide deck on HackMD using Revealjs
https://hackmd.io/s/how-to-create-slide-deck
https://revealjs.com
-->
#### :yin_yang: zen3geo: Guiding :earth_asia: data on its path to enlightenment
<small> [Pangeo Machine Learning working group presentation <br> Monday 7 Nov 2022, 17:00-17:15 (UTC)](https://discourse.pangeo.io/t/monday-november-07-2022-machine-learning-working-group-presentation-zen3geo-guiding-earth-observation-data-on-its-path-to-enlightenment-by-wei-ji-leong/2883) </small>
_by **[Wei Ji Leong](https://github.com/weiji14)**_
<!-- Put the link to this slide here so people can follow -->
<small> P.S. Slides are at https://hackmd.io/@weiji14/2022zen3geo</small>
----
### Why **not** zen3geo
> [Yet another GeoML library](https://github.com/weiji14/zen3geo/discussions/70)
> It thinks [spatial is special](https://web.archive.org/web/20221013135644/https://www.linkedin.com/pulse/what-special-spatial-willy-simons)
> No data visualization tools
----
### Why zen3geo
> [Worse is better](https://www.jwz.org/doc/worse-is-better.html)
> [Simple is better than complex](https://peps.python.org/pep-0020/#the-zen-of-python)
> [Let each part do one thing and do it well](https://en.wikipedia.org/wiki/Unix_philosophy#Do_One_Thing_and_Do_It_Well)
---
### Design patterns
| | :yin_yang: [zen3geo](https://github.com/weiji14/zen3geo/tree/v0.5.0) | [torchgeo](https://github.com/microsoft/torchgeo/tree/v0.3.1) | [eo-learn](https://github.com/sentinel-hub/eo-learn/tree/v1.3.0) | [raster-vision](https://github.com/azavea/raster-vision/tree/v0.13.1) |
|--|--:|--:|--:|--:|
| [Design](https://www.brandons.me/blog/libraries-not-frameworks) | [**Lightweight** library](https://github.com/weiji14/zen3geo/blob/v0.5.0/pyproject.toml#L22-L26) | [Heavyweight library](https://github.com/microsoft/torchgeo/blob/v0.3.1/setup.cfg#L27-L66) | [Library](https://github.com/sentinel-hub/eo-learn/tree/v1.2.1#installation) / [Framework](https://github.com/sentinel-hub/eo-learn/blob/v1.2.1/setup.py#L41-L50) | [Framework](https://github.com/azavea/raster-vision/blob/master/requirements.txt#L1-L7) |
| [Paradigm](https://en.wikipedia.org/wiki/Composition_over_inheritance) | **Composition** | Inheritance | Inheritance | Chained-inheritance |
| Data model | **xarray** & geopandas | GeoDataset (numpy & shapely) | EOPatch (numpy & geopandas) | DatasetConfig (numpy & geopandas) |
<small>Relation of GeoML libraries - https://github.com/weiji14/zen3geo/discussions/70</small>
----
### Minimal core, optional extras
| `pip install ...` | Dependencies |
|:-------------------------------|---------------|
| `zen3geo` | rioxarray, torchdata |
| `zen3geo[raster]` | ... + xbatcher |
| `zen3geo[spatial]` | ... + datashader, spatialpandas |
| `zen3geo[stac]` | ... + pystac, pystac-client, stackstac |
| `zen3geo[vector]` | ... + pyogrio[geopandas] |
<small>[_Write libraries, not frameworks_](https://www.brandons.me/blog/libraries-not-frameworks)</small>
----
### Composition over Inheritance
Chaining or 'Pipe'-ing a series of operations, rather than subclassing
<small>E.g. RioXarrayReader - Given a source list of GeoTIFFs, read them into an xarray.DataArray one by one</small>
```python=
class RioXarrayReaderIterDataPipe(IterDataPipe):
def __init__(self, source_datapipe, **kwargs) -> None:
self.source_datapipe: IterDataPipe[str] = source_datapipe
self.kwargs = kwargs
def __iter__(self) -> Iterator:
for filename in self.source_datapipe:
yield rioxarray.open_rasterio(filename=filename, **self.kwargs)
```
<small>I/O readers, custom processors, joiners, chippers, batchers, etc. More at https://zen3geo.rtfd.io/en/v0.5.0/api.html</small>
[![Sample data pipeline flowchart.](https://user-images.githubusercontent.com/23487320/200208660-3c48e003-6592-4811-8585-73eac3f2516c.png)](https://zen3geo.readthedocs.io/en/v0.5.0/vector-segmentation-masks.html#combine-and-conquer)
----
### Xarray data model
Labelled multi-dimensional data arrays!
```
<xarray.DataArray (band: 1, y: 2743, x: 3538)>
[9701196 values with dtype=uint16]
Coordinates:
* x (x) float64 2.478e+05 2.479e+05 ... 5.307e+05 5.308e+05
* y (y) float64 3.146e+05 3.145e+05 ... 9.532e+04 9.524e+04
* band (band) int64 1
spatial_ref int64 0
Attributes:
TIFFTAG_DATETIME: 2019:12:16 07:41:53
TIFFTAG_IMAGEDESCRIPTION: Sentinel-1A IW GRD HR L1
TIFFTAG_SOFTWARE: Sentinel-1 IPF 003.10
```
<small>Store multiple bands/variables, time indexes, and metadata!</small>
---
### The features you've been waiting for
| | [zen3geo](https://github.com/weiji14/zen3geo/tree/v0.5.0) | [torchgeo](https://github.com/microsoft/torchgeo/tree/v0.3.1) | [eo-learn](https://github.com/sentinel-hub/eo-learn/tree/v1.3.0) | [raster-vision](https://github.com/azavea/raster-vision/tree/v0.13.1) |
|--|--:|--:|--:|--:|
| Spatiotemporal Asset Catalogs (STAC) | [Yes](https://github.com/weiji14/zen3geo/discussions/48) | [No^](https://github.com/microsoft/torchgeo/issues/403) | No | [Yes?](https://github.com/azavea/raster-vision/pull/1243) |
| BYO custom function | [Easy](https://zen3geo.readthedocs.io/en/v0.5.0/vector-segmentation-masks.html#transform-and-visualize-raster-data) | [Hard](https://torchgeo.readthedocs.io/en/v0.3.1/tutorials/transforms.html) | [Hard](https://eo-learn.readthedocs.io/en/latest/examples/core/CoreOverview.html#EOTask) | [Hard](https://docs.rastervision.io/en/0.13/pipelines.html#rastertransformer) |
----
### Cloud-native geospatial with [STAC](https://stacspec.org)
Standards based spatiotemporal metadata!
<small>From querying STAC APIs to stacking STAC Items,
stream data directly from cloud to compute!</small>
```mermaid
graph LR
subgraph STAC DataPipeLine
A["IterableWrapper (list[dict])"] --> B
B["PySTACAPISearcher (list[pystac_client.ItemSearch])"] --> C
C["StackstacStacker (list[xarray.DataArray])"]
end
```
<small>More info at https://github.com/weiji14/zen3geo/discussions/48</small>
----
### Transforms as functions, not classes
```python
def linear_to_decibel(dataarray: xr.DataArray) -> xr.DataArray:
# Mask out areas with 0 so that np.log10 is not undefined
da_linear = dataarray.where(cond=dataarray != 0)
da_decibel = 10 * np.log10(da_linear)
return da_decibel
dp_decibel = dp.map(fn=linear_to_decibel)
```
<small>Do conversions in original data structure (`xarray`)
[vs](https://torchgeo.readthedocs.io/en/v0.3.1/tutorials/transforms.html)
on tensors with no labels (`torch`) via subclassing</small>
```python
class LinearToDecibel(nn.Module):
def __init__(self):
super().__init__()
def forward(self, tensor: torch.Tensor) -> torch.Tensor:
tensor_linear = tensor.where(tensor != 0, other=torch.FloatTensor([torch.nan]))
tensor_decibel = 10 * torch.log10(input=tensor_linear)
return tensor_decibel
dataset_decibel = DatasetClass(..., transforms=LinearToDecibel())
```
---
### Features you never thought you needed
| | [zen3geo](https://github.com/weiji14/zen3geo/tree/v0.5.0) | [torchgeo](https://github.com/microsoft/torchgeo/tree/v0.3.1) | [eo-learn](https://github.com/sentinel-hub/eo-learn/tree/v1.3.0) | [raster-vision](https://github.com/azavea/raster-vision/tree/v0.13.1) |
|--|--:|--:|--:|--:|
| Multi-CRS without reprojection | [Yes](https://zen3geo.readthedocs.io/en/v0.5.0/chipping.html#pool-chips-into-mini-batches) | [No](https://github.com/microsoft/torchgeo/issues/278) | No? | No? |
| Multi-dimensional (beyond 2D+bands) | [Yes](https://zen3geo.readthedocs.io/en/v0.5.0/stacking.html), via xarray | No | No | No |
----
### Multiple coordinate reference systems
*without reprojecting!*
[E.g. if you have many satellite scenes spanning several UTM zones](https://zen3geo.readthedocs.io/en/v0.5.0/chipping.html#pool-chips-into-mini-batches)
```python
# Pass in list of Sentinel-2 scenes from different UTM zones
urls = ["S2_52SFB.tif", "S2_53SNU.tif", "S2_54TWN.tif", ...]
dp = torchdata.datapipes.iter.IterableWrapper(iterable=urls)
# Read into xr.DataArray and slice into 32x32 chips
dp_rioxarray = dp.read_from_rioxarray()
dp_xbatcher = dp_rioxarray.slice_with_xbatcher(input_dims={"y": 32, "x": 32})
# Create batches of 10 chips each and shuffle the order
dp_batch = dp_xbatcher.batch(batch_size=10)
dp_shuffle = dp_batch.shuffle()
...
```
<small>Enabled by torchdata's data agnostic Batcher iterable-style DataPipe</small>
----
### Multiple dimensions
[*stack co-located data*](https://zen3geo.readthedocs.io/en/v0.5.0/stacking.html)
E.g. [time-series data](https://zen3geo.readthedocs.io/en/v0.5.0/stacking.html#sentinel-1-polsar-time-series) or multivariate climate/ocean model outputs
```
<xarray.Dataset>
Dimensions: (time: 15, x: 491, y: 579)
Coordinates:
* time (time) datetime64[ns] 2022-01-30T1...
* x (x) float64 6.039e+05 ... 6.186e+05
* y (y) float64 1.624e+04 ... -1.095e+03
Data variables:
vh (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
vv (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
dem (y, x) float16 dask.array<chunksize=(579, 491), meta=np.ndarray>
Attributes:
spec: RasterSpec(epsg=32647, bounds=(603870, -1110, 618600, 16260)...
crs: epsg:32647
transform: | 30.00, 0.00, 603870.00|\n| 0.00,-30.00, 16260.00|\n| 0.00,...
resolution: 30
```
<small>Enabled by xarray's rich data structure</small>
---
### Beyond zen3geo v0.5.0
| | [zen3geo](https://github.com/weiji14/zen3geo/tree/v0.5.0) | [torchgeo](https://github.com/microsoft/torchgeo/tree/v0.3.1) | [eo-learn](https://github.com/sentinel-hub/eo-learn/tree/v1.3.0) | [raster-vision](https://github.com/azavea/raster-vision/tree/v0.13.1) |
|--|--|--|--|--|
| Multi-resolution | DIY, [Yes^](https://github.com/xarray-contrib/xbatcher/issues/93) via [datatree](https://github.com/TomNicholas/datatree) | [No](https://github.com/microsoft/torchgeo/issues/74) | [Yes](https://github.com/sentinel-hub/eo-learn/blob/v1.2.1/geometry/eolearn/geometry/superpixel.py) | No |
| ML Library coupling | Pytorch or [None^](https://github.com/pytorch/data/issues/293) | Pytorch | None | Pytorch, Tensorflow (DIY) |
----
### Multiple spatiotemporal resolutions
*10m, 20m, 60m, ...*
- Handle in xbatcher via datatree - https://github.com/xarray-contrib/xbatcher/issues/93
- Allows for:
- Super-resolution tasks
- Multimodal sensors (Optical/SAR/Gravity/etc)
----
### From `Pytorch` to `None`
Once `torchdata` becomes standalone from `pytorch`,
see https://github.com/pytorch/data/issues/293
<small>Dropping the `torch` dependency frees up space for other ML libraries - e.g. cuML, Tensorflow, etc</small>
> [Perfection is achieved,
> not when there is nothing more to add,
> but when there is nothing left to take away](https://www.goodreads.com/quotes/19905-perfection-is-achieved-not-when-there-is-nothing-more-to)
---
### Unleash your :earth_asia: data
- Install - https://pypi.org/project/zen3geo
- Discuss - https://github.com/weiji14/zen3geo/discussions
- Contribute - https://zen3geo.readthedocs.io/en/v0.5.0/CONTRIBUTING.html
> 水涨船高,泥多佛大
> <small>Rising water lifts all boats, more clay makes for a bigger statue</small>
{"metaMigratedAt":"2023-06-17T11:38:01.980Z","metaMigratedFrom":"YAML","breaks":true,"description":"Pangeo Machine Learning working group presentation","slideOptions":"{\"theme\":\"simple\",\"width\":\"80%\"}","title":"zen3geo: Guiding Earth Observation data on its path to enlightenment","showTags":"false","lang":"en-NZ","contributors":"[{\"id\":\"c1f3f3d8-2cb7-4635-9d54-f8f7487d0956\",\"add\":29433,\"del\":17795}]"}