zen3geo: Guiding Earth Observation data on its path to enlightenment

zen3geo: Guiding data on its path to enlightenment

Pangeo Machine Learning working group presentation
Monday 7 Nov 2022, 17:00-17:15 (UTC)

by Wei Ji Leong

P.S. Slides are at https://hackmd.io/@weiji14/2022zen3geo

Why not zen3geo

Yet another GeoML library

It thinks spatial is special

No data visualization tools

Why zen3geo

Worse is better

Simple is better than complex

Let each part do one thing and do it well

Design patterns

	zen3geo	torchgeo	eo-learn	raster-vision
Design	Lightweight library	Heavyweight library	Library / Framework	Framework
Paradigm	Composition	Inheritance	Inheritance	Chained-inheritance
Data model	xarray & geopandas	GeoDataset (numpy & shapely)	EOPatch (numpy & geopandas)	DatasetConfig (numpy & geopandas)

Relation of GeoML libraries - https://github.com/weiji14/zen3geo/discussions/70

Minimal core, optional extras

`pip install ...`	Dependencies
`zen3geo`	rioxarray, torchdata
`zen3geo[raster]`	… + xbatcher
`zen3geo[spatial]`	… + datashader, spatialpandas
`zen3geo[stac]`	… + pystac, pystac-client, stackstac
`zen3geo[vector]`	… + pyogrio[geopandas]

Write libraries, not frameworks

Composition over Inheritance

Chaining or 'Pipe'-ing a series of operations, rather than subclassing

E.g. RioXarrayReader - Given a source list of GeoTIFFs, read them into an xarray.DataArray one by one








class RioXarrayReaderIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs) -> None:
        self.source_datapipe: IterDataPipe[str] = source_datapipe
        self.kwargs = kwargs
    
    def __iter__(self) -> Iterator:
        for filename in self.source_datapipe:
            yield rioxarray.open_rasterio(filename=filename, **self.kwargs)

I/O readers, custom processors, joiners, chippers, batchers, etc. More at https://zen3geo.rtfd.io/en/v0.5.0/api.html

Xarray data model

Labelled multi-dimensional data arrays!

<xarray.DataArray (band: 1, y: 2743, x: 3538)>
[9701196 values with dtype=uint16]
Coordinates:
  * x            (x) float64 2.478e+05 2.479e+05 ... 5.307e+05 5.308e+05
  * y            (y) float64 3.146e+05 3.145e+05 ... 9.532e+04 9.524e+04
  * band         (band) int64 1
    spatial_ref  int64 0
Attributes:
    TIFFTAG_DATETIME:          2019:12:16 07:41:53
    TIFFTAG_IMAGEDESCRIPTION:  Sentinel-1A IW GRD HR L1
    TIFFTAG_SOFTWARE:          Sentinel-1 IPF 003.10

Store multiple bands/variables, time indexes, and metadata!

The features you've been waiting for

	zen3geo	torchgeo	eo-learn	raster-vision
Spatiotemporal Asset Catalogs (STAC)	Yes	No^	No	Yes?
BYO custom function	Easy	Hard	Hard	Hard

Cloud-native geospatial with STAC

Standards based spatiotemporal metadata!

From querying STAC APIs to stacking STAC Items,
stream data directly from cloud to compute!

graph LR
    subgraph STAC DataPipeLine

    A["IterableWrapper (list[dict])"] --> B
    B["PySTACAPISearcher (list[pystac_client.ItemSearch])"] --> C
    C["StackstacStacker (list[xarray.DataArray])"]

    end

More info at https://github.com/weiji14/zen3geo/discussions/48

Transforms as functions, not classes

def linear_to_decibel(dataarray: xr.DataArray) -> xr.DataArray:
    # Mask out areas with 0 so that np.log10 is not undefined
    da_linear = dataarray.where(cond=dataarray != 0)
    da_decibel = 10 * np.log10(da_linear)
    return da_decibel

dp_decibel = dp.map(fn=linear_to_decibel)

Do conversions in original data structure (xarray)
vs
on tensors with no labels (torch) via subclassing

class LinearToDecibel(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, tensor: torch.Tensor) -> torch.Tensor:
        tensor_linear = tensor.where(tensor != 0, other=torch.FloatTensor([torch.nan]))
        tensor_decibel = 10 * torch.log10(input=tensor_linear)
        return tensor_decibel
        
dataset_decibel = DatasetClass(..., transforms=LinearToDecibel())

Features you never thought you needed

	zen3geo	torchgeo	eo-learn	raster-vision
Multi-CRS without reprojection	Yes	No	No?	No?
Multi-dimensional (beyond 2D+bands)	Yes, via xarray	No	No	No

Multiple coordinate reference systems

without reprojecting!

E.g. if you have many satellite scenes spanning several UTM zones

# Pass in list of Sentinel-2 scenes from different UTM zones
urls = ["S2_52SFB.tif", "S2_53SNU.tif", "S2_54TWN.tif", ...]
dp = torchdata.datapipes.iter.IterableWrapper(iterable=urls)

# Read into xr.DataArray and slice into 32x32 chips
dp_rioxarray = dp.read_from_rioxarray()
dp_xbatcher = dp_rioxarray.slice_with_xbatcher(input_dims={"y": 32, "x": 32})
# Create batches of 10 chips each and shuffle the order
dp_batch = dp_xbatcher.batch(batch_size=10)
dp_shuffle = dp_batch.shuffle()
...

Enabled by torchdata's data agnostic Batcher iterable-style DataPipe

Multiple dimensions

stack co-located data

E.g. time-series data or multivariate climate/ocean model outputs

<xarray.Dataset>
Dimensions:     (time: 15, x: 491, y: 579)
Coordinates:    
  * time        (time) datetime64[ns] 2022-01-30T1...
  * x           (x) float64 6.039e+05 ... 6.186e+05
  * y           (y) float64 1.624e+04 ... -1.095e+03
Data variables: 
    vh          (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
    vv          (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
    dem         (y, x) float16 dask.array<chunksize=(579, 491), meta=np.ndarray>
Attributes:
    spec:        RasterSpec(epsg=32647, bounds=(603870, -1110, 618600, 16260)...
    crs:         epsg:32647
    transform:   | 30.00, 0.00, 603870.00|\n| 0.00,-30.00, 16260.00|\n| 0.00,...
    resolution:  30

Enabled by xarray's rich data structure

Beyond zen3geo v0.5.0

	zen3geo	torchgeo	eo-learn	raster-vision
Multi-resolution	DIY, Yes^ via datatree	No	Yes	No
ML Library coupling	Pytorch or None^	Pytorch	None	Pytorch, Tensorflow (DIY)

Multiple spatiotemporal resolutions

10m, 20m, 60m, …

Handle in xbatcher via datatree - https://github.com/xarray-contrib/xbatcher/issues/93
Allows for:
- Super-resolution tasks
- Multimodal sensors (Optical/SAR/Gravity/etc)

From `Pytorch` to `None`

Once torchdata becomes standalone from pytorch,
see https://github.com/pytorch/data/issues/293

Dropping the torch dependency frees up space for other ML libraries - e.g. cuML, Tensorflow, etc

Perfection is achieved,
not when there is nothing more to add,
but when there is nothing left to take away

Unleash your data

Install - https://pypi.org/project/zen3geo
Discuss - https://github.com/weiji14/zen3geo/discussions
Contribute - https://zen3geo.readthedocs.io/en/v0.5.0/CONTRIBUTING.html

水涨船高，泥多佛大
Rising water lifts all boats, more clay makes for a bigger statue

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

zen3geo: Guiding data on its path to enlightenment

Why not zen3geo

Why zen3geo

Design patterns

Minimal core, optional extras

Composition over Inheritance

Xarray data model

The features you've been waiting for

Cloud-native geospatial with STAC

Transforms as functions, not classes

Features you never thought you needed

Multiple coordinate reference systems

Multiple dimensions

Beyond zen3geo v0.5.0

Multiple spatiotemporal resolutions

From Pytorch to None

Unleash your data

From `Pytorch` to `None`