changed 2 years ago
Published Linked with GitHub

:yin_yang: zen3geo: Guiding :earth_asia: data on its path to enlightenment

Pangeo Machine Learning working group presentation
Monday 7 Nov 2022, 17:00-17:15 (UTC)

by Wei Ji Leong

P.S. Slides are at https://hackmd.io/@weiji14/2022zen3geo


Why not zen3geo

Yet another GeoML library

It thinks spatial is special

No data visualization tools


Why zen3geo

Worse is better

Simple is better than complex

Let each part do one thing and do it well


Design patterns

:yin_yang: zen3geo torchgeo eo-learn raster-vision
Design Lightweight library Heavyweight library Library / Framework Framework
Paradigm Composition Inheritance Inheritance Chained-inheritance
Data model xarray & geopandas GeoDataset (numpy & shapely) EOPatch (numpy & geopandas) DatasetConfig (numpy & geopandas)

Relation of GeoML libraries - https://github.com/weiji14/zen3geo/discussions/70


Minimal core, optional extras

pip install ... Dependencies
zen3geo rioxarray, torchdata
zen3geo[raster] + xbatcher
zen3geo[spatial] + datashader, spatialpandas
zen3geo[stac] + pystac, pystac-client, stackstac
zen3geo[vector] + pyogrio[geopandas]

Write libraries, not frameworks


Composition over Inheritance

Chaining or 'Pipe'-ing a series of operations, rather than subclassing

E.g. RioXarrayReader - Given a source list of GeoTIFFs, read them into an xarray.DataArray one by one

class RioXarrayReaderIterDataPipe(IterDataPipe): def __init__(self, source_datapipe, **kwargs) -> None: self.source_datapipe: IterDataPipe[str] = source_datapipe self.kwargs = kwargs def __iter__(self) -> Iterator: for filename in self.source_datapipe: yield rioxarray.open_rasterio(filename=filename, **self.kwargs)

I/O readers, custom processors, joiners, chippers, batchers, etc. More at https://zen3geo.rtfd.io/en/v0.5.0/api.html

Sample data pipeline flowchart.


Xarray data model

Labelled multi-dimensional data arrays!

<xarray.DataArray (band: 1, y: 2743, x: 3538)>
[9701196 values with dtype=uint16]
Coordinates:
  * x            (x) float64 2.478e+05 2.479e+05 ... 5.307e+05 5.308e+05
  * y            (y) float64 3.146e+05 3.145e+05 ... 9.532e+04 9.524e+04
  * band         (band) int64 1
    spatial_ref  int64 0
Attributes:
    TIFFTAG_DATETIME:          2019:12:16 07:41:53
    TIFFTAG_IMAGEDESCRIPTION:  Sentinel-1A IW GRD HR L1
    TIFFTAG_SOFTWARE:          Sentinel-1 IPF 003.10

Store multiple bands/variables, time indexes, and metadata!


The features you've been waiting for

zen3geo torchgeo eo-learn raster-vision
Spatiotemporal Asset Catalogs (STAC) Yes No^ No Yes?
BYO custom function Easy Hard Hard Hard

Cloud-native geospatial with STAC

Standards based spatiotemporal metadata!

From querying STAC APIs to stacking STAC Items,
stream data directly from cloud to compute!

graph LR
    subgraph STAC DataPipeLine

    A["IterableWrapper (list[dict])"] --> B
    B["PySTACAPISearcher (list[pystac_client.ItemSearch])"] --> C
    C["StackstacStacker (list[xarray.DataArray])"]

    end

More info at https://github.com/weiji14/zen3geo/discussions/48


Transforms as functions, not classes

def linear_to_decibel(dataarray: xr.DataArray) -> xr.DataArray:
    # Mask out areas with 0 so that np.log10 is not undefined
    da_linear = dataarray.where(cond=dataarray != 0)
    da_decibel = 10 * np.log10(da_linear)
    return da_decibel

dp_decibel = dp.map(fn=linear_to_decibel)

Do conversions in original data structure (xarray)
vs
on tensors with no labels (torch) via subclassing

class LinearToDecibel(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, tensor: torch.Tensor) -> torch.Tensor:
        tensor_linear = tensor.where(tensor != 0, other=torch.FloatTensor([torch.nan]))
        tensor_decibel = 10 * torch.log10(input=tensor_linear)
        return tensor_decibel
        
dataset_decibel = DatasetClass(..., transforms=LinearToDecibel())

Features you never thought you needed

zen3geo torchgeo eo-learn raster-vision
Multi-CRS without reprojection Yes No No? No?
Multi-dimensional (beyond 2D+bands) Yes, via xarray No No No

Multiple coordinate reference systems

without reprojecting!

E.g. if you have many satellite scenes spanning several UTM zones

# Pass in list of Sentinel-2 scenes from different UTM zones
urls = ["S2_52SFB.tif", "S2_53SNU.tif", "S2_54TWN.tif", ...]
dp = torchdata.datapipes.iter.IterableWrapper(iterable=urls)

# Read into xr.DataArray and slice into 32x32 chips
dp_rioxarray = dp.read_from_rioxarray()
dp_xbatcher = dp_rioxarray.slice_with_xbatcher(input_dims={"y": 32, "x": 32})
# Create batches of 10 chips each and shuffle the order
dp_batch = dp_xbatcher.batch(batch_size=10)
dp_shuffle = dp_batch.shuffle()
...

Enabled by torchdata's data agnostic Batcher iterable-style DataPipe


Multiple dimensions

stack co-located data

E.g. time-series data or multivariate climate/ocean model outputs

<xarray.Dataset>
Dimensions:     (time: 15, x: 491, y: 579)
Coordinates:    
  * time        (time) datetime64[ns] 2022-01-30T1...
  * x           (x) float64 6.039e+05 ... 6.186e+05
  * y           (y) float64 1.624e+04 ... -1.095e+03
Data variables: 
    vh          (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
    vv          (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
    dem         (y, x) float16 dask.array<chunksize=(579, 491), meta=np.ndarray>
Attributes:
    spec:        RasterSpec(epsg=32647, bounds=(603870, -1110, 618600, 16260)...
    crs:         epsg:32647
    transform:   | 30.00, 0.00, 603870.00|\n| 0.00,-30.00, 16260.00|\n| 0.00,...
    resolution:  30

Enabled by xarray's rich data structure


Beyond zen3geo v0.5.0

zen3geo torchgeo eo-learn raster-vision
Multi-resolution DIY, Yes^ via datatree No Yes No
ML Library coupling Pytorch or None^ Pytorch None Pytorch, Tensorflow (DIY)

Multiple spatiotemporal resolutions

10m, 20m, 60m,


From Pytorch to None

Once torchdata becomes standalone from pytorch,
see https://github.com/pytorch/data/issues/293

Dropping the torch dependency frees up space for other ML libraries - e.g. cuML, Tensorflow, etc

Perfection is achieved,
not when there is nothing more to add,
but when there is nothing left to take away


Unleash your :earth_asia: data

水涨船高,泥多佛大
Rising water lifts all boats, more clay makes for a bigger statue

Select a repo