Pangeo Machine Learning working group presentation
Monday 7 Nov 2022, 17:00-17:15 (UTC)
by Wei Ji Leong
P.S. Slides are at https://hackmd.io/@weiji14/2022zen3geo
![]() |
torchgeo | eo-learn | raster-vision | |
---|---|---|---|---|
Design | Lightweight library | Heavyweight library | Library / Framework | Framework |
Paradigm | Composition | Inheritance | Inheritance | Chained-inheritance |
Data model | xarray & geopandas | GeoDataset (numpy & shapely) | EOPatch (numpy & geopandas) | DatasetConfig (numpy & geopandas) |
Relation of GeoML libraries - https://github.com/weiji14/zen3geo/discussions/70
pip install ... |
Dependencies |
---|---|
zen3geo |
rioxarray, torchdata |
zen3geo[raster] |
… + xbatcher |
zen3geo[spatial] |
… + datashader, spatialpandas |
zen3geo[stac] |
… + pystac, pystac-client, stackstac |
zen3geo[vector] |
… + pyogrio[geopandas] |
Chaining or 'Pipe'-ing a series of operations, rather than subclassing
E.g. RioXarrayReader - Given a source list of GeoTIFFs, read them into an xarray.DataArray one by one
class RioXarrayReaderIterDataPipe(IterDataPipe): def __init__(self, source_datapipe, **kwargs) -> None: self.source_datapipe: IterDataPipe[str] = source_datapipe self.kwargs = kwargs def __iter__(self) -> Iterator: for filename in self.source_datapipe: yield rioxarray.open_rasterio(filename=filename, **self.kwargs)
I/O readers, custom processors, joiners, chippers, batchers, etc. More at https://zen3geo.rtfd.io/en/v0.5.0/api.html
Labelled multi-dimensional data arrays!
<xarray.DataArray (band: 1, y: 2743, x: 3538)>
[9701196 values with dtype=uint16]
Coordinates:
* x (x) float64 2.478e+05 2.479e+05 ... 5.307e+05 5.308e+05
* y (y) float64 3.146e+05 3.145e+05 ... 9.532e+04 9.524e+04
* band (band) int64 1
spatial_ref int64 0
Attributes:
TIFFTAG_DATETIME: 2019:12:16 07:41:53
TIFFTAG_IMAGEDESCRIPTION: Sentinel-1A IW GRD HR L1
TIFFTAG_SOFTWARE: Sentinel-1 IPF 003.10
Store multiple bands/variables, time indexes, and metadata!
Standards based spatiotemporal metadata!
From querying STAC APIs to stacking STAC Items,
stream data directly from cloud to compute!
graph LR
subgraph STAC DataPipeLine
A["IterableWrapper (list[dict])"] --> B
B["PySTACAPISearcher (list[pystac_client.ItemSearch])"] --> C
C["StackstacStacker (list[xarray.DataArray])"]
end
More info at https://github.com/weiji14/zen3geo/discussions/48
def linear_to_decibel(dataarray: xr.DataArray) -> xr.DataArray:
# Mask out areas with 0 so that np.log10 is not undefined
da_linear = dataarray.where(cond=dataarray != 0)
da_decibel = 10 * np.log10(da_linear)
return da_decibel
dp_decibel = dp.map(fn=linear_to_decibel)
Do conversions in original data structure (xarray
)
vs
on tensors with no labels (torch
) via subclassing
class LinearToDecibel(nn.Module):
def __init__(self):
super().__init__()
def forward(self, tensor: torch.Tensor) -> torch.Tensor:
tensor_linear = tensor.where(tensor != 0, other=torch.FloatTensor([torch.nan]))
tensor_decibel = 10 * torch.log10(input=tensor_linear)
return tensor_decibel
dataset_decibel = DatasetClass(..., transforms=LinearToDecibel())
without reprojecting!
E.g. if you have many satellite scenes spanning several UTM zones
# Pass in list of Sentinel-2 scenes from different UTM zones
urls = ["S2_52SFB.tif", "S2_53SNU.tif", "S2_54TWN.tif", ...]
dp = torchdata.datapipes.iter.IterableWrapper(iterable=urls)
# Read into xr.DataArray and slice into 32x32 chips
dp_rioxarray = dp.read_from_rioxarray()
dp_xbatcher = dp_rioxarray.slice_with_xbatcher(input_dims={"y": 32, "x": 32})
# Create batches of 10 chips each and shuffle the order
dp_batch = dp_xbatcher.batch(batch_size=10)
dp_shuffle = dp_batch.shuffle()
...
Enabled by torchdata's data agnostic Batcher iterable-style DataPipe
E.g. time-series data or multivariate climate/ocean model outputs
<xarray.Dataset>
Dimensions: (time: 15, x: 491, y: 579)
Coordinates:
* time (time) datetime64[ns] 2022-01-30T1...
* x (x) float64 6.039e+05 ... 6.186e+05
* y (y) float64 1.624e+04 ... -1.095e+03
Data variables:
vh (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
vv (time, y, x) float16 dask.array<chunksize=(1, 579, 491), meta=np.ndarray>
dem (y, x) float16 dask.array<chunksize=(579, 491), meta=np.ndarray>
Attributes:
spec: RasterSpec(epsg=32647, bounds=(603870, -1110, 618600, 16260)...
crs: epsg:32647
transform: | 30.00, 0.00, 603870.00|\n| 0.00,-30.00, 16260.00|\n| 0.00,...
resolution: 30
Enabled by xarray's rich data structure
10m, 20m, 60m, …
Handle in xbatcher via datatree - https://github.com/xarray-contrib/xbatcher/issues/93
Allows for:
Pytorch
to None
Once torchdata
becomes standalone from pytorch
,
see https://github.com/pytorch/data/issues/293
Dropping the torch
dependency frees up space for other ML libraries - e.g. cuML, Tensorflow, etc
水涨船高,泥多佛大
Rising water lifts all boats, more clay makes for a bigger statue