owned this note
owned this note
Published
Linked with GitHub
Intake-STAC Design Doc
======================
Authors: Joe Hamman, Scott Henderson
Date: Started February 26, 2019
*tldr; intake-stac is an intake plugin for accessing datasets described using the [SpatioTemporal Asset Catalog specification](https://github.com/radiantearth/stac-spec)*
## Background and High-level Goals
- `STAC` is a simple catalog format that is finding wide adoption in the remote sensing world, especially for datasets stored in the Cloud.
- `Intake` is a lightweight Python package for finding, investigating, loading, and disseminating data.
The goal of `Intake-Stac` is to facilitate lazy loading of remote sensing datasets stored on servers into xarray datasets for analysis with Python.
### Intake-stac should:
- ingest STAC catalogs, providing a mapping from STAC to intake catalog formats (see https://github.com/sat-utils/sat-stac)
- support queries against STAC metadata (see https://github.com/sat-utils/sat-api and https://github.com/sat-utils/sat-search)
- support reading data using the intake-xarray plugin
## Design Philosophy
- lightweight
- let other tools do the heavy lifting (intake, xarray, s3fs, etc.)
- provide for anticipated/common query patterns
- what types of data are available in this bounding box and time period?
- narrow by source name ('MODIS')
- narrow by data type ('MSLA')
- what sources provide data of such type
- what types are provided by source
- coincidence
- data quality and density
- Intake-STAC access automatically appears in a pangeo JupyterHub; does this make sense? If so what does it look like?
- where Intake-STAC functionality ends suggest:
- provide FAQ pointers to stable extern resources - Jake's book, Scott's Binder notebooks, etc
- Interop w/ DataCite DOIs
- Makes data dicoverable through Google
- First goal: Don't replicate existing
- Second goal: Take advantage of this project
## Scope questions
- do we limit to data stored on AWS/GCS/Azure? current STAC implementations are limited compared to archives on gov servers: https://github.com/radiantearth/stac-spec/blob/master/implementations.md
- will intake-stac support transformations from gov servers to archives of convenience (e.g. COG or Zarr on S3)?
- element 84 has put together CMR search, which catalogs NASA's entire archive. CMR queries can return STAC catalogs, but need to update version and maybe incorporate directly into CMR? https://github.com/Element84/cmr-stac-api-proxy
## Technical Design
- what we want:
```python
# converting to intake catalog will enable intake tools such as gui browser
cat = intake.StacCatalog('landsat8-aws.json')
# or leverage existing tools such as sat-api/sat-search
cat = intake.StacSearch(collection='landsat8', bbox=[], datetime='2017/2019')
cat.filter(bands=['red','green','nir'], cloudcover=20)
# need to share STAC catalogs with colleagues / reproduce work later
cat.to_file('my-catalog.json')
# would be great to explore metadata as geopandas geodataframe
df = cat.to_dataframe()
# for achives on gov servers or legacy formats
cat.to_archive_of_convenience(s3bucket, awscredentials)
# currently sat-utils allows data download, but not lazy loading via xarray:
ds = cat.to_dask()
# default plots with geoviews?
cat.plot.thumbnails()
```
- currently, lots of manual functions to get remote sensing time series into xarray datasets (even w/ intake): https://nbviewer.jupyter.org/github/scottyhq/pangeo-binder-test/blob/master/notebooks/3-intake-stac-landsat.ipynb
- challenges:
- stac spec changing rapidly, so intake-stac versions should match stac spec versions (currently 0.6.1)
- stac `item` assets can be any format (not just COG or Zarr)
- what to do with complex NASA HDF data?
- i suspect we will need a 'plugin' system for subcatalogs that define options for every satellite / sensor (e.g. landsat8.yml, sentinel2.yml, modis.yml, sentinel.yml). This is what will specify defaults and parameters for the to_dask() function.