Intake-STAC Design Doc

Authors: Joe Hamman, Scott Henderson

Date: Started February 26, 2019

tldr; intake-stac is an intake plugin for accessing datasets described using the SpatioTemporal Asset Catalog specification

Background and High-level Goals

  • STAC is a simple catalog format that is finding wide adoption in the remote sensing world, especially for datasets stored in the Cloud.

  • Intake is a lightweight Python package for finding, investigating, loading, and disseminating data.

The goal of Intake-Stac is to facilitate lazy loading of remote sensing datasets stored on servers into xarray datasets for analysis with Python.

Intake-stac should:

Design Philosophy

  • lightweight
  • let other tools do the heavy lifting (intake, xarray, s3fs, etc.)
  • provide for anticipated/common query patterns
    • what types of data are available in this bounding box and time period?
      • narrow by source name ('MODIS')
      • narrow by data type ('MSLA')
    • what sources provide data of such type
    • what types are provided by source
    • coincidence
    • data quality and density
  • Intake-STAC access automatically appears in a pangeo JupyterHub; does this make sense? If so what does it look like?
  • where Intake-STAC functionality ends suggest:
    • provide FAQ pointers to stable extern resources - Jake's book, Scott's Binder notebooks, etc
  • Interop w/ DataCite DOIs
    • Makes data dicoverable through Google
    • First goal: Don't replicate existing
    • Second goal: Take advantage of this project

Scope questions

Technical Design

  • what we want:
# converting to intake catalog will enable intake tools such as gui browser
cat = intake.StacCatalog('landsat8-aws.json')

# or leverage existing tools such as sat-api/sat-search
cat = intake.StacSearch(collection='landsat8', bbox=[], datetime='2017/2019')
cat.filter(bands=['red','green','nir'], cloudcover=20)

# need to share STAC catalogs with colleagues / reproduce work later
cat.to_file('my-catalog.json') 

# would be great to explore metadata as geopandas geodataframe
df = cat.to_dataframe()

# for achives on gov servers or legacy formats
cat.to_archive_of_convenience(s3bucket, awscredentials)

# currently sat-utils allows data download, but not lazy loading via xarray:
ds = cat.to_dask()

# default plots with geoviews?
cat.plot.thumbnails()
  • currently, lots of manual functions to get remote sensing time series into xarray datasets (even w/ intake): https://nbviewer.jupyter.org/github/scottyhq/pangeo-binder-test/blob/master/notebooks/3-intake-stac-landsat.ipynb

  • challenges:

    • stac spec changing rapidly, so intake-stac versions should match stac spec versions (currently 0.6.1)
    • stac item assets can be any format (not just COG or Zarr)
      • what to do with complex NASA HDF data?
    • i suspect we will need a 'plugin' system for subcatalogs that define options for every satellite / sensor (e.g. landsat8.yml, sentinel2.yml, modis.yml, sentinel.yml). This is what will specify defaults and parameters for the to_dask() function.
Select a repo