<!--
Docs for making Markdown slide deck on HackMD using Revealjs
https://hackmd.io/s/how-to-create-slide-deck
https://revealjs.com
-->
### A streaming data pipeline with Harmonized Landsat Sentinel-2 (HLS) :satellite: imagery
<small> Lunch and Learn presentation at NASA IMPACT<br> Friday 30 Jun 2023, 20:00-21:00 (UTC) </small>
_by **[Wei Ji Leong](https://github.com/weiji14)** & [Ryan Avery](https://github.com/rbavery) @ [DevelopmentSeed](https://developmentseed.org)_
<!-- Put the link to this slide here so people can follow -->
<small> P.S. Slides are at https://hackmd.io/@weiji14/2023mlpipeline</small>
---
### :floppy_disk: Typical way of producing ML-ready geospatial :earth_africa: data
1. Download imagery from cloud to local disk
2. Pre-process and chip into smaller size (e.g. 512x512)
3. Store as GeoTIFF/NPY/TFRecords/etc
| Pros | Cons|
|--|--|
| Can be handled by GIS/geospatial expert | Need to reprocess data for different input size |
| Often faster to load into neural network model | Loss of geospatial metadata if not careful |
---
### :cloud: Cloud-native way of creating ML-ready geospatial :earth_asia: data
1. Access data from Spatiotemporal Asset Catalog (STAC)
2. Data-proximate pre-processing and chipping on the fly
3. Load tensors directly into GPUs in cloud environment
| Pros | Cons |
|--|--|
| Can experiment with different input sizes as hyperparameter | ML engineer has to manage data pipeline |
| Save on local storage and file management | More data loading latency if off-region |
---
### Demo
Data pipeline for
Harmonized Landsat Sentinel-2 + Burn scar masks
<img src="https://hackmd.io/_uploads/ryn96bhu2.png" alt="Harmonized Landsat Sentinel-2 image on left, burn scar mask on right" style="margin:0px auto;display:block" width="50%"/>
Follow along at https://nasa-impact.github.io/ml-pipeline/docs/01_datapipelines_with_torchdata.html
---
### Take home messages
- Data: Publish as Spatiotemporal Asset Catalogs (STAC)
- Model: Look into STAC ML-model standard: https://github.com/stac-extensions/ml-model
- Learn: About scalable geospatial machine learning!
---
### Links
- Repo: https://github.com/NASA-IMPACT/ml-pipeline
- Jupyter Book: https://nasa-impact.github.io/ml-pipeline
- Contact:
- weiji@developmentseed.org (@weiji14)
- ryan@developmentseed.org (@rbavery)
{"title":"A streaming data pipeline with Harmonized Landsat Sentinel-2 (HLS) imagery","description":"Pangeo Machine Learning working group presentation","slideOptions":"{\"theme\":\"simple\",\"width\":\"80%\"}","lang":"en-NZ","contributors":"[{\"id\":\"c1f3f3d8-2cb7-4635-9d54-f8f7487d0956\",\"add\":15102,\"del\":12557}]"}