<!--
Docs for making Markdown slide deck on HackMD using Revealjs
https://hackmd.io/s/how-to-create-slide-deck
https://hackmd.io/c/codimd-documentation/%2F%40codimd%2Fmarkdown-syntax
https://revealjs.com
-->
### The ecosystem of geospatial machine learning tools in the [Pangeo](https://pangeo.io) :earth_asia: world.
<small>[FOSS4G SotM Oceania 2023 presentation <br> Wednesday 18 Oct 2023, 13:50β14:15 (NZDT)](https://talks.osgeo.org/foss4g-sotm-oceania-2023/talk/YP3KPT)</small>
_by **[Wei Ji Leong](https://github.com/weiji14)** @ [Development Seed](https://developmentseed.org/team/weiji-leong)_
<!-- Put the link to this slide here so people can follow -->
<small>P.S. Slides are at https://hackmd.io/@weiji14/foss4g2023oceania</small>
<table>
<tr>
<td>
<img src="https://github.com/pangeo-data/branding/blob/master/logo/large-logo-blue-text.png?raw=true" alt="Pangeo logo" style="float: right" width="25%">
</td>
<td>
π
</td>
<td>
<img src="https://hackmd.io/_uploads/SJcnovPZa.png" alt="Development Seed logo" style="float: left" width="65%">
</td>
<td>
         
</td>
</tr>
</table>
---
### Building next-gen Machine Learning (ML) tools in the Pangeo community π§βπ€βπ§
- Towards cloud-native and **GPU-native** πΎ
- Streaming and batching data **subsets** on-the-fly πΈ
- From single sensor to **multi-modal** models π
<!--
i.e. RAPIDS AI, xbatcher and zen3geo
-->
---
<!-- .slide: style="font-size: 0.85em;" -->
### A community of big data geoscientists πΊοΈ
- Promoting open, reproducible and scalable ways of doing ocean/atmosphere/land/climate science π§βπ¬
- Things Pangeo folks are passionate about:
- **Cloud/HPC infrastructure** - From local desktops to shared servers βοΈ
- **Open Source Software** - Collaborating on scientific libraries like xarray, and cloud-native standards like STAC ποΈ
- **Education/Outreach** - Tutorials, Pangeo Showcase webinars, Conference workshops π§βπ«
<!--
![Screenshot of Pangeo Discourse forum](https://hackmd.io/_uploads/SkU5_mG-T.png)
-->
---
## What does the Pangeo ML 'stack' look like? π€
<small>Answer: Framework agnostic (but opinionated π), mostly Python π, also all over the place π«</small>
---
<!-- .slide: style="display: block; top: 23px;" -->
# ![Pangeo Machine Learning Ecosystem 2023](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/pangeo_ml_ecosystem.png)
---
## Example walkthrough πΈ
On a climate/weather dataset - WeatherBench2 ERA5
- β© Direct-to-GPU reads from Zarr - `kvikIO`
- βοΈ Slicing n-dimensional arrays on-the-fly - `xbatcher`
- π Composable data pipelines - `zen3geo`
<!--
<small>Note: focus of rest of the talk will be on Earth Observation raster data</small>
-->
----
# ![NVIDIA GPUDirect Storage](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/nvidia_gpu_direct_storage.png)
<small>Reference: https://developer.nvidia.com/blog/gpudirect-storage</small>
----
# ![xbatcher n-dimensional slicing](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/xbatcher_ndim_slicing.png)
<small>Docs: https://xbatcher.readthedocs.io</small>
----
# ![zen3geo Composable DataPipes](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/zen3geo_composable_datapipes.png)
<!--
### zen3geo - Composable DataPipes for geospatial
Alt text: Flow diagram showing STAC, vector, raster, spatial and other DataPipes making up zen3geo
β¨ Features:
- Friendly walkthrough tutorials π°
- Handle multi-CRS/multi-res layers π
- BYO custom function π§°
π€οΈ Roadmap:
- More STAC DataPipes π
- Decouple from Pytorch π₯
- Refactor backend to async-first π€Ή
-->
<small>Docs: https://zen3geo.readthedocs.io</small>
---
# ![Demo DataPipe code](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/demo_datapipe_code.png)
<small>Full code: https://github.com/weiji14/foss4g2023oceania</small>
<!-- https://revealjs.com/code -->
<!--
### Demo Python code
Chaining operations using zen3geo torch DataPipes
<pre><code data-trim data-line-numbers="2-4,8,10,12,14">
# Input path or url to Zarr store
dp_source = IterableWrapper(
iterable=["gs://weatherbench2/.../era5.zarr"])
)
dp_weather_chips = (
# Use kvikIO to read Zarr store
dp_source.read_from_xpystac(engine="kvikio", **kwargs)
# Custom function to select desired data variables
.map(fn=sel_datavars)
# Use xbatcher to slice datacube along time-dimension
.slice_with_xbatcher(input_dims={"time": 2})
# Custom function to convert CuPy array to Torch tensor
.collate(collate_fn=xarray_to_tensor_collate_fn)
)
</code></pre>
-->
---
# ![Compare results](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/compare_results.png)
<!--
### Results: ~25% faster reads with GPUDirect Storage (GDS)
![compare_kvikio_zarr](https://github.com/weiji14/foss4g2023oceania/assets/23487320/753bdfd0-a98b-4b3d-81a2-9bdd6e5db93b)
Back of envelope savings calculations:
- Scaling to 55x more data (18.2GB to 1TB)
- Train for 100 epochs
Zarr: 16.0s x 55 x 100 = 24.4 hours
kvikIO: 11.9s x 55 x 100 = 18.2 hours
Assuming price of US$3/hour for a GPU**,
6 hours savings a day = US$18
Over 30 days (1 month) = US$540 (~NZ$900) saved!
*Technical details:
- 18.2GB ERA5 subset to 1 year at 6 hourly resolution, 3 data variables (float32)
- Zarr v2 spec, no compression, no consolidated metadata
- Benchmarked on an RTX A2000 8GB GPU, connected to PCIe Gen4 x8
**A100 40GB GPU hourly rate (range $2.39-$4.10)
from https://www.paperspace.com/gpu-cloud-comparison
-->
----
<!-- .slide: style="font-size: 0.65em;" -->
### Savings in terms of price πΈ and carbon emissions π
| Time saved |$ Price (USD3/hr)^[1]^ | Carbon (0.04842 kgCO~2~eq/hr)^[2]^ | CO~2~ emissions equivalent to |
|--|--|--|--|
| 6 hr (day) | USD18 (NZD30) | 0.29 kgCO~2~eq | 35 smartphone charges π± |
| 180 hr (month) | USD540 (NZD900) | 8.72 kgCO~2~eq | 44km of driving a car π |
| 2190 hr (year) | USD6570 (NZD11090) | 106.04 kgCO~2~eq | 431km domestic flight (Taupo to Auckland) βοΈ |
[1]: Price: USD3/hour for NVIDIA A100 40GB GPU (range $2.39-$4.10) from https://www.paperspace.com/gpu-cloud-comparison
[2]: Carbon intensity in Sydney region (538 gCO~2~eq/kWh) from<br>https://cloud.google.com/sustainability/region-carbon<br>90 W power draw * 1 hour = 0.09 kWh<br>0.09 kWh * 538 gCO2eq/kWh = 0.04842 kgCO~2~eq/hr
<!--
Note, not properly using markdown-it-footnote because it makes this page jump up to the first page in print-to-pdf view.
35 smartphone charges - https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator#results
44km of driving a Toyota Corolla 2020 - https://www.co2everything.com/category/travel-and-transport
431km domestic flight - 246g per km - https://ourworldindata.org/travel-carbon-footprint
-->
---
<!-- .slide: style="font-size: 0.75em;" -->
### Where to learn more π§βπ
- π« Educational resources:
- Project Pythia Cookbooks - https://cookbooks.projectpythia.org
- GeoSMART - https://geo-smart.github.io
- UW Hackweeks as a Service - https://guidebook.hackweek.io
- π¬ Pangeo ML Working Group Meeting
- https://pangeo.io/meeting-notes.html#working-group-meetings
- First Tuesday of each month (3pm US Eastern)
:::info
P.S. We're inviting [showcase](https://pangeo.io/pangeo-showcase.html) speakers for the Pangeo ML Working Group!
:::
---
### Thank you! :sheep:
Link to repository and slides
https://github.com/weiji14/foss4g2023oceania
![QR code to https://github.com/weiji14/foss4g2023oceania repo](https://github.com/weiji14/foss4g2023oceania/assets/23487320/63056665-63e7-4f79-bc40-b155b2792921)
<small>
Or contact me at<br>
GitHub: @weiji14 | Mastodon: @weiji14@mastodon.nz | Email: weiji@developmentseed.org
</small>
{"title":"The ecosystem of geospatial machine learning tools in the Pangeo world","breaks":true,"description":"FOSS4G SotM Oceania 2023 presentation","lang":"en-NZ","slideOptions":"{\"theme\":\"simple\",\"width\":\"70%\"}","showTags":"true","contributors":"[{\"id\":\"c1f3f3d8-2cb7-4635-9d54-f8f7487d0956\",\"add\":18597,\"del\":12898}]"}