<!-- Docs for making Markdown slide deck on HackMD using Revealjs https://hackmd.io/s/how-to-create-slide-deck https://hackmd.io/c/codimd-documentation/%2F%40codimd%2Fmarkdown-syntax https://revealjs.com --> ### The ecosystem of geospatial machine learning tools in the [Pangeo](https://pangeo.io) :earth_asia: world. <small>[FOSS4G SotM Oceania 2023 presentation <br> Wednesday 18 Oct 2023, 13:50–14:15 (NZDT)](https://talks.osgeo.org/foss4g-sotm-oceania-2023/talk/YP3KPT)</small> _by **[Wei Ji Leong](https://github.com/weiji14)** @ [Development Seed](https://developmentseed.org/team/weiji-leong)_ <!-- Put the link to this slide here so people can follow --> <small>P.S. Slides are at https://hackmd.io/@weiji14/foss4g2023oceania</small> <table> <tr> <td> <img src="https://github.com/pangeo-data/branding/blob/master/logo/large-logo-blue-text.png?raw=true" alt="Pangeo logo" style="float: right" width="25%"> </td> <td> πŸ’š </td> <td> <img src="https://hackmd.io/_uploads/SJcnovPZa.png" alt="Development Seed logo" style="float: left" width="65%"> </td> <td> &ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp; </td> </tr> </table> --- ### Building next-gen Machine Learning (ML) tools in the Pangeo community πŸ§‘β€πŸ€β€πŸ§‘ - Towards cloud-native and **GPU-native** πŸ‘Ύ - Streaming and batching data **subsets** on-the-fly πŸ›Έ - From single sensor to **multi-modal** models πŸ™ <!-- i.e. RAPIDS AI, xbatcher and zen3geo --> --- <!-- .slide: style="font-size: 0.85em;" --> ### A community of big data geoscientists πŸ—ΊοΈ - Promoting open, reproducible and scalable ways of doing ocean/atmosphere/land/climate science πŸ§‘β€πŸ”¬ - Things Pangeo folks are passionate about: - **Cloud/HPC infrastructure** - From local desktops to shared servers ☁️ - **Open Source Software** - Collaborating on scientific libraries like xarray, and cloud-native standards like STAC πŸ—ƒοΈ - **Education/Outreach** - Tutorials, Pangeo Showcase webinars, Conference workshops πŸ§‘β€πŸ« <!-- ![Screenshot of Pangeo Discourse forum](https://hackmd.io/_uploads/SkU5_mG-T.png) --> --- ## What does the Pangeo ML 'stack' look like? πŸ€” <small>Answer: Framework agnostic (but opinionated 😝), mostly Python 🐍, also all over the place πŸ’«</small> --- <!-- .slide: style="display: block; top: 23px;" --> # ![Pangeo Machine Learning Ecosystem 2023](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/pangeo_ml_ecosystem.png) --- ## Example walkthrough 🚸 On a climate/weather dataset - WeatherBench2 ERA5 - ⏩ Direct-to-GPU reads from Zarr - `kvikIO` - βœ‚οΈ Slicing n-dimensional arrays on-the-fly - `xbatcher` - 🎍 Composable data pipelines - `zen3geo` <!-- <small>Note: focus of rest of the talk will be on Earth Observation raster data</small> --> ---- # ![NVIDIA GPUDirect Storage](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/nvidia_gpu_direct_storage.png) <small>Reference: https://developer.nvidia.com/blog/gpudirect-storage</small> ---- # ![xbatcher n-dimensional slicing](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/xbatcher_ndim_slicing.png) <small>Docs: https://xbatcher.readthedocs.io</small> ---- # ![zen3geo Composable DataPipes](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/zen3geo_composable_datapipes.png) <!-- ### zen3geo - Composable DataPipes for geospatial Alt text: Flow diagram showing STAC, vector, raster, spatial and other DataPipes making up zen3geo ✨ Features: - Friendly walkthrough tutorials πŸ”° - Handle multi-CRS/multi-res layers 🌐 - BYO custom function 🧰 πŸ›€οΈ Roadmap: - More STAC DataPipes 🎍 - Decouple from Pytorch πŸ”₯ - Refactor backend to async-first 🀹 --> <small>Docs: https://zen3geo.readthedocs.io</small> --- # ![Demo DataPipe code](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/demo_datapipe_code.png) <small>Full code: https://github.com/weiji14/foss4g2023oceania</small> <!-- https://revealjs.com/code --> <!-- ### Demo Python code Chaining operations using zen3geo torch DataPipes <pre><code data-trim data-line-numbers="2-4,8,10,12,14"> # Input path or url to Zarr store dp_source = IterableWrapper( iterable=["gs://weatherbench2/.../era5.zarr"]) ) dp_weather_chips = ( # Use kvikIO to read Zarr store dp_source.read_from_xpystac(engine="kvikio", **kwargs) # Custom function to select desired data variables .map(fn=sel_datavars) # Use xbatcher to slice datacube along time-dimension .slice_with_xbatcher(input_dims={"time": 2}) # Custom function to convert CuPy array to Torch tensor .collate(collate_fn=xarray_to_tensor_collate_fn) ) </code></pre> --> --- # ![Compare results](https://github.com/weiji14/foss4g2023oceania/releases/download/v0.9.0/compare_results.png) <!-- ### Results: ~25% faster reads with GPUDirect Storage (GDS) ![compare_kvikio_zarr](https://github.com/weiji14/foss4g2023oceania/assets/23487320/753bdfd0-a98b-4b3d-81a2-9bdd6e5db93b) Back of envelope savings calculations: - Scaling to 55x more data (18.2GB to 1TB) - Train for 100 epochs Zarr: 16.0s x 55 x 100 = 24.4 hours kvikIO: 11.9s x 55 x 100 = 18.2 hours Assuming price of US$3/hour for a GPU**, 6 hours savings a day = US$18 Over 30 days (1 month) = US$540 (~NZ$900) saved! *Technical details: - 18.2GB ERA5 subset to 1 year at 6 hourly resolution, 3 data variables (float32) - Zarr v2 spec, no compression, no consolidated metadata - Benchmarked on an RTX A2000 8GB GPU, connected to PCIe Gen4 x8 **A100 40GB GPU hourly rate (range $2.39-$4.10) from https://www.paperspace.com/gpu-cloud-comparison --> ---- <!-- .slide: style="font-size: 0.65em;" --> ### Savings in terms of price πŸ’Έ and carbon emissions 🏭 | Time saved |$ Price (USD3/hr)^[1]^ | Carbon (0.04842 kgCO~2~eq/hr)^[2]^ | CO~2~ emissions equivalent to | |--|--|--|--| | 6 hr (day) | USD18 (NZD30) | 0.29 kgCO~2~eq | 35 smartphone charges πŸ“± | | 180 hr (month) | USD540 (NZD900) | 8.72 kgCO~2~eq | 44km of driving a car πŸš— | | 2190 hr (year) | USD6570 (NZD11090) | 106.04 kgCO~2~eq | 431km domestic flight (Taupo to Auckland) ✈️ | [1]: Price: USD3/hour for NVIDIA A100 40GB GPU (range $2.39-$4.10) from https://www.paperspace.com/gpu-cloud-comparison [2]: Carbon intensity in Sydney region (538 gCO~2~eq/kWh) from<br>https://cloud.google.com/sustainability/region-carbon<br>90 W power draw * 1 hour = 0.09 kWh<br>0.09 kWh * 538 gCO2eq/kWh = 0.04842 kgCO~2~eq/hr <!-- Note, not properly using markdown-it-footnote because it makes this page jump up to the first page in print-to-pdf view. 35 smartphone charges - https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator#results 44km of driving a Toyota Corolla 2020 - https://www.co2everything.com/category/travel-and-transport 431km domestic flight - 246g per km - https://ourworldindata.org/travel-carbon-footprint --> --- <!-- .slide: style="font-size: 0.75em;" --> ### Where to learn more πŸ§‘β€πŸŽ“ - 🏫 Educational resources: - Project Pythia Cookbooks - https://cookbooks.projectpythia.org - GeoSMART - https://geo-smart.github.io - UW Hackweeks as a Service - https://guidebook.hackweek.io - πŸ’¬ Pangeo ML Working Group Meeting - https://pangeo.io/meeting-notes.html#working-group-meetings - First Tuesday of each month (3pm US Eastern) :::info P.S. We're inviting [showcase](https://pangeo.io/pangeo-showcase.html) speakers for the Pangeo ML Working Group! ::: --- ### Thank you! :sheep: Link to repository and slides https://github.com/weiji14/foss4g2023oceania ![QR code to https://github.com/weiji14/foss4g2023oceania repo](https://github.com/weiji14/foss4g2023oceania/assets/23487320/63056665-63e7-4f79-bc40-b155b2792921) <small> Or contact me at<br> GitHub: @weiji14 | Mastodon: @weiji14@mastodon.nz | Email: weiji@developmentseed.org </small>
{"title":"The ecosystem of geospatial machine learning tools in the Pangeo world","breaks":true,"description":"FOSS4G SotM Oceania 2023 presentation","lang":"en-NZ","slideOptions":"{\"theme\":\"simple\",\"width\":\"70%\"}","showTags":"true","contributors":"[{\"id\":\"c1f3f3d8-2cb7-4635-9d54-f8f7487d0956\",\"add\":18597,\"del\":12898}]"}
    1558 views