<!--
Docs for making Markdown slide deck on HackMD using Revealjs
https://hackmd.io/s/how-to-create-slide-deck
https://hackmd.io/c/codimd-documentation/%2F%40codimd%2Fmarkdown-syntax
https://revealjs.com
-->
### Cloud- & ML-native file formats 💾 for geospatial and multi-modal applications
<small>DevSeed Team Week 2024 deep dive <br> Wednesday 21 Feb 2024, 13:30–14:30 (EST)</small>
<!-- _by **[Wei Ji Leong](https://github.com/weiji14)** @ [Development Seed](https://developmentseed.org/team/weiji-leong)_
-->
<!-- Put the link to this slide here so people can follow -->
<small>P.S. Slides are at https://hackmd.io/@weiji14/mlnativeformats</small>
---
### From cloud to ML-native formats :rocket:
Cloud-native :cloud: | ML-native :thunder_cloud_and_rain:
--|--
Efficient subsetting using range requests | Efficient random access for shuffling ops on mini-batches
Stream data in parallel into CPU RAM | Asynchronous loading into GPU RAM
Uni-modal data type (e.g. mp3, GeoTIFF) | Multi-modal formats (?)
---
#### ML is no longer just Natural Language Processing (NLP) or Computer Vision (CV) :exploding_head:
- Multi-modal datasets
- Text :memo: (.txt)
- Audio/Sound :musical_note: (.mp3)
- Video :movie_camera: (.mp4)
- Images :camera_with_flash: (.png/.jpg)
- Geospatial
- Point clouds :fireworks: (.las), Vectors :diamond_shape_with_a_dot_inside: (.gpq)
- Trajectory :arrow_heading_up: (MF-JSON)
- Raster :world_map: (.tif)
---
### Can one format rule them all?
<!-- Focus on cloud-native (read-optimized) file formats: GeoParquet, Zarr, Lance -->
<img src="https://hackmd.io/_uploads/Sk-4QxQnp.png" alt="Cloud-optimized Machine Learning formats" width="70%">
<!-- https://excalidraw.com/#json=nwqXccEA9f8A8alLlpcWF,8qLCJngKCrftMyM3Pqk3OA -->
<small>Do we store Raster in Vector, or Vector in Raster?</small>
---
## GeoParquet
- Since 2021, [v1.0.0](https://github.com/opengeospatial/geoparquet/releases/tag/v1.0.0) in Sep 2023
- Pros ✅
- Good support for structured data (text, FixedShapeTensor)
- On the way to being on OGC standard
- Cons ❌
- Slow random access to rows for ML training (e.g. shuffling is inefficient)
- Unstructured data (images, audio, videos) is harder to store, best case is just storing a reference
---
## Zarr
- Since 2015, [v2](https://github.com/zarr-developers/zarr-python/pull/37) in 2016, v3 in progress
- Pros ✅
- Very flexible data structure with support for arbitrary dimensions
- Can access chunks in a performant way from cloud object stores, including direct-to-GPU
- Cons ❌
- Nested/multi-dimensional structure may be overkill for certain applications
- No Arrow support yet, mostly Python-based I/O drivers
---
## [Lance](https://github.com/lancedb/lance)
- Since 2022, still in v0.x.y
- Pros ✅
- Designed for multi-modal, built with Arrow interoperability
- Fast random access to specific rows (i.e. supports efficient shuffling)
- Cons ❌
- No geospatial support yet
- Slightly bigger filesize compared to Parquet (doesn't compress so well)
---
### What next? :telescope:
1. Discuss about where we want to focus attention on (10min)
2. Take a shot at drafting a document with ideas (10min in silence + 10min with talking)
- https://miro.com/app/board/uXjVNqT7DaE=/?share_link_id=592313143177
3. Summarize and plan for post-team week (10min)
---
### Example multi-modal use-cases
| Use case | Data modalities |
|:--:|:--:|
| Bioacoustics + Camera Traps | Audio + Images |
| Autonomous driving / streetview | LiDAR (depth data) + Video |
| Weather station data | Time-series (precipitation from rain gauge) + Doppler radar images |
| Land-use Land Cover | Text (OSM tags) + Images (Satellite imagery) |
<!-- <small>Pick any you'd like to work on, or come up with your own!</small> -->
{"title":"Cloud- & ML-native file formats for geospatial and multi-modal applications","breaks":true,"description":"slide: https://hackmd.io/p/template-Talk-slide","lang":"en-NZ","slideOptions":"{\"theme\":\"simple\",\"width\":\"70%\"}","contributors":"[{\"id\":\"c1f3f3d8-2cb7-4635-9d54-f8f7487d0956\",\"add\":4742,\"del\":3111}]"}