Cloud- & ML-native file formats for geospatial and multi-modal applications

### Cloud- & ML-native file formats 💾 for geospatial and multi-modal applications DevSeed Team Week 2024 deep dive Wednesday 21 Feb 2024, 13:30–14:30 (EST)   P.S. Slides are at https://hackmd.io/@weiji14/mlnativeformats --- ### From cloud to ML-native formats :rocket: Cloud-native :cloud: | ML-native :thunder_cloud_and_rain: --|-- Efficient subsetting using range requests | Efficient random access for shuffling ops on mini-batches Stream data in parallel into CPU RAM | Asynchronous loading into GPU RAM Uni-modal data type (e.g. mp3, GeoTIFF) | Multi-modal formats (?) --- #### ML is no longer just Natural Language Processing (NLP) or Computer Vision (CV) :exploding_head: - Multi-modal datasets - Text :memo: (.txt) - Audio/Sound :musical_note: (.mp3) - Video :movie_camera: (.mp4) - Images :camera_with_flash: (.png/.jpg) - Geospatial - Point clouds :fireworks: (.las), Vectors :diamond_shape_with_a_dot_inside: (.gpq) - Trajectory :arrow_heading_up: (MF-JSON) - Raster :world_map: (.tif) --- ### Can one format rule them all?  <img src="https://hackmd.io/_uploads/Sk-4QxQnp.png" alt="Cloud-optimized Machine Learning formats" width="70%">  Do we store Raster in Vector, or Vector in Raster? --- ## GeoParquet - Since 2021, [v1.0.0](https://github.com/opengeospatial/geoparquet/releases/tag/v1.0.0) in Sep 2023 - Pros ✅ - Good support for structured data (text, FixedShapeTensor) - On the way to being on OGC standard - Cons ❌ - Slow random access to rows for ML training (e.g. shuffling is inefficient) - Unstructured data (images, audio, videos) is harder to store, best case is just storing a reference --- ## Zarr - Since 2015, [v2](https://github.com/zarr-developers/zarr-python/pull/37) in 2016, v3 in progress - Pros ✅ - Very flexible data structure with support for arbitrary dimensions - Can access chunks in a performant way from cloud object stores, including direct-to-GPU - Cons ❌ - Nested/multi-dimensional structure may be overkill for certain applications - No Arrow support yet, mostly Python-based I/O drivers --- ## [Lance](https://github.com/lancedb/lance) - Since 2022, still in v0.x.y - Pros ✅ - Designed for multi-modal, built with Arrow interoperability - Fast random access to specific rows (i.e. supports efficient shuffling) - Cons ❌ - No geospatial support yet - Slightly bigger filesize compared to Parquet (doesn't compress so well) --- ### What next? :telescope: 1. Discuss about where we want to focus attention on (10min) 2. Take a shot at drafting a document with ideas (10min in silence + 10min with talking) - https://miro.com/app/board/uXjVNqT7DaE=/?share_link_id=592313143177 3. Summarize and plan for post-team week (10min) --- ### Example multi-modal use-cases | Use case | Data modalities | |:--:|:--:| | Bioacoustics + Camera Traps | Audio + Images | | Autonomous driving / streetview | LiDAR (depth data) + Video | | Weather station data | Time-series (precipitation from rain gauge) + Doppler radar images | | Land-use Land Cover | Text (OSM tags) + Images (Satellite imagery) |