<!-- Docs for making Markdown slide deck on HackMD using Revealjs https://hackmd.io/s/how-to-create-slide-deck https://hackmd.io/c/codimd-documentation/%2F%40codimd%2Fmarkdown-syntax https://revealjs.com --> ### Cloud- & ML-native file formats 💾 for geospatial and multi-modal applications <small>DevSeed Team Week 2024 deep dive <br> Wednesday 21 Feb 2024, 13:30–14:30 (EST)</small> <!-- _by **[Wei Ji Leong](https://github.com/weiji14)** @ [Development Seed](https://developmentseed.org/team/weiji-leong)_ --> <!-- Put the link to this slide here so people can follow --> <small>P.S. Slides are at https://hackmd.io/@weiji14/mlnativeformats</small> --- ### From cloud to ML-native formats :rocket: Cloud-native :cloud: | ML-native :thunder_cloud_and_rain: --|-- Efficient subsetting using range requests | Efficient random access for shuffling ops on mini-batches Stream data in parallel into CPU RAM | Asynchronous loading into GPU RAM Uni-modal data type (e.g. mp3, GeoTIFF) | Multi-modal formats (?) --- #### ML is no longer just Natural Language Processing (NLP) or Computer Vision (CV) :exploding_head: - Multi-modal datasets - Text :memo: (.txt) - Audio/Sound :musical_note: (.mp3) - Video :movie_camera: (.mp4) - Images :camera_with_flash: (.png/.jpg) - Geospatial - Point clouds :fireworks: (.las), Vectors :diamond_shape_with_a_dot_inside: (.gpq) - Trajectory :arrow_heading_up: (MF-JSON) - Raster :world_map: (.tif) --- ### Can one format rule them all? <!-- Focus on cloud-native (read-optimized) file formats: GeoParquet, Zarr, Lance --> <img src="https://hackmd.io/_uploads/Sk-4QxQnp.png" alt="Cloud-optimized Machine Learning formats" width="70%"> <!-- https://excalidraw.com/#json=nwqXccEA9f8A8alLlpcWF,8qLCJngKCrftMyM3Pqk3OA --> <small>Do we store Raster in Vector, or Vector in Raster?</small> --- ## GeoParquet - Since 2021, [v1.0.0](https://github.com/opengeospatial/geoparquet/releases/tag/v1.0.0) in Sep 2023 - Pros ✅ - Good support for structured data (text, FixedShapeTensor) - On the way to being on OGC standard - Cons ❌ - Slow random access to rows for ML training (e.g. shuffling is inefficient) - Unstructured data (images, audio, videos) is harder to store, best case is just storing a reference --- ## Zarr - Since 2015, [v2](https://github.com/zarr-developers/zarr-python/pull/37) in 2016, v3 in progress - Pros ✅ - Very flexible data structure with support for arbitrary dimensions - Can access chunks in a performant way from cloud object stores, including direct-to-GPU - Cons ❌ - Nested/multi-dimensional structure may be overkill for certain applications - No Arrow support yet, mostly Python-based I/O drivers --- ## [Lance](https://github.com/lancedb/lance) - Since 2022, still in v0.x.y - Pros ✅ - Designed for multi-modal, built with Arrow interoperability - Fast random access to specific rows (i.e. supports efficient shuffling) - Cons ❌ - No geospatial support yet - Slightly bigger filesize compared to Parquet (doesn't compress so well) --- ### What next? :telescope: 1. Discuss about where we want to focus attention on (10min) 2. Take a shot at drafting a document with ideas (10min in silence + 10min with talking) - https://miro.com/app/board/uXjVNqT7DaE=/?share_link_id=592313143177 3. Summarize and plan for post-team week (10min) --- ### Example multi-modal use-cases | Use case | Data modalities | |:--:|:--:| | Bioacoustics + Camera Traps | Audio + Images | | Autonomous driving / streetview | LiDAR (depth data) + Video | | Weather station data | Time-series (precipitation from rain gauge) + Doppler radar images | | Land-use Land Cover | Text (OSM tags) + Images (Satellite imagery) | <!-- <small>Pick any you'd like to work on, or come up with your own!</small> -->
{"title":"Cloud- & ML-native file formats for geospatial and multi-modal applications","breaks":true,"description":"slide: https://hackmd.io/p/template-Talk-slide","lang":"en-NZ","slideOptions":"{\"theme\":\"simple\",\"width\":\"70%\"}","contributors":"[{\"id\":\"c1f3f3d8-2cb7-4635-9d54-f8f7487d0956\",\"add\":4742,\"del\":3111}]"}
    303 views