owned this note
owned this note
Published
Linked with GitHub
# OME2024 NGFF Challenge
## Front matter
### Zoom
Please paste this into the Zoom chat as new people join:
:::warning
Welcome to the call. Please be aware that this session may be recorded. Live notes for the session are available in https://hackmd.io/3emKqKQsT_2U35vLepzDEQ Where possible, help to structure the notes for later publication rather than commenting in Zoom's chat. Thanks!
:::
### Code of conduct
The OME community is open to everybody and built upon mutual respect. Please take the time to review the code of conduct below.
https://github.com/ome/.github/blob/master/CODE_OF_CONDUCT.md
## Check-in (20241009)
### Agenda
* Discuss data URLs, usages? (Josh)
* Look at challenge site (Will)
- https://ome.github.io/ome2024-ngff-challenge/
- [Preview WIP](https://deploy-preview-53--ngff-challenge.netlify.app/)
### Session 1
* Attending
- Josh Moore
- Sebastien Besson
- Will Moore
- Frances Wong
- Koji Kyoda
- Joost de Folter
- Dominik Lindner
- Tom Boissonnet
* Notes
- Josh: look at challenge, new URLs.
- Joost: continuation?
- Josh: need to release this challenge
- other challenges?
- metadata (and upcoming RO-Crate RFC?)
- transforms (i.e., an active RFC)
- Seb: RFCs into versions there's a tension of where to spend time (challenge, RFC, implementation, ...)
- this was interesting because it mixed a couple of concepts, tried together. (different audiences)
- if it's only transforms, but will you get the people who aren't immediately interested.
- be careful of limited resources (one Josh)
- Joost: transforms sounds interesting; ro-crate is interesting for EM (could input)
- Koji: using challenge tool to convert existing zarr data.
- Q: add multiple RO-Crate terms?
- needs nargs
- https://github.com/ome/ome2024-ngff-challenge/blob/main/src/ome2024_ngff_challenge/resave.py#L721
- JM: and next? add more metadata.
- KK: discussed with Olympus. They will join GBI EoE in Japan
- Will:
- spec next step is v0.5
- https://github.com/ome/ngff/pull/242
- Anything beyond RFC-2?
- Only possibly "spec bugs" (typing, etc.)
- and transforms?
- https://github.com/ome/ngff/pull/138
- currently conflicting, waiting on reviews, etc.
- Tom
- people will delete their data?
- keep a copy of the RO-Crate
- archive to Zenodo
- Will: include thumbnails
- Will: **demo** of https://github.com/ome/ome2024-ngff-challenge/pull/53
- Josh: https://www.w3.org/TR/void/#statistics
- Josh: or just a search engine? next steps.
- It's been fun. :cl:
![image](https://hackmd.io/_uploads/HJSu2pQykl.png)
### Session 2
* Attending
- Eric Perlman
- Kiya Govek
- Dan Toloudis
- Will Moore
- Josh Moore
- Gideon Dunster
- Erick Ratamero
- Melissa Linkert
* Notes
- Data status
- JAX: No more.
- Dan: not immediately. Maybe.
- Challenge feedback
- EP:
- every thing about sharded data is better (**: for read only datasets)
- 20% reduction on disk (with the defaults)
- a lot happened all at once. better product for public distribution.
- highest level: all improved
- tools
- had to wrap the resave command; was it a success?? (late to the game) robustness
- return status object (even breaking unix-ness)
- ran at scale, pretty fast. $500 conversion cost for all JAX data in challenge
- on top of original zarr converstion
- with tools as today (bf writing converted) could tool this with scratch SSD on emphemeral cloud nodes
- turbojpeg has become a pain (won't run on mac) - hard to debug NDPI
- JM: maybe not *everything* about sharding is better. Can you just delete shards? e.g. for zarrv2 you can just not download chunks
- DT:
- didn't get hands dirty on conversion tools
- interested in converting large time-series (read-only after conversion)
- really interesting: guidelines on sharding strategically for different read scenarios.
- have explored webpage of what's been converted; gives momentum to get onto Zarr v3.
- JM: don't have that insight *yet*
- for ML training, you might want them broken into 3D bricks, for viz 2D planes (chunks)
- separate but related on the shards
- differen experimental groups are also thinking about different parameters
- EP: all data was 2D slide scans. but 3D for viz are now in single chunks.
- for 3D viz, do care about the 3D chunks
- shard as compute entity.
- probably no client is trying load the whole shard
- JM: geospatial person in zarr meeting had reordering problem
- suboptimal dimension ordering w/o sharding to suboptimal dimension ordering w/ sharding was a 4x speed up
- ML: re: NDPI see [temp directory](https://github.com/glencoesoftware/bioformats2raw/pull/252/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5)
- JM: response to "can we keep doing the challenge" from this morning. No - people need to be able to delete their data. Maybe something next year:
- Collecting interesting transform datasets
- `+1`: EP (non-JAX)
- DT: most complicated is lattice skew (though perhaps de-skewed)
- *simplest* chain of transforms wins (as tool developer)
- Metadata
- GD: Allen cares about metadata
- ER: :+1: for annually, but perhaps a bit smaller. every OME meeting? what/scope/effort
- ER: one idea for next step / challenge: minimal set of infrastructure & vis tools in common for all the challenges
- caching, querying
- JM: do we want to grow up catalog static webpages? challenge template? catalogs of data like python community
- ER: most exciting thing about this challenge was big-picture vision of federated image repositories that speak common language
- This is valuable if there is a centralized place that can query all the repositories
- JM: https://x.com/notjustmoore/status/1346494255435554817
- EP: moving beyond CSV?
- DT: also building internally something on CSV
- EP: TODO: minutae list as well. (file naming!)
- ER: we did build a distributed repository in 4 months as CSV files on github
- EP: flybrain data? 8nm so 40TB. flattened segmentation down to 10TB. (raw full is 100TB)
- JM: summary of challenge: we've elevated base level so that doing things on top is a smaller lift
- WM: demo of https://deploy-preview-53--ngff-challenge.netlify.app/
- how do people want to find things? "all 3D images with 2D chunks?"
- EP: value of having a small thumbnail in the hierarchy
- JAX didn't do that; only went down to 1k x 1k
- thumbnail of the resolution as a non-standard size. WM: finding optimal thumbnail resolution ahead of time
- ER: include that in **guidance** (separate entry in OMERO and cached for a reason)
- EP: how are we collecting feedback from everyone on how it went?
- JM: blog post / image.sc summary. More extreme would be paper. could pull out pieces for RFC
- Lessons learned can turn into should/may statements in RFC (e.g. don't do this, it costs a lot of cloud money)
- EP: should separate teams write separate blog posts or should there be one collective post
- JM: if you had separate posts, main post could link to them, or could start from shared markdown/google doc
![image](https://hackmd.io/_uploads/rJdnxQ4y1e.png)
## Check-in (20240925)
### Agenda
* Check-in ;)
### Session 1
* Attending
- Will Moore
- Norman Rzepka
- Sébastien Besson
- Aybuke Yoldas
- Joost de Folter
- Martin Jones
- Dominik Lindner
- Frances Wong
- Francois Sherwood
* Notes
- WM: PR with samples viewer
- see https://samples-viewer--ome2024-ngff-challenge.netlify.app/?csv=https://raw.githubusercontent.com/will-moore/ome2024-ngff-challenge/samples_viewer/samples/ngff_samples.csv
- samples/ngff_samples.csv - stores Institution + CSV url that lists zarr samples
- Norman - summarise data sizes written in CSV.
- Aybuke - we have samples and a csv: https://samples-viewer--ome2024-ngff-challenge.netlify.app/?csv=https://raw.githubusercontent.com/BioImage-Archive/ebi-ngff-challenge-samples/refs/heads/main/ebi-ngff-challenge-samples.csv
- Joost: conversion question: validator.
- Tiff stack -> ngff with converter. bioformats2raw layout with /0. Expected? Yes.
- ro-crate json at root - not shown when looking at the image. Will: could have validator check parent dir.
- Norman: add source of image (e.g. link to IDR etc) in the csv.
- Will: Author not included in metadata.
- source: for institution etc, Martin: core facilities.
- Let's use "origin": csv column for link to original data.
- Seb: ro-crate metadata for storing the data that this is "derived from"? Will, yes - but a bit late for right now. Also "Creator".
- Done: 10:39.
- Actions:
- Will: fix ro-create-metadata loading in samples-viewer (Done)
- Will: Support 'origin' column link in samples csv (Done)
- Will: python script to add 'written' column to csv
### Session 2
* Attending
- Will Moore
- Joel Lüthi
- Jens Wendt, Uni Münster
- Eric Perlman
- Kiya Govek
- Melissa Linkert
- Steve Taylor
* Notes:
- Eric P: where do we put csv? Will: any public csv with a URL is fine.
- Jens: help converting. big czi WSI, 840 GB.
- Will: convert to Zarr v2 first. Then resave.
- EP: issue with default shard size being too big
- Jens: what is the correct shard size?
- Will: no absolute right and wrong chunk/shard
- staylor: is there support for scanpy/anndata in NGFF?
- Will: NGFF doesn't have spec. spatialdata has own spec with ome-zarr, may be fleshed out in future
- Joel: no support for Zarrv3 yet
- Jens: 2 screens 16 plates - images with ROIs
- https://github.com/ome/omero-cli-zarr should work for plates and polygons -> labels
- Will: Eric - validator to show shard size. E.g. with HEAD request? http supported by all back ends. Kiya: our bucket has quite restrictive permissions (but should still be OK). Eric: 100 TB images sometimes have e.g. 100 shards, possible to check for all.
- Will: 'origin' (original data url) and 'source' (institution e.g. "IDR") columns are supported from csv
- Kiya: totals on samples-viewer not very meaningful when table has 1 image per collection.
## Check-in (20240911)
### Agenda
* Check-in ;)
### Session 1
* Attending
- Josh Moore
- Sébastien Besson
- Conni Wetzker
- Aybuke Yoldas
- Frances Wong
- Kola Babalola
- Martin Jones
- Joost de Folter
* Notes
- AY: "ramping up" (focused on the API)
- CW: dataset of interest?
- FLIM dataset (Leica) .lif exported as .ptu (no OMERO support)
- post-doc in Haase group converted it for visualization and OMERO
- leading to a small publication (soonish). nice and complete. not huge.
- Marcelo converted to OME-TIFF (<- PTU); napari plugin can visualize it.
- KB: using OMERO for visualization for OME-TIFF or to OME-Zarr?
- CW: overview (below)
- SB: best that can be done in OMERO.web
- possible to create a module OME-TIFF but only works in insight
- AY: happy to have it in the BioImage Archive
- CW: Marcelo still on parental leave
- FW: haven't done FLIM yet, so interesting.
- MJ: "ramping up" (checking on the data; going public)
- have benchmark data up on EMPIAR
- CLEM: EMPIAR-11666 (Zarr or N5, local MoBIE projects)
- also: 11537 and 10819. Already in VolumeBrowser in some chunked format.
- may be from cleaned/reduced/aligned/slightly-preprocessed data
- in the future might be nice to have something "Raw-er"
- Josh
- see https://github.com/ome/ome2024-ngff-challenge/pull/50 for the beginning of statistics
![image](https://hackmd.io/_uploads/rkHtRAChA.png)
### Session 2
- Attending
- Josh Moore
- Melissa Linkert
- Torsten Stöter
- Eric Perlman
- Notes
- EP: another 1500 miles since last week.... ("ramping up")
- TS: taking a lot of time for the conversion.
- (LSM, EM)
- first bioformats2raw
- TS: stable?
- JM: no changes to the format. would like to get remote *writing* working though
- TS: sharding tested for the challenge. suggestions.
- previously 4k x 4k tiles, stacks.
- didn't work with default options. (OOM?)
- turned on sharding. then it went through.
- EP: would have 4k x 4k
- using for non-sharding to reduce number of files
- wouldn't use it for sharding
- 2k x 2k is a lot later latency (or 1k x 1k x 1k)
- suggest filing an issue for the failure.
- z-stack is aligned? (probably)
- TS: haven't tried a viewer
- EP: read chunking can happen later
- haven't looked into rechunking though
- e.g. 256 x 256 x 16 (depends on how the data is used)
- ML: working on zarr2zarr
- latest updated in https://github.com/glencoesoftware/zarr2zarr/pull/8
- no additional planned changes
- Seb testing chunking, sharding, and metadata migration (using validator, etc.)
- EP: rechunking in multiple dimensions?
- ML: v2 as input. can set arbitrary chunk and shard on the output. (up to you)
- EP: let's us get the 3D JAX data done. :100:
- want to get the channels co-located
- JM: performance testing?
- ML: mostly identifying pain points
- EP: path where zarr2zarr gets called for the challenge?
- ML: goal wasn't production tool for all of your data
- more excersing zarr-java and preparing for Zarr v3
- end goal is not to need zarr2zarr but you would write from bioformats2raw
- i.e., sandbox outside bioformats2raw
- EP: i.e., ultimately wrapper code
- JM: FYI
- blosc compression issue
- https://github.com/zarr-developers/zarr-python/issues/2171
## Commitment (20240904)
### Agenda
* Round-table
* Status
* Help needed?
* Next steps (statistics, etc.)
### Session 1
* Attending
- Will Moore
- Aybuke Yoldas
- Josh Moore
- Bugra Oezdemir
- Frances Wong
- Francois Sherwood
- Joost de Folter
- Sebastien Besson
- Dom Lindner
- Susanne Kunis
- Argha Sarker
- Norman Rzepla
* Notes
- Roundtable
- WM: actively converting IDR. listing images in the README. soon to have idr0090 done (22 plates. 0.5-1TB each). idr0044 lightsheet: 2k x 2k x 2k x 500 timepoints. not 3D downsampled. i.e.: close
- also: there are updates to the NGFF validator.
- see: https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/idr/share/ome2024-ngff-challenge/idr0044/4007801.zarr
![image](https://hackmd.io/_uploads/rkYOQoHhR.png)
- Highly recommended to validate any generated data
- SK: also looking into RO-Crate generation
- where does the namespace for the key/value pairs come from? (not all OMERO instances)
- SB: supporting on the infrastructure side and getting Java tools caught up. back to a good place on sharding. now looking at converting metadata from v2 to v3. (question for RFC)
- JdF: talking to Crick EM team. have identified data.
- Q1: convert first to Zarr v2? Yes (with ngff-converter or bioformats2raw)
- Q2: RO-Crate? yes. There are some examples.
- `pip install ome2024-ngff-challenge` and `ome2024-ngff-challenge resave -h`
- FS: tested converting on small scale. working on converting metadata for the new website (where we're going to put beautiful images). still committed.
- DL: idr0157 converting to v2. 50TB with 15 or 20 different organisms.
- issues with nextflow running out of space (not deleting files)
- BO: haven't started converting. might have access to interesting data.
- --> Dominik: can hopefully help with batch conversion.
- https://github.com/Euro-BioImaging/BatchConvert
- few differences between the strategies?
- other questions
- WM: what to do with the outputs? ("catalog")
- JM: not from my side. yet.
- Javascript or Python (notebook)
- JM: First step probably move README to CSV or YAML
- WM: Current state https://ome.github.io/ome-zarr-catalog/?csv=https://raw.githubusercontent.com/ome/ome-zarr-catalog/main/public/zarr_samples.csv
- JdF: preferred way to publish?
- public URL with CORS (S3 or HTTP(S))
- SK: interest in a unique representation of all the links?
- ro-crate-preview.html
- perhaps something for the future
- AS: RO-Crate example
- JM: historical perspective of omero-marshal, omero-rdf, etc.
- AY: from BIA to come with a mapping & schema for REMBI. OME does it for OMERO. and then map between the two. but for the user namespaces, that's a whole different issue. ontological mapping is tricky (at best)
- rather than one for IDR and one for BIA, merge it in the REMBI realm.
- JM: future challenges? iterating on this?
- FS: what's being used with JSON-LD tooling? JSON tooling? or RO-Crate tooling? do you rely on the context?
- AS: relying on JSON-LD (from RDF). FS: i.e. you have an extra RDF graph object ... suggestions won't work. ;)
- FS: to make things more compatible for the future, try to use IDs rather than strings.
- "homo sapiens", "tonsil" --> to objects with a name field that links to the string
- can then add in an ID later (if you don't already have it)
- in JSON-LD flattened (RO-Crate) is then much longer, but that's fine.
![image](https://hackmd.io/_uploads/BJky2sB2A.png)
- Downsampling
- NR: Code from the `webknossos` Python package
- Calculating which resolutions to downsample: https://github.com/scalableminds/webknossos-libs/blob/master/webknossos/webknossos/dataset/_downsampling_utils.py#L37-L107
- Run the downsampling: https://github.com/scalableminds/webknossos-libs/blob/master/webknossos/webknossos/dataset/_downsampling_utils.py#L299-L366
- 3D downsampling with an attempt to end up at isotropic voxels
- Example: 11x11x22nm input, downsample by factors of 2-2-1, 4-4-2, 8-8-4, ...
- Example: 11x11x22nm input, downsample with constant factors of 2-2-2, 4-4-4, 8-8-8, ...
- No downsampling across other axes, e.g. channels, time points
### Session 2
* Attending
- Josh Moore
- Eric Perlman
- Will Moore
- Peter Sobolewski
- Melissa Linkert
- Kiya Govek
- Seb Besson
- Erick Ratamero
- Dominik Lindner
* Notes
- WM: Validator demo again
- JM: get URLs then we can start doing interesting things.
- EP: Peter & Quarto? :) markdown+code -> generate other things
- KG: have dataset, have information to convert it, need GCP
- EP: local filesystem based is happy. need to do the GCS write.
- other Eric mentioned looking at the metadata. bar is low.
- JM: heads up on writing remotely with v3. (Josh to send link)
- ER: chasing people for licensing information
- KG: using the same ontologies
- ML: ongoing work on zarr2zarr (Java), some metadata work from Seb
- need options on chunk size, changing from v2 to v3
- looking into various zarr-java issues with Norman. going well.
- SB: more important for RFC-2 and the next release.
- could start trying zarr2zarr if you didn't want to leave bioformats2raw
- only thing missing is the insertion of the RO-Crate (not on radar)
- JM: could add a command to add the ro-crate-metadata.json
- Misc: https://forum.image.sc/t/ome-ngff-workflows-hackathon-2024-november-18th-22nd-in-zurich/100843
- ER: just license or contact information?
- JM: at the moment just the license. let's us at least work with it.
- KG: for larger, person submitting isn't necessarily correct (and different per dataset)
- https://www.researchobject.org/ro-crate/specification/1.1/contextual-entities.html#licensing-access-control-and-copyright
## Check-in (20240821)
### Agenda
* Status updates
- Library status
- Working and ready to use
- Examples in README
- Metadata validation
* Getting started!
- Instructions
- Publicity
- Collection
### Session 1
- Attendees
- Josh Moore (OME/GerBI)
- Will Moore (OME)
- Matthew Hartley (EBI)
- Francois Sherwood (EBI)
- Aybuke Yoldas (EBI)
- Sébastien Besson (Glencoe)
- Tom Boissonnet (HHU)
- Martin Jones (Crick)
- Joost de Folter (Crick)
- Petr Walczysko (UoD)
- Notes
- Josh:
- Binary migration
- Required metadata?
- Statistics. Anything else last minute?
- Matthew:
- small scale conversions. can share but wouldn't put them in the README
- created a new bucket for the challenge (with permissions)
- good example of a naive user.
- resave size issue caused a problem
- Will: fixed now in released version
- probably won't have bigger data converted for a few weeks
- Francois:
- Is RO-Crate in the correct location? Josh: looks so.
- Haven't had time to implement the CSV input. Hopefully by end of week?
- Will: RO-Crate doesn't show up in the validator (yet)
- Matt: will need a validator. looks *like* a real NCBI ID is fine.
- give people help to do the ID look up.
- or ask for descriptive text and we do the look up?
- Will: CSV? Matt: Alternative option for providing bulk parameters
- Additionally an additional artifact that can have more details (descriptions, etc.)
- Tom: should only validate the RO-Crate and the URL exists.
- Matt: back to the question of the default values
- Tom: avoid any defaults.
- Josh: but required?
- Will: minimally license (provide some choices); allow skip
- Seb: prefer no metadata rather than bad metadata
- SHOULD: license, organism, modality
- MAY: name, description
- i.e. allow skip
- Will: extract "name" from NGFF metadata? :+1:
- Aybuke: MUST for license?
- Matt: suggest stick with SHOULD?
- Josh: if missing, interpretation is "private"?
- Seb: that's ok for local testing?
- Josh: CLI validator?
- FS: validate that it's specifically ID? or any string?
- Josh: two alternate fields?
- Matt: allows us to help them fix it later (but adds complexity)
- Tom: any RO-Crate that is not valid we can review and contact them?
- Matt: avoid back and forth process
- AY: do we have something in place to help people look up?
- JM: limiting factor is someone to code it up.
- AY: wary about two stage. people will forget later
- MH: will have to have tooling to re-write the RO-Crate. allow the two-stage?
- conversion is expensive, but won't want to do it again.
- metadata collection is more iterative
- MH: dump FBbi and possibly NCBI today?
- JM: human, mouse, arid., droso., zebrafish, yeast, c. elegans,
- JM: suggestions for moving to production?
- bumping version number
- MH: anyone else who hasn't tested yet? JM: Kiya?
- README!
### Session 2
- Attendees
- Josh Moore (GerBI)
- Eric Perlman (JAX)
- Lorenzo Cerrone (BioVisionCenter, Zürich)
- Joel Lüthi (BioVisionCenter, Zürich)
- Norman Rzepka (scalableminds)
- Sebastien Besson (Glencoe)
- Notes
- Josh: re-summarizing morning session (see those notes)
- Licenses
- Joel: for the challenge or NGFF in general? More challenge. JL: good, need to train people
- Eric: but we work in a realm of de factor standards so good to start with proper processs now
- part of an internal JAX thread
- Misc
- Eric: zarrs build issue
- Eric: turbojpeg issue (Josh: arm?) -- may need to use x86 docker
- Eric: using s3 compatibility layer (not GCS)
- Josh: stats
- Eric: unclear how to evaluate perf vs. cost in a solid way
- Norman
- talked to Francois about collections; need time to prototype
- unified Zarr streaming into webknossos. [Example](https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://data-humerus.webknossos.org/data/zarr3_experimental/scalable_minds/l4_sample/color)
- Lorenze: Zarr v3 and deprecating Zarr v2
- Norman: idea was to have a hard-cut, but can still create 0.4 NGFFs
- Joel: single computational environments where you can write Z v2 and v3
- Norman: for zarr-python, probably forever support v2 and v3 (not too optimized)
- zarr.v2 code will go but the support will be there
- Josh: sidenote for Norman that updating v3 branch of zarr-python was breaking today
- Norman: long way to go for releases.
- Joel: months? or 2025
- Norman: ironing bugs and for normal use ok
- put API surface of v2 is so huge. what are we preserving?
- Lorenzo: dropping sync API?
- Norman: we're unsure. Please chip in. Core devs don't use it or what it should look like.
- Lorenzo: always found zarr-python API too wide
- Seb: Java also rough
- Norman: but starting from scratch so don't have the baggage
- Seb: yeah, even if more people are hacking on zarr-python
- and java API already includes leasons learned from zarrita(-py)
- Seb: migration (in challenge context) going well
- converting with sharding is slower (can't parallelize as much)
- hopefully in a few weeks will have a Java library that duplicates the Python one (to compare)
- Norman: thanks for the RFC2 review
- Seb: challenge is validating it also. Then we decide on the roadap
- Norman: opinion -- get 0.5 out ASAP.
- Josh: agreed. bigger worry is the impl ecosystem (zarr-python)
- Seb: identifying blockers as a consumer
- Joel: great to hear what's changing and what directions are going!
- Josh: lessons learned
- --> Lorenzo: definitely start using ASAP
- zarr-python v3 testing of remote writing
- codec as biggest risk
- .... "nested sharding"
- Seb: but re-converting is painful
- Joel: start processing zarr v2 with zarr-python v3 branch?
- Norman: yes, but don't use zarr.v2 imports
- Josh: preventing deletion?
- Norman: still some references
## Spec freeze? (20240731)
### Agenda
* Status updates
- dev3
- [Francois](https://github.com/ome/ome2024-ngff-challenge/pulls?q=is%3Apr+author%3Asherwoodf)
- [Tom](https://github.com/ome/ome2024-ngff-challenge/pulls?q=is%3Apr+author%3Atom-tbt)
- dev2
- [Josh](https://github.com/ome/ome2024-ngff-challenge/pulls?q=is%3Apr+author%3Ajoshmoore)
- Zarr implementations
- zarr-python
- d-v-b: [2058](https://github.com/zarr-developers/zarr-python/pull/2058) etc.
- tensorstore
- misc: [182](https://github.com/google/tensorstore/issues/182)
* Discussion points
- Configuration
- RO-Crate metadata
- Shard-size heuristics
- Scaling/Performance
- Slurm anyone?
- Generating scripts?
- Timeline: holidays, beta testing, datasets, freeze, etc.
### Session 1
- Attendees
- Josh Moore (GerBI)
- Norman Rzepka (scalableminds)
- Matthew Hartley (EBI)
- Francois Sherwood (EBI)
- Aybuke Yoldas (EBI)
- Susanne Kunis (UOS)
- Koji Kyoda (RIKEN)
- Hiroya Itoga (RIKEN)
- Tong Li (Sanger)
- David Gault (UoD)
- Dominik Lindner (UoD)
- Frances Wong (UoD)
- Notes
- dev3 (Francois):
- Prototypical code to generate objects which are REMBI-like.
- Have an example of the metadata of the flattened, compacted format. Opinions very welcome.
- Commented on the REMBI proposal from Tom based on experience from BIA. Tweaks were needed.
- Norman: had a glance, but superficial. Would be good to look together (now or later)
- FS: biggest question is whether or not to go with the REMBI model
- screenshare https://github.com/ome/ome2024-ngff-challenge/pull/9 which is loosely based on Tom's REMBI proposal
- NR: had a hard time to understand the RO-Crate JSON. Other ways of visualizing? Is Python better?
- FS: Python felt easier. Create a crate. Add datasets, files, and contextual entities.
- JM: and reading? FS: spits it out into a similar structure.
- MH: linked data online viz tools are also available
- FS: can easily convert to other profiles even if flattened here.
- FS: working with RO-Crate, needs a `@base` specified (not normal). Can add framing.
- MH: will need to identify who the users are.
- people writing will only need to set 5 parameters
- developers will need to know the Python side
- metadata consumers should hopefully not stare at the flattened view
- NR: interested in collection stuff. up for a separate brainstorm on that?
- FS: :+1:. schedule on Zulip.
- NR: terminology. unsure if we'd use "Dataset", would use "Image"
- JM: full type system. can subclass or have multiple superclasses. should make it easy to understand
- FS: can creative the relevant terms somewhere. No linked data formalization of REMBI out there.
- find and re-use vocabulary (worry about URIs later)
- dev2 (Josh)
- pypi? Josh
- rocrate params? Francois
- convert.sh?
- chunk/shard: NR: same for all multi-resolution, but differences for 2D & 3D
- MH: everything starts from OME-Zarr v2, right? Yes.
- JM: if we wanted, then can think of bf2raw or omero-cli-zarr
- NR: not the end of the world to go first to v2
- Unsure of the state of sharding.
- datasets
- IDR/Frances: starting datasets
- looked at slides
- idr0090: in v2, HCS, rich multi-channel metadata
- idr0157: not in v2, plants with lots of organisms (not largest)
- idr0044: not in v2, large light-sheet.
- MH: nice selection. :+1: on different organisms
- BIA/MH:
- not at the choosing stuff yet
- one EM
- with remote reading/writing tools, take platyneris (2TB)
- dependent on which storage is available (next meeting!)
- JM: testing now?
- MH: anything on https://www.ebi.ac.uk/bioimage-archive/galleries/visualisation.html e.g. BIAD144, EMPIAR-10442
- JM: visualization? MH: useful to have an example in the README.
- scalableminds/NR:
- [open PR](https://github.com/scalableminds/webknossos/pull/7941) soon to be merged that generates dev2 on the fly.
- targeting on dev2 ("write"). reading in a few weeks.
- need curation on the metadata to go to dev3. select certain datasets.
- RIKEN/Koji:
- converting SSBD to OME-Zarr
- problem: many TIFF filesets. worst case: lost metadata. setting physical pixel size.
- MH: BIA has tools to rewrite Zarr metadata with the correct scaling (on each level of the mutliscale)
- TODO: MH to share those scripts
- JM: in the IDR, corrected in OMERO and exported from there
- DG: flags on bfconvert? Probably not.
### Session 2
- Attendees
- Josh (GerBI)
- Martin Jones (Crick)
- Ken Ho (Crick)
- Melissa Linkert (Glencoe)
- Eric Perlman (JAX and stuff)
- Jordao Bragantini (CZIBioHub)
- Judith Lacoste (MIA Cellavie Inc.)
- Notes
- dev3/metadata: Josh summarizing the morning session
- Eric: interested in the result but don't have to be involved in the intermediate discussions
- Ken: collection like a dataset? Yes, but with metadata.
- Eric: see an example, e.g., of an NDPI. something on each of the channels.
- probably subset of IDR metadata
- Judith: so Tom's trying to use REMBI? Working in that direction, but requires adaptation
- Judith: when working with the data model, equipment has filters associated with the hardware. but specimen isn't necessarily a match
- dev2/binary:
- chunk/shards at each level?
- Eric: fine for the challenge but for anisotropic would need it
- codecs:
- very tricky
- out of memory
- Eric: try configuration for buffers, etc.? No.
- Josh: difference in tensorstore-py? Don't know.
- scripts?
- Eric: good, but could see just printing to stdout
- Jordao: on HCS? fields X resolutions
- Jordao: parallelize on t? That's an upstream question
- zarr-java and sharding (`JM->ML`)
- ETA of the end of year, but not within the next month.
- any other path? only possibly aicsimageio, but just a subset.
- Eric: fine for the challenge. possibly measure conversion time.
- even planning for Zarr v4 :smile:
- summarizing
- Start testing! :smile:
- TODO: Josh to put up sample data (Eric: :+1:)
- Datasets - anything weird?
- Josh: reported on the IDR datasets from Frances
- Martin: lots of MoBIE generated datasets. alone aren't too interesting. (transforms later)
- Next meeting in 3 weeks
- Eric: JAX meeting with mgmt about data just afterwards
## Status Sync (20240717)
### Rough agenda
Please keep an eye on this or add your own. :point_down:
* Status updates:
* dev2:
- spec: https://ngff.openmicroscopy.org/rfc/2/
- generation: https://github.com/ome/ome2024-ngff-challenge/pull/3
- https://github.com/glencoesoftware/zarr2zarr
- validation: https://github.com/ome/ome-ngff-validator/pull/36
- visualization: https://github.com/hms-dbmi/vizarr/pull/172 (along with https://github.com/google/neuroglancer/issues/606)
* dev3:
- [RO-Crate](https://github.com/ome/ome2024-ngff-challenge/pull/2)
- Collections?
* Other items:
* FYI: NGFF (v0.4) data that was generated for the IDR can be found under https://idr.openmicroscopy.org/webclient/usertags/?show=tag-38304992
* Is 2D versus 3D chunking orthogonal to the sharding and the challenge? Or should it be included ?
* Additionally, is there any interest in proposing multiple sharding configurations for the challenge?
* Likely "biological entity" terms that will be used? See https://github.com/ome/ome2024-ngff-challenge/pull/2#discussion_r1674340139
### Session 1
* Attendees
* Josh Moore (GerBI)
* Bugra Oezdemir (EuBI)
* Tom Boissonnet (HHU)
* Joost de Folter (Crick)
* Frances Wong (UoD)
* Sébastien Besson (Glencoe)
* Dominik Lindner (UoD)
* Aybuke Yoldas (EBI)
* Koji Kyoda (RIKEN)
* Norman Rzepka (scalableminds)
* Eric Perlman (The World)
* Francois Sherwood (EBI)
* Hiroya Itoga (RIKEN)
* Notes
* Josh: dev2 (binary) is in a fairly good
- https://github.com/ome/ome2024-ngff-challenge/tree/main/dev2
* Seb: zarr2zarr - not on the dev2/rfc-2 format yet. Solely upgrades v2 to v3 (and v3 to v2). (Josh: i.e. dev1).
- Doesn't work on HCS.
- Doesn't shard
- Question of how to validate. Currently doing that via the roundtrip back to v2.
- Norman: use some zarrita implementations for zarr-python v3. using byte to byte. could also use zarr-python alphas now. https://github.com/zarr-developers/zarr-java/blob/main/src/test/java/dev/zarr/zarrjava/ZarrTest.java#L61
- Josh: validator would be the "best way" for the moment (also: works locally)
- Seb: next step is choosing the public bucket for the IDR datasets
* Aybuke: haven't done conversions yet but will do. Kola is likely to be the person.
* Next steps on binary
- Eric: been adding GCS parameters to the resave script
- Eric: interested in the compression flags (blosc optimizer)
- Seb: where?
- Eric: specific zulip thread? (Norman can join on a zulip thread)
- Norman: blosc with zstd level 5, chunk size based on use case (32 cubic)
- Eric: different for large 2D screens.
* Metadata/dev3 (Francois)
- https://github.com/ome/ome2024-ngff-challenge/pull/2
- Things to cover:
- who is keen to be part of the decision making group?
- additionally: Norman, Tom, Frances
- where do we store the RO-Crate file relative to the .zarr?
- what does into it and formatting issues?
- RO-Crate is relatively rigid (Norman: and complex)
- what does a collection definition look like? (Norman)
- e.g., multiple images and assign relationships, type attributes, etc.
- ro-crates of multiple ro-crates?
- FS: not mentioned in the spec
- JM: from their POV it's allowed, but no special handling
- https://github.com/ome/ome2024-ngff-challenge/pull/2
- FS: JSON-LD as RDF or as JSON is two different communities. number 4
- TB: in favor of ro-crate JSON within the zarr.json
- need a top-level for collections
- AY: similar to Norman's collections?
- FS: worst case highly duplicative JSONs (nothing stopping you from describing larger datasets at each level; would get copied over at each level)
- JM: ...rdf spiel...
- Norman: moving NGFF metadata to the ro-crate metadata.
- breaking but worth an experiment?
- summary/decisions:
- FS: sounds like everyone is roughly happy with 2 and 3 (and not 4). so unless there are other suggestions/experiments, can work on that proposal
- Josh: i.e., a ro-crate-metadata.json is acceptable beside any `zarr.json/.zattrs` (but we will need references between them)
- NR: image and a prediction, multiple images. all share which sample. another use case: multiple images of different samples and share some metadata, e.g., sinlge image knows its parent to get something about the sample.
- AY: if moving metadata out of zarr.json, how would the tools work? NR: it wouldn't, all tools would need to be updated.
- AY: channels? don't always have the right information.
- JM: anything a MUST in zarr.json? NR: don't think so.
- TB: can reference zarr from the RO-Crate
- NR: need to be careful about mixing the two and making them mandatory
- for a tool that can only read zarr, what does it need?
- Geo community is working on multiscales
- Maybe transforms and axes
- Careful decision
- FS: need NGFF in JSON-LD format for this
- JM: unsure if there will be open rebellion of flattened, compacted NGFF form :smile:
- FS: happy to create a new PR with examples for position, flattened/compacted NGFF,
- parallelization (Norman)
- JM: would defer to tensorstore. anyone?
- NR: not in next two weeks, but then perhaps capacity
* ACTIONS
- "pass your Zarrs around" (all)
- sharding/chunking/compression benchmarks
- zulip compression thread (JM, et al)
- new RO-Crate PR (FS)
### Session 2
* Attendees
* Josh Moore (GerBI)
* Melissa Linkert (Glencoe)
* Peter Sobolewski (JAX)
* Joel Lüthi (UZH)
* Tom Boissonnet (HHU)
* Jordao Bragantini (Biohub)
* Kiya Govek (JAX)
* Davis Bennett (independent)
* Aybuke Yoldas (EBI)
* Cameron Fraser (AICS)
* Erick Ratamero (JAX)
* Notes
* JM: Main looming topic left on Dev2 is sharding / compression technique (in next few weeks)
* JB: also looking into compression.
- Davis: every codec is going to work. will depend on the data.
- Davis: would like to see in zarr-python tools that would make it possible to benchmark codecs against samples.
- Josh: are all implemented? DB: list is shorter.
- DB: also spec is unclear if a codec is unsupported
- ... check with your data.
* PS/fyi: at SciPy, xarray discussions around Earthmover hiring for biomedical. OME-Zarr (metadata) was explicitly discussed.
- Davis: have a package that turns OME-Zarr in xarray: https://pypi.org/project/xarray-ome-ngff/
* Melissa/fyi: zarr2zarr "watch this space"
- demonstrates using zarr-java to write & read zarr v3
- simple copy from v2 to v3 and a round trip back to v2
- local only, no HCS, needs work on sharding options
- all on TODO for the next 2 weeks
- Question to the group: best way to validate that the v3 is "correct"? metadata updated, sharding is configured correctly, etc.
- Josh: currently best validator for ome-zarr is web ome-ngff-validator.
- Python can work cause that's where it's implemented.
- Can't check that compressions are doing what you want - basically have to check hashsums
- Davis: don't need. round tripping is sufficient
- Melissa: zarr-java is a work-in-progress, so independently of NGFf, how do we confirm?
- Josh: there are a lot of zarr implementations to check for: https://github.com/zarr-developers/zarr_implementations
- Davis: zarr v3 shed for testing: https://github.com/d-v-b/zarr-workbench/tree/main/v3-sharding-compat
- Melissa: had discussed
* Josh: by end of July to settle on spec. then we will stop changing things. we will need to ramp up actual conversion of data
- Davis: not going to consider converting until visualization tools work, specifically Fiji
- Jordao: is conversion metadata only?
- Davis: you *can* do metadata-only thing
- Jordao: we will probably use slurm
- AY: could use slurm
- JL: also have slurm
- JB: what are people using?
- AY: conversion part of bigger pipeline. Are using snakemake, could move to nextflow
- EBI as a whole is moving towards slurm, so eventually slurm will be better for us
- ER: we will be using custom google cloud stuff, might not be helpful for others
- JL: eventually will wrap into Fractal
* Jordao/Josh: benchmark?
* Josh: will want to be able to play with the data. read metadata. point tools to datasets of interest
* Josh: big matrix of compression settings, modality/dimensionality of data
* ER: interesting benchmark is process of conversion itself. how long does it take? how much does it cost in human hours and processing?
* JL: on-prem file numbers! DB: some of it can be done through calculations. (experience with sharding has been good; but may need to test your HTTP implementations)
* JB: nested directories work fine with timelapses of moderate volumes (~20GB per volume), but need sharding for plates and large 3d images
* AY: benchmark is that visualization tools work, and the happiness of the IT crowd.
* Josh: hope having some data out there in new format is enticing enough to have application/vis developers update their tools
* Josh: https://ome.github.io/ome-ngff-tools/ records state of community after 3 month period
* Josh: Proposal from this morning:
* RO-crate sidecar file
* Need an example ro-create - Francois working on this, Tom/Josh/Normal also interested
* JL: is there a viewer/abstraction layer that people are using to look at ro-crate, or are they looking at json?
* Josh: tool for looking at hierarchy of metadata attached: https://language-research-technology.github.io/crate-o/#/
* Probably what we want for our community is little widgets in our existing tool set (e.g. vizarr). Would let you drill down to metadata and select images by metadata. Don't currently exist.
* Davis: hard sell for biologists if no tool
* Josh: benefit to ro-crate is that spec already exists for metadata
* Josh: list of ro-crate tools: https://www.researchobject.org/ro-crate/tools
* JL: my question was more "what do people here use to look at them?"
* chat: JB: 'Is the tensor store `zarr3 driver` compliant with the current specs for sharding? https://google.github.io/tensorstore/driver/zarr3/index.html'
* DB: it should be
* Cameron: allen institute for cell science, work on their web visualization tool. work with Dan Toloudis. mostly fly on the wall today
* Josh: are people doing 2D or 3D chunking?
* ER: even WSI is not trivial at JAX
* JB: on-prem we keep chunks as big as possible to limit file sizes. created smaller chunks for neuroglancer for responsive async loading, but still 3D because neuroglancer orthoview
* Josh (standing in for Francois): What are biological entities people are wanting to include in challenge?
* Zebrafish (BioHub)
* Mice (JAX)
* Stem cells - human cell lines, some mouse (UZH)
* Josh: there will be pre-defined categories for some of the metadata. Hope was that we could work from one ontology, but looking like we will need to pull from a set of ontologies
* Jordao: will there be domain-specific metadata?
* Josh: hope is to set up a structure where you can include whatever metadata you want
* Josh: two we need are biological organism and imaging modality, but you can include anything else
* DB: if communities are defining schemas for their own metadata, how does it get validated?
* Josh: there are ways of writing down schemas for these extensions
* Josh: could be that people who write schemas also write validators, hopefully some of that overhead will discourage adding too many schemas
* Joel: hackathon in fall on NGFF in Zurich. deciding dates: https://docs.google.com/forms/d/e/1FAIpQLSdmbWkuBa11MnFcbIHYnxQLEf1p0xna3rL26j_YrWKlStMFSw/viewform
- ACTION:
- Josh: ask what managers are of interest.
- Josh: outline the benchmark interest (so people keep numbers)
- Josh: ask about biological entities (see agenda)
## Spike 1 (20240703)
### Rough agenda
* Status round table
* To cover:
* Successes & blockers
* Any things that need merging, etc.
* Any updates on who is interested
* Groups: dev1, dev2, dev3
* Anyone else?
* Next steps
* Additional calls, etc.
* Next target
### Session 1
- Attendees
- Josh Moore (GerBI)
- Davis Bennett (freelance)
- Joost de Folter (Crick)
- Will Moore (UoD)
- Frances Wong (UoD)
- Matthew Hartley (EBI)
- Francois Sherwood (EBI)
- Aybuke Yoldas (EBI)
- Tom Boissonnet (HHU)
- Guillaume Gay (FBI)
- Norman Rzepka (scalable minds)
- Seb Besson (Glencoe)
- Notes
- Josh:
- 3 months left. We will need to start
- dev1 (Will/Davis)
- Will: not much. New code from Joost loading remote data for conversion.
- Davis: problem space is straight-forward. e.g., no metadata changes.
- Josh: had had problems with zarr-python v3. had a proposal on the table to use
- Will: couldn't use dask
- Davis: wrinkle with array constructor. fix has been merged in v3 even if there is no release. caveat for tensorstore: no wheels for windows.
- `dask.from_array()` rather than `from_zarr`
- dev2 (Norman)
- PR for the next version of RFC-2: https://github.com/ome/ngff/pull/250
- Josh: so do we move dev1 to dev2 and work from that branch?
- Norman: https://github.com/ome/ngff/pull/242
- Will: Validator working for everyone?
- Josh: yes, but not the thumbnail.
- Joost: didn't like label subimages
- Will: to look at both of those.
- dev3
- Matthew: looking at minimal RO-Crate example
- with minimal organism and imaging modality
- https://github.com/ome/ome2024-ngff-challenge/pull/2
- Francois: laziest possible metadata with *some* information
- linking to NCBI taxon ID
- tried to link to *some* linked data
- could be better; not wrong.
- working on BIA models at the same time.
- SVGs in repo for importing.
- Will: json is within the zarr or above the zarr?
- Matthew: above the zarr.
- Josh: was assuming within the zarr.
- Seb: describes the whole zarr or pieces
- Tom: assumed that things on the root apply to everything, etc.
- Davis: will the RO-Crate not be accessible to Zarr tools?
- Josh: working towards a ZEP. Maybe in the zarr.json.
- Guillaume: separate file question -- would like to export legacy data with omero-cli-transfer. something that would be useful for NGFF & OMERO/OME-TIFF would make sense.
- Matthew: current proposal does that, but doesn't
- Need to enumerate the trade-offs, then experiment and see what we learn.
- misc: sample data (Davis)
- https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/test.20data
- https://github.com/d-v-b/zarr-workbench/tree/main/v3-sharding-compat/data
- also testing zarr-python v3 against sharding
- Seb:
- :+1: on RFC-2. Timeline of two weeks for reviewers' comments. Clock is ticking, would be good to discuss if all the blockers have been addressed. Seb to discuss with Melissa
- zarr-java: some work from Norman *et al.* for a release. waiting to get things configured from the zarr-developers org. Let's get it there.
### Session 2
- Attendees
- Josh Moore (GerBI)
- Will Moore (UoD)
- Argha Sarker (UOS)
- Melissa Linkert
- Peter Sobolewski (JAX)
- Erick Ratamero (JAX)
- Susanne Kunis (UOS)
- Dan Toloudis (AICS)
- Eric Perlman (self)
- Kiya Govek (JAX)
- Matthew Hartley (EBI)
- Notes
- Quick summary from session 1: (**dev2**) With the [updated proposal for RFC-2](https://github.com/ome/ngff/pull/250), folks working on dev1 propose to migrate to dev2 with all their activities over the course of the next 2 weeks, i.e. we begin updating the metadata for the OME-Zarr datasets that are being written. (**dev1**) Otherwise, previous issues from dev1 for using zarr-python have been fixed in the zarr-python v3 branch; additionally, a version using dask.from_array rather than dask.from_zarr will be attempte. (**dev3**) Work on dev3 has also begun with https://github.com/ome/ome2024-ngff-challenge/pull/2. Comments welcome on the PR event though it is merged. Also welcome are other PRs/commits with alternative proposals. Francois will be tracking the various pros and cons of the alternatives.
- dev1/dev2 (binary):
- Eric: looking into what it takes to convert (others like Will working on the actual code). e.g., writing with tensorstore to GCS. balance of performance
- Josh: example in https://github.com/ome/ome2024-ngff-challenge of using zarr-py for __(?) and tensorstore for writing. maybe makes more sense to focus on tensorstore layer than dask, or work on both concurrently
- Eric: want out-of-the-box set of tools that work locally, then we (JAX team / Eric) will make work with GCP and object storage
- Josh: Anyone else working with object storage?
- Will: s3 mounted or remotely? did that with MinIO
- Dan: I like direct writes to s3, but for the purposes of our institute we're probably writing locally and copying up to s3
- Will: NGFF validator updated to work against Norman's PR. Difficulty that the schemas are not there until the PR is merged. Do we need relative links to schemas instead of absolute links to fix this?
- https://github.com/ome/ome-ngff-validator/pull/36
- Josh: Mavencentral changed process of deployments so things from few months ago need to be redone. Might deploy to github repo and skip mavencentral for the challenge.
- Josh: check out links above in misc section for test data of zarrs in v2 and v3 versions
- dev3 (metadata)
- summary thanks to Matthew: how we structure the thing and what goes in it. Upcoming issue for summarizing the pros & cons. Decision to be made at the next meeting so that the tooling can be updated.
- Argha: have been looking at including the nodes and what is included as well as the contexts. Will be adding a response on the PR.
- Susanne: also looking at an OMERO.openlink example of RO-Crate with the OMERO metadata. Map annotations are a blackbox. Also the question of what happens in the local file case when all URLs are relative.
- Erick: playing catchup (lots of off time) now and probably through October. Will get involved with dev3 but in a month or so. No opinions. Carte blanche.
- misc
- Peter: labels not working in 0.5.0 release of napari. handling of zarrs that have a lot of chunks via dask is poor. tensorstore might be more performant. BFIO folks have plugin that thought could read ome-zarr metadata but didn't work out
- Peter: is anyone else zarr-related going to be at scipy conference?
- Josh: Dan if you get a tool working with the challenge version of ome-zarr spec, let us know. will need to start working with viz tooling soon
- Will: vizarr doesn't work with challenge spec yet. could look at omero-cli-zarr, but probably won't be main way of working with it
- Dan: overseeing web-based viewer and c++ code for rendering ome-zarr. have not yet looked at implementing challenge spec yet. c++ viewer uses tensorstore, web viewer uses zarrita.
- Josh: only change so far is swapping out v2 for v3 zarr. over next few weeks we will need to decide what's in and what's out for rfc2
- Josh: how does c++ handle groups? Dan: it doesn't. Josh: when transforms come that might get more complicated, but outside scope of challenge
- Josh: most communication happening in ngff stream of imagesc zulip
- https://imagesc.zulipchat.com/#narrow/stream/328251-NGFF
- Eric: where would right place be to look for summaries before zoom meetings? Josh: next time will have beefier agenda
## Kick-off (20240619)
### Rough agenda
- As necessary
- Introductions, questions, etc
- Repositories, datasets, buckets
- Tools, developers, **examples**
- Motivation & scope
- Review of the last week
- PRs
- Blockers
- Specific questions/tasks
- Java status statement
- Parallelization frameworks (dask & spark)
- Shard access benchmarks
- Modalities (file count reduction, etc.)
### Session 1
- Attendees
- Josh Moore (GerBI)
- Sébastien Besson (Glencoe Software)
- Will Moore (OME)
- Jean-Marie Burel (OME)
- Joost de Folter
- Tom Boissonnet
- Davis Bennett
- Petr (OME)
- David (OME)
- Frances (IDR)
- Guillaume Gay (France BioImaging)
- Notes
- Introductions, etc.
- Tom: not many datasets from HHU. could go fishing but it's not going to be significant (certainly not annotated). happy to be involved with the REMBI/RO-Crate work discussion. Poor understanding of what needs to develop. Python coding.
- Team 1 (Zarr v3)
- Team 2 RFC-2 (version etc)
- Sounds like: Team 3 (metadata, RO-Crate etc) this afternoon
- Joost: also tema 3. Have lots of datasets from 2 projects. Multimodal. Interested in generalizability, including transforms. Python coding at the Crick. Generally interested in the challenge. Would love to see benchmark data. Might not be able to share the Crick data. Have inhouse compute & storage.
- Josh: selling point is "self-publication platform", but 'dev' version of the data might be a blocker. Convert to NGFF v0.4 (Zarr v2) might be good first step.
- Team 1: v2/v3 (Python-spin)
- Josh: been looking at Davis' and Will's PRs
- Will: not sure what the next step is. dask issue as a blocker? try tensorstore for reading/writing? briefly looked at that. unsure how to read remote zarr data. for converting something much larger without download, would be good to work from remote.
- Davis: tensorstore has the concept of KV store drivers. One is S3. Unsure if it works on HTTP. Config is all raw JSON/dict driven.
- Will: currently blocked on dask.
- Davis: would never use the dask.to_zarr and from_zarr methods. A trap. Joost: why? Davis: arrays and groups have a model/attributes. Dask papers over those meaning people don't use Zarr to the most. Can also just use regular dask arrays.
- Will: using https://github.com/ome/ome-ngff-validator/pull/35#issuecomment-2168345157 to convert. Best way to update that for remote data?
- working RemoteStore (which is the new FSStore)
- tensorstore
- Will: missing documentation for sharding. Davis: Struggling with coming up with user-facing documentation that's easy to understand. Would encourage you to read the spec and possibly Norman's tests. Basically a compression algorithm. Declares a mini-array internally. (Can you have nested sharding codecs?!)
- + a = root.create_array(str(ds_path), shape=data.shape, chunk_shape=shard_shape, dtype=data.dtype,
+ codecs=[ShardingCodec(chunk_shape=chunk_shape, codecs=[BytesCodec(), BloscCodec()])])
- Davis: parallelize over entire shards, use large chunk size matching the shard
- No good API for exposing the sharding size. That's what you would parallelize on for reading.
- ergonomics of zarr-python v3. independent of the metadata. good if those discussions can happen in the zarr zulip (https://ossci.zulipchat.com/).
- Javascript
- zarrita.js sharding: https://github.com/manzt/zarrita.js/pull/94
- Will: moving sharding or chunks? Davis: what should happen is -- client requests chunks, passed to sharding codec, range requests are resolved ... Will: how do the URLs work? range request headers
- Java
- Seb: (more this afternoon) Java teams are trying to do the same as discussed under Python but starting at a more prototype stage. Assessing java library (zarr-java). Working on an example of converting a v2 to v3 and then converting back. It will trigger very similar question about how to introspect a shard, etc. Try to keep things in sync with the Python ecosystem. **bioformats2raw** is not in scope at the moment. Focused just on the arrays. Similar as the discussion around dask, we will then start investigating how to scale that up (toy example, etc.) Spark, etc. as the next API question(s).
- Davis: re: API at the top-level would be to represent all arrays as sharded. In the v2 case, you just have one chunk per shard. In v3, there are multiple chunks per shard. That would provide a consistent API at least. Seb: is the compression global to the array? currently yes. Josh: chunks/subchunks..
- Seb (chat): Sharding logically splits chunks (“shards”) into sub-chunks (“inner chunks”) that can be individually compressed and accessed. This allows to colocate multiple chunks within one storage object, bundling them in shards. The vocabulary is definitely complex for a newcomer (and confusing). Davis: for "chunk" we think of a read/write symmetry, but we break that with the "shard". Still looking for good terminology.
- Distribution
- Seb: where to publish zarr-java? need a release like pre-release on pypi
- TODOs:
- Josh: conda environment for everyone (script to repo)
- Davis: ping when RemoteStore's fixed
- Will/Josh: tensorstore/RemoteStorage investigation (script to repo)
- Josh: announcements on image.sc (zulip when useful)
- Josh: RO-Crate kick-off (minimally after this afternoon meeting)
### Session 2
- Attendees
- Josh Moore (GerBI)
- Kiya Govek (JAX)
- Tom Boissonnet (Heinrich heine University)
- Peter Sobolewski (JAX)
- Norman Rzepka (scalable minds)
- Hannes Spitz (scalable minds)
- Kola Babalola (EMBL-EBI)
- Melissa Linkert (Glencoe)
- Seb Besson (Glencoe)
- Notes
- Hannes: with scalableminds
- Kola: getting up to speed
- Josh: intro on 3 teams
- Team 1: v2 to v3 tooling
- https://github.com/ome/ome2024-ngff-challenge using tensorstore for array conversion, blocker to using dask
- Team 2: RFC-2 - move to zarr v3 in the ome-zarr spec, might move close to Team 1
- Team 3: metadata, Ro-crate
- Kiya: pitching to JAX leadership
- what is the pitch for this challenge outside of this community?
- Josh: immediate value other than code/tools would be a blog post saying "we can do this" (POC of process)
- Josh: ultimately about scalability - benchmarking numbers will be useful
- Josh: Is there a way to pitch this to get more money? Presenting in Japan next week, but maybe too early for that.
- Josh: team 1 goal: Write a python script to convert ome-zarr to v3, put that data somewhere
- team 3 will need similar goal for ro-crate. would start with putting ro-crate inside zarr directory, then build validator web app
- ome-ngff-validator on v0.5-dev1: https://deploy-preview-35--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/v0.5-dev1/6001240.zarr
- Tom: tutorial for building ro-crate?
- https://www.researchobject.org/packaging_data_with_ro-crate/
- https://docs.google.com/document/d/1prxAKrz4yHtC7kt6IOZ1qjwTcNtz1j3hJGjEFGzm6wg/edit#heading=h.uesvkk2ldyg
- Davis: what's the idea with the ro-crate specification with zarr?
- Josh: use ro-crate standard to attach collection metadata
- Davis: clients who want to consume that metadata couldn't do that with zarr, would need to buy in to ro-crate model
- Josh: if this is useful, would take back to zarr community and add to zip(?)
- Josh: who's doing something in the next two weeks?
- Tom: need to read tutorial for ro-crate
- Josh: Kola, your team could get involved if you have a JSON-LD representation of REMBI
- Kola: they(?) in the middle of developing model for JSON-LD, hopefully finished in the next few weeks
- Josh: communication on zulip: https://imagesc.zulipchat.com/
- Norman: Hannes working on zarr-java, developing tests on zarrita
- Melissa: Glencoe expanding work on testing v2 and v3: https://github.com/zarr-developers/zarr-java/issues/1
- Does zarr-java have support for v2? Norman: some support for v2 but not all of it works.
- Norman: Options would be to build out to parity with jzarr, or let jzarr handle v2
- Melissa: would be most helpful to have zarr-java built somewhere. priority good v3 support in zarr-java, but would be nice to have v2
- Seb: how much sharding is in zarr-java?
- Norman: there is an example somewhere, code is implemented, it also does partial reading
- Melissa: if there is a tool for v2 to v3 and v3 to v2, we should be able to test round-trip sharding
- Josh: what are you thinking for deploying? maven central?
- Seb: might need to apply for group on maven central
- Norman: could we reuse dev.zarr? we'll be touch for how maven central works and credentials
- Josh (chat): https://central.sonatype.com/artifact/dev.zarr/jzarr/0.3.7/versions
- Peter: on the napari side, numpy2 getting ironed out, pre-testing on zarr-python 3.0 but there are test errors (related to dask?)
- Josh: those are issues that exist externally of the challenge and need to get fixed before we make process?
- Peter: important that something can visualize these files
- Peter: napari 0.5.0 is "real soon now" and might have better support
- Josh: would it make sense to use pre-release of 0.5.0?
- Peter: yeah, don't use 0.4.19
- Josh: what is best candidate for visualizer for v3 arrays?
- Norman: webknossos has v3 support, but question is ome metadata. Neuroglancer can do v3 and can currently read 0.4 metadata
- Josh: does someone have a working neuroglancer example that we could pass around?
- Norman: don't have a dataset with 0.4 metadata in zarr v3, but if we had that could put it into neuroglancer.
- Tom/Kola: RO-Crate tutorial
- Josh: neuroglancer example for dev1
- Peter & co.: napari release (independently of challenge)
- Hannes/Norman: deployment of zarr-java
- Josh: send updates, etc.
## Kick-off (20240612)
- Rough agenda
- Introductions, questions, etc
- Repositories, datasets, buckets
- Motivation & scope
- Tools, developers, **examples**
- Timeline, meetings, communication
### Session 1
- Attendees:
- Josh Moore (GerBI)
- Will Moore (OME)
- Sébastien Besson (Glencoe Software)
- Norman Rzepka (scalable minds)
- Eric Perlman (Yikes LLC / JAX)
- Guillaume Gay (France BioImaging)
- Hiroya ITOGA (SSBD, RIKEN, Japan)
- Koji Kyoda (SSBD, RIKEN, Japan)
- Susanne Kunis
- Davis Bennett (freelance)
- Stefanie Weidtkamp-Peters (GerBI)
- Also: Khaled, Dominik, Frances,
- Notes
- Seb: adjusting calendars now. How much more can change? Josh: not much I assume. An hour earlier for the AM session and an hour later for the PM session is likely the limi
- Frances: expected to join both? (Hell) no. :) Choose what works best for you.
- Datasets & repositories
- Davis: OpenOrganelle (XX TBs); pay for compute; AWS open data.
- Eric: JAX. [KOMP](https://www.jax.org/research-and-faculty/resources/knockout-mouse-project) data. Dave is discussing internally. Perhaps beyond research IT group. ~10 TBs. (Have more data as well, but that's the highest priority). Paying for conversion and storage.
- Frances: IDR. Conversion/storage ok. What parameters? Which datasets are interesting? (type, size)
- Guillaume: FBI. Unsure of what data. X TB from MiFoBio. Will also collect from colleagues. storage/compute ok.
- Hiroya/Koji: SSBD, already started converting SSBD image data (in OME-Zarr v0.4). storage/compute fine. plan to convert about 40 TB (Phys servs/VMs)
- Norman: data proxied (on-the-fly). ~150 TB.
- Josh/Susanne: N4BI: no numbers but fairly large. 1PB so can share space with others.
- Scope
- Will: "ephemeral" seems unfortunate. What's the benefit? If we limit the scope more, then it can be usable.
- Norman: expensive part is moving other things to Zarr. cheaper is the metadata part (no standards by the fall), so people need to be willing to change a number of JSON files.
- Eric: committment is be willing to make these changes and keep making them up to a date.
- Seb: implementation components of RFCs are still open. Does this count as an implementation for those RFCs? (counting towards validation). i.e. "early adoption process"
- Davis: only transition from zarr v2 to v3.
- Josh: so pin down sharding today.
- Seb: do completely minimal scope and ONLY change the chunks to shards.
- Davis/Eric: lot of interest in the sharding
- Will: OMEZarrReader: how it will handle arbitrary order and number of dimensions.
- Eric: but not the metadata people aren't here.
- Norman: interested in the RO-Crate (experimental)
- Tools
- Seb: Glencoe (tooling/Java)
- Davis (tooling/Python)
- v2 to v3: can convert in Python.
- unclear how to do that efficiently. (dask?)
- Josh: rechunker? zarrita can do it. can resurrect.
- conversion code in zarrita https://github.com/scalableminds/zarrita/blob/async/zarrita/array_v2.py#L452-L559
- Seb: zarr-java is the only thing known of. will spend time looking on that. replicate the work from zarr-python v3. test the round tripping (Read/write) (Melissa will also propose)
- Norman: have been doing more work over the last weeks. there is also some v2 support in zarr-java. Needs more testing.
- Eric: open to a world of multiple passes.
- Josh: so focus
- Norman: work on RFC-2 more so we can convert the metadata forward as well. Josh: ETA? proposal in 2 weeks.
- "Teams":
- RFC-2: Norman, Will, Seb (re-review)
- v2/v3: Eric, Will, Davis (possibly just consumer)
- rocrate (or "metadata team"): Josh, Norman, Guillaume (review)
- Notes
- Seb: adjusting calendars now. How much more can change? Josh: not much I assume. An hour earlier for the AM session and an hour later for the PM session is likely the limi
- Frances: expected to join both? (Hell) no. :) Choose what works best for you.
- Datasets & repositories
- Davis: OpenOrganelle (XX TBs); pay for compute; AWS open data.
- Eric: JAX. [KOMP](https://www.jax.org/research-and-faculty/resources/knockout-mouse-project) data. Dave is discussing internally. Perhaps beyond research IT group. ~10 TBs. (Have more data as well, but that's the highest priority). Paying for conversion and storage.
- Frances: IDR. Conversion/storage ok. What parameters? Which datasets are interesting? (type, size)
- Guillaume: FBI. Unsure of what data. X TB from MiFoBio. Will also collect from colleagues. storage/compute ok.
- Hiroya/Koji: SSBD, already started converting SSBD image data (in OME-Zarr v0.4). storage/compute fine. plan to convert about 40 TB (Phys servs/VMs)
- Norman: data proxied (on-the-fly). ~150 TB.
- Josh/Susanne: N4BI: no numbers but fairly large. 1PB so can share space with others.
- Scope
- Will: "ephemeral" seems unfortunate. What's the benefit? If we limit the scope more, then it can be usable.
- Norman: expensive part is moving other things to Zarr. cheaper is the metadata part (no standards by the fall), so people need to be willing to change a number of JSON files.
- Eric: committment is be willing to make these changes and keep making them up to a date.
- Seb: implementation components of RFCs are still open. Does this count as an implementation for those RFCs? (counting towards validation). i.e. "early adoption process"
- Davis: only transition from zarr v2 to v3.
- Josh: so pin down sharding today.
- Seb: do completely minimal scope and ONLY change the chunks to shards.
- Davis/Eric: lot of interest in the sharding
- Will: OMEZarrReader: how it will handle arbitrary order and number of dimensions.
- Eric: but not the metadata people aren't here.
- Norman: interested in the RO-Crate (experimental)
- Tools
- Seb: Glencoe (tooling/Java)
- Davis (tooling/Python)
- v2 to v3: can convert in Python.
- unclear how to do that efficiently. (dask?)
- Josh: rechunker? zarrita can do it. can resurrect.
- conversion code in zarrita https://github.com/scalableminds/zarrita/blob/async/zarrita/array_v2.py#L452-L559
- Seb: zarr-java is the only thing known of. will spend time looking on that. replicate the work from zarr-python v3. test the round tripping (Read/write) (Melissa will also propose)
- Norman: have been doing more work over the last weeks. there is also some v2 support in zarr-java. Needs more testing.
- Eric: open to a world of multiple passes.
- Josh: so focus
- Norman: work on RFC-2 more so we can convert the metadata forward as well. Josh: ETA? proposal in 2 weeks.
- "Teams":
- RFC-2: Norman, Will, Seb (re-review)
- v2/v3: Eric, Will, Davis (possibly just consumer)
- rocrate (or "metadata team"): Josh, Norman, Guillaume (review)
### Session 2
- Attendees
- Joel Lüthi
- Aastha Mathur
- Matthew Hartley
- Eric Perlman (Yikes LLC, JAX)
- Fernando Cervantes (JAX)
- Melissa Linkert
- Davis Bennett
- Peter Sobolewski (JAX)
- Kiya Govek (JAX)
- Josh Moore
- Jason Swedlow
- Will Moore
- Bugra Oezdemir
- Notes
- Data & repositories
- Aastha/Bugra: HD - few submissions and the nodes. need some time to figure it out over the next few weeks.
- Chat: Nanotomy.org is a node for HD, would be good incentive. Has anyone reached out?
- Josh has talked to Ben - they are in the process of submitting it to IDR
- Matt: some of the data is in BIA and IDR. Fraction already converted to OME-Zarr (v2)
- Kiya & Eric: JAX - 8TBs as NDPI in KOMP, bigger as Zarrs. Reaching out to Broad. Heard a soft no.
- Fernando & Peter as well. Interested in the sharding (use them ASAP)
- Joel: not much public data. Interested from the image processing on large numbers of images in shards. Also want to hear the experiences.
- Matthew: approx 1PB of data that *should* be in OME-Zarr. looking at how to do it at scale. long-term requires software & database development. balancing that and the ephemeral work. 100s of datasets across techniques. convert to v2 and v3 later is good. (already have 150 TB of v2 and existing pipelines) for Autumn target, 250TB-ish. Could possibly go beyond based on the internal storage team (may include the storage available to IDR)
- compression issue means that some datasets will be 10x larger. (300TB -> 3PB)
- metadata: problems are solvable later on in the process. cheap operation in CPU cycles.
- JRS: IDR - Chicken and eggs
- IDR depends on resources at EBI
- What's strategic? Just visually?
- Broad is an internal target. Perhaps with momentum they will re-join.
- Melissa: Glencoe - no novel datasets. Focus on having a separate Java tool for v2 to v3 and for sharding.
- not necessary bioformats2raw
- iterate quickly - zarr java is a moving target right now
- Process & tools
- Might be good for groups to convert to v2 today so that can use the v2 to v3 tool when it comes out.
- zarr v3 out in python now
- tools in zarr-python or zarrita for doing the conversion v2 - v3. But performant scaling-up? dask based?
- Matthew: metadata. Basic validation tool, perhaps under ro-crate. 2-3 weeks is hard.
- but the metadata side can come a bit later.
- Jason: similar in terms of time scale, but IDR team is interested in metadata.
- getting something "consistent" with REMBI guidelines
- deeper discussion with IDR team next Monday.
- Matthew: would be great to have an official project validator
- Matthew: can do a lot of testing of v2 to v3.
- Josh: think about platform for scaling up (dask, spark, rechunker, etc.)
- Matthew: re: target spectrum, bioformats2raw is planer. so using rechunker for the 3D downsampling.
- Josh will get ro-crate sample within next 2 weeks
- Norman will make a proposal for ro-crate
- We can do ro-crate for now without committing to it fully because this is ephemeral
- Peter: visualization side. napari-ome-zarr for the JAX datasets.
- Josh: would be a lot of work to keep up with each of the iterations
- First step would be to see if napari can read a zarr with sharding
- some people already have frankensteinian ome-zarrs with v3 sharding - should find a place to store those
- Davis: if we have a spec, could update pydantic zarr
- Josh: not going to change 0.4 spec, could we have a branch of pydantic zarr for now?
- Davis: currently pr combines v3 and metadata
- could instead relax requirement to allow zarr v3 with same metadata requirements
- Will: don't see how to separate zarr v3 changes from metadata changes in PR
- Josh: set up dev version of spec for the challenge, remove mention of .zgroup and .zarray from spec
- stage 1: addition of.zattrs to zarr.json (v2 to v3)
- stage 2: addition of metadata to rfc2
- stage 3: ro-crate.json
- Davis: so we are proposing changes to the spec without rfc process?
- Josh: it won't be a new version of the spec - just for generating this ephemeral data
- Davis: what could go wrong?
- Josh: sharding is pretty safe. spec wouldn't support if someone needs to use both v2 and v3
- Josh: might end up being developing in the spec in the end, but we shouldn't put that pressure on ourselves
- JRS: substantial collections of v2 data we don't think we can get to v3 in the timeline
- referring to mixed v2 and v3
- Josh: valid for purposes of challenge: datasets converted directly to v3, or datasets converted from v2 to v3
- so that challenge is internally consistent in v3
- Timeline of challenge
- Meetings not every two weeks, but that's the attempt
- There is another meeting next week
- Dates on image.sc post: https://forum.image.sc/t/ome2024-ngff-challenge/97363