# LEAPS-INNOV WP7 parallel session
April 20, 2021
- indico agenda for the entire kick-off meeting: https://indico.desy.de/event/29568/
## VC details:
- dial into the LEAPS-INNOV zoom conference: https://desy.zoom.us/j/92974199260 (the passcode was sent to you by mail)
- use the breakout self-assign feature to enter the break-out room
## Attendees:
- HZDR: Peter Steinbach, Uwe Konrad, Guido Juckeland
- DESY: Anton Barty, David Pennicard, Frank Schluenzen
- ESRF: Vincent Favre-Nicolin,
- Soleil: Brigitte Gagey
- PSI: Alun Ashton
- ALBA: Nicolas Soler
- MAX IV: Zdenek Matej, Clemens Weninger
- SOLARIS: Michal Piekarski, Michal Falowski
- ELETTRA: Francesco Guzzi
- DIAMOND: Sateesh Maheswaran
- HZB: André Hilger, Ingo Manke, M. Osenberg
- Andy Gotz
- Ireneusz Zadworny
- Jerome Kieffer
- Mark Heron
- Darren Spruce
- Lorenzo Pivetta
## Agenda
Our parallel session will run as follows (times estimated):
- Greeting and introduction: 5 mins
- Short presentation (a few slides each) on tasks 7.1-7.4; estimated 30 mins total (depending on discussion)
- Presentations from institutes; 6 people responded, and if we take up to 10 mins each plus some discussion this could be maybe 1.5 hours
+ David Pennicard - DESY
+ Brigitte Gagey - Soleil
+ Alun Ashton – PSI
+ Francesco Guzzi - Elettra
+ Nicolas Soler - ALBA
+ Vincent Favre-Nicolin - ESRF
+ Peter Steinbach - HZDR
+ Zdenek Matej and Clemens Weninger - MAX IV
- Time for discussion
So, we will likely be finished by 1300, but if we're getting a lot of useful discussion and want to continue we could rejoin in the afternoon.
# Notes / Questions
Use this chapter of the document to take notes, share resources or ask questions.
## Coffee break
- worked well :-)
## Greetings / David Pennicard
- staffing equalling distributed within WP7
- mean number of FTE months = 8-12, hence not all partners can be fully active at the same time
- main goal of WP is knowledge sharing. D.S: Based on the comments from the EU, it is clear that individual facilities will be invisible with regard to future EU funding applications. So it would be advantageous if we could achieve a sustainable knowledge sharing platform.
- **Please send all slides presented to David and Darren**
## Short presentations on tasks 7.1 - 7.4
### WP7.1 / David Pennicard
Tasks:
- setting up collaboration platform
- also code and data repository for sharing
- regular conference meetings
- [ ] Question by DP: a webportal - could this act as a general source of information?
- PS: we need this to share progress and deliverables, question here for me is how public we share information?
- [ ] Question by DP: what platforms or tools are required?
- Any objections to using git for version control?
- PS: whatever we choose everyone should have access and it needs to be clear who administers that; in the best of all worlds I'd prefer a github.com/gitlab.com team (on neutral ground so to speak) where we only have to identify the administrators
- MF: some labs moved to gitlab.com
- PS: gitlab.com is also a bit more feature rich :wink:
- GJ: we also have a common setup where we replicate repositories to local gitlab CE instances for CI/CD
- [ ] Question by DP: how often should we meet?
- proposal is three months. D.S (from Alun) Should we rotate between the various facilities to host it?
- Question: should we invite similar projects to participate/present?
- Helmholtz has the new topic "Data Management and Analysis (DMA)" where subtopic 1 also covers data reduction
- [ ] Question: Should we have a single common repository/organization/group
- Until we develop code or some material together such repos tends to be quite empty
- On the other side we need to have some overview
- PS: for WP7.3 there will definitely code/pipelines being developed and shared
- [ ] hosting the 3 monthly meeting, maybe that could be done on a rotation basis between facilities so the sites can showcase thier own activities?
### WP7.2 / Nicolas Soler - Vincent Favre-Nicolin
- will organize interviews with stakeholders
- establish limits/metrics for loss in lossy algorithms
- establish metrics for performance in data reduction
- for each facility:
- identify needs
- identify contacts
- need from each partner:
- poll on global needs
- ???
- Question: what resources are devoted?
- DP: staffing is between 8-18 FTE months; funding for people is available but not really for new people
- DS: we know work is ongoing, funding aside LEAPS available perhaps to focus on compression
- AB: should we address this?
- AG: presentations this morning discussed to match funding by 2
- VFN: main goal for this kick-off is to know whom to connect to (developers, scientists, ...)
- MH: putting this together was constrained by available budget
- main goal was to catalyze collaboration and build on what has already been done
- challenge: limited resources granted and coordinating
- DS: code sharing may be tricky and not get in the way
- TODO after meeting:
- gather documents on benchmarks for compression or internal white papers which evaluated different schemes
- poll to formally gather names, projects and relevant techniques ?
### WP7.3 / Peter Steinbach
- create a corpus of datasets (open data)
- identify and evaluate viable approaches for compression/reduction including pipelines
- Compression pipeline: preprocessing/compression/transport or storage/decompression
- Focus:
- new algorithms
- flexible, performant pipelines
- validated & open results
- Invite all collaborators to share information
- Question: [EXPANDS](https://expands.eu/) collected datasets -> maybe use them for this project?
- AB: what is the effect on the science output imposed by the compression/reduction; need to get our hands on reconstruction pipelines for example
- PS: website available?
- AB: document by EXPANDS is a deliverable there: https://zenodo.org/record/4558708#.YFJ3A537SUk
- AB will search the link
- GJ: we should use the EU requirements for Panosc and Expands to publish their results, hence the data sets should also be visible via OpenAIRE
- Question: frameworks available for pipelines?
- PS: the more close you are to the detector (the higher the requirements for compression bandwidth), the less such frameworks exist
- some i/o frameworks like [hdf5](https://www.hdfgroup.org/)/[zarr](https://zarr.readthedocs.io/en/stable/tutorial.html)/[adios2](https://github.com/ornladios/ADIOS2) offer pipeline facitlies
- for offline compression (on a HPC cluster or in the cloud), several workflow engines exist [nextflow](https://nextflow.io) or [snakemake](https://snakemake.readthedocs.io/) are two examples among many many others
- JK: blosc2 (https://github.com/Blosc/c-blosc2) offers building blocks for such pipelines
- PS: @JK can you point me to the API reference of blosc2 for pipelines, please?
- JK: I believe this page describes the different compressor and pre-filters available in Blosc2: https://blosc-doc.readthedocs.io/en/latest/frame_format.html
- PS: super cool, so this "pipeline" is a fixed combination of filter steps which come with blosc2 - not arbitrary filters that a user can "bring in"
### WP7.4 / Alun Ashton
- Contacts:
- Identify who is involved in the WP
- open dialogue with WP8
- industrial contacts
- IBM
- AG: IBM very interested to get involved, PSI also worked with IBM on "hardware" compression; need to formulate what help we want
- AA: other venues possible as well (FPGAs)
- HDF GROUP
- AG: good connections to hdfgroup, so would need to forward demands to them
- PS: have hdf5 be thread safe from the get go would be nice (at least for shared memory machines)
- NVIDIA
- has a development center in Lund, several former LU-proffesors were head hunted there, in particular with expertize in data compression on GPUs, they were keen on collaboration (also some meetings with ESS)
## Presentations from institutes
- idea for presentations is to provide an overview
- each partner gets a chance to show ongoing or conducted work
### David Pennicard - DESY
- people involved: Anton, Luca, David, Thorsten, Volker (or Patrick?)
- strong focus: serial crystallography
- central software: cheetah
- publication: https://asu.pure.elsevier.com/en/publications/cheetah-software-for-high-throughput-reduction-and-analysis-of-se
- website: https://www.desy.de/~barty/cheetah/Cheetah/Welcome.html
- code: https://github.com/antonbarty/cheetah
- Question PS: the github repo looks rather quiet, where does the main development of cheetah happen?
- main developement happens internally to desy on stash.desy.de
- latest developments will be released soon
- machine learning has strong interest @ DESY i.e. easier to get funding / attract interest
- Question AG: compression by fitting the background and only fitting the peak?
- Chuck ??? at SLAC did work on this
- might be cool for corner of detector
- AB being unclear if this a solution for all applications
- yield from this depends on crystal in experiment
- Question AB: how to handle data conservation/preservation if compression the day after the experiment?
- AG: definition of raw data depends on technique (discussed in PANOSC)
- question of finances: if you can afford the storage, store raw data; if not, store compressed data
- AG: impact of compression is the question, example from tomography where jpeg2k lossy compression was fine to use as it retained the same level of contruction
- Question AG: resources for compression located at DESY?
- AB: typically looking into this topic - mostly geared towards future experiments
- Question PS: Anton Barty or Andy Götz: generally speaking, who publishes the entire reconstruction pipelines for your experiments? (background: for assessing the quality of compression efforts, seamless access to such reconstruction pipelines is crucial)
- do you mean software or scientific methodlogy? The software is a mix of locally developed software (open source) and externally developed software (usually open source but not always). Quality of compression is determined and evaluated by beamline scientists usually with help from software engineers. Setting up of pipelines is done by data scientists and/or beamline experts.
- PS: Thanks for the elaboration. I basically meant to ask how easy will it be for us to reproduce your pipelines given some examplary (raw) data sets?
### Brigitte Gagey - Soleil
- Question AG: switch off compression (slide 14)?
- at some pipeline stage (data aquisition), compression had to be stopped as it was too time-consuming
- use Eiger detector built-in compression
- use lz4 at storage system
- Question ZM: how is data used if the storage system uses compression?
- lz4 is used by storage layer (transparent to users, i.e. users don't see it)
- Question DS: what are your ideas about our project?
- compression more close to the detector would be nice
- Question AG: will soleil do serial crystallography?
- 2 beamlines in operation using this technique
- One report was made inhouse to compare compression schemes depending on the techniques. Brigitte will ask Emmanuel Farhi if this can be shared
- **We reconvene at 12:45pm!**
### Alun Ashton – PSI (starting 12:45pm)
- REDML project: reduction of high volume experimental data using machine learning. Primary case MX (structural biology: MX=Macromolecular Crystallography)? Swiss Data Science funded project
- Question PS: regarding the data rates, 10 GB/s is quite high. I wonder how long each experiment sustains this data rate? In other words, if the 10 GB/s have to be sustained for 2 seconds only, the task becomes a bit easier.
- AA: should be using bursts (a few seconds), accidents can happen
- AA: challenge is also to re-read the data once it lands on disk
- Question VFN: do you use Power9 and GPFS together?
- AA: Power9 used to process data stream
- ZM: GPFS works with Power9 natively -> site specific issues possible
- VFN: bad network performance on Power9 experienced in last couple of months (would be interested in exchange of knowledge/experience)
- AG: power9 nodes run Ubuntu instead of RHEL
- ZM: will provide contact to MAXIV IT department
- ZM: have power8 and power9 using GPFS alright (sorry for wrong info: power8 has gpfs, power9 at maxiv not), ZM will investigate what was the setup on IBM customer-cluster in US
### Francesco Guzzi - Elettra
- Question AG: software available as open-source?
- FG: need to ask
- FG: publication https://www.nature.com/articles/s41598-020-66435-6
- Question DP: compressive sensing for xray microscopy, does this change operation of experiment?
- FG: only selectively illuminate object -> use compressed sensing reconstruction and hence reduce data generated
- DP: interesting idea to depart from row-wise scans, **but** not contained in LEAPS-INNOV proposal
### Nicolas Soler - ALBA
- Question AG: how much data is compressed on disk?
- some beamlines converge to HDF5 based compression
- Question DS: file formats from "detectors"?
- mostly CBF (crystallographic binary format) files
### Vincent Favre-Nicolin - ESRF
- [pyFAI](https://github.com/silx-kit/pyFAI)
- usage for serial Xtallography with [lima2](https://gitlab.esrf.fr/limagroup/lima2)
- multi-frame compression for xray photon correlation spectroscopy: [dynamix](https://github.com/silx-kit/dynamix)
- tomography: use jpeg2000 compression
- software:
- https://github.com/silx-kit/hdf5plugin to use codecs inside hdf5
- contributed to blosc2
- PS: Can you elaborate a bit more on what you mean with "plugin API"
- AG: one idea is to mimic pyhdf5 interface also for in-memory stuff so that people can continue to use the same programs as for offline analysis
- VFN: the idea is that we need an API which is generic enough to be re-used in any/most hardware & network configuration.
- Question PS: I am not so familiar with Power9 hardware details, what codec does Power9 implement in hardware?
- JK: online gzip compression
### Peter Steinbach - HZDR
- AG: How will the 18 PM at HZDR be set up?
- PS: HZDR is adding in-kind contribution and we plan to have one person active for the whole project runtime
- SM: What is the compression algorithm behind ADIOS2?
- PS: ADIOS has a pipeline fashion similar to HDF5, but somewhat restricted to HPC systems (needs a certain software environment, is also more I/O bandwidth optimized
- also worth a look: https://www.nvidia.com/en-gb/data-center/magnum-io/
- DP: What would HZDR would like to learn from the Synchrotron and FEL community?
- PS: we are also users at those facilities and want to contribute back
- GJ: we also operate the ROBL beamline at ESRF, HIBEF at euXFEL
- GJ: we also want to learn about different (and future HZDR) data sources to have our data management environment able to plugin these data sets and analysis workflows as well
- NS: what is the ROFEX data actually representing
- PS: turbulent flow in a centrifugal pump (resolution of figure is rather low to see the gas turbulences inside the liquid of the pump)
- SM: we should also have put a focus on compressing lots of small files
- DS: yes, it is important. Let's defer that more "data management" question
- VFN: what would you label as ‘small’ files examples. Individual 2D files below a few Mbytes or 1D data e.g. for spectroscopy ?
- SM: so this will depend on how your file system is configured
- SM: generally speaking, with GPFS for example, they will define at GB level
- VFN: These days we pack these ’small’ files in hdf5 so we don’t have issues with a large number of files.
- SM: but important to realise these are all tuneable parameters and how one configures parallel file system is a dark art and site specific (largely based on file access patterns of the applications)
- Peter referenced ADIOS2 See https://www.sciencedirect.com/science/article/pii/S2352711019302560 or https://github.com/ornladios/ADIOS2
### Zdenek Matej and Clemens Weninger - MAX IV
- Presentation: https://lu.box.com/s/hh0u1xgwjsejbcdf9zpjhalth7slsq39
- Question PS to all: is there an overview what software people use to analyze data? In other words, which ecosystems should our efforts target?
- VFN: I think we need a formal poll after (well before the next 3-monthly meeting) to gather all this information
- ZM: whatever we do, it needs to be compatible with downstream software (needed by consuming scientists)
- PS: at HZDR we see people use fiji for images, jupyter notebooks for sure or hdfview for h5 files - but I have no clear overview what the distribution is
- AG: PSI plugin for fiji (https://github.com/paulscherrerinstitute/ch.psi.imagej.hdf5 ?)
- AG: data preservation is essential -> how to look at the data in 10 years from now
- Comment MH: compression is a very broad field
- several dependencies (technology, algorithms, infrastructure, science)
- maybe bound/limit what we focus on
- PS: dataset corpus is one way to establish this focus
- MH: need to consider yield that can be expected from datasets
- Comment MH: not try to reinvent the wheel -> be prepared to look beyond while considering available resources (work force)
- AG: nice algorithms out there, but we should be mindful of our community
- AG: need to liaise with scientists to collect demands (matrix of solutions)
- AB: make the point of limited storage to scientists
- MH: scientists will never speak with one voice
- VFN: make poll to infer who is involved in what topic with what data set
### Michal Falowski - SOLARIS
- DP: cryo-EM is an important data producer
- AG: let's get in touch with the communities
- NS: for some centers, cryo-EM is considered a beamline -> this should be included?
- MH: should stick with data from photon sources
- MF: I've got short info from CryoEM person, the only compression for now is TIFF LZW. There is new method of saving data with data only (omitting zeros): "eee" electron event represention.
- MF: For archiving he doesn't know any compression method because it need to be lossless and operate on TB.
# Summary
- VFN: How to proceed?
- DP: share presentations on indico
- DP: go through material of today, find important links to projects that are established (matrix of modalities/frameworks/technology)
- NS: how to communicate? (mailing lists, chat platforms, ...)
- VFN: first identify fields, then go about including scientists
- PS: I second that idea! first isolate domains/datasets which a promising, then proposing this list to the community -> then diving into compressing these (takes time to understand experimental conditions)
- VFN: virtual seminar on compression approaches
- AG: good idea -> hdf5 compression plugins were developed with the photon community
- AG: how to capture the inner workings of a compression algorithm
- PS: good idea -> showcase what is possible and what worked
- ZM: be mindful of calendaring
- DP: use LEAPS-Innov portal to channel events to "outsiders"
- AG: should we focus on decompression?
- ZM: should codec be independent from container? (for example, the lz4 block format can be used by other codecs);
- PS: zstd implements this today, i.e. compress with zstd but decompress with gzip
- VFN: compression should be as transparent as possible
- AB: matlab imposed problems
- ZM: native mat files (from v7.3) are hdf5 based under the hood
- VFN/NS: poll with partners will be the next step -> confer with other coordinators or anyone interested
- GJ: so this will depend on how your file system is configured
- DP: will bring this up with LEAPS-INNOV coordination
- MH: currently only high level plan available -> detail this into deliverables
- maybe set up by dedicated channels
- VFN: there is a wp7 mailing list? -> DP will check
- not sure if this is intended
- MH: contact names from DIAMOND will be circulated next week
- MH: plan ahead and declare what is being done to avoid getting superficial information only; experience from EXPANDS
- AB: project manager needed
- AB: define concrete goal for each dataset, provide reconstruction pipeline, compare compression approaches with respect to information loss
- PS: How to get people to "react" -> how about mandating each partner to provide two data sets that are troubling them at the moment and turn this into a paper (assess state-of-the-art), repeat this at the end of the project
- VFN: let's not do this per center but per modality
- PS: could produce a paper now that everyone is invited to contribute (citations might motivate people)
- MH: people around me are not judged by citations
- PS: sure, but this idea was meant as a motivation across the community (scientists might be interested in these citations)
- DP: maybe concentrate on tomography? Can we find people?
- VFN: sure that the community will respond (or can be forced to)
- PS: I agree, outlook for more science is viable
- VFN: we should go for the poll first -> use this as the basis to sub-divide
- will circulate a draft with everyone
- poll could be conducted within one months time
- AB: more active leadership in the beginning (to bring train on track)
- DS: FTE staffing designed to level load on many shoulders
- DP: HZDR received 18 months FTE as the task is very concrete
- DS: specify reason for next meeting?
- DP: doodle poll on the next meeting will be circulated
- MH: within a months time
- PS: happy to contribute/help, but WP7.3 needs corpus to work on
- PS: I think a paper based on the poll by WP7.2(VFN) is a valid goal to get engagement by community (both IT and scientific)
# Feedback
## Let us know what you liked about today's parallel session :+1:
- I think the note taking added useful content to the discussion and we could store this somewhere for later reuse
- I very much liked the overview into the activities of all centers coupled with conrete tips on tools/approaches that work
## Let us know what you didn't like or would love being improved about today's parallel session :-1:
- I would appreciate to use the time in the meeting effectively, I sometimes had the feeling that decisions or smaller polls were pushed into the indefinite future (with hackmd and zoom we have tools to collect this information and move forward right away)
- :+1: I agree, at least we need to store the need for a decision on an agenda for a future meeting
- :+1: I also agree - it is OK for a kickoff to be a bit long to gather a wide range of opinions - but if we have more regular meetings these must be shorter so the tasks leaders will have to provide a clear agenda.
-