# Helmholtz AI Consultants All Hands Meeting, June 17, 2021 :::info :Bulb: Notes: 1. Collaborative note-taking is encouraged. Please feel free to add notes below 2. At the end of the talks, please provide anonymous and constructive feedback. ::: - announcements: no all-hands meeting in July + August 2021; bluejeans meeting will be working, so if people want to get together, feel free to use this room - introducing two new community members: - Elisabeth Georgii :wave: (consultant with HMGU) - Gerome Vivar :wave: (consultant with HMGU) --- ## Talk 1: "Efficient data loading and benchmarking based on WeatherBench", Jakob Lüttgau (DKRZ) ## Notes: <!-- Please write below this line --> [WeatherBench repo](https://github.com/pangeo-data/WeatherBench) - as a general rule: i/o performance that you see on your laptop (e.g. with a SSD) can be quite different in a HPC context - the core reason being, that an HPC system serves different purposes - for those not familiar with parallel filesystems: - typically the storage system is split over several cabinets full of hard-drives - when you store a file in the filesystem, the software splits up your file and distributes these splits (referred to as stripes) among several of these cabinets - hence, choosing the number of stripers or the stripesize has quite some impact on your i/o performance - as an example: if you choose the stripesize too small (say you have a 1GB file and choose stripesize 0.1M), then looking up a single -.1M stripe has quite some overhead which accumulates quickly - zarr: https://zarr.readthedocs.io/en/stable/tutorial.html (the better hdf5, but development has focussed on other datatypes than at what hdf5 is good at) - hdf5: https://support.hdfgroup.org/HDF5/whatishdf5.html (one of the standard formats in experiment and HDF5 these days) - climate data typically stored in netCDF, https://www.unidata.ucar.edu/software/netcdf/ Optimizing I/O: 1. Determine model batch consumption 2. Familiarize with data format tuneables 3. Familiarize with storage service tuneables ## Questions and Remarks: <!-- Please write below this line --> 1. How large are these datasets typically? Are we speaking GBs, or TBs? - PS: I think there were hints to this in one of the first slides. IIRC, it depends on the angular resolution. - CB: Ok must have missed this:) - PS: no problem, you might not be the only one ;-) 2. the team at HZDR that does quite some large simulations which also relies on quite large data (>TB range) relies heavily on [ADIOS2](https://github.com/ornladios/ADIOS2); they ran successfully with this on machines like Titan (11k GPUs working as a team) and others. - did you have a look at this? 3. page 16: which i/o framework did you use for this slide? - PS: I am leading an effort to increase the compression efforts in the life science community. I suggest we get in touch. --- --- ## Talk 2: "Inverting the beamline", Peter Steinbach (HZDR) The slides are available at: https://figshare.com/articles/presentation/20210610-iktp-inversebeamlines-Steinbach_pdf/14768982 ## Notes: <!-- Please write below this line --> Problem: Invert a simulation, i.e. infer simulation parameters Typically approached with 'fit' Alternative: Bootstrapping and simulation-based inference simulation represents the likelihood, want to get the posterior on the data (side note: paper on "What are the most important statistical ideas of the past 50 years?" (http://www.stat.columbia.edu/~gelman/research/unpublished/stat50.pdf)) Example application: BESSY, find the sim param that reproduce the beam profile -> Need to learn the invertible mapping, with normalizing flows Their loss function are based on change of variable formula, they can infer the latent variable z, they optimize the exact log-likelihood of the data Implementation = Invertible Neural Networks (https://arxiv.org/pdf/1808.04730.pdf), expanded to Conditional INNs (https://arxiv.org/pdf/2105.02104.pdf) Promising a venue besides VAE and GANs To go further: Sequential neural posterior estimation, from David Greenberg. Model seems to be learning beam parameters more resonably. Example was carried out on synthetic data. Take aways Normalizing flows: - emerged as learnable transformation between distributions - emerged as central building blocks for conditional density estimation ## Questions and Remarks: <!-- Please write below this line --> MP: Can you highlight the difference with standard parameter regression/ MLE etc..? What are the typical nembers, # of parameters that can be infered, # of training examples, # of experimental observations? The likelihood should be learned by the NN, and the parameter space sampling should be more efficient. Current experiments: 500 000 simulations, 12 parameters, 120 epochs -> Workshop in september? * While Peter has been working with simulated data various other fields(e.g. astronomy?) are already using these methods (normalizing flows) also with experimental data. --- --- ## Feedback --- ### Share something you liked :+1: <!-- Please write below this line --> 1. interesting mixture of topics today 2. Please add comment here 3. ### Share something that could be improved <!-- Please write below this line --> 1. Please add comment here 2. Please add comment here 3.