PyHEP.dev Book of the Notes

<center> <img src="https://hackmd.io/_uploads/HJ4BhW_c3.jpg" width="250"> </center> <br> This is where we'll keep notes on everything that we talk about at PyHEP.dev. * Gordon's talk: - > If you have to write `axis=1` in an argument to numpy, columnar programming has failed (UX). - Started discussion about being able to assign physics meaningful aliases (e.g. `axis="jets"`) - c.f. [Awkward behaviors](https://awkward-array.org/doc/main/reference/ak.behavior.html) - c.f. [xarray](https://docs.xarray.dev/) --- beloved by the broader Scientific Python community - Example from user guide: https://docs.xarray.dev/en/stable/user-guide/indexing.html# ## ML R & D Breakout session (Tuesday) Present: Philip, Kilian, Richa, Raghav, Josue, Mike Some of the questions that were discussed: - What frameworks do people use (lightning & friends)? - pytorch lightning - ML Flow might also do some things that lightning does - Onnyx for plugging in ML in other frameworks/model exchange - Dashboards (wandb & friends)? - ML Flow - Weights & Biases Projects that were mentioned: - https://github.com/jet-net/JetNet: Interface for Jet datasets (combining different data sources). - Kilian: https://github.com/gnn-tracking/gnn_tracking - https://github.com/FAIR4HEP/cookiecutter4fair: Cookie cutter for data science **Conclusions:** - Dashboards like (W&B / ML Flow) are a good way to bring people "on the same page" and compare/review/debug performance - Frameworks that are built around hooks and plugin/callback structure are a good way to allow extensibility without growing "Dinosaur classes. For example, lightning hooks like `on_validation_epoch_ends` allow you to write callbacks to do stuff at the end of an epoch rather than sublassing/modifying your class ## User experience for physics data analysis tools/Documentation/Training (Tuesday) > **Note**: This has been copied down to the Thursday session ## About partitions and Analysis Workflows (Tuesday) Present: Lindsey, Peter, Marcel, Carlos, Ben, Zoe, Juraj, Nikolai, Massimiliano, Matt Bellis - Dataset location and specification - Where is the data given the analysis objective? - Dataset exploration - Rapid turnaround for basic shapes, test ideas for intuition, basic histograms and benchmarks. - Use dask.persist, da.to_parquet, etc. to reuse results. - Add provenance to da.to_parquet so that it works like dask.persist but in disk? - How to anotate task graphs for reuse or optimization? - Spill to disk in the dask cluster manager. - Lifetimes: - Workflow, spill to disk temp, spill to disk more permanent. - Establishment of initial cuts and benchmark of histograms - Typically where an analysis framework starts - Shape of "final" distributions - In general - How to recreate all these steps quickly for new analysis ideas. - Frameworks may be to rigid, use graphs but not task graphs. But we can use Luigi, link graphs or computations outside dask. - Analisys is more than just the dask graph. - Luigi for coarse coordination. ![](https://hackmd.io/_uploads/ByzlUKT5h.jpg) ## Naming axes breakout (Tuesday) Present: Angus, Alex, Ioana, Matthew, Ianna, Jim, Tal, Remco, Jonas, Mason, Clemens, - We should be able to label / name axes for readability. - Awkward Array can/should add support for named axes, c.f. XArray with dimension names - No need to add labels because they're much more specific to plotting - New function e.g. `array = ak.with_axis_name(array, 0, "events")` to permit `ak.sum(array, axis="events")` - The `flatten(array[argmax(..., keepdims=True)])` pattern is complex to understand, despite being a *pattern* - We could add an accessor that allows us to force ragged indexing. This permits to have a single-index accessor if needs be, that directly consumes the result of `argmax(..., keepdims=False)`, e.g. `array.at[...]`. We could also have a similar-yet-different `array.select[...]` that accepts a `keepdims=True` positional reducer result, and flattens the result afterwards (we know that `keepdims=True` produces regular dimensions, which can be identified statically). - Interest in removing repeated array names `event.foo.x[event.foo.y > 2]` → `event.foo.x[.y > 2]` (or better) - Could introduce a `this` proxy that refers to array being sliced. - Tools: - Xarray - Example from user guide: https://docs.xarray.dev/en/stable/user-guide/indexing.html# - Nathan: [JAX also has some (experimental) support for named axes and parallelism](https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html#introducing-and-eliminating-named-axes) - Examples - [Alex example from AGC](https://gist.github.com/alexander-held/a9ff4928fe33e8e0c09dd27fbdfd24d9) - [Gordon fantasy rewrite](https://gist.github.com/gordonwatts/87d29b9e1dd13f0958968cd194e7b929) of Alex cell's 3. ## Horizontal Scaling (Tuesday) * In discussing Lukas brought up the view of coarsness/analysis resolution: * A workflow language (like Luigi) allows for the user to define coarse, high-level analysis operations (e.g. event selection, histogramming, statistical analysis) (scale: What resources you're needing) * Lukas: Should think about orthogonalization of views. Need to have the workflows systems think about the operations be distributed. * Dask then takes the blocks of the things that are defined in this workflow language and then efficiently operate on them (scale: What expression are you calculating) * Sharing our outputs/provenance: * If you do an expensive computation you want to be able to share that with colleauges * [Tiled](https://github.com/bluesky/tiled): In the future will be able to store Awkward arrays * Upcoming IRIS-HEP topical seminar * Workflow languages - Lukas points out that in the bioinformatic world (where Snakemake came from) that the workflows are much more focused on using well designed CLI tools, whereas for our HEP workflows they are much more complicated and more unique. Conclusions: - nested workflows (Luigi & Dask) - getting people to use it ("making it advantageous") - need to train early (LHCb Starter Kit is moderately successful in [getting people to use Snakemake](https://hsf-training.github.io/analysis-essentials/snakemake/README.html?highlight=snakemake)) - cookiecutter-like starters - prior experiences - ATLAS requiring yadage for 4 years; not promising - Matthew: I think "not promising" is maybe a bit too bleak. But more a shift of where to _start_ using workflows. - Lukas: Success / Problems: - Success: New physics results are coming out that are using reuse of published ATLAS anlayses. People are also using it, and evidence that analyses actually are expressable in a workflow language. - Problems: People don't tend to use yadage/workflow languages in their day to day (a first step is usually building a new Docker image for your analysis). ## Workflows (Tuesday) * Marcel showed us [`columnflow`](https://github.com/columnflow/columnflow) workflows * Discussed why people might want to use a workflow system and while there can still be an "escape hatch" this also allows you to work for a long time before you might need to "pull it". * What is the "workflow file" and how does it work? For columnflow this is Python files vs. Snakemake/Yadage that use a YAML definition file. * Integration with REANA: * Seems like this should be quite doable with Luigi/Law-and-order! * Already an AGC implimention using columnflow * [Open PR](https://github.com/iris-hep/analysis-grand-challenge/pull/183) to add it to the AGC implimentation listing * Should have seemless integration with coffea on dask (but not yet) ## Fitting and Histograms (Wednesday) * Seemed like there was quite a bit of disagreement, uncertainty on what "fitting" means, so moving things later into the week (tomorrow, Wednesday) * Focus on importance of serialization that isn't just binary but also "human readable" (to Matthew "human readable" means JSON or YAML, regardless of how gnarly) * Mike Sokoloff suggested writing both binary and human readable * Jim suggested [Avro](https://avro.apache.org/docs/1.2.0/) * Within the PWA/amp-analysis community, some of this has been attempted before * https://pypwa.jlab.org/0309052.pdf * http://www-meg.phys.cmu.edu/pwa/main/index.php/software * Why did these not take off (code rot, people's shifting interests, etc.) ## Unit tests of fitting / inference approaches (Wednesday) * Mike Sokoloff, Josue Molina, Matt Bellis Our discussion broke down along two lines, a) datasets for comparing and stress-testing fitting/inference frameworks and b) agreed upon values for "standard" amplitudes (e.g. Breit-Wigner) #### Datasets for comparing frameworks Imagine a location where datasets are hosted in a hypersimple, future-proof (lol) format like ASCII-text. These datasets are *observables* with the features labeled in some way (README, .csv, .json (?)). Each dataset could come with *suggested* PDFs or amplitudes that could be used as a hypothesis for the underlying distribution. These suggestions should be typed up in LaTeX or from some paper and *not* provided as code. A framework-developer can then try to code up this "physics" however they see fit. *Should the dataset contributors be required to provide information on how the data was generated?* Is [Project Euler](https://projecteuler.net/) a template for this type of website? There should be some agreed-way (HS3?) for framework developers to compare their outputs. But in general, these comparisons should involve * Central values of the parameters * Uncertainties (this means different things to different people) * Likelihood scans * Confindence intervals * Nuisance parameters / constraints (how they varied) * Anything else? So as an example, imagine a single-column dataset of an invariant mass distribution that was generated as two interfering Breit-Wiger functions (BW) with some resolution applied. The suggested fits could be * 1 or 2 Gaussian functions * 1 BW * 2 BWs * 2 BWs with resolution (convolution with Gaussion) with "correct" resolution * 2 BWs with *incorrect* resolution provided as central value (e.g. for Gaussian constraint) Should come up with examples for binned fits as well. But the above lets someone play around with "native" PDFs *or* PDFs generated from interfering amplitudes. It should be straightforward for people up upload datasets that they feel are useful for the community (PWA, hist-templates, etc.) to check their framworks on. This makes it easier for frameworks to compare not just the code but the assumptions about the physics and how they coded it up. Josue has experience with this comparing `rio++` and `laura++` and finding that they gave different results with a particular physics model, because of how it was coded up in `laura++` and how the integrals were calculated. [Here is a hyper-simplified version of what a website hosting these datasets could look like if it was on Github](https://github.com/mattbellis/datasets-for-testing-fitting-frameworks/tree/main) #### "Unit tests" for functions There could be datasets that map observable values to ampltude values. This way people could test their code to see if get the same numbers. For example, imagine a 2-column dataset that is designed to vet their code for a BW. So the first column is an observed mass and the second column is the amplitude as a complex number. Challenges are: * The BW depends on the mass and width so do you have this very finely grained for many different masses and widths? * How do you agree upon the canonical values? Related to this was a spin-off discussion of how amplitude analyses should share their results. In addition to the parameter values, should they also share their * Data * Efficiency maps * Need to make sure the phase space is sampled enough so provide good enough coverage Chalkboard of our discussion. ![](https://hackmd.io/_uploads/rkFlkAR92.jpg)  ## Histogram and Fit Serialization (Wednesday) Present: Peter, Jim, Angus, Ben, Henry * General - Jim: It is not a technical difficulty, but a standarization difficulty. * Desirables - From and to root files - Good C++ and Python interfaces - Bin and edges data should be binary - Embeddable in root or HDF5 * Nice to have - Metadata as human readable, rest could/needs to be binary - Complete custom streamer. From boost_histogram define the serialization as a separate C++ library. `uproot` can recognized that and deserialized appropriately. * Difficulties - Cannot put everything on JSON efficiently, need to add binary data. ### Discussion: - Peter: could add metadata through the attributes of HDF5. Metadata could be a JSON with attributes. - Jim: Most distinguishing thing that may work is to serialize it as a `TString`, but root won't be able to inspect it. - Jim: Hans proposed to serialize using python using Boost. - HDF5 easy to inspect, tons of libraries for that. - Protocol buffers. - HS3 standard back/from JSON. - Angus: If HS3 standard is JSON, we could define a superset that maps directly onto e.g. BSON. - Jim: But, there are many JSON-like binary formats. - C++ needs to be able to define the ROOT streamers. - Can't depend upon ROOT → Need to use special ROOT macros in streamers. - Implies multiple libraries; `histlib` C++ histogram serialization, another library that depends upon `histlib` to serialize into ROOT. - Henry: we could avoid worrying about a dependency tree and just write the code twice - Angus: I imagine this code might not change much, so that's probably reasonable - Henry: It probably will with feature additions (schema evolution) etc. - We can avoid writing a ROOT deserialiser given that we have a TH1 converter for boost-histogram - Jim: Streamers are bidirectional, so this is a moot point - Common miscommunication is that any-JSON means all-JSON - Angus: What does human-readable mean? - We should try and define this for the purpose of future conversations - Protobuf requires protocol definition to understand metadata, so not very readable - HDF5 metadata is easily findable - JSON tools _can_ include linked data, but do we really worry about not following JSON standards? - Henry: should bin edges be part of metadata? - Worst case - a very long list of edges - Angus: should this not be out-of-band for variable edges - Henry: consider a zip file, with two files: metadata + binary blob - Angus: is this intended to be easily modified, or is it write-once read-once. - Yes! - Jim: there are ways to embed in zip, ROOT, HDF5. - Zip can have many files (single file with well defined name, metadata). - HDF5 you'd need to make a group with a special interpreter for tha group. - ROOT is all packed up in streamers anyway. - If the primary expression of the histogram metadata was JSON, then we could use cling on the JSON schema to generate the C++ classes representing JSON. - If we start by saying that we want a JSON schema and binary blob(s). - We have decided upon JSON-*like* data, i.e. it trivially maps to JSON (without wedding to a particular format) - Jim: if we're using JSON schema to describe the metadata, then we're "all good". - We need to really clear that we're talking about binary blobs out-of-band. - Henry: it would be complicated to have edges as data (not metadata) - Jim: multiple-data (exploded) exposes internals more, at a tradeoff of memory usage - We want to say - it should be easy enough to put things into HDF5, ROOT, zip. - We (Jim, Angus) are thinking about this as we do for Awkward Arrays; we can target _many_ formats using this, even those that we don't yet know of. - Juraj: Example of JSOM+binary blobs: [glTF](https://en.wikipedia.org/wiki/GlTF), the format has two modes "everything in one file" or "structrure + adresses" ## Model Building (Wednesday) (also inference discussions happened here too) * Lukas points out that https://www.statsmodels.org/ sits on top of SciPy stats - Could probably add histogram fitting and need to work with them * For hist, need to be able to fit a simple model to a histogram very easily * Lukas: There could be a seperate inference library where pyhf or zfit provide the model and logpdf function and then the inference library can do the operations * Think similar to blackjax how you provide the logpdf * Alex: The things you need to do require you to be able to modify the model (e.g. hold parameters constant or remove sections of). If you need to go through some generic interface it becomes much harder. How do we fix this? * Alex: How do you write the model? For HistFactory you either go through some framework or you reinvent the wheel. So how would you reasonably write out the model in some language (e.g. STAN, PyMC --- but not these, probably, but something along these ideas)? * Nathan: Can we bridge the models that we use the most to probabilistic programming language like PyMC? * Lukas: We did this for https://github.com/malin-horstmann/Bayesian_pyhf where the model becomes something like $\textrm{Pois}(n | \lambda(\theta)) N(a|\gamma(\theta))$ in PyMC * So maybe we should be having pyhf just be going from JSON to $\textrm{Pois}(n | \lambda(\theta))$ * Having a clean seperation of providing these functions and then being able to work with probabilistic programming languages would be something beneficial to do. * Nick: It will be important to have explicit LogNormal functions as constraints can be different if you start with Normal and then construct a log normal in composite manners. * Lukas: A problem with using probabilisitic programming languages for modeling and building is that all these tools are designed for Bayesian methodology. If we were going to try to use a probabilisitic programming languages in a **frequentist** manner we might have to do all this work ourselves, which is maybe quite a heavy lift. * To be able to interface between the inference library and the model need a clear API * Alex: The part for external libraries that is perhaps interesting is being able to do things that are more complicated than MLE like root finding and being able to freeze specific model parameters. All of this _should_ be able to be abstraced away, though unclear if making this generic is actually useful. * Lukas: Model mutations should happen on the modeling library side. * Nathan: It seems that the key ingrideint would be a minium viable API for what the model should be/impliment for being able to do impact plots and others for any model. * Alex: Has an example of where this become quite complicated with goodnes-of-fit. Example in cabinetry: [cabinetry.fit._goodness_of_fit](https://github.com/scikit-hep/cabinetry/blob/59b49fbb5e3caffb76b12df5277d71912e909712/src/cabinetry/fit/__init__.py#L366-L428): this relies on evaluating constraint terms separately and requires a notion of what those are * Nick: [pytrees](https://jax.readthedocs.io/en/latest/pytrees.html) are a nice way of representing structured parameter vectors while still easily flattening them for the purpose of minimization. Each bin in a binned fit could be annotated with labels that allow to isolate different contributions to the total NLL. I assume the same is true for domains of integration in unbinned fits. * (nathan) even using a dict (which is still a pytree) helps here more than a vector -- you can pass a subset of params etc by name into a method! * Lindsey: We need to break everything in the above out into individual parts at the high level and then agree on APIs there. * Nathan: random infodump -- probabilistic programming framework in jax that has some nice building blocks https://www.tensorflow.org/probability/oryx/notebooks/probabilistic_programming ### Model protocol Present: Jonas, Massimiliano, Philip, Remco Discussion revolved around how to represent a model in the most generic way, so that frameworks can 'talk' to each other. Idea: ``` func(x: Mapping[str, ArrayLike]) -> Mapping[str, ArrayLike]` ``` This represents $f: \mathbb{R}^n \to \mathbb{R}^m$ (potentially $\mathbb{C}$). Some examples: - PDF: `P(x: Mapping[str, float | ArrayLike]) -> ArrayLike` - Likelihood function: `L(x: Mapping[str, float]) -> ArrayLike` - Gradient: `Df(x: Mapping[str, float]) -> Mapping[str, ArrayLike]` Criticism: - Why `str`? Seems very specific to fitting (namely: the `str` represents a parameter or variable). Why enforce that on non-fitting packages that don't care about names? In other words: `x` is more than just $\mathbb{R}^n$. Answer (partial): for example, if you want to just vary _some_ parameters? - Warning: each dimension has its own dimension because it's `ArrayLike` ## InterOp between C++ and Python (Wednesday) Present: Baidyanath, Ianna, Ioana, Juraj - **Use Case by Juraj:** FCC team uses RDataFrame to run their analysis. New users expect to write everything in Python but RDataFrame requires C++ functions to be passed into it. They are heavily dependent on C++ libraries. They use Spack for distribution which can interfere with other environments. Solutions: - RDataFrame can add better Python support for all functions passed to it. This is ongoing but will take a bit more time. Depends on tighter Numba-Cppyy integration. - Awkward array can be used. This solution is not viable for many use cases due to heavy dependence on C++ based libraries for the user (will most probably be fixed once cppyy 3.1.0). Another issue is additional columns cannot be added to the data. - Investigate intererplay between EDM4hep and Awkward array ## JIT compilation Issues: - JIT compile time - What is a good amount of time for compilation: order of magnitude within out of run compile - This was a problem in Garima's project. gcc/clang took well under 1 second; `gInterpreter.Declare` took 17 seconds. - Less knowledge in JIT compilation than in AOT, hard for it to know what to spend time optimizing. - How about users of JIT compilation having hooks to be able to inject knowledge about which collection instances are big (need more optimization than the rest). - Is this related to C++? Does C++ JIT need to compile for cases that may never be called? - Isolating state of JITted code - So if the JITted code crashes, the outer binary does not crash, and is able to restart. - This is a C++ thing; Numba and Julia don't provide a way to access to raw pointers/ability to segfault. - Python/C++ interoperability - . - Cppyy maintenance - Avoiding ROOT dependency (faster turn-around time than asking the experiments if adding a feature breaks them). - Hard to find people who want to work on PyROOT. - clang-repl is already in LLVM; libInterop should be, too (eventually). - xeus - QuantStack wants to move xeus-cling to clang-repl. - Demo exists of passing variables from C++ to Python in a Jupyter notebook, but not the other way. - C++ JIT compilation in WASM is working. - Is there an alternative to cppyy? - [pybind11 binder](https://github.com/RosettaCommons/binder) - Numba - numba-rvsdg to deal with new, rapidly changing Python bytecode. - Jim's side-project: extending this (without LLVM) to provide Python bytecode → AST. - Awkward - has extensions for Numba and for cppyy; Numba and cppyy have integrations; will Awkward work with Numba _and_ cppyy? - unclear what Jim is asking, whether there are any problems there. ## Automatic Differentiation and Gradient Passing (Wednesday) * Lukas: If we start basic with the IRIS-HEP AGC what are the parameters that - Alex: Cuts that you apply, machine learning piece that gets applied - Lukas: One initial approach is that replace cuts with weights (each event gets a weight) - What is the thing that you want to optimize? Can then plot the loss landscape in 1D and this would then allow us to understand if this is something what would be achievable for optimization with AD. **Visualization** of this would be quite important. - Run the AGC at full fidelty as is with many cut values for 1 parameter that can be wiggled. - For the AGC there is a way to run a small version with limited data all in memory on a single machine, so could possibly get enough events for sufficient loss landscape * Lukas: You might somewhat solve that when you fill the histogram you fill it with an array of weights the size of the histogram. - The AGC currently uses `hist`. You - Normally you have 3 bins. If you wiggle the parameters a bit then an event will fall into different bins. - If now you say that each event goes into all bins with each bin getting a different weight (that sum to 1) then as you wiggle the parameters then the weights in each bin shift. This is basically doing a binned KDE. - [Example of this in code from relaxed](https://github.com/gradhep/relaxed/blob/2bff24096f06068eba9b6ec75fdb0ab69a1243bc/src/relaxed/ops.py#L46) - For `hist` instead of trying to change `hist` itself could also just impliment a custom backwards AD rule to deal with it as this is just going to deal with summation. * Lukas: Plan for IRIS-HEP Analysis Systems 1. Run AGC and visualize the landscape for 1 cut. 2. Change the AGC to work with weights. 3. Change to all-bin filling. In the event weight scenario it must be that any event downstream must be evaluatable, so if you have a selection cut on 4 leptons that is failed and then you downstream require a 5th leptop pt then this won't work and you would be required to do a stochastic approach. * Gordon * Example of [passing gradients between JAX and PyTorch](https://gist.github.com/mattjj/e8b51074fed081d765d2f3ff90edf0e9), line 27 is probably the key line. Looks quite simple, other than knowning what is being passed (metadata). * [Gordon's talk in MODE](https://indico.cern.ch/event/1242538/contributions/5432836/attachments/2689673/4669514/2023%20-%20MODE%20-%20NN%20and%20Cuts.pdf) on using Cuts in a differential pipeline. * Hacking session * put together a simple "analysis": read jet four-vector information from file, calibrate energy as a function of a nuisance parameter and calculate invariant mass of dijet pair * do this with awkward+jax and take derivatives of the mean of invariant masses (across events) wrt. the nuisance parameter * notebook: https://gist.github.com/alexander-held/9ca8eae6ab99572d7a332ab8c1b02beb, `awkward` follow-up: https://github.com/scikit-hep/awkward/issues/2591 ## Sharded Histograms (Wednesday) Present: Peter, Lindsey, Nick, Ben - Peter implementation - Actor histograming, dask automatically knows about actors, and triggering connection to the actor where the data is stored. - HDF5 memory compress histograms. Efficient because they are often sparse. - One histogram internally chunked in HDF5 for compression. - One possible issue is that hist.fill needs to keep everything in memory. * Ideas: - Henry: Maybe subclass hist.fill so it that it knows to work in chunks, rather than keep all in memory. Work in "snapshots" of the fill. - Lindsey: At the end of computation, how do histograms look to the user? - Ben: When computation end> Does it end when all chunks have been filled for a postprocess analysis? Or when they have been _virtually_ merged. - How to do the dispatch? - Henry: Transfer all fill to all workers and then filter to what is not needed. E.g. some histograms just look at slices of the axes. - How do you do the sums across the fills? - How to do conditional dask? E.g. split on categories before dispatch. (numpy masks?) - Ben: Lists of numpy masks that select rows of array. Label compute nodes so that they process a particular mask. - Peter: Treat the fills like a Triton server. Send a signal that dumps the histogram at the end. - Peter: How do you know that a data chunk belongs to a particular fill? Lindsey: the bining is done before hand. - Lindsey: ```python h = hist.dask.sharded.new....() # then do... h.fill(*arrays) # rows of *arrays are dispatched to compute nodes, and... dask.compute(... sum over bin segments ....) ``` ## Teaching, training, documentation and coordination Present Tuesday session: Matthew Bellis, Angus, Oksana, Mason, Aman, Zoe, Kilian, Ben, Remco, Clemens, Josue, Benjamin, Juraj [gh issue](https://github.com/HSF/PyHEP.dev-workshops/issues/9) **Questions & discussion** * **Documentation interlinking**: * How to connect the different resources (how to guides, training)? Scikit-HEP is decentralized development, but central vision. Possible solution: Analysis gallery * **Analysis Gallery**: Should there be a central "analysis" gallery * [LHCb StarterKit Analysis Essentials](https://hsf-training.github.io/analysis-essentials/advanced-python/20DataAndPlotting.html) has an example analysis for beginners that brings everything together * Astropy also has learn.astropy.org, a collection of notebooks that serve as howto-guides * Matthew: Does the complexity of our "real analyses" map onto such simple examples? * Matt: Also astropy is just a single package (that has majority of fields behind it) * Matthew: ROOT has gallery of examples. Could convert these. (https://root.cern/doc/master/group__Tutorials.html) * Alex: How to curate? What's the threshold for "too simple" to "too hard" * Aman: Could have an interactive "map" that can be clickable and links to the documentation * Jim: Might host training and documentation together * Clemens: Could have "learning paths" * **Discoverability** * How to make training discoverable? * [HSF Training Center](hepsoftwarefoundation.org/training/curriculum.html) * have plausible web analysis around all of our training material investigate discoverability/user behavior * Had GSoC proposal to rebuild this in a more dynamic website able to list more and filter it by need (and some alpha-versions were made that we could start from) * Need to find good balance between too narrow in focus and too wide * Negative example for "too wide": "Awesome lists" that keep on expanding and stop being useful * How to interlink different documentations/trainings * Relation between different kinds of matrial: Diataxis: https://diataxis.fr/ - lots of work but can be used as an overall guide * **Getting people to contribute & getting user feedback** * Should we have a mechanism for third-parties to contribute documentation to specific repositories from a centralised source? * How can we get users (especially new users) to write documentation? * How to add a "comment" box for notebooks: * Jim: Could have link to gh issue/hackmd/etc. * How to give notification to developer * Could do "hypothesis" * How can we break down hurdles: * Lean more heavily on Web-IDE (GitHub codespaces etc.) for PRs * Essential to give users the option to give feedback quickly (without having to create issues/pull requests). Ideal: feedback/comment button. Example [`sphinx-comments`](https://sphinx-comments.readthedocs.io), [`sphinx-disqus`](https://sphinx-disqus.readthedocs.io) (from [this](https://docs.readthedocs.io/en/latest/faq.html#i-want-comments-in-my-docs) FAQ on RTD) * Could people earn "Karma". Become official contributors to the project if they report a documentation issue. * Angus: Let's come up with a "social ethos" to help build a community of users. How to build a community? * Matthew: People are very shy/apprehensive about doing things "in public". Could there be a sandbox for it? Or having private tickets (also think of GDPR)? Philip: Pythia does this. * Juraj: Tutorials ready from CI. * Angus: create stubs: - Users can see which topics are already identified - New developers can see where to start contributing - Long-term developers can contribute in free time * **Training paradigms**: * Matthew: People may start to prefer video resources rather than reading material. One problem is that it is harder to keep video up-to-date. * Kilian: There are already some prototypes video documentation for some HSF Training Modules (like Docker etc.). * Having more things as prerequisites to even out level of participants and to make sure we don't lose time with "trivial things" * **Platforms for running code for training**: * Angus: Should we have central "Binder" service? * **Forum & chat** * similar to ROOT forum? * Both chat and forum have merit * Been over a year that we talked about this. * what to use for chat? discord? * How to balance chat vs forum? * Jerry: Bot in chat that generates discourse post, initially invisible, for posterity. * Forum ~ stackoverflow-ish * **Other ideas/suggestions:** * Benjamin: Workshop about "what to do if there's no documentation?" * Philip: Discussions sounds similar to HEP Forge: Find packages, link them together on a page. Do we repeat history? So what can we learn from that? Reasons HEP Forge failed: * ran out of funding * overambitious: wanted to be more than just an organizing/discoverability project but wanted to solve versioning (and failed) * switch from SVN to fabricator/git (and lost people in the switch) * Oksana: How can we train developers that they engage users and write peoper documentation. * Office hours similar to scipy? - Repeating new contributors meeting ### ✨ Conclusions & actionable items ✨ * **Discoverability**: Need to link resources together and make them discoverable: * Documentation/training material should be interlinked * [HSF Training Center](hepsoftwarefoundation.org/training/curriculum.html) can be expanded to make tutorials discoverable. Must strike balance between notable/maintained and inclusive. Considerations: * Resources have to be curated. Set minimum standards for quality and notability. * Be clear about scope, don't be HEP forge or one of the awesome-xyz lists * [Plausible](https://plausible.io) can be used to understand where users come from/where they go * **Contributions**: Making it easier to contribute/give comments (rather than opening PRs): Options include: * teaching people about GitHub Web IDEs for simple PRs * include more feedback/comment buttons (like sphinx-comments), ideally also anonymous. * include stubs for things that are missing in docs and should be filled in * **Prerequisites**: Having prerequisites for workshops can "even out" experience levels of people and avoid "trivial questions" * **Feedback**: Jim recommended directly implementing feedback buttons into notebooks (e.g., via Slido) * **Maintainability**: guaranteeing that code examples work (see also CI remark; documentation from Jupyter notebooks) and that interlinking is correct (e.g. [`linkcheck`](https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-the-linkcheck-builder)). * **Hacking away & other ideas**: * Regular "documentation day" * Half-day workshop as part of PyHEP to help developers write good (or any at all) user guides/documentation. Could also use that to get everyone to interlink things. **Actionable items** - Fork (or take inspiration from) <https://learn.astropy.org> - hsf training has around 3-5 alternative "from scratch" implementations/PoCs of similar platforms that we could consider/start from. Might also rope in some of the GSoC candidates with JS knowledge. - Purpose is to provide a minimal (basis?) set of tutorials that provide a starting point for self-learning(?). - Do we split tutorial from guides at the top-level, or as a filter criterion? - Design an updated pipeline/map for AS (maybe even interactive) - https://iris-hep.org/as.html - Can be non-linear, to provide a better visual overview of how different packages integrate/can be used at different stages - Adding a place for videos (for the ones that have been presented live), and a place for time estimate: "This tutorial will take 10 minutes." - Easy pipeline for contributors to add new ones. ## Amplitude analysis (PWA) Present: Henry, Jonas, Josue, Mike, Remco - General questions - Can amplitude building be completely separated from (efficient) function evaluation and fitting? If so, it would make it easier to let PWA frameworks talk to each other. --> More easy to do with Python than with C++ - What do we consider to be a PDF? How does it relate to amplitudes? How does it do its normalisation? - Can we standardise plotting in amplitude analyses? - **Issues that might bring PWA frameworks together** - Can PWA frameworks become a consumer of more general fitter packages? - Can we define a standard to describe amplitudes and amplitude models? (Like [UFO](https://arxiv.org/abs/1108.2040), [DecayLanguage](https://github.com/scikit-hep/decaylanguage), ...) - We need to improve comparisons between different frameworks to make our results more reliable (and reproducible!) - Central place for documentation would be great. References to important literature and inventory of frameworks. - Benchmark analyses / unit tests, would be solved by UHI-like test package - Challenges with comparisons - Most PWA tools are isolated frameworks (see list [here](https://pwa.readthedocs.io/software.html)); hard to make them talk to each other - Differences in lineshapes, handling of spin, sign conventions - Integration strategies for likelihoods - **Concrete follow-up steps** - Regular PWA software meetings (tie in to existing conferences like [PWA/ATHOS](https://indico.cern.ch/event/885396)) - Would be great if there were a PWA package to, like [UHI](https://uhi.readthedocs.io) and [scikit-hep-testdata](https://github.com/scikit-hep/scikit-hep-testdata) (see also [unit test session](#“Unit-tests”-for-functions)): - streamline amplitude analyses (optional dependencies), i.e. input/output to different frameworks - standardisation, protocols of data and/or amplitudes - test suites for comparing performance and result consistency, think [array-api-tests](https://data-apis.org/array-api-tests) - host test data sets (e.g. through release notes, like GooFit does) - Note: `pwa` is still available on PyPI; [pwa.readthedocs.io](https://pwa.readthedocs.io) is already 'ours' ;) ![](https://hackmd.io/_uploads/ry3_DHgi2.jpg) ## Task Graphs and Using Annotations to Match Resources and Modify the Computation Present: Nick, Ben, Angus - Dask annotations: metadata about tasks that schedulers may choose to respect, e.g.: ```python= with dask.annotate(big_mem_machine=True, prefer_gpu=True): z = something.compute() with dask.annotate(machine_profile=..., checkpoint=True, environment=...): ... # then somewhere in the custom scheduler: dask.get_annotations() # {'big_computation': True} # or: l.annotations for (n, l) in z.dask.layers.items() ``` - E.g.: Give a hint to the scheduler that it is a good idea to cache something because it was expensive to compute. But cache only what is relevant in the context of the task graph in question (e.g. do not cache everything as persist would do.) - Easy case to test is to use these annotations as class ads for HTCondor. - Label graph for things like systematic variations, and modify the graph accordingly: ```python= with dask.annotate(systematic="jec"): scalefactor = da.full(2, 1.01) # dask.delayed(1.01) events["jets", "pt"] = events.jets.pt * scalefactor ``` - Offload expensive sections of code to accelerators ```python= def superMLalgo(jet): # pretend this needs a GPU to be fast enough return jet.pt*20 + jet.eta with dask.annotate(needs_gpu=True): events["jets", "discriminant"] = superMLalgo(events.jets) ``` - Cache results that are expensive to compute and/or small relative to input ```python= cut = ak.any(events.jets.discriminant > 0.5, axis=1) with dask.annotate(please_cache=True): events = events[cut] ``` ## Analysis Grand Challenge * Alex, Oksana, Lindsey, Clemens, Matthew, Tal, Massimiliano, Jayjeet, Mason, Marcel, Peter, Yuraj - New CMS OD data campaign (NanoAOD2016) will require to adopt analysis -> lets track it over issue (e.g. year-dependent analysis decisions / specific file handling) - L.Gray: wishlist to have proper AGC tutorial as well with more CMS complexity - L.Gray: the training example of ParticleNet would be amazing (very computationally expensive model to train / evaluate) - Autodiff integration: how to make sure what is already differentiable - moving in small prototypes: continue until hitting issues, then feeding back to developers to resolve them - small prototypes can already conceptually capture lots of important functionality and be extended in various directions - How will look like using the pattern dask+jax? - supposedly can work, none of us has practical experience - Training in differentiable analysis: we will need to have different model each time when changing cut, meaning that model should be updated and making it very expensive - Q from Tal: what about systematics? would the whole chain aware that pt cut was changed? - yes, can conceptually propagate this all through (would require e.g. differentiable `correctionlib`) - Q: if there is an interest to include running event generators in the AGC setup? -> We expect that AGC pipeline starts from "common" data formats, however this is rather interesting for workflows such as those based on `madminer` - Next step will be to move coffea2023: no obvious blockers, need to test this again - can also adopt recent changes like ML interface that should help - Marcel is asking if we can add "debugging style" AGC implementation (e.g. cutflow), it could be useful while teaching - should come "for free" with PackedSelection in coffea now, other such types of functionality would be useful to probe - showcase of columnflow implementation of AGC ## File formats * Jim, Matt B., Zoe, Jerry, Nikolai, Ioana * Nikolai (PHYSLITE in ATLAS): Stores some classes, not just simple data values. Hierachy: ``` ROOT --> TTree --> (NanoAOD [CMS], DOAOD_PHYSLITE [ATLAS]) --> RNTuple Arrow / Parquet --> Feather HDF5 --> (hepfile, TDTree, HEP_CCES (Dune, Argonne, Fermilab)) ``` `RNtuple` does not support storing classes (yeah!) `RNtuple` may have been informed by early discussions about Parquet. Arrow is probably the way of things going forward because of how they have carefully considered the memory layout, informed by frustrations with `pandas` dataframe. Arrow is in-memory. Parquet and Feather are on-disk storage. #### Action items * What should we do about the HDF5 projects? (hepfile,HEP-CCEs, TDTree) * Matt B will push forward with `hepfile` so that we can profile its performance going forward. * Jerry and Jim (CHEP 21 suite) will provide Matt with some [tests](https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark) to convert to the different formats. This could turn into a general performance suite for file formats (ROOT, feather, HDF5, etc.). Profile tests could also help with *discoverability* of different approaches. * Nikolai will work on an output module for Belle2 efforts. * `odo` tool to convert from one format to another. #### Future items * Could AGC provide a test case to see how much time is spent doing the following. Since this is a real analysis test case, it would allow us to understand where time is spent. * Decompressing data * Reading from file * In-memory operations (after read in from file) ![](https://hackmd.io/_uploads/r1d25rlo3.jpg) ## Code readability Present: Angus, Aman, Iana, Kilian, Gordon Defining readability: - How many "units" of code a concept requires to explain - Extrema of APL vs code-gen - Tradeoff between terseness and units of action per line - Tradeoff between modularity and succintness - User happiness vs developer happiness. And who's the user here? Examples: - Sometimes the most _performant_ code comes at a cost of readability - The `axis=1` discussion -> named axes General Discussion: - Tradeoff between production and development w.r.t terseness - Tradeoff between users and developers; both are important, and different kinds of readability matter - Type hints: - Is this a good goal, does it help readability? - Can we type hint array libraries? - Ianna suggests that we touch on FuncADL - Gordon has a hitlist. - `argmin`, `argmax`, `axis=1` - `argmin` - the common pattern is to slice into a record to pull out multiple fields. To be able to minimise with respect to a field. - Should we have some kind of manual pass through repos - Jim's work to analyse repos? - How do the language models feel about our syntax? - Use ChatGPT et al. to explain what code does, which implicitly encodes user understanding? - Office hours? - Requires need to advertise in multiple formats, and drive point home. - Physicist nature is to assume they are misunderstanding things, rather than the tooling being confusing. - We should try and improve this. ### pyhep forum & chat **There will be a vote to reach community consensus about what to do** Outline concrete, actionable items below.