# NSF Project Report - CSSI 2023 # Accomplishments ## What are the major goals of the project? The PONDD project is a collaborative project to increase meaningful access to dark matter detector data and astrophysical simulations. It utilizes and expands upon systems such as Kaitai Struct for declarative data format definitions, ServiceX for distributed data access and analysis, Awkward Array for array-based computation, and yt for access to astrophysical datasets. In addition, to enable access to these disparate systems, the project will couple them and provide dataset identification mechanisms to enable access to the datasets and to identify additional datasets of relevance. Once a dataset has been "resolved" by the dataset identification method, ServiceX "extractors" and "transformers" can be applied to them, enabling subsets of larger datasets to be returned and accessed. ## What was accomplished under these goals and objectives? ### Major Activities During the course of this project year, the primary progress of PI Roberts' group has been to begin implementing a Kaitai compiler for Awkward Arrays. This is a core piece of the proposed software stack and once complete will provide software for anyone with a Kaitai description of their data to read their data using the Awkward Array python interface, a memory-efficient and fast interface for large datasets (GB scale). We need this for scientific data; Kaitai's existing python interface is only suitable for small files (kB scale). - Begin the Kaitai to Awkward Array comipler. This work is at https://github.com/ManasviGoyal/kaitai_struct_compiler and tests are at https://github.com/ManasviGoyal/awkward_kaitai_tests - Create and test a description of the current SuperCDMS data format. The test data and description are at https://github.com/det-lab/dataReaderWriter. ### Specific Objectives - Create the Kaitai-Awkward compiler to enable handling of GB-scale datasets - Identify collaboration opportunities beyond SuperCDMS where PONDD would be useful ### Significant Results - We have begun the Kaitai-Awkward compiler. Specific achievements are: identifying the method forward, identifying the minimal set of needed features, and writing several data descriptions along with creating small, test datasets. - PI Roberts is currently working on becoming a member of HALO, a neutrino observatory co-located with SuperCDMS and several other dark matter and neutrino experiments at SNOLAB. ## What opportunities for training and professional development has the project provided? PI Roberts currently has two undergraduate students working on registering SuperCDMS data. The current students do not have prior programming experience and have increased their skills in: - python - version control - Metadata stewardship ## Have the results been disseminated to communities of interest? If so, please provide details Nothing to report ## What do you plan to do during the next reporting period to accomplish the goals? - Complete the minimal feature set of the Kaitai-Awkward compiler that we have identified as critical for scientific datasets. - Become a member of HALO and work with SNO+, with the goal of identifying coincident events in these two co-located detectors. We can support these efforts even without Andrea Zonca and this work will proceed immediately. - Once Andrea Zonca is under subcontract, identify at least one additional use case from another field (Material Databases, and Climate Change models are the use cases we intend to explore first) - Implement minimal working examples from those use cases # Participants/Organizations # Impact ## What is the impact on the development of the principal discipline(s) of the project? This project will deliver python data-reading software to dark matter and neutrino experiments that use custom binary formats and have GB-scale files. Experiments currently write data-reading software on their own; in the case of SuperCDMS this software is too simplistic and slow for practical use. Demand within the collaboration for memory-efficient and fast reading is increasing, particularly as machine learning applications for our unprocessed data are increasing. ## What is the impact on other disciplines? Custom binary data is not unique to dark matter experiments; we expect the code from this project to be directly useful wherever fast, memory-efficient access to binary data is a need. Jim Pivarski (listed in the participants section) has contacts in multiple fields where Awkward Array is used, and this software once complete may be of interest to them. The first target here is observational Astronomy; we expect to use participant support funds to support describing a set of standard observational astronomy file formats in the Kaitai data-description language. ## What is the impact on the development of human resources? This project has directly developed undergraduate students' skills in software engineering as well as data and metadata management. ## What was the impact on teaching and educational experiences? ## What is the impact on physical resources that form infrastructure? Nothing to report ## What is the impact on institutional resources that form infrastructure? Nothing to report ## What is the impact on information resources that form infrastructure? Nothing to report ## What is the impact on technology transfer? Nothing to report ## What is the impact on society beyond science and technology? ## What percentage of the award's budget was spent in a foreign country? None # Changes/Problems ## Changes in approach and reasons for change Changes from the previous years are 1. We will take use cases from outside of SuperCDMS as our first use cases. While we are still using SuperCDMS as an example of scientific data, the SuperCDMS data catalog infrastructure is delayed, which is a critical part of PONDD. The most likely case for use is unprocessed SuperCDMS data stored on the Open Storage Network along with metadata and we continue to stage data suitable for this task. However, looking for other examples (like the HALO and SNO+ neutrino experiments) is a good way to lessen risk of delay. 2. We have requested that the postdoc be rebudgeted to continued support for Andrea Zonca and a Master's student; this request was approved. ## Actual or Anticipated problems or delays and actions or plans to resolve them We faced significant delays in spending this year, and this has delayed most of our projects by a full year. A no-cost extension may be a way to address this delay. ## Changes that have significant impact on expenditures We have underspent significantly this year; one issue (now resolved) was rebudgeting to support Andrea Zonca and a graduate student rather than a postdoc. The second issue (which is not yet resolved) has been an ongoing delay in issuing the subcontract to Andrea Zonca, at SDSC.