Calibration internal review minutes

# Calibration internal review minutes ## Meeting with referees: 2023/12/04 **Partecipants:** - Giulio Dujany - Umberto Tamponi - Markus Prim - Karim Trabelsi - Michael De Nuccio - Stefano Lacaprara - Ueda - Takanori Hara (joined late) **Update of the resource requested for recalibration:** - **Permanent storage**: 400 TB (up to 2023) + 180 TB/year (from 2024 on) - **Temporary storage**: 180 TB if permanent storage is on disk, 540 if on tape - **CPU for collectors:** 3-45 kHEPSpec06/year (~600 opportunistic CPU slots) - **Memory for algorithms:** 20 GB **Available at KEKCC** - 1 PB disk and a queue of high-priority jobs up to 400 slots available for calibration - 1.5 PB disk for data production **Plan up to 2028** - Use as “permanent storage” calib disk + ~300 TB of dataprod in 2027-2028 - In case we need to reproduce cDST use ~180 TB in dataprod as “temporary storage” **What after 2028? (to be decided to prepare new KEKCC request by the end of 2025)** - Use tape (need to understand staging time. In note assumed ~150 TB/month ) - Ask for more disk or use more of dataprod disk (1 PB disk can cover 5 more years) **Usage of tape at Desy** - At DESY Calibration managers asked what data needs to be staged on disk. They also set an expiration date, "the pinning", that means that the data is guaranteed to stay on disk until that date and after that date it can be deleted from tape (or stay there if there untill there is the need for the disk). - It was useful to have the data on disk until it was need without the fear that it would disappear. **Discussion about tape usage at KEKCC** - At KEKCC the pinning is not feasible. The data will need to be copied from tape to a temporary disk (eg. on the datapod disk according to the plan above). - At KEKCC the tape is designed to mainly deal with the raw data. Any other request has lower priority so the expected throughput varies a lot depending if we are collecting data at the same time or not. - **Need to understand the min and max throughput we can expect to be able to plan accordingly. Ueda-san together with Hara-san will look into the spec** - **Need of measuring the actual throughput with the actual cDSTs (for the file-size dependency)** - The discussions about the new KEKCC from 2028 will start more than two years before. We should have a clear plan for calibration at the end of 2025 if we want our input to be taken into account. - A significant increase of disk space is unlikely. We should thus plan how to use the tape. - The througput of the tape system depends on the size of the files. We cannot merge cDSTs from different runs but apart from that there is some flexibility on the maximum size. Currently we were aiming at 2GB per file. - The througput depends on 2 factors: the number of tape drivers and the perfeormance of a single driver. **Clarifications about the current note** - A clear conclusion that states waht can be don with the current available resources at KEKCC should be added. For example table IV now speaks of 600 CPU slots while at KEKCC 400 are available - The `b2_prod` queue and `b2_calib` queue should not be confused. the purpose of the `b2_prod` queue is different from the one of the `b2_calib` queue. The calibration jobs should not use `b2_prod` queue. Currently, 400 job slots are assigned to `b2_calib` queue. If more resources are needed at specific times for calibration it should be negociated and slots may eventually be added to the `b2_calib` queue. <s>It is assumed that `b2_prod` will almost no longer be used.</s>[corrected by THara] [ueda]: *The assumption at the time making the b2_calib queue was that the resource usage with `b2_prod` for HLT skimming could be **moved** to the calibration usage, so one should not expect 400 (b2_prod) + 400 (b2_calib) can be used for calibration work.* **Conclusion** - A new version of the note with the clarifications asked will be distributed by the end of the week for the final sign-off and this will conclude the work of this review committee. - The follow-up on how to use the tape after 2028 will continue in DataProd-Computing meetings. ## Meeting with referees: 2023/06/20 Partecipants: - Alexander Glazov - Giulio Dujany - Takanori HARA - Umberto Tamponi (he/him) - Markus Prim Two seprarate topics: proc16 and long term future **KEKCC status** (reports by Umberto and Hara-san): - **Database** will be sorted out: currently there is a Squid server at KEKCC for cvmfs. To be checked if we can use the same one or a dedicated one. Getting a new server should be fine. Markus proposed to set it up well ahead of s-proc5. He will coordinate offline with Hara-san and Ueda-san. Markus points out that anyway a solution for calibration will be useful at KEKCC as some experts, like Chris Hearty, already perform some calibration jobs at KEKCC. - **Disk space** on plus of the dataprod/ disk we have at disposal for recalibration 100 TB of local disk space for the output of the collectors and algorithms plus 500 TB to store the cDSTs. These resources are expected to increase in the future. We will thus have no issue on this front when when all the cDSTs fit into the available disk. When we will need to start using tape however this will be an issue as the system will become significanlty more complex and we do not have someone at KEK that will take care of this. Tape usage can thus be a significant obstacle on the one-center-at-KEKCC scenario for the long future. - **CPUs** The dataprod queue is enough for the collector jobs. A dedicated server can probably be set up for the algorithms. - **Airflow interface** this part still needs to be explored. Umberto is confident that all the thecnical issues can be solved moreover Airflow already performed some calibration at KEKCC in the past. Markus remarks that the usual issue is that we have just one Airflow expert that will soon leave. **Plans for the future** - It would be nice to have s-proc5 and after proc16 to be ran at KEKCC. To make this happen we need to set up the Squid server at KEKCC, copy the cDSTs there, configure Airflow to work for KEKCC. All this can and should be done ahead of s-proc5. If we fail to do so s-proc5 cannot wait but it is unsure where it will be done. - From what has been shown up to now it looks like the single-centre solution is a viable option and should be preferred as the multi-centre soulution is more complex and needs some extra software development for which there is no person power available. - Markus, Umberto et all will provide a new version of the technical note in a couple of weeks with the latest estimates of the resources requested (for example Umberto reported that the memory requirement for the Alignment will probably be reduced) and what is currently available at KEKCC. A paragraph will also be added on the possible issue on needing the tape for reprocessing that could make KEKCC not suitable for the very long run as a single recalibration centre. - After summer, we will use s-proc5 to get a more refined estimate on the needs of a recalibration, we will update the note with that estimate and we will conclude the review in time to present the conclusion at the automn BPAC. ## Meeting with referees: 2023/05/25 Partecipants: Stefano Lacaprara Alexander Glazov Giulio Dujany Jake Bennett Karim Trabelsi Michael De Nuccio (they/them) Michel Hernandez Villanueva Takanori HARA Ueda (guest) Umberto Tamponi (he/him) ueda (guest) * KEKCC resources? * tape is ok (we will have more than pledge) * disk: we will have 0.5PB in addition to 1.5PB of dataprod * squid server should be possible * algo machine can be complex * we can use NAF/BNL in case * or standard lsf queue at KEKCC * Large memory machine is for alignement, which we do not plan for prompt unless error are find. * Summer shutdown: * not really an issue: no calibration forseen in summer * Grid option: * 5 GB is a limit for grid both in input and output * gbasf2 tends to fail when creating such-big mDSTs, and so rejects then as inputs too * do we really want/need to use gbasf2, or can we use gb2prod tools? * collector jobs are very similar to analysis jobs, so gbasf2 seems very fitting (out of the box) * gbasf2 is what is used for analysis: list of inputs, steering file, you get a number of outputs * gb2prod tools require tuning and upgrade to work on such things, they are meant for else, we probably don’t want to introduce these extra features/steps/complexities in (re)calibration * gb2prod are used for production: interface is more complex, more tasks are delegated to the tools themselves, e.g. you have a merge step in gb2prod that doesn’t exist in gbasf2, there is a “tail” of processes * gb2_prod will have automatic staging, gbasf2 has not * this is a point in favor of gb2prod, because we’ll need that for (re)calibration * Missing info: * if KEKCC agree to provide these resources * how hard it is to make gbasf-CAF backend * Umberto: by september we will know if kekcc-local is a stable solution: if not, we must go for the grid option * Ueda: we will not know by september because staging can only be tested at the end of it (?) * Umberto: isn’t it sort of a already-solved problem?