LArSoft Production with Balsam

# LArSoft Production with Balsam ###### tags: `larsoft` `theta` `adsp` `production` `balsam` [ToC] ## Balsam - Get Started Balsam is the software that will help us managa the jobs. - Balsam instructions: https://balsam.readthedocs.io/en/latest/index.html - We need to use the latest Balsam at: https://github.com/argonne-lcf/balsam/ - There is also the [older balsam documentation](https://balsam.readthedocs.io/en/master/), which may be useful. There is a virtual environment where Balsam is already installed: ```bash= source /grand/neutrino_osc_ADSP/env_balsam/bin/activate ``` To test that we can run some simple test job, create a balsam site and submit a couple of test jobs. This is taken from the Balsam [tutorial](https://balsam.readthedocs.io/en/latest/tutorials/theta-quickstart.html) itself, with some modifications. You may want to read the tutorial too. ```bash=0 # [cd to your local area] balsam site init balsam_testsite [Select the default configuration for Theta-KNL] cd balsam_testsite/ balsam site start ``` Balsam login doesn't work, but Misha can generate a token. I had to put the following block in `~/.balsam/client.yml`: ``` api_root: https://balsam-dev.alcf.anl.gov client_class: balsam.client.BasicAuthRequestsClient connect_timeout: 6.2 read_timeout: 120.0 retry_count: 3 token: XXX token_expiry: '2021-07-25T19:35:08.727143' username: deltutto ``` You can ask Corey to get something like this generated. I also changed the following in `settings.yml`: ```python= idle_ttl_sec: 300 ``` Then, need to uncomment this line in the `job-template.sh` script: ``` # export https_proxy=http://theta-proxy.tmi.alcf.anl.gov:3128 ``` This is needed because the jobs are living in that external database and we can interact with them on the login nodes since since they have outbound network access. But compute nodes don’t have this network access, and the server (which used to be internal to ALCF and is now public) is unreachable. This export tunnels outbound traffic through a proxy server to let compute nodes reach the bigger world. Now the Balsam site is running, and we can create a couple of test jobs: ```bash=0 balsam app create > Application Name (of the form MODULE.CLASS): test.Hello > Application Template [e.g. 'echo Hello {{ name }}!']: echo hello {{ name }} && sleep {{ sleeptime }} && echo goodbye balsam job create --app test.Hello --workdir test/1 --param name="world" --param sleeptime=2 balsam job create --app test.Hello --workdir test/2 --param name="balsam" --param sleeptime=1 balsam queue submit -q debug-cache-quad -A neutrino_osc_ADSP -n 1 -t 10 -j mpi ``` Can be monitored with: ```bash=0 watch "qstat -u $USER && balsam queue ls" ``` and can be seen here: https://status.alcf.anl.gov/theta/activity. The output is placed in the `data/` directory. ## LArSoft Test Job with Balsam Create a new app in the Balsam site that will run a prodsingle LArSoft job: ```bash=0 balsam app create > Application Name (of the form MODULE.CLASS): prodtest.ProdSingle > Application Template [e.g. 'echo Hello {{ name }}!']: to_be_written ``` Then open the file `apps/prodtest.py` and modify the `command_template` attribute to: ```python=0 command_template = ''' singularity run -B /lus/grand/projects/neutrino_osc_ADSP:/lus/grand/projects/neutrino_osc_ADSP:rw /lus/grand/projects/neutrino_osc_ADSP/fnal-wn-sl7.sing <<EOF #setup SBNDCODE: source /lus/grand/projects/neutrino_osc_ADSP/software/larsoft/products/setup setup sbndcode v09_24_02 -q e20:prof lar -c prodsingle_mu_bnblike.fcl EOF ''' ``` Save it and run `balsam sync app` to sync the newly modified app. Create a new job: ```bash= balsam job create --app prodtest.ProdSingle --workdir prodtest_single/0 ``` And check that the job has been created with: ```bash= balsam job ls ``` Now let's run it: ```bash= balsam queue submit -q debug-cache-quad -A neutrino_osc_ADSP -n 1 -t 10 -j mpi watch "qstat -u $USER && balsam queue ls" ``` If it is successful, the output files will be in `data/prodtest_single/0`. ## LArSoft Production Workflow ### Introduction Up to now we’ve been running jobs in `mpi` mode, in which balsam runs on the `mom` node. This mode runs the balsam executable and launches a bunch of `aprun` calls (which is the Theta version of `mpirun`) and those go onto the compute nodes. In serial mode, it’s a bit inverted: balsam runs on each compute node (via `aprun`) and does POpen calls without `aprun` involved. So this is a direct fork of the process onto compute nodes. The advantage here is that we can use a portion of the compute node, per balsam-job, instead of granularity of one node or higher. With 64 cores, we have been running the larsoft jobs with 64 lar calls per node, and scaling out onto hundreds of nodes For this, we end up packing in a ton of larsoft jobs, launching a big balsam job, and they all run together. We want to have several stages, though, which requires a workflow. When `serial` mode (or `mpi` mode, too) is asking the database for jobs, it uses the list of jobs that are `PREPROCESSED`. Jobs can have parents and children too, which will prevent them from running right away Check out the diagram down the page here: https://balsam.readthedocs.io/en/master/userguide/app.html We can have a job (say, prodsingle) that is a parent to a geant job, which is a parent to a detsim job, etc. We can break societal norms too: jobs can have unlimited numbers of parents or children One optimization that was found helps is to do something like: - Job 1: `lar -c prodsingle.fcl -n 25` - Jobs 2-26: `lar -c g4.fcl -n 1 -nskip {JOB_INDEX} -i singles.root`, each has Job 1 as a parent - Job 27-52: `lar -c detsim.fcl -n 1 -i singles_g4_JOB_INDEX.root`, each has exactly 1 parent from the previous step - Job 53: `lar -c merge.fcl -i {all detsim outputs}`, this job has every job from 27 to 52 as it’s parent This gives a graph of jobs that balsam will take, and dynamically run every job with it’s parents finished and keep going until no more jobs You can also slice between workflows with `tags`, so one could be `sbnd:g4` and one could be `uboone:prodsingle` and one could be `sbnd:detsim` and you can select only `sbnd`, only `detsim`, only whatever. ### Creating a Workflow The following shows an example on how we can create a prodsingle job, followed by several geant4 jobs which have the prodsingle one as the parent. First, we need to make two apps, one for prodsingle, and one for Geant4. This is done similarly to the example above, but now we need to pass some arguments. For prodsingle, the `command_template` should look something like: ```python= command_template = ''' singularity run -B /lus/grand/projects/neutrino_osc_ADSP:/lus/grand/projects/neutrino_osc_ADSP:rw /lus/grand/projects/neutrino_osc_ADSP/fnal-wn-sl7.sing <<EOF #setup SBNDCODE: source /lus/grand/projects/neutrino_osc_ADSP/sbndcode/setup setup sbndcode v09_24_02 -q e20:prof lar -c prodsingle_mu_bnblike.fcl --nevts {{nevts}} --output {{output}} EOF echo "\nJob finished in: $(pwd)" echo "Available files:" ls ''' ``` While for Geant4: ```python= command_template = ''' singularity run -B /lus/grand/projects/neutrino_osc_ADSP:/lus/grand/projects/neutrino_osc_ADSP:rw /lus/grand/projects/neutrino_osc_ADSP/fnal-wn-sl7.sing <<EOF #setup SBNDCODE: source /lus/grand/projects/neutrino_osc_ADSP/sbndcode/setup setup sbndcode v09_24_02 -q e20:prof lar -c standard_g4_sbnd.fcl --nevts {{nevts}} --nskip {{nskip}} --output {{output}} --source {{source}} EOF echo "\nJob finished in: $(pwd)" echo "Available files:" ls ''' ``` Instead of creating jobs from the command line, we will use the Balsam API. This python script creates the jobs: ```python= from balsam.api import Job, App, site_config n_events = 5 node_packing_count=64 # This should always be 64 # # ProdSingle # prodsingle_app = App.objects.get(site_id=site_config.site_id, class_path="production.ProdSingle") prodsingle_jobs = [] for i in range(1): job = Job( workdir=f"prod_prodsingle/{i}", app_id=prodsingle_app.id, tags={'sbnd':'prodsingle'}, parameters={ "nevts": f"{n_events}", "output": f"prodsingle_sbnd_{i}.root"}, node_packing_count=node_packing_count, num_nodes=1, parent_ids=(), ) print(job) job.save() prodsingle_jobs.append(job) # # Geant4 # geant4_app = App.objects.get(site_id=site_config.site_id, class_path="production.Geant4") geant4_jobs = [] for i in range(n_events): job = Job( workdir=f"prod_geant4/{i}", app_id=geant4_app.id, tags={'sbnd':'geant4'}, parameters={ "nevts": "1", "nskip": f"{i}", "output": f"prodsingle_geant4_sbnd_{i}.root", "source": f"{prodsingle_jobs[0].parameters['output']}"}, node_packing_count=node_packing_count, num_nodes=1, parent_ids={prodsingle_jobs[0].id}, ) print(job) job.save() geant4_jobs.append(job) ``` Running this should create the following: ```bash= $> balsam job ls ID Site App Workdir State Tags 67191 thetalogin6:balsam_testsite2 production.Geant4 prod_geant4/4 AWAITING_PARENTS {'sbnd': 'geant4'} 67187 thetalogin6:balsam_testsite2 production.Geant4 prod_geant4/0 AWAITING_PARENTS {'sbnd': 'geant4'} 67190 thetalogin6:balsam_testsite2 production.Geant4 prod_geant4/3 AWAITING_PARENTS {'sbnd': 'geant4'} 67189 thetalogin6:balsam_testsite2 production.Geant4 prod_geant4/2 AWAITING_PARENTS {'sbnd': 'geant4'} 67186 thetalogin6:balsam_testsite2 production.ProdSingle prod_prodsingle/0 STAGED_IN {'sbnd': 'prodsingle'} 67188 thetalogin6:balsam_testsite2 production.Geant4 prod_geant4/1 AWAITING_PARENTS {'sbnd': 'geant4'} ``` We can then subbmit this with the usual ```bash= balsam queue submit -q debug-cache-quad -A neutrino_osc_ADSP -n 1 -t 30 -j mpi watch "qstat -u $USER && balsam queue ls" ``` ### Full Workflow The full workflow involves: - Generation stage: here I am doing 50 events per gen job. - Geant4 stage: one job for each of the 50 generated events before. - Detsim: one job for each of the geant4 jobs. - Merge: one job for the 50 events above. It merges all the 50 detsim files into one. Example of Balsam `app` that defines all the four applications above: https://gist.github.com/marcodeltutto/ec536144fb876dc11d861302d227b3cf Example of python script that creates all the needed jobs: https://gist.github.com/marcodeltutto/caccf9e09716c85a919e841f6d97457b ### Optimizing the Full Workflow Assume you submit 100,000 events, which turns out to be 2,000 prod single jobs of 50 events each. G4 and Detsim do 1 event each and then there is a merge job at the end, 50 events in. The minimum job size on theta is 128 nodes == 8,192 cores (64 cores per node). 100,000 events would need ~1,500 nodes to process all g4 and detsim simultaneously. Let’s assume you do 128 nodes. Balsam will start by filling out nodes with jobs (prodsingle) for 2,000 cores and the other nodes will sit idle at first. When the first prodsingle job comes back, balsam will mark it as finished and the children job will get preprocessed (at this stage, it will be a process on the mom node to link the files in). Once those g4 jobs get preprocessed, they will start to launch onto compute nodes. Meanwhile, the prodsingle jobs will keep coming back, and g4 jobs will keep getting preprocessed and launched. Eventually, once 128 prodsingle jobs have come back you will have 8192 g4 jobs ready to go and balsam will fill up the rest of the cores. And, as g4 jobs finish, probably more g4 jobs will slip in but detsim will too, once they are preprocessed from their g4 parent With such an asymmetric parent/child compute balance it makes a lot of sense to launch one or two jobs on the debug queue to do just prodsingle, and then when they are starting to finish processing you can launch a big job for g4 + detsim: ```bash= balsam queue submit [...] --tag sbnd:prodsingle ``` For example, if we want to produce 12,800 events, we can do 256 gen jobs, each of them with 50 events each. First, we submit only the gen jobs to the debug queue: ```bash= balsam queue submit -q debug-cache-quad \ -A neutrino_osc_ADSP -n 8 -t 60 -j serial \ --tag sbnd=prodsingle ``` When they are completed, we submit all the others to the default queue: ```bash= balsam queue submit -q default \ -A neutrino_osc_ADSP -n 128 -t 60 -j serial ```