Virgo IGWN job submission tutorial

--- tags: HTC, Virgo, LIGO --- # Virgo IGWN job submission tutorial This document describes the procedures to follow in order to submit a computing job on the IGWN (International Gravitational Waves Network) resources pool. The ideal workflow is: 0. Get a x509 certificate and register in in the Virgo VOMS 1. Create a set of credentials to get identified by the IGWN resources 2. Get access to the collaboration resources 3. Connect to the IGWN submit nodes and use HTCondor 4. Describe your task and make it a `.sub` 5. Distributed data and software  Each of the above points will be discussed in the following paragraphs. --- ## 0. Get a x509 certificate and register in in the Virgo VOMS A valid x509 certificate is needed for almost any the procedure shown below. Please get in touch with your Home Institution in order to obtain a GRID-enabled x509 certificate. Once the certificate is obtained install it on the machine(s) you intend to use to connect to the IGWN resources, following the instructions provided by your Home Institution. Once you have a valid certificate, you are allowed to complete the procedure for the registration of your certificate in the Virgo VOMS (Virtual Organization Membership Service) by connecting to [this page](https://voms.cnaf.infn.it:8443/voms/virgo/user/home.action). When opening this page a web browser prompt should ask you to provide a certificate to connect to the website. Select your x509 certificate from the suggestions or manually select it if needed. Once connected to the page you can apply for VOMS memberships using the provided button. Apply for `/virgo` and `/virgo/virgo` memberships selecting them from the dropdown menu. ![](https://i.imgur.com/viSD9Oj.png) You'll be notified by email upon membership approval. Once done, your VOMS groups and roles should contain the following items: ![](https://i.imgur.com/ZvMQ1tp.png) The certificate may come in many different formats. In the end it should be converted to an x509 certificate and then split in the private key and your own certificate. These two files should be placed in `~/.globus` and are `userkey.pem` and `usercert.pem`. To get certificate format specific instructions to extract these two certificate components please refer to your Home Institution instructions. The same `~/.globus` forlder with the two contained files should be present on any machine from which you expect to be able to create new proxy certificates (e.g. your laptop, a workstation and your home directory on any remote submission machine of IGWN). --- ## 1. Create your identification credentials For this tutorial, let's assume your username will be `albert.einstein`. First of all please install the `voms-proxy-*` tools and `ligo-proxy-init`. See the following table for the typical commands to perform such installations: | OS | Install `voms-proxy-*` | Install `ligo-proxy-init` | | -------- | -------- | -------- | | Macos | `brew install voms` | [this page](https://www.lsc-group.phys.uwm.edu/lscdatagrid/doc/installclient.html) | | Linux | `sudo apt-get install voms-clients`,`sudo dnf install voms-clients-cpp` or `sudo yum install voms-clients-cpp` | [this page](https://www.lsc-group.phys.uwm.edu/lscdatagrid/doc/installclient.html) | The first command to issue on your connecting machine is used to create a valid proxy certificate which will authenticate yourself (`albert.einstein`) and certify you are part of our collaboration (LIGO/Virgo/KAGRA). The command to create such impersonation proxy is: ```bash ligo-proxy-init albert.einstein ``` It will prompt you to input the account password (in out case `emc2_is_cool`). In case you don't remember it, please note that your credentials are the same used to login to the official LIGO pages, for example [here](https://git.ligo.org/users/sign_in). The output of this command should be the following: ``` Your identity: albert.einstein@LIGO.ORG Enter pass phrase for this identity: Creating proxy .................................... Done Your proxy is valid until: Feb 14 22:26:34 2020 GMT ``` The proxy created via this command is indeed a VOMS proxy, hence the usual `voms-proxy-*` utilities are compatible and allow for detailed insight of the proxy. In order to verify that the proxy has been correctly created simply issue: ```bash voms-proxy-info ``` If the proxy has been correctly created you should get something in the lines of: ``` subject : /DC=org/DC=cilogon/C=US/O=LIGO/CN=Alber Einstein albert.einstein@ligo.org issuer : /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon Basic CA 1 identity : /DC=org/DC=cilogon/C=US/O=LIGO/CN=Alber Einstein albert.einstein@ligo.org type : EEC strength : 2048 path : /tmp/x509up_u501 timeleft : 275:59:42 key usage : Digital Signature, Key Encipherment, Data Encipherment ``` Otherwise: ``` Proxy not found: /tmp/x509up_u501 (No such file or directory) ``` The proxy can be destroyed by calling: ```bash voms-proxy-destroy ``` --- ## 2. Get access to the collaboration resources In order to access to the collaboration resources two options are available: * Through the `ssh.ligo.org` portal, a gateway to access to the collaboration ssh-enabled hosts. One can connect using `ssh` towards the gateway and then selecting the desired site. The gateway performs the ssh redirection. * Directly through proxy authentication with `gsissh` towards a specific resource. As a side note, a direct `ssh` connection can be achieved as well, but won't be discussed in this tutorial. Before connecting to the collaboration resources some procedures have to be completed 1. Request an LDG account [here](https://grouper.ligo.org/ldg/request/). 2. Once the account has been created, one should upload the public ssh key [here](https://grouper.ligo.org/ldg/manage_ssh/). 3. Verify that the default login shell is set to anything but `nologin` at the page https://grouper.ligo.org/ldg/users/albert.einstein/ (where `albert.einstein` should be replaced with your username). If it isn't set it to the wanted option via the dropdown menu and then click on "Update all". In order to check the successful registration and setup one should perform two operations from a shell: * Try to perform a connection using `gsissh` ```bash gsissh albert.einstein@ldas-grid.ligo.caltech.edu ``` * Try to perform an `ssh` connection through the gateway `ssh.ligo.org`: ```bash ssh albert.einstein@ssh.ligo.org ``` You should be prompted with a site selection message like the following: ```console ::: LIGO Data Grid Login Menu ::: Select from these LDG sites: 0. Logout 1. AEI - Albert Einstein Institue, Hannover (Germany) 2. CIT - California Institute of Technology, Pasadena, CA (USA) 3. LHO - LIGO Hanford Observatory - Hanford, WA (USA) 4. LLO - LIGO Livingston Observatory - Livingston, LA (USA) 5. UWM - Center for Gravitation, Cosmology & Astrophysics (Milwaukee, WI, USA) 6. IUCAA - Inter-University Centre for Astronomy & Astrophysics, Pune (India) 7. CDF - Cardiff University, Cardiff (UK) [Registered Users Only] ------------------------------------------------------------- Z. Specify alternative user account Enter selection from the list above: ``` Try to select one option to ensure yourself you can connect using your credentials. --- ## 3. Connect to a IGWN submit node and submit a job Once the credentials are correctly configured, you should have been granted access to the submit hosts installed in several computing centers of the collaboration. Each submit host is configured to connect to the underlying HTCondor computing workload manager. Any computing task one wishes to run on the IGWN pool should be submitted from one of such submit hosts. The list of submit hosts follows: * stro.nikhef.nl @ NikHef (not yet operational) * ldas-osg.ligo.caltech.edu @ Caltech * osg-ligo-1.t2.ucsd.edu * cabinet-10-10-4.t2.ucsd.edu (restricted) The connection follows the same directions given above to connect to generic collaboration resources, hence both `ssh` and `gsissh` access is supported. The workload manager adopted by the collaboration is HTCondor. In HTCondor the job description language takes the form of a submit file, identified by the `.sub` extension. On the submit machine HTCondor is used via a set of CLI (Command Line Interface) commands. Two commands are of crucial importance to submit and inspect a computing task: 1. `condor_submit` followed by the `.sub` path subits the described task to the computing pool. It returns a job unique ID used for the following commands. An example output is: ```bash $: condor_submit test-job.sub Submitting job(s). 1 job(s) submitted to cluster 972902. ``` 3. `condor_q` followed by the unique ID returned by the submit command allows to inspect the status of the job to make sure it is running or to notice it went in error/held/idle. An example of output is: ```bash $: condor_q 972902 -- Schedd: schedd.ip.address : <192.168.1.1:9619?... @ 02/03/20 16:39:08 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS virgo029 ID: 972902 2/3 16:37 _ _ 1 1 972902.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 739 jobs; 3 completed, 2 removed, 199 idle, 487 running, 48 held, 0 suspended ``` Always remeber that the declared output of a job is automatically propagated back to the submit node directory from which the submission has happened. In addition all --- ## 4. Describe your task and make it a `.sub` Your computing task can be univocally described by the input(s), the executable to run, the parameters of such exeutable and the produced output(s). Following this schema it is trivial to create an HTCondor-specific job descriptor (namely a `.sub` file, or submit file), which can be used to launch a specific task on the IGWN pool. Every example reported in the following sections is available for download on [GitHub](https://github.com/gabrielefronze/virgo-htc-tutorial). Feel free to clone this repository. ### How to describe a job A skeleton of a valid submit file is: ```htmlmixed= universe = vanilla transfer_input_files = executable = arguments = transfer_output_files = log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1 ``` where lines from 3 to 6 have to be filled, and lines from 8 to 10 can be modified to redirect logging, ouput and error messages to other files. The lines from 12 to 19 are related to the job submission and shouldn't be changed. Lines 21 and 22 should be filled using the tags generated from [this site](https://accounting.ligo.org/user) and (typically) your own name in the form `albert.einstein`. The GitHub repository contains two tools, `set_IGWN_group.sh` and `set_IGWN_user.sh`, which are provided to modify in bulk the accounting information of the submit files. Use them according to the informations reported above (e.g. `./set_IGWN_user.sh albert.einstein`). Line 24 is the line which triggers the enqueuing of the task. #### Input/Output -less job - Exercise 1 In order to figure how to complete the lines from 3 to 6, let's think about a simple example where one simply wants to run on the worker node the following bash command: ```bash ls -lrt . ``` which will list the content of any directory found in your user home on the HTCondor worker node. There is no input needed for such operation, neither a output file is created: the only interesting output comes in the `std.out` file, where the output of `ls` will be streamed. In this case the submit file becomes: ```htmlmixed= universe = vanilla executable = /bin/ls arguments = -lrt . log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1 ``` As you can see the executable is `/bin/ls`, while all additional arguments are put in the arguments field. The executable path should always be path-independent, hence absolute. #### Input-only job - Exercise 2 In this case lets figure how to describe a job which requires the input of a file and which outputs the file contents in `std.out`. Let's assume a file `something_to_print.txt` is present alongside the `.sub` file. The command to print the file content would be: ```bash cat ./something_to_print.txt ``` In this case the submit file becomes: ```htmlmixed= universe = vanilla transfer_input_files = ./something_to_print.txt executable = /bin/cat arguments = ./something_to_print.txt log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1 ``` The path given to the `transfer_input_files` field is relative to the path where the job submission happens on the submission host. The main executable is `/bin/cat` and the argument is the file path relative to the worker node user home of the input file transferred above. The inputs are transferred at job submission automatically. #### Output-only job - Exercise 3 A job which takes no input and generates a file or a directory can be represented by the following command: ```bash touch my_test_output_file.txt ``` which creates a file filled with the "Hello world!" string. In this case the submit file becomes: ```htmlmixed= universe = vanilla executable = /bin/touch arguments = my_test_output_file.txt transfer_output_files = ./my_test_output_file.txt log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = accounting_group_goes_here accounting_group_user = albert.einstein queue 1 ``` which should be of trivial interpretation. Please note that on line 4, the double quotes have to be escaped with HTCondor syntax, requiring double-double quotes to make them interpreted as the `"` character. #### Input/Output job - Exercise 4 For this example lets assume one wants to perform a byte copy of a input file using `cat`, creating a new file as output. The command to run in: ```bash cat ./my_input.txt >> ./my_output.txt ``` The submit file becomes: ```htmlmixed= universe = vanilla transfer_input_files = ./my_input.txt executable = /bin/cp arguments = my_input.txt my_output.txt transfer_output_files = my_output.txt log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = accounting_group_goes_here accounting_group_user = albert.einstein queue 1 ``` which should be trivial to understand given previous examples. #### Script-executing job - Exercise 5 Let's assume a job runs a bash script named `script.sh` available on the submit node. In this case the custom script should be treated as a input file, make it transferred to the worker node by HTCondor automatically. The submit file becomes: ```htmlmixed= universe = vanilla executable = ./script.sh arguments = test_argument transfer_output_files = ./surprise.txt log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = accounting_group_goes_here accounting_group_user = albert.einstein queue 1 ``` --- ## 5. Distributed data and software The collaboration deals with scientific data and software distribution. Most of the collaboration's official data and software is already available on the worker nodes and do not require to manually transfer it via the submit file as shown above. ### Distributed data In most computing sites the LIGO/Virgo GWF framefiles are available in the CVMFS path, at `/cvmfs/oasis.opensciencegrid.org/ligo` or `/cvmfs/oasis.opensciencegrid.org/virgo`. The access to such data is protected via x509 authentication. The following line in the `.sub` file grants that a x509 proxy is present on the worker node as well in order to be authenticated to access GWF files. ```htmlmixed= use_x509userproxy = true ``` An additional line in the `.sub` files can be injected to require to run on sites providing data files via CVMFS. The line is: ```htmlmixed= requirements = (HAS_LIGO_FRAMES=?=True) ``` If such requirement is enforced, the worker node will be able to use the distributed copy of data without requiring the definition of any input file in the `.sub` job descriptor. Typically the available `gwdatafind` installation will be configured to rely on the local CVMFS distribution if present. `gwdatafind` is able to retrieve and provide the user with a list of paths satisfying a set of query parameters. Such parameters are: * **observatory** [see here](https://computing.docs.ligo.org/guide/data/#local-data-discovery) for available names * **data type** [see here](https://computing.docs.ligo.org/guide/data/#datasets) for available types * **GPS start** and **stop** as epoch timestamp For example, a `gwdatafind` call is: ```bash python -m gwdatafind -o L -t L1_HOFT_C00 -s 1187008866 -e 1187008898 ``` requiring LIGO data for the Livingston channel in the real time calibration pipeline. The typical output of such command is: ```bash 'file://localhost/cvmfs/oasis.opensciencegrid.org/ligo/frames/O2/hoft/L1/L-L1_HOFT_C00-11870/L-L1_HOFT_C00-1187008512-4096.gwf' ``` which is indeed a local CVMFS path at: ```bash /cvmfs/oasis.opensciencegrid.org/ligo/frames/O2/hoft/L1/L-L1_HOFT_C00-11870/L-L1_HOFT_C00-1187008512-4096.gwf ``` Such data queries can be performed on any machine configured with the LIGO/Virgo CVMFS data distribution, therefore on the HTCondor worker nodes satisfying the requirement `HAS_LIGO_FRAMES`. For further details see [this page](https://computing.docs.ligo.org/guide/data/). ### Distributed software Similarly to what happens with the collaboration data, two CVMFS repositories for LIGO and Virgo official software are distributed and made available on the worker nodes. If your executable happens to be distributed via such channel, it will already be available on the worker node without the need for a manual transfer as input. If a script named `my_test_script.sh` is distributed via CVMFS, you can directly use its CVMFS path as main executable (e.g. `/cvmfs/.../my_test_script.sh`. #### CVMFS software-executing job Let's assume a job runs an executable named `my_exe` available in the Virgo CVMFS repository at path `/cvmfs/virgo.ego-gw.it/sw/albert.einstein/my_exe`. The submit file becomes: ```htmlmixed= universe = vanilla executable = /cvmfs/virgo.ego-gw.it/sw/albert.einstein/my_exe transfer_executable = False arguments = <put your arguments here> transfer_output_files = <register here the outputs you want to retrieve> log = std.log output = std.out error = std.err ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1 ``` where lines 5 and 6 might be deleted if not necessary. Note line 4, where the specific flag is used to avoid unnecessary executable file transfers.  ## How to run at CNAF directly In order to run a HTCondor job at CNAF using the local resources pool, one should connect to `ui01-virgo.cnaf.infn.it` or `ui02-virgo.cnaf.infn.it`, reachable from `bastion.cnaf.infn.it`. Use `ssh` to connect. All of the above instructions are still valid with the following conversion map: | IGWN submit node | CNAF | | -------- | -------- | | `condor_submit SUB_FILE` | `condor_submit -name sn-01.cr.cnaf.infn.it -spool SUB_FILE` | | `condor_q JOB_ID` | `condor_q -name sn-01.cr.cnaf.infn.it JOB_ID` | In addition, since this kind of submission relies on a Condor CE, the job output is not automatically reported to the submit host. In order to retrieve the job output (ouputs defined in `transfer_output_files` and in `output`, `error` and `log`) one should perform the following command: ``` condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it JOB_ID ```