owned this note
owned this note
Published
Linked with GitHub
# How to Run `ipyrad` on Hydra
***All of these steps should be run on one of the login node.***
## Create an `IPython` Profile Compatible with Hydra
(You only have to do this once)
1. Load the ipyrad module:
```
$ module load bioinformatics/ipyrad/0.7.29
```
1. Configure ipyrad to run on Hydra by running `config4hydra`
```
$ config4hydra
```
- This script creates an "sge" IPython profile: (The name *sge* is arbitrary, but it will be used throughout the rest of this walk-through.) That creates a directory in your HOME directory, called ~/.ipython, and then prepopulates it with default config files. These are edited ipython config files setup to use Hydra GE.
3. **From your ipyrad working directory:** Run the ` cp_templates` command to copy the 3 template files to your working directory (`template`, `run-ipyrad.job` and `start-stop-ipcluster.csh`):
```
$ cp_templates
```
## Testing `ipcluster` on Hydra
(Again, you only have to do this once, to ensure that the previous steps were done correctly)
1. **From your ipyrad working directory:** start `ipcluster` as follows: `ipcluster start -n N --profile=sge --daemonize` where you replace `N` by the number of 'engines' to start, like for example 4. This will start N+1 jobs: one ipcontroller and N ipengine as N tasks of one job array.
`$ ipcluster start -n 4 --profile=sge --daemonize`
1. Check that the N+1 jobs have been started by the GE and are (eventually) in 'r' state.
*Make sure to wait at least 1 minute to get past 60 second delay that we programmed into the config files (See Appendix).*
```
$ module load tools/local
$ q+ +a%
```
1. If you see N+1 entries in the queue, it's working properly.
2. Now kill the `ipcluster`, or it will keep running, with N+1 jobs doing nothing:
`$ ipcluster stop --profile=sge`
## Running an `ipyrad` Job on Hydra
***All of these steps should be run on one of the login node, using a distinct directory.***
1. Prepare a parameter file following the instructions here: https://ipyrad.readthedocs.io/tutorial_intro_cli.html#create-an-ipyrad-params-file
1. That file will be used by starting `ipyrad` via a job file using something like this:
`$ ipyrad -n params-project_name.txt`
Using an editor change the parameters appropriately.
1. You can adjust two parameters in the `start-stop-ipcluster.csh` file:
* The first is the amount of compute nodes to be used, `N`, which we have set to **20** in line 4. This sets the number of `iPython` "engines" to run.
* The second is the queue and the amount of memory _each_ of the `iPython` "engine" are requesting. This is done by editing the value of the `queueSpec` variable on line 5.
- By default, we have the job running on the long high memory queue `lThM.q` requesting **30GB** per core. This has been tested on a large dataset and should be sufficient for most projects.
- The default `start-stop-ipcluster.csh` file looks like this:
```sh
#!/bin/csh
#
# no of engines, queue specification, stop file name, how often to check for the stop file, and the ipcluster profile name
@ N = 20
set queueSpec = 'lThM.q -l himem,mres=30G'
set stopFile = stop-ipcluster-profile=sge.now
set waitTime = 5m
set ipProfile = sge
#
# load the ipyrad module
module load bioinformatics/ipyrad/0.7.29
#
# remove the stop file, if there is one
rm -f $stopFile
#
# start the predefined IPython cluster (SGE type), with the queue spec, --daemonize MUST be last arg
echo + `date` ipcluster start --n $N --profile=$ipProfile --BatchSystemLauncher.queue="$queueSpec" --daemonize
ipcluster start --n $N --profile=$ipProfile --BatchSystemLauncher.queue="$queueSpec" --daemonize
#
# wait to let all the engines start,
# by qstating and counting the engines every 30 sec, for 10 passes (5m)
@ nPassMax = 10
@ iPass = 1
loop:
echo + `date` sleep 30
sleep 30
@ nc = `qstat -s r -u $USER | grep -c " controller "`
@ ne = `qstat -s r -u $USER | grep -c " engine "`
echo + `date` "$nc controller and $ne engine(s) running in the queue"
if ($nc == 0 || $ne != $N) then
if ($iPass > $nPassMax) then
echo + `date` "no controller or wrong number of engine(s) in the queue at pass # $iPass"
echo + `date` ipcluster stop --profile=$ipProfile
ipcluster stop --profile=$ipProfile
echo + `date` $iPass passes, exiting
exit 1
endif
@ iPass++
goto loop
endif
#
# submit the job, passing on the stop file name and tthe ipProfile
echo + `date` qsub run-ipyrad.job $stopFile $ipProfile
qsub run-ipyrad.job $stopFile $ipProfile
#
# now loop until the stop file is found
echo + `date` looking for $stopFile every $waitTime
@ i = 0
while (! -e $stopFile)
sleep $waitTime
@ i++
if ( ($i % 12) == 0) echo + `date` looking for $stopFile every $waitTime
end
#
# job completed, stopping the IPython cluster
echo + `date` ipcluster stop --profile=$ipProfile
ipcluster stop --profile=$ipProfile
#
# we're done
exit
```
4. The default template file,`template`, specifies the parameters needed to run `ipcluster`, used by `ipyrad`, on Hydra. It should not need to be edited and by default looks like this:
``` sh
#
#$ -cwd -j y -o $JOB_NAME.$JOB_ID.$TASK_ID.log
#$ -q {queue}
#$ -t 1-{n}
#
module load bioinformatics/ipyrad/0.7.29
set type = $JOB_NAME
set echo
python -m ipyparallel.$type --profile-dir="{profile_dir}" --cluster-id="{cluster_id}"
```
1. The job file`run-ipyrad.job` submits your `ipyrad` job on Hydra and uses the `ipcluster` jobs spun up by `start-stop-ipcluster.csh`.
The one line that you will want to change is the `ipyrad` command line, by adding the specific ipyrad parameter file created for this run in line 12 and the list of steps to execute.
``` sh
# /bin/csh
#$ -cwd -j y -o run-ipyrad.log
#$ -N ipyrad
#$ -q lThC.q
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
#
set stopFile = $1
set ipProfile = $2
module load bioinformatics/ipyrad/0.7.29
#
# now start ipyrad
ipyrad -p params-project_name.txt -s 1234567 -t 1 --ipcluster=$ipProfile
#
# when done, tell start-stop to shut down the IPython cluster
date > $stopFile
#
echo = `date` job $JOB_NAME done
```
1. In your working directory you should now have 4 files:
```
$ ls -l
-rw-rw-r-- params-project_name.txt
-rw-rw-r-- run-ipyrad.job
-rwxrwxr-x start-stop-ipcluster.csh
-rwx------ template
```
- **NOTE** Make sure that `start-stop-ipcluster.csh` is set to be executable and the `template` file has the correct permission settings, if not do the following:
$ chmod +x start-stop-ipcluster.csh
$ chmod 700 template
1. Finally, start your `ipyrad` job as follows:
`$ ./start-stop-ipcluster.csh &> start-stop-ipcluster-login01.log &`
* The `&` at the end of the line will run the command in the background so that you can continue using on Hydra while the job runs. If you start it on login02, adjust accordingly the log file name.
* That script will start and stop everything: `ipcluster`, one `controller`, N `engines`, and the `ipyrad` job.
* IPcluster produces a `controller` and an `engine` file, from the `template` file, as well as a controller log file (`controller.NNNN.1.log`) and N engine log files (`engine.NNNN.M.log`). The ipyrad job produces a log file as well (`run-ipyrad.log`).
## Stopping a Job
* The `start-stop-ipcluster.csh` script initiates `ipcluster`, start the `controller` and `engine` jobs and submits the `ipyrad` job on Hydra using the GE.
* When the `ipyrad` job completes it generate a stop file (`stop-ipcluster-profile=sge.now`) that is detected by the `start-stop-ipcluster.csh` script that then stops all the running `ipcluster` processes in the queue and on the login node.
* So if all goes well, everything stops on it own _cleanly_.
**If you wish to kill a running ipyrad job** for whatever reason you _must_ do so with the following two steps:
* (1) killing with `qdel` the `ipyrad` job, and,
* (2) creating the `stop-ipcluster-profile=sge.now` file in your working directory, as follows:
`$ echo kill >stop-ipcluster-profile=sge.now`
## Before Submitting any Subsequent ipyrad Jobs
1. Please check that you have no related ipcluster process running on the login node.
`$ top -u <your_user_name>`
Each user should only have `sshd` and `bash` processes running by default. You will also see an entry for the command you just ran `top`, i.e. something like this:
```
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11010 gonzalez 20 0 16096 2308 848 R 3.7 0.0 0:00.04 top
82511 gonzalez 20 0 101m 1984 956 S 0.0 0.0 0:00.05 sshd
82512 gonzalez 20 0 106m 2004 1468 S 0.0 0.0 0:00.08 bash
```
* You exit `top` by hitting the `q` key.
* If you have other processes running, kill them with `kill -9 <PID>`, where <PID> the is the process id listed by top.
2. Check the queue for `controller` or `engine` jobs, with `qstat` or `q+`.
3. **CAVEATS**: Each user can only have one `ipyrad` instance running at the same time.
## Appendix: Changes made to config files to make them suitable for Hydra
1. Edit `ipcluster_config.py`
```python
#line 127
c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'
#line 178
c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'
#line 187
c.IPClusterStart.delay = 60.0
#line 388
c.BatchSystemLauncher.queue = u'mThC.q'
```
1. Edit `ipcontroller_config.py`
```python
#Line 239
c.RegistrationFactory.ip = u'*'
#Line 259
c.HubFactory.client_ip = u'*'
```
1. Edit `ipengine_config.py`
```python
#Line 377
c.RegistrationFactory.ip = u'*'
```