# 11657077 (gomez.124, Problems submitting / running jobs)
## Demian's description of the problem
<pre>
I've been submitting jobs to the scheduler but there seems to be some
problem. First I started with a test script for my system and ended up
submitting a script that only echoes some information. No logs were
generated and the jobs always stay in the queue with status Q:
[gomez.124@unity-1 pg]$ qstat -u gomez.124
unity-1.asc.ohio-state.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS
TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ -----
------ --------- --------- - ---------
1436089.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 4
160 3gb 01:30:00 Q --
1436099.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2
32 3gb 00:05:00 C --
1436104.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2
32 3gb 00:05:00 C --
1436105.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2
32 3gb 00:05:00 C --
1436106.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2
32 3gb 00:05:00 Q --
If I trace the job it seems it wants to start running but something may be
failing:
[gomez.124@unity-1 pg]$ tracejob -v 1436106
/var/spool/torque/server_priv/accounting/20191207: Permission denied
/var/spool/torque/server_logs/20191207: Successfully located matching job
records
/var/spool/torque/mom_logs/20191207: No such file or directory
/var/spool/torque/sched_logs/20191207: No such file or directory
Job: 1436106.unity-1.asc.ohio-state.edu
12/07/2019 10:07:08.352 S enqueuing into batch, state 1 hop 1
12/07/2019 10:07:34.584 S Job Run at request of
root@unity-1.asc.ohio-state.edu
12/07/2019 10:07:34.604 S child reported success for job after 0 seconds
(dest=???), rc=0
12/07/2019 10:07:34.604 S Not sending email: User does not want mail of
this type.
12/07/2019 10:07:34.663 S obit received - updating final job usage info
12/07/2019 10:07:34.664 S job exit status -3 handled
Again, the script is super simple (it doesn't do anything fancy)
#!/usr/bin/bash
#PBS -l walltime=00:05:00
#PBS -l nodes=2:ppn=16
#PBS -j oe
#PBS -N Parallel_GAMIT
echo " >> Started at " $(date)
headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the
name of the headnode
nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a
variable with the list of worker nodes.
workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by
commas with all the worker nodes.
echo " >> Running on $headnode with workers $workers"
echo " >> End"
One thing that was very strange is that the first time I submitted a job to
the queue I requested abe for emails and I got 100 or so, all with the same
information (when I saw this I killed the job using qdel). At no point
before killing it the job appeared to be running. No logs were found.
Here's one of the emails (got another 99 of the same):
PBS Job Id: 1436088.unity-1.asc.ohio-state.edu
Job Name: Parallel.GAMIT
Exec host: u011.unity/0-39+u071.unity/0-39+u041.unity/0-39+u040.unity/0-39
Begun execution
Error_Path: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parall
el.GAMIT.o1436088
Output_Path: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parall
el.GAMIT.o1436088
My experience with PBS is limited. Am I doing anything wrong?
Thanks,
Demián
</pre>
## Observations
Job 1436249 is active (in some sense) now.
### `qstat` v. `showq`
It shows as queued with `qstat`:
```
$ qstat -u gomez.124
unity-1.asc.ohio-state.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1436249.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2 32 3gb 00:05:00 Q --
```
However `showq` shows it as running; you can run this repeatedly and watch the remaining time dwindle until the start time resets, as though the job is restarting.
```
$ showq -u gomez.124
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
1436249 gomez.12 Running 32 00:04:52 Mon Dec 9 14:30:22
1 active job 32 of 2944 processors in use by local jobs (1.09%)
2 of 82 nodes active (2.44%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total job: 1
```
### `tracejob`
`tracejob` shows job apparently fail and start retrying (exit status -3):
```
$ tracejob -v 1436249 | head -n 15
/var/spool/torque/server_priv/accounting/20191209: Successfully located matching job records
/var/spool/torque/server_logs/20191209: Successfully located matching job records
/var/spool/torque/mom_logs/20191209: No such file or directory
/var/spool/torque/sched_logs/20191209: No such file or directory
Job: 1436249.unity-1.asc.ohio-state.edu
12/09/2019 09:44:37.408 S enqueuing into batch, state 1 hop 1
12/09/2019 09:44:37 A queue=batch
12/09/2019 09:44:58.449 S Job Run at request of root@unity-1.asc.ohio-state.edu
12/09/2019 09:44:58.466 S child reported success for job after 0 seconds (dest=???), rc=0
12/09/2019 09:44:58.467 S preparing to send 'b' mail for job 1436249.unity-1.asc.ohio-state.edu to gomez.124@unity-1.asc.ohio-state.edu (---)
12/09/2019 09:44:58.524 S obit received - updating final job usage info
12/09/2019 09:44:58.524 S job exit status -3 handled
12/09/2019 09:44:58 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=1 start=1575902698 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
12/09/2019 09:45:29 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=2 start=1575902729 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
12/09/2019 09:46:19 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=3 start=1575902779 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
12/09/2019 09:46:50 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=4 start=1575902810 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
12/09/2019 09:47:21 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=5 start=1575902841 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
```
Seems to restart about every 30 seconds; been running about four hours now.
```
12/09/2019 14:33:38 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=511 start=1575920018 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
12/09/2019 14:34:09 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=512 start=1575920049 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
12/09/2019 14:34:40 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=513 start=1575920080 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb
```
### `checkjob`
```
$ checkjob -v 1436249
job 1436249 (RM job '1436249.unity-1.asc.ohio-state.edu')
AName: Parallel_GAMIT
State: Running
Creds: user:gomez.124 group:ses-users class:batch
WallTime: 00:00:00 of 00:05:00
SubmitTime: Mon Dec 9 09:44:37
(Time Queued Total: 4:51:05 Eligible: 00:00:21)
StartTime: Mon Dec 9 14:35:42
TemplateSets: DEFAULT
Total Requested Tasks: 32
Req[0] TaskCount: 32 Partition: UNITY-1
Dedicated Resources Per Task: PROCS: 1 MEM: 96M
TasksPerNode: 16
Allocated Nodes:
[u019.unity:16][u082.unity:16]
SystemID: Moab
SystemJID: 1436249
Notification Events: JobStart,JobEnd,JobFail Notification Address: gomez.124@osu.edu
Task Distribution: u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,...
IWD: /home/gomez.124/pg
UMask: 0000
OutputFile: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parallel_GAMIT.o1436249
ErrorFile: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parallel_GAMIT.e1436249
StartCount: 515
Partition List: UNITY-1
SrcRM: UNITY-1 DstRM: UNITY-1 DstRMJID: 1436249.unity-1.asc.ohio-state.edu
Submit Args: submit_task.sh
Flags: BACKFILL,RESTARTABLE
Attr: BACKFILL,checkpoint
StartPriority: 1
IterationJobRank: 0
PE: 32.00
Reservation '1436249' (-00:00:21 -> 00:04:39 Duration: 00:05:00)
```
### Logs
In `/var/spool/torque/server_priv/accounting/20191209`, I see the job start about every 30 seconds; it never ends.
### output/error files
Neither output nor error files are created.
```
$ sudo ls -la /fs/project/gomez.124/logs/Parallel_GAMIT.o1436249
ls: cannot access /fs/project/gomez.124/logs/Parallel_GAMIT.o1436249: No such file or directory
$ sudo ls -la /fs/project/gomez.124/logs/Parallel_GAMIT.e1436249
ls: cannot access /fs/project/gomez.124/logs/Parallel_GAMIT.e1436249: No such file or directory
```
## Job submission script
```
#!/usr/bin/bash
#PBS -l walltime=00:05:00
#PBS -l nodes=2:ppn=16
#PBS -o /fs/project/gomez.124/logs
#PBS -j oe
#PBS -m abe
#PBS -M gomez.124@osu.edu
#PBS -N Parallel_GAMIT
# ----------------------------------------------------------------------------------------------------------------------
# Do not change anything above this line unless you're sure of what you're doing.
# -------------------------------------------------------------------------------
# Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a
# shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out
# into one file.
# ----------------------------------------------------------------------------------------------------------------------
# Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that
# aren't:
# module: Deals with the various tools available on the HPC.
# pbsdcp: Distributes or collects folders on the cluster.
# ----------------------------------------------------------------------------------------------------------------------
echo " >> Started at " $(date)
headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode
nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes.
workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes.
echo " >> Running on $headnode with workers $workers"
echo " >> End"
```
## Run as me
Deleted lines in script to use Demian's area for output and to send him email, then submit this myself.
Runs fine, except that `$workers` not written:
```
>> Started at Mon Dec 9 14:47:22 EST 2019
>> Running on u001 with workers
>> End
```
Everything in `qstat`, `showq`, `tracejob` looks fine.
### Update 12/16/19
Finally got this to fail for me. Edited pbs script to request 4 nodes, 24 cores per node (so this should force the use of at least two different nodes). Also simplified how to find the nodes used.
Script:
```
#!/usr/bin/bash
2 #PBS -l walltime=00:05:00
3 #PBS -l nodes=4:ppn=24
4 #PBS -j oe
5 #PBS -m abe
6 #PBS -N Parallel_GAMIT
7 # ----------------------------------------------------------------------------------------------------------------------
8 # Do not change anything above this line unless you're sure of what you're doing.
9 # -------------------------------------------------------------------------------
10 # Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a
11 # shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out
12 # into one file.
13 # ----------------------------------------------------------------------------------------------------------------------
14 # Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that
15 # aren't:
16 # module: Deals with the various tools available on the HPC.
17 # pbsdcp: Distributes or collects folders on the cluster.
18 # ----------------------------------------------------------------------------------------------------------------------
19 echo " >> Started at " $(date)
20
21 #headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode
22 nodes=($( cat $PBS_NODEFILE | sort | uniq)) # Create a variable with the list of worker nodes.
23 #nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes.
24 #workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes.
25
26 #my_nodes=$( cat $PBS_NODEFILE)
27 #my_workers=""
28
29 #cat $PBS_NODEFILE
30 #echo " >> Running on $headnode with workers $workers"
31 #echo " >> Or, Running on $headnode with workers $my_workers"
32
33 echo " >> nodes: $nodes"
34
35
36
37 echo " >> End"
```
Job id is 1436948; some documentation:
`checkjob`
```
checkjob -v 1436948
job 1436948 (RM job '1436948.unity-1.asc.ohio-state.edu')
AName: Parallel_GAMIT
State: Running
Creds: user:shew.1 group:research-asctech-rcs class:batch
WallTime: 00:00:00 of 00:05:00
BecameEligible: Fri Dec 13 14:52:35
SubmitTime: Fri Dec 13 14:51:39
(Time Queued Total: 2:19:20:49 Eligible: 2:13:57)
StartTime: Mon Dec 16 10:12:28
TemplateSets: DEFAULT
Total Requested Tasks: 96
Req[0] TaskCount: 96 Partition: UNITY-1
Dedicated Resources Per Task: PROCS: 1 MEM: 32M
TasksPerNode: 24
Allocated Nodes:
u001.unity*48:u072.unity*24:u073.unity*24
SystemID: Moab
SystemJID: 1436948
Notification Events: JobStart,JobEnd,JobFail
Task Distribution: u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,...
IWD: /home/shew.1/cur-support/ticket-11657077-gomez
UMask: 0000
OutputFile: unity-1.asc.ohio-state.edu:/home/shew.1/cur-support/ticket-11657077-gomez/Parallel_GAMIT.o1436948
ErrorFile: unity-1.asc.ohio-state.edu:/home/shew.1/cur-support/ticket-11657077-gomez/Parallel_GAMIT.e1436948
StartCount: 7576
BypassCount: 10
Partition List: UNITY-1
SrcRM: UNITY-1 DstRM: UNITY-1 DstRMJID: 1436948.unity-1.asc.ohio-state.edu
Submit Args: submit_task.sh
Flags: BACKFILL,RESTARTABLE
Attr: BACKFILL,checkpoint
StartPriority: 133
IterationJobRank: 0
PE: 96.00
Reservation '1436948' (-00:00:02 -> 00:04:58 Duration: 00:05:00)
```
### Update 12/17/19
Stripping pbs script down. Everything fails in the same way.
No use of environment variable fails:
```
#!/usr/bin/bash
#PBS -l walltime=00:05:00
#PBS -l nodes=4:ppn=24
#PBS -j oe
#PBS -N Parallel_GAMIT
# ----------------------------------------------------------------------------------------------------------------------
# Do not change anything above this line unless you're sure of what you're doing.
# -------------------------------------------------------------------------------
# Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a
# shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out
# into one file.
# ----------------------------------------------------------------------------------------------------------------------
# Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that
# aren't:
# module: Deals with the various tools available on the HPC.
# pbsdcp: Distributes or collects folders on the cluster.
# ----------------------------------------------------------------------------------------------------------------------
#echo " >> Started at " $(date)
echo " >> Started"
#headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode
#nodes=($( cat $PBS_NODEFILE | sort | uniq)) # Create a variable with the list of worker nodes.
#nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes.
#workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes.
#my_nodes=$( cat $PBS_NODEFILE)
#my_workers=""
#cat $PBS_NODEFILE
#echo " >> Running on $headnode with workers $workers"
#echo " >> Or, Running on $headnode with workers $my_workers"
#echo " >> nodes: $nodes"
echo " >> End"
```
Doing nothing fails:
```
#!/usr/bin/bash
#PBS -l walltime=00:05:00
#PBS -l nodes=4:ppn=24
#PBS -j oe
#PBS -N Parallel_GAMIT
# ----------------------------------------------------------------------------------------------------------------------
# Do not change anything above this line unless you're sure of what you're doing.
# -------------------------------------------------------------------------------
# Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a
# shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out
# into one file.
# ----------------------------------------------------------------------------------------------------------------------
# Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that
# aren't:
# module: Deals with the various tools available on the HPC.
# pbsdcp: Distributes or collects folders on the cluster.
# ----------------------------------------------------------------------------------------------------------------------
#echo " >> Started at " $(date)
#echo " >> Started"
#headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode
#nodes=($( cat $PBS_NODEFILE | sort | uniq)) # Create a variable with the list of worker nodes.
#nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes.
#workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes.
#my_nodes=$( cat $PBS_NODEFILE)
#my_workers=""
#cat $PBS_NODEFILE
#echo " >> Running on $headnode with workers $workers"
#echo " >> Or, Running on $headnode with workers $my_workers"
#echo " >> nodes: $nodes"
#echo " >> End"
```
Really doing nothing fails:
```
#!/usr/bin/bash
#PBS -l walltime=00:05:00
#PBS -l nodes=4:ppn=24
#PBS -j oe
#PBS -N Parallel_GAMIT
```
One of my old scripts adapted to use multiple nodes fails:
```
#!/usr/bin/env bash
#PBS -l walltime=00:10:00
## #PBS -l nodes=u029.unity:ppn=1
## #PBS -l nodes=1:ppn=1
#PBS -l nodes=4:ppn=24
#PBS -N five-minute-job
#PBS -j oe
#PBS -m abe
#PBS -M shew.1@osu.edu
cd $PBS_O_WORKDIR
echo "Going to sleep: `date`"
for i in `seq 1 10`;
do
sleep 30
echo "--just slept 30 seconds--iteration $i at `date`"
done
echo "Awake: `date`"
```