11657077 (gomez.124, Problems submitting / running jobs)

# 11657077 (gomez.124, Problems submitting / running jobs) ## Demian's description of the problem <pre> I've been submitting jobs to the scheduler but there seems to be some problem. First I started with a test script for my system and ended up submitting a script that only echoes some information. No logs were generated and the jobs always stay in the queue with status Q: [gomez.124@unity-1 pg]$ qstat -u gomez.124 unity-1.asc.ohio-state.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 1436089.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 4 160 3gb 01:30:00 Q -- 1436099.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2 32 3gb 00:05:00 C -- 1436104.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2 32 3gb 00:05:00 C -- 1436105.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2 32 3gb 00:05:00 C -- 1436106.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2 32 3gb 00:05:00 Q -- If I trace the job it seems it wants to start running but something may be failing: [gomez.124@unity-1 pg]$ tracejob -v 1436106 /var/spool/torque/server_priv/accounting/20191207: Permission denied /var/spool/torque/server_logs/20191207: Successfully located matching job records /var/spool/torque/mom_logs/20191207: No such file or directory /var/spool/torque/sched_logs/20191207: No such file or directory Job: 1436106.unity-1.asc.ohio-state.edu 12/07/2019 10:07:08.352 S enqueuing into batch, state 1 hop 1 12/07/2019 10:07:34.584 S Job Run at request of root@unity-1.asc.ohio-state.edu 12/07/2019 10:07:34.604 S child reported success for job after 0 seconds (dest=???), rc=0 12/07/2019 10:07:34.604 S Not sending email: User does not want mail of this type. 12/07/2019 10:07:34.663 S obit received - updating final job usage info 12/07/2019 10:07:34.664 S job exit status -3 handled Again, the script is super simple (it doesn't do anything fancy) #!/usr/bin/bash #PBS -l walltime=00:05:00 #PBS -l nodes=2:ppn=16 #PBS -j oe #PBS -N Parallel_GAMIT echo " >> Started at " $(date) headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes. workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes. echo " >> Running on $headnode with workers $workers" echo " >> End" One thing that was very strange is that the first time I submitted a job to the queue I requested abe for emails and I got 100 or so, all with the same information (when I saw this I killed the job using qdel). At no point before killing it the job appeared to be running. No logs were found. Here's one of the emails (got another 99 of the same): PBS Job Id: 1436088.unity-1.asc.ohio-state.edu Job Name: Parallel.GAMIT Exec host: u011.unity/0-39+u071.unity/0-39+u041.unity/0-39+u040.unity/0-39 Begun execution Error_Path: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parall el.GAMIT.o1436088 Output_Path: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parall el.GAMIT.o1436088 My experience with PBS is limited. Am I doing anything wrong? Thanks, Demián </pre> ## Observations Job 1436249 is active (in some sense) now. ### `qstat` v. `showq` It shows as queued with `qstat`: ``` $ qstat -u gomez.124 unity-1.asc.ohio-state.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 1436249.unity-1.asc.oh gomez.124 batch Parallel_GAMIT 0 2 32 3gb 00:05:00 Q -- ``` However `showq` shows it as running; you can run this repeatedly and watch the remaining time dwindle until the start time resets, as though the job is restarting. ``` $ showq -u gomez.124 active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 1436249 gomez.12 Running 32 00:04:52 Mon Dec 9 14:30:22 1 active job 32 of 2944 processors in use by local jobs (1.09%) 2 of 82 nodes active (2.44%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total job: 1 ``` ### `tracejob` `tracejob` shows job apparently fail and start retrying (exit status -3): ``` $ tracejob -v 1436249 | head -n 15 /var/spool/torque/server_priv/accounting/20191209: Successfully located matching job records /var/spool/torque/server_logs/20191209: Successfully located matching job records /var/spool/torque/mom_logs/20191209: No such file or directory /var/spool/torque/sched_logs/20191209: No such file or directory Job: 1436249.unity-1.asc.ohio-state.edu 12/09/2019 09:44:37.408 S enqueuing into batch, state 1 hop 1 12/09/2019 09:44:37 A queue=batch 12/09/2019 09:44:58.449 S Job Run at request of root@unity-1.asc.ohio-state.edu 12/09/2019 09:44:58.466 S child reported success for job after 0 seconds (dest=???), rc=0 12/09/2019 09:44:58.467 S preparing to send 'b' mail for job 1436249.unity-1.asc.ohio-state.edu to gomez.124@unity-1.asc.ohio-state.edu (---) 12/09/2019 09:44:58.524 S obit received - updating final job usage info 12/09/2019 09:44:58.524 S job exit status -3 handled 12/09/2019 09:44:58 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=1 start=1575902698 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb 12/09/2019 09:45:29 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=2 start=1575902729 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb 12/09/2019 09:46:19 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=3 start=1575902779 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb 12/09/2019 09:46:50 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=4 start=1575902810 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb 12/09/2019 09:47:21 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=5 start=1575902841 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb ``` Seems to restart about every 30 seconds; been running about four hours now. ``` 12/09/2019 14:33:38 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=511 start=1575920018 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb 12/09/2019 14:34:09 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=512 start=1575920049 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb 12/09/2019 14:34:40 A user=gomez.124 group=ses-users jobname=Parallel_GAMIT queue=batch ctime=1575902677 qtime=1575902677 etime=1575902677 start_count=513 start=1575920080 owner=gomez.124@unity-1.asc.ohio-state.edu exec_host=u019.unity/4-19+u082.unity/12-27 Resource_List.walltime=00:05:00 Resource_List.nodes=2:ppn=16 Resource_List.nodect=2 Resource_List.neednodes=2:ppn=16 Resource_List.mem=3gb ``` ### `checkjob` ``` $ checkjob -v 1436249 job 1436249 (RM job '1436249.unity-1.asc.ohio-state.edu') AName: Parallel_GAMIT State: Running Creds: user:gomez.124 group:ses-users class:batch WallTime: 00:00:00 of 00:05:00 SubmitTime: Mon Dec 9 09:44:37 (Time Queued Total: 4:51:05 Eligible: 00:00:21) StartTime: Mon Dec 9 14:35:42 TemplateSets: DEFAULT Total Requested Tasks: 32 Req[0] TaskCount: 32 Partition: UNITY-1 Dedicated Resources Per Task: PROCS: 1 MEM: 96M TasksPerNode: 16 Allocated Nodes: [u019.unity:16][u082.unity:16] SystemID: Moab SystemJID: 1436249 Notification Events: JobStart,JobEnd,JobFail Notification Address: gomez.124@osu.edu Task Distribution: u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,u019.unity,... IWD: /home/gomez.124/pg UMask: 0000 OutputFile: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parallel_GAMIT.o1436249 ErrorFile: unity-1.asc.ohio-state.edu:/fs/project/gomez.124/logs/Parallel_GAMIT.e1436249 StartCount: 515 Partition List: UNITY-1 SrcRM: UNITY-1 DstRM: UNITY-1 DstRMJID: 1436249.unity-1.asc.ohio-state.edu Submit Args: submit_task.sh Flags: BACKFILL,RESTARTABLE Attr: BACKFILL,checkpoint StartPriority: 1 IterationJobRank: 0 PE: 32.00 Reservation '1436249' (-00:00:21 -> 00:04:39 Duration: 00:05:00) ``` ### Logs In `/var/spool/torque/server_priv/accounting/20191209`, I see the job start about every 30 seconds; it never ends. ### output/error files Neither output nor error files are created. ``` $ sudo ls -la /fs/project/gomez.124/logs/Parallel_GAMIT.o1436249 ls: cannot access /fs/project/gomez.124/logs/Parallel_GAMIT.o1436249: No such file or directory $ sudo ls -la /fs/project/gomez.124/logs/Parallel_GAMIT.e1436249 ls: cannot access /fs/project/gomez.124/logs/Parallel_GAMIT.e1436249: No such file or directory ``` ## Job submission script ``` #!/usr/bin/bash #PBS -l walltime=00:05:00 #PBS -l nodes=2:ppn=16 #PBS -o /fs/project/gomez.124/logs #PBS -j oe #PBS -m abe #PBS -M gomez.124@osu.edu #PBS -N Parallel_GAMIT # ---------------------------------------------------------------------------------------------------------------------- # Do not change anything above this line unless you're sure of what you're doing. # ------------------------------------------------------------------------------- # Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a # shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out # into one file. # ---------------------------------------------------------------------------------------------------------------------- # Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that # aren't: # module: Deals with the various tools available on the HPC. # pbsdcp: Distributes or collects folders on the cluster. # ---------------------------------------------------------------------------------------------------------------------- echo " >> Started at " $(date) headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes. workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes. echo " >> Running on $headnode with workers $workers" echo " >> End" ``` ## Run as me Deleted lines in script to use Demian's area for output and to send him email, then submit this myself. Runs fine, except that `$workers` not written: ``` >> Started at Mon Dec 9 14:47:22 EST 2019 >> Running on u001 with workers >> End ``` Everything in `qstat`, `showq`, `tracejob` looks fine. ### Update 12/16/19 Finally got this to fail for me. Edited pbs script to request 4 nodes, 24 cores per node (so this should force the use of at least two different nodes). Also simplified how to find the nodes used. Script: ``` #!/usr/bin/bash 2 #PBS -l walltime=00:05:00 3 #PBS -l nodes=4:ppn=24 4 #PBS -j oe 5 #PBS -m abe 6 #PBS -N Parallel_GAMIT 7 # ---------------------------------------------------------------------------------------------------------------------- 8 # Do not change anything above this line unless you're sure of what you're doing. 9 # ------------------------------------------------------------------------------- 10 # Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a 11 # shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out 12 # into one file. 13 # ---------------------------------------------------------------------------------------------------------------------- 14 # Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that 15 # aren't: 16 # module: Deals with the various tools available on the HPC. 17 # pbsdcp: Distributes or collects folders on the cluster. 18 # ---------------------------------------------------------------------------------------------------------------------- 19 echo " >> Started at " $(date) 20 21 #headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode 22 nodes=($( cat $PBS_NODEFILE | sort | uniq)) # Create a variable with the list of worker nodes. 23 #nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes. 24 #workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes. 25 26 #my_nodes=$( cat $PBS_NODEFILE) 27 #my_workers="" 28 29 #cat $PBS_NODEFILE 30 #echo " >> Running on $headnode with workers $workers" 31 #echo " >> Or, Running on $headnode with workers $my_workers" 32 33 echo " >> nodes: $nodes" 34 35 36 37 echo " >> End" ``` Job id is 1436948; some documentation: `checkjob` ``` checkjob -v 1436948 job 1436948 (RM job '1436948.unity-1.asc.ohio-state.edu') AName: Parallel_GAMIT State: Running Creds: user:shew.1 group:research-asctech-rcs class:batch WallTime: 00:00:00 of 00:05:00 BecameEligible: Fri Dec 13 14:52:35 SubmitTime: Fri Dec 13 14:51:39 (Time Queued Total: 2:19:20:49 Eligible: 2:13:57) StartTime: Mon Dec 16 10:12:28 TemplateSets: DEFAULT Total Requested Tasks: 96 Req[0] TaskCount: 96 Partition: UNITY-1 Dedicated Resources Per Task: PROCS: 1 MEM: 32M TasksPerNode: 24 Allocated Nodes: u001.unity*48:u072.unity*24:u073.unity*24 SystemID: Moab SystemJID: 1436948 Notification Events: JobStart,JobEnd,JobFail Task Distribution: u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,u073.unity,... IWD: /home/shew.1/cur-support/ticket-11657077-gomez UMask: 0000 OutputFile: unity-1.asc.ohio-state.edu:/home/shew.1/cur-support/ticket-11657077-gomez/Parallel_GAMIT.o1436948 ErrorFile: unity-1.asc.ohio-state.edu:/home/shew.1/cur-support/ticket-11657077-gomez/Parallel_GAMIT.e1436948 StartCount: 7576 BypassCount: 10 Partition List: UNITY-1 SrcRM: UNITY-1 DstRM: UNITY-1 DstRMJID: 1436948.unity-1.asc.ohio-state.edu Submit Args: submit_task.sh Flags: BACKFILL,RESTARTABLE Attr: BACKFILL,checkpoint StartPriority: 133 IterationJobRank: 0 PE: 96.00 Reservation '1436948' (-00:00:02 -> 00:04:58 Duration: 00:05:00) ``` ### Update 12/17/19 Stripping pbs script down. Everything fails in the same way. No use of environment variable fails: ``` #!/usr/bin/bash #PBS -l walltime=00:05:00 #PBS -l nodes=4:ppn=24 #PBS -j oe #PBS -N Parallel_GAMIT # ---------------------------------------------------------------------------------------------------------------------- # Do not change anything above this line unless you're sure of what you're doing. # ------------------------------------------------------------------------------- # Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a # shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out # into one file. # ---------------------------------------------------------------------------------------------------------------------- # Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that # aren't: # module: Deals with the various tools available on the HPC. # pbsdcp: Distributes or collects folders on the cluster. # ---------------------------------------------------------------------------------------------------------------------- #echo " >> Started at " $(date) echo " >> Started" #headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode #nodes=($( cat $PBS_NODEFILE | sort | uniq)) # Create a variable with the list of worker nodes. #nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes. #workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes. #my_nodes=$( cat $PBS_NODEFILE) #my_workers="" #cat $PBS_NODEFILE #echo " >> Running on $headnode with workers $workers" #echo " >> Or, Running on $headnode with workers $my_workers" #echo " >> nodes: $nodes" echo " >> End" ``` Doing nothing fails: ``` #!/usr/bin/bash #PBS -l walltime=00:05:00 #PBS -l nodes=4:ppn=24 #PBS -j oe #PBS -N Parallel_GAMIT # ---------------------------------------------------------------------------------------------------------------------- # Do not change anything above this line unless you're sure of what you're doing. # ------------------------------------------------------------------------------- # Current cluster settings: limit the time on the cluster to 5 minutes, use 2 nodes with 1 cpu per node and set up a # shared parallel file system, the project number, submit to the debug queue, combine standard error and standard out # into one file. # ---------------------------------------------------------------------------------------------------------------------- # Most of the commands in the script are straight forward Linux commands, here is a brief description of the ones that # aren't: # module: Deals with the various tools available on the HPC. # pbsdcp: Distributes or collects folders on the cluster. # ---------------------------------------------------------------------------------------------------------------------- #echo " >> Started at " $(date) #echo " >> Started" #headnode=$( hostname | cut -d. -f1 ) # Create a variable containing the name of the headnode #nodes=($( cat $PBS_NODEFILE | sort | uniq)) # Create a variable with the list of worker nodes. #nodes=($( cat $PBS_NODEFILE | sort | uniq | grep -v $headnode)) # Create a variable with the list of worker nodes. #workers=$(echo ${nodes[@]} | tr ' ' ,) # Create a string separated by commas with all the worker nodes. #my_nodes=$( cat $PBS_NODEFILE) #my_workers="" #cat $PBS_NODEFILE #echo " >> Running on $headnode with workers $workers" #echo " >> Or, Running on $headnode with workers $my_workers" #echo " >> nodes: $nodes" #echo " >> End" ``` Really doing nothing fails: ``` #!/usr/bin/bash #PBS -l walltime=00:05:00 #PBS -l nodes=4:ppn=24 #PBS -j oe #PBS -N Parallel_GAMIT ``` One of my old scripts adapted to use multiple nodes fails: ``` #!/usr/bin/env bash #PBS -l walltime=00:10:00 ## #PBS -l nodes=u029.unity:ppn=1 ## #PBS -l nodes=1:ppn=1 #PBS -l nodes=4:ppn=24 #PBS -N five-minute-job #PBS -j oe #PBS -m abe #PBS -M shew.1@osu.edu cd $PBS_O_WORKDIR echo "Going to sleep: `date`" for i in `seq 1 10`; do sleep 30 echo "--just slept 30 seconds--iteration $i at `date`" done echo "Awake: `date`" ```