Fluent is known that starts MPI jobs within its own fluent
script.
Usally, passwordless ssh is needed to allocated nodes before any batch script can be submitted.
The following 3-lines of code is sufficient to ensure you have this set up:
test -f ~/.ssh/id_rsa || ssh-keygen -t rsa -f ~/.ssh/id_rsa -q -N ""
test -f ~/.ssh/authorized_keys || cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
grep -qf ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys || cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
To check if paswordless works try to get two nodes and get to one of them in next 5 minutes
[leon@viz ~]$ salloc --nodes=2 --partition=rome,haswell --time=5:00 --mem=0
salloc: Granted job allocation 59168
salloc: Waiting for resource configuration
salloc: Nodes cn[41-42] are ready for job
[leon@viz ~]$ ssh cn42
Warning: Permanently added 'cn42,10.0.2.142' (ECDSA) to the list of known hosts.
Last login: Mon Nov 29 10:54:26 2021
[leon@cn42 ~]$ exit
logout
Connection to cn42 closed.
[leon@viz ~]$ exit
exit
salloc: Relinquishing job allocation 59168
The follwing sbatch example used cpomrises of the following features:
sbatch my-lam8.sbatch 3
will resubmit itself 2 times (18 hours)squeue
. Job name 2*lam8
means that there is one remaining job to be resubmited after this job completes.my-lam8.jou
.ntasks
required. Suggested is to select multiple of 48 on rome
partition.
[leon@viz lam8]$ cat my-lam8.sbatch
#!/bin/bash
#SBATCH --export=ALL,LD_PRELOAD=
#SBATCH --partition rome
#SBATCH --ntasks=96 # total number of cores requested
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=48
#SBATCH --job-name=lam8
#SBATCH --mem=120G
#SBATCH --time=6:00:00 # time limit days-hh:mm:ss
#SBATCH --signal=B:USR1@120
trap 'touch exit-fluent' USR1
module purge
module load ANSYS/20.1
if test -f '#restart.inp'
then journal=restart.jou
mv '#restart.inp' $journal
else
journal=my-lam8.jou
rm -f *.log *.trn *-[0-9]*.{cas,dat.h5,cas.h5,cdat} cleanup*.sh
fi
grep -q checkpoint/exit-filename $journal || sed -i \
-e '1i(set! checkpoint/check-filename "./check-fluent")' \
-e '1i(set! checkpoint/exit-filename "./exit-fluent")' $journal
fluent 3ddp -g -slurm -t${SLURM_NTASKS} -mpi=ibmmpi -pib -i $journal &
wait; kill -0 $! 2> /dev/null; process_status=$?; wait
if test $process_status = 0 -a $# = 1 -a 0$1 -gt 1 -a -f '#restart.inp'
then b=/tmp/${SLURM_JOB_ID}.sbatch
sed /job-name[=]/s/=[0-9]*[*]*/=$(($1-1))*/ $0>$b
sbatch $b $(($1-1)) && rm -f $b
else
echo "Fluent job(s) finished" ; rm -f restart.jou
fi
Line 11 specifies that batch script should recieve user signal 120 seconds before the job would be terminated anyway due to reaching --time
specified.
Line 13 will trap (execute) single command that will create exit-fluent
when SLURM will signal it to the batch script. This file is being specified at the begining of the Fluent journal file my-lam8.jou
and will initiate Fluent to stop all processes and write #restart.inp
file for input in the journal.
[leon@viz lam8]$ head -3 my-lam8.jou
; Scheme commands to specify the check pointing and emergency exit commands:
(set! checkpoint/check-filename "./check-fluent")
(set! checkpoint/exit-filename "./exit-fluent")
One can also use touch check-fluent
in the directory to force saving the files while Fluent is running.
Line 18 test for presence of Fluent restart file (#restart.inp
) and if it exists then in line 19 a journal file (restart.jou
) is set and created in line 20. If there is no #restart.inp
being created then we can only start from our starting my-lam8.jou
journal file set in line 22. Before we can safely start from scratch we need to carefully clean the old log, transactions, and timesteps that could interfere clean start and restarts. See man 7 glob
for more info on the pattern for removing files.
In line 25 we check for the presence of checkpoint settings (exit-filename
) in the journal file and if such line is missing the scheme set
commands are added at the beginning of journal file. Therefore, you need not to take care of adding these commands to your journal file. The other reason is that #restart.inp
is missing them too and we are always adding that lines to allow multiple restarts.
Line 29 starts Fluent in background with the ampersand (&) at the end of the line. This is necessary to allow sbatch script to catch the USR1 signal.
First wait
in line 30 traps the USR1 signal and executes touch in line 13. Second wait
waits for Fluent to read the exit-fluent
file and gracefuly saves everything. If Fluent will finish regularly then both waits will contine and there will be no deadlock. Harmless kill between the waits probes if Fluent is running in background. Its process status will be zero if user signal is received while waiting, meaning that time limit is approaching and exit-fluent
file was created. If Fluent stops normaly due to reaching specified number of iterations or other convergence criteria or any failure in journal processing then kill process status will be 1, meaning that there is no Fluent to resubmit.
Line 29 tests for Fluent status, number of command line argumentsm, presence of the Fluent #restart.inp
and uses first argument number of repeats that is being decremented by one at each sbatch
resubmit.
In line 34 we create a new sbatch
script (named in line 33) that is slightly updating --job-name=lam8
line with the stream editor and adding a number* in front of the name.
After the batch is submitted in line 35 the script is removed as it is not needed anymore. If there are no repeats and everything is OK then the (last) script echoes in line 37 that the Fluent jobs is finished and possibly cleans the last restart journal file.
IMPORTANT! The sbatch script removes all cas, dat and cdat files after restart. If you want to save files in such a format rewrite line 22 to rm -f *.log *.trn cleanup*.sh
.
We recommend that the file extension .sbatch
is used for sbatch
aware scripts although using .sh
or other convention does not affect the sbatch
submission.
Runing any ANSYS/21.1 GUI or solver on rome nodes fails with
/opt/pkg/software/ANSYS/21.1/v211/licensingclient/linx64/ansyscl: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b
To overcome this problem one can preload correct Kerberos 5 library by adding
export LD_PRELOAD=/opt/pkg/software/ANSYS/21.1/compatibility_fix/libk5crypto.so.3
just after module load ANSYS/21.1
. Alternatively, LD_LIBRARY_PATH
can be prepended.
Newer versions of ANSYS (21.1R2) are RHEL 8.4 compatible and above fix is not needed anymore. However, it was observed that the following preload is needed for ANSYS/2021R2 module:
export LD_PRELOAD=/opt/pkg/software/ANSYS/2021R2/v212/commonfiles/MPI/Intel/2018.3.222/linx64/lib/libstrtok.so:${LD_PRELOAD}
OpenMPI 4.x is included within Fluent and uses UCX communicator as preferred fabric for Infiniband.
#!/bin/bash
#SBATCH --export=ALL,LD_PRELOAD=/opt/pkg/software/ANSYS/2021R2/v212/commonfiles/MPI/Intel/2018.3.222/linx64/lib/libstrtok.so
#SBATCH --partition rome
#SBATCH --ntasks=96 # total number of cores requested
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=48
#SBATCH --job-name=lam8r2
#SBATCH --mem=120G
#SBATCH --time=4:00:00 # time limit days-hh:mm:ss
#SBATCH --signal=B:USR1@120
trap 'touch exit-fluent' USR1
module purge
module load ANSYS/2021R2
if test -f '#restart.inp'
then journal=restart.jou
mv '#restart.inp' $journal
else
journal=my-lam8.jou
rm -f *.log *.trn *-[0-9]*.{cas,dat.h5,cas.h5,cdat} cleanup*.sh
fi
grep -q checkpoint/exit-filename $journal || sed -i \
-e '1i(set! checkpoint/check-filename "./check-fluent")' \
-e '1i(set! checkpoint/exit-filename "./exit-fluent")' $journal
NODEFILE=${SLURM_SUBMIT_DIR}/slurmhosts.${SLURM_JOB_ID}.txt
scontrol show hostname ${SLURM_NODELIST} > ${NODEFILE}
fluent 3ddp -g -slurm -t ${SLURM_NTASKS} -mpi=openmpi -cnf=${NODEFILE} -pib -i $journal &
wait; kill -0 $! 2> /dev/null; process_status=$?; wait
if test $process_status = 0 -a $# = 1 -a 0$1 -gt 1 -a -f '#restart.inp'
then b=/tmp/${SLURM_JOB_ID}.sbatch
sed /job-name[=]/s/=[0-9]*[*]*/=$(($1-1))*/ $0>$b
sbatch $b $(($1-1)) && rm -f $b
else
echo "Fluent job(s) finished" ; rm -f restart.jou
fi
Instead of -mpi=openmpi
one can use -mpi=intel2019
or simply -mpi=intel
.
Note that -mpi=ibmmpi
is deprecated and actually starts Intel MPI that requires -cnf=${NODEFILE}
in any case.`
IMPORTANT! The sbatch script removes all cas, dat and cdat files after restart. If you want to save files in such a format rewrite line 22 to rm -f *.log *.trn cleanup*.sh
.
To restart Ansys Fluent simulation and resubmit your case to queue after time limit use the following command to run your batch script: sbatch my-lam8.sbatch 3
For more information, how to resubmit a job, see section 'ANSYS Fluent 20.1 with checkpointing and resubmit'.
Whether you are running steady or unsteady/transient simulations, in the starting journal file my-lam8.jou
should be written:
my-lam8.jou
reads case and data file (Line 4 and 5) and starts with 20000 iterations.
[kkovacic@viz]$ cat my-lam8.jou
(set! checkpoint/check-filename "./check-fluent")
(set! checkpoint/exit-filename "./exit-fluent")
file/read-case G40_upper_chamber.cas.h5
file/read-data G40_upper_chamber.dat.h5
solve/monitors/residual/print yes
solve/set/flow-warnings no
solve/iterate 20000
;parallel/timer usage
;file/stop-transcript
exit ok
When the time limit is reached #restart.inp
is created and later restart.jou
journal file. #restart.inp
contains information of remaining iterations. For example, after third restart, saved case G40_upper_chamber.cas0080.cas0144.cas0205.cas
will be started with remaining 19795 iterations (it 19795
) untill the time limit will be reached and a new case with another suffix of cas[0-9].cas will be saved for additional restart.
Example of #restart.inp
:
[kkovacic@viz]$ cat \#restart.inp
rc G40_upper_chamber.cas0080.cas0144.cas0205.cas
rd G40_upper_chamber.cas0080.cas0144.cas0205.dat
it 19795
solve/dti 12345 30
(Line 21) you set 12345 number of time steps to calculate and max. 30 iterations per timestep.
[kkovacic@viz]$ cat my-lam8.jou
; Scheme commands to specify the check pointing and emergency exit commands:
(set! checkpoint/check-filename "./check-fluent")
(set! checkpoint/exit-filename "./exit-fluent")
; Reading initial case and data file
file/read-case G25L500.cas.h5
file/read-data G25L500.dat.h5
; Printing residuals
solve/monitors/residual/print yes
solve/set/flow-warnings no
; Reseting flow time-optional
;(rpsetvar 'time-step 77239)
;(rpsetvar 'flow-time 0.003246279814592568)
; Initial time step
solve/set/time-step 1.0e-8
; Max. iterations per time step
solve/set/transient-controls/max-iterations-per-time-step 30
; Choosing duration method: 0-Incremental Time Steps, 1-Total Time Steps, 2-Total Time, 3-Incremental Time
solve/set/transient-controls/duration-specification-method 0
; Specify nr. of iterations and max. iterations per time-step and start transient calculation
solve/dual-time-iterate 12345 30
;parallel/timer usage
;file/stop-transcript
exit ok
yes
When the time limit is reached #restart.inp
is created and later restart.jou
journal file. With check pointing (Lines 3 and 4) your case is saved at the last time-step and when restarting it, the calculation continues, first to complete the iterations within the last-saved time-step, and then proceesing with remaining number of time-steps. Checkpoiting (Lines 3 and 4) have to be included in a generated journal file to proceed with another restart (see sbatch file, lines 23-26). Bellow is an example of a restart.jou
journal file, genarated after restarting.
[kkovacic@viz]$ cat restart.jou
(set! checkpoint/check-filename "./check-fluent")
(set! checkpoint/exit-filename "./exit-fluent")
rc G25L500.cas0084.cas0144.cas
rd G25L500.cas0084.cas0144.dat
it 26
/solve dti 12340 30
;parallel/timer usage
;file/stop-transcript
exit ok
yes
G25L500.cas0084.cas0144.cas
is a saved case after two restarts. it 26
(Line 6) calculates remaining iterations within the last-saved time-step. /solve dti 12340 30
(Line 7) proceeds with remaining time-steps with max. 30 iterations per time-step.
IMPORTANT!
If your solver is set to duration specification method [2] Total Time or [3] Incremental Time, specified in seconds, the #restart.inp
will contain e.g. /solve dti -1 30
instead of /solve dti 12340 30
and the following error /solve dti -1 Total Time should be set to be greater than the current flow time in order to proceed further. 30 Specified flow time reached flow time = 1.264e-08, total time = 1.264e-08
will appear and your case will be stopped after completing iterations (it 26
) within the last-saved time-step. Make sure the duration specification method is set to 0
or 1
in your initital journal file or case file, uploaded to HPCFS.
The following sample script starts ANSYS System Coupling 2021R2 and runs the script run.py. Additional command-line arguments can be used when starting System Coupling. For more details please refer to the System Coupling documentation: System Coupling Settings and Commands Reference -> Command-Line Options.
[jzevnik@viz rome]$ cat runsc
#!/bin/bash
#SBATCH --export=ALL,LD_PRELOAD=/opt/pkg/software/ANSYS/2021R2/v212/commonfiles/MPI/Intel/2018.3.222/linx64/lib/libstrtok.so
#SBATCH --partition=rome # specify partition: rome or haswell
#SBATCH --ntasks=96 # total number of cores requested
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=48 # number of natsks per node
#SBATCH --job-name=run_fsi # specify job name
#SBATCH --mem=120G
#SBATCH --time=1-00:00:00 # time limit days-hh:mm:ss
"/opt/pkg/software/ANSYS/2021R2/v212/SystemCoupling/bin/systemcoupling" -R run.py > output.log
The following run.py script opens the System Coupling input file "fsi.sci" and starts Mech and Fluent solvers. Additionally, working directories for both solvers are specified along with the corresponding input files.
Allocation of computational resources is done in Line 7. For more details please refer to the System Coupling documentation: System Coupling User's Guide -> Using System Coupling's User Interfaces -> Advanced Coupled Analysis Tasks -> Using Parallel Processing Capabilities.
ImportSystemCouplingInputFile(FilePath = 'fsi.sci')
execCon = DatamodelRoot().CouplingParticipant
execCon['Solution'].ExecutionControl.InitialInput = 'solid.dat'
execCon['Solution'].ExecutionControl.WorkingDirectory = 'Mechanical'
execCon['Solution 1'].ExecutionControl.InitialInput = 'fluid.jou'
execCon['Solution 1'].ExecutionControl.WorkingDirectory = 'Fluent'
PartitionParticipants(AlgorithmName = 'SharedAllocateMachines',NamesAndFractions = [('Solution', 8.0/96.0),('Solution 1', 96.0/96.0)])
PrintSetup()
Solve()
Additional start up arguments for each coupling partitipant can be specified by the "AdditionalArguments" command:
execCon['Solution 1'].ExecutionControl.AdditionalArguments = '-pib'
System Coupling run can be restarted from any of the previously saved restart points. The following restart.py script opens the restart point at the 1000th coupling step and continues with the simulation.
Open(CouplingStep = 1000)
PrintSetup()
Solve()
HPCFS
SLURM
ANSYS
System Coupling