Try   HackMD

ANSYS at HPCFS

ANSYS Fluent and SSH

Fluent is known that starts MPI jobs within its own fluent script.
Usally, passwordless ssh is needed to allocated nodes before any batch script can be submitted.
The following 3-lines of code is sufficient to ensure you have this set up:

test -f ~/.ssh/id_rsa || ssh-keygen -t rsa -f ~/.ssh/id_rsa -q -N ""
test -f ~/.ssh/authorized_keys || cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
grep -qf ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys || cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

To check if paswordless works try to get two nodes and get to one of them in next 5 minutes

[leon@viz ~]$ salloc --nodes=2 --partition=rome,haswell --time=5:00 --mem=0
salloc: Granted job allocation 59168
salloc: Waiting for resource configuration
salloc: Nodes cn[41-42] are ready for job
[leon@viz ~]$ ssh cn42
Warning: Permanently added 'cn42,10.0.2.142' (ECDSA) to the list of known hosts.
Last login: Mon Nov 29 10:54:26 2021
[leon@cn42 ~]$ exit
logout
Connection to cn42 closed.
[leon@viz ~]$ exit
exit
salloc: Relinquishing job allocation 59168

ANSYS Fluent 20.1 with checkpointing and resubmit

The follwing sbatch example used cpomrises of the following features:

  • Gracefuly stoping Fluent two minutes before max time limit reached (6 hours)
  • Automatic resubmit when number of submits specified as argument to the script. For example sbatch my-lam8.sbatch 3 will resubmit itself 2 times (18 hours)
  • Marking remaining runs in job name when observing with squeue. Job name 2*lam8 means that there is one remaining job to be resubmited after this job completes.
  • Checkpointing (saving files) by adding file watch to fluent journal file my-lam8.jou.
  • Automatic sizing of the job by only changing ntasks required. Suggested is to select multiple of 48 on rome partition.
[leon@viz lam8]$ cat my-lam8.sbatch #!/bin/bash #SBATCH --export=ALL,LD_PRELOAD= #SBATCH --partition rome #SBATCH --ntasks=96 # total number of cores requested #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=48 #SBATCH --job-name=lam8 #SBATCH --mem=120G #SBATCH --time=6:00:00 # time limit days-hh:mm:ss #SBATCH --signal=B:USR1@120 trap 'touch exit-fluent' USR1 module purge module load ANSYS/20.1 if test -f '#restart.inp' then journal=restart.jou mv '#restart.inp' $journal else journal=my-lam8.jou rm -f *.log *.trn *-[0-9]*.{cas,dat.h5,cas.h5,cdat} cleanup*.sh fi grep -q checkpoint/exit-filename $journal || sed -i \ -e '1i(set! checkpoint/check-filename "./check-fluent")' \ -e '1i(set! checkpoint/exit-filename "./exit-fluent")' $journal fluent 3ddp -g -slurm -t${SLURM_NTASKS} -mpi=ibmmpi -pib -i $journal & wait; kill -0 $! 2> /dev/null; process_status=$?; wait if test $process_status = 0 -a $# = 1 -a 0$1 -gt 1 -a -f '#restart.inp' then b=/tmp/${SLURM_JOB_ID}.sbatch sed /job-name[=]/s/=[0-9]*[*]*/=$(($1-1))*/ $0>$b sbatch $b $(($1-1)) && rm -f $b else echo "Fluent job(s) finished" ; rm -f restart.jou fi

Line 11 specifies that batch script should recieve user signal 120 seconds before the job would be terminated anyway due to reaching --time specified.
Line 13 will trap (execute) single command that will create exit-fluent when SLURM will signal it to the batch script. This file is being specified at the begining of the Fluent journal file my-lam8.jou and will initiate Fluent to stop all processes and write #restart.inp file for input in the journal.

[leon@viz lam8]$ head -3 my-lam8.jou
; Scheme commands to specify the check pointing and emergency exit commands:
(set! checkpoint/check-filename "./check-fluent")
(set! checkpoint/exit-filename "./exit-fluent")

One can also use touch check-fluent in the directory to force saving the files while Fluent is running.

Line 18 test for presence of Fluent restart file (#restart.inp) and if it exists then in line 19 a journal file (restart.jou) is set and created in line 20. If there is no #restart.inp being created then we can only start from our starting my-lam8.jou journal file set in line 22. Before we can safely start from scratch we need to carefully clean the old log, transactions, and timesteps that could interfere clean start and restarts. See man 7 glob for more info on the pattern for removing files.

In line 25 we check for the presence of checkpoint settings (exit-filename) in the journal file and if such line is missing the scheme set commands are added at the beginning of journal file. Therefore, you need not to take care of adding these commands to your journal file. The other reason is that #restart.inp is missing them too and we are always adding that lines to allow multiple restarts.

Line 29 starts Fluent in background with the ampersand (&) at the end of the line. This is necessary to allow sbatch script to catch the USR1 signal.
First wait in line 30 traps the USR1 signal and executes touch in line 13. Second wait waits for Fluent to read the exit-fluent file and gracefuly saves everything. If Fluent will finish regularly then both waits will contine and there will be no deadlock. Harmless kill between the waits probes if Fluent is running in background. Its process status will be zero if user signal is received while waiting, meaning that time limit is approaching and exit-fluent file was created. If Fluent stops normaly due to reaching specified number of iterations or other convergence criteria or any failure in journal processing then kill process status will be 1, meaning that there is no Fluent to resubmit.
Line 29 tests for Fluent status, number of command line argumentsm, presence of the Fluent #restart.inp and uses first argument number of repeats that is being decremented by one at each sbatch resubmit.
In line 34 we create a new sbatch script (named in line 33) that is slightly updating --job-name=lam8 line with the stream editor and adding a number* in front of the name.
After the batch is submitted in line 35 the script is removed as it is not needed anymore. If there are no repeats and everything is OK then the (last) script echoes in line 37 that the Fluent jobs is finished and possibly cleans the last restart journal file.

IMPORTANT! The sbatch script removes all cas, dat and cdat files after restart. If you want to save files in such a format rewrite line 22 to rm -f *.log *.trn cleanup*.sh.

We recommend that the file extension .sbatch is used for sbatch aware scripts although using .sh or other convention does not affect the sbatch submission.

ANSYS 21.1 on rome partition

Runing any ANSYS/21.1 GUI or solver on rome nodes fails with

    /opt/pkg/software/ANSYS/21.1/v211/licensingclient/linx64/ansyscl: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b

To overcome this problem one can preload correct Kerberos 5 library by adding

export LD_PRELOAD=/opt/pkg/software/ANSYS/21.1/compatibility_fix/libk5crypto.so.3

just after module load ANSYS/21.1. Alternatively, LD_LIBRARY_PATH can be prepended.

Newer versions of ANSYS (21.1R2) are RHEL 8.4 compatible and above fix is not needed anymore. However, it was observed that the following preload is needed for ANSYS/2021R2 module:

export LD_PRELOAD=/opt/pkg/software/ANSYS/2021R2/v212/commonfiles/MPI/Intel/2018.3.222/linx64/lib/libstrtok.so:${LD_PRELOAD}

FLUENT 2021R2 with OpenMPI or Intel MPI

OpenMPI 4.x is included within Fluent and uses UCX communicator as preferred fabric for Infiniband.

#!/bin/bash #SBATCH --export=ALL,LD_PRELOAD=/opt/pkg/software/ANSYS/2021R2/v212/commonfiles/MPI/Intel/2018.3.222/linx64/lib/libstrtok.so #SBATCH --partition rome #SBATCH --ntasks=96 # total number of cores requested #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=48 #SBATCH --job-name=lam8r2 #SBATCH --mem=120G #SBATCH --time=4:00:00 # time limit days-hh:mm:ss #SBATCH --signal=B:USR1@120 trap 'touch exit-fluent' USR1 module purge module load ANSYS/2021R2 if test -f '#restart.inp' then journal=restart.jou mv '#restart.inp' $journal else journal=my-lam8.jou rm -f *.log *.trn *-[0-9]*.{cas,dat.h5,cas.h5,cdat} cleanup*.sh fi grep -q checkpoint/exit-filename $journal || sed -i \ -e '1i(set! checkpoint/check-filename "./check-fluent")' \ -e '1i(set! checkpoint/exit-filename "./exit-fluent")' $journal NODEFILE=${SLURM_SUBMIT_DIR}/slurmhosts.${SLURM_JOB_ID}.txt scontrol show hostname ${SLURM_NODELIST} > ${NODEFILE} fluent 3ddp -g -slurm -t ${SLURM_NTASKS} -mpi=openmpi -cnf=${NODEFILE} -pib -i $journal & wait; kill -0 $! 2> /dev/null; process_status=$?; wait if test $process_status = 0 -a $# = 1 -a 0$1 -gt 1 -a -f '#restart.inp' then b=/tmp/${SLURM_JOB_ID}.sbatch sed /job-name[=]/s/=[0-9]*[*]*/=$(($1-1))*/ $0>$b sbatch $b $(($1-1)) && rm -f $b else echo "Fluent job(s) finished" ; rm -f restart.jou fi

Instead of -mpi=openmpi one can use -mpi=intel2019 or simply -mpi=intel.
Note that -mpi=ibmmpi is deprecated and actually starts Intel MPI that requires -cnf=${NODEFILE} in any case.`

IMPORTANT! The sbatch script removes all cas, dat and cdat files after restart. If you want to save files in such a format rewrite line 22 to rm -f *.log *.trn cleanup*.sh.

Restarting Ansys Fluent 2021R2

To restart Ansys Fluent simulation and resubmit your case to queue after time limit use the following command to run your batch script: sbatch my-lam8.sbatch 3
For more information, how to resubmit a job, see section 'ANSYS Fluent 20.1 with checkpointing and resubmit'.
Whether you are running steady or unsteady/transient simulations, in the starting journal file my-lam8.jou should be written:

  1. Steady-state simulation
    Starting journal file my-lam8.jou reads case and data file (Line 4 and 5) and starts with 20000 iterations.
[kkovacic@viz]$ cat my-lam8.jou (set! checkpoint/check-filename "./check-fluent") (set! checkpoint/exit-filename "./exit-fluent") file/read-case G40_upper_chamber.cas.h5 file/read-data G40_upper_chamber.dat.h5 solve/monitors/residual/print yes solve/set/flow-warnings no solve/iterate 20000 ;parallel/timer usage ;file/stop-transcript exit ok

When the time limit is reached #restart.inp is created and later restart.jou journal file. #restart.inp contains information of remaining iterations. For example, after third restart, saved case G40_upper_chamber.cas0080.cas0144.cas0205.cas will be started with remaining 19795 iterations (it 19795) untill the time limit will be reached and a new case with another suffix of cas[0-9].cas will be saved for additional restart.
Example of #restart.inp:

[kkovacic@viz]$ cat \#restart.inp rc G40_upper_chamber.cas0080.cas0144.cas0205.cas rd G40_upper_chamber.cas0080.cas0144.cas0205.dat it 19795
  1. Unsteady/transient simulation
    The crucial thing is to set duration specification method in your solver to Incremental Time Steps [0] or to Total Time Steps [1] (Line 19 in the following journal file example). With command solve/dti 12345 30 (Line 21) you set 12345 number of time steps to calculate and max. 30 iterations per timestep.
[kkovacic@viz]$ cat my-lam8.jou ; Scheme commands to specify the check pointing and emergency exit commands: (set! checkpoint/check-filename "./check-fluent") (set! checkpoint/exit-filename "./exit-fluent") ; Reading initial case and data file file/read-case G25L500.cas.h5 file/read-data G25L500.dat.h5 ; Printing residuals solve/monitors/residual/print yes solve/set/flow-warnings no ; Reseting flow time-optional ;(rpsetvar 'time-step 77239) ;(rpsetvar 'flow-time 0.003246279814592568) ; Initial time step solve/set/time-step 1.0e-8 ; Max. iterations per time step solve/set/transient-controls/max-iterations-per-time-step 30 ; Choosing duration method: 0-Incremental Time Steps, 1-Total Time Steps, 2-Total Time, 3-Incremental Time solve/set/transient-controls/duration-specification-method 0 ; Specify nr. of iterations and max. iterations per time-step and start transient calculation solve/dual-time-iterate 12345 30 ;parallel/timer usage ;file/stop-transcript exit ok yes

When the time limit is reached #restart.inp is created and later restart.jou journal file. With check pointing (Lines 3 and 4) your case is saved at the last time-step and when restarting it, the calculation continues, first to complete the iterations within the last-saved time-step, and then proceesing with remaining number of time-steps. Checkpoiting (Lines 3 and 4) have to be included in a generated journal file to proceed with another restart (see sbatch file, lines 23-26). Bellow is an example of a restart.jou journal file, genarated after restarting.

[kkovacic@viz]$ cat restart.jou (set! checkpoint/check-filename "./check-fluent") (set! checkpoint/exit-filename "./exit-fluent") rc G25L500.cas0084.cas0144.cas rd G25L500.cas0084.cas0144.dat it 26 /solve dti 12340 30 ;parallel/timer usage ;file/stop-transcript exit ok yes

G25L500.cas0084.cas0144.cas is a saved case after two restarts. it 26 (Line 6) calculates remaining iterations within the last-saved time-step. /solve dti 12340 30 (Line 7) proceeds with remaining time-steps with max. 30 iterations per time-step.

IMPORTANT!
If your solver is set to duration specification method [2] Total Time or [3] Incremental Time, specified in seconds, the #restart.inp will contain e.g. /solve dti -1 30 instead of /solve dti 12340 30 and the following error /solve dti -1 Total Time should be set to be greater than the current flow time in order to proceed further. 30 Specified flow time reached flow time = 1.264e-08, total time = 1.264e-08 will appear and your case will be stopped after completing iterations (it 26 ) within the last-saved time-step. Make sure the duration specification method is set to 0 or 1 in your initital journal file or case file, uploaded to HPCFS.

ANSYS System Coupling 2021R2

The following sample script starts ANSYS System Coupling 2021R2 and runs the script run.py. Additional command-line arguments can be used when starting System Coupling. For more details please refer to the System Coupling documentation: System Coupling Settings and Commands Reference -> Command-Line Options.

[jzevnik@viz rome]$ cat runsc #!/bin/bash #SBATCH --export=ALL,LD_PRELOAD=/opt/pkg/software/ANSYS/2021R2/v212/commonfiles/MPI/Intel/2018.3.222/linx64/lib/libstrtok.so #SBATCH --partition=rome # specify partition: rome or haswell #SBATCH --ntasks=96 # total number of cores requested #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=48 # number of natsks per node #SBATCH --job-name=run_fsi # specify job name #SBATCH --mem=120G #SBATCH --time=1-00:00:00 # time limit days-hh:mm:ss "/opt/pkg/software/ANSYS/2021R2/v212/SystemCoupling/bin/systemcoupling" -R run.py > output.log

The following run.py script opens the System Coupling input file "fsi.sci" and starts Mech and Fluent solvers. Additionally, working directories for both solvers are specified along with the corresponding input files.

Allocation of computational resources is done in Line 7. For more details please refer to the System Coupling documentation: System Coupling User's Guide -> Using System Coupling's User Interfaces -> Advanced Coupled Analysis Tasks -> Using Parallel Processing Capabilities.

ImportSystemCouplingInputFile(FilePath = 'fsi.sci') execCon = DatamodelRoot().CouplingParticipant execCon['Solution'].ExecutionControl.InitialInput = 'solid.dat' execCon['Solution'].ExecutionControl.WorkingDirectory = 'Mechanical' execCon['Solution 1'].ExecutionControl.InitialInput = 'fluid.jou' execCon['Solution 1'].ExecutionControl.WorkingDirectory = 'Fluent' PartitionParticipants(AlgorithmName = 'SharedAllocateMachines',NamesAndFractions = [('Solution', 8.0/96.0),('Solution 1', 96.0/96.0)]) PrintSetup() Solve()

Additional start up arguments for each coupling partitipant can be specified by the "AdditionalArguments" command:

execCon['Solution 1'].ExecutionControl.AdditionalArguments = '-pib'

Restarting a Coupled Analysis

System Coupling run can be restarted from any of the previously saved restart points. The following restart.py script opens the restart point at the 1000th coupling step and continues with the simulation.

Open(CouplingStep = 1000)
PrintSetup()
Solve()
tags: HPCFS SLURM ANSYS System Coupling