256 nodes execution, 8 ranks per node, 6 OMP threads. Attempt to go without MPICH_ASYNC_PROGRESS and MPI_THREAD_MULTIPLE.
```
/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/raps/bin/SLURM/lumi/FAILING/tco1279l137/hz9o/hres/cray.craympich/lum-g.cray.sp/h2.N256.T1856xt7xh1+ioT96xt7xh0.nextgems_6h.i16r0w16.eORCA12_Z75.htco1279-5082533
```
## Ranks with error in MPIDI_OFI_handle_cq_error
The error appears on MPI_Alltoallv() call on some of the ranks. The ranks pattern seems to rather regular (not totally random), so there can be some model-related pattern here. The error log reported by all failed ranks is very similar and looks like:
```
libfabric:43457:1701722707::cxi:ep_data:report_send_completion():3570<warn> nid006870: TXC (0xe001:7:0): Request dest_addr: 637 caddr.nic: 0XA921 caddr.pid: 5 rxc_id: 0 error: 0x316ca810 (err: 5, PTLTE_NOT_FOUND)
libfabric:43457:1701722707::cxi:ep_data:report_send_completion():3570<warn> nid006870: TXC (0xe001:7:0): Request dest_addr: 289 caddr.nic: 0X7E60 caddr.pid: 1 rxc_id: 0 error: 0x316c7fd0 (err: 5, PTLTE_NOT_FOUND)
libfabric:43457:1701722707::cxi:ep_data:report_send_completion():3570<warn> nid006870: TXC (0xe001:7:0): Request dest_addr: 173 caddr.nic: 0X7C23 caddr.pid: 5 rxc_id: 0 error: 0x1b8eda0 (err: 5, PTLTE_NOT_FOUND)
libfabric:43457:1701722707::cxi:ep_data:report_send_completion():3570<warn> nid006870: TXC (0xe001:7:0): Request dest_addr: 1565 caddr.nic: 0XF221 caddr.pid: 5 rxc_id: 0 error: 0x1b8a450 (err: 5, PTLTE_NOT_FOUND)
libfabric:43457:1701722707::cxi:ep_data:report_send_completion():3570<warn> nid006870: TXC (0xe001:7:0): Request dest_addr: 521 caddr.nic: 0X9F41 caddr.pid: 1 rxc_id: 0 error: 0x316c9840 (err: 5, PTLTE_NOT_FOUND)
MPICH ERROR [Rank 1159] [job id 5082533.1] [Mon Dec 4 22:45:07 2023] [nid006870] - Abort(674306703) (rank 1159 in comm 0): Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x328e37c0, scnts=0x7fffffb41f60, sdispls=0x7fffffb3fc60, dtype=0x4c000427, rbuf=0x3ce0aac0, rcnts=0x7fffffb41e60, rdispls=0x7fffffb3df60, datatype=dtype=0x4c000427, comm=comm=0xc4000015) failed
MPIR_CRAY_Alltoallv(1187)......:
MPIR_Waitall(167)..............:
MPIR_Waitall_impl(51)..........:
MPID_Progress_wait(201)........:
MPIDI_Progress_test(97)........:
MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)
```
where the number and the contents of libfabric messages slightly differs. The gdb-generated backtrace is:
```
Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted.
0x000015550d06bd2b in raise () from /lib64/libc.so.6
#0 0x000015550d06bd2b in raise () from /lib64/libc.so.6
#1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6
#2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#5 0x00001554e2726381 in PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#6 0x00001555117902a2 in pmpi_alltoallv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12
#7 0x000015550d4a8b36 in mpl_alltoallv_real4 (psendbuf=<error reading variable: value requires 12779520 bytes, which is more than max-value-size>, ksendcounts=..., precvbuf=<error reading variable: value requires 12779520 bytes, which is more than max-value-size>, krecvcounts=..., ksenddispl=..., krecvdispl=..., kmp_type=<error reading variable: Cannot access memory at address 0x0>, kcomm=-1006632939, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=<error reading variable: Cannot access memory at address 0x0>, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_alltoallv_mod.F90:319
#8 0x000015551a029333 in trmtol (pfbuf_in=<error reading variable: value requires 12779520 bytes, which is more than max-value-size>, pfbuf=<error reading variable: value requires 12779520 bytes, which is more than max-value-size>, kfield=<optimized out>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trmtol_mod.F90:336
#9 0x0000155519ff04d4 in ltinv_ctl (kf_out_lt=2, kf_uv=1, kf_scalars=0, kf_scders=0, pspvor=..., pspdiv=..., pspscalar=<error reading variable: Location address is not set.>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kfldptruv=<error reading variable: Location address is not set.>, kfldptrsc=<error reading variable: Location address is not set.>, fspgl_proc=0x0) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ltinv_ctl_mod.F90:125
#10 0x0000155519feb78f in inv_trans_ctl (kf_uv_g=1, kf_scalars_g=0, kf_gp=2, kf_fs=2, kf_out_lt=2, kf_uv=1, kf_scalars=0, kf_scders=0, pspvor=<error reading variable: value requires 102480 bytes, which is more than max-value-size>, pspdiv=<error reading variable: value requires 102480 bytes, which is more than max-value-size>, pspscalar=<error reading variable: Location address is not set.>, kvsetuv=..., kvsetsc=<error reading variable: Location address is not set.>, pgp=..., fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:292
#11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=3556, kvsetuv=..., kvsetsc=<error reading variable: Location address is not set.>, kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=..., pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648
#12 0x0000155519676a50 in speuv_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#13 0x0000155519552b24 in suorog_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#14 0x00001555195b4e22 in suspec_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#15 0x00001555194d1d88 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#16 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#17 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#18 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#19 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#20 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#21 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#22 0x00000000004064dc in master_ ()
```
The ranks affected:
```
28, 57, 115, 144, 173, 202, 260, 289, 318, 347, 376, 405, 463, 521, 608, 637, 666, 695, 724, 753, 782, 898, 927, 956, 985, 1159, 1217, 1275, 1333, 1362, 1478, 1536, 1565, 1594, 1623, 1652, 1710, 1739, 1826, 1855
```
The distance between them in ranks:
```
29, 87, 29, 58, 29, 29, 29, 29, 58, 116, 29, 58, 58, 58, 174, 29, 29, 29, 116, 29, 29, 29, 29, 29, 29, 87, 58, 58, 29, 29, 29, 29, 29, 58, 29, 29, 29, 58, 29, 29
```
## The rank with FFT error
In rank 1797 we have the FFT error.
```
#0 0x00001554f1a77cf4 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#1 0x00001554f1aa1651 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#2 0x00001554f18d7345 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#3 0x00001554f18d7390 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#4 0x00001554f1897d88 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#5 0x00001554f1897e99 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#6 0x00001554f19e9fe9 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#7 0x00001554f19bd0b0 in hipModuleLoadData () from /opt/rocm-5.2.3/lib/libamdhip64.so.5
#8 0x00001554f166c818 in RTCKernel::RTCKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<char, std::allocator<char> > const&) () from /opt/rocm-5.2.3/lib/librocfft.so.0
#9 0x00001554f166e8ae in RTCKernel::runtime_compile(TreeNode&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /opt/rocm-5.2.3/lib/librocfft.so.0
#10 0x00001554f160aa8e in RuntimeCompilePlan(ExecPlan&) () from /opt/rocm-5.2.3/lib/librocfft.so.0
#11 0x00001554f160622f in ProcessNode(ExecPlan&) () from /opt/rocm-5.2.3/lib/librocfft.so.0
#12 0x00001554f1605c1a in rocfft_plan_create_internal(rocfft_plan_t*, rocfft_result_placement_e, rocfft_transform_type_e, rocfft_precision_e, unsigned long, unsigned long const*, unsigned long, rocfft_plan_description_t*) () from /opt/rocm-5.2.3/lib/librocfft.so.0
#13 0x00001554f1606a6b in rocfft_plan_create () from /opt/rocm-5.2.3/lib/librocfft.so.0
#14 0x000015550014167a in hipfftMakePlan_internal(hipfftHandle_t*, unsigned long, unsigned long*, hipfftType_t, unsigned long, hipfft_plan_description_t*, unsigned long*, bool) () from /opt/rocm-5.2.3/lib/libhipfft.so
#15 0x0000155500140c80 in hipfftMakePlanMany () from /opt/rocm-5.2.3/lib/libhipfft.so
#16 0x00001555001404d5 in hipfftPlanMany () from /opt/rocm-5.2.3/lib/libhipfft.so
#17 0x000015551a039e94 in create_plan_ffth_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#18 0x000015551a02520e in create_plan_fft (kplan=0x74654070, ktype=<error reading variable: Cannot access memory at address 0x1>, kn=<error reading variable: Cannot access memory at address 0x0>, klot=<error reading variable: Cannot access memory at address 0x1>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/tpm_ffth.F90:112
#19 0x000015551a048910 in ftinv (preel=<error reading variable: value requires 26339328 bytes, which is more than max-value-size>, kfields=64) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_mod.F90:132
#20 0x000015551a042f95 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=1, kf_uv=0, kf_scalars=1, kf_scders=0, kf_gp=1, kf_fs=1, kf_out_lt=1, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=..., pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:171
#21 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=1, kf_gp=1, kf_fs=1, kf_out_lt=1, kf_uv=0, kf_scalars=1, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 102480 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=..., fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300
#22 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=3555, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=..., pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648
#23 0x000015551967549f in speree_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#24 0x0000155519552a09 in suorog_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#25 0x00001555195b4e22 in suspec_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#26 0x00001555194d1d88 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#27 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#28 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#29 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#30 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#31 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#32 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#33 0x00000000004064dc in master_ ()
```
# The environment variables and MPI threading level
The set of env variables:
```
MPICH_GPU_SUPPORT_ENABLED=1
CRAY_ACC_FORCE_EARLY_INIT=1 # force init at program start of all devices
FI_CXI_OPTIMIZED_MRS="false"
MPICH_ABORT_ON_ERROR=1
FI_CXI_RX_MATCH_MODE=software
MPICH_SMP_SINGLE_COPY_MODE=NONE
MPICH_ALLTOALL_INTRA_ALGORITHM=pairwise
FI_CXI_EQ_ACK_BATCH_SIZE=1
MPICH_OFI_NIC_POLICY=GPU
ROCFFT_RTC_CACHE_PATH=/tmp
FI_CXI_DEFAULT_CQ_SIZE=131072
FI_CXI_OFLOW_BUF_SIZE=268435456
FI_CXI_OFLOW_BUF_COUNT=4
FI_CXI_DEFAULT_TX_SIZE=32768
export FI_LOG_LEVEL=info
```
MPI threading level is `MPI_THREAD_FUNNELED`
q