128-nodes execution, 8 ranks per node, 6 OMP threads. Attempt to go without MPICH_ASYNC_PROGRESS and without MPI_THREAD_MULTIPLE; and also without any extra MPI_xxx, MPICH_xxx and FI_xxx environment variables.
Logs are saved:
```
/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/LATEST_RAPS/raps/bin/SLURM/lumi/comment/tco1279l137/hz9o/hres/cray.craympich/lum-g.cray.sp/h5.N128.T768xt7xh1+ioT128xt7xh0.nextgems_6h.i16r0w16.eORCA12_Z75.htco1279-5257824
```
Total number of ranks is 1024.
## MultiIO ranks
Ranks 768-1023 run a different code (the MultiIO subroutines), they are just killed after the crash, so they can be excluded from investigation.
## Ranks hanging in MPI wait/progress procedures
292 ranks were hanging and finished due to ALARM signal from a debugger. All of them were waiting in some MPI function (like `MPI_Waitall()`) called from one of the `trltog` or `trmtol` subroutines.
## Ranks aborted in MPIDI_OFI_handle_cq_error()
388 ranks get an error in the
`MPIDI_OFI_handle_cq_error()` call. The stacks are:
```
Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted.
0x000015550d06bd2b in raise () from /lib64/libc.so.6
#0 0x000015550d06bd2b in raise () from /lib64/libc.so.6
#1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6
#2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#5 0x00001554e2d96458 in PMPI_Isend () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#6 0x000015551178f3fc in pmpi_isend__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12
#7 0x000015550d4d09a2 in mpl_send_real4 (pbuf=..., kdest=704, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kmp_type=<optimized out>, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=0, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_send_mod.F90:176
#8 0x0000155519f75fa3 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1
#9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=18, kf_scders=0, kf_gp=137, kf_fs=18, kf_out_lt=18, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268
#10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=18, kf_out_lt=18, kf_uv=0, kf_scalars=18, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1226736 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300
#11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648
#12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#20 0x00000000004064dc in master_ ()
```
or:
```
#0 0x000015550d06bd2b in raise () from /lib64/libc.so.6
#1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6
#2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#5 0x00001554e2726381 in PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#6 0x00001555117902a2 in pmpi_alltoallv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12
#7 0x000015550d4a8b36 in mpl_alltoallv_real4 (psendbuf=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, ksendcounts=..., precvbuf=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, krecvcounts=..., ksenddispl=..., krecvdispl=..., kmp_type=<error reading variable: Cannot access memory at address 0x0>, kcomm=-1006632939, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=<error reading variable: Cannot access memory at address 0x0>, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_alltoallv_mod.F90:319
#8 0x000015551a029333 in trmtol (pfbuf_in=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, pfbuf=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, kfield=<optimized out>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trmtol_mod.F90:336
#9 0x0000155519ff04d4 in ltinv_ctl (kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=..., pspdiv=..., pspscalar=<error reading variable: value requires 1158584 bytes, which is more than max-value-size>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kfldptruv=<error reading variable: Location address is not set.>, kfldptrsc=<error reading variable: Location address is not set.>, fspgl_proc=0x0) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ltinv_ctl_mod.F90:125
#10 0x0000155519feb78f in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1158584 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:292
#11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648
#12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#20 0x00000000004064dc in master_ ()
```
The call stack as it is reported by Cray-MPICH:
```
MPICH ERROR [Rank 8] [job id 5257824.1] [Thu Dec 14 20:04:58 2023] [nid005504] - Abort(672771983) (rank 8 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack:
PMPI_Isend(161)................: MPI_Isend(buf=0xad130640, count=1910, dtype=0x4c000427, dest=1, tag=20000, comm=0xc400000c, request=0x7fffff722d6c) failed
MPID_Isend(578)................:
MPIDI_Progress_test(97)........:
MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - VNI_NOT_FOUND)
```
Or:
```
MPICH ERROR [Rank 729] [job id 5257824.1] [Thu Dec 14 20:04:51 2023] [nid006352] - Abort(741415567) (rank 729 in comm 0): Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x5f00e440, scnts=0x7fffff729de0, sdispls=0x7fffff728b60, dtype=0x4c000427, rbuf=0x52a05680, rcnts=0x7fffff729c60, rdispls=0x7fffff727f60, datatype=dtype=0x4c000427, comm=comm=0xc4000015) failed
MPIR_CRAY_Alltoallv(1187)......:
MPIR_Waitall(167)..............:
MPIR_Waitall_impl(51)..........:
MPID_Progress_wait(201)........:
MPIDI_Progress_test(97)........:
MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)
```
## Ranks aborted in MPIDI_CRAY_XPMEM_do_attach()
88 ranks show a failure from XPMEM MPI subroutines. Typical MPI-reported stacks are:
Variant 1:
Cray-MPICH stack:
```
xpmem_attach error: : No such file or directory
MPICH ERROR [Rank 540] [job id 5257824.1] [Thu Dec 14 20:04:51 2023] [nid005852] - Abort(138520079) (rank 540 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack:
PMPI_Irecv(166)......................: MPI_Irecv(buf=0xb409cc10, count=24805, dtype=0x4c000427, src=539, tag=20000, comm=0xc4000005, request=0x7fffff722dcc) failed
MPID_Irecv(529)......................:
MPIDI_irecv_unsafe(163)..............:
MPIDI_SHM_mpi_irecv(462).............:
MPIDI_SHM_mpi_imrecv(514)............:
MPIDI_SHM_mmods_try_matched_recv(167):
MPIDI_CRAY_Common_lmt_handle_recv(44):
MPIDI_CRAY_Common_lmt_import_mem(210):
MPIDI_CRAY_XPMEM_do_attach(528)......: xpmem_attach failed on rank 4 (src_rank 539, vaddr 0xb36d0700, len 99220)
```
GDB stack:
```
Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted.
0x000015550d06bd2b in raise () from /lib64/libc.so.6
#0 0x000015550d06bd2b in raise () from /lib64/libc.so.6
#1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6
#2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#4 0x00001554e44bb7fb in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#5 0x00001554e2d4fe7b in PMPI_Irecv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#6 0x000015551178f4fc in pmpi_irecv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12
#7 0x000015550d4c1a3d in mpl_recv_real4 (pbuf=<error reading variable: value requires 99220 bytes, which is more than max-value-size>, ksource=540, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kfrom=<optimized out>, krecvtag=<optimized out>, kount=<error reading variable: Cannot access memory at address 0x0>, kmp_type=5, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=-335539279, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_recv_mod.F90:241
#8 0x0000155519f75df7 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1
#9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=17, kf_scders=0, kf_gp=137, kf_fs=17, kf_out_lt=17, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268
#10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1158584 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300
#11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648
#12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#20 0x00000000004064dc in master_ ()
```
Variant 2:
Cray-MPICH stack:
```
xpmem_attach error: : No such file or directory
MPICH ERROR [Rank 100] [job id 5257824.1] [Thu Dec 14 20:04:52 2023] [nid005681] - Abort(541175567) (rank 100 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack:
PMPI_Isend(161)...........................: MPI_Isend(buf=0xb3fa6480, count=1464, dtype=0x4c000427, dest=66, tag=20000, comm=0xc4000005, request=0x7fffff722dc8) failed
MPID_Isend(578)...........................:
MPIDI_Progress_test(105)..................:
MPIDI_SHMI_progress(118)..................:
MPIDI_POSIX_progress(412).................:
MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64):
MPIDI_CRAY_Common_lmt_handle_recv(44).....:
MPIDI_CRAY_Common_lmt_import_mem(210).....:
MPIDI_CRAY_XPMEM_do_attach(528)...........: xpmem_attach failed on rank 4 (src_rank 101, vaddr 0xb4f1e630, len 123496)
```
GDB stack:
```
#0 0x000015550d06bd2b in raise () from /lib64/libc.so.6
#1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6
#2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#4 0x00001554e44bb7fb in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#5 0x00001554e2d96458 in PMPI_Isend () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12
#6 0x000015551178f3fc in pmpi_isend__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12
#7 0x000015550d4d09a2 in mpl_send_real4 (pbuf=..., kdest=67, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kmp_type=<optimized out>, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=0, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_send_mod.F90:176
#8 0x0000155519f75fa3 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1
#9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=17, kf_scders=0, kf_gp=137, kf_fs=17, kf_out_lt=17, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268
#10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1165520 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300
#11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648
#12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so
#20 0x00000000004064dc in master_ ()
```
## Ranks with other failures
Single rank (rank=201) has a different failure with the messages and the gdb stack:
```
Memory access fault by GPU node-5 (Agent handle: 0x18d17b0) on address 0x15516ac7d000. Reason: Unknown.
Thread 2 "ifsMASTER.SP" received signal SIGABRT, Aborted.
[Switching to Thread 0x1553b42cf700 (LWP 43546)]
0x000015550d06bd2b in raise () from /lib64/libc.so.6
#0 0x000015550d06bd2b in raise () from /lib64/libc.so.6
#1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6
#2 0x00001554640f8b65 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#3 0x00001554640f5b39 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#4 0x00001554640b3497 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#5 0x000015550d0076ea in start_thread () from /lib64/libpthread.so.0
#6 0x000015550d13949f in clone () from /lib64/libc.so.6
```
## Environment and MPI threading mode
The MPI-related environment variables for this execution:
```
MPICH_GPU_SUPPORT_ENABLED=1
CRAY_ACC_FORCE_EARLY_INIT=1 # force init at program start of all devices
MPICH_ABORT_ON_ERROR=1
ROCFFT_RTC_CACHE_PATH=/tmp
```
Threading more: `MPI_THREAD_FUNNELED`