128-nodes execution, 8 ranks per node, 6 OMP threads. Attempt to go without MPICH_ASYNC_PROGRESS and without MPI_THREAD_MULTIPLE; and also without any extra MPI_xxx, MPICH_xxx and FI_xxx environment variables. Logs are saved: ``` /scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/LATEST_RAPS/raps/bin/SLURM/lumi/comment/tco1279l137/hz9o/hres/cray.craympich/lum-g.cray.sp/h5.N128.T768xt7xh1+ioT128xt7xh0.nextgems_6h.i16r0w16.eORCA12_Z75.htco1279-5257824 ``` Total number of ranks is 1024. ## MultiIO ranks Ranks 768-1023 run a different code (the MultiIO subroutines), they are just killed after the crash, so they can be excluded from investigation. ## Ranks hanging in MPI wait/progress procedures 292 ranks were hanging and finished due to ALARM signal from a debugger. All of them were waiting in some MPI function (like `MPI_Waitall()`) called from one of the `trltog` or `trmtol` subroutines. ## Ranks aborted in MPIDI_OFI_handle_cq_error() 388 ranks get an error in the `MPIDI_OFI_handle_cq_error()` call. The stacks are: ``` Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted. 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e2d96458 in PMPI_Isend () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x000015551178f3fc in pmpi_isend__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #7 0x000015550d4d09a2 in mpl_send_real4 (pbuf=..., kdest=704, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kmp_type=<optimized out>, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=0, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_send_mod.F90:176 #8 0x0000155519f75fa3 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1 #9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=18, kf_scders=0, kf_gp=137, kf_fs=18, kf_out_lt=18, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268 #10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=18, kf_out_lt=18, kf_uv=0, kf_scalars=18, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1226736 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300 #11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x00000000004064dc in master_ () ``` or: ``` #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e2726381 in PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x00001555117902a2 in pmpi_alltoallv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #7 0x000015550d4a8b36 in mpl_alltoallv_real4 (psendbuf=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, ksendcounts=..., precvbuf=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, krecvcounts=..., ksenddispl=..., krecvdispl=..., kmp_type=<error reading variable: Cannot access memory at address 0x0>, kcomm=-1006632939, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=<error reading variable: Cannot access memory at address 0x0>, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_alltoallv_mod.F90:319 #8 0x000015551a029333 in trmtol (pfbuf_in=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, pfbuf=<error reading variable: value requires 33168384 bytes, which is more than max-value-size>, kfield=<optimized out>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trmtol_mod.F90:336 #9 0x0000155519ff04d4 in ltinv_ctl (kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=..., pspdiv=..., pspscalar=<error reading variable: value requires 1158584 bytes, which is more than max-value-size>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kfldptruv=<error reading variable: Location address is not set.>, kfldptrsc=<error reading variable: Location address is not set.>, fspgl_proc=0x0) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ltinv_ctl_mod.F90:125 #10 0x0000155519feb78f in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1158584 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:292 #11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x00000000004064dc in master_ () ``` The call stack as it is reported by Cray-MPICH: ``` MPICH ERROR [Rank 8] [job id 5257824.1] [Thu Dec 14 20:04:58 2023] [nid005504] - Abort(672771983) (rank 8 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(161)................: MPI_Isend(buf=0xad130640, count=1910, dtype=0x4c000427, dest=1, tag=20000, comm=0xc400000c, request=0x7fffff722d6c) failed MPID_Isend(578)................: MPIDI_Progress_test(97)........: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - VNI_NOT_FOUND) ``` Or: ``` MPICH ERROR [Rank 729] [job id 5257824.1] [Thu Dec 14 20:04:51 2023] [nid006352] - Abort(741415567) (rank 729 in comm 0): Fatal error in PMPI_Alltoallv: Other MPI error, error stack: PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x5f00e440, scnts=0x7fffff729de0, sdispls=0x7fffff728b60, dtype=0x4c000427, rbuf=0x52a05680, rcnts=0x7fffff729c60, rdispls=0x7fffff727f60, datatype=dtype=0x4c000427, comm=comm=0xc4000015) failed MPIR_CRAY_Alltoallv(1187)......: MPIR_Waitall(167)..............: MPIR_Waitall_impl(51)..........: MPID_Progress_wait(201)........: MPIDI_Progress_test(97)........: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND) ``` ## Ranks aborted in MPIDI_CRAY_XPMEM_do_attach() 88 ranks show a failure from XPMEM MPI subroutines. Typical MPI-reported stacks are: Variant 1: Cray-MPICH stack: ``` xpmem_attach error: : No such file or directory MPICH ERROR [Rank 540] [job id 5257824.1] [Thu Dec 14 20:04:51 2023] [nid005852] - Abort(138520079) (rank 540 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack: PMPI_Irecv(166)......................: MPI_Irecv(buf=0xb409cc10, count=24805, dtype=0x4c000427, src=539, tag=20000, comm=0xc4000005, request=0x7fffff722dcc) failed MPID_Irecv(529)......................: MPIDI_irecv_unsafe(163)..............: MPIDI_SHM_mpi_irecv(462).............: MPIDI_SHM_mpi_imrecv(514)............: MPIDI_SHM_mmods_try_matched_recv(167): MPIDI_CRAY_Common_lmt_handle_recv(44): MPIDI_CRAY_Common_lmt_import_mem(210): MPIDI_CRAY_XPMEM_do_attach(528)......: xpmem_attach failed on rank 4 (src_rank 539, vaddr 0xb36d0700, len 99220) ``` GDB stack: ``` Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted. 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e44bb7fb in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e2d4fe7b in PMPI_Irecv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x000015551178f4fc in pmpi_irecv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #7 0x000015550d4c1a3d in mpl_recv_real4 (pbuf=<error reading variable: value requires 99220 bytes, which is more than max-value-size>, ksource=540, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kfrom=<optimized out>, krecvtag=<optimized out>, kount=<error reading variable: Cannot access memory at address 0x0>, kmp_type=5, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=-335539279, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_recv_mod.F90:241 #8 0x0000155519f75df7 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1 #9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=17, kf_scders=0, kf_gp=137, kf_fs=17, kf_out_lt=17, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268 #10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1158584 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300 #11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x00000000004064dc in master_ () ``` Variant 2: Cray-MPICH stack: ``` xpmem_attach error: : No such file or directory MPICH ERROR [Rank 100] [job id 5257824.1] [Thu Dec 14 20:04:52 2023] [nid005681] - Abort(541175567) (rank 100 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(161)...........................: MPI_Isend(buf=0xb3fa6480, count=1464, dtype=0x4c000427, dest=66, tag=20000, comm=0xc4000005, request=0x7fffff722dc8) failed MPID_Isend(578)...........................: MPIDI_Progress_test(105)..................: MPIDI_SHMI_progress(118)..................: MPIDI_POSIX_progress(412).................: MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64): MPIDI_CRAY_Common_lmt_handle_recv(44).....: MPIDI_CRAY_Common_lmt_import_mem(210).....: MPIDI_CRAY_XPMEM_do_attach(528)...........: xpmem_attach failed on rank 4 (src_rank 101, vaddr 0xb4f1e630, len 123496) ``` GDB stack: ``` #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e44bb7fb in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e2d96458 in PMPI_Isend () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x000015551178f3fc in pmpi_isend__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #7 0x000015550d4d09a2 in mpl_send_real4 (pbuf=..., kdest=67, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kmp_type=<optimized out>, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=0, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_send_mod.F90:176 #8 0x0000155519f75fa3 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1 #9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=17, kf_scders=0, kf_gp=137, kf_fs=17, kf_out_lt=17, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268 #10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1165520 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300 #11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 4717184 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x00000000004064dc in master_ () ``` ## Ranks with other failures Single rank (rank=201) has a different failure with the messages and the gdb stack: ``` Memory access fault by GPU node-5 (Agent handle: 0x18d17b0) on address 0x15516ac7d000. Reason: Unknown. Thread 2 "ifsMASTER.SP" received signal SIGABRT, Aborted. [Switching to Thread 0x1553b42cf700 (LWP 43546)] 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554640f8b65 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1 #3 0x00001554640f5b39 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1 #4 0x00001554640b3497 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1 #5 0x000015550d0076ea in start_thread () from /lib64/libpthread.so.0 #6 0x000015550d13949f in clone () from /lib64/libc.so.6 ``` ## Environment and MPI threading mode The MPI-related environment variables for this execution: ``` MPICH_GPU_SUPPORT_ENABLED=1 CRAY_ACC_FORCE_EARLY_INIT=1 # force init at program start of all devices MPICH_ABORT_ON_ERROR=1 ROCFFT_RTC_CACHE_PATH=/tmp ``` Threading more: `MPI_THREAD_FUNNELED`