64 nodes execution, 8 ranks per node, 6 OMP threads. Attempt to go without MPICH_ASYNC_PROGRESS and MPI_THREAD_MULTIPLE. ``` /scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/raps/bin/SLURM/lumi/FAILING/tco1279l137/hz9o/hres/cray.craympich/lum-g.cray.sp/h2.N64.T488xt7xh1+ioT12xt7xh0.nextgems_6h.i16r0w16.eORCA12_Z75.htco1279-5093989 ``` ## Ranks with MPIDI_OFI_handle_cq_error The error appears on 72 ranks out of 512. The mechanism seems to be the same: some error messages from libfabric, then internal assert in the MPI function that called it. Three typical stacks: ### 1. MPI_Alltoall: ``` libfabric:18908:1701781678::cxi:ep_data:report_send_completion():3570<warn> nid006381: TXC (0xc1f2:0:0): Request dest_addr: 464 caddr.nic: 0X10DF2 caddr.pid: 0 rxc_id: 0 error: 0x8ab2f660 (err: 5, VNI_NOT_FOUND) MPICH ERROR [Rank 136] [job id 5093989.1] [Tue Dec 5 15:07:58 2023] [nid006381] - Abort(540088975) (rank 136 in comm 0): Fatal error in PMPI_Alltoallv: Other MPI error, error stack: PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x499f6a00, scnts=0x7fffff496c34, sdispls=0x7fffff495ea0, dtype=0x4c000427, rbuf=0x4ceb88c0, rcnts=0x7fffff496b40, rdispls=0x7fffff495700, datatype=dtype=0x4c000427, comm=comm=0xc4000024) failed MPIR_CRAY_Alltoallv(1187)......: MPIR_Waitall(167)..............: MPIR_Waitall_impl(51)..........: MPID_Progress_wait(201)........: MPIDI_Progress_test(97)........: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - VNI_NOT_FOUND) ``` gdb stack: ``` Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted. 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e2726381 in PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x00001555117902a2 in pmpi_alltoallv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #7 0x000015550d4a8b36 in mpl_alltoallv_real4 (psendbuf=<error reading variable: value requires 55319680 bytes, which is more than max-value-size>, ksendcounts=..., precvbuf=<error reading variable: value requires 55319680 bytes, which is more than max-value-size>, krecvcounts=..., ksenddispl=..., krecvdispl=..., kmp_type=<error reading variable: Cannot access memory at address 0x0>, kcomm=-1006632924, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=<error reading variable: Cannot access memory at address 0x0>, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_alltoallv_mod.F90:319 #8 0x000015551a028ac6 in trltom (pfbuf_in=<error reading variable: value requires 55319680 bytes, which is more than max-value-size>, pfbuf=<error reading variable: value requires 55319680 bytes, which is more than max-value-size>, kfield=<optimized out>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltom_mod.F90:345 #9 0x0000155519fed86e in ltdir_ctl (kf_fs=18, kf_uv=0, kf_scalars=18, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1937232 bytes, which is more than max-value-size>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kfldptruv=<error reading variable: Location address is not set.>, kfldptrsc=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ltdir_ctl_mod.F90:90 #10 0x0000155519fd99ea in dir_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=18, kf_uv=0, kf_scalars=18, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1937232 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/dir_trans_ctl_mod.F90:258 #11 0x0000155519f7ea4b in dir_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/dir_trans.F90:525 #12 0x0000155515e92d74 in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x00000000004064dc in master_ () ``` ### 2. MPI_Isend ``` libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x8e78d2c0 (err: 5, VNI_NOT_FOUND) libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x8b5f13f0 (err: 5, VNI_NOT_FOUND) libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x8b5f26a0 (err: 5, VNI_NOT_FOUND) libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x89993830 (err: 5, VNI_NOT_FOUND) libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x899943b0 (err: 5, VNI_NOT_FOUND) libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x8b5e7ba0 (err: 5, PTLTE_NOT_FOUND) libfabric:122654:1701781680::cxi:ep_data:recv_req_report():357<warn> nid005148: RXC (0x7151:0:0) PtlTE 19: Request error: 0x3151b530 (err: 5, VNI_NOT_FOUND) MPICH ERROR [Rank 8] [job id 5093989.1] [Tue Dec 5 15:08:00 2023] [nid005148] - Abort(1008316303) (rank 8 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(161)................: MPI_Isend(buf=0xb11e6620, count=147872, dtype=0x4c000427, dest=9, tag=20000, comm=0xc400000c, request=0x7fffff4917b8) failed MPID_Isend(578)................: MPIDI_Progress_test(97)........: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - VNI_NOT_FOUND) ``` gdb stack: ``` Thread 1 "ifsMASTER.SP" received signal SIGABRT, Aborted. 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #0 0x000015550d06bd2b in raise () from /lib64/libc.so.6 #1 0x000015550d06d3e5 in abort () from /lib64/libc.so.6 #2 0x00001554e25b4c98 in MPID_Abort.cold () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e44bb6cc in MPIR_Handle_fatal_error () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e44bb7db in MPIR_Err_return_comm () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e2d96458 in PMPI_Isend () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x000015551178f3fc in pmpi_isend__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #7 0x000015550d4d09a2 in mpl_send_real4 (pbuf=..., kdest=10, ktag=20000, kcomm=<error reading variable: Cannot access memory at address 0x0>, kmp_type=<optimized out>, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=292632730, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_send_mod.F90:176 #8 0x0000155519f75fa3 in trltog (pglat=..., kf_fs=<optimized out>, kf_gp=<optimized out>, kf_scalars_g=137, kvset=..., kptrgp=..., pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:1 #9 0x000015551a044928 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=<optimized out>, kf_scalars=18, kf_scders=0, kf_gp=137, kf_fs=18, kf_out_lt=18, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:268 #10 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=18, kf_out_lt=18, kf_uv=0, kf_scalars=18, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1939536 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300 #11 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #12 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #13 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x00000000004064dc in master_ () ``` ### 3. MPI_Waitall ``` libfabric:33403:1701781677::cxi:ep_data:report_send_completion():3570<warn> nid007685: TXC (0x10db3:4:0): Request dest_addr: 477 caddr.nic: 0X10D23 caddr.pid: 5 rxc_id: 0 error: 0x86858800 (err: 5, PTLTE_NOT_FOUND) MPICH ERROR [Rank 468] [job id 5093989.1] [Tue Dec 5 15:07:57 2023] [nid007685] - Abort(740962447) (rank 468 in comm 0): Fatal error in PMPI_Waitall: Other MPI error, error stack: PMPI_Waitall(378)..............: MPI_Waitall(count=46, req_array=0x7fffff491740, status_array=0x7fffff3bf280) failed MPIR_Waitall(167)..............: MPIR_Waitall_impl(51)..........: MPID_Progress_wait(201)........: MPIDI_Progress_test(97)........: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND) ``` gdb stack: ``` Thread 1 "ifsMASTER.SP" received signal SIGALRM, Alarm clock. cxip_util_cq_progress (util_cq=0x2052f40) at prov/cxi/src/cxip_cq.c:566 #0 cxip_util_cq_progress (util_cq=0x2052f40) at prov/cxi/src/cxip_cq.c:566 #1 0x00001553ba678111 in ofi_cq_readfrom (cq_fid=0x2052f40, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232 #2 0x00001554e350c544 in MPIR_Waitall_impl () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #3 0x00001554e3572caf in MPIR_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #4 0x00001554e42d39e6 in MPIC_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #5 0x00001554e449cfbd in MPIR_CRAY_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #6 0x00001554e2726cda in PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpi_cray.so.12 #7 0x00001555117902a2 in pmpi_alltoallv__ () from /opt/cray/pe/mpich/8.1.27/ofi/cray/14.0/lib/libmpifort_cray.so.12 #8 0x000015550d4a8b36 in mpl_alltoallv_real4 (psendbuf=<error reading variable: value requires 51081888 bytes, which is more than max-value-size>, ksendcounts=..., precvbuf=<error reading variable: value requires 51081888 bytes, which is more than max-value-size>, krecvcounts=..., ksenddispl=..., krecvdispl=..., kmp_type=<error reading variable: Cannot access memory at address 0x0>, kcomm=-1006632939, kerror=<error reading variable: Cannot access memory at address 0x0>, krequest=<error reading variable: Cannot access memory at address 0x0>, cdstring=...) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/fiat/src/fiat/mpl/internal/mpl_alltoallv_mod.F90:319 #9 0x000015551a028ac6 in trltom (pfbuf_in=<error reading variable: value requires 51081888 bytes, which is more than max-value-size>, pfbuf=<error reading variable: value requires 51081888 bytes, which is more than max-value-size>, kfield=<optimized out>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/trltom_mod.F90:345 #10 0x0000155519fed86e in ltdir_ctl (kf_fs=17, kf_uv=0, kf_scalars=17, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1824576 bytes, which is more than max-value-size>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kfldptruv=<error reading variable: Location address is not set.>, kfldptrsc=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ltdir_ctl_mod.F90:90 #11 0x0000155519fd99ea in dir_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_uv=0, kf_scalars=17, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1824576 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/dir_trans_ctl_mod.F90:258 #12 0x0000155519f7ea4b in dir_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/dir_trans.F90:525 #13 0x0000155515e92d74 in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #14 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #15 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #16 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #17 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #19 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #20 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #21 0x00000000004064dc in master_ () ``` ## Ranks with FFT crashes Rank 12: ``` Thread 1 "ifsMASTER.SP" received signal SIGSEGV, Segmentation fault. 0x00001554f1a77cf4 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #0 0x00001554f1a77cf4 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #1 0x00001554f1aa1651 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #2 0x00001554f18d7345 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #3 0x00001554f18d7390 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #4 0x00001554f1897d88 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #5 0x00001554f1897e99 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #6 0x00001554f19e9fe9 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #7 0x00001554f19bd0b0 in hipModuleLoadData () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #8 0x00001554f166c818 in RTCKernel::RTCKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<char, std::allocator<char> > const&) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #9 0x00001554f166e8ae in RTCKernel::runtime_compile(TreeNode&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #10 0x00001554f160aa8e in RuntimeCompilePlan(ExecPlan&) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #11 0x00001554f160622f in ProcessNode(ExecPlan&) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #12 0x00001554f1605c1a in rocfft_plan_create_internal(rocfft_plan_t*, rocfft_result_placement_e, rocfft_transform_type_e, rocfft_precision_e, unsigned long, unsigned long const*, unsigned long, rocfft_plan_description_t*) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #13 0x00001554f1606a6b in rocfft_plan_create () from /opt/rocm-5.2.3/lib/librocfft.so.0 #14 0x000015550014167a in hipfftMakePlan_internal(hipfftHandle_t*, unsigned long, unsigned long*, hipfftType_t, unsigned long, hipfft_plan_description_t*, unsigned long*, bool) () from /opt/rocm-5.2.3/lib/libhipfft.so #15 0x0000155500140c80 in hipfftMakePlanMany () from /opt/rocm-5.2.3/lib/libhipfft.so #16 0x00001555001404d5 in hipfftPlanMany () from /opt/rocm-5.2.3/lib/libhipfft.so #17 0x000015551a039e94 in create_plan_ffth_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x000015551a02520e in create_plan_fft (kplan=0x15551c3a3600 <__STATIC_LOCAL_6>, ktype=<error reading variable: Cannot access memory at address 0x1>, kn=<error reading variable: Cannot access memory at address 0x0>, klot=<error reading variable: Cannot access memory at address 0x1>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/tpm_ffth.F90:112 #19 0x000015551a048910 in ftinv (preel=<error reading variable: value requires 105408000 bytes, which is more than max-value-size>, kfields=244) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_mod.F90:132 #20 0x000015551a042f95 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=0, kf_scalars=17, kf_scders=0, kf_gp=137, kf_fs=17, kf_out_lt=17, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:171 #21 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1831784 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300 #22 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #23 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #24 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #25 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #26 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #27 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #28 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #29 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #30 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #31 0x00000000004064dc in master_ () ``` Rank 477: ``` Thread 1 "ifsMASTER.SP" received signal SIGSEGV, Segmentation fault. 0x00001554f1a77cf4 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #0 0x00001554f1a77cf4 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #1 0x00001554f1aa1651 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #2 0x00001554f18d7345 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #3 0x00001554f18d7390 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #4 0x00001554f1897d88 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #5 0x00001554f1897e99 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #6 0x00001554f19e9fe9 in ?? () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #7 0x00001554f19bd0b0 in hipModuleLoadData () from /opt/rocm-5.2.3/lib/libamdhip64.so.5 #8 0x00001554f166c818 in RTCKernel::RTCKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<char, std::allocator<char> > const&) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #9 0x00001554f166e8ae in RTCKernel::runtime_compile(TreeNode&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #10 0x00001554f160aa8e in RuntimeCompilePlan(ExecPlan&) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #11 0x00001554f160622f in ProcessNode(ExecPlan&) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #12 0x00001554f1605c1a in rocfft_plan_create_internal(rocfft_plan_t*, rocfft_result_placement_e, rocfft_transform_type_e, rocfft_precision_e, unsigned long, unsigned long const*, unsigned long, rocfft_plan_description_t*) () from /opt/rocm-5.2.3/lib/librocfft.so.0 #13 0x00001554f1606a6b in rocfft_plan_create () from /opt/rocm-5.2.3/lib/librocfft.so.0 #14 0x000015550014167a in hipfftMakePlan_internal(hipfftHandle_t*, unsigned long, unsigned long*, hipfftType_t, unsigned long, hipfft_plan_description_t*, unsigned long*, bool) () from /opt/rocm-5.2.3/lib/libhipfft.so #15 0x0000155500140c80 in hipfftMakePlanMany () from /opt/rocm-5.2.3/lib/libhipfft.so #16 0x00001555001404d5 in hipfftPlanMany () from /opt/rocm-5.2.3/lib/libhipfft.so #17 0x000015551a039e94 in create_plan_ffth_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #18 0x000015551a02520e in create_plan_fft (kplan=0x15551c3a3600 <__STATIC_LOCAL_6>, ktype=<error reading variable: Cannot access memory at address 0x1>, kn=<error reading variable: Cannot access memory at address 0x0>, klot=<error reading variable: Cannot access memory at address 0x1>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/tpm_ffth.F90:112 #19 0x000015551a048910 in ftinv (preel=<error reading variable: value requires 105408000 bytes, which is more than max-value-size>, kfields=244) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_mod.F90:132 #20 0x000015551a042f95 in ftinv_ctl (kf_uv_g=<optimized out>, kf_scalars_g=137, kf_uv=0, kf_scalars=17, kf_scders=0, kf_gp=137, kf_fs=17, kf_out_lt=17, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kptrgp=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/ftinv_ctl_mod.F90:171 #21 0x0000155519febb40 in inv_trans_ctl (kf_uv_g=0, kf_scalars_g=137, kf_gp=137, kf_fs=17, kf_out_lt=17, kf_uv=0, kf_scalars=17, kf_scders=0, pspvor=<error reading variable: Location address is not set.>, pspdiv=<error reading variable: Location address is not set.>, pspscalar=<error reading variable: value requires 1823896 bytes, which is more than max-value-size>, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, fspgl_proc=0x0, pspsc3a=<error reading variable: Location address is not set.>, pspsc3b=<error reading variable: Location address is not set.>, pspsc2=<error reading variable: Location address is not set.>, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/internal/inv_trans_ctl_mod.F90:300 #22 0x0000155519f946ce in inv_trans (pspvor=..., pspdiv=..., pspscalar=..., pspsc3a=..., pspsc3b=..., pspsc2=..., fspgl_proc=0x0, ldscders=<error reading variable: Cannot access memory at address 0x0>, ldvorgp=<error reading variable: Cannot access memory at address 0x0>, lddivgp=<error reading variable: Cannot access memory at address 0x0>, lduvder=<error reading variable: Cannot access memory at address 0x0>, ldlatlon=<error reading variable: Cannot access memory at address 0x0>, kproma=16, kvsetuv=<error reading variable: Location address is not set.>, kvsetsc=..., kresol=1, kvsetsc3a=<error reading variable: Location address is not set.>, kvsetsc3b=<error reading variable: Location address is not set.>, kvsetsc2=<error reading variable: Location address is not set.>, pgp=<error reading variable: value requires 7417728 bytes, which is more than max-value-size>, pgpuv=<error reading variable: Location address is not set.>, pgp3a=<error reading variable: Location address is not set.>, pgp3b=<error reading variable: Location address is not set.>, pgp2=<error reading variable: Location address is not set.>) at /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/source/ectrans/src/trans/gpu/external/inv_trans.F90:648 #23 0x0000155515e9259d in specrt_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #24 0x00001555194d21e3 in suinif_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #25 0x000015551679bb31 in csta_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #26 0x00001555166466c5 in cnt3_glo_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #27 0x0000155516645cad in cnt3_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #28 0x0000155516645790 in cnt2_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #29 0x0000155516645099 in cnt1_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #30 0x0000155516641b77 in cnt0_ () from /pfs/lustrep3/scratch/project_465000454/vsingh/SCALING_27NOV/thread_safety_funneled/ifs-bundle/build.lumi-g_funneled/bin/../lib/libarpifs.SP.so #31 0x00000000004064dc in master_ () ``` ## MPI environment vars and MPI threading level The set of env variables: ``` MPICH_GPU_SUPPORT_ENABLED=1 CRAY_ACC_FORCE_EARLY_INIT=1 # force init at program start of all devices FI_CXI_OPTIMIZED_MRS="false" MPICH_ABORT_ON_ERROR=1 FI_CXI_RX_MATCH_MODE=software MPICH_SMP_SINGLE_COPY_MODE=NONE MPICH_ALLTOALL_INTRA_ALGORITHM=pairwise FI_CXI_EQ_ACK_BATCH_SIZE=1 MPICH_OFI_NIC_POLICY=GPU ROCFFT_RTC_CACHE_PATH=/tmp FI_CXI_DEFAULT_CQ_SIZE=131072 FI_CXI_OFLOW_BUF_SIZE=268435456 FI_CXI_OFLOW_BUF_COUNT=4 FI_CXI_DEFAULT_TX_SIZE=32768 export FI_LOG_LEVEL=info ``` MPI threading level is `MPI_THREAD_FUNNELED`