Francesco Andreucci

# Hackathon note Francesco Andreucci ## Scope of the project My main objective is to compare the performances of some NCCL routines to those of their MPI counterparts. In order to do this, I wrote three mini-app codes: one for the bidirectinoal pingpong, one for allreduce and one for all2all. ### Bidirectional pingpong | Node 1 | Nodes 2 | | -------- | -------- | | ![pingpong_cfr_1_sl](https://hackmd.io/_uploads/HJRQRvJWke.png)| ![pingpong_cfr_2_sl](https://hackmd.io/_uploads/ByCQCw1-Jl.png) | ### AllReduce For the allreduce, here are the plots for 1,2,4,8 nodes that I already got (HPCX vs NCCL): | Node 1 | Nodes 2 | | -------- | -------- | | ![bw_allreduce_cfr_1_sl](https://hackmd.io/_uploads/rJBL5PkZJx.png) | ![bw_allreduce_cfr_2_sl](https://hackmd.io/_uploads/B1B89vJ-kg.png) | | Nodes 4 | Nodes 8 | |![bw_allreduce_cfr_4_sl](https://hackmd.io/_uploads/S1xS8qv1ZJg.png) | ![bw_allreduce_cfr_8_sl](https://hackmd.io/_uploads/S1H85w1bJg.png)| 1) Intranode NCCL is evidently able to better utilise the NVLink wrt HPCX (even if this seems a bit surprising to me since HPCX is NVIDIA too...) 2) Internode: on 4,8 nodes NCCL saturates the BW (the BW for the collective is taken as `BW=size*niters/time`), while HPCX gets progressively worse as the number of nodes increases (and this is okish). Otoh, on 2 nodes NCCL gets a factor of 2. Indeed, by profiling on 2,4,8 nodes one can see that indeed in 2 nodes there are roughly 3 communications per NiC per iteration, while on 4 and 8 nodes there are 6. As to why this happens, this is probably related to the innards of NCCL and how the actual reduction algorithm is implemented. One possible hypothesis that was suggested is that thisis due to the fact that NCCL is able to understand that there are only 2 nodes and only one exchange is needed to perform the allreduce. | 2 nodes(8gpus) | 4 nodes(16 gpus) | 8 nodes (32 gpus) | | -------- | -------- | -------- | | ![image](https://hackmd.io/_uploads/SkRMzFy-1e.png) | ![image](https://hackmd.io/_uploads/BJNJmKyZyl.png) | ![image](https://hackmd.io/_uploads/rkNZXt1bJg.png)| ### All-to-all Here the main objective is to figure out what is the better inter-play between streams for the NCCL implementation. There are mainly two options: ``` for (int i =0; i<nprocs; i++){ cudaStreamCreate(&streams[i]); } ncclGroupStart(); for (int k =0; k<nprocs; k++){ ncclSend(&send[dim*k], dim, ncclInt, k, comm, streams[k]); ncclRecv(&recv[dim*k], dim, ncclInt, k, comm, streams[k] ); } ncclGroupEnd(); for(int j =0; j<nprocs; j++){ cudaStreamSynchronize(streams[j]); } ``` or: ``` ncclGroupStart(); for (int k =0; k<nprocs; k++){ ncclSend(&send[dim*k], dim, ncclInt, k, comm, stream); ncclRecv(&recv[dim*k], dim, ncclInt, k, comm, stream); } ncclGroupEnd(); cudaStreamSynchronize(stream); ``` From profiling with nsys it appears that NCCL just replaces whatever is inside the group calls with the kernel ncclKernel_SendRecv_Sum, and that the choice of running with one or multiple streams is immaterial. One can think of two possible ways to initialise the buffer fo testing the all2all: ``` #ifdef partial int * send = (int *)malloc(sizeof(int)*dim*nprocs); int * recv = (int *)malloc(sizeof(int)*dim*nprocs); #else int * send = (int *)malloc(sizeof(int)*Nprocs*dim); int * recv = (int *)malloc(sizeof(int)*Nprocs*dim); #endif ``` where ` Nprocs` is the maximum number of tasks we are going to use (considering we have one task per gpu if one wants to run on 1,2,4,8,16 nodes, Nprocs would be 64). Depending on what one wants to test, weak or strong scaling. ##### Preliminary weak scaling results ###### One node ![Figure_1](https://hackmd.io/_uploads/SJF9Ukbbkx.png) ###### Two nodes ![Figure_2](https://hackmd.io/_uploads/ryWpPJWb1e.png)