Comparison between different allreduce algorithms on different platforms

# Comparison between different allreduce algorithms on different platforms ## Horovd Results The custom operations were run for 50 times and taken the average. ```bash x=50 while [ $x -gt 0 ]; do command x=$(($x-1)) done ``` The results was tested on 8 nodes and each node consists of 4 gpus (total 32 GPUs) with 10Gbs networks and 100Gbs networks.  ![](https://i.imgur.com/eLCk0O4.png) ![](https://i.imgur.com/aqC9U8l.png) #### Results Verification on Horovod Timeline Env: 10 Gbs Network, 8 nodes x 4gpus = 32 GPUs, message size ~200mb * `NCCL_REDUCE`+`MPI_ALLREDUCE`+`NCCL_BCAST` ![](https://i.imgur.com/FwJajV8.png) ![](https://i.imgur.com/mckujKA.png) * `NCCL_REDUCESCATTER` + `MPI_ALLREDUCE` + `NCCL_ALLGATHER` ![](https://i.imgur.com/ckjPyto.png) ![](https://i.imgur.com/cRMPfN0.png) * `MPI_ALLREDUCE` ![](https://i.imgur.com/1cLNTEK.png) * `NCCL_REDUCE` + `NCCL_ALLREDUCE` + `NCCL_BCAST` ![](https://i.imgur.com/21OiWrk.png) ![](https://i.imgur.com/cPwdFiW.png) * `NCCL_ALLREDUCE` ![](https://i.imgur.com/Ic3HYgb.png) * `NCCL_REDUCESCATTER` + `NCCL_ALLREDUCE` + `NCCL_ALLGATHER` ![](https://i.imgur.com/OpbOOxp.png) ![](https://i.imgur.com/JUcnzBH.png) #### Torch Record Time(ms) vs. Python Time(s) Env: 10 Gbs Network, 2 nodes * 4 GPU = 8 GPUs * `NCCL_REDUCESCATTER` + `MPI_ALLREDUCE` + `NCCL_ALLGATHER` ![](https://i.imgur.com/GhgFdJq.png) Confirm MPI start time and end time on different ranks * `MPI_ALLREDUCE` ![](https://i.imgur.com/7FkzgtP.png) -> `MPI_ALLREDUCE` starts at the same time on the same node * `NCCL_ALLREDUCE` ![](https://i.imgur.com/TBQkZfg.png) ![](https://i.imgur.com/9epF484.png) -> `NCCL_ALLREDUCE` starts at the same time on same node ## OSU_BENCHMARKS+OMPI  ## NCCL_TEST #### upgrade nccl version from `2.4.7` to `2.6.4` #### update script to force use network interface `NET/SOCKET/0` and `NET/IB/0` ![](https://i.imgur.com/ZiM8KR0.png) ![](https://i.imgur.com/4nOl6SL.png)    --- #### Throughput ![](https://i.imgur.com/Nm198U7.png) ![](https://i.imgur.com/ewHOo3X.png)     ---