# Comparison between different allreduce algorithms on different platforms ## Horovd Results The custom operations were run for 50 times and taken the average. ```bash x=50 while [ $x -gt 0 ]; do command x=$(($x-1)) done ``` The results was tested on 8 nodes and each node consists of 4 gpus (total 32 GPUs) with 10Gbs networks and 100Gbs networks. <!-- ![](https://i.imgur.com/QDGHQUK.png) ![](https://i.imgur.com/aKEnBTX.png) --> ![](https://i.imgur.com/eLCk0O4.png) ![](https://i.imgur.com/aqC9U8l.png) #### Results Verification on Horovod Timeline Env: 10 Gbs Network, 8 nodes x 4gpus = 32 GPUs, message size ~200mb * `NCCL_REDUCE`+`MPI_ALLREDUCE`+`NCCL_BCAST` ![](https://i.imgur.com/FwJajV8.png) ![](https://i.imgur.com/mckujKA.png) * `NCCL_REDUCESCATTER` + `MPI_ALLREDUCE` + `NCCL_ALLGATHER` ![](https://i.imgur.com/ckjPyto.png) ![](https://i.imgur.com/cRMPfN0.png) * `MPI_ALLREDUCE` ![](https://i.imgur.com/1cLNTEK.png) * `NCCL_REDUCE` + `NCCL_ALLREDUCE` + `NCCL_BCAST` ![](https://i.imgur.com/21OiWrk.png) ![](https://i.imgur.com/cPwdFiW.png) * `NCCL_ALLREDUCE` ![](https://i.imgur.com/Ic3HYgb.png) * `NCCL_REDUCESCATTER` + `NCCL_ALLREDUCE` + `NCCL_ALLGATHER` ![](https://i.imgur.com/OpbOOxp.png) ![](https://i.imgur.com/JUcnzBH.png) #### Torch Record Time(ms) vs. Python Time(s) Env: 10 Gbs Network, 2 nodes * 4 GPU = 8 GPUs * `NCCL_REDUCESCATTER` + `MPI_ALLREDUCE` + `NCCL_ALLGATHER` ![](https://i.imgur.com/GhgFdJq.png) Confirm MPI start time and end time on different ranks * `MPI_ALLREDUCE` ![](https://i.imgur.com/7FkzgtP.png) -> `MPI_ALLREDUCE` starts at the same time on the same node * `NCCL_ALLREDUCE` ![](https://i.imgur.com/TBQkZfg.png) ![](https://i.imgur.com/9epF484.png) -> `NCCL_ALLREDUCE` starts at the same time on same node ## OSU_BENCHMARKS+OMPI <!-- ![](https://i.imgur.com/AZEUY33.png) ![](https://i.imgur.com/9tFBdAE.png) ![](https://i.imgur.com/Dj1sp6N.png) --> ## NCCL_TEST #### upgrade nccl version from `2.4.7` to `2.6.4` #### update script to force use network interface `NET/SOCKET/0` and `NET/IB/0` ![](https://i.imgur.com/ZiM8KR0.png) ![](https://i.imgur.com/4nOl6SL.png) <!-- ![](https://i.imgur.com/anGlbZp.png) --> <!-- ![](https://i.imgur.com/wYBQsJU.png) ![](https://i.imgur.com/S87Cnfb.png) --> <!-- ![](https://i.imgur.com/xR34BAK.png) ![](https://i.imgur.com/kW2YqfZ.png) --> --- #### Throughput ![](https://i.imgur.com/Nm198U7.png) ![](https://i.imgur.com/ewHOo3X.png) <!-- ![](https://i.imgur.com/CGSmt8o.png) ![](https://i.imgur.com/7fk2wb3.png) --> <!-- * Note: The average latency of openmpi allreduce algorithm indeed reduces ~10 times from 10g network to 100g network, however the throughput remains similar. The bandwidth test results also confirm this finding. ![](https://i.imgur.com/0qsoIaj.png) --> <!-- ![](https://i.imgur.com/KjSK2Mz.png) ![](https://i.imgur.com/N3fpBe3.png) --> <!-- ![](https://i.imgur.com/vwWT0cv.png) ![](https://i.imgur.com/kqbiiWN.png) --> --- <!-- ![](https://i.imgur.com/gwSyj7I.png) ![](https://i.imgur.com/svdzzIc.png) -->