# Comparison between different allreduce algorithms on different platforms
## Horovd Results
The custom operations were run for 50 times and taken the average.
```bash
x=50
while [ $x -gt 0 ]; do
command
x=$(($x-1))
done
```
The results was tested on 8 nodes and each node consists of 4 gpus (total 32 GPUs) with 10Gbs networks and 100Gbs networks.
<!-- 

-->


#### Results Verification on Horovod Timeline
Env: 10 Gbs Network, 8 nodes x 4gpus = 32 GPUs, message size ~200mb
* `NCCL_REDUCE`+`MPI_ALLREDUCE`+`NCCL_BCAST`


* `NCCL_REDUCESCATTER` + `MPI_ALLREDUCE` + `NCCL_ALLGATHER`


* `MPI_ALLREDUCE`

* `NCCL_REDUCE` + `NCCL_ALLREDUCE` + `NCCL_BCAST`


* `NCCL_ALLREDUCE`

* `NCCL_REDUCESCATTER` + `NCCL_ALLREDUCE` + `NCCL_ALLGATHER`


#### Torch Record Time(ms) vs. Python Time(s)
Env: 10 Gbs Network, 2 nodes * 4 GPU = 8 GPUs
* `NCCL_REDUCESCATTER` + `MPI_ALLREDUCE` + `NCCL_ALLGATHER`

Confirm MPI start time and end time on different ranks
* `MPI_ALLREDUCE`

-> `MPI_ALLREDUCE` starts at the same time on the same node
* `NCCL_ALLREDUCE`


-> `NCCL_ALLREDUCE` starts at the same time on same node
## OSU_BENCHMARKS+OMPI
<!--



-->
## NCCL_TEST
#### upgrade nccl version from `2.4.7` to `2.6.4`
#### update script to force use network interface `NET/SOCKET/0` and `NET/IB/0`


<!--  -->
<!-- 
 -->
<!-- 
 -->
---
#### Throughput


<!-- 
 -->
<!-- * Note: The average latency of openmpi allreduce algorithm indeed reduces ~10 times from 10g network to 100g network, however the throughput remains similar. The bandwidth test results also confirm this finding.

-->
<!-- 

-->
<!-- 
 -->
---
<!-- 
 -->