highlighting the major concepts and components needed for it.
https://pytorch.org/tutorials/beginner/dist_overview.html
It is a single-program multiple-data training paradigm. The model is replicated on every GPUs, and every GPUs will train different set data. It need communicaiton to synchronize the gradients, thus this approach need some libraries to perfrom allreduce and brodcast operation. such as NCCL, gloo or some MPI libraries.
It will only support data parallelism, but it is the easiest way to do distributed training.
The distributed RPC framework support multiple parallelism such as distributed pipeline parallelism, parameter server paradigm. The API can help to manage remote object lifetime and access memory beyond machine boundaries.
This approach provides Remote Procedure Call, Remote Reference, Distributed Autograd, and Distributed Optimizer to control the parameters between devices, these APIs is base on some MPI libraries such NCCL gloo or MPI. This approach can perfrom many parallel strategy, but it is more complicated to implement.
single nodel
batch size
256 -> out of mem
128 -> 64.624
64 -> 64.411
32 -> 65.466
16 -> 63.068
2 nodes
dist
weak
10.97 11.16
strong
11.4842 10.87
horovod
weak
10.16
strong
10.06
4 nodes
dist
weak
21.82
strong
22.04 20.33
horovod
weak
19.67
strong
18.8