Lab 2 (Working)

--- tags: CSE5449 --- # Lab 2 (Working) ## Question ### Investigate two options highlighting the major concepts and components needed for it. https://pytorch.org/tutorials/beginner/dist_overview.html #### Distributed Data-Parallel Training (DDP) It is a single-program multiple-data training paradigm. The model is replicated on every GPUs, and every GPUs will train different set data. It need communicaiton to synchronize the gradients, thus this approach need some libraries to perfrom allreduce and brodcast operation. such as NCCL, gloo or some MPI libraries. It will only support data parallelism, but it is the easiest way to do distributed training. #### RPC-Based Distributed Training (RPC) The distributed RPC framework support multiple parallelism such as distributed pipeline parallelism, parameter server paradigm. The API can help to manage remote object lifetime and access memory beyond machine boundaries. This approach provides Remote Procedure Call, Remote Reference, Distributed Autograd, and Distributed Optimizer to control the parameters between devices, these APIs is base on some MPI libraries such NCCL gloo or MPI. This approach can perfrom many parallel strategy, but it is more complicated to implement. #### Horovod ## Env ### conda ## Result single nodel batch size 256 -> out of mem 128 -> 64.624 64 -> 64.411 32 -> 65.466 16 -> 63.068 2 nodes dist weak 10.97 11.16 strong 11.4842 10.87 horovod weak 10.16 strong 10.06 4 nodes dist weak 21.82 strong 22.04 20.33 horovod weak 19.67 strong 18.8