What have I done in this summer

# GSoC 2023: [Mpi4py in DaCe] Add support for Mpi4py in DaCe Contributor: Fu-Chiang, Chang Mentor: Alexandros Nikolaos Ziogas [DaCe](https://github.com/spcl/dace) [Timeline](https://docs.google.com/document/d/19rWyh7smI7ZwfqhR2Q7_GeQqjMP5czjO_1BryMVA-R0/edit?pli=1) ## Introduction DaCe, or the Data-Centric Parallel Programming Framework, takes code written in Python and other programming languages and translates it to highly optimized CPU, GPU, or FPGA programs through the use of Stateful DataFlow multiGraph (SDFG)- a dataflow-based intermediate program representation. This intermediate program representation exposes where a program moves how much data and allows a programmer to interactively optimize the application. The goal of this project is to implement support for mpi4py in the DaCe framework. The Message Passing Interface (MPI) is the uncontested standard for communication in distributed memory systems. While the MPI standard is focused on the C and Fortran, the main language for high performance computing, an increasing number of scientific applications are written in Python. Mpi4py is a package that offers Python bindings for the MPI standard. ([[Mpi4py in DaCe] Add support for Mpi4py in DaCe](https://github.com/spcl/.github/blob/main/profile/gsoc.md#mpi4py-in-dace-add-support-for-mpi4py-in-dace)) This summer my work consists of two parts: (1) adding new MPI support to the SDFG IR through LibraryNodes and (2) mapping mpi4py syntax to SDFG subgraphs that utilize the new nodes through the DaCe Python frontend. The following new MPI supports are added, empowering DaCe to craft high performance program for more people. 1. send/recv 2. Alltoall 3. Comm split/free 4. MPI RMA functions put/get/accumulate/fence/lock/unlock/free 5. A connector for ensuring the execution order 6. A synchronization checker for RMA functions The rest of the sections are organized as follows. Methodology section introduces how I implement new features by library node and mpi4py replacement and implementation details presents the detail of each MPI capability implementation. Evaluation section evaluates RMA functionality at scale. Experience summarizes the experience of this summer of code. Future work and debug section provides advice about the future development. ## Methodology To enhance the new functionality for DaCe, it needs a replacement for the front end to parse the program and a library node to represent the function in SDFG. For most of my implementation, I start with a four-step implementation. 1. Implement [MPI function] library node 2. Implement test for [MPI function] library node 3. Implement replacement for [MPI function] 4. Implement test for [MPI function] replacement ### Library node The library node is a crucial part of DaCe, it facilitates the abstraction of common operations and allows DaCe to provide performance portability on different platforms. A library node in SDFG representation described the input/output of this function. ![A library node in SDFG](https://hackmd.io/_uploads/rkdynfiJT.png) > A library node in SDFG This library node will be expanded as C code for better performance. ![C expansion of a library node](https://hackmd.io/_uploads/rJZwnfj1p.png) > C expansion of a library node In order to create a library node, the developer needs to provide the input/output definition and its expansion as the [following](https://github.com/Com1t/dace/blob/mpi4py_dev/dace/libraries/mpi/nodes/comm_split.py). You can find a tutorial [here](https://github.com/spcl/dace/blob/ae3e6fe68425925c0827826591cd03af6715a6f4/tutorials/library_nodes.ipynb). ### Replacement DaCe uses the Python AST module to parse the Python code into a SDFG. A function replacement will replace the function with the library node specified. Distributed replacements are all specified in "[dace/frontend/common/distr.py](https://github.com/Com1t/dace/blob/mpi4py_dev/dace/frontend/common/distr.py)" ## Implementation details 1. send/recv Implemented two replacements for send, recv. 2. Alltoall Alltoall is probably the only missing common collective when I start coding. This function could be handy when redistributing data among all processes. 3. Comm split/free Communicator split can help the user to further divide `MPI_COMM_WORLD` into multiple small pieces. 4. MPI RMA functions MPI Remote Memory Access functions are a group of MPI functions that allow a MPI process to conduct communication without both ends specifying the data movement, which reduces the amount of synchronization and provides better scalability. This summer, I implemented all RMA transmission functions and synchronization functions for both active and passive modes. 5. A connector for ensuring the execution order By default, DaCe will optimize the SDFG by analyzing the data flow of SDFG and separating SDFGs without explicit data dependency into different parallel sections. However, for MPI functions, some of them don't have an explicit data flow, such as synchronization functions. In this case, the optimization will result in an incorrect result. Thus we introduce a new connector for specifying the ordering. ![Extra "_out" connector](https://hackmd.io/_uploads/SyBn-65k6.png) > Extra "_out" connector 6. A synchronization checker for RMA functions For RMA transmission functions, such as put/get/accumulate, they are required to only start a transmission within an RMA epoch. Thus, a "_in" connector is introduced for these library nodes, to ensure a synchronization already happened before the transmission. If no synchronization is found, a warning will be raised to inform the user about the potential Error. ![Extra "_in" connector](https://hackmd.io/_uploads/Hk7m8Xoya.png) > Extra "_in" connector ## Evaluation - multi-node C-stationary matrix multiplication To wrap up the whole summer, we decided to implement a C-stationary distributed GEMM algorithm[1] to test the scalability and stability of DaCe mpi4py support. The original work is focused on SpMM, a common linear algebra operation, especially in the field of AI and scientific computing. The operation is calculating a sparse matrix A multiplied by dense matrix B and producing another result dense matrix C. However, for simplicity, we evaluate this algorithm only by dense matrices. A matrix-to-matrix multiplication can be expressed as `C = A @ B` Each process owns a certain partition of the three matrices. C-stationary indicates the C matrix will not be transmitted during the calculation. Only the A or B matrix needs to be collected on the fly ![Data distribution of C-stationary GEMM](https://hackmd.io/_uploads/rkTDsXik6.png) > Data distribution of C-stationary GEMM[1] The pseudo-code of this algorithm can be expressed as the following. ![Pseudo-code of C-stationary GEMM algorithm](https://hackmd.io/_uploads/Sy4Wi7jJp.png) > Pseudo-code of C-stationary GEMM algorithm[1] To measure the performance of distributed GEMM, we conduct scaling experiments on the Taiwania 3 supercomputer. Each node has two 28-core Intel Xeon Platinum 8280 CPUs and 192GB of memory. The nodes are connected through an Infiniband HDR100. The DaCe-generated C/C++ code is compiled with the intel C compiler and intel MPI library from intel oneAPI 2022.2. We spawn one process per socket (2 processes per node) and 28 threads per process (equal to the number of physical cores in a socket) to better utilize MKL for process-local computation. The size of matrices A, B, and C are around 32768 * 32768 floats, for values that are not divisible by sqrt( p ) a round-up will be introduced. ![Strong Scalability of C-stationary GEMM](https://hackmd.io/_uploads/BydHu09kT.png) > Strong Scalability(in time) of C-stationary GEMM DaCe demonstrates commendable scalability as the number of processes increases, resulting in a reduction in calculation time. Up to a process count of 36, the advantages of scaling are profound. However, beyond 36 processes, the rate of decrease in calculation time appears to decelerate. This might be attributed to a low computational load, where the processor is capable of quickly completing the calculations. ![](https://hackmd.io/_uploads/HJ7LaV1xT.png) > Strong Scalability(in parallel effieiency) of C-stationary GEMM From parallel efficiency, a similar trend can also be observed. More processes can improve performance, but the increment is not perfectly propotional to the number of processes. This could result from two reasons: (1) a sub-optimal matrix size is chosen, thus could not fully utilize the CPU (2) the overlap of computation and computation has not yet been studied, and communication is not hidden while the computation. ## Experience Being a part of GSoC 2023 is a very enjoyable experience for me. Last year, DaCe was the reproducibility challenge in the student cluster competition at SC22, and I was the person in charge of this challenge, validating the result of this work. GSoC 2023 provided me with a chance to transition from a validator to a developer, allowing me to contribute to this project. I want to thank Alex for being my mentor and helping me to explore more in the HPC and DaCe project. I enjoyed all the meetings, code reviews, and design discussions. Thank you for your time and your patience! ## Future work - Tuple for specifying displacement in MPI Most of the transmission functions only support sending/receiving from the beginning of the buffer. - A more thorough RMA synchronization check The current implementation only checks the synchronization logic correctness when a new put/get/accumulate operation is added. However, there might be other cases. - MPI-IO It‘s probably the only unimplemented part of DaCe mpi4py support. - Integrare sequence connector to all mpi nodes Only newly introduced mpi nodes have this feature. - Find a pattern to hide communication by computation Currently, DaCe has not yet implemented an optimization to overlap computation while communication is taking place. Such optimization should improve the performance significantly! ## Some debug 1. When testing a new library node, DaCe might raise this valueError. It is because DaCe will try to simplify the SDFG and promote some values to symbols. What I will do is turn off `simplify` when creating SDFG. ![Promotion Error](https://hackmd.io/_uploads/Sk33L9cy6.png) > Promotion Error 2. If the program is not running as we thought it might result from DaCe fuse optimization. When a function does not have an explicit dependency in the SDFG, DaCe will automatically put it into a omp parallel section, which might break the logic of some blocking function. 3. How to [connect memlet and library node](https://hackmd.io/BMS60oLSQ3u7RNbcnWt2ig) ## Reference [1] Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2021. Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing (ICS '21). Association for Computing Machinery, New York, NY, USA, 431–442. https://doi.org/10.1145/3447818.3461472