# Course / Book / Implementation : MPI \& Distributed Systems # MPI Programming > Reference \: > * Parallel and High Performance Computing > https://www.manning.com/books/parallel-and-high-performance-computing ## MPI Basic ### OpenMPI \& MPICH > Reference \: > * which one is easier to master, OpenMPI or MPICH? > https://www.reddit.com/r/HPC/comments/1b9mtgq/which_one_is_easier_to_master_openmpi_or_mpich/ > * MPICH vs OpenMPI > https://stackoverflow.com/questions/2427399/mpich-vs-openmpi#25493270 :::info :information_source: **MPI Commands** * _mpicc_ : Compiler Wrapper for C code * _mpic++_ : Compiler Wrapper for C++ cide * _mpirun -n \<nprocs\>_ : Start MPI program with _-np_ processes ::: ### Basic Program structure ```cpp= #include <mpi.h> #include <stdio.h> // for cpp use cstdio int main(int argc, char* argv[]) { MPI_Init(&argc, &argv); int rank, nprocs; MPI_Comm_rank(MPI_COMM_WORLD, &rank); // rank : process id MPI_Comm_size(MPI_COMM_WORLD, &nprocs); // procs : number of process in a comm MPI_Finalize(); // synchronize and exits MPI program return 0; } ``` ### Collective Communications > Reference \: > * Parallel and High Performance Computing > https://www.manning.com/books/parallel-and-high-performance-computing ![{1F63B748-1C92-4CDE-807B-CA999BF83A15}](https://hackmd.io/_uploads/ByeVG4t5aR.png =x400) ## Message Passing ### Blocking Send/Receive Blocking send and receive will not return if the condition is fulfilled. #### Program Hang Example # 1 *MPI_Recv()* will block the program and no one will send any data therefore porgram hang. ```cpp= MPI_Recv(recvData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send(sendData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD); ``` #### Program Hang Example # 2 Program preallocated buffer might not enough so send might be waiting for receive to allocate buffer. ```cpp= MPI_Send(sendData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD); MPI_Recv(recvData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); ``` #### Program Example (using **MPI_Sendrecv**) By using collective communication calls, the responsibility for avoiding hangs and deadlocks, as well as responsibility for good performance will be delegated to the MPI library ```cpp= #include <mpi.h> #include <cstdio> #include <cstdlib> constexpr size_t dataSize = 64; int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); double sendData[dataSize], recvData[dataSize]; // local data for (int i = 0; i < dataSize; ++i) sendData[i] = static_cast<double>(i); int rank, nprocs; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); if (nprocs % 2 == 1) { if (rank == 0) // invoke only one process to print error message printf("Should invoke this program with even number of processes\n"); exit(EXIT_FAILURE); } int tag = rank / 2; // integer division to create paired groups id int partnerRank = ((rank / 2) * 2) + (rank % 2 == 0); // (rank % 2 == 0) create [1, 0, 1, 0, 1, 0] sequence printf("rank : %d, partnerRank : %d\n", rank, partnerRank); #ifdef FOR_HANG // if message small, the preallocated buffer can store the send data and continue (not hang) // if message too big, the send wait for receive call to allocate buffer (hang) MPI_Send(sendData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD); MPI_Recv(recvData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); #endif MPI_Sendrecv(sendData, dataSize, MPI_DOUBLE, partnerRank, tag, recvData, dataSize, MPI_DOUBLE, partnerRank, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); if (rank == 1) { for (int i = 0; i < dataSize; ++i) { if (i % 8 == 0) printf("\n"); printf("%lf, ", recvData[i]); } printf("\n"); } MPI_Finalize(); return 0; } ``` ### Reduction Functions MPI provide some reduction functions by passing operations to the reducntion function call :::info :information_source: **MPI Commands** * _MPI\_MAX_ : return max value * _MPI\_MIN_ : return min value * _MPI\_SUM_ : return sum of an array * _MPI\_MANLOC_ : return index of man value * _MPI\_MINLOC_ : return index of min value ::: #### Example Compile with : *mpic++ fileName -o programName **-lpthread*** ```cpp= #include <mpi.h> #include <unistd.h> #include <cstdio> #include <thread> #include <chrono> int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); double startTime, mainTime, minTime, maxTime, avgTime; int rank, nprocs; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Barrier(MPI_COMM_WORLD); startTime = MPI_Wtime(); std::this_thread::sleep_for(std::chrono::seconds(5)); mainTime = MPI_Wtime() - startTime; MPI_Reduce(&mainTime, &maxTime, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); MPI_Reduce(&mainTime, &minTime, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD); MPI_Reduce(&mainTime, &avgTime, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (rank == 0) { avgTime /= nprocs; printf("Time for work is Min: %lf Max: %lf Avg: %lf seconds\n", minTime, maxTime, avgTime); } MPI_Finalize(); return 0; } ``` ### Scatter and Gather Distribute work or gather work to a process The following program snippet is part of the MergeSort implementation in Github https://github.com/Chen-KaiTsai/Project_Notes/blob/main/Performance_Engineering/OpenMP%26OpenMPI/MergeSort.cpp ```cpp= int main(int argc, char **argv) { int rank, nprocs; int *input_data; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); int partial_data_size = N / nprocs; int *partial_data = (int *)malloc(partial_data_size * sizeof(int)); if (rank == 0) { input_data = (int *)malloc(N * sizeof(int)); int i; for (i = 0; i < N; i++) input_data[i] = rand(); } double time_x = MPI_Wtime(); MPI_Scatter(input_data, partial_data_size, MPI_INT, partial_data, partial_data_size, MPI_INT, 0, MPI_COMM_WORLD); ... ``` ### Ghost Cells / Halo Cells *TODO \: Continue when working on multi GPU project.* > Reference \: https://github.com/essentialsofparallelcomputing/Chapter8/tree/master/GhostExchange A solution help *avoid if statement and provide minimum message transfer*. > 1. https://stackoverflow.com/questions/66272972/process-to-process-computation-and-communication-in-mpi > 2. http://www.claudiobellei.com/2018/09/30/julia-mpi/ The above two link provide detailed concept and implementation. ![](https://hackmd.io/_uploads/H18VTEU43.png) Mostly, we want to avoid if statement in the main computational loop. To implement boundary check, adding cells surrounding the mesh and set those to appropriate values before the main computational loop. * *Domain-boundary halo* \: For imposing a specific set of boundary conditions. ### MPI_Pack Example ``` cpp int MPI_Pack(const void *inbuf, int incount, MPI_Datatype datatype, void *outbuf, int outsize, int *position, MPI_Comm comm) ``` :::info :information_source: **Parameters** * _inbuf_ : input buffer * _incount_ : input data size **in number of elements** (Integer) * _datatype_ : datatype of input data items * _outbuf_ : output buffer * _outsize_ : output buffer size **in bytes** (Integer) * _position_ : current position in buffer **in bytes** (Integer) **[will be incremented by this function]** * _MPI_Comm_ : Communicator for packed message (Handler) ::: ## MPI Data Types for Performance and Code Simplification A complex data can be encapsulated into a custom data type for transfer data with reduced number of `send` or `receive`. * `MPI_Type_contiguous` * `MPI_Type_vector` * `MPI_Type_create_subarray` \: Creates a rectangular subset of a larger array. * `MPI_Type_indexed` or `MPI_Type_create_hindexed` ![{3D3DB7D2-9E72-4CB8-B6AD-E905DD0598A6}](https://hackmd.io/_uploads/rkDzVp5T0.png =x250) # Building Cluster > Reference > * Running an MPI Cluster within a LAN > https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/ > * Raspberry Pi Cluster > https://www.instructables.com/Raspberry-Pi-Cluster/ > * How to build a Raspberry Pi cluster > https://www.raspberrypi.com/tutorials/cluster-raspberry-pi-tutorial/ > * Building a Raspberry Pi Cluster > https://glmdev.medium.com/building-a-raspberry-pi-cluster-784f0df9afbd > * Building a Raspberry Pi Cluster: Step-by-Step Guide and Practical Applications > https://www.sunfounder.com/blogs/news/building-a-raspberry-pi-cluster-step-by-step-guide-and-practical-applications ## Hardware *Some of the hardware are old unused hardware that can be repurposed into part of the cluster.* *Notice that the first iteration will not include two of the existing compute based servers in my current server build and the NAS. I want to keep it simple.* ### Hardware Features that can be Taken into Consideration :::info :information_source: **POE** ::: ### Hardware List * **Power Supply** \: [PX PWC-10013B 100W GaN](https://www.px.com.tw/zh-tw/products/PWC-10013B#%E7%94%A2%E5%93%81%E5%8A%9F%E8%83%BD) * **USB-A to USB Type-C \(Save for Supplying Power\)**\: [Kamera 佳美能USB3.2 Type-C公轉USB-A母 OTG轉接頭-10Gbps/120W/20V/6A](https://24h.pchome.com.tw/prod/DCACKN-A900H2780) * **Nodes Board \(Candidate \#1\)** \: [Orange Pi One](https://shopee.tw/%E6%B8%85%E5%80%89%E8%B2%B7%E4%B8%80%E9%80%81%E4%B8%80%EF%BC%8C%E9%A6%99%E6%A9%99%E6%B4%BEH3%E6%99%B6%E7%89%87%E9%96%8B%E7%99%BC%E6%9D%BF-%E8%89%AF%E5%93%81%E7%89%88-orange-pi-one-%E5%96%AE%E6%9D%BF%E9%9B%BB%E8%85%A6-i.236794101.29173021966?sp_atk=207280b0-5400-41a9-877b-97dd625dc2e0&xptdk=207280b0-5400-41a9-877b-97dd625dc2e0) * **Switch \(Candidate \#1\)** \: [Zyxel 合勤 GS1100-16v3](https://24h.pchome.com.tw/prod/DRAF0I-A900B35IX) * **AP for WiFi** \: TOTOLINK N300RB-Plus * **Nodes Board \(Candidate \#2\)** \: [Asrock Beebox Series J3000](https://www.asrock.com/nettop/Intel/Beebox%20Series/index.tw.asp?cat=) * **NFS Storage** \: [SanDisk 晟碟 Ultra Fit USB 3.2 隨身碟512GB](https://24h.pchome.com.tw/prod/DGCA3K-A900GKLD3) ### Diagram in Planning Stage \(2025\/04\/20\) ![image](https://hackmd.io/_uploads/Hyd_aRfJxx.png) The current setup \(2025\/05\/08\) will be * No Switch * VLAN \#2 go into TOTOLINK N300RB-Plus * N300RB will be in DHCP mode * Compute Nodes are Orange Pi One ## Applications ### Running an MPI Cluster within a LAN https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/ # 從異世界歸來發現只剩自己不會 Kubernetes 系列 > Reference \: > https://ithelp.ithome.com.tw/articles/10287149 :::info :information_source: **Microservices** * Wiki https://zh.wikipedia.org/wiki/%E5%BE%AE%E6%9C%8D%E5%8B%99 ::: ## Kubernetes Cluster > Reference \: > {%preview https://kubernetes.io/docs/concepts/architecture/ %} # Designing Distributed Systems > Reference \: > * Designing Distributed Systems > https://www.oreilly.com/library/view/designing-distributed-systems/9781098156343/ *This book is mostly for concepts. Therefor, I will only include extra material that I find when studying this book.* :::info :information_source: **HTTP Respond Codes** * Wiki https://en.wikipedia.org/wiki/List_of_HTTP_status_codes ::: > In general, the goal of a container is to establish boundaries around specific resources (e.g., this application needs two cores and 8 GB of memory). Likewise, the boundary delineates team ownership (e.g., this team owns this image). Finally, the boundary is intended to provide separation of concerns (e.g., this image does this one thing). ## Single Node Pattern ### The Sidecar Pattern Use a sidecar container along with a application container to provide additional function. :::info :information_source: `topz` https://hub.docker.com/r/brendanburns/topz ::: :::info :information_source: **Dockerfile \& Docker Compose** https://old-oomusou.goodjack.tw/docker/dockerfile-dockercompose/ ::: # Setup Distributed System \: Orange Pi One Cluster > Reference \: > * Official Website > https://www.armbian.com/ > * Building a Raspberry Pi Cluster Part I > http://glmdev.medium.com/building-a-raspberry-pi-cluster-784f0df9afbd > * Building a Raspberry Pi Cluster Part II > https://glmdev.medium.com/building-a-raspberry-pi-cluster-aaa8d1f3d2ca > * Building a Raspberry Pi Cluster Part III > https://glmdev.medium.com/building-a-raspberry-pi-cluster-f5f2446702e8 ## Hardware ![image](https://hackmd.io/_uploads/r1j2IyFwxx.png) * Switch \: Zyxel GS1200-8 * Nodes \: [Orange Pi One](http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-One.html) ## Armbian OS Basic ### OS Consideration For OS of the clusters, I am going to use armbian and its variance. The OS I am going to use is `Armbian_25.2.2_Orangepione_noble_current_6.6.72_xfce_desktop.img.xz` The reason I use Armbian is because it's maintained with relatively new version of Linux Kernel. The official image is considered pretty old. ``` ~$ uname --all Linux orangepione 6.6.72-current-sunxi #2 SMP Fri Jan 17 12:36:27 UTC 2025 armv7l armv7l armv7l GNU/Linux ``` ### SSH Server The `ssh` server is enabled and running by default image. Therefore, connect to Orange Pi One is basically working out of the box. ### Monitor System with `armbianmonitor` The following is the help page of `armbianmonitor` ``` :~$ armbianmonitor -h Usage: armbianmonitor [-h] [-b] [-c $path] [-d $device] [-D] [-m] [-p] [-r] [-u] Options: -c /path/to/test Performs disk health/performance tests -d Monitors writes to $device -D Tries to upload debug disk info to improve armbianmonitor -m Provides simple CLI monitoring - scrolling output -M Provides simple CLI monitoring - fixed-line output -n Provides simple CLI network monitoring - scrolling output -N Provides simple CLI network monitoring - fixed-line output -p Tries to install cpuminer for performance measurements -r Tries to install RPi-Monitor -u Tries to upload armbian-hardware-monitor.log for support purposes -v Tries to verify installed package integrity -z Runs a quick 7-zip benchmark to estimate CPU performance ``` ## Seperation of Network Domain ![image](https://hackmd.io/_uploads/B1Dz7ytPxe.png) The router allow me to assign different sub domain to different ports, thus these nodes will not have similar IP than all the other server. This make debugging easier since they are also physically seperated. ## SLURM Primer > Reference \: > * Slurm 基本說明 > https://man.twcc.ai/@twccdocs/doc-twnia2-main-zh/https%3A%2F%2Fman.twcc.ai%2F%40twccdocs%2Fguide-twnia2-slurm-intro-zh > * Day12. 國網(國家高速網路與計算中心)介紹 > https://ithelp.ithome.com.tw/m/articles/10327016 > * Wiki > https://zh.wikipedia.org/zh-tw/Slurm%E5%B7%A5%E4%BD%9C%E8%B0%83%E5%BA%A6%E5%B7%A5%E5%85%B7 ## SLURM Setup > Reference \: > * Slurm 安裝流程 > https://hackmd.io/@stargazerwang/S1t60i6lF