# CS3210 Assignment 2 ## Steps to reproduce the result A script to run our program on a Linux machine can be executed by running this on a computer with `mpicc`: `make slots=<slots> file=<filename>` ```sh # e.g To run locally: make local slots=4 file=test.in ``` ```sh # e.g To run in compute cluster: make slots=4 file=test.in ``` ### Note This will signal the program to utilize 4 slots. If there is less then 4 slots available, the [-oversubscribe](https://www.open-mpi.org/faq/?category=running#oversubscribing) flag will still enable running of the code. #### If running on compute cluster Configure the machinefile.1 to store the correct machine hostname. You should have ssh access to these nodes. #### If running locally Ensure that OpenMPI is installed. Run `make local slots=4 file=test.in`. ### Input file should be of the following format: ```= Number of particles on the square surface (N) Size of the square (L) Radius of the particle (R) Number of steps (S) print || perf // either "print" or "perf" id id_x id_y id_vx id_vy // optional, for all particles ... ``` ## Strategy for Assignment I (Recap) For assignment 1, the main gist of the parallelism was to split the program into stages and parallelize each stage. The following are the stages that we implemented. - Determine the preference of collision by each particle. This could be with a wall, a particle, or nothing. - Move the particles depending on allocated preference. - Flush particles preference and repeat for the number of steps. We derived some nice results to detect collisions continuously. By characterising the position and velocity of each particle as $\vec{p_{x}}$ and $\vec{v_{x}}$ respectively, we can detect _if_ two particles collide at a time $t$. This is great because for every pair of particles we can run a few FLOPs and figure out if they collide. However, we noted that it would be computationally intensive to create a temporary matrix to store intermediate computation (the potential collision between _any_ two particles), so we opted to parallelize the computation for _every particle_. For every particle $p_i$ (one particle in one thread), we query every _other_ particle $p_j$, noting which particle it would prefer to collide with (based on the earliest time of collision). We also check for collision with the wall. After this, for every particle $p_i$ (one particle in one thread), we check if the particle's preferred collide-ee prefers $p_i$ as well. If there is a match, we collide both particles. Alternatively, if there is a pending collision with a wall instead, we collide with the wall. ### Mathematics for detecting collision \begin{equation*} \begin{split} \left\| {v_{x1} \choose v_{y1}}t + {x_1 \choose y_1} - ({v_{x2} \choose v_{y2}}t + {x_2 \choose y_2})\right\| &= 2r \\ \left\| {v_{x1} - v_{x2} \choose v_{y1} - v_{y2}}t + {x_1 - x_2 \choose y_1 = y_2}\right\| &= 2r \\ [(v_{x1} - v_{x2})t + (x_1 - x_2)]^2 + [(v_{y1} - v_{y2})t + (y_1 - y_2)]^2 &= 4r^2 \\ \end{split} \end{equation*} Hence: \begin{equation*} \begin{split} a &= v_{x1} - v_{x2} \\ b &= x_1 - x_2 \\ c &= v_{y1} - v_{y2} \\ d &= y_1 - y_2 \\ 4r^2 &= (at + b)^2 + (ct + d)^2 (a^2 + c^2) t^2 + 2(ab + cd)t + (b^2 + d^2 - 4r^2) &= 0 \\ e &= a^2 + c^2 \\ f &= 2(ab + cd) \\ g &= b^2 + d^2 - 4r^2 \\ t &= \frac{-f \pm \sqrt{f^2 - 4eg}}{2e} \end{split} \end{equation*} ## Improved strategy for assignment II The above strategy was optimized for CUDA and OpenMP, which had a shared memory. However, in OpenMPI, we are distributing computation across several machines. Hence, the communication model becomes non-trivial. Indeed, we considered the master-slave model as well, where a master slot sends multiple jobs (comparisons) to other slots. However, we realised that we could do better. We opted to model the problem as a distributed system, where **each slot represents a few particles**. This was inspired by us reading about trivial parallel sorting. The core idea is then that using a linear ($O(N)$) amount of requests, a particle could figure out what it should be doing for the current step. Unfortunately, due to I/O purposes, we initially use the master/slave model to bootstrap the data for all the slots. ### Steps in Assignment II approach #### Initially, the "master": - Allocate slots with particles. These slots will be responsible for computing collisions of these particles - Note that the master slot is also allocated some particles for a more even distribution of work. #### Initially, other slots: - Receive environment variable and particle information. #### For each step, _each_ slot: - Sends its particle(s) to all other slots. - Receives particle(s) from other slots, and determine collision preference for each particle that the slot is in charge of. - Sends its collision preference for each particle that it is in charge of, to every other slot - Receives collision preference for every other particle, checking it with its own for collision matching (whether both particle and colliding particle wants to collide with each other) - Move the particles. ### Analysis of strategy in assignment II - We experience a slower computation time if we use more than 1 machine as communication overhead is higher than computation. In particular, it depends on the network throughput - More communication than in assignment I as we do not have a shared memory. - Scales better than CUIA and OpenMP as we have the mechanism to add more machines. - We can combine OpenMPI with OpenMP to parallelize local computations. ## Experimental results and analysis Unless stated otherwise, we fix $N = 10000$, $L = 10000$, $R = 1.0$, $Steps (S)=10$. This ensures that - There are $\frac{N}{L^2} = 0.0001$ particles per unit area, which is dense enough without breaking the assumptions laid out in the pdf. - This will be computationally expensive for a sequential algorithm, that would take $O(N^2)$ time. ### Legends: (for CUDA) B: Number of blocks (for OpenMP and CUDA) T: Number of threads --- ### Varying number of particles Using all the machines available, and setting the number of slots as the number of cores, we see the following performace: | Number of particles | $t_{1, mpi}$ (s)| $t_{2, mpi}$ (s)| $t_{3,mpi}$ (s)| $\bar{t}_{mpi}$ (s) | $\bar{t}_{ave}$ (s) | | -- | --- | -- | - | --- | -- | -- | | 2000 | 10.5 | 10.3 | 10.1 | 10.3 | 10.3 | | 4000 | 41.3 | 42.9 | 44.7 | 46.3 | 43.8 | | 6000 | 96.2 | 100.3 | 98.2 | 98.8 | 98.5 | | 8000 | 149.0 | 150.2 | 153.8 | 153.4 | 151.6 | | 10000 | 178.9 | 180.4 | 195.2 | 166.7 | 180.3 | Runtime of CUDA against MPI: ![](https://i.imgur.com/PX59Wd6.png) We note that for a small input size, CUDA is faster due to localtiy of data. This means that the cost of network overhead is too high. However once we scale the input size up, we note that the slowdown from network costs is overshadowed by the faster $O(N)$ collision algorithm. Note that in order to achieve this, we would need the nodes to all be in the same network. ## Notes about OpenMPI assignment 1. For this assignment, we focused mainly on **scalability** of the program. We wanted to create a solution that truly leverages MPI, and did not want to simply port over the old data-parallelism program. 2. We notice that double arithmetic is much better now, as we are not using CUDA warps to run the FLops. 3. As mentioned before -- while we can change L, S and r, from Part I, we are already sure about how they will affect run-time. ## Appendix ### Team members - Ang Wei Neng [A0164178X] - Vignesh Sh$a$nkar [A0167354Y]