# Homework 2.1 Parallelizing a Particle Simulation > Zixi Cai: Implementing blocking algorithm and paralleling and optimizing it with openmp. > > Gang Yao: Implemented quadtree algorithm and paralleling it with openmp, but was not efficient. ## Serial Blocking Algorithm Performance ### Performance Diagram ![center](https://i.imgur.com/uX9ptXl.png) | N_PARTICLES | 10 | 100 |1000 |10000 |100000 |1000000 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | Simulation Time (s) | 0.00235066 | 0.0236218 |0.245257|3.6771| 50.4904| 610.136 | ### Algorithm Overview First define block size as the particle cutoff distance plus a small safe bound, in our case `block_size = cutoff*1.03`. We use `std::list` to represent each block element containing a linked list of pointers to actual particle instances. Then we use a vector to hold each block element in an `std::vector`. While updating each frame, we use a vector of size "block size" to hold iterators of each block. Then remove and add elements to their new block correspondingly. ```cpp const double BLOCK_SIZE = cutoff * 1.03; typedef std::list<particle_t *> PLIST; typedef std::vector<PLIST> PLISTVEC; PLISTVEC blocks; std::vector<PLIST::iterator> parts_iters; ``` In the initialization step, we iterate through all the particles and put them into the block where they belong and initialize the particle iterator vector to point to the head of each linked list in the block bucket. When computing simulate one step, we iterate through all the blocks and check potential collisions between the current block and its neighbors. According to Newton second law of force, any action on an object must have an equal opposite reaction. To properly utilize this property and avoid recomputation, we only check four among the eight neighbors of a given block. After iterating through all the blocks, we should have the updated acceleration numbers for all the particles. Then we iterate through all the particles and update their position based on the acceleration and delta time. With the updated position of the particle, we then further calculate the block bucket that it previously belongs to. If its new position dictates that it should belong to a new bucket, we remove it from its old bucket and add it to the head of the new bucket. ![](https://i.imgur.com/7wDmolH.png) ## Serial Quadtree Blocking Algorithm Performance ### Performance Diagram ![](https://i.imgur.com/aK4GUnL.png) | N_PARTICLES | 10 | 100 |1000 |10000 |100000 | | -------- | -------- | -------- | -------- | -------- | -------- | | Simulation Time (s) | 0.00559471 | 0.0643842 |0.929185|18.3705| 426.512| ### Algorithm Overview Quadtree is a data structure used for 2D spatial partitioning. It is an extension of binary tree, but instead of 2 children it have four. Each node has a rectangle descriptor and a list to hold the particle pointers. We first define the maximum number of particles a quadtree node can hold as `max_ele`. When ever a the number of particles exceeds `max_ele`, we split the node into 4 nodes, any node that lies on the boundary would belong to the parent node. The QuadTree class offers a list of public interfaces including `insert`, `remove`, `update` and `query`. When we are done constructing the tree, we can iterate over the particle list, and use the rectangle boundary of the particle to query the quadtree. Any quadtree nodes that intersects with the query rectangle are added to the potential colliable list. Then we calculate collision between the list and the current particle. To slightly improve the performance, we insert the node while iterating over the particles and only check collision with the existing tree nodes and calculate mutual collision effect. ![](https://i.imgur.com/CgdWo9G.png) However, the performance of this algorithm is almost four times slower than the simple blocking algorithm. Firstly, constructing the quadtree has very large overhead and every insertion and deletion operation would take O(logN) time. Querying the quadtree is also O(logN) time. Since we are iterating over all the particles, the complete complexity would be O(N\*log(N)). Whereas the simple blocking algorithm iterate over all the blocks and thus has guranteed O(N) performance when the particles are mostly distributed evenly. In our case, the simple blocking algorithm is more suitable for N body collision detection and also has a better pattern for parallel computing model. ## Parallel Blocking Algorithm Performance ### Performance Diagram ![](https://i.imgur.com/yW8AHfH.png) ![](https://i.imgur.com/n9sEy3c.png) | N_PARTICLES | 10 | 100 |1000 |10000 | 100000 | 1000000 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | Simulation Time (s) | 0.0387869| 0.0437257| 0.0538881| 0.173959| 1.70446| 22.1686| | Serial Simulation Time (s) | 0.00235066 | 0.0236218 |0.245257|3.6771| 50.4904| 610.136 | | Speedup | 0.0606 | 0.5402 | 4.5512 | 21.1378 | 29.6225 | 27.5225 | ### Synchronization Every step of the simulation generally consists of two steps: Caculating acceleration and moving particles. Both for loops of the two steps can be optimized with `#pragma omp for` and openmp automatically generate new threads and manage the new thread stacks for us. For the first step, if we are checking all 8 neighboring blocks against a given block for risk of collision, then all the threads spawn from this parallel "for clause" would have access to the same shared memory `std::vector<PLIST> blocks`. One thread for one block, each trying to modify it's own acceleration parameter. So there will not be race conditions and no lock is needed. However, to incorporate the "mutual force" optimization just like the serial algorithm, potential race conditions emerge when two threads would trying to modify the acceleration of the same particle. So we need to fix it with proper locks. We first need a list of locks for each particle element: `std::vector<omp_lock_t> parts_locks;`. Then when updating acceleration of two collided particles, we surround the assignment statements with `parts_locks[i] lock acqure` and `parts_locks[i] lock release`. This ensures that a competing update only gets to set the acceleration after the first one finishes. For the second step, when moving each particle on their separate thread, we need to decide whether or not this particle need to be put into another block. If two threads tries to add their particle onto the front of the block list or tries to remove particles of the same block, race conditions happen. So we need to lock the block data before each remove and insert operation, and release it after. For this purpose we declare another `std::vector<omp_lock_t> blocks_locks;` in the initialization step. ### Design Choices Applying locks can sometime hit the performance since some threads would have to wait when race conditions happen. However, in our case, race conditions only happen in rare cases. So the benefit of cutting down half of the computation by applying mutual force clearly outweighs the performance overhead from locks. Another thing to consider. Given the blocking strategy used for dramastically reducing the number of neighboring particles we need to visit when computing force of a particle, we still have flexibility about how to parallelize both force computing phase and moving phase. For each phase, we tried two strategies: block-wise parallelism (BP), dividing work by blocks, which means applying `#pragma omp for` to for-loop iterating over all blocks; particle-wise parallelism (PP), dividing work by particles, which means applying `#pragma omp for` to for-loop iterating over all particles. We tried all combinations on 1000000 particles test, and results are shown below (each combination denoted by phase 1 strategy+phase 2 strategy). ||BP+BP|BP+PP|PP+BP|PP+PP| |-|-|-|-|-| |Simulation Time (s)|31.32|**22.78**|48.77|29.33| The best performance appears when we apply block-wise parallelism to force computing phase and particle-wise parallelism to moving phase, which is what we settled on eventually. Here is some analysis. For force computing phase, since we need to frequently access surrounding particles, it is beneficial to somewhat access particles in spatially sequential manner to take advantage of caches. By accessing particles in the order specified by blocks, we maximize temporal locality. In moving phase, each particle is independent to each other and we do not need to access particles that are spatially close to the particle we are trying to move, so no temporal locality can be exploited. If we apply block-wise parallelism here, we increase the chance that two processes modify the same block since we increase the chance that two processes are accessing adjacent blocks. And in the meantime we enlarge the overhead caused by keeping cache consistency across cores since adjacent blocks might sit in the same cache line. Particle-wise parallelism solves these problem because particles are actually in random order in memory space. There is one last optimization we have tried, two-layered blocking. In force computing phase, we divide work by superblocks (superblock-wise parallelism), big blocks composed of a matrix of original blocks. Each time a core processes a superblock, and inside the superblock, it processes blocks in the way similar to what is done in serial code. Theoretically, there is no difference in time needed in one-layered blocking and two-layered blocking because no matter how the parallelism structure is constructed, the number of blocks we need to iterate over is the same, so is the theoretical depth of the program. But in practice, we found that proper superblock size (several hundreds of blocks per superblock) gave us minor performance boost espacially in large amount cases. A guess of this would be superblocks are likely to fit into cache as a whole and in this sense when a core is iterating over blocks in the superblock assigned to it, cache hit rate is increased. ### Performance Analysis We see from the performance diagram that when N_PARTICLES is small, the openmp program has minimum improvement or even worse performance. This is because the majority of the work when N_PARTICLES is small is from thread communication. Also, synchronizaion cost is more obvious and takes up a larger portion of the total run time. But when N_PARTICLES is large, the speedup is obvious and stays at a constant portion at around 30 times faster. <!--We know that the cori system has 16 cores and 2 threads per core. So the limit to P time speedup would be roughly 32 times. But when we account for synchronization time and communication time, the ideal speedup can never be reached. --> Regarding idealized speedup, since our simulation is fully parallelized - combined with 2 phase both of which are made parallel - idealized speedup should be the parallel number which is the number of cores used. The chart below shows actual speedup and idealized speedup when putting different number of cores into use, tested with 100000 particles. ![](https://i.imgur.com/8UmRtgO.png) |Number of Cores|1|2|4|8|16|32|64| |-|-|-|-|-|-|-|-| |Simulation Time (s)|54.48|32.41|22.81|11.95|6.22|3.30|1.79| |Speedup|0.927|1.558|2.214|4.225|8.117|15.300|28.207| It is interesting to find that although actual speedup is far lower than idealized speedup, they are somewhat linearly related. Breaking down runtime into computation time and synchronization/communication time, computation time will roughly be reduced to 1/p given the number of cores p, but synchronization/communication time will increase along with p scaled up, and the rate at which synchronization/communication time changes should be less than linear to the rate at which p is scaled.