Introduction - HackMD

- [Introduction](#introduction) - [Team:](#team) - [Approach to Bitonic Sort in Batched + Multi-Threading](#approach-to-bitonic-sort-in-batched--multi-threading) - [Effects of batch size and thread count](#effects-of-batch-size-and-thread-count) - [Thread Pool & Bitonic Sort](#thread-pool--bitonic-sort) - [Implementation with MPI](#implementation-with-mpi) - [Performance](#performance) - [Results](#results) - [Performance Caution](#performance-caution) - [Possible Reasons for Slowdown](#possible-reasons-for-slowdown) - [A note about the MOC](#a-note-about-the-moc) - [Other Small Changes](#other-small-changes) - [Usage Notes](#usage-notes) - [Future Changes](#future-changes) # Introduction This document describes our approach and results to parallelizing bitonic sort within the *Secrecy* framework utilizing MPI. ## Team: - Lucas Ou - Ian Saucy - Bang Tran # Approach to Bitonic Sort in Batched + Multi-Threading Although bitonic support inherently allows easy parallelization special care has to be taken in this case since _Secrecy_ depends on batched operation for optimization. In addition, the nature of the MPC environment further complicates the implementation. The general approach is this, each phase of bitonic sort has one or more columns. Each comparison in this column can be done in parallel because they are guaranteed to be non-overlapping with all other comparisons in the same column. Thus, we can identify a method to partition the comparisons in a particular column of a given phase enabling threads to share the load of performing these comparisons. Each thread will get the following partition size: <div markdown="1" align="center"> <img src="https://bit.ly/2Sd6kJb" align="center" border="0" alt="\text{partition size} = \frac{\text{input size}}{2 * \text{total threads}}" width="235" height="36" /> </div> Naturally, the starting comparison number for the ith thread is as follows: <div markdown="1" align="center"> <img src="https://bit.ly/3eN1CcH" align="center" border="0" alt="partition_i = thread_i * \frac{\text{input size}}{2*\text{thread count}}" width="279" height="36" /> </div> Note that this is not the index of the starting comparison. This must be found via a different method that depends on the current column, phase and round of bitonic sort. See `get_next_index()` in `primitives.c` for how this is done exactly. The first challenge is insuring that each thread on each party is operating on the same partition of data as it's corresponding thread on the next and previous party in the ring. That is to say, the ith thread should be communicating with the ith thread on the other parties in the ring. This can be accomplished through a partitioning mechanism that is consistent allowing each party to perform the same operation independently without communication but still ending up with the same partitions. We partition based on the input size, the batch size and finally the number of threads. Next, we need to make sure all threads across all parties are on the same column. This is handled through pthread barrier synchronization to ensure that we wait for all threads to complete their work for a given column before moving onto the next column. ## Effects of batch size and thread count Given that the partitioning of comparisons in a given column is dependent on the input, batch size and thread count there are a couple considerations to be made when selecting the batch size and thread count. There is actually a max number of threads possible for a given input and batch size. This is because each thread has to have at least a batch size number of comparisons to complete. For this reason we calculate the max number of threads supported by the current input and verify that it is not smaller than the supplied thread count. In the case that it is our code falls back to using the calculated max number of threads. In concrete terms, the max number of threads possible for a given input is as follows: <div align="center" > <img src="https://bit.ly/3nDNJl6" align="center" border="0" alt="\text{max threads} = \frac{\text{input size}}{2 * \text{batch size} * \text{thread count}}" width="374" height="43" ></div> Thus, case should be taken when choosing a batch size and thread count for a given input in order to increase the utilization of the cluster and achieve improved performance. ## Thread Pool & Bitonic Sort We implemented a thread pool using pthreads to help parallelize the sorting operations. The advantage of a thread pool is that it allows us to reuse use threads from one column to another saving valuable computation time to create new threads each time. Most of our API for the thread pool is fairly straightforward and written from an approach that supports general use of the pool. The most unique feature is the method of synchronization. Since it is important that all threads do not advance until everyone is ready to move onto the next column we utilize barrier synchronization to move everyone forward at once. Barrier synchronization simplifies our code drastically but has the drawback that any shared resource(such as MPI) that the threads utilize will experience a very high contention at the beginning of each round of work. ## Implementation with MPI Given the importance of having threads communicate within their sub-groups(or sub-rings more accurately) we had to identify a method to make sure that each thread only got the shares that belonged to it. We accomplished this through the use MPI message tags which allow both the sender and receiver to specify the tag for the current message allowing us to 'filter' messages to the correct threads at each party. That being said, 'filter' is not quite the word given that this all happens seamlessly through the MPI library. We did look at other approaches such as different communication contexts on MPI or using `MPI_Comm_Spawn` but these methods did not work in the structure of *Secrecy*. The latter approach has the benefit of running in a separate thread from the current MPI communicator but requires operation being spawned is a compiled binary. In essence, it's a way of starting an MPI session from within an MPI session. Although it would be possible to re-write the code to support this, we decided it did not make sense as this would be very complex. # Performance We tested the performance on AWS within the same region `c6g.4large` which as 16 cores and 32gb of ram and a network bandwidth of up to 10gbs. Our tests consisted of sorting an input of random data with a length of `2^19` or `524288` rows with a single sorting column. We also tested using various batch sizes and thread counts. ## Results As can be seen below the batch size did not make a large difference in execution time but we see a solid decrease in computation time all the way through 16 threads. <div align="center" markdown="1"> ![alt text](bitonic-sort-graph.png) </div> See [here](../results-emp/bitonic_sort_threaded.csv) for complete data that generated this graph. ## Performance Caution Although the results are promising they do not include the time required to generate the random numbers that are essential to the secure functioning of *Secrecy*. For the scope of this project we removed random number generation until a thread-safe version is built. Thus, we expect the final results to be slightly worse under the same thread count. ## Possible Reasons for Slowdown Given that bitonic sort is mostly CPU bound in *Secrecy* one would expect to see a continued decrease in computation time as threads increase. But this is typically bounded by the network bandwidth and overhead with MPI. The most likely reason that we don't experience gain beyond 8 or 16 threads is because MPI's method for supporting multiple threads sending data at the same time could be force some sort of queueing. Thus, as threads increase contention for MPI increases and reduces performance. Another reason could be to a network bottleneck on AWS. A different batch size could result in creased performance or changing VMs to something that supports higher network throughput. ### A note about the MOC Most of our initial development and testing was done MOC. This suffices for development but for performance testing another provider should be utilized. This is primally because network throughput is throttled on the MOC causing inaccurate test results. # Other Small Changes We made a few other changes throughout the code. The primary one being removing # Usage Notes The most important note is that the thread count must be a power of two. This has to do with how we calculate partitions. Otherwise, the normal restrictions of *Secrecy* apply. # Future Changes Although the results from our parallelization of bitonic sort are very good(over 600% improvement!) further optimizations could be made. The major one we identify right now is the removing the need to utilize MPI for general communication. It is unclear exactly how MPI handles multiple threads under the hood and thus is most likely impacting performance. An improved solution might use simple TCP sockets between parties allowing faster communication.