Implementations : Leetcode but Parallel - A Collection of Parallel (ish?) Algorithms

# Leetcode but Parallel - A Collection of Parallel \(ish\?\) Algorithms *This post will include algorithm implementation from multiple sources. The algorithm will not always need to be parallelized; however, I will think out of the scope of a leetcode question and extend my implementation accordingly.* *Implementation will possibly include **a sequential implementation, a parallel implementation and a GPU implementation.*** *The goal is to practice as many parallel algorithms as possible. I will start with algorithms included in books that I previous read \(Study notes as follows\).* # Source Code Repo *TODO* # Book to Started from >* The Art of Concurrency > https://www.tenlong.com.tw/products/9789863471257 > * The Art of Writing Efficient Programs > https://www.tenlong.com.tw/products/9781800208117 > * C++ Data Structures and Algorithm Design Principles: Leverage the power of modern C++ to build robust and scalable applications > https://www.amazon.com/Data-Structures-Algorithm-Design-Principles-ebook/dp/B07SYJSGVD > * LeetCode 101: A Grinding Guide > https://github.com/changgyhub/leetcode_101 > * Parallel and Distributed Algorithms by MobyLab > https://mobylab.docs.crescdi.pub.ro/en/docs/parallelAndDistributed/introduction/ # Leetcode Algorithms *In this section, the focusing topics are in the **LeetCode 101: A Grinding Guide*** # Books or Other Algorithms *In this section, I implement algorithms directly without correlation with a leetcode question.* ## Book Algorithm Implementations \: C\+\+ Data Structures and Algorithm Design Principles: Leverage the power of modern C\+\+ to build robust and scalable applications > Reference \: > https://www.amazon.com/Data-Structures-Algorithm-Design-Principles-ebook/dp/B07SYJSGVD *In this section, only selected Excercise or Activities will be implemented. They will be mark as the same number in the book.* *See my other post for study notes* * Book : C++ Data Structure & Algorithm Design https://hackmd.io/@Erebustsai/S1ThGKDYa ### 1. Create a Data Structure for a Filesystem TODO \: Github Page This implementation is different compare to the book implementation. I prefer loop over recurrsion. ### 2. Streaming Median :::info :information_source: **Find Median from a Set of Data** * **Quick Select for Median filter** https://github.com/Chen-KaiTsai/GPGPU_CUDA/blob/main/ImageFilter/filter.cu * **Median of Five** https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/blob/main/MedianOfFive/main.cpp ::: TODO \: Github Page ### 3. K-Way Merge Using Heaps :::info :information_source: **`std::push_heap`** Notice the format of the input container should be sorted except the last element. https://cplusplus.com/reference/algorithm/push_heap/ ::: :::info :information_source: **`std::pop_heap`** https://cplusplus.com/reference/algorithm/pop_heap/ ::: :::info :information_source: **lambda funtion for `std::priority_queue`** https://notes.boshkuo.com/docs/C++/STL/priority_queue ::: ### 4. STL Sorting with `list` & `forward_list` ```cpp constexpr size_t DATASIZE = 1024 * 1024 * 8; std::list<int> linkedList(data.begin(), data.end()); std::forward_list<int> forwardList(data.begin(), data.end()); linkedList.sort(); forwardList.sort(); ``` ``` Time: 10.325000 Time: 12.114000 ``` ### 5. Vaccinations :::info :information_source: **`std::ptrdiff_t` vs `size_t`** https://stackoverflow.com/a/4377401 This is happened in the code ```cpp auto length = std::distance(first, last); // return value is std::ptrdiff_t ``` ::: ### 6. Parallel Binary Search > Reference \: > https://mobylab.docs.crescdi.pub.ro/en/docs/parallelAndDistributed/laboratory3/parallelBinarySearch/ Every stages need to be sync. However, creating thread and join threads will be really costly. Therefore, I use a threshold to enforce an smallest forking data size. This value can be tuned. ```cpp while (chunkSize / nThreads > nThreads * threshold) { // Start threads for (unsigned int t = 0; t < nThreads; ++t) { tHandles.emplace_back(std::thread(pRangeCheck, data.get(), &flag, value, front, end, chunkSize, t)); } // join threads for (unsigned int t = 0; t < nThreads; ++t) { tHandles[t].join(); } if (flag == -1) { std::cout << "Not Found\n\n"; break; } front = front + chunkSize * flag; end = front + chunkSize; } ``` Every forked threads will check if the value in search is in range of data it responsible for. ```cpp void pRangeCheck(int* __restrict__ data, int* flag, const int &value, const size_t &front, const size_t &end, const size_t &chunkSize, const int threadIdx) { size_t first = front + (chunkSize) * threadIdx; size_t last = first + chunkSize; if (last > end) { last = end; } if (value >= data[first] && value <= data[last - 1]) { *flag = threadIdx; } } ``` ### 7. STL `std::sort`, `nth_element()`, `sort()` > * `std::partial_sort` in C++ > https://www.geeksforgeeks.org/cpp/stdpartial_sort-in-cpp/ > * `sort()` vs. `partial_sort()` vs. `nth_element() + sort()` in C++ STL > https://www.geeksforgeeks.org/cpp/sort-vs-partial_sort-vs-nth_element-sort-in-c-stl/ TODO\: Github Link ### 8. Parallel Disjoint\-Set Study > Reference\: > * Lock-free parallel disjoint set data structure > https://github.com/wjakob/dset > * Introduction to Disjoint Set \(Union-Find Data Structure\) > https://www.geeksforgeeks.org/dsa/introduction-to-disjoint-set-data-structure-or-union-find-algorithm/ > * Union By Rank and Path Compression in Union-Find Algorithm > https://www.geeksforgeeks.org/dsa/union-by-rank-and-path-compression-in-union-find-algorithm/ **Sequential Implementation with Union by Rank \& Path Compression** ```cpp /** * @brief Since ID and index of the node is seperated, need to find element first */ size_t find(int x) { auto node_it = std::find_if(nodes.begin(), nodes.end(), [x](Vertex n) { return n.ID == x; }); if (node_it == nodes.end()) { return std::numeric_limits<size_t>::max(); } size_t node_idx = std::distance(nodes.begin(), node_it); size_t parent_idx = parent[node_idx]; std::vector<size_t> tmp; tmp.reserve(nodes.size()); while(parent_idx != node_idx) { tmp.push_back(node_idx); node_idx = parent_idx; parent_idx = parent[node_idx]; } // path compression for (auto& x : tmp) { parent[x] = parent_idx; } return parent_idx; } ``` The original design seperate the Vertice ID and the index so we need to search if the node ID is exist and when doing path compression, an extra space required to store the search path. TODO\: Github Link ### 9. Kruskal Minimum Spanning Tree with Disjoint-Set TODO ### 10. Borůvka\'s Minimum Spaning Tree \& Parallelized > Reference\: > * Boruvka's algorithm > https://www.youtube.com/watch?v=lZrEwUHE3vw > * Boruvka's algorithm | Greedy Algo-9 \(Implementation\) > https://www.geeksforgeeks.org/dsa/boruvkas-algorithm-greedy-algo-9/ > * Parallelizing Minimum Spanning Tree Using Borůvka’s Algorithm in Parlaylib and CUDA > https://jzaia18.github.io/15618-Final/ #### Sequential Implementation > Reference\: > * Kruskals Algorithm > https://hackmd.io/uRQooisDStG9uYADQJESXg?view#Minimum-Spanning-Tree--Kruskals-Algorithm > * Prim\'s Algorithm > https://hackmd.io/uRQooisDStG9uYADQJESXg?view#Prims-MST-Algorithm > * Boruvka's algorithm | Greedy Algo-9 \(Implementation\) > https://www.geeksforgeeks.org/dsa/boruvkas-algorithm-greedy-algo-9/ * Input a connected, weighted and un-directed graph. * Initialize all vertices as individual components \(or sets\) * Initialize MST as empty * While there are more than one components, do the following for each componenet * Find the closest weight edge that connects this component to any other component * Add this closest dege to MST if not already added * Return MST TODO\: Github Link #### Parallel Study > * Parallelizing Minimum Spanning Tree Using Borůvka’s Algorithm in Parlaylib and CUDA > https://jzaia18.github.io/15618-Final/ *The goal of studying their work is to make a OpenCL version of it.* * **Data Structure**\: list of Edges * **Synchronization**\: updating cheapest edge for each vertex \(connected component\) * **Edge Indexing**\: if two edge have same weight, they choose the edge with the lower index in the edge list. This ensure that they don't have cycle of edges with the same weight to be part of the MST. This means edge's order is not just simply order by weight but some mechanism. * **Checking Edges**\: > we instead kept a boolean for each of the original edges. As we iterate through the cheapest edges in parallel, we update the boolean corresponding to the edge. We had to be careful about not double counting edges as we kept track of the number of edges added to our MST (which was used to determine when to terminate our algorithm in CUDA implementation). We could check if an edge was already counted by checking our MST boolean array atomically. Alternatively, we could check that the current edge did not count as the cheapest edge for any other vertex. In practice, we didn’t notice any significant difference and used the second implementation. * **Union-find**\: > the union-find was stored as an array of length n, where n represents the number of vertices in the entire graph. Each array element tracks the cheapest edge for the given vertex as well as a parent in the component to which the vertex belongs. For root nodes of a component, the vertex stores itself. To union 2 components, the root of one component must set its parent to any node contained in the other. To find the root of a component, you must loop over the “parent” nodes until the root is reached. * **Edge Update**\: We would replace the vertices of an edge with the component the vertex belongs to. As such the edge would now go between 2 components of our updated graph. * **CUDA Implemenatation** > The first kernel resets the cheapest edge array as well as flattens our union-find data structure (this sets up the second kernel to perform find on a flat tree). Although it would be possible to merge the functionality of this kernel into the other two, it does not perform as well. The second kernel, assign_cheapest() corresponds to the first portion of the algorithm which involves finding the cheapest step. The third kernel, update_mst() corresponds to contracting the graph. This involves iterating through our cheapest edge array in parallel to merge vertices along those edges. Finally, the host copies a value from the device to determine when the MST is fully formed, and these 3 kernels are iterated until this point. * **tie-breaking**\: If there are two edges with same weight, [tie-breaking](https://en.wikipedia.org/wiki/Bor%C5%AFvka's_algorithm) is needed to prevent cycle ### 11. Prim\'s Minimum Spanning Tree \& Parallelized #### Sequential Version \#1 ```cpp struct Label { size_t ID; int distance; Label(size_t ID, int distance) { this->ID = ID; this->distance = distance; } inline bool operator< (const Label& l) const { return this->distance < l.distance; } inline bool operator> (const Label& l) const { return this->distance > l.distance; } // inline bool operator() (const Label& l) const { // return this->distance > l.distance; // } }; ``` TODO Github Link #### Sequential Version \#2 \(for parallel\) ```cpp /** * @brief Distance `unsigned int` */ void Prims(const unsigned int **W, std::vector<Edge> &Edges, const size_t N) { std::unique_ptr<size_t[]> nearNode = std::make_unique<size_t[]>(N); std::unique_ptr<unsigned int[]> minDist = std::make_unique<unsigned int[]>(N); for (int i = 1; i < N; i++) { nearNode[i] = 0; minDist[i] = W[i][0]; } for (int i = 0; i < N - 1; i++) { unsigned int min = std::numeric_limits<unsigned int>::min(); int k; for (int j = 1; j < N; j++) { if (minDist[j] >= 0 && minDist[j] < min) { min = minDist[j]; k = j; } } Edges[i].v_1 = nearNode[k]; Edges[i].v_2 = k; minDist[k] = -1; for (int j = 0; j < N; j++) { if (W[j][k] < minDist[j]) { minDist[j] = W[j][k]; nearNode[j] = k; } } } } ``` * There is no negative weight and weight is `unsigned int` * input `W` is a matrix in the following format ![image](https://hackmd.io/_uploads/ry0x17Sixl.png) * input `Edges` is a list of edges denote by two vertex in the result MST * input `N` is the number of vertex * `nearNode` denote the which vertices is now near the current MST * `minDist` denote the minimum distance between each vertices and the MST * if a vertex is in the MST, the distance will be update to `-1` #### Parallel Version There are two part that can be parallelized; The first part is ***find the minimum distance to MST from other vertices*** and the second part update ***the distance of vertices that is not in the MST between MST*** can be simply parallelized with OpenMP for loop. **The first part** Notice that in the loop `j` start with `1` to prevent check itself, but we can start with `0` the program will work. Therefore, in parallel version, this eliminate checking if the index is `0`. ```cpp void pSearchMin(unsigned int* __restrict__ minDist, int* index, const int threadIdx, size_t N, size_t nThreads) { *(index) = -1; size_t pSize = N / nThreads; int start = threadIdx * pSize; int end = (threadIdx + 1) * pSize; if (threadIdx == (nThreads - 1)) // boundary checking end = N; unsigned int min = std::numeric_limits<unsigned int>::min(); int k; for (int i = start; i < end; ++i) { if (minDist[i] >= 0 && minDist[i] < min) { min = minDist[i]; k = i; } } *(index) = k; } ``` After finding all the local minimum, we need to use the main thread to do the final reduction and find the global minimum. This final reduction will not be too expensive since the number of threads is usually small. **The second part** ```cpp #pragma omp parallel for for (int j = 0; j < N; j++) { if (W[j][k] < minDist[j]) { minDist[j] = W[j][k]; nearNode[j] = k; } } ``` TODO\: Testing \& Github Link ### 12. Dijkstra\'s Shortest Path Problem ```cpp std::vector<size_t> dijkstra_shortest_path(const Graph& G, size_t src, size_t dest) { std::priority_queue<Label, std::vector<Label>, std::greater<Label>> frontier; std::set<size_t> visited; std::vector<size_t> parent(G.size()); std::vector<unsigned int> distance(G.size(), std::numeric_limits<unsigned int>::max()); std::vector<size_t> shortest_path(G.size()); frontier.emplace(src, 0); parent[src] = src; while (!frontier.empty()) { Label current_vertex = frontier.top(); frontier.pop(); if (current_vertex.ID == dest) { std::cout << "Destination " << current_vertex.ID << " reached.\n"; break; } if (visited.find(current_vertex.ID) == visited.end()) { std::cout << "Setting vertex " << current_vertex.ID << "\n"; for (auto &e : G.get_edges(current_vertex.ID)) { if (current_vertex.distance + e.weight < distance[e.dest]) { frontier.emplace(Label{ e.dest, current_vertex.distance + e.weight }); parent[e.dest] = current_vertex.ID; distance[e.dest] = current_vertex.distance + e.weight; } } visited.insert(current_vertex.ID); } } // trace back to src from dest with parent vector size_t pre_vertex_ID = dest; while (pre_vertex_ID != src) { shortest_path.push_back(pre_vertex_ID); pre_vertex_ID = parent[pre_vertex_ID]; } shortest_path.push_back(src); std::reverse(shortest_path.begin(), shortest_path.end()); // to (src to dest) return shortest_path; } ``` Dijkstra shortest path will found the shortest paths from the source vertex to **ALL** the other vertices. ### 13. Bellman Ford\'s Shortest Path Problem Time complexity is higher than Dijkstra but can deal with negative edges. ```cpp void BellmanFord(Graph &G, size_t start) { std::vector<Edge> Edges = G.get_edges(); std::vector<int> distance(Edges.size(), std::numeric_limits<int>::max()); distance[start] = 0; for (size_t i = 0; i < Edges.size(); ++i) { for (const Edge &e : Edges) { size_t src = e.src; size_t dest = e.dest; int w = e.weight; if (distance[src] == std::numeric_limits<int>::max()) { continue; } if (distance[src] + w < distance[dest]) { distance[dest] = distance[src] + w; } } } /** * @brief Search again. if still find, negative cycle exist */ for (const Edge &e : Edges) { size_t src = e.src; size_t dest = e.dest; int w = e.weight; if (distance[src] == std::numeric_limits<int>::max()) { continue; } if (distance[src] + w < distance[dest]) { std::cout << "Negative cycle found\n"; return; } } std::cout << "All distance from vertex: " << start << std::endl; for (size_t i = 0; i < Edges.size(); ++i) { std::cout << "\t" << i << ": "; if (distance[i] == std::numeric_limits<int>::max()) { std::cout << "Not Visited" << std::endl; } else { std::cout << distance[i] << std::endl; } } } ``` The negative weight cycle can be found by checking all edges one additional time after the main algorithm finished. If there is still distance that can be smaller, the algorithm is not wrong, the source graph must have negative weight cycle. # GPGPU Algorihtms *See my other post* * GPGPU Algorithm Implementation https://hackmd.io/@Erebustsai/Byul7e-Up * GPGPU_OpenCL \[Github Repo\] https://github.com/Chen-KaiTsai/GPGPU_OpenCL * GPGPU_CUDA \[Github Repo\] https://github.com/Chen-KaiTsai/GPGPU_CUDA