1. Parallel Data Processing

# 1. Parallel Data Processing ## Paralelism? Process large task by dividing it into smaller task that are then worked simultaneously. Database processing works well with parallelism ***(coarse-grained parallelism)*** The primary objective of parallel database processing is to gain performance improvement ## Why parallel processing? 1. The processing speed of processor depends on the transmision speed between electronic components which is usually limited by the speed of light. 2. The density of transistors in processor that is limited. 3. Other hardware limitations ## Measure of Performance Improvement : Throughput vs Response Time 1. Throughput : The number of task that can be finished in given time. 2. Response Time : The amount of time to complete single task from the time it submitted. ## Metrics of Performance Improvement : Speed Up - Scale Up 1. Speed Up : Related to the increase in speed. $$ \text{Speed-Up} = \frac{\text{elapsed time on uniprocessor}}{\text{elapsed time on multiprocessor}} $$ 2. Scale Up : Related to the increase of size or number of tasks that can be handled in same amount of time. * Transaction Scale-up : Related to the increase of number of transaction that can be handled * Data Scale Up : Related to the increase the size of data that can be handled. $$ \text{Scale-Up} = \frac{\text{uniprocessor elapsed time smaller system}}{\text{multiprocessor elapsed time smaller system}} $$ ## Parallel Obstacles 1. Start-up and consolidation cost * Start-up cost related to the initiation of multiple process in parallel processing. * Consolidation cost related to the aggregation of result from multiple processor by the host processor. > Amdahl law, which states that the compute time can be divided into the parallel part and the serial part, and no matter how high the degree of parallelism in the former, the speed up will be asymptotically limited by the latter, which must be performed on a single processing element. 2. Interference and Communication * Interference happened when each processing unit competes with other to use resources. * Communication such as waiting time. 3. Skew Refer to the unevenness of workload partitioning. Equal workload (load balance) among all processing elements is one of the critical factors to achieve linear speed up. Zipf used for modelling Skewness ($\theta$ is the degree of skewness, $|R|$ number of records in table, $|R_i|$ number of records in processor $i$): $$ |R_i| = \frac{|R|}{i^\theta \times \sum^{N}_{j=0} \frac{1}{j^\theta}} $$ * **Data Skew**: is caused by the unevenness of data placement in a disk in each local processor, or by the previous operator. * **Processing Skew**: is caused by the processing itself, and may be propagated by the data skew initially ## Forms of Parallelism 1. Interquery Parallelism Different queries executed in parallel with others. The aim is to scale-up. **One query on processors**. ![](https://i.imgur.com/CwExivz.png) 2. Intraquery Parallelism Execution of single query in parallel. (**One processor one sub query**). The aim is to speed-up. ![](https://i.imgur.com/1g4NcdW.png) 3. Intraoperation Parallelism Execution of individual operation in parallel. Single operation executed by multiple processors. ![](https://i.imgur.com/BTKU10r.png) 4. Interoperation Parallelism Execution of different operations in parallel. Also called, partitioned paralellism. Intraoperation parallelism raises the need for formulating parallel versions of basic database operations such as: parallel search, parallel sort, etc. * **Pipeline Parallelism**: Sequential process where an operation will depend on the result of previous operation. Useful for small number of processor but does not scale-up well. ![](https://i.imgur.com/T5jWFzk.png) * **Independent Paralellism** : Where operations in a query that do not depend on one another can be executed in parallel ![](https://i.imgur.com/fEjnxNj.png) 5. Mixed parallelism A mixture of all available parallelism forms. ![](https://i.imgur.com/nCgc4kN.png) ## Parallel Database Architecure 1. **Shared-Memory Architecture** : all processor shared common main memory and secondary memory. Load balancing is easy but suffer from memory and bus contention 2. **Shared-Disk Architecture** : Each processors has it local memory but shared the disk. Load balancing is easy but suffer from memory and bus contention 3. **Shared-Nothing Architecture**: Each processors has it local memory and disk. Load balancing is difficult. 4. **Shared-Something Architecture**: a mixture of shared-memory and shared-nothing architectures. Also called cluster architecture, where each node is shared-something architecture (SMP) connected to another SMP throught interconnection network (Bus, Mesh, Hypercube). Advantage: Flexibility in configuration, compromise load balancing problems of shared-memory architecure and extensibility of shared-nothing arch. ## Interconnection Architectures 1. **Bus** All processing unit connected via single communcation lines. Works well for small number of processor but do not scale well. 2. **Mesh** Processing units are arranged as a node in a grid (each node connect to it adjacents nodes). Can scale better. 3. **Hypercube** In a 2-dimensional hypercube, it is exactly the same as a 2 ð 2 mesh structure. In a 3-dimensional hypercube, the processing elements are connected as in a cube ## Summary 1. Why is parallelism is needed? Because the need of processing large data in reasonable elapsed time. 2. What can be achieved by parallelism in database processing? Linear speed-up and linear scale-up. 3. How is parallelism achieved in database processing? Refer to five different parallelism: Intequery, Intraquery, Intraoperation, Interoperation or Mixed. 4. What facilities of paralllel computing can be used? Shared-memory, Shared-disk, Shared-nothing, shared something. # 2. Data Partitioning Depending on the architecture, data partitioning can be done physically or logically. In a shared-nothing architecture, data is **placed permanently** over several disks, whereas in a shared-everything (i.e., shared-memory and shared-disk) architecture, data is **assigned logically** to each processor. ## Data Partitioning Basic Data Partitioning: 1. Vertical Data Partitioning 2. Horizontal Data Partitioning ## Horizontal Data Partitioning 1. Round-robin data partitioning 2. Hash data partitioning 3. Range data partitioning 4. Random-unequal data partitioning ### a. Round-robin partitioning - Each record in turn is allocated to a processing element in clockwise manner. - Advantage: data is distributed evenly - Records in partition not grouped semantically ### b. Hash data partitioning - Make a partition more meaningful - Records in same partition has same hash value - Best for exact match retrieval - Not suitable for range search - Might be Skewed becaus the data value distribution which likely non-uniform. ![](https://i.imgur.com/YqRweiR.png) ### c. Range Partitioning - Spread records based on givern range of attribute. - Suitable for range retrieval - Might suffer from data skew ![](https://i.imgur.com/dtrhii1.png) ### d. Random-Unequal - Not based on the same attribute - The size of each partition in unequal - Common, especially when the operation is actually an operation based on temporary results obtained from the previous operations. ![](https://i.imgur.com/4F7dSHC.png) ## Attribute-based vs Non Attribute Based data partitioning Attribute-based: Hash and Range Non Attribute-based: Round-Robbin and random-unequal ![](https://i.imgur.com/ZiYmuzC.png) ## Complex Data Partitioning - Hybrid-Range Partitioning System (HRPS) - Partition table into different fragment using range and each fragment distributed to each processor using round-robbin. ![](https://i.imgur.com/eLVh6MR.png) - Support for Small Tables - Support for Tables with Nonuniform Distributions of the Partitioning Attribute Values Three cases on HRPS (N=num processor, M=participating proccessor): 1. All processors is used (M=N) - Compared with range partitioning: The query will use 1-2 processors only, and hence less optimal - Compared with hash partitioning: Hash will use all N processors, and hence less efficient due to start up, communication, and termination overheads 2. Participating proccessor more than available processor (M>N) - Still use BN processors - Compared with range partitioning: Likely will use single processor 3. Participating proccessor less than available processor (M<N) - Compared with range partitioning: The query will use 1-2 processors only, and hence less optimal - Compared with hash partitioning: Hash will use all N processors, and hence less efficient due to start up, communication, and termination overheads - Multiattribute Grid Declustering (MAGIC) Based on multiple attribute and support range and exact-match search. ![](https://i.imgur.com/QTQ7jIe.png) Example: - Query 1 (exact match on Slname=="Roberts") - MAGIC → Use 6 processors (5, 11, 17, 23, 29, 35) (most optimal) - Hash → Only use a single processor (less optimal) - Range → Use all processors because it is partitioned based on Sid not Slname (not necessary) - Query 2 (discrete range SID = [98555-98600]) - MAGIC → Use 6 processors (31, 32, 33, 34, 35, 36) (most optimal) - Hash → Use all processors because it is partitioned based on Slname not Sid (not necessary) - Range → Use 1 processor (less efficient) - Bubba’s Extended Range Declustering (BERD) Two levels of data partitioning, primary attribute partitioning and secondary partitioning. Step: 1. Partition the table based on primary attributes using range 2. Each fragment is scanned and aux table is created 3. Aux table is partitioned using secondary attribute 4. Place the fragments from step 1 to 3 into multiple processors. ![](https://i.imgur.com/eUleY8l.png) The auxiliary table provides enough information to direct the execution of a query to at least one (high correlation between the two partitioning attribute values) and at most two processors **(two correlation)**. For example, an exact match for student “Chan” will direct the first process to the first processor, and then goes to the third processor, where record “Chan” is actually located. # 3. Searching Operation ## Kind of Search Queries 1. Exact-Match Search 2. Range Search (Continous & Discrete) 3. Multi Attribute Search ## Linear Search - Linear Search - Binary Search ## Parallel Search - Processor activation or involvement : **depend on the partitioning used** ![](https://i.imgur.com/ZoTG4ZQ.png) - Local searching method: **depend on data order** ![](https://i.imgur.com/9Et6VZq.png) - Key comparison: **depend on the search attribute values (uniqueness of data)** ![](https://i.imgur.com/DJcGpEF.png) # 4. Join Operation ![](https://i.imgur.com/WL4XgsC.png) ## Linear Join - Nested-Loop - Hash Join - Sort-Merge Join ## Parallel Join Step - Data Partitioning - Local Join ## Data Partitioning in Parallel Join - Divide and Broadcast Algorithm - Disjoint data partition ## Divide and Broadcast Algorithm - Divide one table into multiple disjoint partitions (Dividing one table can simply use equal division) - Broadcast (copy) another smaller table into all partitions - Perform local join - Aggregate join result Problems: - No load imbalance problem, but the broadcasting method is inefficient - The problem of workload imbalance will occur if the table is already partitioned using random-unequal partitioning - If shared-memory is used, then there is no replication of the broadcast table. - Each processor will access the entire table S and a portion of table R. But if each processor does not have enough working space, then the local join might not be able to use a hash-based join ## Disjoint Partitioning-based Algorithm - Partition data using range or hash partition - Perform local join - Aggregate the result ## Cost Model For Data Partition ![](https://i.imgur.com/S8SrDrP.png) | Cost | Divide and Broadcast | Disjoint Partitioning-based | |--------------------------|----------------------|-----------------------------| | Scan Cost | Si/P x IO | (Si/P + Ri/P) x IO | | Select Cost | \|Si\| x (tr+tw) | (\|Si\| + \|Ri\|)x(tr+tw) | | Transfer Cost | (S/P - Si/P)x(ml+mp) | (Si/P + Ri/P)x(ml+mp) | | Finding Destination Cost | - | (\|Si\| + \|Ri\|)x(td) | | Receiving Cost | (S/P - Si/P)x mp | (Si/P + Ri/P)x mp | | Disk Writing Cost | (S/P - Si/P)xIO | (Si/P + Ri/P)xIO | ## Selectivity Ratio (π) and Productivity Ratio (σ) 1. Projectivity ratio π is the ratio between the projected attribute size and the original record length. 2. Selectivity ratio σ is a ratio between the total output records, which is determined by the number of records in the query result, and the original total number of records. ## Cost Model For Local Join - Scan Cost : (Ri/P + Si/P) x IO - Select Cost : (|Ri|+|Si|) x (tr+tw) - Join Cost : (|Ri| x (tr + th)) + (|Si| x (tr + th + tj)) - Query Result Cost ![](https://i.imgur.com/tqr6EVc.png) ## Parallel Join Optimization 1. Load Balancing Issue : - In terms of parallelism, the reduction in the query elapsed time is achieved by having each processor finish its execution as early as possible and as evenly as possible. - No load imbalance in divide and broadcast-based parallel join. But this kind of parallel join is unattractive, due to the heavy broadcasting - In disjoint-based parallel join algorithms, processing skew is common - To solve this skew problem, create more fragments than the available processors, and then rearrange the placement of the fragments ![](https://i.imgur.com/JHXT0H3.png) 2. Managing Memmory Issue: In the disjoint partitioning, after the data is distributed to the designated processors, the data has to be stored on disk. Then in the local join, the data has to be loaded from the disk again. ![](https://i.imgur.com/BmHPYar.png) ## Outer Join - ROJA (Redistribution Outer Join Algorithm) - Redistribute - Local Join - DOJA (Duplication Outer Join Algorithm) - Replicate - Inner Join - Hash and Redistribute - Outer Join - DER (Duplication and Efficient Redistribution) - Replicate - Local Inner Join - Hash the Row Id Select Row Id with no matches - Redistribute the Row ID - Store the Row ID that appears as many times as thee number of processor - Inner Join - OJSO (Outer Join Skew Optimization) - Redistribute R,S - Outer join R,S, store the result into Jredis and Jlocal - Redistribute Jredis and T - Union the final result # 5. Sorting Operation ## Internal vs External Sorting - Internal Sorting: When the data fits into memmory - External Sorting: When the data does'nt fits into memmory. Usually sort-merge algorithm is used. Number of passes = $\text{log}_{B-1} \text{subfiles at phase 0} + 1$ ## Serial Sorting Algorithm - Quick Sort - Bubble Sort - Insertion Sort - Sort-merge ## Parallel Sorting Algorithm - Parallel Merge-All Sort: Local Sort, Merge, - Parallel Binary-Merge Sort: Local Sort, Binary Merges, Final Merge - Redistribution Binary-Merge Sort: Local Sort, Redistribute, Intermediary Merge - Redistribution Merge-All Sort : Local Sort, Redistribute, Merge - Parallel Partitioned Sort : Redistribute first before sort and merge ### Parallel Merge-All Sort ![](https://i.imgur.com/R6NLFwG.png) - Traditional approach - K-way merging. In k-way merging, the searching for the smallest value among k partitions are done at the same time. k-way merging requires k files open simultaneously - Load balance in local sort - Problems: Heavy load on one processor when merging and network contention ![](https://i.imgur.com/I0rRpSm.png) ### Parallel Binary-Merge Sort ![](https://i.imgur.com/80uvE2l.png) - Traditional approach - Merge in Pairs, but can be time consuming if the list is long. the pipeline process in binary merging requires extra overheads - Problems: Merging still heavy, Height of the tree ![](https://i.imgur.com/QGEtUub.png) ### Redistribution Binary-Merge Sort ![](https://i.imgur.com/AxQNmw7.png) Benefit: Merging become lighter Problem: Height of the tree ### Redistribution Merge-All Sort ![](https://i.imgur.com/ubARseQ.png) Benefit: True parallelism in merging, Reduce the tree height Problem: Skew problem in the merging ### Parallel Partitioned Sort ![](https://i.imgur.com/z0h2gmQ.png) - Partitioning may cause skew - No need for merging as a rule of thumb for selecting an appropriate parallel sorting algorithm, the following rules can be adopted: - If processing skew degree is high, then use parallel redistribution merge-all sort. - If both data skew and processing skew degrees are high OR no skew, then use parallel partitioned sort # 6. Group By Operation ## Serial Group By - Read record one by one and hash it - Add to the hash table (hash table stored in memory) - Query Result Stored to disk But How if the hash-table is very big and doesn't fit to memory? - Divide the table according to memory size - Use disk to store the hash table - Memory used as intermediary hash-table If the memory is large enough, then super linear speed-up can be achieved. ## Parallel Group By - Traditional Methods (May use: Merge-All and Hierarchical Merge) - Two-Phase Merge - Redistribution Method ## Traditional Methods - Local aggregation in each processor - Global aggregation (May use: Merge-All and Hierarchical Merge) Problems: - No paralleism in global stage - Need to pay a special attention to some aggregate functions (AVG) when performing a local aggregate process - Global aggregation carried by single processor - Network bottleneck when aggregating the data from child processor to host processor ## Two-phase Merge - Local aggregation in each processor - Distribute local result based on grouping attribute - Global Aggregation ![](https://i.imgur.com/sjJfE0l.png) Problems: - Skew. Due to redistribution - No load balancing option. - Doesn’t work well if the number of group is large because local aggregation will be too costly if the number of group is large - Not feasible when the selectivity does not filter the original records greatly ## Redistribution Method - Redistribute first based on grouping attribute - Perform local aggregation on each processor ![](https://i.imgur.com/u0EDpzJ.png) Problems: - Skew. Due to redistribution. **Task stealing** method used to solve this problems by moving data from heavy processor to idle or lighter processor. Hence, still works when the number of groups is large