3. Velocity - HackMD

# 3. Velocity ## 1. Data stream Data stream is a real-time, continues, ordered sequence (by arrival time or timestamp explicitely) of items. ### Windowing Approach to reduce the importance of older tuples without eliminating their influence on further analysis. Type of Stream Windows: - Time-based ![](https://i.imgur.com/SbCuTlh.png) - Tuple-based ![](https://i.imgur.com/MsRrPiV.png) ### Burst Arrival - Time Based Window ![](https://i.imgur.com/dWdOS0B.png) - Tuple Based Windows ![](https://i.imgur.com/RjJzBM3.png) ### Database vs Stream Processing | Database | Stream Processing | |-----------------------------------------|----------------------------------------| | Bounded Data | Unbounded Data (data keeps coming without end) | | Relatively Static Data | Dynamic Data | | Complex, ad-hoc query | Simple, continues query | | Possible to backtrack during processing | No backtracking, single pass operation | | Exact Answer to Query | Approximate answer to query | | Tuples arrival rate is low | Tuples arrival rate is high | ### Event vs Processing time - Event Time: the time when the data is produced by the source - Processing time: the time when data is arrived at the processing unit - Ideally event time==processing time, in reality event time is earlier - Data may arrive in burst. ### Streaming Technologies: Kafka - Scalable - Fault-tolerant - Publish-Subscribe messaging system - Enable distributed applications ### How Kafka Works - Publish-Subscribe messaging system - All messages are persisted and replicated to peer brokers for fault-tolerance - Messages persisted in specific amount of time - There are no deletes - Does not track who consume messages ## 2. Stream Join ## a. General Stream ### General Stream Join Process ![](https://i.imgur.com/27qPF9O.png) - When tuple r is comming: - Probe matching tuples in current window of stream S - Insert tuple r into stream R window - Invalidate all expired tuples in stream R window ### General Stream Join Types: - Nested-Loop Join - Check each tuple in streams R and match with all tuples in streams S - Problems: Overlapping tuples when windows size = slide size - Sort-Merge Join - Sort tuples and both stream's window - Merge both stream - Problems: Overlapping tuples when windows size = slide size - Hash Join - Hash every tuple in stream window S, store in hash table - Proble every tuple in window S with hash table - Symmetric Hash Join - When tuple r arrives: - Probe tuple r to hash table S - Hash tuple r and add to hash table R - Insert tuple r into stream R - When tuple s arrives: - Probe tuple s with hash table R - Hash tuple s and add to hash table S - Insert tuple s into stream S ## b. Unbounded Stream Join Join is applied to tuples in running window. ### 1. Time Based window Stream Join (Using M-Join) - Similar to Symmetric Hash Join - Each tuple comming for example tuple r, probe to hash table S and T ![](https://i.imgur.com/FkuBWUJ.png) - Do similar thing when tuple s and t coming. ### 2. Tuple Based window Stream Join (Using Handshake Join) Problems with tuple based window: - Window size of each stream (number of tuples), can they be different? ![](https://i.imgur.com/WG0rvjf.png) - How if windows amont streams will not overlap? ![](https://i.imgur.com/75zIm8K.png) Handshake Join, Inspired by player handshake in soccer: ![](https://i.imgur.com/ldZIlMS.png) Missing Handshake Solution: - Adding empty tuple between tuples ![](https://i.imgur.com/jqkupv0.png) - Perform additional handshake, before move forward. Thus, tuple in stream A not only make handshake with pair tuple in stream B, but also with next tuple in stream B. ![](https://i.imgur.com/7IviPiv.png) ## c. Bounded Stream Join (both streams has Start and End) 1. **Nested-Loop Join** - Same like explained before - We can perform m-way join (join 3 streams) - Pipelining join between stream R and S - Will result rs as incoming tuple, then perform RS join T 2. **Symetric Hash Join** - As explained before using hash table for both stream R and S 3. **M-Join** - Like symetric Has Join, but with 3 or more stream 4. **AM-Join** - Advanced M Join - Using additional Hash Tables, Bit-vector Hash Table (BiHT) - When tuple r coming, instead of probing to others Hash table, probe with BiHT ![](https://i.imgur.com/qAtwHp0.png) Issues: Colision in BiHT and Updating BiHT ## 3. Granularity Reduction in Data Streams - Granularity is the level of detail at which data are stored in a database. Used for managing complexity and efficient retrieval. - For Overlapped Windows: No Granularity Reduction applied. No reduction in terms of number of records. This is a pure moving average ### Mixed Levels of Granularity - Temporal-based - Spatial-based ### Sensor-Arrays 1. Multiple sensors measuring the same things - Reduce and then Merge - Merge and then Reduce 2. Multiple sensors measuring the different things - Reduce, Normalize, and then Merge - Normalize, Merge and then Reduce > Normalisation used to convert the raw data into a category, which binds different sensors into one common thread