# 3. Velocity
## 1. Data stream
Data stream is a real-time, continues, ordered sequence (by arrival time or timestamp explicitely) of items.
### Windowing
Approach to reduce the importance of older tuples without eliminating their influence on further analysis.
Type of Stream Windows:
- Time-based

- Tuple-based

### Burst Arrival
- Time Based Window

- Tuple Based Windows

### Database vs Stream Processing
| Database | Stream Processing |
|-----------------------------------------|----------------------------------------|
| Bounded Data | Unbounded Data (data keeps coming without end) |
| Relatively Static Data | Dynamic Data |
| Complex, ad-hoc query | Simple, continues query |
| Possible to backtrack during processing | No backtracking, single pass operation |
| Exact Answer to Query | Approximate answer to query |
| Tuples arrival rate is low | Tuples arrival rate is high |
### Event vs Processing time
- Event Time: the time when the data is produced by the source
- Processing time: the time when data is arrived at the processing unit
- Ideally event time==processing time, in reality event time is earlier
- Data may arrive in burst.
### Streaming Technologies: Kafka
- Scalable
- Fault-tolerant
- Publish-Subscribe messaging system
- Enable distributed applications
### How Kafka Works
- Publish-Subscribe messaging system
- All messages are persisted and replicated to peer brokers for fault-tolerance
- Messages persisted in specific amount of time
- There are no deletes
- Does not track who consume messages
## 2. Stream Join
## a. General Stream
### General Stream Join Process

- When tuple r is comming:
- Probe matching tuples in current window of stream S
- Insert tuple r into stream R window
- Invalidate all expired tuples in stream R window
### General Stream Join Types:
- Nested-Loop Join
- Check each tuple in streams R and match with all tuples in streams S
- Problems: Overlapping tuples when windows size = slide size
- Sort-Merge Join
- Sort tuples and both stream's window
- Merge both stream
- Problems: Overlapping tuples when windows size = slide size
- Hash Join
- Hash every tuple in stream window S, store in hash table
- Proble every tuple in window S with hash table
- Symmetric Hash Join
- When tuple r arrives:
- Probe tuple r to hash table S
- Hash tuple r and add to hash table R
- Insert tuple r into stream R
- When tuple s arrives:
- Probe tuple s with hash table R
- Hash tuple s and add to hash table S
- Insert tuple s into stream S
## b. Unbounded Stream Join
Join is applied to tuples in running window.
### 1. Time Based window Stream Join (Using M-Join)
- Similar to Symmetric Hash Join
- Each tuple comming for example tuple r, probe to hash table S and T

- Do similar thing when tuple s and t coming.
### 2. Tuple Based window Stream Join (Using Handshake Join)
Problems with tuple based window:
- Window size of each stream (number of tuples), can they be different?

- How if windows amont streams will not overlap?

Handshake Join, Inspired by player handshake in soccer:

Missing Handshake Solution:
- Adding empty tuple between tuples

- Perform additional handshake, before move forward. Thus, tuple in stream A not only make handshake with pair tuple in stream B, but also with next tuple in stream B.

## c. Bounded Stream Join (both streams has Start and End)
1. **Nested-Loop Join**
- Same like explained before
- We can perform m-way join (join 3 streams)
- Pipelining join between stream R and S
- Will result rs as incoming tuple, then perform RS join T
2. **Symetric Hash Join**
- As explained before using hash table for both stream R and S
3. **M-Join**
- Like symetric Has Join, but with 3 or more stream
4. **AM-Join**
- Advanced M Join
- Using additional Hash Tables, Bit-vector Hash Table (BiHT)
- When tuple r coming, instead of probing to others Hash table, probe with BiHT

Issues: Colision in BiHT and Updating BiHT
## 3. Granularity Reduction in Data Streams
- Granularity is the level of detail at which data are stored in a database.
Used for managing complexity and efficient retrieval.
- For Overlapped Windows: No Granularity Reduction applied. No reduction in terms of number of records. This is a pure moving average
### Mixed Levels of Granularity
- Temporal-based
- Spatial-based
### Sensor-Arrays
1. Multiple sensors measuring the same things
- Reduce and then Merge
- Merge and then Reduce
2. Multiple sensors measuring the different things
- Reduce, Normalize, and then Merge
- Normalize, Merge and then Reduce
> Normalisation used to convert the raw data into a category, which binds different sensors into one common thread