Dataflow Discussion

# Dataflow Discussion by Aneesh Melkot 1001750503 ![](https://i.imgur.com/Pa0HVOW.png) ## Contents [TOC] ## When would you use a batch system vs. streaming system? ### Batch Systems **Batch** processing/systems typically deal with large data sets at one time. In this case the data to be used/processed has been stored for some time. This data set is processed all at once in a "Batch". > **Example**: When there is a need to perform some historical analysis on **Sales** data in a Car company, batch processing is used on the data that has been collected over the past. [color=#6C63FF] #### Benefits of Batch - Good concept for handling repeated workloads. - Workloads can also be handled offline. - Very low idling - No need of specialized HW/SW. ### Stream Systems On the other hand **Streaming** systems/processing is a concept that deals with processing data as it becomes available. Typically data is still in transit (going to be stored). This data is processed and transformed as per the requirement and then stored in raw an/or the final form. In legacy or older systems the data was highly structured and organized but modern data is aggregated from many sources and doesn't necessarily follow the same structure. Hence, stram processing is gaining steam. > **Example**: A company like Amazon Prime video might want to process user data to finetune their recommendation system. For this purpose, they will process every `click`, `view` and `like` the user performs on the app as and when the user performs it and updates the **recommendations** for that user. This is a stream system/processing that processes these events on the fly. Not to be confused with Video Streaming which is a different concept. [color=#6C63FF] #### Benefits of Stream - Continuous processing/management of data. - Easy filtering and cleaning - Lesser management of data once a piplene is establised and tested. - Can help boost user experience ## What about the Lambda architecture? With a hybrid approach, batch-processing and stream-processing techniques are accessible through **lambda** architecture, which is a means of processing enormous amounts of data (also known as "Big Data"). Arbitrary function computation is handled using the lambda architecture. Three layers make up the lambda architecture: - Batch Layer: Used for ETL and Data Warehousing - Serving Layer: Indexing for Batch and Speed layers. - Speed Layer: Data that is most recent and not processed in Batch is processed in this layer. > **Example**: Since this architecture has no limits on memory, one of its best features is the ability to create a Data Lake. Companies can leverage this to dump their data to the cloud. This would give company wide visibilty and transparency to this data and multiple teams can run their specialized workloads on this huge data set to deliver reports and derive insights. [color=#6C63FF] #### Benefits of Lambda - Easy Management - Scalability - Availability ## What is the purpose of the Windowing model in Google Cloud Dataflow? A serverless data processing service called Cloud Dataflow executes tasks created using the Apache Beam modules. Running a job on Cloud Dataflow creates a cluster of virtual computers, distributes the tasks among the VMs, and dynamically expands the cluster based on the job's performance. In order to improve your task, it could even alter the pipeline's processing sequence. Typically windowing comes into picture when we are dealing with a subset of the data. > **Example**: We want to process data for a particular day example `Wednesday`. [color=#6C63FF] In a distributed system data will not get stored immediately as they occur and there might be some delay or inconsistency in the data store. This latency might bring in some data from the previous day ie `Tuesday` and might be missing some data from the day in question. We can utilize **windowing** to resolve this issue. The objective is to separate the data into several *time-based* segments. Each component must have a `timestamp` in order for the data processing program to know which window to place it in for it to function. ### Why were different types of Windowing patterns required for different applications at Google (please give some examples)? When dealing with Batch data, we have all data available for processing, hence this can be considered as a single global window. Different windowing patterns are used as unbounded data doesn't have a structure nor is it consistent. It is sporadic. Hence there is a need to group data using different strategies in anticipation of its arrival inorder to process them effeciently. When dealing with streaming or unbounded data we have the following windowing concepts used at Google wrt Dataflow - #### Tumbling/Fixed Window The window sizes are fixed and have equal interals and no overlap. ![](https://i.imgur.com/UULQwgE.png) > **Example**: An application that is continuously sending data can be attributed to an arbitrary fixed window size of 2 minutes in length. Now any kind of aggregation processing can be done on each of window and the otuput would be emitted for each window. [color=#6C63FF] #### Hopping/Sliding Window The window sizes are fixed and have equal interals and can overlap. Same data can be found in multiple windows. ![](https://i.imgur.com/SEdqNVT.png) > **Example**: In a streaming application a window size can be set to 10 minutes with overlap at 5 minutes. In this case data at the 6th minute will be present in both window 1 and 2. This can be beneficial to perform running averages and running totals on the incoming dataset. In this case a running total can be calculated for 10 minutes and for data changing every 5 minutes. [color=#6C63FF] #### Session Window Windows with varying intervals and sizes and no overlap. ![](https://i.imgur.com/XQZLFIv.png) > **Example**: A user using an application will logon and do some actions and become idle and again come back and use the applcaition. Here we see a staggered pattern in the usage. Useful for user activity that is non evenly distributed. [color=#6C63FF] This window only takes in the parameter `gap` or `idle time` which is fixed. > Images taken from Dataflow Docs. :::warning Custom windows can also be defined based on requirement. ::: ### What is the purpose of triggering in Dataflow? How is it related to/different from the windowing models? Before talking about Triggers, lets take a look at what Watermarks are. #### Watermark This is the threshold that determines when the data in a window should have arrived. A data is `late` if it arrives in the window but is greater than the watermark. A watermark is essentially a pointer that moves from start of window to end of window. #### Triggering The results have to be emitted after the data in a window is proccessed. The **trigger** determines when to emit the results. By default, the trigger emits the result when the watermark crosses the end of the current window. ### Windowing vs Triggering Triggers complement the windowing model since they each have an impact on the behavior of the system along a distinct axis of time. Windowing | Triggering :----:|:----: How is data grouped? | When are the results of every group emitted? ## Is event-time windowing that important? **YES**. This is due to the fact that uprocessed data is sporadic, inconsistent and can come out of order and non sequentially. The system should also be able to go back in time and update the result as older data arrive at a later point in time. > **Example**: Due to latency, data from 2 days ago could come in today. This data needs to be appended in a window in the past and the results have to be recalculated for that window. [color=#6C63FF] We need the ability to get **"The result so far...**". ## References [Dataflow Docs](https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#:~:text=Triggers%20determine%20when%20to%20emit,collection%20in%20a%20streaming%20pipeline.) [Dataflow Paper](https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)