# Azure EventHubs Data Streaming Options
**_Challenge:_**
Payloads (JSON objects) coming from the HBT (Homeowner Business Transformation) application can have different schemas for each object. As a result, the objects should be handled differently and written to different sinks/tables. Spark Structured Streaming assumes by default that the payloads have the same schema and expect a single schema to be applied to the payload before proceeding.
**_Options:_**
| Solutions | API Language | Requires Spark Version | Pros | Cons |
|--|--|--|--|--|
| Option 1: Use "foreachBatch" Method | Python | 2.4.4 | Have the data in realtime in staging history | Would require an upgrade on CMHC AIP from Cloudera 5.14 to 6.3 which would require an impact analysis |
| Option 2: Use "foreach" Method | Scala | 2.3.0 | have the data in realtime fashion in Staging history | Using Scala brings new challenges, |
| Option 3: 2-Stage pipeline | Python | Any | | Data in staging history is 30mins to 1hr old |
**_Option 1:_**
The "foreachBatch" method offers a way to work with streaming dataframe containing different objects. However, it requires Spark v2.4.4 for Python. AIP is using 2.3 and upgrading the version will require a lot of effort and cost. As a result, this option is not viable.
Current CMHC AIP Spark versions:
| ENV |SPARK |
|--|--|
|DEV |2.2.0 |
|PRD |2.3.0 |
**_Option 2:_**
The "foreach" method offers a way to work with streaming dataframe containing different objects at a row level. It is available for Scala/Spark 2.3.0.
**_Option 3:_**
Apply two stage pipeline:
| From | To | Use |
|--|--|--|
| Azure EventHubs | Landing Zone | Spark Structured Streaming |
| Landing Zone | Staging History Zone | Batch processing/PySpark |

**_Approach_**
The overall approach is as follows:
1) EventHubs
2) Spark Structured Streaming from EventHubs to Landing Zone
3) Landing Zone: Directories categorizing JSON files by Object-Event, then Date
4) Batch Processing from Landing Zone to Staging Zone
5) Staging Zone: Hive Parquet Tables
6) ETL from Staging Zone to Data Warehouse
7) Integrated Zone: SQL Server Data Warehouse
**_Stage 1_**
Spark structured streaming will consume JSON objects from Azure EventHubs. The JSON objects will be partitioned into M folders in the Landing Zone, where M is the number of different object types (eg. loan, application, portfolio, etc.).
EventHubs messages will be in JSON format:
- correlationId
- data (the actual payload/JSON object)
- sequenceNumber
- [object name]
- type: examples:
- ca.cmhc.hobt.event.[portfolio].created
- ca.cmhc.hobt.event.[portfolio].updated
- ca.cmhc.hobt.event.[portfolio].removed
- dataContentType
- id
- source
- specVersion
- timestamp
Data will be stored in Landing as JSON based files partitioned by event type and date.
Event Type Partitions Example:

Date Partitions Example:

**_Stage 2_**
Batch processing will load the JSON files data from the Landing Zone into Hive Parquet Tables in the Staging Zone. There will be M number of tables in Hive for each folder in Landing, corresponding to each type of objects.
There will be periodic loads every 30/60 minutes from the Landing folders into Staging tables.
events/created/updated/removed will be appended to the history table.
###DAN###
Jo and Aboud:
- diagram for the below options
- bulletpoints
- pros and cons
For Nesrin:
1. Event hub --> 1 streaming app (filter by message type) --> into M folders (partitioned by type + date, JSON file) --> Batch file --> build 1 parquet table for each object (loan, application, portfolio) (this is not in RT)
-- dont need ForEachBatch or ForEach
pros: no dependency on version 2.4, nor we need to use scala
cons: the JSON tables are broken down by object by create/update/delete, this sol is not RT
If we can use scala
2. evernt hub --> 1 streaming app (use for each) --> M Parquet table (by object)
Pros: parquet table is near real-time
currently spark 2.3 ==> spark 2.4
Deploy to DEV to PRD
1. commit, push to remote
2. pull from git in PRD
3. use it in prd!
AIP, AIP support (pradeep, boris), AIP AO (appl outsourceing) (none, no people here), AO (all applications)
DevOps automation
- make sure we have library in DEV, PRD
- Azure SQL (managed service)