A Glance of Data Engineering

###### tags: `WebHack#38` # A Glance of Data Engineering Speaker: __Ansel Lin__, senior data engineer from treasure data ## Talk ### Mission of data engineering prepare easy to use data ### What kind of data? #### Structured and Unstructured Data ##### Structured Data - Data engineering mainly focus on this ##### Unstructured Data - Media ### Where does data come from #### Master Data - User profile - Personal info #### Behavioral Data - Logs of events - For the time being, data will be accumulated ### How large are the data & How fast are they growing - Behavior data will grow much faster than master data ### Howd to define "easy-to-use" > How the data will be used? > What question can help people makinghttps://hackmd.io/team/WebHack/new decision? - We can categorize those questsion into two types: Look-up and analytical ### Possible challenges - processing efficiency ### How should we store the data - Row-ise: Arrange the data row by row - Columnar Storage: Arrange the data column by column ### columnar storage #### Benefit - Good for analytical query #### Limitation - Update the record - ### Behaviour Data - Append only - Partition into time segments - For each time segment: immutable data ### How to prepare the data ![](https://i.imgur.com/nOF8wkE.png) #### How data flow works Input (web service, user import...) -> Output (External Database, dashboard, adhoc query) ### Data ingestion - Push-based ingestion: ingestion is triggered from data provider - Pull-based ingestion: ingestion is triggered from data receiver #### Publisher-Subscriber model - Data provider use push-based to push data to publisher-subscriber (broker) - Data receiver use pull-based to pull data from publisher-subscriber ##### Benefit - Decouple the concern ### How data flow to the user - From data flow to the user, it'll go through a "read path" - If data set is large, it'll take long time walk through read path #### How to optimize - Pre-generate intermediate state - Then each time the data read request comes in, it can take shorter for read path ### Data pipelines #### Possible challenges - Transitions have so many stages, it possible some transition will fail - bad data from data source, which will pollute your down-stream data #### Operation required utility - Monitoring & alerting - Workflow management tool: Transiton flow has many dependencies and this tool can help us model #### Through data pipelines, the data will keep immutable ![](https://i.imgur.com/TuTSGR1.png) ### How quick we want the data to be visible - Daily batch processing - Houly batch processing - mini batch processing - Streaming Processing ### Lambda Architecture - Old data (ex. up to yesterday) from batch processing with higher accuracy - Live data (ex. today) from streaming processing with better timeliness ## Q & A - How to ensure idempotent in practice? And how to apply in the whole data flow? - Isolate the output scope of each transformation, ex. independent directory, partition, etc., then defaultly to clean up the output scope before transformation execution - Ensure the idempotent principle from the data ingestion process and every single trnsformation stage - If the source or upstream is dynamic, avoid rerunning the ingestion twice - What the cost of Lambda architecture? Any approach to reduce the cost? - Duplicated implementations & computing resources in both batch & streaming processing - Instead of implementing full duplicated pipelines, only implement required part of streaming data flow - Implement with both batch & streaming compatible framework, ex. Spark, Flink, (Kappa Architecture) to reduce the implementation and maintainance cost - Is data cleansing included in data pipelines? - Yes, definitely. Actually the data cleansing usually would be the first transformation step we will do in the pipelines, so that it could provide the downstreams more clean data. - In the publisher-subscriber model, how does the data capacity and retension policy work? - First the brocker component should be scalable, so for any increasing capacity requirements, it can be easily scale out. - Because the model is providing a "subscription" service to unspecific, possible many, subscribers, the data should be retented until certain time frame, ex. 7 days or 30 days. ## Retrospective ## Networking