###### tags: `WebHack#38`
# A Glance of Data Engineering
Speaker: __Ansel Lin__, senior data engineer from treasure data
## Talk
### Mission of data engineering
prepare easy to use data
### What kind of data?
#### Structured and Unstructured Data
##### Structured Data
- Data engineering mainly focus on this
##### Unstructured Data
- Media
### Where does data come from
#### Master Data
- User profile
- Personal info
#### Behavioral Data
- Logs of events
- For the time being, data will be accumulated
### How large are the data & How fast are they growing
- Behavior data will grow much faster than master data
### Howd to define "easy-to-use"
> How the data will be used?
> What question can help people makinghttps://hackmd.io/team/WebHack/new decision?
- We can categorize those questsion into two types: Look-up and analytical
### Possible challenges
- processing efficiency
### How should we store the data
- Row-ise: Arrange the data row by row
- Columnar Storage: Arrange the data column by column
### columnar storage
#### Benefit
- Good for analytical query
#### Limitation
- Update the record
-
### Behaviour Data
- Append only
- Partition into time segments
- For each time segment: immutable data
### How to prepare the data
![](https://i.imgur.com/nOF8wkE.png)
#### How data flow works
Input (web service, user import...) -> Output (External Database, dashboard, adhoc query)
### Data ingestion
- Push-based ingestion: ingestion is triggered from data provider
- Pull-based ingestion: ingestion is triggered from data receiver
#### Publisher-Subscriber model
- Data provider use push-based to push data to publisher-subscriber (broker)
- Data receiver use pull-based to pull data from publisher-subscriber
##### Benefit
- Decouple the concern
### How data flow to the user
- From data flow to the user, it'll go through a "read path"
- If data set is large, it'll take long time walk through read path
#### How to optimize
- Pre-generate intermediate state
- Then each time the data read request comes in, it can take shorter for read path
### Data pipelines
#### Possible challenges
- Transitions have so many stages, it possible some transition will fail
- bad data from data source, which will pollute your down-stream data
#### Operation required utility
- Monitoring & alerting
- Workflow management tool: Transiton flow has many dependencies and this tool can help us model
#### Through data pipelines, the data will keep immutable
![](https://i.imgur.com/TuTSGR1.png)
### How quick we want the data to be visible
- Daily batch processing
- Houly batch processing
- mini batch processing
- Streaming Processing
### Lambda Architecture
- Old data (ex. up to yesterday) from batch processing with higher accuracy
- Live data (ex. today) from streaming processing with better timeliness
## Q & A
- How to ensure idempotent in practice? And how to apply in the whole data flow?
- Isolate the output scope of each transformation, ex. independent directory, partition, etc., then defaultly to clean up the output scope before transformation execution
- Ensure the idempotent principle from the data ingestion process and every single trnsformation stage
- If the source or upstream is dynamic, avoid rerunning the ingestion twice
- What the cost of Lambda architecture? Any approach to reduce the cost?
- Duplicated implementations & computing resources in both batch & streaming processing
- Instead of implementing full duplicated pipelines, only implement required part of streaming data flow
- Implement with both batch & streaming compatible framework, ex. Spark, Flink, (Kappa Architecture) to reduce the implementation and maintainance cost
- Is data cleansing included in data pipelines?
- Yes, definitely. Actually the data cleansing usually would be the first transformation step we will do in the pipelines, so that it could provide the downstreams more clean data.
- In the publisher-subscriber model, how does the data capacity and retension policy work?
- First the brocker component should be scalable, so for any increasing capacity requirements, it can be easily scale out.
- Because the model is providing a "subscription" service to unspecific, possible many, subscribers, the data should be retented until certain time frame, ex. 7 days or 30 days.
## Retrospective
## Networking