Try   HackMD

How to Design a Successful (Intern) Project with Apache Beam? - Kir Chou

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@pycontw/2023
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Modern data processing:

  • Massive-scale data
  • Unbounded data
  • Out-of-Order data

Goal:

  • Correctness (to a level)
  • Low enough latency
  • Acceptable cost

The hero of this talk: Apache Beam

"Table" - becomes -> "Collection"

Questions

it's cost heavily to maintain apache beam infra in house?

If you ask this question, it's likely better for you to use cloud solutions.

Any keys we better to know when using Beam with streaming data? E.g., how to prevent data missing, duplicated data processing.

From the three goals earlier, here we are talking about "correctness". So how correct you want to be? Say you have to be super correct, so you have to sacrificie something, e.g. it will take longer time. In the mentioned papers there are many scenarioes to think about this.

Are there any good tools to design the data flow that is the "key" to using Beam well?

We use the whiteboard. Write down what data I have and what format. And just do that. This part is pretty one-off and have to think carefully. Once it's done, you can implement it. Unfortunately no other suggestions to just "solve" this.

We have big data, how do we optimise the flow or intermediate tables for this?

My company has many teams that are working on pipeline optimisation. No current general solution. Some thoughts though: can some data be dropped earlier, so less to process, but no very generic stuff.

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份