--- title: "How to Design a Successful (Intern) Project with Apache Beam? - Kir Chou" tags: PyConTW2023, 2023-organize, 2023-共筆 --- # How to Design a Successful (Intern) Project with Apache Beam? - Kir Chou {%hackmd H6-2BguNT8iE7ZUrnoG1Tg %} <iframe src=https://app.sli.do/event/uTGZwb1A7fYXEtB48xiZTn height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 Modern data processing: - Massive-scale data - Unbounded data - Out-of-Order data Goal: - Correctness (to a level) - Low enough latency - Acceptable cost The hero of this talk: [Apache Beam](https://beam.apache.org/) "Table" - becomes -> "Collection" ## Questions ### it's cost heavily to maintain apache beam infra in house? If you ask this question, it's likely better for you to use cloud solutions. ### Any keys we better to know when using Beam with streaming data? E.g., how to prevent data missing, duplicated data processing. From the three goals earlier, here we are talking about "correctness". So how correct you want to be? Say you have to be super correct, so you have to sacrificie something, e.g. it will take longer time. In the mentioned papers there are many scenarioes to think about this. ### Are there any good tools to design the data flow that is the "key" to using Beam well? We use the whiteboard. Write down what data I have and what format. And just do that. This part is pretty one-off and have to think carefully. Once it's done, you can implement it. Unfortunately no other suggestions to just "solve" this. ### We have big data, how do we optimise the flow or intermediate tables for this? My company has many teams that are working on pipeline optimisation. No current general solution. Some thoughts though: can some data be dropped earlier, so less to process, but no very generic stuff. Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份