Spark Task Optimization Journey: How I Increased 10x Speed by Performance Tuning - 游騰林

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

共筆入口：https://hackmd.io/@pycontw/2023
手機版請點選上方按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Collaborative Writing Workplace：https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Data

信用卡消費資料
size: 80GB

Methods of transferring (bulk) data

local files
- cons: 佔 disk 空間
Apache Sqoop
- 2hrs
pandas read_sql
- 3.5hrs
pyspark read table
- read method: dbtable or query
- 3hrs

Spark connection properties 與效能有關的參數

exampel code

spark_df = spark.read \
    .foramt("jbdc") \
    ...
    .option("partitionColumn", "txn_amt") \ 
    .option("numPartitions", 10) \ 
    .option("lowerBound", "2022-02-01") \ 
    .option("upperBound", "2023-02-01") \ 
    .load()

fetechsize
- 沒有明顯幫助
partitioin column
- 要搭配 upperBound / lowerBound 和 numPartitions 一起使用
  - numPartitions: 10(recommended) ~ 50，發給倉儲 request 的次數，太多會送出太多請求
- partitioinColumn: 必須是數值、日期、時間，但有些資料並沒有合適的欄位
  - solution: 直接給 index 平均切
- 如果發現速度降不下來，可能是資料都集中在同一個 partition，因此需要研究/實驗拿哪個欄位來當 partition (可能會需要domain) 可以讓資料分布比較平均，以講者銀行業為例用信用卡交易日期來切
  - 講者一開始使用「交易金額」作為 partitionColumn，但發現金額會集中在某個範圍，導致某個 partition 的資料量特別多，改使用「交易日期」後可以將 partition 較平均分配
spark maxExecutor
spark executor memory

其他 spark 任務的加速技巧

Spark SQL's Catalyst Optimizer
- code readablity > code efficency
cache
Partition 切的平均，能讓 Spark 加速
- 需要適時檢查個階段的狀態搭配 repartition 來平衡各 partition 中的資料量
要使用 panadas dataframe 時先做 repartition 再 toPandas
- 當 Repartition 數太少，導致一個 partition 太多資料量，會吃掉太多記憶體, task 就會被砍掉

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份