Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus
歡迎來到 PyCon TW 2024 共筆
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
共筆入口:https://hackmd.io/@pycontw/2024
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2024 Collaborative Writing
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Collaborative Writing Workplace:https://hackmd.io/@pycontw/2024
Using mobile please tap to unfold the agenda.
Collaborative writing start from below
從這裡開始共筆
Memray
Use PySpark
import pyspark.pandas as pd
spark.driver.maxResultSize
df.to_numpy()
loads all data into the driver
pyspark.pandas.DataFrame.shape
is very slow
pyspark.pandas.DataFrame.__len__
is expensive
pyspark.pandas.DataFrame.apply(f)
is different from pandas when axis = 0
More scalable
pyspark.sql.DataFrame.mapInPandas(f)
- Still OOM
PySpark built-in memory profiler
- Set the Spark config
spark.python.profile.memory
to true
sc.show_profiles()
Use Memray
with memray.Tracker(filename, native_traces=True):
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份