Try   HackMD

Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus

歡迎來到 PyCon TW 2024 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@pycontw/2024
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2024 Collaborative Writing
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Collaborative Writing Workplace:https://hackmd.io/@pycontw/2024
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Memray

Use PySpark

  • import pyspark.pandas as pd

spark.driver.maxResultSize

  • df.to_numpy() loads all data into the driver

pyspark.pandas.DataFrame.shape is very slow

pyspark.pandas.DataFrame.__len__ is expensive

pyspark.pandas.DataFrame.apply(f) is different from pandas when axis = 0

More scalable

  • pyspark.sql.DataFrame.mapInPandas(f)
  • Still OOM

PySpark built-in memory profiler

  • Set the Spark config spark.python.profile.memory to true
  • sc.show_profiles()

Use Memray

  • with memray.Tracker(filename, native_traces=True):

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份