Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus

--- title: "Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus" tags: PyConTW2024, 2024-organize, 2024-共筆 --- # Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus {%hackmd NY3XkI1xQ1C9TrHQhoy9Vw %} <iframe src=https://app.sli.do/event/dMBC2cddx8QsCwdDinkDHS height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 Memray - Python memory profiler - https://pypi.org/project/memray/ - `python -m memray run script.py` Use PySpark - `import pyspark.pandas as pd` `spark.driver.maxResultSize` - `df.to_numpy()` loads all data into the driver `pyspark.pandas.DataFrame.shape` is very slow `pyspark.pandas.DataFrame.__len__` is expensive `pyspark.pandas.DataFrame.apply(f)` is different from pandas when axis = 0 More scalable - `pyspark.sql.DataFrame.mapInPandas(f)` - Still OOM PySpark built-in memory profiler - Set the Spark config `spark.python.profile.memory` to `true` - `sc.show_profiles()` Use Memray - `with memray.Tracker(filename, native_traces=True):` --- Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份