---
title: "Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus"
tags: PyConTW2024, 2024-organize, 2024-共筆
---
# Effectively memory profiling distributed PySpark code - Kaashif Hymabaccus
{%hackmd NY3XkI1xQ1C9TrHQhoy9Vw %}
<iframe src=https://app.sli.do/event/dMBC2cddx8QsCwdDinkDHS height=450 width=100%></iframe>
> Collaborative writing start from below
> 從這裡開始共筆
Memray
- Python memory profiler
- https://pypi.org/project/memray/
- `python -m memray run script.py`
Use PySpark
- `import pyspark.pandas as pd`
`spark.driver.maxResultSize`
- `df.to_numpy()` loads all data into the driver
`pyspark.pandas.DataFrame.shape` is very slow
`pyspark.pandas.DataFrame.__len__` is expensive
`pyspark.pandas.DataFrame.apply(f)` is different from pandas when axis = 0
More scalable
- `pyspark.sql.DataFrame.mapInPandas(f)`
- Still OOM
PySpark built-in memory profiler
- Set the Spark config `spark.python.profile.memory` to `true`
- `sc.show_profiles()`
Use Memray
- `with memray.Tracker(filename, native_traces=True):`
---
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份