Python FFI 的陰暗角落 - scc

--- title: "Python FFI 的陰暗角落 - scc" tags: PyConTW2025, 2025-organize, 2025-共筆 --- # Python FFI 的陰暗角落 - scc {%hackmd L_RLmFdeSD--CldirtUhCw %} <iframe src=https://app.sli.do/event/tpEoj76rR1dNJ79KxGpe2s height=450 width=100%></iframe> :::success 本演講提供 AI 翻譯字幕及摘要，請點選這裡前往 >> [PyCon Taiwan AI Notebook](https://pycontw.connyaku.app/?room=8TV2RAg1bJP3qXDUH7bO) AI translation subtitles and summaries are available for this talk. Click here to access >> [PyCon Taiwan AI Notebook](https://pycontw.connyaku.app/?room=8TV2RAg1bJP3qXDUH7bO) ::: > Collaborative writing start from below > 從這裡開始共筆 ## What is FFI - Foreign Function Interface - Python World --> C++ World - frequent function call ## The hidden corners ### Benchmark ctypes, cffi, pybind11, PyO3: why free-threaded makes them slower - `PyMutex_LockFast`, `PyMutex_Unluck` is frequently called - Needs to consider racing condition in free-threaded version - Recap - Python3-13-NoGIL shows higher overhead than 3.13-GIL when crossing FFI boundaries - Python3-14-NoGIL shows higher overhead than 3.14-GIL for FFi calls ### Why PyO3 fast, but ctypes slow? - `take-gil` - CDLL releases GIL / PyDLL doesn't release - In free-threaded, PyDLL won't need to take GIL with it. - Recap - PyDLL > CDLL over a 20% performance when GIL. - PyO3 includes deep opt for vector calls without the GIL. - The libffi trampoline emerges as the next bottleneck. - Interesting findings: `Python --> Pybind11 --> C++` is slower than `Python --> PyO3 --> Rust --> C++` ### Racing global states after No-GIL - GIL protects foreign functions but if we discard it ||GIL|No-GIL| |:--:|:--:|:--:| |Thread Safe FF|Safe|Safe| |Non TS FF | Uncertain|Race| #### Suggestions - Suggestion for migratting to free-threaded version: Foreign functions should be thread-safe - Otherwise, don't use multithreading python to call FF. ## Takeaway - Python3.14t is impressive, performance is almost as good as NoGIL version - Canary or Shadow Deployment - Make sure all components are thread-safe ## QA - Why is PyO3 + Rust faster? ==PyO3 ~ 75ns per function call, Pybind11 ~ 150ns, Rust --> C++ about 10ns == - Stack profiling software? ==perf, should enable some option in kernel options== ## Links - https://pycontw2025.scc.tw - scc@scc.tw Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份