A basic guide for Python profiling

# A basic guide for Python profiling * A dynamic performance analysis that measures the behavior (incl. time and space complexity) of a program. * Commonly, this technique serves for performance optimization. * 程式好慢... 慢在哪裡 (Hotspot analysis)？為什麼慢 (Performance bottleneck)?怎麼加速 (Optimization)？ * 忽快忽慢？下次 Jsaon 分享 pytest-benchmark ## Motivation: Knowing the behavior to plan an optimization strategy * Find the **hotspot** of a program. * **Amdahl's law**: Don't spend too much effort on minor things. ![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/AmdahlsLaw.svg/1920px-AmdahlsLaw.svg.png) * Do profiling in a **top-down** manner, instead of bottom-up. * Find the reason for the slowness (**bottleneck**). * **Von Neumann architecture** ![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Von_Neumann_Architecture.svg/2560px-Von_Neumann_Architecture.svg.png) * **xxx-bound / xxx is the bottleneck**: The device on which the program spends the most. (e.g., CPU-bound, memory-bound, IO-bound, GPU-bound) ## Hotspot analysis * Find the hot code segment. ### cProfile * Documentation: https://docs.python.org/3/library/profile.html * Why **c**Profiler? It’s a C extension with less overhead. * **Record the trace of a program.** * Profile the entire program. ```bash python -m cProfile -o output_file (-m module | myscript.py) ``` * Profile a code segment. ```python # Python 3.8+ import cProfile with cProfile.Profile() as pr: ... pr.dump_stats("stats") ``` ```python import cProfile pr = cProfile.Profile() pr.enable() # ... pr.disable() pr.dump_stats("stats") ``` * **Flat report** * Enter the interactive shell of `Stats`. ```bash python3 -m pstats STATS_FILE ``` * Commands * help * sort (cumtime | tottime | ...) * **cumtime**: Accumulated time incl. subroutines. Commonly used to find the hotspot in a high level. * **tottime**: Accumulated time not incl. subroutines. Commonly used to find the exact hotspot. * reverse: Reverse the list. * stats [n]: Print the report (with the leading n rows). * Result ![](https://i.imgur.com/pILbuJQ.png) * **Call-graph report** * `gprof2dot` is a powerful tool to visualize a variety of profiler outputs, including `gprof`, `cProfile`, `Valgrind`. * Repo: https://github.com/jrfonseca/gprof2dot * Usage: ```bash gprof2dot -f pstats STATS_FILE | dot -Tpng -o output.png ``` * Result: ![](https://i.imgur.com/F0y7ByO.jpg) ### line_profiler * Line-by-line profiling * Repo: https://github.com/rkern/line_profiler * Usage * Annotate the function of interest. ```python @profile def foo(): ... ``` * Collect the trace of the program ```bash kernprof -l program.py ``` * Get the flat report ```bash python -m line_profiler program.py.lprof ``` * Result ![](https://i.imgur.com/BIeLuxk.png) ### Pitfalls * Mind the caches (e.g., application cache, page cache) * Mind the lazy evaluation (e.g., NumPy) * Mind the input sensitivity (e.g., excisional biopsy slides vs needle biopsy slides) * Mind the noises (e.g., chenchc is running a program named thrashing.py) ## Find the performance bottleneck: Scratch the surface through system monitoring * Quantitatively collect metrics of the whole program to roughly determine **the bottleneck of a program with nearly no effort**. * CPU-bound * Memory-bound * IO-bound * Background * **Program**: A sequence or a set of instructions. CPU can only identify machine codes, so high-level programs must be compiled (C/C++) or interpreted (Python) before execution. ![](https://miro.medium.com/max/1352/0*-Il8DFT-ga-U_sJF.png) * **User-space** and **kernel-space**: * A user program can only access ALU and its memory space (i.e., user-space operations) without a system call. ![](https://media.geeksforgeeks.org/wp-content/uploads/20210819120532/Transitionfromusertokernelmode.png) * **Interrupt**: A request for the processor to interrupt the currently executing process. * **TL;DR: Interrupts are expensive and mostly related to inefficient memory access and IO.** * Common interrupt types: * **System call (a.k.a. trap)**: e.g., requesting the OS to open a file. * **Page fault**: e.g., accessing the newly opened file through memory mapping. (這個水很深...) * **IO signal**: e.g., the SSD has read the requested page of the file and signals the CPU to handle it. * **Exception**: e.g., 1.0 / 0.0 raises a divided-by-zero exception. * **Timer interrupt**: For scheduling. * Handling interrupt is time-consuming. ![](https://media.geeksforgeeks.org/wp-content/uploads/121-1.png) * Common metrics: * **Execution time (a.k.a. wall-clock time, real time)** * **User-space CPU time (a.k.a. CPU time, user time)** * **CPU time**: Summation of the time spent on each CPU core. * **User-space time**: Time spent in the running state, not including sleep state caused by interrupts. * If user time / (real time * parallelism) ~ 1.0, the program is CPU-bound. * **Kernel-space CPU time (a.k.a. sys time)** * Time dealing with interrupts. * If this value is extremely high, the whole system will be stuck. The situation is called **thrashing**. Mostly, the direct reason is **excessive page faults**, and the reason behind is **the huge number of inefficient memory accesses**. * **Memory consumption (a.k.a., working set size, residence set size)** * Tools (I think you should be already familiar with these.) * time (bash command) * `time python foo.py` * /usr/bin/time * `/usr/bin/time python foo.py` * ![](https://i.imgur.com/S8ZVhVO.png) * top * htop ![](https://i.imgur.com/hqWN7QF.png) * timeit (Python module) * `python -m timeit -- 'from foo import bar; bar()'` * ```python import timeit from foo import bar if __name__ == "__main__": print(timeit.timeit(bar)) # or `timeit.Timer(bar).autorange()` to automatically select a proper number of iterations. ``` * IPython ```python In [1]: from foo import bar() In [2]: %timeit bar() ``` * ![](https://i.imgur.com/kWhrK6p.png) ## Strategy for optimization * **CPU-bound** (a.k.a., compute-bound) * Implement the algorithm with less **time** complexity. * Use an application-level cache: Trade space for time. * Use C-based libraries. * Compilation enables the program to optimize instruction-level parallelism (ILP) and data-level parallelism (DLP). ![](https://johnysswlab.com/wp-content/uploads/1_O4N5IlOJmtl_KLQJul4B_w.png) * Exploit thread-level parallelism (TLP). * multiprocessing * concurrent.futures * OpenMP (in C/C++) * GPU offloading. * Use PyTorch operations to replace compute-intensive NumPy operations. * **Memory-bound** * Implement the algorithm with less **space** complexity. * Use an application-level cache. * Use C-based libraries. * Compilation enables the program to optimize cache friendliness. ![]( https://cs.brown.edu/courses/csci0300/2021/notes/assets/l10-storage-hierarchy.png) * Increase the **locality** of memory access. * Bad ```python patch = image[10000:12345, 10000:12345, :] new_patch = complex_preprocess(patch) ``` * Good ```python patch = image[10000:12345, 10000:12345, :].ascontiguousarray() new_patch = complex_preprocess(patch) ``` * **Multithreading and GPU can make it worse!** * **IO-bound** * Use an application-level cache. * Increase the **locality** of data access. * Sequential access is better than random access. * **Multithreading and GPU can make it worse!** * Use cash $$$