High Performance Python (Part 2)

--- title: 'High Performance Python (Part 2)' tags: Python disqus: hackmd --- High Performance Python (Part 2) == >[name=Cheng-Chin Chiang] [time=August 12, 2022] ## Table of Contents [TOC] ## References 1. [High Performance Python](https://www.amazon.com/High-Performance-Python-Performant-Programming-ebook/dp/B087YTVL8F) by Micha Gorelick, with [code samples](https://github.com/mynameisfiber/high_performance_python_2e) on the GitHub. ![](https://i.imgur.com/ZKdvNdm.jpg =300x400) 2. [Cython: A Guide for Python Programmers](https://www.oreilly.com/library/view/cython/9781491901731/) by Kurt Smith, with [code samples](https://github.com/cythonbook/examples) on the GitHub. ![](https://i.imgur.com/Vz3cdg6.jpg =300x400) ## [Python Road Map](https://roadmap.sh/python) ![](https://i.imgur.com/4AfkTOf.png =600x800) ## Compiling to C Assumeing you have chosen good algorithms and you have reduced the amount of data you are processing. The easiest way to get your code to run faster is to compile your code down to machine code. Python offers a number of options for this: - AOT (ahead-of-time) compiler: Cython - JIT (just-in-time) compiler: Numba and PyPy ### Cython Cython converts type-annotated Python (via `cdef` keyword) into a compiled extension module. ==The type annotations are C-like.== It is wide usage, maturity, and OpenMP support. For example, when calculating the Julia set, we rewrite an expensive calculation function in the file named `cythonfn.pyx`: :::info :::spoiler `cythonfn.pyx` ```python= def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: # Instead of using abs(z) < 2 z = z * z + c n += 1 output[i] = n return output ``` ::: And then we set up the rep-compiling script in the `setup.py` as: :::info :::spoiler `setup.py` ```python= from distutils.core import setup from Cython.Build import cythonize setup(ext_modules=cythonize("cythonfn.pyx", compiler_directives={"language_level": "3"})) # For Python 3.x version ``` ::: Then run the command to pre-compile this function module: ``` $ python setup.py build_ext --inplace ``` A file `cythonfn.c` and a binary file `cythonfn[...].so` are created. Now we can call the module `cythonfn` in the main python code `julia1.py`: :::info :::spoiler `julia1.py` ```python= import time #from cythonfn import calculate_z import cythonfn # <1> # area of complex space to investigate x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8 c_real, c_imag = -0.62772, -.42193 def calc_pure_python(desired_width, max_iterations): """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display""" x_step = (x2 - x1) / desired_width y_step = (y1 - y2) / desired_width x = [] y = [] ycoord = y2 while ycoord > y1: y.append(ycoord) ycoord += y_step xcoord = x1 while xcoord < x2: x.append(xcoord) xcoord += x_step # build a list of co-ordinates and the initial condition for each cell. # Note that our initial condition is a constant and could easily be removed, # we use it to simulate a real-world scenario with several inputs to our function zs = [] cs = [] for ycoord in y: for xcoord in x: zs.append(complex(xcoord, ycoord)) cs.append(complex(c_real, c_imag)) print("Length of x:", len(x)) print("Total elements:", len(zs)) start_time = time.time() # output = calculate_z(max_iterations, zs, cs) output = cythonfn.calculate_z(max_iterations, zs, cs) # <2> end_time = time.time() secs = end_time - start_time print(f"Took {secs:0.2f} seconds") assert sum(output) == 33219980 # this sum is expected for 1000^2 grid with 300 iterations # Calculate the Julia set using a pure Python solution with # reasonable defaults for a laptop calc_pure_python(desired_width=1000, max_iterations=300) ``` ::: ### N-Body Simulations This is one of examples from the Cython book (Chapter 4). We can also refer to the [computer language benchmarks game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/) web site. It simulates the solar system evolutions in the phase space. This is a chaotic system, which means the long-term evolution of the system is very sensitive to the initial positions and velocities of all bodies. :::info :::spoiler `nbody.pyx` ```python= # The Computer Language Benchmarks Game # http://benchmarksgame.alioth.debian.org/ # # originally by Kevin Carson # modified by Tupteq, Fredrik Johansson, and Daniel Nanz # modified by Maciej Fijalkowski # 2to3 from libc.math cimport pow, sqrt import sys PI = 3.14159265358979323 SOLAR_MASS = 4 * PI * PI DAYS_PER_YEAR = 365.24 cdef struct body_t: double x[3] double v[3] double m DEF NBODIES = 5 BODIES = { 'sun': ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS), 'jupiter': ([4.84143144246472090e+00, -1.16032004402742839e+00, -1.03622044471123109e-01], [1.66007664274403694e-03 * DAYS_PER_YEAR, 7.69901118419740425e-03 * DAYS_PER_YEAR, -6.90460016972063023e-05 * DAYS_PER_YEAR], 9.54791938424326609e-04 * SOLAR_MASS), 'saturn': ([8.34336671824457987e+00, 4.12479856412430479e+00, -4.03523417114321381e-01], [-2.76742510726862411e-03 * DAYS_PER_YEAR, 4.99852801234917238e-03 * DAYS_PER_YEAR, 2.30417297573763929e-05 * DAYS_PER_YEAR], 2.85885980666130812e-04 * SOLAR_MASS), 'uranus': ([1.28943695621391310e+01, -1.51111514016986312e+01, -2.23307578892655734e-01], [2.96460137564761618e-03 * DAYS_PER_YEAR, 2.37847173959480950e-03 * DAYS_PER_YEAR, -2.96589568540237556e-05 * DAYS_PER_YEAR], 4.36624404335156298e-05 * SOLAR_MASS), 'neptune': ([1.53796971148509165e+01, -2.59193146099879641e+01, 1.79258772950371181e-01], [2.68067772490389322e-03 * DAYS_PER_YEAR, 1.62824170038242295e-03 * DAYS_PER_YEAR, -9.51592254519715870e-05 * DAYS_PER_YEAR], 5.15138902046611451e-05 * SOLAR_MASS) } def combinations(l): result = [] for x in range(len(l) - 1): ls = l[x+1:] for y in ls: result.append((l[x],y)) return result cdef void make_cbodies(list bodies, body_t *cbodies, int num_cbodies): cdef body_t *cbody for i, body in enumerate(bodies): if i >= num_cbodies: break (x, v, m) = body cbody = &cbodies[i] cbody.x[0], cbody.x[1], cbody.x[2] = x cbody.v[0], cbody.v[1], cbody.v[2] = v cbodies[i].m = m cdef list make_pybodies(body_t *cbodies, int num_cbodies): pybodies = [] for i in range(num_cbodies): x = [cbodies[i].x[0], cbodies[i].x[1], cbodies[i].x[2]] v = [cbodies[i].v[0], cbodies[i].v[1], cbodies[i].v[2]] pybodies.append((x, v, cbodies[i].m)) return pybodies def advance(double dt, int n, bodies): # Note that this does not take a `pairs` argument, and *returns* a new # `bodies` list. This is slightly different from the original pure-Python # version of `advance`. cdef: int i, ii, jj double dx, dy, dz, mag, b1m, b2m, ds body_t *body1 body_t *body2 body_t cbodies[NBODIES] make_cbodies(bodies, cbodies, NBODIES) for i in range(n): for ii in range(NBODIES-1): body1 = &cbodies[ii] for jj in range(ii+1, NBODIES): body2 = &cbodies[jj] dx = body1.x[0] - body2.x[0] dy = body1.x[1] - body2.x[1] dz = body1.x[2] - body2.x[2] ds = dx * dx + dy * dy + dz * dz mag = dt / (ds * sqrt(ds)) b1m = body1.m * mag b2m = body2.m * mag body1.v[0] -= dx * b2m body1.v[1] -= dy * b2m body1.v[2] -= dz * b2m body2.v[0] += dx * b1m body2.v[1] += dy * b1m body2.v[2] += dz * b1m for ii in range(NBODIES): body2 = &cbodies[ii] body2.x[0] += dt * body2.v[0] body2.x[1] += dt * body2.v[1] body2.x[2] += dt * body2.v[2] return make_pybodies(cbodies, NBODIES) def report_energy(bodies): # Note that this takes just a `bodies` list, and computes `pairs` # internally, which is a slight modification from the original pure-Python # version of `report_energy`. e = 0.0 pairs = combinations(bodies) for (((x1, y1, z1), v1, m1), ((x2, y2, z2), v2, m2)) in pairs: dx = x1 - x2 dy = y1 - y2 dz = z1 - z2 e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5) for (r, [vx, vy, vz], m) in bodies: e += m * (vx * vx + vy * vy + vz * vz) / 2. print("%.9f" % e) def offset_momentum(ref, bodies): px = py = pz = 0.0 for (r, [vx, vy, vz], m) in bodies: px -= vx * m py -= vy * m pz -= vz * m (r, v, m) = ref v[0] = px / m v[1] = py / m v[2] = pz / m def main(n, bodies=BODIES, ref='sun'): system = list(bodies.values()) offset_momentum(bodies[ref], system) report_energy(system) system = advance(0.01, n, system) report_energy(system) ``` ::: ### pyximport If your code has a simple setup and doesn't require third-party modules, you can simplified build system via `pyximport` without `setup.py` file. For example, we add the following code lines in the start of previous `julia1.py` file: :::info :::spoiler `julia1.py` ```python= import pyximport pyximport.install(language_level=3) # For Python 3.x import cythonfn ... ``` ::: any subsequently imported `.pyx` files will be automatically compiled. ### Cython and Numpy We can apply numpy array in the `.pyx` file to have a better speed when accessing the 1-D data. For example, we modified the `cythonfn.pyx` file as: :::info :::spoiler `cythonfn.pyx` ```python= import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) # <1> for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output ``` ::: Also modify the `setup.py` as: :::info :::spoiler `setup.py` ```python= from distutils.core import setup import numpy as np from Cython.Build import cythonize setup(ext_modules=cythonize("cythonfn.pyx", compiler_directives={"language_level": "3"}), include_dirs=[np.get_include()]) ``` ::: ### Parallelize Calculations with OpenMP In Cython, we can use `prange` operator and add the `-fopenmp` compiler in `setup.py` to parallelize a calculating loop. For example, we modify the `cythonfn.pyx` as: :::info :::spoiler `cythonfn.pyx` ```python= from cython.parallel import parallel, prange # <1> import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): """Calculate output list using Julia update rule""" cdef unsigned int i, length cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil, parallel(): # <2> for i in prange(length, schedule="guided"): # <3> z = zs[i] c = cs[i] output[i] = 0 while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c output[i] += 1 return output ``` ::: Also modify `setup.py` as: :::info :::spoiler `setup.py` ```python= from distutils.core import setup from distutils.extension import Extension import numpy as np ext_modules = [Extension( "cythonfn", ["cythonfn.pyx"], extra_compile_args=['-Xclang -fopenmp'], extra_link_args=['-Xclang -fopenmp'], )] from Cython.Build import cythonize setup(ext_modules=cythonize(ext_modules, compiler_directives={"language_level": "3"},), include_dirs=[np.get_include()]) ``` ::: In this example, we disable the ==GIL(Global Interpreter Lock)== with `nogil` keyword and use parallel clculations with `parallel()` keyword and `prange` operator. We also set `schedule="guided"`, which will dynamically assign work, so fewer threads will wait for new works. ### Foreign Function Interfaces Sometimes the automatic solutions can not be achieved, you need to write custom C code yourself. This could be because the compilation methods don't find some potential optimizations, or because ==you want to take advantage of libraries or language features that are not available in Python==. For example we want to use an external library to solve the 2D diffusion equation. The solver function in C looks like the following (`diffusion.c`): :::info :::spoiler `diffusion.c` ```c= void evolve(double in[][512], double out[][512], double D, double dt) { int i, j; double laplacian; for (i=1; i<511; i++) { for (j=1; j<511; j++) { laplacian = in[i+1][j] + in[i-1][j] + in[i][j+1] + in[i][j-1] - 4 * in[i][j]; out[i][j] = in[i][j] + D * dt * laplacian; } } } ``` ::: We compile the `diffusion.c` using gcc: ``` $ gcc -O3 -std=gnu11 -c diffusion.c $ gcc -shared -o diffusion.so diffusion.o ``` then we call the `diffusion.so` in via the Cython methods in the Python code: :::info :::spoiler `diffusion1.py` ```python= import ctypes import time import numpy as np grid_shape = (512, 512) _diffusion = ctypes.CDLL("diffusion.so") # <1> # Create references to the C types that we will need to simplify future code TYPE_INT = ctypes.c_int # <2> TYPE_DOUBLE = ctypes.c_double # <2> TYPE_DOUBLE_SS = ctypes.POINTER(ctypes.POINTER(ctypes.c_double)) # <2> # Initialize the signature of the evolve function to: # void evolve(int, int, double**, double**, double, double) _diffusion.evolve.argtypes = [TYPE_DOUBLE_SS, TYPE_DOUBLE_SS, TYPE_DOUBLE, TYPE_DOUBLE] # <3> _diffusion.evolve.restype = None # <3> def evolve(grid, out, dt, D=1.0): # First we convert the python types into the relevant C types assert grid.shape == (512, 512) cdt = TYPE_DOUBLE(dt) # <4> cD = TYPE_DOUBLE(D) # <4> pointer_grid = grid.ctypes.data_as(TYPE_DOUBLE_SS) # <4> pointer_out = out.ctypes.data_as(TYPE_DOUBLE_SS) # <4> # Now we can call the function _diffusion.evolve(pointer_grid, pointer_out, cD, cdt) # <4> def run_experiment(num_iterations): scratch = np.zeros(grid_shape, dtype=ctypes.c_double) grid = np.zeros(grid_shape, dtype=ctypes.c_double) block_low = int(grid_shape[0] * 0.4) block_high = int(grid_shape[0] * 0.5) grid[block_low:block_high, block_low:block_high] = 0.005 start = time.time() for i in range(num_iterations): evolve(grid, scratch, 0.1) grid, scratch = scratch, grid return time.time() - start if __name__ == "__main__": t = run_experiment(500) print(t) ``` ::: Or we can call the `diffusion.so` in via the `cffi` module in the Python code: :::info :::spoiler `diffusion2.py` ```python= import time from cffi import FFI, verifier import numpy as np grid_shape = (512, 512) ffi = FFI() # <1> ffi.cdef("void evolve(double **in, double **out, double D, double dt);") # <1> lib = ffi.dlopen("diffusion.so") # <1> def evolve(grid, dt, out, D=1.0): pointer_grid = ffi.cast("double**", grid.ctypes.data) # <2> pointer_out = ffi.cast("double**", out.ctypes.data) # <2> lib.evolve(pointer_grid, pointer_out, D, dt) # <2> def run_experiment(num_iterations): scratch = np.zeros(grid_shape, dtype=np.double) # <3> grid = np.zeros(grid_shape, dtype=np.double) # <3> block_low = int(grid_shape[0] * 0.4) block_high = int(grid_shape[0] * 0.5) grid[block_low:block_high, block_low:block_high] = 0.005 start = time.time() for i in range(num_iterations): evolve(grid, 0.1, scratch) # <3> grid, scratch = scratch, grid return time.time() - start if __name__ == "__main__": t = run_experiment(500) print(t) verifier.cleanup_tmpdir() ``` ::: ### Numba Numba is a just-in-time compiler that specializes in numpy code, which compiles via the LLVM compiler (not via g++ or gcc) at runtime. Numba does not bind to external c libraries (like Cython), but ==it can automatically generate code for GPUs (which Cython cannot)==. By simply adding the decorator `@jit` to the function, you can optimize its performance if it can. For example, we modify the `julia1.py` as: :::info :::spoiler `julia1.py` ```python= import numba from numba import jit import time import numpy as np # area of complex space to investigate x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8 c_real, c_imag = -0.62772, -.42193 @jit(nopython=False) def calculate_z(maxiter, zs, cs, output): """Calculate output list using Julia update rule""" for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and abs(z) < 2: z = z * z + c n += 1 output[i] = n ... ``` ::: where the `nopython` specifier means if Numba can not compile all of the code, then it fails. `nopython=False` means we turn off the fail option and Numba will silently fall back on a Python mode if it can not compile all of the code. If you run the same function a second time in the same Python session, it runs even faster. Because there is no need to compile the target function on the second pass if the arguments types the same. On the second run, the Numba result is equivalent to the Cython with numpy result. ==So it came out as faster as Cython for very little work.== We can also add `OpenMP` parallelization support with `prange`, as shown below: :::info :::spoiler `julia2.py` ```python= import numba from numba import jit from numba import prange import numpy as np # area of complex space to investigate x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8 c_real, c_imag = -0.62772, -.42193 @jit(nopython=True, parallel=True) def calculate_z(maxiter, zs, cs, output): """Calculate output list using Julia update rule""" for i in prange(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and abs(z) < 2: z = z * z + c n += 1 output[i] = n ... ``` ::: We expands the decorator to require `nopython` and `parallel`. ### PyPy [PyPy](https://www.pypy.org/) is a drop-in replacement for CPython and offers all the built-in modules. The JIT compiler in PyPy is very effective, and good speedups can be seen with little or no work on your part. PyPy can run our pure Python Julia demo without any modifications. ### Summary | | Cython | Numba | PyPy | | --------- | -------- | -------- | ---- | | Mature | Y | Y | Y | | Widespread| Y | - | - | | *numpy* support | Y | Y | Y | | Nonbreaking code changes | - | Y | Y | | Need *C* knowledge | Y | - | - | | Support *OpenMP* | Y | Y | - | ### Using GPUs via PyTorch [PyTorch](https://pytorch.org/) is a static computational graph tensor library that is particularly user-friendly and has an intuitive API for anyone familiar with numpy. We can apply GPU calculations via PyTorch with simple code modifications. For example, we modify the 2D diffusion code as: :::info :::spoiler `diffusion_pytorch.py` ```python= import timeit #from numpy import roll, zeros import torch from torch import (roll, zeros) # <1> DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # <2> GRID_SHAPE = (2048, 2048) def laplacian(grid): return ( roll(grid, +1, 0) + roll(grid, -1, 0) + roll(grid, +1, 1) + roll(grid, -1, 1) - 4 * grid ) def evolve(grid, dt, D=1): return grid + dt * D * laplacian(grid) def run_experiment(num_iterations, grid_shape=GRID_SHAPE): grid = zeros(grid_shape) block_low = int(grid_shape[0] * 0.4) block_high = int(grid_shape[0] * 0.5) grid[block_low:block_high, block_low:block_high] = 0.005 #grid = grid.cuda() # <3> grid = grid.to(DEVICE) # <3> for i in range(num_iterations): grid = evolve(grid, 0.1) return grid if __name__ == "__main__": print("torch.cuda.is_available: {}".format(torch.cuda.is_available())) n_iter = 100 N, runtime = timeit.Timer(f"run_experiment({n_iter})", globals=globals()).autorange() print(f"Runtime with grid {GRID_SHAPE}: {runtime / N:0.4f}s") ``` ::: However, the biggest speed constraint for GPUs is the ==transfer time of data from the system memory to the GPU memory==. When we use `tensor.to(DEVICE)`, we are triggering a transfer of data that may take some time depending on the speed of the ==GPU's bus== and the ==amount of data== being transferred. In fact, one of the biggest things slowing down PyTorch code when it comes to deep learning is copying training data from the host into the GPU. Often the training data is simply too big to fit on the GPU, and doing these constant data transfers is an unavoidable penalty.