---
title: 'High Performance Python (Part 2)'
tags: Python
disqus: hackmd
---
High Performance Python (Part 2)
==
>[name=Cheng-Chin Chiang] [time=August 12, 2022]
## Table of Contents
[TOC]
## References
1. [High Performance Python](https://www.amazon.com/High-Performance-Python-Performant-Programming-ebook/dp/B087YTVL8F) by Micha Gorelick, with [code samples](https://github.com/mynameisfiber/high_performance_python_2e) on the GitHub.

2. [Cython: A Guide for Python Programmers](https://www.oreilly.com/library/view/cython/9781491901731/) by Kurt Smith, with [code samples](https://github.com/cythonbook/examples) on the GitHub.

## [Python Road Map](https://roadmap.sh/python)

## Compiling to C
Assumeing you have chosen good algorithms and you have reduced the amount of data you are processing. The easiest way to get your code to run faster is to compile your code down to machine code. Python offers a number of options for this:
- AOT (ahead-of-time) compiler: Cython
- JIT (just-in-time) compiler: Numba and PyPy
### Cython
Cython converts type-annotated Python (via `cdef` keyword) into a compiled extension module. ==The type annotations are C-like.== It is wide usage, maturity, and OpenMP support. For example, when calculating the Julia set, we rewrite an expensive calculation function in the file named `cythonfn.pyx`:
:::info
:::spoiler `cythonfn.pyx`
```python=
def calculate_z(int maxiter, zs, cs):
"""Calculate output list using Julia update rule"""
cdef unsigned int i, n
cdef double complex z, c
output = [0] * len(zs)
for i in range(len(zs)):
n = 0
z = zs[i]
c = cs[i]
while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: # Instead of using abs(z) < 2
z = z * z + c
n += 1
output[i] = n
return output
```
:::
And then we set up the rep-compiling script in the `setup.py` as:
:::info
:::spoiler `setup.py`
```python=
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("cythonfn.pyx", compiler_directives={"language_level": "3"})) # For Python 3.x version
```
:::
Then run the command to pre-compile this function module:
```
$ python setup.py build_ext --inplace
```
A file `cythonfn.c` and a binary file `cythonfn[...].so` are created. Now we can call the module `cythonfn` in the main python code `julia1.py`:
:::info
:::spoiler `julia1.py`
```python=
import time
#from cythonfn import calculate_z
import cythonfn # <1>
# area of complex space to investigate
x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8
c_real, c_imag = -0.62772, -.42193
def calc_pure_python(desired_width, max_iterations):
"""Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display"""
x_step = (x2 - x1) / desired_width
y_step = (y1 - y2) / desired_width
x = []
y = []
ycoord = y2
while ycoord > y1:
y.append(ycoord)
ycoord += y_step
xcoord = x1
while xcoord < x2:
x.append(xcoord)
xcoord += x_step
# build a list of co-ordinates and the initial condition for each cell.
# Note that our initial condition is a constant and could easily be removed,
# we use it to simulate a real-world scenario with several inputs to our function
zs = []
cs = []
for ycoord in y:
for xcoord in x:
zs.append(complex(xcoord, ycoord))
cs.append(complex(c_real, c_imag))
print("Length of x:", len(x))
print("Total elements:", len(zs))
start_time = time.time()
# output = calculate_z(max_iterations, zs, cs)
output = cythonfn.calculate_z(max_iterations, zs, cs) # <2>
end_time = time.time()
secs = end_time - start_time
print(f"Took {secs:0.2f} seconds")
assert sum(output) == 33219980 # this sum is expected for 1000^2 grid with 300 iterations
# Calculate the Julia set using a pure Python solution with
# reasonable defaults for a laptop
calc_pure_python(desired_width=1000, max_iterations=300)
```
:::
### N-Body Simulations
This is one of examples from the Cython book (Chapter 4). We can also refer to the [computer language benchmarks game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/) web site. It simulates the solar system evolutions in the phase space. This is a chaotic system, which means the long-term evolution of the system is very sensitive to the initial positions and velocities of all bodies.
:::info
:::spoiler `nbody.pyx`
```python=
# The Computer Language Benchmarks Game
# http://benchmarksgame.alioth.debian.org/
#
# originally by Kevin Carson
# modified by Tupteq, Fredrik Johansson, and Daniel Nanz
# modified by Maciej Fijalkowski
# 2to3
from libc.math cimport pow, sqrt
import sys
PI = 3.14159265358979323
SOLAR_MASS = 4 * PI * PI
DAYS_PER_YEAR = 365.24
cdef struct body_t:
double x[3]
double v[3]
double m
DEF NBODIES = 5
BODIES = {
'sun': ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS),
'jupiter': ([4.84143144246472090e+00,
-1.16032004402742839e+00,
-1.03622044471123109e-01],
[1.66007664274403694e-03 * DAYS_PER_YEAR,
7.69901118419740425e-03 * DAYS_PER_YEAR,
-6.90460016972063023e-05 * DAYS_PER_YEAR],
9.54791938424326609e-04 * SOLAR_MASS),
'saturn': ([8.34336671824457987e+00,
4.12479856412430479e+00,
-4.03523417114321381e-01],
[-2.76742510726862411e-03 * DAYS_PER_YEAR,
4.99852801234917238e-03 * DAYS_PER_YEAR,
2.30417297573763929e-05 * DAYS_PER_YEAR],
2.85885980666130812e-04 * SOLAR_MASS),
'uranus': ([1.28943695621391310e+01,
-1.51111514016986312e+01,
-2.23307578892655734e-01],
[2.96460137564761618e-03 * DAYS_PER_YEAR,
2.37847173959480950e-03 * DAYS_PER_YEAR,
-2.96589568540237556e-05 * DAYS_PER_YEAR],
4.36624404335156298e-05 * SOLAR_MASS),
'neptune': ([1.53796971148509165e+01,
-2.59193146099879641e+01,
1.79258772950371181e-01],
[2.68067772490389322e-03 * DAYS_PER_YEAR,
1.62824170038242295e-03 * DAYS_PER_YEAR,
-9.51592254519715870e-05 * DAYS_PER_YEAR],
5.15138902046611451e-05 * SOLAR_MASS)
}
def combinations(l):
result = []
for x in range(len(l) - 1):
ls = l[x+1:]
for y in ls:
result.append((l[x],y))
return result
cdef void make_cbodies(list bodies, body_t *cbodies, int num_cbodies):
cdef body_t *cbody
for i, body in enumerate(bodies):
if i >= num_cbodies:
break
(x, v, m) = body
cbody = &cbodies[i]
cbody.x[0], cbody.x[1], cbody.x[2] = x
cbody.v[0], cbody.v[1], cbody.v[2] = v
cbodies[i].m = m
cdef list make_pybodies(body_t *cbodies, int num_cbodies):
pybodies = []
for i in range(num_cbodies):
x = [cbodies[i].x[0], cbodies[i].x[1], cbodies[i].x[2]]
v = [cbodies[i].v[0], cbodies[i].v[1], cbodies[i].v[2]]
pybodies.append((x, v, cbodies[i].m))
return pybodies
def advance(double dt, int n, bodies):
# Note that this does not take a `pairs` argument, and *returns* a new
# `bodies` list. This is slightly different from the original pure-Python
# version of `advance`.
cdef:
int i, ii, jj
double dx, dy, dz, mag, b1m, b2m, ds
body_t *body1
body_t *body2
body_t cbodies[NBODIES]
make_cbodies(bodies, cbodies, NBODIES)
for i in range(n):
for ii in range(NBODIES-1):
body1 = &cbodies[ii]
for jj in range(ii+1, NBODIES):
body2 = &cbodies[jj]
dx = body1.x[0] - body2.x[0]
dy = body1.x[1] - body2.x[1]
dz = body1.x[2] - body2.x[2]
ds = dx * dx + dy * dy + dz * dz
mag = dt / (ds * sqrt(ds))
b1m = body1.m * mag
b2m = body2.m * mag
body1.v[0] -= dx * b2m
body1.v[1] -= dy * b2m
body1.v[2] -= dz * b2m
body2.v[0] += dx * b1m
body2.v[1] += dy * b1m
body2.v[2] += dz * b1m
for ii in range(NBODIES):
body2 = &cbodies[ii]
body2.x[0] += dt * body2.v[0]
body2.x[1] += dt * body2.v[1]
body2.x[2] += dt * body2.v[2]
return make_pybodies(cbodies, NBODIES)
def report_energy(bodies):
# Note that this takes just a `bodies` list, and computes `pairs`
# internally, which is a slight modification from the original pure-Python
# version of `report_energy`.
e = 0.0
pairs = combinations(bodies)
for (((x1, y1, z1), v1, m1),
((x2, y2, z2), v2, m2)) in pairs:
dx = x1 - x2
dy = y1 - y2
dz = z1 - z2
e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5)
for (r, [vx, vy, vz], m) in bodies:
e += m * (vx * vx + vy * vy + vz * vz) / 2.
print("%.9f" % e)
def offset_momentum(ref, bodies):
px = py = pz = 0.0
for (r, [vx, vy, vz], m) in bodies:
px -= vx * m
py -= vy * m
pz -= vz * m
(r, v, m) = ref
v[0] = px / m
v[1] = py / m
v[2] = pz / m
def main(n, bodies=BODIES, ref='sun'):
system = list(bodies.values())
offset_momentum(bodies[ref], system)
report_energy(system)
system = advance(0.01, n, system)
report_energy(system)
```
:::
### pyximport
If your code has a simple setup and doesn't require third-party modules, you can simplified build system via `pyximport` without `setup.py` file. For example, we add the following code lines in the start of previous `julia1.py` file:
:::info
:::spoiler `julia1.py`
```python=
import pyximport
pyximport.install(language_level=3) # For Python 3.x
import cythonfn
...
```
:::
any subsequently imported `.pyx` files will be automatically compiled.
### Cython and Numpy
We can apply numpy array in the `.pyx` file to have a better speed when accessing the 1-D data. For example, we modified the `cythonfn.pyx` file as:
:::info
:::spoiler `cythonfn.pyx`
```python=
import numpy as np
cimport numpy as np
def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
"""Calculate output list using Julia update rule"""
cdef unsigned int i, n
cdef double complex z, c
cdef int[:] output = np.empty(len(zs), dtype=np.int32) # <1>
for i in range(len(zs)):
n = 0
z = zs[i]
c = cs[i]
while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
z = z * z + c
n += 1
output[i] = n
return output
```
:::
Also modify the `setup.py` as:
:::info
:::spoiler `setup.py`
```python=
from distutils.core import setup
import numpy as np
from Cython.Build import cythonize
setup(ext_modules=cythonize("cythonfn.pyx", compiler_directives={"language_level": "3"}),
include_dirs=[np.get_include()])
```
:::
### Parallelize Calculations with OpenMP
In Cython, we can use `prange` operator and add the `-fopenmp` compiler in `setup.py` to parallelize a calculating loop. For example, we modify the `cythonfn.pyx` as:
:::info
:::spoiler `cythonfn.pyx`
```python=
from cython.parallel import parallel, prange # <1>
import numpy as np
cimport numpy as np
def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
"""Calculate output list using Julia update rule"""
cdef unsigned int i, length
cdef double complex z, c
cdef int[:] output = np.empty(len(zs), dtype=np.int32)
length = len(zs)
with nogil, parallel(): # <2>
for i in prange(length, schedule="guided"): # <3>
z = zs[i]
c = cs[i]
output[i] = 0
while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
z = z * z + c
output[i] += 1
return output
```
:::
Also modify `setup.py` as:
:::info
:::spoiler `setup.py`
```python=
from distutils.core import setup
from distutils.extension import Extension
import numpy as np
ext_modules = [Extension(
"cythonfn",
["cythonfn.pyx"],
extra_compile_args=['-Xclang -fopenmp'],
extra_link_args=['-Xclang -fopenmp'],
)]
from Cython.Build import cythonize
setup(ext_modules=cythonize(ext_modules, compiler_directives={"language_level": "3"},), include_dirs=[np.get_include()])
```
:::
In this example, we disable the ==GIL(Global Interpreter Lock)== with `nogil` keyword and use parallel clculations with `parallel()` keyword and `prange` operator. We also set `schedule="guided"`, which will dynamically assign work, so fewer threads will wait for new works.
### Foreign Function Interfaces
Sometimes the automatic solutions can not be achieved, you need to write custom C code yourself. This could be because the compilation methods don't find some potential optimizations, or because ==you want to take advantage of libraries or language features that are not available in Python==. For example we want to use an external library to solve the 2D diffusion equation. The solver function in C looks like the following (`diffusion.c`):
:::info
:::spoiler `diffusion.c`
```c=
void evolve(double in[][512], double out[][512], double D, double dt) {
int i, j;
double laplacian;
for (i=1; i<511; i++) {
for (j=1; j<511; j++) {
laplacian = in[i+1][j] + in[i-1][j] + in[i][j+1] + in[i][j-1] - 4 * in[i][j];
out[i][j] = in[i][j] + D * dt * laplacian;
}
}
}
```
:::
We compile the `diffusion.c` using gcc:
```
$ gcc -O3 -std=gnu11 -c diffusion.c
$ gcc -shared -o diffusion.so diffusion.o
```
then we call the `diffusion.so` in via the Cython methods in the Python code:
:::info
:::spoiler `diffusion1.py`
```python=
import ctypes
import time
import numpy as np
grid_shape = (512, 512)
_diffusion = ctypes.CDLL("diffusion.so") # <1>
# Create references to the C types that we will need to simplify future code
TYPE_INT = ctypes.c_int # <2>
TYPE_DOUBLE = ctypes.c_double # <2>
TYPE_DOUBLE_SS = ctypes.POINTER(ctypes.POINTER(ctypes.c_double)) # <2>
# Initialize the signature of the evolve function to:
# void evolve(int, int, double**, double**, double, double)
_diffusion.evolve.argtypes = [TYPE_DOUBLE_SS, TYPE_DOUBLE_SS, TYPE_DOUBLE, TYPE_DOUBLE] # <3>
_diffusion.evolve.restype = None # <3>
def evolve(grid, out, dt, D=1.0):
# First we convert the python types into the relevant C types
assert grid.shape == (512, 512)
cdt = TYPE_DOUBLE(dt) # <4>
cD = TYPE_DOUBLE(D) # <4>
pointer_grid = grid.ctypes.data_as(TYPE_DOUBLE_SS) # <4>
pointer_out = out.ctypes.data_as(TYPE_DOUBLE_SS) # <4>
# Now we can call the function
_diffusion.evolve(pointer_grid, pointer_out, cD, cdt) # <4>
def run_experiment(num_iterations):
scratch = np.zeros(grid_shape, dtype=ctypes.c_double)
grid = np.zeros(grid_shape, dtype=ctypes.c_double)
block_low = int(grid_shape[0] * 0.4)
block_high = int(grid_shape[0] * 0.5)
grid[block_low:block_high, block_low:block_high] = 0.005
start = time.time()
for i in range(num_iterations):
evolve(grid, scratch, 0.1)
grid, scratch = scratch, grid
return time.time() - start
if __name__ == "__main__":
t = run_experiment(500)
print(t)
```
:::
Or we can call the `diffusion.so` in via the `cffi` module in the Python code:
:::info
:::spoiler `diffusion2.py`
```python=
import time
from cffi import FFI, verifier
import numpy as np
grid_shape = (512, 512)
ffi = FFI() # <1>
ffi.cdef("void evolve(double **in, double **out, double D, double dt);") # <1>
lib = ffi.dlopen("diffusion.so") # <1>
def evolve(grid, dt, out, D=1.0):
pointer_grid = ffi.cast("double**", grid.ctypes.data) # <2>
pointer_out = ffi.cast("double**", out.ctypes.data) # <2>
lib.evolve(pointer_grid, pointer_out, D, dt) # <2>
def run_experiment(num_iterations):
scratch = np.zeros(grid_shape, dtype=np.double) # <3>
grid = np.zeros(grid_shape, dtype=np.double) # <3>
block_low = int(grid_shape[0] * 0.4)
block_high = int(grid_shape[0] * 0.5)
grid[block_low:block_high, block_low:block_high] = 0.005
start = time.time()
for i in range(num_iterations):
evolve(grid, 0.1, scratch) # <3>
grid, scratch = scratch, grid
return time.time() - start
if __name__ == "__main__":
t = run_experiment(500)
print(t)
verifier.cleanup_tmpdir()
```
:::
### Numba
Numba is a just-in-time compiler that specializes in numpy code, which compiles via the LLVM compiler (not via g++ or gcc) at runtime. Numba does not bind to external c libraries (like Cython), but ==it can automatically generate code for GPUs (which Cython cannot)==. By simply adding the decorator `@jit` to the function, you can optimize its performance if it can. For example, we modify the `julia1.py` as:
:::info
:::spoiler `julia1.py`
```python=
import numba
from numba import jit
import time
import numpy as np
# area of complex space to investigate
x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8
c_real, c_imag = -0.62772, -.42193
@jit(nopython=False)
def calculate_z(maxiter, zs, cs, output):
"""Calculate output list using Julia update rule"""
for i in range(len(zs)):
n = 0
z = zs[i]
c = cs[i]
while n < maxiter and abs(z) < 2:
z = z * z + c
n += 1
output[i] = n
...
```
:::
where the `nopython` specifier means if Numba can not compile all of the code, then it fails. `nopython=False` means we turn off the fail option and Numba will silently fall back on a Python mode if it can not compile all of the code.
If you run the same function a second time in the same Python session, it runs even faster. Because there is no need to compile the target function on the second pass if the arguments types the same. On the second run, the Numba result is equivalent to the Cython with numpy result. ==So it came out as faster as Cython for very little work.==
We can also add `OpenMP` parallelization support with `prange`, as shown below:
:::info
:::spoiler `julia2.py`
```python=
import numba
from numba import jit
from numba import prange
import numpy as np
# area of complex space to investigate
x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8
c_real, c_imag = -0.62772, -.42193
@jit(nopython=True, parallel=True)
def calculate_z(maxiter, zs, cs, output):
"""Calculate output list using Julia update rule"""
for i in prange(len(zs)):
n = 0
z = zs[i]
c = cs[i]
while n < maxiter and abs(z) < 2:
z = z * z + c
n += 1
output[i] = n
...
```
:::
We expands the decorator to require `nopython` and `parallel`.
### PyPy
[PyPy](https://www.pypy.org/) is a drop-in replacement for CPython and offers all the built-in modules. The JIT compiler in PyPy is very effective, and good speedups can be seen with little or no work on your part. PyPy can run our pure Python Julia demo without any modifications.
### Summary
| | Cython | Numba | PyPy |
| --------- | -------- | -------- | ---- |
| Mature | Y | Y | Y |
| Widespread| Y | - | - |
| *numpy* support | Y | Y | Y |
| Nonbreaking code changes | - | Y | Y |
| Need *C* knowledge | Y | - | - |
| Support *OpenMP* | Y | Y | - |
### Using GPUs via PyTorch
[PyTorch](https://pytorch.org/) is a static computational graph tensor library that is particularly user-friendly and has an intuitive API for anyone familiar with numpy. We can apply GPU calculations via PyTorch with simple code modifications. For example, we modify the 2D diffusion code as:
:::info
:::spoiler `diffusion_pytorch.py`
```python=
import timeit
#from numpy import roll, zeros
import torch
from torch import (roll, zeros) # <1>
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # <2>
GRID_SHAPE = (2048, 2048)
def laplacian(grid):
return (
roll(grid, +1, 0)
+ roll(grid, -1, 0)
+ roll(grid, +1, 1)
+ roll(grid, -1, 1)
- 4 * grid
)
def evolve(grid, dt, D=1):
return grid + dt * D * laplacian(grid)
def run_experiment(num_iterations, grid_shape=GRID_SHAPE):
grid = zeros(grid_shape)
block_low = int(grid_shape[0] * 0.4)
block_high = int(grid_shape[0] * 0.5)
grid[block_low:block_high, block_low:block_high] = 0.005
#grid = grid.cuda() # <3>
grid = grid.to(DEVICE) # <3>
for i in range(num_iterations):
grid = evolve(grid, 0.1)
return grid
if __name__ == "__main__":
print("torch.cuda.is_available: {}".format(torch.cuda.is_available()))
n_iter = 100
N, runtime = timeit.Timer(f"run_experiment({n_iter})", globals=globals()).autorange()
print(f"Runtime with grid {GRID_SHAPE}: {runtime / N:0.4f}s")
```
:::
However, the biggest speed constraint for GPUs is the ==transfer time of data from the system memory to the GPU memory==. When we use `tensor.to(DEVICE)`, we are triggering a transfer of data that may take some time depending on the speed of the ==GPU's bus== and the ==amount of data== being transferred.
In fact, one of the biggest things slowing down PyTorch code when it comes to deep learning is copying training data from the host into the GPU. Often the training data is simply too big to fit on the GPU, and doing these constant data transfers is an unavoidable penalty.