# Numba Programming Reference \: > * https://lulaoshi.info/gpu/python-cuda/numba > * A ~5 minute guide to Numba > * https://numba.pydata.org/numba-doc/latest/user/5minguide.html *In this note, I will provide a way to correctly install numba in WSL2.* Numba is a compiler for Python which takes the bytecode and compile it into the native language. Therefore, the interpreter is bypassed. ## Install Numba Add the following instructions into `.bashrc` in your home directory. Notice that `"Nvidia Tesla P40"` should be changed to the GPU you use and visible device should be set according to the GPU use choose if you have multiple GPU in your system. ```bash export NUMBA_CUDA_DRIVER="/usr/lib/wsl/lib/libcuda.so.1" export MESA_D3D12_DEFAULT_ADAPTER_NAME="NVIDIA Tesla P40" export CUDA_VISIBLE_DEVICES=0 ``` ## Device Info ```bash Found 1 CUDA devices id 0 b'Tesla P40' [SUPPORTED] Compute Capability: 6.1 PCI Device ID: 0 PCI Bus ID: 129 UUID: GPU-fe961b5a-b59b-3a4b-b15f-ba94324892e6 Watchdog: Enabled FP32/FP64 Performance Ratio: 32 Summary: 1/1 devices are supported ``` ![image](https://hackmd.io/_uploads/SyiJ54s0p.png =x500) ## Example \: Monte Carlo Pi ```python @jit (nopython = True, parallel = True, cache = True) def monte_carlo_pi_parallel(itr=10000): hit_count = 0 for _ in range(itr): x = random() y = random() if (x ** 2 + y ** 2) < 1.0: hit_count += 1 result = 4.0 * hit_count / itr return result ``` Example of using numba to do monte carlo estimation of the value of pi. In order to use numba in Windows Subsystem for Linux, export `NUMBA_CUDA_DRIVER="/usr/lib/wsl/lib/libcuda.so.1"` need to be added into the bash configure file. ## Example \: Vector add ```python @jit (nopython=True, parallel=True, cache=True) def vector_add_parallel(x, y, size): for i in prange(size): y[i] = x[i] + y[i] @cuda.jit def vector_add_cuda(x, y, size): i = cuda.grid(1) i_in_bounds = (i >= 0) and (i < size) if i_in_bounds: y[i] = x[i] + y[i] ``` **With Array Size 1024 x 1024** Since it is a naive kernel, there are not much computation happened in the kernel. Notice that a warning is reported by the just-in-time compiler. ```bash 0.56.4 Found 1 CUDA devices id 0 b'Tesla P40' [SUPPORTED] Compute Capability: 6.1 PCI Device ID: 0 PCI Bus ID: 129 UUID: GPU-fe961b5a-b59b-3a4b-b15f-ba94324892e6 Watchdog: Enabled FP32/FP64 Performance Ratio: 32 Summary: 1/1 devices are supported [2. 2. 2. ... 2. 2. 2.] Take Times: 0.6530523300170898 [2. 2. 2. ... 2. 2. 2.] Take Times: 0.2872049808502197 ``` ***Works well with Numba*** * Numerical code, loop * Large amounts of data * Data-Parallel operations ***Things that can be tricky to optimize, particularly on CUDA*** * Code using lots of strings or dicts * Inherently serial logic * Calling lots of already-compiled code * Code with a lot of object-oriented patterns and features * Code that’s already been heavily optimized using another tool / paradigm ***`@njit` decorator*** * Compiles the Python bytecode to native code * Single-threaded on CPU ***Parallelize execution on the CPU*** * With parallel = True * Use prange instead of range to mark the loop needed parallelism When the first call to the function will trigger compiler to compile the function which will cost lots of time. We can also pass cache=True to the njit which will then save the compiled object code and prevent recompile next time running the program ***`@cuda.jit` decorator*** * The index can be calculated form cuda.grid() * Similar to CUDA C programming, the kernel indices need to checked if in bounds * The data need to be copy to the device first otherwise, implicit copies will be invoke every time the kernel being called and executed.