Intergrating cuda kernel into your pytorch code

## [Custom C++ and CUDA Extensions](https://pytorch.org/tutorials/advanced/cpp_extension.html) ### C++ extensions come in two flavors: - They can be built “ahead of time” with setuptools (AOT) - or “just in time” via torch.utils.cpp_extension.load() (JIT) ### AOT - First, write a file using setuptools to compile our C++ code (lltm.cpp). ```python! ### setup.py from setuptools import setup, Extension from torch.utils import cpp_extension setup(name='lltm_cpp', ext_modules=[cpp_extension.CppExtension('lltm_cpp', ['lltm.cpp'])], cmdclass={'build_ext': cpp_extension.BuildExtension}) # ext_modules is the List of modules you want to build. # The name of the Extension matches with TORCH_EXTENSION_NAME of pybind11. While the setup's name is the package name. # Naming convention: # Coincide when possible: For single-module packages, keep the package name and extension module name the same for simplicity. # Different names for multi-module packages: Use a broader package name and more specific module names when your package contains multiple extensions or modules. ``` - C code example ```cpp // inside ltmm.cpp #include <torch/extension.h> #include <iostream> torch::Tensor d_sigmoid(torch::Tensor z) { auto s = torch::sigmoid(z); return (1 - s) * s; } ``` - <torch/extension.h> is the one-stop header to include all the necessary PyTorch bits to write C++ extensions. It includes: - The ATen library, which is our primary API for tensor computation, - pybind11, which is how we create Python bindings for our C++ code, - Headers that manage the details of interaction between ATen and pybind11 - Having our operations written in C++ and ATen, you can use pybind11 to bind your C++ functions or classes into Python in a very simple manner. ```cpp // at the end of lltm.cpp PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("forward", &lltm_forward, "LLTM forward"); m.def("backward", &lltm_backward, "LLTM backward"); } ``` - Finally, run ```bash python setup.py install ``` - The result: ```python In [1]: import torch In [2]: import lltm_cpp In [3]: lltm_cpp.forward Out[3]: <function lltm.PyCapsule.forward> ``` - A wonderful fact about PyTorch’s ATen backend is that it abstracts the computing device you are running on. This means the same code we wrote for CPU can also run on GPU, and individual operations will correspondingly dispatch to GPU-optimized implementations. ### JIT - The JIT compilation mechanism provides you with a way of compiling and loading your extensions on the fly by calling a simple function in PyTorch’s API called torch.utils.cpp_extension.load(). For the LLTM, this would look as simple as this: ```python from torch.utils.cpp_extension import load lltm_cpp = load(name="lltm_cpp", sources=["lltm.cpp"]) ``` - Under the hood, this is what load() does: - Create a temporary directory /tmp/torch_extensions/lltm, - Emit a Ninja build file into that temporary directory, - Compile your source files into a shared library, - Import this shared library as a Python module. In fact, if you pass verbose=True to cpp_extension.load(), you will be informed about the process. - First time executing load() requires more time. - In most cases JIT technique is enough. ### CUDA kernel - The best practice: - C++ file as a boilerplate - You start by writing a C++ file that defines the functions that will be called from Python. These functions serve as the interface between the Python code and the underlying CUDA kernels. - In the C++ file, you use pybind11 to bind the C++ functions to Python, making them accessible from the Python side. - Additionally, in the C++ file, you declare CUDA functions that are implemented in separate CUDA (.cu) files. - The C++ functions perform necessary checks, prepare data, and then forward the calls to the corresponding CUDA functions. - CUDA files: - In separate CUDA files (with the .cu extension), you write the actual CUDA functions. - These CUDA files implement the functions that were declared in the C++ file. - Need two things: one function that performs operations we don’t wish to explicitly write by hand and calls into CUDA kernels, and then the actual CUDA kernel for the parts we want to speed up. - Compilation Process: - You must give your CUDA file a different name than your C++ file. - The cpp_extension (BuildExtension) package ensures that each compiler handles the files it is best suited to compile. ### Under the hood - pybind11: - Role: Bridges your C++ (or CUDA) code with Python. It allows you to define functions, classes, and variables in C++/CUDA and expose them to Python so they can be used just like Python functions and classes. - Example: You write some high-performance C++ code, like a matrix multiplication function, and use pybind11 to make this function callable from Python. - CUDAExtension: - Role: Specifies how to compile your C++/CUDA code into a Python extension module. It tells the build system which source files to compile, what flags to use, and how to link everything together. - Example: You have a .cu file (CUDA source file) that contains the matrix multiplication implementation. CUDAExtension handles the compilation of this file into a form that Python can import and use. - BuildExtension: - Role: Manages the building process within the setup.py script. It tells setuptools how to use CUDAExtension to compile your C++/CUDA code with pybind11 bindings into a Python module. - Example: When you run python setup.py build, BuildExtension ensures that the .cu file is compiled into a .so (shared object) or .pyd (Python dynamic module) file that can be imported in Python. - How They Work Together - Write C++/CUDA Code: You start by writing your high-performance code in C++/CUDA, and you use pybind11 to create bindings for this code so that Python can call it. - Set Up CUDAExtension: In your setup.py, you use CUDAExtension to specify how to compile your C++/CUDA source files. This includes setting up include directories, compile flags, and source files. - Use BuildExtension to Build: Finally, BuildExtension is used in your setup.py script to actually compile the code. It integrates with setuptools and ensures that everything is compiled correctly and placed in a location where Python can find and import it. - In a Nutshell - pybind11: Exposes your C++/CUDA code to Python. - CUDAExtension: Specifies how to compile that C++/CUDA code. - BuildExtension: Tells the build system to actually compile everything into a Python module. So, you write the code, set up the build instructions, and BuildExtension takes care of turning your C++/CUDA code into something Python can use, thanks to pybind11.