Argonne Visit (14 weeks): Proposal
===
###### tags: `ANL`
## Title
**Differentiable GPU-Offloading beyond CUDA: Handling Machine Learning with OpenMP & Enzyme**
## Executive Summary
> Differentiable GPU-computing of some of the biggest workloads, such as machine learning, is currently mostly performed with NVIDIA's proprietary backends cuBLAS, and cuDNN. In this project we will build on LLVM's open source stack in combining OpenMP and Enzyme to first simplify the user-syntax, and subsequently construct a proof-of-concept PyTorch backend utilizing OpenMP's GPU-offloading, and Enzyme's differentiation engine to perform all differentiation, and offloading within the LLVM compiler stack.
## Key Deliverables
1. Easily usable syntax of Enzyme for OpenMP at the Pragma-level
2. Make sure it plays nicely with GPU-offloading by integrating proper unit-testing
3. Integrate an (experimental) OpenMP backend into PyTorch for [MLsys 2023](https://mlsys.org)
## Description
#### 1. Easily-accessible syntax of Enzyme in OpenMP
Enzyme, as of now, requires its users to adhere to its exact syntax which poses a number of hurdles for first-time users, such as the definition and initialization of the shadow domain. To make Enzyme's gradients much more accessible to users, easier to integrate into existing codes, and improve the overall user experience we seek to integrate Enzyme at the pragma-level of OpenMP. See the example below for a potential syntax:
```cpp
#pragma omp parallel for firstprivate(..) enzyme(reverse, shadow)
for(k=0; k<loopEnd;++k)
{
func1(..);
func2(..);
func3(..);
}
```
Open question:
- Can we even automate the creation of the shadow to make it truly seamless?
#### 2. Tests for Enzyme integration with OpenMP GPU-offloading
To harmonize, and weed out potential bugs of OpenMP GPU-offloading in conjunction with Enzyme we will have to properly test both together before proceeding.
- Write unit tests of Enzyme with OpenMP GPU-offloading utilizing the major features of OpenMP on GPUs, and verify the gradients.
#### 3. (Experimental) OpenMP backend for PyTorch
To enable a more flexible backend of PyTorch, and most importantly enable the first proper comparison of Enzyme's efficiency with a state-of-the-art purpose-built machine learning AD-engine, we will add an OpenMP backend to PyTorch. For this we will in large part utilize the stability and cleaner syntax of part 1 & 2.
<div style="text-align:center">
<img src="https://i.imgur.com/6xWSuQV.png" alt="drawing" width="400"/>
</div>
Here there are multiple paths to consider, such as the previous work by [Triton](https://github.com/openai/triton) (see figure above), and other machine learning compilers which came before such as
- [PlaidML](https://github.com/plaidml/plaidml)
- [Tensor Comprehensions](https://github.com/facebookresearch/TensorComprehensions)
and the auxiliary work in [Intel oneDNN](https://github.com/oneapi-src/oneDNN). Intel oneDNN already utilizes OpenMP for its CPU-based runtime, hence giving us multiple avenues to build up the code for this task from. oneDNN's goals diverge from the pure OpenMP reliance, and our goals, in this task though:
- GPU engine relies on OpenCL, oneAPI DPC++ Compiler, or CUDA-backend
- oneAPI DPC++ contains part of the logic for OpenMP GPU offload, but its oneDNN GPU engine is not able to rely solely on the OpenMP GPU offload, but always has to include at least OpenCL.
For the initial prototype there are multiple potential pathways, such as working on the level of PyTorch's [C++ API](https://pytorch.org/cppdocs/), or working in the previous work [embedding work](https://arxiv.org/abs/2010.08439) to let the Clang compiler do the heavy lifting. Both pathways would leave out the entire Python-front in the beginning, and hence significantly reduce complexity at this stage. We can furthermore glean at the Triton IR/Compiler, as sketched above, for specific optimization passes, and integration concerns with PyTorch itself. Triton will furthermore be a necessary benchmark to compare to later on, i.e.
- PyTorch with CUDA backend
- Triton with PyTorch's Autograd AD
- PyTorch with Enzyme and OpenMP GPU-offloading
> Targeted towards MLSys, whose submission date should coincide with end date at Argonne (Oct 8 or 15)
## Notes from Call
[Notes](/5y9NhGBsS56FIb812YgwkQ)