Applying Fusion to ZK

As introduced in [Tensorflow XLA: The Fusion Compiler for Tensorflow](https://www.geeksforgeeks.org/deep-learning/tensorflow-xla-the-fusion-compiler-for-tensorflow/), one of XLA's most important optimizations is Fusion. This article will briefly discuss what Fusion is, why it's crucial, and the actual performance benefits of applying it in the context of ZK. ## What is Fusion? **Loop Fusion**, for example, means combining loops like the following: ```c++ for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } for (int i = 0; i < n; i++) { d[i] = c[i] * a[i]; } ``` Into a single, fused loop. ```c++ for (int i = 0; i < n; i++) { c[i] = (a[i] + b[i]) * a[i]; } ``` In compilers like ZKX, operations like element-wise operations are primary candidates for fusion. For example, consider the following HLO-like operations: ```hlo ENTRY %main { %x = paramter(0) %y = parameter(1) %0 = babybear[n] add(%x, %y) ROOT %1 = babybear[n] multiply(%0, %a) } ``` Through fusion, these operations are combined into a single one, as shown below: ```hlo %f { %x = paramter(0) %y = parameter(1) %0 = babybear[n] add(%x, %y) ROOT %1 = babybear[n] multiply(%0, %a) } ENTRY %main { %x = paramter(0) %y = parameter(1) %0 = babybear[n] fusion(%x, %y), kind=kLoop, calls=%f } ``` So, let's explore why combining operations is so important. ## Why Fusion is important? The problem we are tackling is **memory-bound**, so we must exploit the memory hierarchy to optimize memory transfers. ![image](https://hackmd.io/_uploads/S1nFLr-g-g.png) You can think of GPU data movement in three tiers. The lower you go, the **faster and smaller** it gets. 1. **Host ↔ Device transfers**: PCIe/NVLink, etc. The most expensive (high latency, limited bandwidth). 2. **Device Global Memory** accesses: high bandwidth but **off-chip**, so latency is still large. 3. **On-chip resources**: L2/L1, **Shared Memory**, registers. Fastest. The key principle: minimize round-trips in (1) and (2) and stay in (3) as long as possible. **When do Host ↔ Device transfers happen?** Whenever we copy data from the host to the GPU (e.g., using `cudaMemcpy` or `Unified Memory migration`) — or in the reverse direction — we incur expensive host-device transfers. **When do Global Memory accesses happen?** To share data across kernel boundaries, intermediates must be written to Global Memory and then read back by the next kernel. This quickly accumulates cost. Once a kernel is launched, computation is distributed over Grid / Block / Thread. Within a kernel, Shared Memory and Caches handle local sharing. But if temporal or spatial locality is broken, Global round-trips become costly. ### Revisit Our Example ```c++ for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } for (int i = 0; i < n; i++) { d[i] = c[i] * a[i]; } ``` Let's assume each loop is launched as a separate GPU kernel. Then two problems arise: Two separate kernel launches are required. The intermediate array `c` must be written to Global Memory so that the next kernel can read it back to compute `d`. Now, let's look at the fused version again: ```c++ for (int i = 0; i < n; i++) { c[i] = (a[i] + b[i]) * a[i]; } ``` If we apply fusion, both of the problems above are solved. This is a critical optimization because this computational pattern (a sequence of element-wise operations) is extremely common in ZK, particularly in procedures like Sumcheck and Quotienting. ## Result This principle seems simple, but it often breaks down when using frameworks like ICICLE for GPU programming. This is because each vector operation (vecop) is typically a separate kernel call, as seen in their API (https://dev.ingonyama.com/api/cpp/vec_ops) | k | ZKX(ms) | ICICLE(ms) | ICICLE / ZKX | | -- | ------- | ---------- | ------------ | | 20 | 10.8636 | 17.1061 | 1.574625354 | | 21 | 21.396 | 35.3345 | 1.651453543 | | 22 | 41.4869 | 71.9025 | 1.733137448 | | 23 | 81.8148 | 143.849 | 1.758227118 | | 24 | 160.608 | 288.029 | 1.793366457 | | 25 | 320.259 | 582.057 | 1.817457121 | We ran a benchmark for the operation $x^2 - x$ (where $x \in \mathbb{F}^{k}$) on an RTX-5090. The result showed a 1.5~1.8x performance difference between the fused and unfused versions. ## Conclusion To be fair, the ICICLE team is aware of fusion's importance and provides a way for users to manually apply fusion using program (https://dev.ingonyama.com/api/cpp/program). However, this manual approach only scratches the surface of what a compiler can automate. Expecting a user to understand the intricate hardware architecture and manually handle the full suite of loop optimizations—such as tiling, vectorization, unrolling, and software pipelining—is an extremely difficult and unreasonable burden. This is precisely the work a compiler should do.