owned this note
owned this note
Published
Linked with GitHub
# The Decentralized Processing Unit
## Abstract
Currently all zkEVM implementations and zk rollups are single threaded and do not support asynchronous computation. I'm arguing that in order for Ethereum and rollups to be able to be considered a "World (Super) Computer" that is able to perform efficient general computation, there must be a mechanism to allow asynchronous execution of code and parallel computing across nodes. This would make the rollup/zkEVM network a "giant GPU", or a DPU: a Decentralized Processing Unit.
## Ethereum - the World Computer?
One often hears that Ethereum is a "world computer", but what does that mean actually? Could we somehow do arbitrary general computation on the Ethereum network? Currently the answer is a "limited yes", but running efficient parallel computation is a resounding "no". I'm going to investigate the possibility of generalized parallel computation in the context of neural networks.
Suppose we had a simple Neural Network $f$ that takes in a 256x256 RGB image $x$, i.e. an array of shape `[256, 256, 3]`, and spits out a number $y$. This neural network is also a function of some set of parameters $w$, which we assume to be adjusted such that the output $y$ is positive for every image $x$ that contains a hotdog, and negative otherwise. In other words,
$$
y = f(x, w) > 0 \ \ \ \text{if} \ \ \ x = \text{"hotdog"}.
$$

Hotdog - Input image.
Now suppose we have a smart contract on the Ethereum network, which we can call with the input image and a transaction fee. A "smart contract" can simply be understood as a program which calls the above function $f$ "on the Ethereum network". What happens then is visualized in the image below:

The user sends the hotdog image to an Ethereum node (Node~1~), which crunches the numbers and outputs $y$. The result and the transaction (green) is composed into a "block" together with other transactions. We can assume that Node~1~ is selected as the block proposer. Because of the blockchain consensus mechanism, this block will be sent to all the other nodes for validation. This means that **all the transactions in the block, including the neural network function call, will be re-computed on all the participating consensus nodes**. In other words, the computational capacity or throughput of the network will not increase at all in terms of the number of nodes participating in the network!
From the point of view of parallel/ distributed computation, this is of course completely unacceptable. One would expect the computational throughput to increase proportional to the number of nodes, but this simply cannot be the case in a blockchain: the security of a trustless network requires that each and every transaction event is validated on the other nodes, in order to ensure that the distributed ledger containing all the address balances is kept correctly in sync. In fact, this transaction validation procedure has already been seen to be quite inefficient even in "traditional" applications, such as DeFi and NFTs, with Ethereum fees surging sometimes to even tens of dollars. It would seem like the blockchain is an utterly useless tool for general computation.
## Enter rollups
Luckily there is an amazing fix to the situation. In fact there are many, but I'll focus on what I personally think is the most elegant one: the zero knowledge rollup. The idea is quite ingenious: the node that processes a batch of transactions does not submit all the transactions to be validated by all the other nodes, but instead computes a mathematical proof $P$ (typically SNARK or STARK) out of all the transactions, and submits this to be verified by the other nodes. The broad idea is that **the proof is easy to verify by the other nodes**.
How is this possible? I'm not going to go into the details of proof systems (you can read more from e.g. zkSync or Starknet[^cairo] web pages or whitepapers), but here's a somewhat contrived, completely non-cryptographic example: suppose the computation task $f$ is "find the positive root of polynomial $x^2 - 0.41\cdot x -1.41=0$", where $f$ would then use e.g. Newton's method to iterate several steps until the value is close enough to the solution, $x \approx 1.41$. Note that *finding* the root takes several steps, but *verifying* that the solution is $x \approx 1.41$ is simple - just substitute the solution into the equation! S\*ARKs "arithmetize" general programs into similar cryptographic "proof" equations, which can then be verified easily.
I've visualized what happens in the following image:

Compared to the L1/ base layer case in the previous visualization, now Node~1~ is an L2 zk-rollup node. It outputs a proof "P", and submits this proof to be validated to the other nodes in the Ethereum L1 network. Alternatively, the entire network is composed of L2 nodes. Normally if all the nodes needed to validate the transactions, the validation time would be the same as on Node~1~, i.e. proportional to the number of operations required to process the entire batch. Let's call that number $T$. For e.g. a STARK, the verification time is instead proportional to $log(T)^2$[^cairo]. This means that **as more and more computations are added into the batch, the cost of verifying becomes negligible**, because $log(T)^2$ grows much more slowly than $T$. This means we get both scalability and security in the Ethereum+L2 network!
There is however a bit of a problem if we return to thinking about the neural network example above: basically all the EVM and corresponding L2 Virtual Machines (zkEVM, Cairo VM etc.) are **single threaded**, i.e. all the computation inside the VMs is done sequentially, or in **serial** fashion. Neural Networks and many other computations however benefit a lot from **parallel** computing. You might have heard that NNs run much faster on GPUs. Let's dig a little deeper into the reasons why this is so.
## Parallelism
Let's be a little more explicit about our neural network $f$. Suppose we "flatten" the input image into a vector of length $N$. We'll also assume the net consist of a simple linear map and a sum operation, like this:
$$
y = f(x; W) \doteq \text{sum} \begin{bmatrix}
W_{11} & W_{12} & \cdots & W_{1N}\\
W_{21} & \ddots \\
\vdots & & & W_{MN}
\end{bmatrix} \cdot \begin{bmatrix}x_1 \\x_2 \\\vdots\\x_N \end{bmatrix}
$$
The matrix $W$ is shape $[M, N]$, which means that the output of the matrix-vector product is shape $[M]$. Then the $\text{sum}$ operation means that we simply sum all the elements, which results in the scalar value $y$. We assume that the matrix elements ("weights") $W_{ij}$ are adjusted so that hotdog images result in a positive number and images without a sufficient amount of hotdogness will result in a negative one. This is about the simplest kind of a "neural network" one can imagine, and would not work very well in real life (although one *could* actually express a single convolutional layer like this). Running this kind of a computation for given $x$ and $W$ goes like this:
1) Multiply vector $[W_{11}, \cdots, W_{1N}]$ elementwise with $[x_{1}, \cdots, x_{N}]$
2) Sum over the resulting vector, define that as $z_1$
3) Repeat above for all remaining $M-1$ rows => vector $[z_{1}, \cdots, z_{M}]$
4) Sum over this vector, resulting in $y = \text{sum}([z_{1}, \cdots, z_{M}])$
And that's the final output. Doing all of these steps in a serial fashion on a single thread means doing N computations at step 1) and 2) and repeating these total M times, and in the end doing one more summation in M steps. In total, **this is an $O(MN)$ time complexity operation**.
### Parallelism on GPU
Things are a bit different on a GPU. Typically a GPU has _thousands_ of little cores, and can run even millions of threads concurrently in a SIMT (Single Instruction Multiple Threads) fashion[^cuda_toolkit]. If we assume that we have lots of cores, we could do some pretty extreme parallelization like this (we assume both x and W have been copied to the GPU's internal "shared memory"):
1) Multiply each $W_{ij}$ with $x_j$ on *different* cores in parallel for all $i, j$
2) Compute $z_i \doteq \sum_j W_{ij} x_j$ on each core $i=1, \ldots, M$ in parallel
3) Compute the sum $y = \text{sum}([z_{1}, \cdots, z_{M}])$
Step 1) is an $O(1)$ operation. The summation operation on all M cores in step 2) might seem like an $O(N)$ operation, but in fact the summation could be done in $O(\log_2(N))$ if we did really extreme parallelization by doing a "reduce add"[^reduce_add]. So if $M \approx N$, we'd have total $O(\log_2(N))$ time complexity, compared to $O(MN)$! I've intentionally neglected various issues with e.g. copying values in memory, memory cost etc, so in reality the improvement wouldn't be this radical, but this is the broad point which bears repeating: **parallelization reduces the problem from $O(MN)$ to $O(\log(N))$** in the ideal case! Even the more pessimistic $O(MN) \to O(M)$ improvement would be a significant boost in speed.
### Parallelism on L2
Now let's think of the problem in a smart contract context. Assume we're running python/vyper on a single threaded zkEVM, but instead of a single monolithic rollup, **let's assume there's an entire network of zkEVM nodes**. For comparison, here's a serial Vyper/Python pseudocode of the above:
{%gist harpone/c5fe6241960f8a84d0d91f919b62be04 %}
This is just a foor loop over the rows of $W$. `dot` could be an external smart contract implementing the vector-vector dot product (although it might make more sense to have a separate smart contract corresponding to each of the rows of $W$). Then after the loop, we do a final sum (which could also be an external contract call).
Now let's imagine a parallel version:
{%gist harpone/dceebd22135c2149ab78557f80f6e313 %}
I'm using a completely fictitious `contractpool` library here, similar to Python's `multiprocessing` library. The main process still runs on a single thread at the `p.map`, but the difference here is that the execution will *not* wait until a result from a row of $W$ and $x$ have been processed by `dot` and then move on. Instead, the different `dot` smart contract instances will write to `zs` asynchronously, and typically finish in any possible order. Then we add a `p.sync()` command, which will wait until all the subcontracts have finished before continuing with executing the main contract. This is semantically similar to how SIMT style parallelization is done on NVIDIA GPUs with CUDA C.
Nothing like this could of course be possible on L1 because of the more or less random order of the `p.map` operation, which would result in the validating nodes doing the computation in different order. Note however that the end result is independent of the order, so that the asynchronity shouldn't be a problem. Indeed, there has been some research in enabling parallelization in blockchains, at least in cases where various operations can be swapped without changing the end result[^parallel_blockchain].
On L2 however, something like this might be feasible. The execution order doesn't really matter, as long as valid proofs are generated. This kind of a parallelization would probably also help with the proof generation if all the subcontracts generate their individual sub-proofs, and then the sub-proofs will be combined into a final proof. Nowadays the typical proof generation happens after a large number of transactions have been rolled up, and can take a significant amount of time. The above parallel implementation would basically be a recursive proof, which are being implemented in at least [zkSync](https://blog.matter-labs.io/zksync-v1-1-reddit-edition-recursion-up-to-3-000-tps-subscriptions-and-more-fea668b5b0ff) and [Starknet](https://medium.com/starkware/recursive-starks-78f8dd401025).
Note that it would be possible to parallelize the above example on the client side with e.g. a simple JS script by simply defining $M$ smart contracts and then feeding in $x$ to each contract, and doing the final summation on the client side. This is essentially what @guiltygyoza did with [Cairo on Starknet](https://github.com/guiltygyoza/tiny-dnn-on-starknet). This approach has however a serious downside: typically NNs are of course quite _deep_ (whence the "deep learning") with even hundreds of layers. Then one would need lots of back and forth communication between the client and the network, and of course much of the NN logic would need to be implemented on the client side. Also it would not be possible to implement similar depth in the parallelization as in the parallel subcontract scheme.
Then how fast would this actually be in real life? Well, it depends. Mostly on the network latency, that is. We can try to do some very simple back of the envelope computations to estimate the parallelization efficiency. Let's consider a single operation (OP), which could be for example a vector-vector dot product or even a single floating point op (FLOP), and denote the time it takes to compute this on a zkEVM single thread as $T_{exe}$. Suppose $T_{lat}$ is the average back and forth communication time between two L2 nodes (remember we're now assuming a large _network_ of L2 nodes instead of a single monolithic rollup node). Then doing N such OPs serially will take total $T_{exe} \cdot N$ seconds, while doing it in parallel will take $T_{lat} + T_{exe}$ seconds. Therefore, if $T_{lat} + T_{exe} < T_{exe} \cdot N$, it will be faster to do these $N$ OPs in parallel. In other words, as we increase the number of OPs, i.e. the size of the NN, this inequality is guaranteed to be fulfilled, which means that it will be beneficial to parallelize the computation no matter what the latency is! We can write the inequality also as $N > 1 + T_{lat} / T_{exe}$. Suppose then for example that $T_{lat} = T_{exe} = 10\text{ms}$. This means that the inequality is satisfied already for $N=2$, i.e. if the NN has even just *two* such (quite big) operations! In other words, if the NN would take longer than $20\text{ms}$ on a single zkEVM node, it would be beneficial to split it into two or more parts if network latency is $10\text{ms}$. This assumes the NN is completely parallelizable, which is not the case, but these are just rough estimates anyway.
In reality this would of course require a fast L2 network of nodes, combined with efficient routing protocol and message passing interfaces, robustness to nodes entering/exiting the network and so on, which is probably not what the current L2s can do. But theoretically such a network would not be that different from what happens inside a GPU.
Now you might be wondering, why not use GPUs in each of the nodes if they are indeed so fast (this is similar to what e.g. [Solana is doing](https://medium.com/solana-labs/sealevel-parallel-processing-thousands-of-smart-contracts-d814b378192))[^solana_note]? Yes, that would be possible, but what happens when your node runs out of GPU cores? Then you would need to parallelize the model between *two* or more GPUs. And that's basically the operating principle in e.g. Deep Learning model training on cloud clusters: each cloud machine has for example 8 GPUs, and one can typically use even hundreds of such machines to train a single large NN. Parallelizing such a task in DL training is usually not overly complicated because typically it is simply the _data_ that is parallelized, i.e. the NN fits nicely on a single GPU, then one feeds in large batches of images into the model on each GPU device. But [_model_ parallelism is much more complicated](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html), with at least three levels of parallelism hierarchy: inside the GPU, between GPUs and between machines. On an L2 network, there would be no such hierarchy, just a very large number of nodes. **Each of the nodes would play the role of a "core" in a giant "GPU", which we could justifiably instead call the DPU, a Decentralized Processing Unit.** Think about it: a "GPU" with millions or even *billions* of "cores", running planet wide neural networks! That's a pretty wild thought (and yeah, pretty far from reality still :D).
## Concluding remarks
Given that today's zk rollups are not networks of zkEVM (or other VM) nodes, but are typically monolithic centralized sequencers, these kinds of thoughts might seem like painting cloud castles in the sky. There are however ongoing plans to eventually decentralize also the sequencers[^zksync_decent_sequencer][^starknet_decent_sequencer]. It's interesting to note that the typical arguments for a decentralized sequencers are liveness and censorship resistance and other reasons, but in my opinion **the most important reason for L2 decentralization is parallel computation**. Once that is implemented, Ethereum and the parallel L2 can finally be justifiably called a World (super) Computer.
## Acknowledgements
My heartfelt thanks to [@guiltygyoza](https://twitter.com/guiltygyoza), [Fran Algaba](https://twitter.com/franalgaba_), [@Qhuesten](https://twitter.com/qhuesten), [Omar Azhar](https://www.linkedin.com/in/omarazhar/) and the zkSync discord for inspiration and interesting technical zk rollup discussions.
## References
[^cairo]: Footnote [Goldberg, Lior, Shahar Papini, and Michael Riabzev. n.d. “Cairo – a Turing-Complete STARK-Friendly CPU Architecture.”](https://eprint.iacr.org/2021/1063)
[^reduce_add]: Footnote [sodocumentation.net: Parallel reduction (e.g. how to sum an array)](https://sodocumentation.net/cuda/topic/6566/parallel-reduction--e-g--how-to-sum-an-array-)
[^parallel_blockchain]: Footnote [Bartoletti, Massimo, Letterio Galletta, and Maurizio Murgia. 2021. “A Theory of Transaction Parallelism in Blockchains.” Logical Methods in Computer Science 17, Issue 4 (November).](https://doi.org/10.46298/lmcs-17(4:10)2021)
[^solana_note]: If I'm not mistaken, it's still not possible to asynchronously call subcontracts inside Solana contracts.
[^cuda_toolkit]: Footnote [More on how CUDA GPUs work at the CUDA toolkit docs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
[^zksync_decent_sequencer]: Footnote [zkSync decentralized sequencer plan](https://docs.zksync.io/userdocs/decentralization/#how-decentralized-is-zksync)
[^starknet_decent_sequencer]: Footnote [Starknet decentralized sequencer plan](https://medium.com/starkware/starknet-on-to-the-next-challenge-96a39de7717)