Heikki Arponen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
2
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# The Decentralized Processing Unit ## Abstract Currently all zkEVM implementations and zk rollups are single threaded and do not support asynchronous computation. I'm arguing that in order for Ethereum and rollups to be able to be considered a "World (Super) Computer" that is able to perform efficient general computation, there must be a mechanism to allow asynchronous execution of code and parallel computing across nodes. This would make the rollup/zkEVM network a "giant GPU", or a DPU: a Decentralized Processing Unit. ## Ethereum - the World Computer? One often hears that Ethereum is a "world computer", but what does that mean actually? Could we somehow do arbitrary general computation on the Ethereum network? Currently the answer is a "limited yes", but running efficient parallel computation is a resounding "no". I'm going to investigate the possibility of generalized parallel computation in the context of neural networks. Suppose we had a simple Neural Network $f$ that takes in a 256x256 RGB image $x$, i.e. an array of shape `[256, 256, 3]`, and spits out a number $y$. This neural network is also a function of some set of parameters $w$, which we assume to be adjusted such that the output $y$ is positive for every image $x$ that contains a hotdog, and negative otherwise. In other words, $$ y = f(x, w) > 0 \ \ \ \text{if} \ \ \ x = \text{"hotdog"}. $$ ![Hotdog](https://i.imgur.com/9WikEAU.png) Hotdog - Input image. Now suppose we have a smart contract on the Ethereum network, which we can call with the input image and a transaction fee. A "smart contract" can simply be understood as a program which calls the above function $f$ "on the Ethereum network". What happens then is visualized in the image below: ![](https://i.imgur.com/VGZjC4n.jpg) The user sends the hotdog image to an Ethereum node (Node~1~), which crunches the numbers and outputs $y$. The result and the transaction (green) is composed into a "block" together with other transactions. We can assume that Node~1~ is selected as the block proposer. Because of the blockchain consensus mechanism, this block will be sent to all the other nodes for validation. This means that **all the transactions in the block, including the neural network function call, will be re-computed on all the participating consensus nodes**. In other words, the computational capacity or throughput of the network will not increase at all in terms of the number of nodes participating in the network! From the point of view of parallel/ distributed computation, this is of course completely unacceptable. One would expect the computational throughput to increase proportional to the number of nodes, but this simply cannot be the case in a blockchain: the security of a trustless network requires that each and every transaction event is validated on the other nodes, in order to ensure that the distributed ledger containing all the address balances is kept correctly in sync. In fact, this transaction validation procedure has already been seen to be quite inefficient even in "traditional" applications, such as DeFi and NFTs, with Ethereum fees surging sometimes to even tens of dollars. It would seem like the blockchain is an utterly useless tool for general computation. ## Enter rollups Luckily there is an amazing fix to the situation. In fact there are many, but I'll focus on what I personally think is the most elegant one: the zero knowledge rollup. The idea is quite ingenious: the node that processes a batch of transactions does not submit all the transactions to be validated by all the other nodes, but instead computes a mathematical proof $P$ (typically SNARK or STARK) out of all the transactions, and submits this to be verified by the other nodes. The broad idea is that **the proof is easy to verify by the other nodes**. How is this possible? I'm not going to go into the details of proof systems (you can read more from e.g. zkSync or Starknet[^cairo] web pages or whitepapers), but here's a somewhat contrived, completely non-cryptographic example: suppose the computation task $f$ is "find the positive root of polynomial $x^2 - 0.41\cdot x -1.41=0$", where $f$ would then use e.g. Newton's method to iterate several steps until the value is close enough to the solution, $x \approx 1.41$. Note that *finding* the root takes several steps, but *verifying* that the solution is $x \approx 1.41$ is simple - just substitute the solution into the equation! S\*ARKs "arithmetize" general programs into similar cryptographic "proof" equations, which can then be verified easily. I've visualized what happens in the following image: ![](https://i.imgur.com/O2gwL3x.jpg) Compared to the L1/ base layer case in the previous visualization, now Node~1~ is an L2 zk-rollup node. It outputs a proof "P", and submits this proof to be validated to the other nodes in the Ethereum L1 network. Alternatively, the entire network is composed of L2 nodes. Normally if all the nodes needed to validate the transactions, the validation time would be the same as on Node~1~, i.e. proportional to the number of operations required to process the entire batch. Let's call that number $T$. For e.g. a STARK, the verification time is instead proportional to $log(T)^2$[^cairo]. This means that **as more and more computations are added into the batch, the cost of verifying becomes negligible**, because $log(T)^2$ grows much more slowly than $T$. This means we get both scalability and security in the Ethereum+L2 network! There is however a bit of a problem if we return to thinking about the neural network example above: basically all the EVM and corresponding L2 Virtual Machines (zkEVM, Cairo VM etc.) are **single threaded**, i.e. all the computation inside the VMs is done sequentially, or in **serial** fashion. Neural Networks and many other computations however benefit a lot from **parallel** computing. You might have heard that NNs run much faster on GPUs. Let's dig a little deeper into the reasons why this is so. ## Parallelism Let's be a little more explicit about our neural network $f$. Suppose we "flatten" the input image into a vector of length $N$. We'll also assume the net consist of a simple linear map and a sum operation, like this: $$ y = f(x; W) \doteq \text{sum} \begin{bmatrix} W_{11} & W_{12} & \cdots & W_{1N}\\ W_{21} & \ddots \\ \vdots & & & W_{MN} \end{bmatrix} \cdot \begin{bmatrix}x_1 \\x_2 \\\vdots\\x_N \end{bmatrix} $$ The matrix $W$ is shape $[M, N]$, which means that the output of the matrix-vector product is shape $[M]$. Then the $\text{sum}$ operation means that we simply sum all the elements, which results in the scalar value $y$. We assume that the matrix elements ("weights") $W_{ij}$ are adjusted so that hotdog images result in a positive number and images without a sufficient amount of hotdogness will result in a negative one. This is about the simplest kind of a "neural network" one can imagine, and would not work very well in real life (although one *could* actually express a single convolutional layer like this). Running this kind of a computation for given $x$ and $W$ goes like this: 1) Multiply vector $[W_{11}, \cdots, W_{1N}]$ elementwise with $[x_{1}, \cdots, x_{N}]$ 2) Sum over the resulting vector, define that as $z_1$ 3) Repeat above for all remaining $M-1$ rows => vector $[z_{1}, \cdots, z_{M}]$ 4) Sum over this vector, resulting in $y = \text{sum}([z_{1}, \cdots, z_{M}])$ And that's the final output. Doing all of these steps in a serial fashion on a single thread means doing N computations at step 1) and 2) and repeating these total M times, and in the end doing one more summation in M steps. In total, **this is an $O(MN)$ time complexity operation**. ### Parallelism on GPU Things are a bit different on a GPU. Typically a GPU has _thousands_ of little cores, and can run even millions of threads concurrently in a SIMT (Single Instruction Multiple Threads) fashion[^cuda_toolkit]. If we assume that we have lots of cores, we could do some pretty extreme parallelization like this (we assume both x and W have been copied to the GPU's internal "shared memory"): 1) Multiply each $W_{ij}$ with $x_j$ on *different* cores in parallel for all $i, j$ 2) Compute $z_i \doteq \sum_j W_{ij} x_j$ on each core $i=1, \ldots, M$ in parallel 3) Compute the sum $y = \text{sum}([z_{1}, \cdots, z_{M}])$ Step 1) is an $O(1)$ operation. The summation operation on all M cores in step 2) might seem like an $O(N)$ operation, but in fact the summation could be done in $O(\log_2(N))$ if we did really extreme parallelization by doing a "reduce add"[^reduce_add]. So if $M \approx N$, we'd have total $O(\log_2(N))$ time complexity, compared to $O(MN)$! I've intentionally neglected various issues with e.g. copying values in memory, memory cost etc, so in reality the improvement wouldn't be this radical, but this is the broad point which bears repeating: **parallelization reduces the problem from $O(MN)$ to $O(\log(N))$** in the ideal case! Even the more pessimistic $O(MN) \to O(M)$ improvement would be a significant boost in speed. ### Parallelism on L2 Now let's think of the problem in a smart contract context. Assume we're running python/vyper on a single threaded zkEVM, but instead of a single monolithic rollup, **let's assume there's an entire network of zkEVM nodes**. For comparison, here's a serial Vyper/Python pseudocode of the above: {%gist harpone/c5fe6241960f8a84d0d91f919b62be04 %} This is just a foor loop over the rows of $W$. `dot` could be an external smart contract implementing the vector-vector dot product (although it might make more sense to have a separate smart contract corresponding to each of the rows of $W$). Then after the loop, we do a final sum (which could also be an external contract call). Now let's imagine a parallel version: {%gist harpone/dceebd22135c2149ab78557f80f6e313 %} I'm using a completely fictitious `contractpool` library here, similar to Python's `multiprocessing` library. The main process still runs on a single thread at the `p.map`, but the difference here is that the execution will *not* wait until a result from a row of $W$ and $x$ have been processed by `dot` and then move on. Instead, the different `dot` smart contract instances will write to `zs` asynchronously, and typically finish in any possible order. Then we add a `p.sync()` command, which will wait until all the subcontracts have finished before continuing with executing the main contract. This is semantically similar to how SIMT style parallelization is done on NVIDIA GPUs with CUDA C. Nothing like this could of course be possible on L1 because of the more or less random order of the `p.map` operation, which would result in the validating nodes doing the computation in different order. Note however that the end result is independent of the order, so that the asynchronity shouldn't be a problem. Indeed, there has been some research in enabling parallelization in blockchains, at least in cases where various operations can be swapped without changing the end result[^parallel_blockchain]. On L2 however, something like this might be feasible. The execution order doesn't really matter, as long as valid proofs are generated. This kind of a parallelization would probably also help with the proof generation if all the subcontracts generate their individual sub-proofs, and then the sub-proofs will be combined into a final proof. Nowadays the typical proof generation happens after a large number of transactions have been rolled up, and can take a significant amount of time. The above parallel implementation would basically be a recursive proof, which are being implemented in at least [zkSync](https://blog.matter-labs.io/zksync-v1-1-reddit-edition-recursion-up-to-3-000-tps-subscriptions-and-more-fea668b5b0ff) and [Starknet](https://medium.com/starkware/recursive-starks-78f8dd401025). Note that it would be possible to parallelize the above example on the client side with e.g. a simple JS script by simply defining $M$ smart contracts and then feeding in $x$ to each contract, and doing the final summation on the client side. This is essentially what @guiltygyoza did with [Cairo on Starknet](https://github.com/guiltygyoza/tiny-dnn-on-starknet). This approach has however a serious downside: typically NNs are of course quite _deep_ (whence the "deep learning") with even hundreds of layers. Then one would need lots of back and forth communication between the client and the network, and of course much of the NN logic would need to be implemented on the client side. Also it would not be possible to implement similar depth in the parallelization as in the parallel subcontract scheme. Then how fast would this actually be in real life? Well, it depends. Mostly on the network latency, that is. We can try to do some very simple back of the envelope computations to estimate the parallelization efficiency. Let's consider a single operation (OP), which could be for example a vector-vector dot product or even a single floating point op (FLOP), and denote the time it takes to compute this on a zkEVM single thread as $T_{exe}$. Suppose $T_{lat}$ is the average back and forth communication time between two L2 nodes (remember we're now assuming a large _network_ of L2 nodes instead of a single monolithic rollup node). Then doing N such OPs serially will take total $T_{exe} \cdot N$ seconds, while doing it in parallel will take $T_{lat} + T_{exe}$ seconds. Therefore, if $T_{lat} + T_{exe} < T_{exe} \cdot N$, it will be faster to do these $N$ OPs in parallel. In other words, as we increase the number of OPs, i.e. the size of the NN, this inequality is guaranteed to be fulfilled, which means that it will be beneficial to parallelize the computation no matter what the latency is! We can write the inequality also as $N > 1 + T_{lat} / T_{exe}$. Suppose then for example that $T_{lat} = T_{exe} = 10\text{ms}$. This means that the inequality is satisfied already for $N=2$, i.e. if the NN has even just *two* such (quite big) operations! In other words, if the NN would take longer than $20\text{ms}$ on a single zkEVM node, it would be beneficial to split it into two or more parts if network latency is $10\text{ms}$. This assumes the NN is completely parallelizable, which is not the case, but these are just rough estimates anyway. In reality this would of course require a fast L2 network of nodes, combined with efficient routing protocol and message passing interfaces, robustness to nodes entering/exiting the network and so on, which is probably not what the current L2s can do. But theoretically such a network would not be that different from what happens inside a GPU. Now you might be wondering, why not use GPUs in each of the nodes if they are indeed so fast (this is similar to what e.g. [Solana is doing](https://medium.com/solana-labs/sealevel-parallel-processing-thousands-of-smart-contracts-d814b378192))[^solana_note]? Yes, that would be possible, but what happens when your node runs out of GPU cores? Then you would need to parallelize the model between *two* or more GPUs. And that's basically the operating principle in e.g. Deep Learning model training on cloud clusters: each cloud machine has for example 8 GPUs, and one can typically use even hundreds of such machines to train a single large NN. Parallelizing such a task in DL training is usually not overly complicated because typically it is simply the _data_ that is parallelized, i.e. the NN fits nicely on a single GPU, then one feeds in large batches of images into the model on each GPU device. But [_model_ parallelism is much more complicated](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html), with at least three levels of parallelism hierarchy: inside the GPU, between GPUs and between machines. On an L2 network, there would be no such hierarchy, just a very large number of nodes. **Each of the nodes would play the role of a "core" in a giant "GPU", which we could justifiably instead call the DPU, a Decentralized Processing Unit.** Think about it: a "GPU" with millions or even *billions* of "cores", running planet wide neural networks! That's a pretty wild thought (and yeah, pretty far from reality still :D). ## Concluding remarks Given that today's zk rollups are not networks of zkEVM (or other VM) nodes, but are typically monolithic centralized sequencers, these kinds of thoughts might seem like painting cloud castles in the sky. There are however ongoing plans to eventually decentralize also the sequencers[^zksync_decent_sequencer][^starknet_decent_sequencer]. It's interesting to note that the typical arguments for a decentralized sequencers are liveness and censorship resistance and other reasons, but in my opinion **the most important reason for L2 decentralization is parallel computation**. Once that is implemented, Ethereum and the parallel L2 can finally be justifiably called a World (super) Computer. ## Acknowledgements My heartfelt thanks to [@guiltygyoza](https://twitter.com/guiltygyoza), [Fran Algaba](https://twitter.com/franalgaba_), [@Qhuesten](https://twitter.com/qhuesten), [Omar Azhar](https://www.linkedin.com/in/omarazhar/) and the zkSync discord for inspiration and interesting technical zk rollup discussions. ## References [^cairo]: Footnote [Goldberg, Lior, Shahar Papini, and Michael Riabzev. n.d. “Cairo – a Turing-Complete STARK-Friendly CPU Architecture.”](https://eprint.iacr.org/2021/1063) [^reduce_add]: Footnote [sodocumentation.net: Parallel reduction (e.g. how to sum an array)](https://sodocumentation.net/cuda/topic/6566/parallel-reduction--e-g--how-to-sum-an-array-) [^parallel_blockchain]: Footnote [Bartoletti, Massimo, Letterio Galletta, and Maurizio Murgia. 2021. “A Theory of Transaction Parallelism in Blockchains.” Logical Methods in Computer Science 17, Issue 4 (November).](https://doi.org/10.46298/lmcs-17(4:10)2021) [^solana_note]: If I'm not mistaken, it's still not possible to asynchronously call subcontracts inside Solana contracts. [^cuda_toolkit]: Footnote [More on how CUDA GPUs work at the CUDA toolkit docs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) [^zksync_decent_sequencer]: Footnote [zkSync decentralized sequencer plan](https://docs.zksync.io/userdocs/decentralization/#how-decentralized-is-zksync) [^starknet_decent_sequencer]: Footnote [Starknet decentralized sequencer plan](https://medium.com/starkware/starknet-on-to-the-next-challenge-96a39de7717)

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully