Slush, a proposal for Fractal scaling

2023, Edit: this construction is primarily a thought experiment showing the possibility for linear scaling, in a model where every validium is responsible for its own data. In practice the DA layer will allow the RUs access to each others data, and has great scaling properties, even if it is not linear, but

\sqrt{N}

, in the total number of light clients.

The DA layer in itself is not a RU architecture (it does not specify how RUs should interact). But the DA layer allows this construction to superseded by a newer one. Some elements of this construction, the bridging, virtual tokens, RU/validium exiting for L2, L3 systems, and even some elements of the consensus mechanism are still be valuable.

Tl; Dr

The basic question of this document is what fractal scaling will look like, how to make bridging cheap between the different rollups, and how to make the whole fractal system secure for chained validiums.

We start with fractal scaling, then we follow with what bridging will look like between the rollups, first for users, then for smart contracts.

Then we discuss the question of validium security: exit strategies and the required DA. We find that the existing solutions are not scalable, and propose our solution, a common consensus mechanism for the descendant validiums.

Then we generalise this solution in two parts, the first generalisation enables a majority of descendant validiumss to take control of the parent validium, making exiting easier in case of a malicious parent layer. The second generalisation replaces L1 PoS consensus mechanism with the new validium consensus (basically enabling the consensus mechanism to handle forking).

Following this we shape the tree structure to make the resulting system have fast confirmation time and low data requirements on each node. Finally we summarise our results (TlDr ^2): a method of fractal scaling that enables transaction throughput that is nearly linear

O (N / log (N))

in the number of nodes, with

O (log (n))

finality time, and

O (log (N))

storage requirements.

Document long outline

We first summarise fractal scaling of ZK rollups and why there will be a need for bridging between different rollups instead of routing everything through L1.
We continue with an overview of what the current bridging solutions are: trustless swapping, authority bridging, routing and pooled routing, and we discuss their drawbacks.
Bridging
- Then we describe our bridging solution as an extreme form of pooled routing, and it’s benefits, and how different versions of it might be deployed.
- Then we extend this bridging so that smart contracts can use it (i.e. atomicity). This comes in two parts, first we enable smart contracts to confirm the bridging process, or if there is no confirmation we enable them to route the transaction through the parent rollups (this is expensive but it is only a fallback option).
- Then we look at a toy example of an AMM in this sharded setting.
Security
- Then we look at the security of rollups/validiums, and how they rely on the L1’s security. We discuss two different solutions for secure exiting, the first is broadly how Starkex validium apps do security, the second is how Starknet rollup does it. We will see that neither of these solutions scale when trying to do iterated validiums. We then discuss a way of exiting that is scalable, but instead of relying only on the L1 security this mechanism enables descendant validiums to exit (the parent validiums) if a majority of the descendants are honest.
- Then we generalise this solution in two parts. The first generalisation enables a majority of descendant validiums to take control of the parent validium, so that it is not the large number of descendant validiums that have to leave if the parent layer is malicious, but the single parent layer. This makes economic sense in case there is conflict between the parent layer and the descendants, as there are less parent nodes than descendant nodes. The second generalisation replaces L1 PoS consensus mechanism with the new validium consensus. For this the consensus mechanism has to be able to handle forking at the L1 level.
- After this we look at the tree structure that makes this structure the most efficient in terms of confirmation time (the time it takes for a transaction to be confirmed and irreversible), and in terms of required data storage. We realise two thing here, the first is that it makes sense for the tree to be narrow and deep (each validium should only have a few child validiums), and secondly that the content of each validium (except the child validiums ZKP) can be externalised into a separate validium, making the parent validium contain ZKPs.
Finally we summarise our results: a method of fractal scaling that enables nearly linear
$O (N / log (n))$ transaction throughput in the number of nodes, all relying on a common consensus mechanism.
As an Appendix we include some calculations.

Zk Rollups and Current bridging

First we review the existing fractal scaling and bridging landscape.

Zk Rollups

First let’s look at the fractal-like, multiple layer ZK Rollup solution to scaling. In the first image, we have Virtual Machines (Rectangles) that output their compressed state as ZKPs to VMs higher in the tree. The ZKPs are verified by ZKP verifier smart contracts on the parent layer.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Fig. 1. Basic setup

When a state update happens, in our image VM 4 is updated with an internal transaction, the ZKP is updated in the hosting group VM, and the hosting VM’s ZKP changes as well. This propagates up the tree to the L1, which is VM -1. We show this with the purple colour.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Fig. 2. A simple state update

In this setting bridging is the act of taking data from one VM to another on a different branch of the tree, without burdening L1. This is necessary as the different rolllups can be assumed to be constant size in tx throughput, and the L1 could not handle the throughput if everything were routed through it.

Current Bridging

The area of trustless bridging is saturated with companies like Hop, Celer, deBridge or Connext. First we will broadly categorise the mechanisms of these companies and find that there is a trilemma. Then we briefly explain the way out of the trilemma, its cost, and how this will be used in the ideal bridging solution.

The first option to bridging would be routing your funds from source rollup up to L1 and then down to the destination rollup. This option is trustless, however if everyone used this option it would overload L1, so this option will always be expensive. However, the costs can be alleviated if you pool funds with other people, if you’re going to the same destination. This is pooled routing. This is essentially how Hop network does bridging. When going from L3 or deeper, the different pools can be further pooled on L2. However, this pooling can only be used for applications that have a large user base on both the sending and receiving rollup.

Alternatively you can go directly between the layers (such as Connext). Although these are substantially faster and cheaper than going through L1, these systems only swap your funds on one chain for funds on another chain, which means there needs to be some other mechanism that compensates for net fund movement between the layers. This incurs additional costs, but more importantly means that this method actually relies on other solutions (such as pooled routing).

There is another way of going directly between layers, this is done by an authority that burns and mints the bridged tokens on each layer (such as Wormhole). This authority usually takes the form of a smaller blockchain, and there are security risks with having to trust these small blockchains. The drawbacks of such systems are obvious.

The trilemma

As we see we have a trilemma: the systems do not satisfy all three required properties of 1. being trustless, 2. not touching L1 , 3. and having real transfers not just swapping. The reason for this is that if we have real transfers and not swapping, then the funds have to be burned on one layer and minted on another. This minting can only happen if the burning has already happened. Confirming that the burning happened needs either trust in some authority, or a message passed down from L1, as the two layers are only connected through L1 trustlessly. This is the reason the trilemma exists.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Fig. 3. The trilemma, choose 2.

Disclaimer on Stargate

After writing this, I found Stargate. They use a similar idea that I will outline, proof of transaction membership based on the block's state hash. What is worse, they use the same language of "bridging trilemma"! However, their trilemma is different, it is about "instant guaranteed finality, unified liquidity, and native asset transactions". They also do bridging between different L1s, and not on the fractal tree. In our classification they are not trustless, and only do swaps. And more importantly, the main part of our document is the second Security part, and our bridging process is optimised for the structures described in that part of the document.

Virtual tokens vs Bridged tokens

If we want to avoid L1 fees, and do not want to move funds on L1 we can only transact using virtual tokens, similar to Hop protocols hTokens. But these can be backed on L1 with the real tokens that they represent. These virtual tokens can be burned and minted as necessary. These could over time outcompete the native bridged tokens on each rollup, as it would be cheaper to transfer to other rollups. However they would be less secure, as an attack on a single rollup can compromise all the locked funds on L1. So it makes sense only to deploy these tokens on sufficiently trusted rollups.

Our trustless bridging

The trilemma’s “solution”

In our bridging we make a different compromise. We can trustlessly confirm that the virtual token was burned on the sender chain if we link the burning together with the L1 state root in a ZKP*. This is enough proof for the destination chain, as they have access to the L1 state root trustlessly, and such a ZKP cannot be faked.

As a solution this is in fact an extreme form of pooled routing, as it effectively pools a bridging transaction not only with other bridging transactions to the destination chain, but with every other transaction happening on the rollup. So when the rollup’s ZKP changes on the parent chain, that is the "pooled" transaction that contains the burning.

So in this solution the compromise we make in the trilemma is that the message does not have to pass through L1, we just need indirect inclusion in L1 via the ZKP-s. This proving can happen off-chain, which is cheap. The verification has to happen on-chain, but only at the destination, and if that is a n-th layer, that means that computation is cheap there. The L1 hash needs to be passed down from L1, but this is trustless and not a lot of data. With this in mind, we can explain the mechanism of this bridging method.

*Technical Note: This linking via a ZKP can be done as the burning on L_n is included in the L_n state hash via a Merkle tree, and this state hash is included in the L_n-1 state via the L_n’s ZKP. Repeating this iteratively, we can chain the ZKP-s together until we reach L1.

New bridging

Now we get to the interesting part, sending a transaction from VM 3 to VM 2 trustlessly. First the transaction has to be sent on VM 3. This propagates up the tree to the root. Now we can create a ZKP, that contains the L1’s hash, and proves that the transaction happened. We denote this by the red triangle.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Fig. 4. Starting a transaction

Now we can create a transaction that includes this ZKP. We can send this tx on VM 2. VM 2 has access to the L1’s hash, so it can verify that this transaction is valid. Now it can update its state accordingly. This means that the transaction happened, the bridging is complete.

Then we can also produce a yellow triangle ZKP that shows the receiving happened, we will use this later.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Fig. 5. Receiving a transaction

So on VM 3 the transaction was sent, and on VM 2 it can be claimed. We have achieved our objectives, this method is trustless, we did not incur L1 costs, and we did actually transfer funds between different layers.

Summary of Bridging

Here we explained a mechanism for transferring a virtual token, meaning it has to be burned on the sender and minted on the receiving rollup. But of course, any kind of data or even function call can be bridged via this method, and the price is independent of how much other people use this method. Another advantage is that there are absolutely no L1 fees involved, as that is shared with other users of the rollups. The drawback of this solution is the proof verification, which is a substantial calculation that has to happen on-chain, even if only at the destination layer. And as a final note, this solution needs special smart contracts that verify the ZKPs and mint the tokens.

Automatic bridging and the sharded VM

This bridging method is great, but it has a weakness: smart contracts can’t use it. Ideally, smart contracts could also bridge funds to other rollups, then that way they could make function calls and there would be interoperability between the different rollups. Up to this point we do not have this capability.

If I bridge funds for myself, and don’t perform the receiving part of the bridging, then I just lose the funds. Smart contracts need an assurance that the transaction is actually received. (Incidentally, we do have this assurance when sending transactions to child rollups, as the new ZKP of the child rollup has to consume the transactions sent to it.) This section is about having that assurance even when sending to parent rollups, thereby enabling smart contracts to use this bridging mechanism to send data anywhere in the tree.

There can in general be no such guarantee, as the receiving rollup might have failed and stopped updating. But we can guarantee that if the receiving rollup continues updating, they will have to take the transaction as an input. We guarantee this by routing the transaction through L1. But as we know, that is expensive, and does not scale. So we use routing only as a failsafe, and most of the time will use a shortcut: our direct bridging method. If there are bots that scan the tree, and pass the transactions between the rollups, then we can send, receive, and confirm the transactions all with our bridging method. However, if this bridging does not happen, we have to revert to routing. This means that after the transaction was sent on the sender rollup, the rollup has to be frozen until it either confirms that the transaction was received, or until it starts routing the transaction through L1.

Bridging to solve automatic bridging

Our goal here is that if a smart contract sends a transaction via our previous bridging method to another rollup, then the destination rollup has to receive the transaction.

The sender rollup cannot by itself send the transaction over, it needs an external bot to do that. But if the bot does facilitate this information transfer, then it can quite easily verify that the transaction did go through, we just need to bridge back a ZKP confirmation that the transaction was received, using our bridging method!

Continuing with our previous images, after VM 2 receives the transaction the yellow triangle ZKP can be computed (we did not use this ZKP in the previous step). This ZKP can be sent back to VM 3 for confirmation. This is similar to the previous update when VM 2 received the transaction. Importing this verification unfreezes VM 3’s state, they can do transactions and internal state updates again. This confirmation and freezing is necessary so that funds are not lost, if VM 3 sends something, VM 2 needs to receive it.

Fig. 6. Confirming a transaction

Fallback option

What happens however if no bot sends the transaction to the receiving chain, or brings the confirmation back? There is already a method that can incentivise rollups to perform operations: forced operations. This is used by apps using Starkex that might not have censorship-resistance, and it allows users to send transactions to the ZKP verifier contract on L1, and the prover has to process the transaction in a given timeframe, or risk freezing the app. We can use the same method here! If the nodes running the layer do not confirm that a transaction happened, then the only ZKP update possible is the one in which the transaction is pushed up the tree. After this it has to be pushed upward toward the common ancestor of the sender and recipient, and then further forced operations can pull the transaction down to the recipient. This means that there are three options, the transaction is confirmed, it is routed via forced operations, or one of the rollups stops updating after it does not complete a forced operation.

This is the failsafe mechanism in case VM 3 and VM 2 have some communication error or one of them is compromised. This corresponds to reverting back to the traditional message passing in this setting, which means that we pass the message up towards the root, then we pull it down towards the recipient.

If VM 2 does not receive the ZKP, or VM 3 does not receive the confirmation, then the last two images (Fig. 5. and Fig. 6.) do not happen. This means the transaction was sent but it still needs to be received. In this case node 3’s state is locked, and they can unlock it by pushing the ZKP of the original transaction upwards towards the parents. The parents then push it upwards as high as it is necessary, in this case that is the VM -1.

Fig. 7. Pushing a transaction upwards

After the transaction has been pushed upwards, the children should only be able to update their state if they pull and process the transaction to their VM.

Fig. 8. Pulling a transaction downward

Sharded Virtual machine with AMM as an example

With this interoperability property, smart contracts can call and send data to other rollups. With this it would make less sense to have multiple AMM-s on each rollup than to have a separate rollup for an AMM that can handle transactions from and to all the other rollups.

The AMM could be hosted on a censorship resistant and decentralised rollup. This would communicate with other rollups with our bridging method, meaning other rollups and this rollup needs the smart contracts that verify against the state root. Once we have this, calling a smart contract would burn one type of virtual tokens on the sender rollup, the AMM rollup can receive this transaction via a bot, convert the funds to the other type of virtual token, and send this back to the original rollup. Then the new type of virtual token can be minted.

Operational methods, incentives

Now we see that it is possible to treat the different rollups as separate shards handling different parts of a single big VM. The passing of messages between the rollups need to be incentivized for bots, or done by the user themselves. This is based on the open nature of the rollups, sending to closed rollups is still possible if the rollup's host accepts the message.

Security, and the problem of iterated exits

Security has two parts, on the one hand it means verifying that all the transactions and state updates are correct, at which both ZK rollups validiums are great at because of the ZK proofs. But security also means avoiding censorship by the rollup/validium’s operator and being able to continue updating the validium in case the operators stop. This is necessary so that the users funds aren’t locked into the rollup/validium. Satisfying these conditions are non-trivial even for validiums, as we shall see later. Relying only on the L1’s security means an L1 node should be able to reclaim funds from the rollup in case of censorship/failure.

Broadly speaking there are two big approaches to solving the second part of rollup/validium security.

The first is the one employed by Starkex, which focuses on trustlessly exiting the validium in case it fails.
The second solution is making sure that the rollup's consensus operates in a decentralised, censorship resistant way, with adequate data availability for L1 nodes to continue the consensus mechanism in case of failure. This is what Starknet rollup is aiming for.

Up to this point, we could have been talking about bridging between L2-s. But we can also do L3s, L4s, etc. This makes execution of computation cheaper but increases security concerns, as iterated rollups means multiple sequential exits or multiple consensus mechanisms that we have to trust. When thinking about security, we will also consider the security of these layers.

With iterated rollups, iterated validiums security is more and not less important on L2, as if L2 nodes start censoring transactions or stop updating the L2 state, then this ruins the network for all of the L3, L4 rollups/ validiums.

Trustless exits

The first solution is giving the rollups exits similar to Starkex. What this means is that we can exit L2 by making a transaction on L1, and then the L2 will have to process the transaction or freeze. This is a forced transaction. This is still not perfect, as the L2 might freeze, in which case our funds are lost. This can be solved however. We can first prove on L1 that we have some funds in the L2. Then the ZKP verifier smart contract on L1 should allow us to withdraw from L2, without there being an L2 state update.

Rollup quiting rollups

This exit strategy is even possible for smart contracts . This would mean proving on L1 that a smart contract with all its data was part of L2, and redeploying it on the L1 with its data.

With this method we could even exit the L2 not with any, but with a ZKP verifier smart contract of an L3. This would mean redeploying the L3 on L1 so that it becomes an L2. This means that child rollups can exit a parent rollups.

Fig. 9. Preparing for exit

Fig. 10. Exiting

Fig. 11. Post exit update

Data challenges

There are some technical problems however if we apply this same mechanism for L2 validiums. In order to construct the proof for exiting, some amount of data has to be published on L1 so that our proofs can be verified. For example the hash of the smart contracts might be published with the ZKP of L2 on L1. Then it is easy to exit the L2 with a smart contract, we just need to know all the data of our smart contract to compose the proof. This is unfortunately untenable for validiums, putting every smart contract's data onto L1 would go against the definition of a validium.

At the other end of the spectrum we can put the minimal amount of data possible onto L1, namely the state hash of the L2. Then, if we have all the data of L2, we can also compose a proof which shows that our smart contract is in fact part of the state hash of the L2, via a Merkle tree. For this to work, we need to know the state of L2, so that we can prove that our smart contract was part of it.

So here we have three options: either the validium is only partially trustless compared to L1 (freezing the updates), we need to push data on L1 in a way that only scales with L1, or we need to solve data availability problems some other way. There is a cheaper version of solving the DA question, that we will see in the next section when discussing the other solution for L2 validium security. This solution ties the L2 validium's consensus mechanism together with its DA.

Security via L2 consensus mechanism

Starknet is currently aiming for decentralising the operator role (=sequencer and prover) of the rollup. This would enable censorship resistance. In addition to this, when a new ZKP is pushed to L1, the L2 state difference is also pushed as call data to L1. This solves the data availability question for the L2, as the L1 nodes can query the events on L1 for the state differences. If all the L2 nodes fail, then we can reconstruct the L2 state from the past events on L1. This allows the L2 consensus system to be permissionless. This is also relatively cheap, as call data is cheap on L1.

However, this solution "only works once", it does not scale to iterated rollups, we will again have the data availability problem. If we have an L3 running on the L2, and both the L2 and L3 operators fail at the same time, then either the L3 state diffs also have to be pushed to L1 (in which case we the solution worked once, and also does not scale as we have to push more and more data to L1 as we add more and more layers (meaning we scale with L1s DA capacity)), or if they are only pushed to L2 then they are lost as the L2 events are lost, we only save the state on L1.

This means that we run into issues of data availability when trying to secure this consensus mechanism for multiple rollups. The similarities with the previous solution are clear, we either have to push the L3’s data to L1 as call data, or we have to find some other way to make the data available to the child rollup nodes. What this really means is that with a permissionless consensus mechanism, the question of rollup/validium security is actually the question of data availability.

As we can see, neither pushing to L1 storage nor pushing to L1 calldata is a solution that scales well for iterated validiumss. For iterated validiums, we need to address the DA question for the children validiums head on, without relying on L1.

The challenge of iterated Security

As we have seen, there are two options for solving security, pushing data up to L1, or making the data available to the children nodes. Starknet is aiming for a blend of the two solutions, pushing data up to L1 as call data. This is the data availability solution, as this calldata cannot be used on L1 for security, but due to L1 security, the data is available for everyone. However it inherits a bad property of the “pushing up data” solution, namely that it does not scale well for iterated validiums. But this is necessary if you want to rely only on the L1’s security.

Here we can make a tradeoff: we do not need the L2 validium to purely rely on the L1’s security. Not relying purely on the L1’s security means that L1 nodes’ relationship to the L2 will not be trustless, they will not have access to the L2 state in case of failure. We will instead rely on another consensus mechanism to solve the DA question. This will mean additional trust assumptions in the L2’s new consensus mechanism, but if we can solve the security of iterated validiums, then this is worth it. The nodes we can include in this consensus must include the L2 nodes, and descendant validiums i.e. L3, L4 nodes. What would be the best way of guaranteeing data availability for them?

Challenges

To solve the L2 DA question for L2 validium nodes would be easy, after all they are L2 nodes, so a PoS consensus mechanism would be enough. The children nodes are harder. Let’s sketch the problems.

The descendant nodes need to receive DA for the L2 state, even though they are not interested in every smart contract on the L2 state.
The L2 nodes should also be able to update the L2 state without the child nodes permission.
If the L2 nodes update, but do not send the new data to the descendant nodes, the descendant nodes still need to exit based on the state for which they did hava DA.
This means some evidence of older L2 states should remain on L1 if there are DA problems for the descendant nodes and the child nodes want to use the previous L2 state hash to exit.
However we do not want to store all the previous L2 state hashes on L1, as that would cause state bloat.
The descendant nodes should not be able to exit the L2 using arbitrarily old L2 states, as exiting means reversing the potential updates of the exiting validium, as they might exit with an outdated state.

We can solve all of this if the descendant nodes have some decision power over the old L2 state hashes on L1. If a majority of them have DA for a state hash, they need to be able to signal this to the L1. We only keep the last state hash for which a majority of them DA, the older ones should be erased, to save space on L1. But another and more important reason for deleting the older hashes is that the descendant validiums need to have a finality time, when they can no longer exit L2 using a very old hash, and thus revert their state to what it was in that hash.This also means, that child validium states are considered finalised not when it reaches L1, but when a majority of the descendant nodes have DA for it in a provable way.

The descendant nodes need to get DA sometimes, they can do this before they submit updates. The exits for child nodes only come into play in emergencies when they try to update and get DA, but find that it is not provided. Then they can exit using the old state on L1. It might happen that they do not have DA for the old hash on L1, as a majority of the descendant nodes could have updated since the last time our specific validium did. This is not a problem, as if we have an honest majority among the validium nodes, then at least one of the honest validiums does have the data associated with the current old hash (and will share that data).

The mechanism

We can create a consensus mechanism based on these principles. The consensus mechanism works as a normal PoS for the L2 nodes. When a child validium wants to update its ZKP, it needs to ask for the L2 state. If they receive it, then they have DA, they store it in case they need to exit the L2. After receiving it, they can push the ZKP update. This means that we can tie together ZKP updates and DA, the child nodes will only update their ZKP if they have DA. This method allows us to “measure” the DA for descendant nodes, and only consider an L2 hash on L1 accepted, if enough descendant nodes have updated onto it. This means that there have to be multiple L2 hashes on L1, the oldest hash being the last one that the majority of descendant nodes have accepted.

How does this measurement happen? Similarly to the L1, on each validium we have to store for each child ZKP verifier contract different hashes of the child validium state, some older and some newer. Each of these hashes represents a point in time. (For synchronising time we will use the L1’s blocknumber.) For each such hash, (which represents a time and the corresponding state on the child validium), we will have a number pair (a, b) where "a" gives us how many of descendant nodes have pushed an update since that time, and "b" gives us the total number of descendant nodes. Here "a" changes with each child ZKP update, while "b" is normally constant. We will calculate the validium's pair based on the child validiums' pairs. For the validium each state corresponds to a time (=L1 blocknumber), similarly to the child validiums. When pushing an update the corresponding state hash will have acceptance ratio (0, D+1) (where D is the total number of descendants), as no child has pushed a state later that it has, and the validium has also not updated after that hash (you get DA of a state after you push the next update, so the initial number is 0). After that each ZKP update changes the acceptance rate of that hash (and new hashes are also added). We calculate the new acceptance rate for when the child validiums have pairs (a_1, b_1), (a_2, b_2), (a_3, b_3) as (a_1+ a_2 +a_3+1, b_1 +b_2 +b_3+1=D+1) for times later than our hash. When this acceptance ratio reaches 50% percent, we do not have to store earlier hashes for exits on the validium itself.

Fig. 12. Measuring descendant acceptance

(Note: all validiums have equal weight in the above calculation. This means that when deploying a validium, it has to be given permission. Otherwise there could be a sybil attack via creating arbitrary number of validiums. This permission for simplicity can be via a staking mechanism. Other weighting solutions are possible which do not require permission, but those require different trust requirements, i.e. more trust in nodes that are higher up in the tree.)

What happens when L2 nodes stop sharing data with their descendants, let's say L3 nodes, assuming that a majority of the L3 nodes are honest? The L3 nodes have access to some previous L2 state, and this state has a hash that is still on L1. The L3 nodes can use this hash to exit the L2, and to move their validium to someplace else in the fractal tree.

With this exit mechanism we will not have data build up on L1. The data that is require to be stored on L1 are the old hashes of L2. If we assume that the validiums are updating at constant pace (e.g. 1 block / 15 secs), then the amount of time it takes for a majority of validiums to confirm an update is

O (d) = O (log (N))

, (where N total number of validiums, d is the average depth). So the number of hashes stored on L1 will also be proportional to this. Where we will actually have substantial data build up is the bottom, each node has to record the data of each of its ancestor validiums, which means

O (d)

data at depth d (this data is needed for the exits).

Liquid consensus

There can be another way of exiting the L2 validium. After all, if the L2 nodes are malicious, then they ruin the tree for all the L3, L4… validiums. This means that a minority (L2 nodes) of all the validium nodes ruin the L2 for all the descendant nodes. Worse, all the descendants have to exit seperatly! It would be much easier for the majority of the descendant nodes, if they could simply replace the L2 nodes with new ones. (This means changing some public keys which specify who can push updates onto L1.) For the L2 validium this merely means the L2 nodes will have to exit with their state on L1, instead of all the descendant nodes.

Fig. 13. Replacing L2 more is more efficient than exiting, but the simple method does not work

In this scenario if the L3 nodes do not receive the data from the L2 nodes before updating, they need to tell other L3 validiums. These validiums nodes verify that DA is compromised. A majority of the L3 nodes need to be able to take “emergency control” of the L2’s ZKP. To prevent anybody from taking “emergency control” we could store each L3 nodes public key on L1. This does not scale when iterating, we would have to store the L4, L5 keys as well. To solve this in a linearly scalable way we can use the restricting factor that the L3 ZKPs cannot be updated by anybody, only the L3 nodes (or their descendants, if they take emergency control). So here the L3 ZKP verifier smart contracts are not just any smart contracts, but are in a special position in the consensus mechanism of the L2.

So after the L3 nodes realise that the L2 node is not providing DA for the latest L2 state on L1, they take the old L2 state for which they have DA and is on L1. They can each push a new ZKP to this L2 state, which “elects” new L2 nodes (changes the public keys). The new L2 nodes (who were just elected) then can receive the old L2 state and new ZKP’s which elect them, do the proving, and push the new L2 ZKP to L1. (Here the L2 hash on L1 is “forked” as the L2 state that was pushed to L1 by the previous nodes (the state that we did not have DA for), was also on L1. However, this L2 state was not accepted by the majority of the descendant nodes, so this is not actually forking (an update is finalised once a majority of the descendant nodes have accepted it.))

For the old L2 nodes security is also guaranteed, after the emergency control happens, they still have the state of the L2 validium. So they can exit with that on L1.

Fig. 14. A successful replacement of L2, circumventing the malicious parent node

(Note: There is one potential victim here, if every L2 node and the majority of L3 validiums are malicious, then a child node might not have access to the L2 state, and might be left out of the new emergency update as well. In this case their data is lost, and they cannot reclaim it on L1, as they do not have access to the L2 data. This is only a critical factor if we cannot trust the majority of validiums.)

Can this method work for iterated validiums? Yes it can, we can use the same principle for L4 nodes. When both the L3 and L2 nodes are malicious, but the L4 nodes are not, then the L4 nodes can first control and update the old L4’s ZKP on the old L3 state, and then the L3’s ZKPs on the old L2 state, with which they can control and update the L2’s ZKP on L1, which still has the old L2 state hash. So with this method, we can turn the secure exit strategy into a consensus system that relies on the majority of validiums being honest.

This is a logical consequence of the first proposal, after all we can think of the measurements as Stakers (descendant nodes) accepting a state update. The state updates are proposed by a single Proposer (the L2). If the Proposer misbehaves, a new one is chosen.

Replacing L1

Disclaimer: This part still needs further research. The broad outlines are clear, the precise simulation that stops forking quickly still needs careful analysis.

Up to this point we had an underlying L2 with a traditional (PoW/PoS) consensus system. This is fine, that is compatible with this system, we have the potential for parallel transactions between different validiums in a fractal tree that are efficiently secured by their common consensus mechanism. However, we have two separate consensus mechanisms here, the validium consensus and the PoS of L2. We can actually get rid of the PoS consensus mechanism on each level, after all the validium consensus mechanism already guarantess DA, and validity is verified by the verifier contracts, and forking is stopped by PoS on L1. This means that a single node can run a single layer, and be verified by all the descendant nodes.

A more ambitious goal is replacing the L1 PoS consensus mechanism with the common validium consensus mechanism. This would enable a single node to run L1, similar to each validium. For this we need to ask, what does the L1 PoS currently provide that the validium consensus does not? The answer is forking, validity is checked on each validium, but we use L1 to stop the forking of L2, and hence indirectly of the deeper layers as well.

To replace PoS with the validium consensus, we need to ensure that the L1 node cannot fork the L1 by itself (if a majority of the descendant validiums are also malicious, then forking cannot be avoided). What we would like is if we could make this validium consensus mechanism stronger, so that descendant validiums could stop forking at the L1 level. We already have a way to measure if the descendant validiums accept the parent validiums state, namely that they push a ZKP to it, and on each validium we can measure what percentage of the descendant validiums have accepted the state. However when discussing forking this is not enough, as the same ZKP can be published to multiple forks. We can stop this however, as the ZKP could also include a choice of the L1 state hash. When a validium receives ZKPs from it’s children, each children also gives it an L1 state hash that it accepts as the valid choice of history. If these state hashes are consistent (meaning that they follow each other in the L1 state), then the validium can prove this, and pick the latest one in time. Then they can include this latest hash (or perhaps even pick a later hash), and include that with their own ZKP and pass that to their parent. When the L1 receives the different ZKPs and the included L1 hashes, it will not be able to include the ZKP into both of the forks, since the included hashes pick exactly one of the forks which it can be added to. At this point the L1 will not be able to produce a new valid block, so it is as if the node stopped working.

Fig. 15. Forking

With the traditional L1, it was implicitly guaranteed that there were always new blocks, here that is not the case. Fortunately we already have a method in the validium consensus that enables the descendant nodes to keep updating the system if a validium node stops updating, they take emergency control, elect a new node for the parent validium (remember each validium can be run by a single node), and the new parent node can publish a new ZKP. We can also do this here, each node could run a virtual L0 node solely to see whether the L1 node pushes valid updates or not (these updates also have to be accepted by enough descendant nodes, as described in the previous paragraph). If the L1 nodes does not push updates, then the L1 node is considered faulty/malicious, and a new one has to be elected by the L2 nodes (the old L1 node could also stake some tokens, which could be slashed for good measure). If the L2 nodes do not elect an L1 nodes, they are also considered faulty/malicious, and they need to be reelected as well.

So together the methods of fork choosing, checking for updates and elections enable the L1 to be run by a single node secured by the common validium consensus mechanism, such that the L1 state stops updating if the L1 node tries to fork it, and if the L1 state stops updating, then the children nodes can elect a new L1 node, and also punish the old one.

In this way, the traditional security model of L1-L2 can be reversed. It is no longer many L1 nodes that store the data and create the security for rollups (like in Danksharding, where the rollup data is posted to L1), but the many different validium nodes store and secure their ancestors data in the tree, and hence create the security for the L1.

Speed and efficiency and numbers

Quick summary of current result, we have a fractal scaling solution, each validium running on a single node, with trustless bridging directly between validiums, all of the validiums are secured with a common consensus mechanism, and L1 is also included in this consensus mechanism in such a way that forking is avoided.

What are the most important constraints for each node with this mechanism? We have two, the first is confirmation time, which is the time that it takes an update to be included into the L1 state, and for a majority of nodes to confirm that transaction (which means pushing their own ZKPs). The other constraints is the amount of data stored, as for the security of each validium, they have to store the data of each of the parent validiums. How do we make this efficient, decrease the confirmation time, and minimise the stored data?

Up to this point, we have not discussed the shape of the tree. For a given amount of N validiums, there are multiple ways of distributing them into a tree, but broadly speaking there are two big options, we can either have a wide and shallow tree, where each of the validiums have a large number of children, but the depth of the tree is small, or we can have deep and narrow tree, with each validium hosting a small number of other validiums.

In current implementations we have the wide and shallow option, as we had a relatively big L1 and big L2, as security was guaranteed by the L1 so iterated validiums were less secure, and there was no way to bridge funds directly between different validiums, so being close to L1 (where most of the activity happens) made sense.

In this new setting these problems do not matter, we have a common consensus mechanism, and direct bridging, so position in the tree is less relevant. This enables us to consider the deep and narrow shape for the tree. This is the efficient way to speed up confirmation times for the nodes, and to minimise stored data. (Check Quick Estimates for actual numbers.)

Another way of making the parents state updates quicker is by minimising each validium by emptying out its “own” state (the part of the VM that is not the ZKP of the children nodes) into a separate ZKP and VM. In the pictures we show these VMs with circles, and the VMs of the actual validiums that only host ZKP-s are rectangles. This makes updates quicker, as the whole state does not need to be proved, only a ZKP of it. This also minimises the stored data, as only the ZKP has to be stored, not the state itself. This is already how we depicted the tree, with 4 ZKP-s in each group VM, 3 corresponding to the children's VM, and 1 for the state.

Fig. 16. The new layout

Once the tree has been narrowed down and nodes’ states’ size higher up in the tree have been minimised, it is possible for lower nodes to store their ancestor group VM-s. This means storing 4PSlog(n)=4PSd data for n nodes in the tree, with the tree being d deep, and PS being the average proof size. The confirmation time is also 4Cd, where C is the time to prove the verification of a single ZKP.

Summary

In the fractal scaling of ZK rollups there will be a need for bridging between different rollups. We started with an overview of what the current solutions are, and their drawbacks. Then we described our bridging solution, and it’s benefits: it has a stable price and can be used to transfer any kind of data. Then we generalise this bridging, giving the sender rollup guarantees the transaction is received. This guarantee enables smart contracts to finish the bridging process, instead of just being able to start it. This means bridging can handle function calls between the rollups. This makes the different layers host different shards of a sharded virtual machine, which fundamentally relies on the security of L1. Then we look at the security requirements for fractal scaling, we find that the current solutions scale with the DA of L1. Then we propose our own validium consensus mechanism which solves the security problem. Then we extend this so that the failure of a single layer does not mean every descendant node will have to exit. Then we extend this consensus mechanism to L1 such that forking will not be possible. Finally we look at the tree structure and ways to make the mechanism more efficient.

Quick estimates

If a tree with N nodes is w>3 wide (each node has w children) with depth d=log_w(N), then if we refactor this tree to be w’=3 wide, keeping the number of nodes N’=N, then the new depth becomes d’=log_3(N)=log_w(N)*log_3(w)=d*log_3(w). If the proving time is linear with the number of children, then for an update to be processed by each parent node takes (C * w * d) time (where C is some constant), which is worse that if the tree is narrow, as then this becomes (C * 3 * log_3(w)* d) (where we assume that C is the same constant). So it makes sense to make the tree narrow.
Also, as we can do transactions in this tree that do not touch L1, and the DA question is solved locally, we can have the “block size” (which we approximate by the total number of new transactions in a given period) growing linearly with the number of nodes, (compared to linearly with the root of number of nodes, as it is with Celestia). We have this linearity, because multiple transactions can happen in parallel between different nodes. This means that parallel transactions can be executed on different hardware. (There are existing attempts for parallelising transactions in Starknet as well, but there it is only a virtual separation, as all of the transactions will be proved by the same prover).f
However there is a downside compared to single layer blockchains, tx finalisation is not instantaneous. It takes 4*C*d time to confirm a transaction, where C is the time required to prove the verification of one ZKP. As the sender node cannot send multiple transactions to other rollups in parallel, this time is important.
The previous points together gives our transaction throughput, which is the number of transactions / time to finality. This is
$O (N / log (N))$ .

FAQ: How does this stop double spending on node 3? Double spending is stopped by the inner VM logic and the ZKP updates. You can only update the ZKP in the parent node by taking the previous VM state and ZKP proof, verifying that this ZKP proves the previous state, and applying a valid VM tx to the VM state, resulting in the result VM. Now we can take this whole computation, make a ZKP of it, and update the previous ZKP. This method stops double spending, as only a single valid state transition is accepted on the parent node.

Ago Lajko

2022/03/27 18:20:43

Meaning L1 nodes will not have access to all the data corresponding to their L2 children (Edited)

2022/03/27 18:25:47

This also means, that child validium states are considered finalised not when it reaches L1, but when a majority of the descendant nodes have DA for it in a provable way.

How much do we expect this to increase finality time? (Edited)

2022/03/27 18:26:02

Delte word (Edited)

2022/03/27 18:29:16

The

Ennek a pargrafusnak nem latom a jelentoseget (Edited)

2022/03/27 18:41:35

Where we will actually have substantial data build up is the bottom, each node has to record the data of each of its ancestor validiums, which means $\mathcal{O}(d)$ data at depth d (this data is needed for the exits).

How much data does the O(d) scaling mean? (Edited)

2022/03/28 08:32:24

L1 nodes don't just fork themselves (Edited)

2022/03/28 08:38:04

each node

All the nodes in the tree will run an L0 node? (Edited)

2022/03/28 08:47:51

we have the wide and shallow option

In current implementations (Edited)

2022/03/28 08:51:27

t has a stable price and can be used to transfer any kind of data.

Stable price wrt moving between any pair of nodes (irrespective of which layer these are on) (Edited)

2022/03/28 08:52:09

his guarantee enables smart contracts to finish the bridging process, instead of just being able to start it.

This is important, and feels like it needs to be emphasised more in the actual doc (Edited)

Kalman Lajko

2022/03/29 10:45:19

Yes (Edited)

2022/03/29 13:10:32

A rough estimte: One eth block has 1MB calldata limit, so around d MB data. There might be a multiplier, like 2-3. (Edited)

2022/03/29 13:32:18

Currently they don't, but if there were a lot more (10-100x) more descendent nodes, it might be worth it for them. (Edited)

2022/03/29 13:51:10

Also here we're trying to replace L1 with a single node, for efficiency. Then forking can very easily happen. (Edited)

2022/03/29 14:03:00

Only a light client, just checking that there are valid updates is enough. This means a single ZKP verification. (Edited)

Nico

2022/04/01 12:34:07

mechanism

typo (Edited)

2022/04/01 12:38:07

in the numb

math notation to make it easier on the eye? (Edited)

Guest Wallace2022/05/25 07:51:16

The sender rollup cannot by itself send the transaction over, it needs an external bot to do that.

? (Edited)

Ying Tong

2022/05/25 19:17:06

The verification has to happen on-chain, but only at the destination, and if that is a n-th layer, that means that computation is cheap there.

It is true that the final L1 proof only has to be verified at the destination; however, verification of intermediate proofs happens at every level along the propagation path. (Edited)

2022/05/30 14:23:25

Yes, but these verifications already need to happen in a fractal tree, and are shared by all users/rollups in the tree.

2022/05/25 19:18:25

absolutely no L1 fees involved

Would it be accurate to say that L1 fees are nonzero, but amortised? (Edited)

2022/05/30 14:25:28

Yes that is true, but it also depends on the perspective. The other perspective is that the amortised L1 costs come from being in the fractal tree, and not from the bridging itself, so bridging is free.

2022/05/30 14:25:39

hello (Edited)

2022/05/30 14:43:23

The rollup nodes could also send the transaction over. My point is that it cannot be part of the VM. (Edited)

Yicheng

2022/06/12 08:36:36

Do you mean relayer? (Edited)

Guest Hoffman2023/06/30 08:59:28

Here we explained a mechanism for transferring a virtual token, meaning it has to be burned on the sender and minted on the receiving rollup

If we limit this proposition to L2 and L1s, this looks like a proposition we made in the (article)[https://www.sciencedirect.com/science/article/pii/S1084804523000978] where the tokens were burned using a new type of transaction and sent to another rollup directly. This was possible because our design let multiple rollups live in the same smart contract/s allowing them to access the root state. (Edited)

Slush, a proposal for Fractal scaling

Tl; Dr

Document long outline

Zk Rollups and Current bridging

Zk Rollups

Current Bridging

The trilemma

Disclaimer on Stargate

Virtual tokens vs Bridged tokens

Our trustless bridging

The trilemma’s “solution”

New bridging

Summary of Bridging

Automatic bridging and the sharded VM

Bridging to solve automatic bridging

Fallback option

Sharded Virtual machine with AMM as an example

Operational methods, incentives

Security, and the problem of iterated exits

Trustless exits

Rollup quiting rollups

Data challenges

Security via L2 consensus mechanism

The challenge of iterated Security

Challenges

The mechanism

Liquid consensus

Replacing L1

Speed and efficiency and numbers

Summary

Quick estimates

Read more

Cross-Rollup Forced Transactions - Introduction

WIP

Research on fast cross-rollup messaging and settlement inside recursive proofs

Societal reputation system