Refunds in Async Transactions

### Disclaimer This document shares our internal in-progress research to gather community feedback. Some sections may be unclear—feel free to ask questions or suggest edits. Note: This research is exploratory and may not be included in the final zkSharding design. For context, please refer to our [documentation](https://docs.nil.foundation/nil/intro) if you're unfamiliar with zkSharding. ### Motivation In zkSharding, cross-shard transactions are non-atomic. This means that sometimes a transaction may fail on the destination shard. If the transaction has tokens attached, the tokens need to be returned to the sender. In this document, we’re trying to figure out how to achieve that. # Refunds in Async Transactions ## 0.0 Summary of Recommendations: 1. Implement the mailbox to enable (optional) renewing funding of fee tokens and retrying of failed CSTs. 2. Failed (and not mailboxed, or cancelled from mailbox) base token transfers are reverted for free. 3. Unused fee tokens are refunded for free. 1. Aggregation of fee token refunds is not necessary for an MVP. The benefits of aggregation can be determined more clearly once we have some usage data from testnets, and better understanding of proving systems which would need to incorporate aggregation of refunds. 4. Reversion in general is not supported and is the responsibility of developers. When a CST fails its error code can be read from the mailbox and acted upon. [To reduce the number of out of gas failures, gas price of all CSTs derived from a tx have their gas price fixed according to the gas price in the tx block’s subgraph of shard blocks (shardDAG). However, this is also not necessary for an MVP and dependent upon other economic considerations] ## 1.0 Definitions This document deals with both *refunds* (returning unused gas fee tokens) and *reversions* (undoing failed state changes) which are described below. Both refunds and reversions require creating return/callback CST(s) once a transaction ends. **REFUND**: Return unused gas fee tokens from transactions. - Small Refunds: unused fee tokens are likely to be small amounts. - It may cost more to return than is returned. - However, not returning unspent tokens may cause users to underpay (to avoid losing unused tokens) and therefore may increase tx failure rate. - Users may need to overpay to account for CST logic that depends on state that interacts with other users and cannot be predicted. - Lots of refunds: Almost every tx that creates a CST will require an extra return-fee-token-CST at the end of processing, because usually there will be leftover fee tokens. - Conditions for returning fee tokens 1. If tx completes then return unspent fee tokens 2. If tx fails/reverts with partial fee usage before reaching a failure condition, then return tokens unspent after failure. **REVERSIONS**: Undo state changes, there are two main categories of reversions 1. Native token transfers that fail and tokens must be returned to sender wallet. Failure modes: - Insufficient fees to pay for processing on the destination shard. If CST gas prices are fixed via shardDAG this can be avoided because gas usage for a simple transfer is known before the tx begins processing. However, the decision to use/not use shardDAG pricing has not been made. - Recipient account receives tokens and also does other logic. If the other logic fails/reverts, then perhaps the token transfer should revert too. - Anything else?+++ 2. General CST failure causing tx failure. These could perhaps be managed by ‘CST failed’ callbacks, but still not clear that it is feasible: 1. returns error code by creating a new CST calling the SC that called the failed CST? Prior CST smart contracts need error handling (dev created). Devs manage reversion. This requires fee tokens/gas which may not be available. 2. Sequential aggregation design makes this simpler [[Message Aggregation Scheme](https://www.notion.so/Message-Aggregation-Scheme-7bc4cf2e30a54e1cb5e0bdb97732e44e?pvs=21)]. If we don’t have sequential aggregation and one branch fails, it isn’t clear how the failed branch knows about other branches, and can then message failure to other branches. ## 2.0 Tx/CST Failure modes Below is a list of possible failure modes - Insufficient fee tokens to pay for gas - Fixing gas price for each shard according to shardDAG tx’s latest shard gas prices will help minimize number of out of gas failures. - Exceed a block limit - Gas used by a single tx/CST on its own is larger than the block gas limit and therefore cannot be completed/processed. - The size of new CSTs created by a single tx/CST on its own is larger than the outbox size limit for a single block and therefore cannot be completed/processed. - Timeout: if a CST is unprocessed for a long period then it automatically fails. - This may become an MEV attack vector, and so we may not want this - CST attempts some invalid operation/reverts - In some cases good application design should incorporate reversion. E.g. if a pool is closed for deposits, then the receiving smart contract should create a CST to return the failed deposit, not just revert. - There may be cases when application design cannot incorporate its own reversion code. - [Not in current scope] User instructed failure via a special transaction ‘cancel pending tx/CST’, see [Shard Overloading and Load Balancing](https://www.notion.so/Shard-Overloading-and-Load-Balancing-972196cd6fff424ca0cfc13884a8d6e8?pvs=21) - These may not need gas, they can be handled at the protocol level rather than within the VM. However, making them free might mean malicious actors can abuse them. - This could consume gas from the pending tx/CST. ## 3.0 Options for returning unused fee tokens 1. Individual refunds: Each tx creates a refund-CST when its last CST does not create any further CST(s). We would need block validity checks/Zk proof that shards create these return-CSTs correctly. - This means (almost) every tx creates a refund CST, which my not be efficient. 2. Aggregate refunds. There are several options: 1. aggregating at the shard block level - shard blocks aggregate all refunds to each shard s0, s1, s2…, and send one refund CST to each shard containing all refunds to that shard (no refund CST if there are no refunds to a shard). - A master chain block containing 1 shard block for each of N shards would contain up to $N^2$ refund CSTs. - We would need block validity checks/ZK proof that shards create these return-CSTs correctly. 2. aggregating at the master chain level - shards blocks do not contain refunds CSTs created for the finished txs within each shard block - master chain identifies all finished txs in shard blocks and aggregates refunds, creating one refund CST to each shard s0, s1, s2, …, these refund CSTs are included in the master chain block - A master chain block containing 1 shard block for each of N shards would contain $O(N)$ refund CSTs. 3. combination of aggregation at both shard and master chain level. - shard blocks aggregate all refunds into a single CST whose destination is the master chain. We would need block validity checks/ZK proof that shards create these return-CSTs correctly. - The master chain processes all refund CSTs in shard blocks, and aggregates them into one refund CST to each shard s0, s1, s2, …, whose destination is each of the shards. - A master chain block containing 1 shard block for each of N shards would contain up to $2N$ refund CSTs. For b) and c) potentially this per shard aggregated refund CST can be further aggregated with other base-token state updates for validator rewards, penalties etc. - Aggregation would have many fewer refund CSTs compared to the per tx refund method 1. - Qu: Which aggregation of refunds is beneficial for ZK proof generation? 3. Synchronous transactions that are completed entirely within a single shard (no CSTs) can refund unused fee tokens, as in a non-sharded blockchain. ## 4.0 Retrying Failed Transactions/CSTs It is not possible to support reversion of transactions when a cross-shard transaction (CST) fails/reverts. Instead, we provide a mailbox feature that supports retrying failed CSTs so that transactions can complete. In some cases retrying a CST can eliminate the need for transaction reversion; however, retrying may not be useful for some use cases. A simple example of a basic token swap where retrying is useful is shown below. The only satisfactory outcomes of a token swap are - all tokens are moved to their destination accounts - the swap fails and no tokens are moved However, it is possible that one token transfer completes, and the other fails (perhaps due to lack of fee tokens) which leaves the token swap in an unacceptable state. If one of the CSTs transferring tokens (red) fail, then the mailbox allows the CST to be retried and complete the swap. ![Mailbox](https://hackmd.io/_uploads/Hk3Ve4dRA.jpg) ## 4.1 Mailbox Specification The ability to retry failed CSTs is supported using a mailbox. The mailbox feature has the following properties - It is associated with an account/smart contract, but is optional. - Contains a list of outgoing CSTs. - Each CST in the mailbox has one of the following states 1. CST has been sent, but no response yet received 2. CST has been sent, and a failure has been received with error code - Each CST in a mailbox has a counter to count the number of retries. - All outgoing CSTs from a mailbox must themselves return a CST to the mailbox that records success or failure (with an error code) - A failed CST in a mailbox can be retried by adding additional fee tokens - Options for who gets to retry a failed CST are configured by the smart contract owning the mailbox - The account that originated the initial tx - The smart contract that created the mailbox - The destination account/smart contract for the failed CST - Specific list of accounts - Any account - Options for who gets to permanently cancel a failed CST are configured by the smart contract owning the mailbox - The account that created the original tx - The smart contract that created the mailbox - The destination account/smart contract for the failed CST - Cancel can be triggered by a maximum number of retries - Cancel triggered after a maximum number of blocks has elapsed - Specific list of accounts - Any account - Upon cancelling a failed CST - a refund is sent to return base tokens transferred, and unused fee tokens. - Cancel operation may call a predesignated smart contract that implements some other code, probably to revert. - this can be used to enforce outcomes in cases where multiple actors are able to attempt retry/cancel operations. <aside> ⚠️ - Mailbox is implemented using precompiles so that it is not expensive. The mailbox may be implemented so that instead of storing the CSTs in the mailbox state (state bloat), instead pointers to CSTs in prior blocks are used. Each time a CST is retried, the CST is included in a new shard block and the mailbox pointer can be updated. However, it is not clear if this pointer approach might be problematic for ZKPs, particularly because this may require proving a relation to a block far distant in the past if a CST is very old when retried. </aside> - In a branching transaction that simultaneously spans multiple shards there may be different mailboxes on different shards. CSTs in different shards can interact with mailboxes on other shards to read state, fund CST retries or cancel failed CSTs. ### 4.2 Mailroom It may be beneficial to organise mailboxes into a ‘mailroom’. - Each shard has a ‘mailroom’ containing all mailboxes for each account/smart contract that requests one. - Mailboxes are easily searchable because they are collected together in a known location. ## 4.3 Uses Cases Where Mailbox Retries are Insufficient and Reversion is Desired - State changes after failure such that a satisfactory outcome can no longer be obtained. Some transactions are time sensitive - e.g. NFT mint failed, after the failure all NFTs have been minted so retrying is pointless. - Does not solve the ‘train and hotel’ problem discussed during Ethereum’s sharding development. When booking a holiday either both train and hotel booking must succeed or fail, not one succeeds and the other fails. If bookings are on different shards and one booking reverts because no tickets/rooms are available, then retrying doesn’t help. - CST can never complete and will always error, so retrying is pointless e.g. it consumes more gas than max gas per block - User changes their mind and doesn’t want to retry, they would rather revert ## 4.4 Potential Mailbox Exploits Developers need to be careful that they do not create situations in which retrying allows exploits to occur. In particular developers should note that - State can change between retries such that a retried CST may produce a different result compared to the hypothetical scenario of successful processing of the original CST. - Consider that if exploits are possible, malicious users may attempt to deliberately under fund transactions hoping for CSTs to fail, revealing the exploit. ## 5.0 How to Revert prior Tx/CST state changes <aside> 💡 This may require some input from async programming design. Especially CST state change commits part way through tx processing, vs after tx completion. </aside> - In general it is not possible to automagically revert a CST state update - Subsequent state updates created by other txs might be based on the CST state update to be reverted, meaning that a more general rollback is required that includes reverting other transactions. - The protocol allows special CST failure callback CSTs to be created. However, developers must design smart contracts to deal with reversion via failure callback CST messages. At this stage it is not clear how this would be structured by developers - Important smart contract state is not committed within the contract until receiving a final ‘tx complete’ CST, but how can complete status be synchronized across shards? There would need to be a consensus mechanism between contracts on different shards. Simple example: contract *a* on shard A waits until contract *b* on shard B completes and B sends a ‘complete’ message back to A. *a* receives the complete message from *b*, so *a* may commit its state update. However, if the message is not guaranteed, then *b* does not know if *a* has received the message and committed its state update, so *b* cannot commit its own state update and must wait to receive another message from *a*, and so on. Further *a* doesn’t actually know if *b* will commit its state update because *a* knows that *b* does not have confirmation, and so *a* may not want to commit its own state update. This is a common knowledge/consensus problem. To solve this each app would need to have its own consensus mechanism. - If reversion is complex and requires significant gas, then reversion probably requires creating a new tx with sufficient gas. It is not clear how this can be automated. - For some applications it may be possible to allow ‘refunds’ as separate txs, like buying something from a shop and returning it under conditions like it must be returned within in a specified time frame, faulty/damaged etc. This functionality would be created by developers. It is not recommended to have any significant protocol-level automatic reversion features because there is no generally applicable approach to support reversion. Protocol support is limited to failed CSTs returning a failure callback CST to notify failure, but it is the job of developers to create functionality to deal with this. ## 6.0 How are refunds/reversions paid for? Options: 1. Use unspent fee tokens if they are available. 1. What if there are insufficient unspent fee tokens? We do not want user to be scared of conducting txs because of the risk of not being able to refund/revert. There doesn’t seem to be any satisfactory approach here. If we make it free, then malicious actors can abuse it. 2. Refunds are free, they are part of protocol state updates for consensus rewards, penalties etc Could some unused fee tokens be burned to penalise deliberate/malicious failure? When making refunds free, the concern is that a malicious actor could cause congestion by spamming many small (cheap) token transfers that deliberately fail. However, the attack has a cost, so the attack is not free. We could deduct a refund fee from available fee tokens, or potentially from the returned tokens themselves. We can also implement a base tx gas (in addition to gas used by the tx) that pays for refunding unused gas tokens. 3. Attach distinct ‘reversion fee tokens’ that are only used if failure occurs (otherwise refunded), and pay for reversion CST processing. - **I recommend** - **free refunds of unused gas tokens.** - **free reversion of base token transfers.** ## Notes - I don’t think we want to support reversion at the protocol level, but if we do then state updates for CSTs that could be reverted should probably not be posted to L1. However, these will be mixed up with completed tx state updates, and it may be difficult/impossible to untangle them. - The ‘train and hotel problem’ has been referred to in previous Ethereum sharding research. It refers to booking both a hotel and train for a trip, each booked on separate shards; however, if either booking fails, then both bookings should fail. The hotel and train booking SCs are distinct dapps that don’t know about each other, and aren’t coordinated. Past Ethereum research refers to the notion of ‘yanking’ to move a SC between shards [https://ethresear.ch/t/cross-shard-contract-yanking/1450]. --- **Links** [Website](https://nil.foundation/) [Documentation](https://docs.nil.foundation/) [Other our research on HackMD](https://hackmd.io/MPkGZS0NSBGZstB6FsvNtA) **Connect** Email: [research@nil.foundation](mailto:research@nil.foundation?subject=HackMD_Test) Twitter: [@nil_foundation](https://x.com/nil_foundation)