Authors: Bhargav, Jeff and Alistair (@bhargav:web3.foundation) ## Problem at hand: Parachain Censorship Relay chain only ensures the parachain blocks do not violate safety but has no guarantees on liveness and censorship resistance on the parachains level. With the proposal to move several core functionalities of the relay chain like Balances, Staking and Governance onto system-parachain (Polkadot-Hub), it is crucial for Polkadot security to exlpore how we can achieve censorship resistance for parachains. For example, lets assume governance was moved to system-parachain. If we have a runtime upgrade of relay chain up for vote, malicious collators can systemically censor aye-votes to deviate the outcome of the referendum. Similarly, this is of significance for Asset-Hub/Plaza as well. ## Threat Model We consider the setting with **at least one honest collator**. It is hard to achieve censorship resistance without fixing the parachains block production mechanism. Here, we explore how to make parachains running AURA censorship resistant. Note that all system parachains use AURA via the Aura-Ext pallet. We assume the `backer` backing the block being censored (by collators) is honest, else the block will always get censored. This is a reasonable assumption given 2/3rd honesty on relay-chain and backers are assigned randomly. We are also assuming that the availability layer is robust and a collator can fetch the latest parablock directly from the availability layer (or the backer) in a reasonable time. ## Making AURA Censorship Resistant ### Current Checks by the PVF The two points where the AURA block authorship rights are checked are: 1. `on_state_proof` in ConsensusHook of the [Aura-Extension](https://github.com/paritytech/polkadot-sdk/blob/8730f3c2fa1d36161fccdc6a318e175eda459d0f/cumulus/pallets/aura-ext/src/consensus_hook.rs#L73) pallet. - Check if the relay slot of parablock being built is $\geq$ the relay slot of the latest block iuncluded. - check velocity condition is not violated, i.e., number of parablock per relay_slot < velocity - Check if parablock timestamp is not too far in the future, i.e., parablock time stamp is <= expected timestamp from relay chain (i.e, relay_slot * relay_slot_duration/para_slot_duration). 3. Current Slot should only be increasing in the [Aura pallet](https://github.com/paritytech/polkadot-sdk/blob/8730f3c2fa1d36161fccdc6a318e175eda459d0f/substrate/frame/aura/src/lib.rs#L128C6-L128C19). ### Concrete Attack Lets assume we have 3 collators A, B and C assigned to consecutive slots. A and C want to censor B, i.e., not allow B's block to get backed. Attack is achieved by A producing block and feeding to the backers but it selectively witholds the block from B. Then C produces and get in his block before B can fetch A's block by recovering data from availability layer. ## Multi-Slot Collations We let the round-robin rotate every `x` slots instead of each slot. Once a collator is selected, it is responsible for building `x` consecutive blocks. The aura authorship checks as part of the PVF have to reflect this change. We choose `x` large enough such that collator being censored has enough time to recover previous blocks and build on it, ensuring a block gets backed in atleast 1 of the $x$ slots. As in our previous setting with collators $A,B,C$, if collator $A$ does not share with $B$, then for $0 \le i < x$ $B$ recovers and imports the blocks by $A$ for slot $A+i$ before its own slot $B+i$, so $B$ can produce a block by its slot $B+x$. Collators being censored can only get 1 block while others produce `x` blocks in each round-robin round. Worse, we should be somewhat more careful with how $A$ delaying its own blocks impacts timing. ((Jeff: Added, not sure how we should express Alistair's concerns here)) This approach is compatible with the (collator time-stamp dependent) Slot-Based collation. This approach however is vulnerable to liveness attacks: adversarial collators don't show up to stall the liveness but then also lose out on block production rewards. If the ratio of adversarial collators is $\alpha$, and collator set is $\mathcal{C}$, then the number of blocks per round is $\alpha\cdot |\mathcal{C}| + (1-\alpha)\cdot x\cdot |\mathcal{C}|$, instead of $x\cdot|\mathcal{C}|$. Multi-slot collations work well if the collator can prioritise the transactions that are attempted to being censored when building its block. ### Determining the parameter $x$ Clearly, the number of consecutive slots (x) in the round-robin is lower bounded by the time required to reconstruct the previous block from the availability layer (b) plus the block building time (a). Hence, we need to set $x$ such that $x\geq a+b$. But with async backing, a malicious collator sequentially tries to not share the block and just-in-time front-run the honest collator for all the unincluded_segment blocks. Hence, we require $x\geq (a+b)\cdot m$, where $m$ is the max allowed candidate depth (unincluded segment allowed). Independently, there is a check on the relay chain which filters out parablocks anchoring to very old relay_parents in the [`verify_backed_candidates`](https://github.com/paritytech/polkadot-sdk/blob/ec700de9cdca84cdf5d9f501e66164454c2e3b7d/polkadot/runtime/parachains/src/inclusion/mod.rs#L1237). Any parablock which is anchored to a relay parent older than the oldest element in `allowed_relay_parents` gets rejected. Hence, the malicious collator can not front-run and censor the consequent collator after this delay as the parablock is no longer valid. The update of the allowed_relay_parents occurs at [`process_inherent_data`](https://github.com/paritytech/polkadot-sdk/blob/ec700de9cdca84cdf5d9f501e66164454c2e3b7d/polkadot/runtime/parachains/src/paras_inherent/mod.rs#L321) where the buffer length of AllowedRelayParents is set by the scheduler parameter: [`lookahead`](https://github.com/paritytech/polkadot-sdk/blob/875437c4aecf99e1f0ffeb8278a3b0b0017acbc2/polkadot/primitives/src/v8/mod.rs#L2148) (set to 3 by default). Therefore, the async_backing delay ($async\_delay$) tolerated by the relay chain backers is $3*6s = 18s$. Hence, the number of consecutive slots is the minimum of the above two values: $$ x \geq min((a+b)\cdot m, a + b + async\_delay) $$ where $m$ is the $max\_candidate\_depth$ (or unincluded segment as seen from collator's perpective). **Parameters for Plaza:** If we assume the previous block data can be fetched from backers, then we comfortably have $a+b \leq 6s$. Using the current async_delay of 18s, we can set $x$ to 4. If the max_candidate_depth (m) for plaza is set $m\leq3$, then this will reduce (improve) $x$ from 4 to $m$. **Open Questions:** - there is an inherent restriction on the async backing parameters to make sense. Is there an equation relating async_delay, $max\_candidate\_depth$, velocity? its not clear. - can we claim that the elements of the allowed_relay_parents vector are always consecutive? essentially this update is purely performed by the relay chain runtime and every new relay chain block is appended while ensuring the overall buffer does not extend more than max_ancestry_len. ## Additional Discussions - Can we have all honest validators start the recovery process when the censoring collators does not share the block? is this too much load on the availability layer or only the collators recover previous block just-in-time before its their turn to produce the block. - The authorship rights in multi-slot collations can also be based on the relay parent slot number instead of the relay parent block number if thats easier to implement. - **Do we introduce a fork-choice rule for Backers?** One approach of resolving the issue of malicious collators front-running the honest ones is by introducing a fork-choice rule for the backer, where the backer prefers blocks with newer relay_parents over older relay_parents. This is ofcourse outside the PVF checks and introduces more honesty assumption on the backers. But we anyways assume backers are honest and with the random assignments this is fairly reasonable. Our initial impression is it does not result in lower $x$. - **2/3rd +1 Honesty on Collators:** In this model, the idea of having an additional consensus round (i.e., similar to Acknowledgements in [Low-Latency-Parachain](https://docs.google.com/document/d/1Vnn_IRk38mri7WQLLYGnYvhTdTRAaQlAAvCcfH-6drQ/edit?usp=sharing) design or similar ideas) makes sense. With 2/3rd +1 honesty, there are no liveness or censorship issues where the collators dont acknowledge valid blocks (liveness attack) or don't acknowledge when the transaction being censored is included. We should be able to achieve lower latency for the targeted transaction making into the acknowledgement phase and eventually into the relay chain. This may be a reasonable assumption to have for systems parachains. We will have to carefully look into the collator selection mechanism. This requires quite some consideration as the collator incentive mechanisms are now shouldering the safety of the relay chain. Ideally, we would want to get away without additional stake being locked by collators. - A simple DoS attack where censoring collators overwhelm the data availability layer can hamper censorhip resistance. To avoid such attacks, where tens of collators request the block from availability layer clogging up their bandwidth, there needs to be access control mechanism where only the potential block building collators are allowed to request blocks. This is not high-priority, as the current system can handle 30 concurrent requests. - If we assume that the collator being censored first attempts to fetch the block from the backer and only later requests from availability layer, then probably multi-slot collation alternative is better. Very likely, the time required to fetch from backer is less than one slot duration, hence the collators misses only one slot and gets the rest in. - Claim Queue: how it replaces the async backing parameters. Do we still have the notion of max_depth and unincluded_segment_len? our initial impression is that only the datstructures used for storing these parameters are different. ## References - [Collator Selection RFC](https://github.com/polkadot-fellows/RFCs/blob/main/text/0007-system-collator-selection.md) - [Write-up by Jeff](https://hackmd.io/@rgbPIkIdTwSICPuAq67Jbw/S1Dc27261g) - [Issue: Allow building on older relay parents](https://github.com/paritytech/polkadot-sdk/pull/8299). - [Enforcing Relay Parent Gap]( https://github.com/paritytech/polkadot-sdk/issues/7780) - [AllowedRelayParent](https://github.com/paritytech/polkadot-sdk/blob/4cd07c56378291fddb9fceab3b508cf99034126a/polkadot/runtime/parachains/src/shared.rs#L64): Check the `update`, `acquire_info`, and [`verify_backed_candidates`](https://github.com/paritytech/polkadot-sdk/blob/ec700de9cdca84cdf5d9f501e66164454c2e3b7d/polkadot/runtime/parachains/src/inclusion/mod.rs#L1237) functions. ## Alternate Approach: Adding timeouts As an alternative to multi-slot collations, we ca intuitively add conditions in the pvf-checks to ensure that B has sufficient time to reconstruct A's block and get his block in before C. One approach is to use wall-clock times as suggested [here](https://hackmd.io/NNMh4WWQTmyeacn2fIF4ng?both). But we can possibly get away by measuring relative to relaychain block numbers. This approach overlaps with some of the ideas presented in [block-based collators](https://github.com/paritytech/polkadot-sdk/issues/4813) but with a different obejctive. We fix some terminology: - s: the relayparent slot number of the last parablock included. - n: current relay slot number (from RelayChainStateProof). - j: paraslot of the last parablock included. - k: paraslot of the current parablock. we have the following conditions in the `on_State_proof` checks: - if $k>j+1$ then $n>=\lceil s + a(k-j-1) + b \rceil$ . Here `b` is the time for a collator to reconstruct the parablock block from the availability layer measured in relay-chain slots + the time for producing the new parablock (for example, if time required is 12s and relay chain slot is 6s, then `b=2`). We can set $a$ to 1 if the honest collators use the strategy of eagerly producing the block immediately if they do not see a block from the previous slot owner. If that block later show up, then they obviously build on that abandoning the eagerly prodcued block. - if $k=j+1$, then the already existing checks in `on_state_proof` suffice. Unsure if parameters `a` and `b` will have unintended consequences on other subsystems, we should probably run some tests/experiments. For the above mechanism to work, the authorship claims (of the aura cycles) need to be relative to the previous parachain block that was included in the relay-parent. The parachain authorship claims can not be just dependent on parachains slots (based on the wall-clock times) as in the current slot-based collator. Otherwise, the griefing attack by B is to create a delay b s.t C can be pushed out of his slot. ### Griefing Analysis Since we now have the condition of waiting for $B$ before letting $C$ author, B can now abuse this timeout to attempt a griefing attack. *Happy path*: If the collators do not attempt misusing the above fix, i.e, keep producing parablocks as and when its their slot, then this does not reduce throughput of the prachain blocks being backed. *Worst case*: Latency is affected only in the case of collators attempting a griefing attack. An attacker can wait `b` time period before authoring their block. In fact the best strategy is to not have consecutive adversarial collators but have them alternated (assuming b >> a) and each of them waits b before producing their block. Such an attack is aggreviated with elastic scaling. Possibly, for a parachain with $c$ cores, we can have collators produce $c$ consecutive blocks before rotating to the next collator. This requires changing the aura round-robin mechanism.