Polkadot state sync

Polkadot Host implementations use Fast/Warp sync to synchronize missing block headers. They are followed by the state sync. The faster we can synchronize missing block headers and the most recent state, the faster our node will be able to start executing the most recent blocks and create new blocks (if we are running validator). In this post we will explore current strategies to sync the state as well as propose a new approach. **The goal is to come up with state sync protocol that:** **1. Easily parallelizable** **2. To save bandwidth responses do not contain overlapping data** **3. Resistant to DOS attacks** If you are already familiar with how state sync works, just scroll down to [Proposed solution](.Proposed_solution) section. ## State ### Key-value Each block in polkadot blockchain has associated state. State is key-value collection. <pre style="background: #161616; color: #d4d4d4"> state["money:alice"] = "100$" state["money:bob"] = "200$" </pre> ### State hash Block state is immutable. Each block stores hash of state **in a form of merkle root** to ensure state integrity. Forged state will have different hash. <pre style="background: #161616; color: #d4d4d4"> hash1 = hash(state({ "money:alice": "100$" })) hash2 = hash(state({ "money:alice": "1000$" })) hash1 != hash2 </pre> ### Polkadot trie Polkadot stores state as merkle trie. #### Trie Trie is prefix tree. Nodes with common prefix are children of parent node with that prefix. <pre style="background: #161616; color: #d4d4d4"> [ "money:alice", "money:bob", ] -> node("money:", [ node("money:alice"), node("money:bob"), ]) </pre> Parent node stores children by letter which follows parent prefix. <pre style="background: #161616; color: #d4d4d4"> [ "money:alice", "money:bob", ] -> node("money:", { "a": node("money:alice"), "b": node("money:bob"), }) </pre> Node key omits redundant prefix. <pre style="background: #161616; color: #d4d4d4"> [ "money:alice", "money:bob", ] -> node(partial="money:", { "a": node(partial="lice"), "b": node(partial="ob"), }) </pre> #### Merkle trie Merkle trie is merkle tree. Block stores hash of root trie node. Immutable node is uniquely identified by it's hash. Parent node stores child hash. Polkadot node database stores node by hash. <pre style="background: #161616; color: #d4d4d4"> trie1 = node("money:", [ node("money:alice"="100$"), node("money:bob"="200$"), ]) trie2 = node("money:", [ node("money:alice"="1000$"), node("money:bob"="200$"), ]) -> { hash1: node("money:", [hash3, hash4]), hash2: node("money:", [hash5, hash4]), hash3: node("money:alice"="100$"), hash4: node("money:bob"="200$"), hash5: node("money:alice"="1000$"), } </pre> ## /state/2 Polkadot "/state/2" protocol is used to obtain state from other peers. There are two subprotocols inside "/state/2" protocol. They are distinguished by `request.proof: bool` field. ### /state/2 proof=false This protocol requests next page of key-values. It stops when `response.end: bool` field is `true`. Then it re-creates trie from key-values. <pre style="background: #161616; color: #d4d4d4"> step 1 request(after="") -> response( "money:alice"="100$", "money:bob"="200$", end=false, ) step 2 request(after="money:bob") -> response( "money:charlie"="300$", "money:dave"="400$", end=true, ) </pre> #### Problems We can't check root hash until we receive all key-values. Re-creating trie nodes from key-values may change root hash (inconsistency during trie v1 migration). Waiting for all requests sequentially may be slow. We may want to send parallel requests to different peers. But we don't know range of keys in trie. We may blindly guess keys, but responses may overlap **, increasing bandwidth**. <pre style="background: #161616; color: #d4d4d4"> step 1 (parallel) request(after="a") request(after="b") request(after="c") -> response(["a", "b", "c"]) response([ "b", "c"]) response([ "c"]) </pre> ### /state/2 proof=true This protocol requests trie nodes (merkle proof) required to get next page of key-values. **That increases response size, especially if trie is getting deeper.** It stops when client receives all trie nodes. <pre style="background: #161616; color: #d4d4d4"> step 1 request(after="") -> response({ hash1: node("money:", [hash2, hash3, hash4, hash5]), hash2: node("money:alice"), hash3: node("money:bob"), }) step 2 request(after="money:bob") -> response({ hash1: node("money:", [hash2, hash3, hash4, hash5]), hash4: node("money:charlie"), hash5: node("money:dave"), }) </pre> We can check root hash for each request. It receives trie nodes as is, so doesn't need to re-create them. Prefixes from received nodes limit possible keys **for future requests**. But we don't know how many keys exist under prefix, so responses may overlap. <pre style="background: #161616; color: #d4d4d4"> step 1 request(after="") -> response({ hash1: node("money:", { "a": hash2, "b": hash3, "c": hash4, "d": hash5, }), }) step 2 (parallel) request(after="money:a") request(after="money:b") request(after="money:c") request(after="money:d") -> response([hash1 hash2 hash3]) response([hash1 hash3 hash4]) response([hash1 hash4 hash5]) response([hash1 hash5]) </pre> Even sequential request responses duplicate parent nodes (root, "m", "mo", "mon", "money:"). ## Problems - Responses duplicate parent nodes. - Uses more network bandwith. - Client already knows these nodes. - Parallel responses overlap. - Uses more network bandwith. - Reduces profit of parallel over sequential requests. ## Proposed solution Protocol can request trie nodes by hashes. Response will contain one or more requested nodes. Response may contain children/decendants of requested nodes. Unknown child hashes from response can be used in new requests. Unknown hashes can be requested in parallel from different peers. Response doesn't return keys outside of requested node prefix. Client requests non-overlapping subtrees. So responses don't overlap. Response doesn't contain parent nodes, so it doesn't duplicate them. <pre style="background: #161616; color: #d4d4d4"> tree hash1 => node([ hash2 => node([ hash4 => node(…) hash5 => node(…) ]) hash3 => node([ hash6 => node(…) hash7 => node(…) ]) ]) step 1 request(hash1) -> response(node([hash2, hash3])) step 2 (parallel) request(hash2) request(hash3) -> response(node([hash4, hash5])) response(node([hash6, hash7])) step 3 (parallel) request(hash4) request(hash5) request(hash6) request(hash7) -> response(node(…)) response(node(…)) response(node(…)) response(node(…)) done </pre> If client needs only particular keys (like "/light/2" protocol), or wants to receive them as early as possible, request can contain such keys. Server will return nodes on path along this key, ignoring nodes outside this prefix. ### Spam Malicious client can request many invalid hashes from server. Server may not have all valid nodes, so it can't prove that hash is invalid. Server will process these hashes independently. Such invalid request with N invalid hashes will be handled in O(N) db reads. Client may prove server that requested hashes exist in trie. Simple merkle proof can consist of trie nodes containing requested hashes. Valid node hash must be root, or child of valid node. Server already knows all valid trie nodes which client may request. So proof can use hashes instead of nodes. Proof and request hashes will be organized as tree, to preserve parent-child relation. <pre style="background: #161616; color: #d4d4d4"> request( hashes=[hash8, hash7], proof={ hash1: node(…, [hash2, hash3, hash4, …]), hash2: node(…, [hash5, hash6, hash7, …]), hash5: node(…, [hash8, hash9, …]), }, ) -> compact(hash1, [ compact(hash2, [ compact(hash5, [ compact(hash8, []), ]), compact(hash7, []), ]), ]) </pre> Proof validation bound is same for valid and invalid proof. First db read will check if server has root node. If it doesn't it stops. If it has node, before next db read it will check that next hash is child of current node. So invalid hashes won't be read from db. <pre style="background: #161616; color: #d4d4d4"> hash1 = node([hash2, hash3]) valid = compact(hash1, [ compact(hash2, […]), ]) invalid = compact(hash1, [ compact(hash2, […]), compact(invalid1, […]), compact(invalid2, […]), … ]) </pre>