Byzantine View Synchronisation: preliminary research

# Byzantine View Synchronisation: preliminary research ## Selected state-of-the-art byzantine view synchronisation protocols | | Cogsworth | Naor-Keidar | Lewis-Pye | | ------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | **Expected latency** | $O(1)$ | $O(1)$ | $O(n)$ | | **Worst-case latency** | $O(n)$ | $O(n)$ | $O(n)$ | | **Expected message complexity** | $O(n)$ with benign faults, $O(n^{2})$ with byzantine faults | $O(n)$ | $O(n^{2})$ | | **Worst-case message complexity** | $O(n^{2})$ with benign faults, $O(n^{3})$ with byzantine faults | $O(n^{3})$ | $O(n^{2})$ | | **Leader/committee selection mechanism** | Randomised, but can be Round Robin, only affecting latency and message complexity | Same | Round Robin, but can be randomised | | **Can be used to help lagging replicas catch up** | Yes, but only if the leader driving progress communicates with the lagging replica (an honest leader will) | Same, but a lagging replica can update its `next_view` variable of it learns that at least f+1 other replicas are voting for a higher view (relevant for liveness and for possibility of improvement against byzantine faults) | Yes, unless bynzatine leader | | **Stablization property** | No, a byzantine leader can ommit some replicas when broadcasting the confimation for a new view, and if that leaves less than f+1 replicas behind they might not be able to advance | Yes, solves the issue with Cogsworth by adding an extra phase | No for S1, $f+1$ correct guaranteed only for the non-epoch view (first view in each epoch) because all $f+1$ epoch leaders are contacted, in views within an epoch a byzantine leader might omit some replicas when broadcasting a view certificate. But S2 should hold. | | **Optimistic responsiveness** | No | No | Yes, thanks to being coupled with the progress protocol | | **Guarantees monotonicity** | No | Yes | Yes | | **Thresold signatures required for the stated latency and message complexity** | Yes | Yes | Yes | | **Sufficiently unambigous to implement** | No | Yes | Yes, though it's specified as a hotstuff-like consensus protocol, rather than a pluggable Pacemaker module, so we'd need to adapt the view sync mechanism to our protocol | | **Buffer-optimal** | No, might require keeping messages for all views in the buffer | Yes, thanks to introducing an extra "finalize" phase, replicas can delete all messages for previous views after "finalize" | Yes, only messages for future views | | **Must be implemented as a new thread** | Yes | Yes | No, can be part of progress mode | | **Number of replicas agreeing on the same next view required to move to that view** | $f+1$ | $f+1$ | $2f+1$ | | **Liveness** | Not sure | Yes | Not sure, what if because of a few consecutive byzantine leaders replicas' views go completely out of sync, and no quorum of votes can be achieved for the same view or epoch? Perhaps this doesn't happen because of relative values of view and view sync timeouts, I need to think about this | ## Pros and cons 1. **Cogsworth:** (+) latency and no need to sample relays, (-) not very clear, tricky implementation, and might not satisfy liveness. 2. **Naor-Keidar:** (+) latency, reasonably clear, and I am convinced about liveness, (-) not sure if it can be made optimistically responsive and sampling a new set of relays for every round is suboptimal. 3. **Lewis-Pye:** (+) the view sync part is very clear and neat, chained like hotstuff, no need to sample relays, can be made optimistically responsive easily, (-) specified as part of a hotstuff-like protocol, and liveness proof is also for the entire prootcol, I'd need to investigate liveness and our ability to detach the view sync protocol from the consensus protocol. ## Important considerations ### Stabilization property (from Naor-Keidar) > *For any $t$ during the run, let $t_0$ be the first time when a correct process enters view $v_{max}(t)$. If no correct process enters any view $v > v_{max}(t)$ then: > S1. From some time $t_1$ onwards, at least $f+1$ correct processes are in view $v_{max}(t)$. > S2. If $t_{0} \geq GST$ and $Relay(v_{max}(t), 1)$ (the first relay of view $v_{max}(t)$ which processes try to contact) is correct, then from some time $t_{2}$ onward all correct processes enter $v_{max}(t)$ and $t_{2} - t_{0} \leq c_{1}$ for some constant $c_{1}$.* Note that given a fair leader rotation mechanism, it implies that all lagging replicas will *eventually* catch up with the rest. However, in the worst case it might take $f$ views, and in the absolute worst-case it could mean that for $f$ views there will be no quorum and no progress in the progress mode will be made. We can try to remedy this with a block-sync mechanism. ### Optimistic responsiveness In the partial synchrony model, we would like that after stablization time $\Delta$ after GST, the consensus protocol moves at the network speed, uninhibited by some imposed timeout if progress can be made faster than timeout. This is in fact a property of the HotStuff consensus protocol, but it fails to hold if the Pacemaker is not optimistically responsive. Byzantine view synchronisation protocols (possible Pacemakers) generally expose an `advance()` API, which replicas call when they want to move to a new view. The safest way as far as liveness is concerned, suggested in Cogsworth and Naor-Keidar, is that for each view a replica calls `advance()` after a fixed and constant timeout for the view. However, this is not optimistically-responsive as it might require replicas to wait to call `advance()` after the voting for the current view is complete. For optimistic responsiveness, we'd like that replicas call `advance()` immediately after they have completed their work for the current view (as suggested in Lewis-Pye). The challenge with implementing this is that: * Replicas should call `advance()` once they send their votes to the next leader * The next leader should call `advance()` once they're done collecting votes for the next view The latter might take longer than the former, but the replicas that called `advance()` early will have a majority and if the view update protocol moves fast they might trigger a view update before the next leader is done collecting votes. This would be a liveness threat. Here is one solution to this problem, compatible with the Lewis-Pye protocol. Since the next leader is the one collecting votes for the view change, they might delay sending the confirmation for the view update (with the votes aggregate) until either: * They manage to collect a QC, or * Their view times out (for this the view timeout must be lower than the replicas' timeout for contacting the leader of the next view by at least one roundtrip time) It works well with Lewis-Pye because it operates based on epochs of size $f+1$ where (1) the leader of each view in the epoch is the same as the leader collecting votes in the view sync protocol, and (2) there are no relays, rather if a leader of a view can't get replicas to enter its view, the view is skipped. However, Cogsworth and Naor-Keidar have $f+1$ relays for each view, such that if a relay fails to send a vote aggregate, the next relay in line is contacted. Perhaps if we ensure that the first relay is the leader of the view, and set the view timeouts and the view sync timeouts for contacting relays accordingly (as above), then maybe the leader of the view can delay the progress of view sync until it collects a QC or times out as above. ## Design challenges 1. Optimal buffer 2. random leader/committee selction (if not Round Robin) 3. Optimistic responsiveness 4. Possible multi-threading 5. **Setting the timeouts such that liveness is ensured** (they should not be defined by the user) ## Random committee selection (TODO) We can try something similar to what [Ethereum](https://eth2book.info/capella/part2/building_blocks/committees/) is doing, i.e.: * Extract a pseudo-random seed from the recent updates to the blockchain * Sample the committee/leader from this random seed Note that a potential issue with this approach is that replicas that are lagging behind might not get the same seed. Unless we do block sync every time before the new seed is determined. ## Block sync trigger mechanism (TODO) ## Sources 1. [Cogsworth](https://cryptoeconomicsystems.pubpub.org/pub/naor-cogsworth-synchronization/release/5) 2. [Naor-Keidar](https://arxiv.org/pdf/2002.07539.pdf) 3. [Lewis-Pye](https://arxiv.org/pdf/2201.01107.pdf) 4. [The latest view on view synchronisation](https://blog.chain.link/view-synchronization/#pacemakers) 5. [Optimal latency and communication for SMR view synchronisation](https://blog.chain.link/optimal-latency-and-communication-smr-view-synchronization/)