owned this note
owned this note
Published
Linked with GitHub
---
tags: Time, Ethereum2.0
---
# BFT Clock Sync
Clock rates are instable, so even if two clocks are perfectly synchronized at some point of time, they will drift and clock discrepancy will grow over time on average.
While drift can be small enough so that it's not a problem over choosen time scale, typicall it's required to resynchronize clocks periodically.
Syncing two clocks is conceptually simple - one should be able to read both clocks (approximately) at the same time and calculate difference between the readings.
An implicit additional requirement is that clock drift relative to each other is bounded and the bound is known. Else it's not possible to calculate proper time resync interval.
So, one clock fault is exceeding assumed drift bound.
In a distributed system, a clock can return different results to different processes - a Byzantine clock failure.
## Requirements
There are several sources of requiremnts in Ethereum2.0:
- BCP
- network layer
- cryptoeconomics
- EVM/smart contracts
BCP and network layer requires synchronization of particiant clocks between each other. However, set of participants if different:
- validator nodes in BCP
- validator + other nodes in p2p, which can be a problem
Network layer assumes stricter requirements to clock disparity - 500ms. While BCP probably can ignore `SECONDS_PER_SLOT` which is 12 sec, however.
Cryptoeconomics considerations require syntonization with World Time Standard (UTC) rather than synchronization, i.e. slot duration as achieved in BCP should not differ much `SECONDS_PER_SLOT` measured in UTC seconds. Note, that clock synchronization algorithm can assure low clock disparity, but can shift clock rate, e.g. BCP time can become driven by the fastest clock.
Time usage in EVM, including smart contracts may require that time is also synchronized to the World Time standard.
In theory, there can be several BCP Time Services suitable for different needs. This can help to achieve graceful degradation, if one service failed to reach its quality goals.
For example, synchronization clocks among validators implemented by synchronizing their clocks to UTC, then compromising the latter sync will lead to BCP level problems. However, BCP clock sync can be implemented as a separate protocol, then compromising UTC sync proto allows BCP function without much problems.While it's possible that clock rate will depart from UTC's one, sooner or later, UTC clock sync will be repaired.
Since BCP relies on clock synchronization to obtain lock-step execution, it can also be achieved using logical clocks.
## Local clocks
Every computer has one or several Crystal Oscilators (XO), either driving its CPU or an Real-time Clock (RTC). XO is relatively stable, one can expect clock drift bound around 100ppm. When calibrated it can even be more predictable, since it's the clock rate uncertainty relative to the nominal clock rate.
Clock drift bound is important in many clock synchronization algorithms, since it can be used to filter out outliers, e.g. faulty readings. Better clock stability also allows to increase time between clock synchronization and thus to reduce resource usage.
A useful aproach would be to simultaneously filter outliers and calibrating clocks, i.e. a form of Robust Clock Claibration.
## Clock Sync Strategies
### Validator resposibility/No Clock Sync protocol
One strategy is that validators are responsible for correctly setting up Time Service. The main requirement from BCP point of view is that such services are independent in the sense that their faults are uncorrelated.
Unfortunately, many validators may limit themselves to using default NTP, which intorduces an implicit centralization in the BCP. And thus prone to correlated faults, e.g. attacks.
- GNSS clocks
- Radio clocks
- Atomic clocks (expensive)
- NTP config with carefully chosen public servers
### Clock Exchange Protocol
A simple way to detect problematic NTP setup is to implement Clock Exchange protocol, i.e. validators can exchange pairwise clock offsets between each other. This will give each validator access to many Time Sources, which can be compared for consistency.
If (weighted) majority of validators have correct Time Source setups, then low-staked or low-qualificatiob validators at least can detect problems with their Time Source setups.
Clock Exchange protocol should possess BFT properties.
### Clock Sync protocol
Clock Exchange protocol can be augmented with clock correction mechanism, leading to Clock Sync protocol.
There can various strategies and algorithm in implementing Clock Synchronization:
- a separate (from BCP) protocol
- CS protocol over a BFT protocol
- CS protocol piggybacked on BCP message flow and/or libp2p message flow
- CS protocol on libp2p level
BCP Clock Sync can rely on UTC sync (Time Sources) or rely on local clocks only. Clock rate syntonization may be achieved with a pressure from a separate UTC sync protocol.
BFT CS can use different network configuration:
- n-to-n channels
- over existing p2p-graph (pubsup, gossipsub/floodsub)
## Clock Sync Algos
### Models
#### Local clock model
Local clock drift is limited by around $\pm100 ppm$ reltive to nominal rate. However, after calibration (learning its rate by comparing to a time standard), clock can be more predictable.
Due to aging, clock rate can deviate more than $100ppm$ from nominal rate.
So, either more loose bounds or calibration procedure may be required.
Formally, the two models can be specified as following:
a. $(1- \rho) \delta \le (C(t + \delta) - C(t)|) \le (1+\rho) \delta$
b. $(r- \rho) \delta \le (C(t + \delta) - C(t)|) \le (r+\rho) \delta$
If clock rate deviated more than the bound, then it's a clock fault. One reason can be clock adjustment by NTP daemon or an administrator.
We ignore more complex models, which estimate aging and/or temperature dependency, since it assumes more complex configuration.
#### Network model
We assume network delays are bounded, in the sense that if a delay exceeds bounds, it's a fault. Lost message means an infinity delay.
A delay out of bounds is a fault, which can be caused by network problems or an adversarial behavior (cannot distinguish them, in general).
Network connectivity model either complete or any correct process can reach other correct process (p2p-graph).
### Building blocks
#### Clock offset measurement
Assuming sender can set sending time field in a message, which is cryptographically signed. If it's not possible, additional message may be required and approproate protocol to prevent forging.
Then the message receiver can timestamp the message upon receiving it. The difference in two timestamps adjusted with delay bounds will contain clock offsets difference, if no fault happend.
Since in reality message cannot arrive earlier, but can be delayed for an arbitrarily long period, the timestamp difference is an upper bound of the clock offset difference. However, some CS algo require a lower bound too. It's useful, since otherwise resulting clock can become driven by the fastest clock in the system.
#### Roundtrip measurement
To get a lower bound, nodes should exchange their roles, i.e. the second node should send a message to the first node.
Often, it's desired that the second message is sent soon after receiving the first one, so that one can ignore clock drift during the round-trip scenario.
Since, both modes need to synchronized their clocks a third message may be needed to pass meauring results to the first node.
#### Bursts
Network delay can vary, so bursts of messages can be used to measure clock difference more preceisely.
However, if there is regular message flow between nodes, then a (robust) regression can be used to measure clock difference plus relative clock skew.
Robust regression is preferred, since there can be occasional high delays. In case of Byzantine faults, it can help to limit induced clock skew.
There is a number of methods to do robust regression: [Theil Sen estimator](https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator), [Repeated median regression](https://en.wikipedia.org/wiki/Repeated_median_regression), [RANSAC](https://en.wikipedia.org/wiki/Random_sample_consensus), [M-estimators](https://en.wikipedia.org/wiki/M-estimator), Quantile regression, etc.
### Byzantine fault tolerance
Sometimes it can be reasonable to limit kind of faults which can be tolerated by CS protocol, since full BFT can be expensive.
However, it's reasonable to use cryptography to assure CS messages unforgeable.
### Delay measurement
An important aspect of CS protocol is delay measurement.
It's not necessary, since one can assume delay bounds. However, such bounds will limit precision. So, measuring delays can be usefull, if high precision is required.
Delay masurement cons