Handling Graceful Updates on Mithril Network

What

When we will run Mithril on Mainnet there will be thousands of signers running altogether. In any case we must prevent a gap in the certificate chain despite production aleas. Of those aleas, upgrading the version of the nodes has an impact as different versions of API, messages, signature may lead to loss of a significant part of the signers population over one epoch or more.

Why

We need to be able to keep enough of signer nodes and the aggregator able to work together in order to produce at least one certificate per epoch.

For example, if the population of signer nodes is split in 2 sets running incompatibles versions, it could lead to a situation where the quorum of the Mithril protocol would not be reached: thus no multi-signature could be produced.

Subsequent Questions

How much is "enough"? Is it configurable? Do we need a safety margin on top of it?
We know that in order to have full security we need to reach 95% of Cardano stakes involved in Mithril protocol. Does this mean that we want 100% of the signers until we ramp up to that threshold?

What is the impact of the stakes of a signer in the computation of the threshold (e.g. when stakes evolve and when a signer de-registered from Mithril or retired from Cardano, and which epoch should we consider)?

Can we have a "ramp up" period after a major update, when we are below a TBD threshold?

Does that mean we need monitoring tools to track what is going on in the signer population?

What is the legal impact of monitoring (i.e. gathering data from third parties)? GDPR?

Do we handle differently the breaking changes and the soft updates?

Is the solution we design also working in the future decentralized setup?

What is needed to track nodes compatibility?
Today we have several versions:

  • Distribution version: version of the software packaging of all the node types
  • Software version: version of each software node
  • API version: version of the HTTP API data structures
  • Database version: version of the nodes database structure
  • Protocol version: version of mithril-stm (used in the metadata of the certificate)

How about automatic upgrade? This would have to be secure: we need to avoid the case when someone attacks our upgrade delivery system, which can be done by posting a transaction on the chain that contains the signed hash of the updated software to download. Would that system be mandatory or optional?

Do we need to exclude signers from the threshold computation if they don't meet some criterias (that can be for security reasons) ? If yes which criterias ? (i.e. signer registered too recently, it did not register for the next epoch, )

How to organise our work?

  • Write an ADR: a draft, in order to find most answers to the questions raised above. This will allow us to discuss our solution with other parties.

  • Exploration:

    • Separate PoC that interact with the chain (to activate a new version): read & write transactions.
    • PoC to know what is the best way to handle backward compatibility of API messages (with protobuf, AVRO, in house development etc.)

Glossary

breaking change: a release which requires to wait "enough" of the signers has upgraded to this new version to switch on its new features.

soft update: a release which is compatible with nodes running the last breaking change.

version: A version corresponds to a given state of the evolution of a software, data schematic or protocol. It is often associated with a numbering which makes it possible to identify it, even in certain cases with a symbolic name.

feature flag: a parameter stored on the blockchain that a feature is to be activated or not.

draft

We need a monitoring solution to ensure a new version is sufficiently spread to represent a majority of the stake. This means there is a need for an external signal for nodes to switch from a feature to the new one.
One way to store this signal is to write it in the blockchain, this means the nodes must read the chain at the same moment to update their behavior.
This moment could be the epoch change since most of the new features might enter in service at this very moment.

Other idea:
Changes in the inner structures (certificates, signatures, message to be signed) might be handled by the Aggregator. Today, it knows the stake distribution and may know if signers have the capacity to handle these new structures.

changes that can break the way the signatures are made:

  • changes in the message to be signed
  • changes in the way single signatures are produced (crypto) maybe in a less extent.

The envisaged scope of the solution for signing different messages is the following:

  • a software update provides both (old & new) ways of composing messages
  • at a given epoch, the softwares switches the algorythm they use to compute certificates
  • in first place, the aggregator will collect signers software versions in order to monitor the deployment rate
  • the information about when to switch is stored in the BlockChain
  • the algorythm does not imply database migrations on the fly

Write the feature flag in the BlockChain:

  • all Signers & Aggregators read the BChain at start-up at a specified address to check if the feature flag has been set for a given Epoch.
  • if it has been set, the new feature is enabled if the given Epoch is reached.
  • if it has been set, the new feature will be enabled if the given Epoch is not already reached.
  • if it has not been set, the BChain will be probed at each new Epoch.

Certificate lifecycle is composed of Eras inside of which they are compatible. Feature flags allow to switch from an Era to another. This is convenient since it allows to represent eras as Enums in the Rust code inside which they can check which code to run. Once the Epoch of the Era has started, the softwares start using the new algorythm thus the old code can be removed from the softwares (soft update).
When a software detects an Era it does not support on the BChain, it can emit warnings to ask for an update. It allows to crash with an explicit error if the software is no more compatble with the current Era.

Select a repo