This document is aiming to illustrate the thoughts of section key chain handling during the situation of too many elders within one section dropped at the same time. Under such situation, a concensus within a section cannot be maintained any more, under the current algorithms, as no more enough number of provable existing elders to pass down the concensus status to new elected elders via majority votes.
# Basic Concepts
Before going further into the detailed description of the handling thoughts, there are some concepts worth to be clarified or defined first.
## What is the Section Key Chain here
Every node within the safe network will have a [ED25519](https://en.wikipedia.org/wiki/EdDSA#Ed25519) keypair, whose public key actually becomes that node's name. Other nodes can use that public key to verify the signatures that node signed with its private key. For the convenience, we use `Key_ED(x)` to stand for such key pair for node x.
Meanwhile, when a node behaves as an elder for a `Section(pfx)` (a section with a prefix of pfx), it will also maintain a bls_keyshare and bls_publickey, as a result of a DKG process (here is [a high level explaination of DKG process used for safe network](https://hackmd.io/ERglUv22SrW_c2EjuGz-Cg?both=#DKG)).
The bls_publickey is known to public, and can be used to verify the signature_shares signed by correspondent bls_keyshares.
Note that each elder has its own bls_keyshare, meanwhile bls_publickey is the same among the elders paticipated for this round.
For convenience, let's use `Key_BLS(Vn)` to stand for the bls_publickey with a version of N (i.e. elders status got changed for N times).
When we have such bls_publickey of version Vn-1 to sign the key generated for version Vn, then they are `chained`, and we call such chain as `section key chain`.
[Section chain and proof chain](https://hackmd.io/ERglUv22SrW_c2EjuGz-Cg?both=#1-Section-chain-and-proof-chain) gives a detailed explaination of it.
We can imagine that such section key chain is having a pattern like:
```
Key_Genesis->Key_BLS(V1)->Key_BLS(V2)->Key_BLS(V3)
```
where Key_Genesis stands for a root that being generated by the first node of the safe network.
This makes the chain become provable to others, i.e. any outsider can verify whether a `Key_BLS(Vn)` has been approved by the valid elders from Vn-1.
It need to be highlighted that the `bls key sign` means aggreate a bls_signature from bls_signature_shares collected from Vn-1 elders and can be verified by `Key_BLS(Vn-1)`.
## When the chain got broken
The `aggregation` of bls_signature_shares requires over defined threshold (currently defined as super_majority) of the particants to contribute.
In the situation when over `Elder_Size - super_majority` elders got dropped at the same time after `Key_BLS(Vn)`, then even the `Key_BLS(Vn+1)` got generated,
there won't be enough bls_signature_shares to be collected from `Key_BLS(Vn)` to aggregate a valid provable proof.
We call `the chain is broken` in this case, and such `broken chain` prevents the section growing or handling any further operations and eventually the whole network collapse.
# Meld the broken chain
To make the network survive from such `broken chain` situation, we need to figure out a resolvement that can make such broken chain still working for the network and eventually restored to normal. We call such attempt as `meld` and following are some proposed ideas proposed.
## Using Neighbouring Section to meld the broken
As mentioned above, section keys are chained. After split, each section will have its own chain with common part at begin and new keys growing from tail.
For example, chains for `Section(1)` and `Section(0)` that split from `Section()` will be:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4)
```
and
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(1), V3) -> Key_BLS(pfx(1), V4)
```
So if say the Section(0) got broken after V4 and generated `Key(pfx(0), V5)` but cannot using V4 key to sign it to compose the chain. We could use the key from the neighbouring Section(1), i.e. `Key_BLS(pfx(1), V4)`, to sign the newly generated key, which is also provable to pulbic. As the chain of neighbouring section is already public provable.
The chain of Section(0) then can be imagined as:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4) X + -> Key(pfx(0), V5)
+-> Key_BLS(pfx(1), V3) -> Key_BLS(pfx(1), V4) +
```
The pseudo code for the remaining elders within Section(0) for the process can be presented as:
```rust
When too many elders dropped {
startup a new DKG process to generate a new BLS key among the remaining elders
send the newly generated BLS key to the closest neighbouring section
------
when received the signed key block proof from the neighbouring section, verify the signatures and signers along the chain of that section
once validated, include the key block into own chain, then promote qualified adults to restore the section back to normal
}
```
The pseudo code for the elders of the neighbouring section can be presented as:
```rust
When received key signing request from the Section(0) remaining elders {
contact the known elders of Section(0) to confirm the claimed dropped elders are really dropped out
sign the newly generated key with own keyshare and send the signatureshare among own section elders
once the signatureshares aggregated, send the block proof back to the Section(0)
}
```
The benefit of this approacch is using the current chain structure seemlessly. The neighbouring section chain is already known, and the key block proof remain as BLS aggregated.
However, this approach requires at least one neighbouring section still having enough elders left over.
## Using Neighbouring Elders to meld the broken
During the extreme disruption of the network, it maybe possible that all neighbouring sections are lost too many elders at the same time. Which even the above approach unable to restore the network.
Under such extreme scenario, It is suggested that we may use ELDER_SIZE elders that cross multiple sections (not including own) to sign the newly generated key using their ED25519 key.
Note that as the elders are from multiple sections, their BLS signature share won't be able to aggregated. Hence it is ED25519 key used.
The meld chain then can be imagined as:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4) X + -> Key(pfx(0), V5)
[Key_ED(pfx(10), Elder(x1)), Key_ED(pfx(10), Elder(x2)), Key_ED(pfx(11), Elder(y1))] +
```
The pseudo code for the remaining elders within Section(0) for the process can be presented as:
```rust
When too many elders dropped {
startup a new DKG process to generate a new BLS key among the remaining elders
collect the ELDER_SIZE elders from the neighbouring sections, according to the XOR address order
send the newly generated BLS key to the collected elders
------
when received a signature from a requested elder, verify the signer is valid along the chain of that section
once accumlated enough signatures, compose the signatures into a key block proof and include it into own chain, then promote qualified adults to restore the section back to normal
}
```
The pseudo code for the elders received the signing request can be presented as:
```rust
When received key signing request from the Section(0) remaining elders {
contact the known elders of Section(0) to confirm the claimed dropped elders are really dropped out
sign the newly generated key with own ED25519 key and send the signature back to the Section(0) remaining elders
}
```
This approach can also be used to replace the previous approach during normal scenario, i.e. all ELDER_SIZE elders selected are from the same section.
The benefit of doing so is that for the signing side, there is no need to carry out the aggregation step.
However, this requires the refactor of existing section chain structure to allow BLS_KEY proof and ED_KEY proof to co-exist within the same chain.
## Using Adults to meld the broken
It is also suggested to let adults from the own section to sign the newly generated key as proof, using their ED25519 key.
The chain will be quite similiar to the one within above approach:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4) X + -> Key(pfx(0), V5)
[Key_ED(pfx(0), Adult(x1)), Key_ED(pfx(0), Adult(x2)), Key_ED(pfx(0), Adult(x3))] +
```
The pseudo code for the remaining elders within Section(0) for the process can be presented as:
```rust
When too many elders dropped {
startup a new DKG process to generate a new BLS key among the remaining elders
collect the ELDER_SIZE adults within own section, according to the XOR address order
send the newly generated BLS key to the collected adults
------
once accumlated enough signatures, compose the signatures into a key block proof and include it into own chain, then promote qualified adults to restore the section back to normal
}
```
The pseudo code for the adults received the signing request can be presented as:
```rust
When received key signing request from the Section(0) remaining elders {
contact the known elders of Section(0) to confirm the claimed dropped elders are really dropped out
sign the newly generated key with own ED25519 key and send the signature back to the Section(0) remaining elders
}
```
The issue of this approach is that the adults' proof is not `offline public provable` with the chain structure only contain elders.
For the neighbouring section elders to accept the new key, they can only contact the broken section during run time.
And any offline meld chain containing such `meld point`, can only be validated via contacting existing online elders knowing the longer chain before the meld point.
To make such meld chain offline provable, the chain structure has to be refactored to contain the adults status as well (at least the closest ELDER_SIZE adults).
One possible solution is that we make the tail of the section chain is a signing of ELDER_SIZE closest adults membership, and make this tail keeps updated and broadcasted to neighbouring section elders.
The chain with such refactor for Section(0) having closest Adults(a1, a2, a3), will be:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4) -> Adults(a1, a2, a3)
```
Once there is an Adult(a4) becomes closer than Adult(a4), then the chain got updated to:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4) -> Adults(a1, a2, a4)
```
Then the `meld chain` under broken situation becomes:
```
Key_Genesis -> Key_BLS(pfx(), V1) -> Key_BLS(pfx(), V2) -> Key_BLS(pfx(0), V3) -> Key_BLS(pfx(0), V4) -> Adults(a1, a2, a4) X + -> Key(pfx(0), V5)
[Key_ED(pfx(0), Adult(a1)), Key_ED(pfx(0), Adult(a2)), Key_ED(pfx(0), Adult(a3))] +
```
# Summary
This document explains three proposed approaches to handle the situation that during a severe network disrution, too many elders dropped at the same time, making a section lost concensus ability to restore the network.
The first approach has the least impact to the existing section chain structure. However it has the limitation that requires at least one elder remained within the broken section.
The latter two approaches requires certain refactory on the chain structure. But they have the benefit that requring less work on the `signing` side.
The last approach, with a further refactory, can even make a chain `self-provable`, without need to requiring multiple section chain instances to be referred to.
# Discussions