owned this note
owned this note
Published
Linked with GitHub
# Churn
FIXME: the sequence diagrams don't render on discourse. Replace them with offline-generated images or with pseudocode.
This document describes in detail how churn (nodes joining and leaving the network) is handled in routing.
## Glossary
### Address
See [Name](#name).
### Agreement
Any action that affect the section must be agreed on by a supermajority (> 2/3) of the section elders. This is implemented by the section elders *proposing* the action by creating a message containing the proposal and signing it with their *BLS secret key share*. They then broadcast the signed proposal to the rest of the elders. Elders then *aggregate* the *signature shares* of these proposal and once they receive enough of the shares (> 2/3), they combine them into a full *BLS signature* which together with the corresponding [public key](#section-key) form an undeniable and independently verifiable cryptographic proof of the action and they can the perform it.
### ELDER_SIZE
The number of elders in a section. This is a whole-network constant so every section has the same number of elders. This number is currently `7` and is chosen as a compromise between two opposing requirements:
1. It should be small because the more elders there are the more computation and message traffic is needed which affects performance.
2. It should be large to make the section resilient to failure and malice.
With `ELDERS_SIZE` of 7 the number of elders is still relatively small and at the same time sections can tolerate up to 2 faulty elders which seems like a reasonable fault tolerance. This number might change in the future.
### Message signatures
Messages sent over the network are cryptographically signed to prevent tampering. There are two types of signatures we use:
1. Messages sent from a single node are signed by their *ED-25519* keys
2. Messages sent from a whole section (see [Signature aggregation](#signature-aggregation)) are signed by the *BLS* secret key share and the resulting *signature shares* are aggregates into a full *BLS signature*.
### Name
A way to identify objects (nodes, data, ...) on the network. It is a 256-bit long number that is typically randomly generated or obtained from a hash of some data. Thus the probability of collision is very low and so names can be considered to be globally unique.
Names form a [metric space](https://en.wikipedia.org/wiki/Metric_space) where the distance function is the `XOR` operator. This is also called **XOR space** and names are called **XOR names**. Note [the XOR distance](https://medium.com/@maidsafe/structuring-network-with-xor-431e785b5ee7) is purely mathematical concept and has no relation to geographical distance. So nodes with names that are "close" according to the XOR metric can still be on the opposite sides on the globe.
Name is sometimes referred to as **Address** in some contexts. It is implemented as `XorName` struct in the code.
See also [Prefix](#prefix).
### Node state
Each node holds a state that contains information about the node itself (`Node` structure), about the section it's a member of (`Section` structure) and about the other sections in the network (`Network` structure).
### Prefix
The first `N` bits of a [Name](#name) where `N` is the prefix length. For example `(011)` is a prefix of length 3. A prefix defines a subset of the name space which contains all names that start with the prefix. An empty prefix covers the whole name space. Prefixes are used to identify [Sections](#section).
### Proposal
When a node want's to perform some action that affects the section it must first propose this action to the rest of the section elders and wait for the proposal to reach agreement. See [Agreement](#agreement).
### RECOMMENDED_SECTION_SIZE
A constant that determines when a section splits. A section can split when both subsection would have at least `RECOMMENDED_SECTION_SIZE` members. This also indirectly affects maximum average section size - it's twice the recommended section size. Currently this number is set to 2 * [`ELDER_SIZE`](#elder_size) which is a compromise between two opposing requirements:
1. We want the sections to be big, because that makes them more resilient to failure (even if lot of nodes are dropped, there is still enough to cover for them)
2. We want the sections to be small so that sections are responsible for smaller chunks of the address space which means less data and less network traffic per section.
### Relocation
The process by which a node is moved ("relocated") from one section to another. Explained in detail in the [Node ageing](#node-ageing) section.
### Section
Group of nodes on the network whose names have the same [prefix](#prefix). Initially the network has only one section with the empty prefix (covering all names). As the network grows, the sections split into two subsections whose prefixes are one bit longer than the original section. For example, section with prefix `(00)` splits into two with prefixes `(000)` and `(001)`.
### Section key
The *BLS* public key that corresponds ("correlates") to the secret key shares held by the current section elders. This key correlates to a hypothetical *secret key* which is the *aggregation* of all those *secret key shares*, however this secret key doesn't physically exists as the shares are generated through a *DKG*. Every section key is signed by the previous section key (the key of the previous set of elders) to form a *Section chain*. The only exception is the very first section key called *genesis key* which is the key of the section consisting of only one node - the *genesis node*, or the first node of the whole network.
### Section structure
Structure that contains information about the section the node is a member of. It contains:
- The list of current section elders
- The section [prefix](#prefix)
- The [section chain](#1-section-chain-and-proof-chain) (the last key of the chain is the [section key](#section-key))
- The container of all the members of the section
The current elders + prefix are grouped into a structure called `EldersInfo` which is signed using the current section key. The individual entries in the members container are also signed using the respective section key that was current at the time the entry was created.
### Signature aggregation
We use the [BLS](https://en.wikipedia.org/wiki/BLS_digital_signature) signature scheme (implemented by the [threshold_cyrpto](https://github.com/poanetwork/threshold_crypto) library) which is a [Threshold scheme](https://en.wikipedia.org/wiki/Threshold_cryptosystem). A BLS signature is created by *aggregating* multiple *signature shares*. Only the elders of a section hold the secret key shares and we set the BLS threshold to 2/3 of the number of elders. This means that each BLS signature is a proof that at least a supermajority of elders signed the corresponding messages. This is how sections reach agreement on any actions without the need for centralized authority.
See also [Agreement](#agreement).
### Sync message
A message that contains a copy of the `Section` and `Network` structures of the sending node. This message is used to synchronize the node state among the section members. All the data in the message is already signed with BLS signatures so it can be already actioned without needing any additional agreement or aggregation. Actioning a `Sync`, means *merging* the `Section` and `Network` contained in the message into the `Section` and `Network` the node has.
### Upper layer
Also called *downstream* / *user* / *consumer* / ... - the library or binary that uses the sn-routing library.
### XorName
See [Name](#name).
## Pseudocode conventions
The pseudocode samples in this document use a syntax based on a simplified subset of [rust](https://www.rust-lang.org/). Additionally, the following constructs are used:
- `send(dst, message)` - send `message` to the peer at the given `dst` socket address (not this is a socket address (IP address + port), not a "XOR address").
- `broadcast(message)` - broadcast `message` to all the elders in the section.
- `receive()` - wait for an incoming message and returns it.
- `emit(event)` - emit the given event to the [upper layer](#upper-layer).
- `error(error)` - signal that the given `error` has occurred. Errors are usually returned to the caller or logged.
- `proposal(proposal)` - create a [proposal](#proposal) which is a message signed by the BLS secret key share of the sender.
-----
## Join
When a node wants to join the network, it must first be provided with a [ED25519](https://en.wikipedia.org/wiki/EdDSA#Ed25519) keypair. This can either be auto-generated or assigned by the user (loaded from config file, ...). The public key also serves as the node [name](#name).
Second, it needs a list of bootstrap contacts (socket addresses) of which at least one must belong to a currently active node in the network. It then bootstraps into the network by trying to connect to all of those contacts in parallel and the first connection that succeeds is then used as the initial bootstrap node.
Then it requests the information about the section to join which is the section whose [prefix](#Prefix) matches the node name. To do this the node sends a `GetSectionQuery` to the bootstrap node and awaits a response. When the bootstrap node is already an elder in the matching section, it replies with `GetSectionResponse::Success` which contains the necessary information about the section:
```rust
struct SectionInfo {
// Prefix of the section
prefix: Prefix,
// BLS public key set of the section
pk_set: bls::PublicKeySet,
// Current section elders
elders: BTreeMap<XorName, SocketAddr>,
}
```
> [name=Gabriel] don't we need the key chain? and if so, we can also explain that the joining node would have the genesis key hard-coded so it can verify is joining the expected network and genuines Elders/section??
Otherwise, the bootstrap node replies with `GetSectionResponse::Redirect` which contains a list of socket addresses to repeat the query to. These should be "closer" to the matching section than the current bootstrap node is:
- When the bootstrap node is not an elder, it replies with addresses of its elders.
- When the bootstrap node is an elder, but not in the right section, it replies with addresses of elders of the section that is closest to the joining node name among all the sections it knows of.
When the joining node receives a `GetSectionResponse::Redirect`, it updates the list of bootstrap nodes and repeats the `GetSectionQuery` to all of them in parallel. This process keeps going until it eventually receives a `GetSectionResponse::Success`.
When that happens, it sends a `JoinRequest` with an empty resource proof to all the elders in the section it is joining. The section elders will respond with a `ResourceChallenge` message containing the data for the resource challenge the node must pass. This is to verify the node has enough resources (CPU, memory, storage and network bandwidth) to be useful to the network.
The joining node must solve the challenges from all the elders and send back a `JoinRequest` which contains the respective solution to each of them.
NOTE: The resource proof generator/solver is implemented in the [resource_proof](https://github.com/maidsafe/resource_proof) crate.
When an elder receives this message, it performs all the following checks:
- Validates the [message signature](#Message-signatures)
- Checks the node isn't or wasn't already a member of this section (under this name)
- Checks the node name matches the prefix of this section
- Checks the node has the latest section key
- Checks whether joins are allowed (`joins_allowed` flag is `true`)
- Checks whether it passed the resource proof
If all those checks pass, it proposes `Online` for the new node. This proposal contains the `MemberInfo` entry for the node with the `state` field set to `Joined`.
When the proposal reaches [agreement](#agreement), the elders insert the contained `MemberInfo` entry into the section members container and send a `NodeApproval` message to the new node (each elder sends one). Then they raise a `MemberJoined` event to the upper layers. From the point of view of the elder, the new node is now a valid member of the section.
When the new node receives the first valid `NodeApproval` message, it uses the information contained in it to initialize the [`Section` structure](#section-structure) and becomes a member of the section as adult. Any other `NodeApproval` messages received afterwards are discarded.
### Pseudocode
From the point of view of the joining node
```rust
fn join() {
let section_info = get_section_info() // get_section_info defined below
for addr in section_info.elders {
send(addr, JoinRequest { resource_proof_response: None })
}
loop {
let message = receive()
if message == ResourceChallenge {
let solution = solve_resource_challenge(message.challenge)
send(
message.sender_addr,
JoinRequest { resource_proof_response: Some(solution) }
)
}
if message == NodeApproval {
return create_node_state(message)
}
}
}
fn get_section_info() {
for addr in hardcoded_contacts {
connect_to(addr) // asynchronous
}
let bootstrap_addr = wait_for_first_successful_connection()
let bootstrap_addrs = [bootstrap_addr]
loop {
for addr in bootstrap_addrs {
send(addr, GetSectionQuery)
}
let message = receive()
if message == GetSectionResponse::Success {
return message
}
if message == GetSectionResponse::Redirect {
bootstrap_addrs.remove(message.sender_addr)
bootstrap_addrs.append(message.new_addrs)
}
}
}
```
From the point of view of a section elder
```rust
// Called when a `JoinRequest` message is received.
fn handle_join_request(request) {
if !validate_signature(request) {
error(InvalidSignature)
}
if !section.prefix.matches(request.sender_name) {
error(PrefixMismatch)
}
if section.members.contains(request.sender_name) {
error(AlreadyMember)
}
if request.resource_proof_response == None {
send(request.sender_addr, ResourceChallenge)
} else {
if validate_resource_proof(request.resource_proof_response) {
broadcast(proposal(Online(MemberInfo {
peer: Peer {
name: request.sender_name,
addr: request.sender_addr,
// MIN_AGE is 'Infant', MIN_AGE + 1 is the minimal 'Adult' age
age: MIN_AGE + 1,
},
state: Joined
})))
} else {
error(InvalidResourceProofResponse)
}
}
}
// Called when an `Online` proposal reaches agreement.
fn handle_agreement_on_online(member_info) {
section.members.update(member_info)
send(member_info.peer.addr, NodeApproval)
emit(MemberAdded(member_info.peer.name))
}
```
## Leave
When an elder tries to send a message to another member and that send fails, it declares the node as lost and proposes it `Offline`. This proposal contains the same `MemberInfo` entry as the corresponding `Online` proposal except the `state` field is set to `Left`.
When this proposal reaches agreement, the elder inserts the contained `MemberInfo` entry with `State::Left` into the section members container, overwriting the previous entry which had `State::Joined` state. Then they raise a `MemberLeft` event to the [upper layer](#Upper-layer). From the point of view of the elder, the node is no longer a member of the section (except when the node is an elder undergoing delayed relocation, which will be explained later).
NOTE: to speed the lost peers detection process, when the qp2p layer notifies us about a peer being disconnected, we immediately send them a `Ping` message. When this message gets sent successfully that means the connection was re-established and we carry on. Otherwise we declare the peer as lost the same as for any other unsuccessfully delivered message as already described.
### Pseudocode
Graceful disconnect:
```rust
fn handle_disconnect(peer) {
if send(peer.addr, Ping) == Error {
broadcast(proposal(Offline { peer, state: Left }))
}
}
fn handle_agreement_on_offline(member_info) {
section.members.update(member_info)
emit(MemberLeft(member_info.peer.name))
}
```
Non-graceful disconnect:
```rust
// ...
if send(peer.addr, any_message) == Error {
broadcast(proposal(Offline(MemberInfo { peer, state: Left })))
}
// ...
```
> [name=Gabriel] I'm thinking we can briefly mention the possible states of the nodes somewhere above, before exploring the joining and leaving flows, i see some appendix/sections exist further down, so at least a link or somehow re-orginise the sections?
## Incoming relocation
A [relocated](#relocation) node joining its new section goes through almost the same process as newly joining node, except there is no resource proof and instead the node must provide a `RelocatePayload` which proves that it's been relocated by the source section. The elders of the joining section perform the same steps as described in the [Join](#Join) part of this document and additionally they verify the `RelocatePayload` and then accept the node with its new age.
## Elder churn
After every churn event (join or leave), the elders perform a check whether the current [`ELDER_SIZE`](#ELDER_SIZE) [oldest](#node-ageing) members of the section are different from the current set of elders and if so, trigger an elder churn which is a process where the section transitions to the new set of elders. While this process is ongoing, the current elders still retain their status and perform their duties in the section as normally.
We will call the `ELDER_SIZE` oldest members of the section when they differ from the current elders the **elder candidates**.
NOTE: the elder candidates usually (but not necessarily) overlap with the current elders. For example:
- If a node that is older than the youngest current elder join the section, the elder candidates will consist of the new node plus all the current elders except the youngest one.
- If one of the current elders goes offline, the elder candidates will consist of all the remaining current elders + the oldest adult of the section.
However, in rare cases the elder candidates might be completely different from the current elders. This most likely happens during [split](#split) where all the current elders end up in only one of the subsections and the candidates for the other one consist of only adults.
### Determining elder candidates
Because multiple nodes can have the same age, it's not enough to use just the age to decide the elder candidates. Instead the current members are sorted according to the following comparison rules:
1. The member with higher age wins over the one with lower age.
2. If they have the same age, the one that is currently elder wins.
3. If both are currently elders or both are currently not elders, the BLS signatures of their `Online` proposal are compared numerically and the smaller one wins.
The final rule is just an arbitrary one to deterministically break ties which is also hard to game.
After being sorted according to these rules, the first `ELDER_SIZE` ones are taken and become the elder candidates.
### DKG
> [name=Gabriel Viganotti] nothing is said of how many new elder candidates at a time can be part of this process per candidates set, when and how are the candidates picked?
> [name=Adam Cigánek]
> > how many new elder candidates at a time can be part of this process
>
> All of them actually, but that's a detail of the DKG implementation we use which is the bls-dkg crate. But maybe it should be mentioned here as well?
> > when and how are the candidates picked
>
> This is described in the previous section. Or Is that not clear enough?
The elder candidates need to go through a Distributed Key Generation (DKG) process to get their section keys before they can become the new elders.
Once the elder candidates are determined, the current elders send them a `DKGStart` message (each elder sends one to each elder candidate):
```rust
struct DKGStart {
elders_info: EldersInfo,
dkg_key: DkgKey, // Unique identifier of the DKG session
}
struct EldersInfo {
elders: BTreeSet<Peer>,
prefix: Prefix,
}
```
(Note these structures are slightly simplified compared to how they actually are defined in the code, but the differences are not significant here)
The elder candidates accumulate these messages and once they get it from a supermajority of the current elders, they start the DKG. The reason to wait for a supermajority is to prevent a single malicious node from triggering spurious DKGs in order to spam the network.
The actual DKG is implemented in the [bls-dkg](https://github.com/maidsafe/bls_dkg) crate and won't be described here in detail. But in summary, it consists of multiple rounds of messages being exchanged among the elder candidates until either the process completes successfully and the new keys are generated, or it fails.
> [name=Gabriel Viganotti] here the candidates in these rounds are only the new ones or this actually means the existing Elders plus the new candidates?? ...I guess the later, as they all (current Elders plus new candidates) need to agree on the same pk set?? it seems we are sometimes calling candidates to current Eleders plus new candidates, and sometime we call candidates only to the new ones ??
#### Successful DKG completion
When the DKG completes, it produces the new `PublicKeySet` (which is the same for everyone) and `SecretKeyShare` (which is unique per candidate and they must keep it secret). The candidates then sign the `EldersInfo` from the `DKGStart` message using this new `SecretKeyShare` and broadcast it back to the current elders in a `SectionInfo` message.
When the current elders receive a supermajority of these `SectionInfo` messages, they check whether the [current elder candidates](#Determining-elder-candidates) are the same as the elders set in the message and if so, they create an `OurElders` proposal containing the new key in order to append it to the section chain. Otherwise they discard it.
There are two rounds of proposals: one is the elder candidates proposing the new `EldersInfo` using the new key and one is the current elders proposing the new key using their current section key. This is necessary because we need two pieces of information and they both need to be signed:
1. The new section key (signed by the old section key) and
2. The new `EldersInfo` (signed by the new section key)
Once this completes, the new key is inserted into the section chain and the new `EldersInfo` replaces the old `EldersInfo` in the [`Section` struct](#Section-structure). It's important that these two things happen as a single atomic step to maintain the invariant that the current `EldersInfo` is signed by the current last key in the section chain.
Finally, the old elders broadcast a [`Sync` message](#Sync-message) to all the members of the section to update them about the new state of the section.
#### Failed DKG
When the candidate detects a DKG failure, it creates a failure proof and broadcasts it to the other candidates (not the current elders!). When a sufficient number of these proofs are accumulated, they are combined into a DKG failure agreement which is only then broadcast to the current elders. The reason the failures are accumulated by the candidates and not the current elders is to have the bulk of the work (including keeping any necessary state) be done by the candidates so there is less burden on the current elders who have enough other responsibilities already.
NOTE: It's not necessary to accumulate a supermajority of these proofs. A `N - supermajority(N) + 1` (where `N` is the number of candidates) of them is enough because in that case a successful completion is no longer possible.
When the current elders receive this failure agreement, they check whether the [current elder candidates](#Determining-elder-candidates) are still the same as the ones reporting the failure and if so, send them the `DKGStart` message again to restart the DKG. Otherwise they ignore it.
### Transition to the new elders
A node can detect that the current elders changed either through observing an agreement on the `OurElders` proposal or by receiving a `Sync` message. In both cases the subsequent process is the same.
When a node observes an elder change of its section, it raises a `EldersChanged` event to the upper layers:
```rust
struct EldersChanged {
prefix: Prefix,
key: bls::PublicKey,
sibling_key: Option<bls::PublicKey>,
elders: BTreeSet<XorName>,
self_status_change: NodeElderChange,
}
```
When the node was not an elder before but is an elder now, the `self_status_change` is set to `Promoted`. When the node was an elder and is no longer one, it is set to `Demoted`. Otherwise it is set to `None`.
When the section split, `prefix` contains the new post-split prefix and `sibling_key` contains the section key of the sibling section. When there is no split, `prefix` is the same as before the elders change and `sibling_key` is `None`. Splits will be explained in more detail later.
This concludes the elder change.
#### Lagging DKG
Only a supermajority of the elder candidates are necessary to trigger the elders change because only a supermajority is needed to aggregate the `SectionInfo` proposal. This means that some of the candidates (at most `N - supermajority(N)`) might observe the elders' change before they observe the completion of the DKG. In this case they are unable to sign section messages with the new key until they observe the DKG completion. This is not a problem because at least a supermajority of the elders have the new keys and so any section messages sent by them can already be aggregated.
### Multiple sets of elder candidates
In times of heavy churn there can be multiple sets of elder candidates being considered at the same time. Because different nodes might process the events in different order, there is no guaranteed way to tell which of the candidates are going to be processed first. Some nodes might see one set first but others might see a different one. Due to eventual consistency, everybody should eventually see all the sets, however the order this happens is unspecified. We don't use any ordering algorithm and instead let all the sets be processed concurrently. This means that there can be multiple DKGs (some nodes might even be participating in more than one DKG at a time), multiple different `SectionInfo` agreements and multiple `OurElders` agreements. The conflicts are only eventually resolved by the section chain which - being a CRDT - produces a consistent total order of the section keys without requiring any consensus. See the [Section chain appendix](https://hackmd.io/ERglUv22SrW_c2EjuGz-Cg?both#1-Section-chain-and-proof-chain) for more details.
This means that forks (conflicts) are **not prevented**, but they are **immediately resolved**. The difference is subtle, but important. It means there is no way to commit in advance to a particular set of elder candidates as a "better" set of elder candidates might come in at any time.
### Sequence diagram
```mermaid
sequenceDiagram
participant A as Current Elders
participant B as Elder Candidates
participant C as Other members
Note over A: Churn event
A->>B: DKGStart
Note over B: Receives DKGStart from a supermajority of elders
B->>B: DKG
alt DKG completes
B->>A: SectionInfo
Note over A: SectionInfo reaches agreement
A->>A: Vote OurElders
Note over A: OurElders reaches agreement
Note over A: Update Section
Note over A: Raise EldersChanged
A->>A: Sync
A->>B: Sync
Note over B: Update Section
Note over B: Raise EldersChanged
A->>C: Sync
Note over C: Update Section
Note over C: Raise EldersChanged
else DKG fails
B->>B: DKGFailureObservation
Note over B: DKGFailureObservation accumulate
B->>A: DKGFailureAgreement
opt elder candidates still the same
Note over A: Trigger DKG restart
A->>B: DKGStart
end
end
```
## Split
When a section grows past a certain size, it splits into two (roughly) equaly sized subsections. The splits are processed essentially as two concurrent Elder changes - one for each subsection.
### Split detection
To trigger a split, the section must have at least [`RECOMMENDED_SECTION_SIZE`](#RECOMMENDED_SECTION_SIZE) members for each subsection. That is, given `N` is the length of the current section prefix (note the genesis section has pefix length of 0), there must be at least `RECOMMENDED_SECTION_SIZE` members whose name has `0` as the bit at index `N`, and at least the same number whose name has `1` as the bit at index `N`.
### Split processing
The split is processed the same as normal elders change except twice - one for each subsection. The two subsections are processed concurrently. This means there are two DKGs, then two `SectionInfo` agreements and finally two `OurElders` agreements. Each elder however applies only the change that pertains to their own subsection. They will however send the `Sync` message for both subsections. This way even if all the pre-split elders end up in only one of the subsections, the other one still receives the `Sync`s which causes their new elders to be promoted.
NOTE: as with normal elders change, there can be multiple splits being processed concurrently and they are only eventually resolved after being applied to the section chain.
### Split barrier
A split is only driven to completion if the pre-split elders observe the `OurElders` agreement for both subsections. If they observe only one, they cache the `OurElders` proposal in a data structure called `SplitBarrier` and only when they observe the other one they finalize the split. The reason this is done is to avoid forgetting the sibling section info before the elders send the `Sync` message to them.
### Sequence diagram
```mermaid
sequenceDiagram
participant E as Current Elders
participant C0 as Elder candidates 0
participant C1 as Elder candidates 1
Note over E: Split triggered
par
E->>C0: DKGStart(0)
C0->>C0: DKG
C0->>E: SectionInfo(0)
E->>E: Vote OurElders(0)
Note over E: OurElders(0) reaches agreement
and
E->>C1: DKGStart(1)
C1->>C1: DKG
C1->>E: SectionInfo(1)
E->>E: Vote OurElders(1)
Note over E: OurElders(1) reaches agreement
end
Note over E: Update section
par
E->>C0: Sync(0)
Note over C0: Update section
and
E->>C1: Sync(1)
Note over C1: Update section
end
```
## Node ageing
Each node in the network is assigned an age which is roughly a measure of how long the node spent in the network doing useful work. The older a node is, the more likely is it to become an elder and thus get important responsibilities. This is a [sybil attack](https://en.wikipedia.org/wiki/Sybil_attack) prevention because if an attacker tried to spawn a large number of zombie nodes, they would all start with low ages and thus have little effect on the network.
### Initial age
Nodes that are in the process of joining the network are called **infants** and they start with age 4. This gets incremented to 5 immediately when they join a section and become **adults**. The seemingly arbitrary number 4 was chosen to avoid too frequent relocations of young nodes (as would happen if it was 0, 1, ...).
### Relocations
Node increases its age when it gets relocated from one section to the next. When the source section decides to relocate one of its members, it provides a proof of its new age which the relocated node presents to the destination section when it joins it. Based on this proof the node will join the destination section with this new age.
### Relocation trigger
Relocations are triggered by churn events (joins and leaves). On each churn event, we perform the following check with each member of the section:
```
signature % 2^age == 0
```
Where `signature` is the BLS signature of the churn event and `age` is the age of the member being checked. If this returns true, that member will be relocated. This means that assuming the signature distribution is roughly uniform, the probability of a node being relocated on any churn event is `1/2^age`.
NOTE: if the check passes for an age `X` it also passes for any age < `X`. To avoid excessive relocations we limit them to only the oldest matching nodes.
### Relocation destination
After a node is chosen for relocation, the destination is calculated by obtaining the `sha3-256` hash of the relocated node name concatenated with the name of the node from the churn event:
```
let destination_addr = sha3_256(relocated-node-name <> churn-event-node-name);
```
This destination address determines the next section the node will join - it's the section that is [closest in XOR distance](https://medium.com/@maidsafe/structuring-network-with-xor-431e785b5ee7) to the destination address.
### Outgoing relocations
The elders of the section then create `RelocateDetails`:
```rust
struct RelocateDetails {
// The pre-relocation name of the node.
name: XorName,
// The destination to relocate the node to.
destination: XorName,
// A known section key of the destination section.
destination_key: bls::PublicKey,
// The post-relocation age of the node.
age: u8,
}
```
sign it with their BLS secret key shares and send it to the relocated node in a `Relocate` message. Then they propose the node `Offline` but with the `state` field set to `Relocated` instead of `Left`. The reason for this is to prevent relocated nodes from rejoining the source section (rejoining is explained in more details later). When the relocated node accumulates a supermajority of those `Relocate` messages, it combines the signature into a full BLS signature, stores it, leaves the section and re-joins the network.
The `destination_key` field is set to the latest section key of the destination section that the source section knows and is used by the relocated node to check trust of messages sent to it by the destination section.
### Relocation joins
The relocated node's process of rejoining the network starts similarly as a normal join - the node first requests information about the section to join. However this time it requests the section matching the `destination` name instead of its own name. When it receives the response, it first generates a new `ed25519` keypair such that the name derived from this keypair matches the destination section prefix. It then signs this new name using the old keypair as a proof that this is the node that's relocated and not some impostor that just somehow got hold of the `RelocateDetails`. The BLS-signed `RelocateDetails` + the signature of the new name are then packed into `RelocatePayload` structure and this is included in the `JoinRequest` message sent to the destination section. There is no resource proof challenge this time, as the `RelocatePayload` itself is an implicit proof that the node did already pass one when it joined for the first time.
### Incoming relocations
When the elders of a section receive a `JoinRequest` from a relocated node, they verify the included `RelocatePayload` (check the `ed25519` signature of the new name and the `BLS` signature of the `RelocateDetails`) in addition to the verification steps of a normal join. They use the contained `destination_key` to provide a sufficiently long proof chain for the `NodeApproval` message so the joining node would trust it.
### Relocation of an elder (delayed relocation)
The relocation process is slightly different when the relocated node is currently an elder. The reason for this is to avoid relocating too many elders before their replacements are finalized. This is because any section agreement requires a supermajority of elders and so losing `N - supermajority(N) + 1` or more elders would mean the section would become incapable of reaching any agreement, including agreement on the next elder candidates. There are mechanisms for dealing with such situations, but they are meant as an emergency measure in case of a catastrophic node loss (due to large network outages, etc...). There is no reason for the network to cause such a situation itself if it can be avoided.
So, when an elder is selected for relocation, the actual relocation gets delayed until the elder gets demoted. The elders don't send them a `Relocate` message like they do to non-elders, but they send a `RelocatePromise` instead:
```rust
struct RelocatePromise {
// Pre-relocation name of the node
name: XorName,
// The destination to relocate the node to.
destination: XorName,
}
```
This message is also BLS signed by the section so can be used as a proof that the node was at some point selected for relocation. Then the elders propose the node `Offline` as well (with `state` set to `Relocated` too).
When the relocated elders receives this message, it stores it locally and continues performing its elder duties as normal. They continue doing that even after the `Offline` proposal reaches agreement. They are incentivized to do so by the promise that they will be eventually relocated and will have their age increased.
Eventually that `Offline` agreement triggers an elder change where the elder candidates no longer include the relocated node(s). Once that change gets finalized and the relocated former elder observes its own demotion, it sends the `RelocatePromise` back to the new elders. They verify its BLS signature and send back a regular `Relocate` message. From that point on everything is handled just like the regular relocation as already described.
### Rejoins
When a node leaves the network for any other reason except relocation, it in theory loses all its age. In case the leave is just temporary (network outage, device restarts, ...) this would be too harsh, especially if the node has accumulated substantial age up to this point. To make things more fair, a node is allowed to rejoin the network under its existing name and will retain a portion of its age. This is possible because even nodes that left are still kept in the members container (with `state` set to `Left`) and so the section is always able to tell whether a node is joining for the first time or rejoining. If a rejoin is detected, the node is immediately relocated to the same section (`destination` is the current node name) with its age halved. However, if the halved age is less or equal to the initial age (4) then there would be no benefit in doing this as the resulting age would be the same as the age of a newly joined node. So to avoid pointless work, the join request is ignored and the node must join as an infant again.
#### Abuse prevention
Let's say a node was relocated to a new section and it misbehaves there and is punished by being kicked out of it. It might try to circumvent this punishment by rejoining the original section, hoping to retain at least half of its age. To prevent this type of abuse, relocated nodes cannot rejoin their original section. This is achieved by setting their `state` to `Relocated` instead of `Left` when they are being proposed `Offline`. Only nodes with `Left` are allowed to rejoin.
### Sequence diagrams
Relocation of an adult:
```mermaid
sequenceDiagram
participant R as Relocated Node (R)
participant S as Src Elders
participant D as Dst Elders
Note over S: Churn event
Note over S: Relocation or R triggered
par
S->>R: Relocate
Note over R: Relocate accumulates
Note over R: Leaves src section
R->>D: GetSectionQuery
D->>R: GetSectionResponse
Note over R: Changes name to match dst section
R->>D: JoinRequest { relocate_payload: Some(..) }
Note over D: validates the JoinRequest
D->>D: Vote Online(R)
Note over D: Mark R as Joined
D->>R: NodeApproval
Note over R: Becomes member of dst section
and
S->>S: Vote Offline(R)
Note over S: Vote reaches agreement
Note over S: Mark R as Relocated
end
```
Relocation of an elder:
```mermaid
sequenceDiagram
participant R as Relocated elder (R)
participant O as Other Src Elders
participant N as New Src Elders
Note over R,O: Churn event
Note over R,O: Relocation of R triggered
par
O->>R: RelocatePromise(R)
Note over R: RelocatePromise accumulates
Note over R: RelocatePromise stored
and
Note over R,O: Vote Offline(R)
end
Note over R,O: Offline reaches agreement
Note over R,N: Elder change
Note over R: Observe own demotion
R->>N: RelocatePromise(R)
N->>R: Relocate(R)
Note over R,N: Continue like when relocating an adult...
```
## Appendix
### 1. Section chain and proof chain
The section chain is a data structure which holds an uninterrupted and independently verifiable chain of section keys ([BLS](https://en.wikipedia.org/wiki/BLS_digital_signature) public keys) all the way from the **genesis key**. It is used as a means to verify trust.
The **genesis key** is the first key in the chain and it forms the root of trust. It can be thought of as the identifier of the whole network and might be hardcoded in the node and client libraries/binaries. Alternatively it can be published in some reasonably trusted location. The genesis key is the only key that is not signed. All the other keys in the chain are signed by some previous keys.
The chain does not prevent forks, but it does automatically resolve them. This means that it's possible the chain contains two or more keys that are all signed with the same parent key, but the chain still presents the keys in a single linear sequence to the outside (thus the "chain" in the name). This fork resolution is completely local (doesn't require any coordination with the other nodes). The order is established using this algorithm:
1. Arrange the keys in a tree where a key that signs some other key is its parent.
2. Sort the keys having the same parent numerically (using the `Ord` relation of the `PublicKey` type)
3. Walk the resulting three in a breadth-first fashion
Example (assuming the lexicographical order corresponds to the numerical order here):
```
A->B->C->D
|
+->E->F->G
|
+->H
```
Would be sorted as:
```
A->B->C->E->D->F->H->G
```
All the mutable operations on the section chain are commutative, associative and idempotent which means the section chain is a CRDT and thus achieves consistency without consensus.
A **proof chain** is a part of the section chain (could be the whole chain as well) which is attached to some signed data (e.g. a message) in order to provide a means to verify its trust. The complete verification process looks like this:
1. Verify the signature against the public key
2. Check the public key is present in the proof chain
3. Verify the proof chain against a set of already trusted keys.
When passed all those checks, the data is considered to be trusted as it means :
1. the data was signed by some group of nodes
2. the group of nodes was approved by some previous group of nodes which in turn was approved by yet another group of nodes and so on all the way to a group we already trust.
The genesis key is the root of trust so even if one doesn't have any other trusted keys they can always verify a proof chain that starts with the genesis key.