Practical Byzantine Fault Tolerance Summary

# Practical Byzantine Fault Tolerance Summary ###### tags: `Distrib-Comp`, `Fault Tolerance`, `Byzantine Faults` ## Properties of PBFT Fault tolerant up until `(n-1)/3` faulty units (out of total `n` replicas). Clients receive replies to requests and replies are correct according to linearizability. Offers both liveness and safety. Does **not** rely on synchrony for safety - does not need known bounds on message delays and process speeds. _Single_ round-trip to execute read-only operations - _two_ for read-write. Efficient message auth - Public Key Cryptography is used only when there are _faults_. ## System Assumptions Network connecting nodes is _unreliable_: message dropping, duplication, delays, out-of-orderness, etc. are all expected. Faulty Nodes behave _arbitrarily_, and fail _independently_ - there is no single cause of node failure (like, say, someone breaching a node with the same root password) **All** replicas know the public keys of the **other** replicas to verify their signatures. Adversaries can coordinate faulty nodes, delay communication, or delay correct nodes - but not indefinitely. We also assume adversaries are _computationally bound_ such that it is infeasible to subvert cryptographic techniques: - They are unable to produce valid signatures for correct nodes - Unable to achieve the plaintext given a digest - Unable to cause digest collisions (hash collisions) ## Service Assumptions Service being replicated blocks waiting for a reply: clients and replicas are non-faulty if no attacker can forge their signature **and** they follow the algorithm specified. Safety is guaranteed regardless of the number of faulty clients: operations performed by faulty clients are _observed in a consistent way_ by non-faulty clients. Safety means that the service satifies linearizability: behaves like a centralized implementation that executes operations atomically. We guard against faulty clients by mandating some form of access control: the algorithm ensures that effects of _access revokation_ is **observed consistently by all clients**. Synchrony is relied upon for _liveness_, not safety: clients eventually receive replies to their requests provided that message travel time (assuming retransmission) `delay(t)` does not grow faster than `t` indefinitely. System does not address fault-tolerant privacy: a faulty replica may leak information to an attacker - since service operations may perform arbitrary computations using their arguments and the service state, information is needed to execute these operations efficiently. ## Algorithm ### Fault Tolerance Threshold `3f+1` is the minimum number of replicas that allow an asynchronous system to provide safety and liveness properties when up to `f` replicas (out of `n`) are faulty. It must still be possible for a decision to proceed after communicating with `n-f` nodes: 1. `f` nodes may be faulty and not respond: `n-f` must still allow a decision to be made. 2. Of this `n-f` number, up to `f` nodes may be faulty, and hence to guarantee correctness `n-f > 2f` (to ensure that the majority vote is correct), which means that `n > 3f`. ### The Algorithm: Replicas Service is modeled as a state machine replicated across different nodes in a distributed system: set of replicas, `R`, are labelled `{0-(R-1)}`, and `|R| = 3f+1`. Replicas maintain the service state and implement service operations, and replicas are divided into two groups, denoted by configurations of replicas called `views`: 1. Primaries (Replica whose `ID == v mod |R|`, `v` is the view number) 2. Backups (all other replicas) View changes happen when it _seems_ that the **Primary has failed**. #### Algorithm Process 1. Client sends request to _invoke service_ to **primary**. 2. **Primary** multicasts the request to the backups. 3. Replicas execute the request and send their replies to the _client_. 4. The client waits for `f+1` replies from _different replicas_ with the **same result** (our 'majority' vote requirement) which is the result of the operation. Assumptions: - Replicas are _deterministic_: Execution of an operation in a given state and given set of arguments must **always** produce the same result: if `f(x, args) => y`, `f(x, args) == y`. (That is, function always produces same result.) - Replicas all start in the **same state**. ### Client Actions :::info - `<m>signed(i)` denotes Message `m` being signed by node `i`. - `D(m)` denotes the digest of message `m`. - Signing implies that the message **digest** is signed and appended to the plaintext of the message rather than the signing of the full message. ::: Clients request execution of operations by sending `<REQ, op, t, client>signed`, where: - `REQ` marks that the message is a request - `op` is the operation being requested - `t` is the timestamp of the request (used to ensure _exactly-once_ semantics for execution of client requests - since requests block) - `client` indicates which client node is sending the message. Timestamps are ordered in such a way that later requests have higher timestamps than earlier ones: could be the value of the local clock. #### Replica Reply Each message sent by the replicas to clients includes the current view number: allows clients to track `view` and hence current **primary**. Replicas send replies to `REQ` directly to clients in the form `<REPLY,view,t,client,i, result>signed(rep_num)`, where: - `REPLY` denotes that the message is a reply; - `view` denotes the current view of the replicas; - `t` is the timestamp of the request being replied to; - `client` denotes the ID of the client node being replied to; - `i` denotes which replica replied; - `result` denotes the result of executing the requested operation. Clients wait for `f+1` replies with valid signatures from _different_ replicas with the same `t` and the same `result` before accepting the result. This ensures that the result is valid, since at most `f` replicas may be faulty. Clients that don't receive replies soon enough (a timeout or similar) broadcast their request to **all replicas** - replicas where requests have been processed simply re-send their reply (replicas remember their last request); otherwise, it sends to the primary. `Client -> timeout -> broadcast request to all replicas -> replica sends reply if done, sends to primary if not` :::info Note that the client is assumed to **wait for a request to complete before starting another**. ::: ### Normal Case Operations Each replica includes the state of the service, a _message log_ that contains accepted messages, and some indicator of the current view. ### Atomic Multicasting from Primary to Replicas Primary `p` receives message `m`: begins 3-phase multicast protocol: first two phases are to ensure total order: #### Phase 1: `pre-prepare` Primary assigns sequence number `ns` to a request: multicasts `preprepare` message with `m` to all backups and adds the message to its own log. Message is of the form `<PRE-PREPARE, v, ns, d>signed(p), m`, where: - `v` indicates the view (number) in which the message is being sent; - `m` is the client's request message; - `d` is `D(m)` (digest of m) - `ns` is the sequence number assigned; Pre-prepare messages are used as proof that the request was assigned a sequence number in view `v` after a view change. Backups accept pre-prepare messages and **log them** provided that: - Have correct signatures for both the `request` and `preprepare` messages - `d = D(m)` - `v` matches current view - Has not been accepted as a pre-prepare message for the given `v` and `n` with a different digest - Is in the acceptable range `[h,H]`, some known high-water and low-water mark. #### Phase 2: `prepare` Replica enters prepared state by multicasting `<PREPARE, v,n,d,i>signed(i)` to all other replicas and adding both messages (PRE-PREPARE and PREPARE) into its log: - `v` indicates the view number of the replica `i` - `n` is the sequence number assigned - `d` is the digest of m, `D(m)` - `i` is the ID of the replica sending out the message. Replica is `prepared(m,v,n,i)` iff. replica `i` has inserted logs of: - request `m` - `pre-prepare` of `m` with view `v` and sequence number `n` - `2f` prepares from different backups that match `pre-prepare` with same view, sequence number, and digest. This ensures that if non-faulty replica `i` is prepared for `m` with view `v` and seq `n`, no other node cannot possibly also be parepared for some `m'` with view `v` and seq `n`, as long as `D(m') != D(m)`. #### Phase 3: `commit` Once a replica `i` is `prepared` with a given request `m` with sequence number `n` and view `v`, it multicasts `<COMMIT, v,n,D(m),i>signed(i)` to the other replicas, starting the commit phase. Replicas accept `COMMIT` messages and insert them into their logs as long as: - They have been signed correctly (matching public-private key pair for `i`) - The `v` is equal to the replica's current view number - The sequence number is in `[h,H]` (the limiting range) Replicas are `committed(m,v,n)` iff. `prepared(m,v,n,i)` is true for all nodes `i'` in some set of `f+1` non-faulty replicas, and `committed-local(m,v,n,i)` iff. `prepared(m,v,n,i)` is true and node `i` has accepted `2f+1` COMMIT messages (inclusive its own) from different replicas that match the `pre-prepare` for `m` - that is, same `v`, `n`, and `D(m)`. This ensures that if `commited-local` is true, then `committed` is true - non-faulty replicas agree on the sequence numbers of requests that do `commited-local` even if they `committed` on different `v` in each replica. Furthermore, it ensures that any `commit-local` done on a working replica will `commit` at `f+1` or more non-faulty replicas eventually. Each replica `i` executes the operation requested by `m` after each `committed-local`, and after `i` has executed all requests with lower sequential numbers. This ensures that all non-faulty replicas execute requests in the same order as required for safety. Replicas reply to the client after execution, discarding requests whose timestamp is lower than the timestamp in the last reply they sent to the client. ### Garbage Collection (Deleting the log) Replicas need proof that the state is correct: request-related messages in a log must be kept until the requests concerned have been executed by `f+1` non-faulty replicas, and can be proven to others in view changes. Replicas that miss messages must also be brought up-to-date by transferring their service state. Hence: - `checkpoints`: states produced by execution of requests; - `stable checkpoints`: states with proofs produced every `x` number of requests Replicas must maintain multiple logical copies of service state: 1. Last `stable checkpoint` 2. 0+ `checkpoints` that have occurred since the last stable checkpoint (not yet stable) 3. Current state When a replica `i` produces a checkpoint, it multicasts `<CHECKPOINT, n, d, i>signed(i)` to the other replicas: - `n` is the sequence number of the last-executed request - `d` is the digest of the state - `i` is the replica Once `2f+1` signed unique copies (including its own) with the same sequence number `n` and digest `d` are collected, they serve as the proof of correctness of a checkpoint. `checkpoints` that are proofed are `stable checkpoints` and the replica can discard all `pre-prepare`, `prepare`, and `commit` messages with `n'` <= the `n` from the checkpoint, as well as all earlier `checkpoint` and `CHECKPOINT` Messages. The checkpoint system is used to advance the range of acceptable sequences (`[h,H]`): - `h` is the sequence number of the last stable checkpoint - `H` is determined as `H = h+k`, where `k` is some number large enough that replicas do not stall waiting for a `checkpoint` to become a `stable checkpoint`. ### View Changing View changes by timeouts - prevent backups for waiting indefinitely for a request to execute - waiting is when a backup receives a request but **has not executed it**. Backups start the timer if: 1. They receive a `request` 2. The timer is not already running. 3. They have finished a request, but there are more requests in the queue. They stop the timer when: - They finish executing a request. When the timer of a backup `i` times out in view `v`, it starts a view change to move the system to view `v+1`: 1. It stops accepting messages other than `checkpoint, view-change, new-view`. 2. It multicasts `<VIEW-CHANGE, v+1, n, C, P, i>signed(i)` to all replicas: 1. `v+1`: New view to change to 2. `n`: sequence number of last `stable checkpoint`, `s`, known to `i`; 3. `C`: set of `2f+1` valid checkpoint messages proving `s` 4. `P`: set of subsets `P(m)` which `prepare` every message `m` of sequence number `i'` greater than `n`: 1. Contains valid `pre-prepare` message (without client message) 2. `2f` matching, valid, unique `prepare` messages with same view, seqnum, digest as `m` When the primary `p` of view `v+1` receives `2f` valid `view-change` messages for `v+1` from other replicas, it multicasts `<NEW-VIEW, v+1, V{}, O{}>signed(p)` to all other replicas, where: 1. `V{}` is a set containing the valid `view-change` messages received by `p` plus the `view-change` to `v+1` sent by the primary; 2. `O{}` is the set of `pre-prepare` messages without the request: 1. Primary determines sequence number `min-s` of the latest `stable checkpoint` in `V{}` and highest sequence number `max-s` in the `prepare` message within `V{}`. 2. Primary creates new `pre-prepare` message for view `v+1` for each seqnum `n'` in `(min-s, max-s]`: 3. If there exists at least 1 set in `P{}`: - Primary makes `<PRE-PREPARE, v+1, n, d(null)>signed(p)`, where `d(null)` is the digest of a special message `null` request which causes no execution. 4. If `P{}` is empty: - Primary makes `<PRE-PREPARE, v+1, n, d>signed(p)`, where: - `d` is the request's digest in the `pre-prepare` message for seqnum `n` with the highest view number in `V{}`. The primary then appends the messages in `O{}` to its log, replacing its last stable checkpoint if `min-s` is newer, and then discards information according to [Garbage Collection](#garbage-collection-deleting-the-log). Only then does it **enter** `v+1`, accepting new messages for `v+1`. A backup accepts `NEW-VIEW` messages for a view `v+1` if: 1. It is signed properly; 2. The `VIEW-CHANGE` messages it contains are valid for `v+1`; 3. The set `O{}` is correct - compute validity by same means that primary used to generate `O{}` It then adds the new information to its log as described for the primary, multicasts `prepare` for each message in `O{}`, adds the `prepare`s to its log, and enters `v+1`. The protocol then enters normal operation like [normal.](#normal-case-operations) #### Proof of Correctness for Replica View-Change Replicas redo the normal protocol for the messages between `min-s` and `max-s` but avoid re-executing client requests by using stored information about last reply sent to each client. Replicas missing some request messages `m` or a stable checkpoint (since they're not in `new-view` messages) can obtain missing information from another replica. e.g. 1. Obtain missing checkpoint state `a` from one of the replicas who certified correctness in `V{}` 2. Since `f+1` of those replicas must be correct, replica `i` will always obtain `a` or a certified stable checkpoint from a later timestamp. 3. To bring a replica up-to-date, we only need to send it the partitions where it is out of date. ## Proof of Algorithm Correctness ### Safety of System Safety is achieved when all non-faulty replicas agree on sequence numbers of requests committed locally. If `prepared` is true for some message `m` with seqnum `n` on some node `i` with view `v`, it means that for all non-faulty nodes `j` there cannot be some `m'` with the same seqnum `n` with view `v` that is also `prepared`. This means that non-faulty replicas agree on the sequence number of requests that commit locally in the same view at the two replicas (and by extension, all replicas). `View-change` protocol ensures that non-faulty replicas also agree on the sequence numbers of requests that `commit-local` in different views at different replicas. Request `m` commits locally at a non-faulty replica with seqnum `n` in view `v` only if `committed(m,v,n)` is true - meaning that there is a set `R1` containing at least `f+1` non-faulty replicas such that `prepared(m,v,n,i)` is true for every replica `i` in the set. Working replicas will not accept a `pre-prepare` for view `v' > v` without having received a `NEW-VIEW` message for `v'` (which allows the replica to accept these new messages) - but any correct `new-view` message for `v' > v` contains correct `view-change` messages from every replica in a set of `2f+1` replicas. Since there are only `3f+1` total replicas, the set of `prepared` replicas must intersect with the set of `view-change` replicas: meaning there is at least one replica that is not faulty - preventing a request with the same seqnum `n` in a previous view `v` from ever committing: **Replicas Agree on the Request that Commits Locally with Seqnum `n`** ### Liveness of System Providing liveness means replicas must move to a new view if they are unable to execute a request; but we must maximise period of time in which `2f+1` non-faulty replicas are in the same view, and that this period of time keeps growing until some requested operation executes. To avoid starting a `view-change` too soon, a replica that multicasts a `VIEW-CHANGE` message for view `v+1` waits for `2f+1` view-change messages for view `v+1` and then starts a timer to expire after some time `T`: If the timer expires before: - a valid `new-view` message for `v+1` arrives - the replica executes a request in `v+1` that it had not executed previously It will start a `view-change` for view `v+2` but wait `2T` before starting a view change for `v+3`, etc. If a replica receives a set of `f+1` valid `view-change` messages for a view greater than the current view, it sends `view-change` for the smallest view number within the set, even if its own timer has not expired: This prevents it from starting a view-change too late. Faulty replicas are also unable to impede progress by forcing rapid view-changes: `view-change` requires `f+1` `view-change` messages, meaning that it cannot change views by sending `view-change` messages. Faulty replicas **can** induce view-changes as the primary by not sending/receiving messages at all (or sending bad messages), but since a primary `p` is determined as the replica `p = v mod R`, where `R` is the set of replicas, it cannot be faulty for more than `f` views.