Dealing with failure

# Dealing with failure ## Logging, Consensus, and Real World Applications In a distributed system, it is less of a question of if and rather a question of when a node will fail. As such, it is important to have a backup plan. How can a system ensure that upon a failure, other nodes can pick up where it left off? Furthermore, how can it ensure that the nodes that are failed over to are consistent with the current state of the system? I’m Alexander Williams, and in this blog I’ll be covering how these problems can be solved. I’ll be using *Why Logical Clocks Are Easy* by Baquero and Nunez to first introduce the concept of tracking what happens and when without using physical time. Logical clocks in turn will enable better understanding of Raft, a consensus algorithm introduced in *In Search of an Understandable Consensus Algorithm* by Ongaro and Ousterhout. Raft in itself introduces a solution to our two problems, but does not provide the necessary context of how it is actually used in the field. To showcase consensus algorithms in practice, and answer the age old question “Who cares?” I will use the *Google File System* by Ghemawat, Gobioff, and Leung. While not using Raft, GFS is a very illustrative example of when exactly consensus is applicable, if not an absolute necessity. ### Logical Clocks: The "What" of logging system state And so we start with logical clocks. Logical clocks, in short, are a way to identify potential casualties by logging events when they happen in relation to previous events in the distributed system. They are an alternative to physical timestamps, which are in general unreliable indicators due since clocks can fall out of sync and have the potential to produce misleading results detailing when an event occurs. With event based logging, this issue is sidestepped entirely because each event’s timestamp is based upon what happens before it. The happens-before relationship was originally defined by Lamport in 1978, and is defined as follows: event A happens before event B iff A happened before B on the same node, or alternatively B knows about A due to some communication with a node that knows about A. ___ Happens-before can be used to create comprehensive histories with more information in multiple ways. First Baquero presents causal histories, which logs every known event as each new one happens. The list of known events then solidifies happens-before relationships (if it’s in the list, it happened before the current event) but carries the downside of massive inefficiency; as more events happen on each node, the list will continue to grow. As more nodes are added and more events happen, the space penalty will only continue to increase. ![Causal Histories](https://dl.acm.org/cms/attachment/02af3fd3-0207-4b34-b608-91095be37a00/baquero2.png) :::info The above diagram from Baquero illustrates both happens-before and the inefficiencies of causal histories: - a1 happens before a2, because at a2 the history includes a1 and a2. a1 also happens before b2, because there was communication between node A and node B that meant B knew about a2. - c2 does not have a happens before relationship with b3 (they are concurrent), because at c2 the history only contains c1. - c3 begins to show how causal histories grow in size incredibly rapidly because all events that have happened before are re-logged every time. ::: ___ As a solution to this cost, Baquero presents a more efficient logging structure: the vector clock. The vector clock, instead of logging all of the events that have occurred, just maintains a vector of the most recent events that have occurred per node. The end result is a significantly smaller data structure with a simple representation of happens-before: if A happens before B, for each index in the vectors i, A[i] is less than or equal to B[i]. ![Vector Clock Diagram](https://i.imgur.com/zk0CdnA.png) :::info This diagram, also from Baquero, displays both the simplicity and efficiency of vector clocks. [1,0,0] happens before [2,0,0] and [2,2,0], but does not happen before [0,1,0]. No matter how many events occur, in this system there will only be at most 3 elements in an event's vector. ::: ___ With that, we now have a logging system that can track a system's state, and a fairly simple one at that. It successfully solves the problem of tracking causal relationships without relying on physical timestamps, but it does carry one significant drawback. If all nodes remain operational forever, there is no issue, but as soon as one node fails, an event may be improperly logged. > Imagine Node 2 fails. If Node 1 communicates with Node 2 it will not get logged. If Node 2 comes back online and performs another action before it knows of Node 1's communication, the logs are now out of order. We need a more fault tolerant system, one that is able to handle not every node being consistent. This is where Raft steps in. ### Raft: The "How" of logging state with potential failure Raft, introduced by Ongaro and Ousterhout in 2014, is a consensus algorithm with the goal of managing a replicated log. Proposed as an alternative to Paxos, an algorithm designed by Lamport, Raft is intended to be a simpler, fault tolerant algorithm with equivalent performance. To illustrate its usefulness, I will split the description of it into 3 parts: a normal election, a normal replication, and then fault tolerance. Through this description, we will be able to see how a logging system can effectively track events in a distributed system in the face of realistic failure expectations. ___ An election works as follows: As a node joins, it is a follower and inherits the current leader’s term. The leader is responsible for sending heartbeat messages to all follower nodes to remain a leader. If no heartbeat is received by one node for a period of time known as the election timeout, it will then increment its term and become a candidate. As a candidate, it will send requests to other nodes for votes, at which point one of three things happens: - The candidate gets elected leader after a majority of votes are received - Another node gets elected leader and the candidate becomes a follower - No node gets elected leader due to split votes, in which case the election process starts again. Split votes are especially rare, and will rarely occur since election timeouts are randomized. In addition, we begin to see the use of logical clocks in terms: one of the guarantees of Raft is that if an entry has an earlier term and has been committed, it must have happened before the current term. ___ We now move on to log replication. As a leader, a node is responsible for sending append entry calls to each of the followers and committing each entry once it is logged by a majority of nodes. These entries contain the term, the index of the log, and the command executed. Three properties are maintained as invariants: - If an entry has an earlier term and has been committed, it must have happened before the current term - Entries in different logs with the same term and index must log the same command - If those entries are the same, all preceding entries must also be the same. Once again we are reminded of vector clocks, as if the most recent entry is present, all previous entries must also be present. In addition, terms serve as an extra logical mechanism, preserving happens-before between leaders. This time, we also have one more important guarantee: the previous entries are in the correct order, solving the earlier issue presented in logical clocks. ___ Now that the mechanism for maintaining the current system's state is present, we can now introduce fault tolerance to the algorithm. First, to ensure election safety, some restrictions are present. The most important restriction is that a new leader must have all the committed entries from previous terms. By including this restriction, a new leader must respect the happens-before relationship and won’t be able to overwrite old committed indexes. > In practice: If a node fails and misses several commits but comes back online during an election, it can not be elected leader For replication safety, no entries from previous terms will be committed by a new leader; this ensures that new leaders do not have to deal with entries committed by followers from a previous term, and will instead try to relog that entry into its term to be officially committed. > In practice: a leader trying to reorganize old, uncommitted entries leads to incredible complexity (especially if it fails while trying to make that commit). By just ignoring and overwriting uncommitted entries, this complexity is avoided. Additionally, a leader will continue to send its current index along with append entries and heartbeats until all nodes are up to date. In doing so, it allows followers that have failed or lost an entry to eventually catch back up. > In practice: if an append is missed due to network unreliability or a follower failing, as long as a later append with the same information reaches the follower later it can be brought back up to date. ___ While I cannot speak to the authors’ goal of making a more understandable consensus algorithm, I do believe that they achieved their goal of building a strong foundation for practical systems by separating the core elements of a consensus protocol. By adding these fault tolerance measures, Raft’s logs are reliable in the face of failure and in turn the larger system as a whole will be able to recover given these logs; the guarantee of consistency between replicas is key to this. ### Google File System: The "Why" of logging and consensus Raft shows that it is possible to have an effective backup plan for distributed systems, enabling systems to be made with the expectation that machines will fail and it must be possible to recover. While the Google File System does not use Raft as its consensus protocol, it is a perfect real-world example of fault tolerant consensus algorithms at work. From a high level, GFS is a distributed file system that has component failure as one of its core assumptions. In turn, it designed its architecture to account for these failures by having easily replicated chunkservers all serviced by a single master node that handles all system metadata and frequently communicates with the chunkservers to provide instructions/ collect state. The most relevant part of the Google File System is actually hidden in its single master node. Since it contains all of the metadata, a master node failure would be absolutely catastrophic, and as such it is important to maintain several replicas. To accomplish this goal, the master’s logs are maintained on multiple machines, and a change is only committed once its log record is replicated on all other replicas. If this sounds Raft-like, it should; the consensus algorithm allows for instant failover to replicas, and if an uncommitted change is lost it only will only affect a few clients (as would losing a chunkserver) rather than all of them. Between the replicas on chunkservers and the replicated logs of the master, GFS effectively accomplishes the author’s goal of creating a highly available fault tolerant file system. Moreover, it provides a case study of how in the field consensus algorithms and proper logging allow for fault tolerance in distributed systems as a whole. ### So why should I care? With this whirlwind tour of logging and an example of it in action, I hope to instill a greater sense of using causality as a mechanism to enable fault tolerance in distributed systems. We started with logical clocks, which serve as the backbone of the consensus algorithms by determining causality at the event level. They freed the systems from relying on physical timestamps, and allowed distributed systems to precisely answer the question of what happened, and in what order. We then brought in Raft, which took the solution of timestamps and both asked and answered what should happen as these nodes fail, making the logging system itself fault tolerant. Finally, we saw a distributed system using a consensus algorithm in action, using its mechanics to allow the greater system to be fault tolerant and proving that in practice, the algorithms are effective. With these tools to track causality, we are able to build much more intricate distributed systems and shift the focus away from preventing failure and towards the architecture itself.