Zookeeper-less Kafka, Raft

## Why Raft came out? [1] 1. Paxos is quite difficult to understand. 2. To facilitate the development of intuitions that are essential for system builders. ### Raft Novel Features: 1. **Strong leader**: Raft uses a stronger form of leader ship than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand. 2. **Leader election**: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly. 3. **Membership changes**: Raft’s mechanism for changing the set of servers in the cluster uses a new ***joint consensus*** approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes. ### Replicated state machines * State machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. * Fault tolerance in distributed systems. * Typically implemented using a replicated log. * Keeping the replicated log consistent is the job of the consensus algorithm. ![image](https://hackmd.io/_uploads/S19Jd3WHa.png) **Figure 1**: Replicated state machine architecture. 1. The consensus module on a server receives commands from clients. 2. Communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order. 3. Once commands are properly replicated, each server’s state machine processes them in log order. 4. The outputs are returned to clients. #### Consensus algorithms for practical systems typically have the following properties: * Safety: never returning an incorrect result. * Availability. * Do not depend on timing: to ensure the consistency of logs. Faulty clocks and extreme message delays can, at worst, cause availability problems. > [name=Hong-Jhih Lin] Time Coordination Issue in a Distributed System * A command can complete as soon as a majority of the cluster has responded. ### Designing for understandability * **Problem decomposition**: divides problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes. * **Simplify the state space** by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. * Tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. Used randomization to simplify the Raft leader election algorithm. ### The Raft consensus algorithm > Please get the [1] for more detail. ## Kafka Needs No Keeper [2] ### Limitations of Zookeeper 1. **Not really a high bandwidth system**. If you have a lot of consumers doing offset fetches and offset commits, it can really start to bog down the system. 2. **Security**: all of the security that we implemented was actually on the broker, and by going through Zookeeper you're bypassing that layer of security. ### Meta data When a new controller comes up, we actually have to load all of the metadata from Zookeeper because remember, the metadata is not stored on the controller. The metadata is stored on Zookeeper. ![image](https://hackmd.io/_uploads/HkuPDJzBp.png) * As your number of partitions increases, this loading time will increase as well. * This loading time is actually critically important because while the controller is loading the metadata it can't be doing things like electing a new leader, stuff like that. * It creates unavailability if you needed the controller to do something during the loading the metadata. * Another O(N) thing is pushing all of the metadata. * This complexity here, obviously the number of brokers will factor in because we have to send this information to every broker. ### Why Raft 1. Need a self-managed quorum here where there's not a dependency on an external system to choose the leader. 2. Leader election is by a majority of the nodes rather than being done by an external system. 3. The replication protocols are not too different. Raft has terms, Kafka has epochs. They're pretty similar. 4. Practically instant failover because Raft election is pretty quick. The similarity ![image](https://hackmd.io/_uploads/BJIJsR-BT.png) ### Metadata as an Event Log - Each change becomes a message. - Changes are propagated to all brokers. - Clear ordering. - Can send deltas. - Offset tracks consumer position. - Easy to measure lag. > "delta" generally refers to a change or difference ![image](https://hackmd.io/_uploads/SyYl4pfH6.png) ### The Controller Raft Quorum * The leader is the active controller. * Controls reads / writes to the log. * Typically 3 or 5 nodes, like ZK ![image](https://hackmd.io/_uploads/Hypbi2zr6.png) #### Instant Failover * Low-latency failover via Raft election. * Standbys contain all data in memory. * Brokers do not need to refetch. #### Metadata Caching * Brokers can persist metadata to disk. * Only fetch what they need. * Use snapshots if we’re too far behind. #### Shared controller node ![image](https://hackmd.io/_uploads/SkYW22zS6.png) * Fewer resources used. * Single node clusters. #### Separate controller nodes ![image](https://hackmd.io/_uploads/BJMwn3MBT.png) * Better resource isolation. * Good for big clusters. #### Conclusion Metadata should be managed as a log - Deltas, ordering, caching. - Controller Failover, Fencing. - Improved scalability, robustness, easier deployment. The metadata log must be self-managed - Raft. - Controller quorum. ## References and More Info. 1. [[PDF] In Search of an Understandable Consensus Algorithm by Stanford University ](https://raft.github.io/raft.pdf) 2. [[Infoq] Kafka Needs No Keeper, MAR 03, 2020](https://www.infoq.com/presentations/kafka-zookeeper/)