Kademlia DHT Learning

# Kademlia DHT Learning ## Preliminary Concepts ### Map Map is a data structure used to store key-value pairs. ### Hash Hash function is used to generate a fixed-length key from an input of any size. This fixed-length key is crucial in DHTs because it allows keys to be organized in a binary tree structure. In Kademlia, the hash function produces a 160-bit output, which is used as the Node ID or key ### Binary Tree In Kademlia, once keys are hashed, they are organized in a binary tree structure. This structure allows efficient searching and routing. **Nodes are treated as leaves in binary tree, with each node's position deteramained by the shortest unique prefix of its ID.** ## Structure And Definiation ### NodeID NodeID is a unique identifier for each node in the Kademlia network. It is generated using a hash function and is 160 bits long. NodeIDs are essential for determining the position of a node in the network and for calculating distances between nodes. ### Distance Kademlia uses the XOR metric to measure the distance between two Node IDs. The distance between two nodes x and y is calculated as x ⊕ y . This metric is symmetric and unidirectional, meaning the distance from x to y is the same as from y to x. root / \ 0 1 / \ / \ 00 01 10 11 / \ / \ / \ / \ 000 001 010 011 100 101 110 111 Imagine a binary tree where each node represents a possible Node ID. The root of the tree is at the top, and each level of the tree represents one more bit of the Node ID. #### Relationship between Prefix and Distance Calculation Nodes that share a longer common prefix are closer to each other in the tree. For example, nodes 000 and 001 share a longer common prefix (00) than nodes 000 and 100, which only share the root (0). ### Routing Table The routing table in Kademlia is a key component that helps nodes find other nodes efficiently. It consists of multiple k-buckets, each covering an exponentially increasing range of distances from the node. #### k-bucket A k-bucket is a list of nodes that fall within a specific distance range from the current node. Each k-bucket can store up to \( k \) nodes (typically 20). Nodes in the same k-bucket are those with distances between \( 2^i \) and \( 2^{i+1} \) from the current node. #### Bucket Index and Depth - **Bucket Index**: The index \( i \) of a k-bucket corresponds to the distance range it covers. For example, the first k-bucket covers distances from \( 2^0 \) to \( 2^1 \), the second from \( 2^1 \) to \( 2^2 \), and so on. - **Depth**: The depth \( h \) of a node is defined as \( 160 - i \), where \( i \) is the smallest index of a non-empty bucket. This represents the distance from the root of the binary tree to the first non-empty k-bucket. ## Procedure ### Find Node `Find Node` takes 160-bits long ID as an argument and the receipent returns k closest nodes to the ID it knows in its k-buckets. The reponse contains no less than k items unless the receipent only knows less than k contacts. ### Find Value `Find Value` takes 160-bits long ID as an argument and the receipent returns k closest nodes to this ID it knows in its k-buckets, or returns the value if it knows. ### Node Lookup Node lookup is the most important procedure in Kademlia. It allows a node to find the closest nodes to a given target ID. The process employs a recursive algorithm to locate the k closest nodes to some given node ID. 1. **Initialization**: - The lookup initiator starts by picking α nodes from its closest non-empty k-bucket (or, if that bucket has fewer than α entries, it just takes the α closest nodes it knows of). - α is a system-wide concurrency parameter, typically set to 3. 2. **Sending FIND_NODE RPCs**: - The initiator sends parallel, asynchronous FIND_NODE RPCs to the α nodes it has chosen. 3. **Recursive Step**: - In the recursive step, the initiator resends the FIND_NODE RPC to nodes it has learned about from previous RPCs. This recursion can begin before all α of the previous RPCs have returned. - Of the k nodes the initiator has heard of closest to the target, it picks α that it has not yet queried and resends the FIND_NODE RPC to them. - Nodes that fail to respond quickly are removed from consideration until and unless they do respond. 4. **Handling Failures**: - If a round of FIND_NODEs fails to return a node any closer than the closest already seen, the initiator resends the FIND_NODE to all of the k closest nodes it has not already queried. 5. **Termination Conditions**: - The lookup terminates when the initiator has queried and gotten responses from the k closest nodes it has seen. ### Store The Store procedure is used to store a key-value pair in the network. The node hashes the key to determine its NodeID and then use `node lookup` to locate the k closest nodes to that ID. It sends STORE requests to these nodes to save the key-value pair. ### Refresh The Refresh procedure ensures that k-buckets are up-to-date. A node periodically selects random IDs within the ranges covered by its k-buckets and performs a `node lookup` for these IDs. This process helps discover new nodes and remove outdated entries from the k-buckets. ## Handling Node Joins and Leaves ### Node Join ### Bucket Splitting ## Data Redundancy and Replication ### Data Replication Data replication ensures that key-value pairs are stored on multiple nodes to prevent data loss. Each key-value pair is stored on the k closest nodes to the key's hashed ID. This redundancy ensures that even if some nodes leave the network, the data remains accessible. ### Republishing Keys To maintain data availability, nodes periodically republish the key-value pairs they store. This process involves performing a Node Lookup for each key and sending STORE requests to the k closest nodes. Republishing helps recover from node departures and ensures that data remains replicated across the network. ### Avoid over-caching The expiration time of a key-value pair is exponentially inversely propotional to the number of nodes between current node and the node whose ID is closest to the key ID. ## References For further reading and detailed understanding of the Kademlia DHT protocol, the following references are highly recommended: 1. **Kademlia: A Peer-to-peer Information System Based on the XOR Metric** - [Kademlia Paper](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf) 2. **Kademlia DHT Explained (Video)** - [Kademlia DHT Explained on YouTube](https://www.youtube.com/watch?v=7o0pfKDq9KE)