# Kademlia DHT Learning
## Preliminary Concepts
### Map
Map is a data structure used to store key-value pairs.
### Hash
Hash function is used to generate a fixed-length key from an input of any size. This fixed-length key is crucial in DHTs because it allows keys to be organized in a binary tree structure.
In Kademlia, the hash function produces a 160-bit output, which is used as the Node ID or key
### Binary Tree
In Kademlia, once keys are hashed, they are organized in a binary tree structure. This structure allows efficient searching and routing.
**Nodes are treated as leaves in binary tree, with each node's position deteramained by the shortest unique prefix of its ID.**
## Structure And Definiation
### NodeID
NodeID is a unique identifier for each node in the Kademlia network. It is generated using a hash function and is 160 bits long. NodeIDs are essential for determining the position of a node in the network and for calculating distances between nodes.
### Distance
Kademlia uses the XOR metric to measure the distance between two Node IDs. The distance between two nodes x and y is calculated as x ⊕ y . This metric is symmetric and unidirectional, meaning the distance from x to y is the same as from y to x.
root
/ \
0 1
/ \ / \
00 01 10 11
/ \ / \ / \ / \
000 001 010 011 100 101 110 111
Imagine a binary tree where each node represents a possible Node ID. The root of the tree is at the top, and each level of the tree represents one more bit of the Node ID.
#### Relationship between Prefix and Distance Calculation
Nodes that share a longer common prefix are closer to each other in the tree. For example, nodes 000 and 001 share a longer common prefix (00) than nodes 000 and 100, which only share the root (0).
### Routing Table
The routing table in Kademlia is a key component that helps nodes find other nodes efficiently. It consists of multiple k-buckets, each covering an exponentially increasing range of distances from the node.
#### k-bucket
A k-bucket is a list of nodes that fall within a specific distance range from the current node. Each k-bucket can store up to \( k \) nodes (typically 20). Nodes in the same k-bucket are those with distances between \( 2^i \) and \( 2^{i+1} \) from the current node.
#### Bucket Index and Depth
- **Bucket Index**: The index \( i \) of a k-bucket corresponds to the distance range it covers. For example, the first k-bucket covers distances from \( 2^0 \) to \( 2^1 \), the second from \( 2^1 \) to \( 2^2 \), and so on.
- **Depth**: The depth \( h \) of a node is defined as \( 160 - i \), where \( i \) is the smallest index of a non-empty bucket. This represents the distance from the root of the binary tree to the first non-empty k-bucket.
## Procedure
### Find Node
`Find Node` takes 160-bits long ID as an argument and the receipent returns k closest nodes to the ID it knows in its k-buckets. The reponse contains no less than k items unless the receipent only knows less than k contacts.
### Find Value
`Find Value` takes 160-bits long ID as an argument and the receipent returns k closest nodes to this ID it knows in its k-buckets, or returns the value if it knows.
### Node Lookup
Node lookup is the most important procedure in Kademlia. It allows a node to find the closest nodes to a given target ID. The process employs a recursive algorithm to locate the k closest nodes to some given node ID.
1. **Initialization**:
- The lookup initiator starts by picking α nodes from its closest non-empty k-bucket (or, if that bucket has fewer than α entries, it just takes the α closest nodes it knows of).
- α is a system-wide concurrency parameter, typically set to 3.
2. **Sending FIND_NODE RPCs**:
- The initiator sends parallel, asynchronous FIND_NODE RPCs to the α nodes it has chosen.
3. **Recursive Step**:
- In the recursive step, the initiator resends the FIND_NODE RPC to nodes it has learned about from previous RPCs. This recursion can begin before all α of the previous RPCs have returned.
- Of the k nodes the initiator has heard of closest to the target, it picks α that it has not yet queried and resends the FIND_NODE RPC to them.
- Nodes that fail to respond quickly are removed from consideration until and unless they do respond.
4. **Handling Failures**:
- If a round of FIND_NODEs fails to return a node any closer than the closest already seen, the initiator resends the FIND_NODE to all of the k closest nodes it has not already queried.
5. **Termination Conditions**:
- The lookup terminates when the initiator has queried and gotten responses from the k closest nodes it has seen.
### Store
The Store procedure is used to store a key-value pair in the network. The node hashes the key to determine its NodeID and then use `node lookup` to locate the k closest nodes to that ID. It sends STORE requests to these nodes to save the key-value pair.
### Refresh
The Refresh procedure ensures that k-buckets are up-to-date. A node periodically selects random IDs within the ranges covered by its k-buckets and performs a `node lookup` for these IDs. This process helps discover new nodes and remove outdated entries from the k-buckets.
## Handling Node Joins and Leaves
### Node Join
### Bucket Splitting
## Data Redundancy and Replication
### Data Replication
Data replication ensures that key-value pairs are stored on multiple nodes to prevent data loss. Each key-value pair is stored on the k closest nodes to the key's hashed ID. This redundancy ensures that even if some nodes leave the network, the data remains accessible.
### Republishing Keys
To maintain data availability, nodes periodically republish the key-value pairs they store. This process involves performing a Node Lookup for each key and sending STORE requests to the k closest nodes. Republishing helps recover from node departures and ensures that data remains replicated across the network.
### Avoid over-caching
The expiration time of a key-value pair is exponentially inversely propotional to the number of nodes between current node and the node whose ID is closest to the key ID.
## References
For further reading and detailed understanding of the Kademlia DHT protocol, the following references are highly recommended:
1. **Kademlia: A Peer-to-peer Information System Based on the XOR Metric**
- [Kademlia Paper](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf)
2. **Kademlia DHT Explained (Video)**
- [Kademlia DHT Explained on YouTube](https://www.youtube.com/watch?v=7o0pfKDq9KE)