owned this note
owned this note
Published
Linked with GitHub
NAT traversal in Codex can be separated into two parts:
1. UPnP/NAT-PMP support;
2. hole punching.
The first part can be addressed by [nim-nat-traversal](https://github.com/status-im/nim-nat-traversal), which has [as simple an API as it gets](https://github.com/status-im/nim-nat-traversal/blob/master/examples/miniupnpc_test.nim).
Hole punching, however, is a whole different story, and this is what this write up deals with.
## Background
Recent versions of libp2p, including nim-libp2p, implement TCP/UDP hole punching by means of [circuit relay v2](https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md). In circuit relay v2, an unreachable node $n_u$ must _latch_ onto an available relay node $n_r$ which then becomes their connection proxy to other nodes.
The multiaddress of the unreachable node then becomes a tuple $(n_u, n_p)$ which contains the SPR of $n_u$, but also the SPR of the relay node. Nodes wishing to reach $n_u$ can then contact the relay node, which will [orchestrate either a connection reversal process or a hole punching attempt](https://github.com/libp2p/specs/blob/master/relay/DCUtR.md).
In more detail, a node $n_f$ attempting to reach $n_u$ will issue a dial to the multiaddress, so that libp2p will: _i)_ establish a relayed connection; _ii)_ use the relayed connection to transmit IP and port data to both ends, as well as allow RTT measurements; _iii)_ perform a synchronized hole punch attempt if both $n_f$ and $n_u$ are behind a NAT, or a connection reversal if $n_f$ is publicly reachable.
If the hole punch or connection reversal succeeds, the libp2p connection gets _upgraded_ to a direct connection.
## Issues
### Codex DHT
The first big issue is that the DHT does not use libp2p for communication. This means none of the libp2p machinery works here.
### Public Node Discovery
The libp2p machinery relies heavily on unreachable nodes being able to reach one or more public nodes which can then work both as relay nodes and as AutoNAT[^1] servers for them.
While this is touted as "decentralized hole punching", it means that _every unreachable peer must know, at all times, at least one publicly reachable peer_ for this to work.
This leads down to two possible paths:
1. system providers keep a pool of public peers which work as relays/AutoNAT servers for users while ensuring capacity and load balancing. This is not decentralized.
2. System providers figure out a way to do decentralized public node discovery.
While libp2p does specify a [rendezvous service](https://docs.libp2p.io/concepts/discovery-routing/rendezvous/) which would appear to be suitable to that end, this merely pushes the problem one layer up: instead of discovering public nodes, now you need to discover rendezvous nodes.
While one could argue it is better to have a centralized rendezvous service than centralized public nodes, it is still a centralized piece which will need to be in the loop well beyond bootstrap.
## A Possible Solution
**Naive approach: use libp2p in the DHT.** To construct our proposal, we start with one seemingly simple solution, which would be to modify the DHT so that it uses libp2p (with a UDP transport) instead of raw UDP.
This would have pretty severe implications for DHT performance: as DHT lookups will typically traverse and require dialing a lot of previously unknown nodes, we would get a large number of relay node contacts plus slow connection upgrades at every lookup.
This is illustrated in the lookup depicted in Figure 1, in which public nodes are represented as circles with thicker lines: as we can see, the number of hops becomes significantly larger (Figure 1b) as messages need to traverse relays to reach their destinations, and this is not even considering the actual connection upgrade process, which makes it even more costly.
<div style="display: grid; text-align: center; grid-template-columns: 1fr 1fr; gap: 10px; padding-bottom: 10px">
<div>
<img src="https://hackmd.io/_uploads/HyZhQOkrR.png" alt="a. Two-hop DHT lookup" style="width: 100%; height: auto;">
</div>
<div>
<img src="https://hackmd.io/_uploads/r1QT7OyrC.png" alt="b. Two-hop DHT lookup on relays" style="width: 100%; height: auto;">
</div>
<div><b>a.</b></div>
<div><b>b.</b></div>
</div>
**Figure 1.** **(a)** two-hop DHT lookup and **(b)** initial messages in the same two-hop lookup when using relays.
If DHT traffic is expected to be low, however, one could argue we could maybe simply use the relayed connections to transmit it, without ever performing a connection upgrade for what could be defined as ephemeral contacts. While still expensive, this is still less expensive than upgrading connections at every hop.
If one is to look more closely at any of these solutions, however, they carry the implication that most if not all DHT traffic from unreachable nodes needs to pass through public nodes. This is quite visible from the sort of topology that emerges in Figure 1b: unreachable nodes are simply latched onto relay nodes and funnel their DHT traffic through them. This realization leads to a second, perhaps more approachable solution, which we describe next.
**Have unreachable nodes work only as DHT clients.** In this scenario, depicted in Figure 2, unreachable nodes access and use the DHT to lookup nodes and publish their provider records but do not participate in the DHT itself -- only public nodes do. Unreachable nodes still latch onto public nodes as before, except that instead of upgrading connections, they can run the lookup algorithm by looking for the right relay node directly: in the picture, node $3$ is looking for the relay for node $8$ (which is node $2$), so it simply asks node $6$ for the closest nodes, gets a reply, then contacts node $2$ directly to initiate DCUtR.
This works because an unreachable node can still issue `findNode` and `provide` requests since those are simple NAT-friendly request/response cycles and in principle[^2] do not require an unreachable node to be dialed back.
<center style="padding-bottom: 10px">
<img src="https://hackmd.io/_uploads/BJLk9OJB0.png" width="60%">
</center>
**Figure 2.** Unreachable peers as client nodes.
**Use the DHT for public node discovery.** This doubles as a solution to public node discovery: if we know that only public nodes are part of the DHT, then we can perform random lookups in its ID space to sample neighbors, and maintain a small set of connected public nodes to use as relays and AutoNAT servers. This spreads the load organically and scales better than having a purely centralized solution.
Having only public nodes in the DHT may mean that the set of nodes making up the DHT is fairly small. As noted before, however, even if we do use relays, we will already have in essence a DHT of public nodes as every contact to an unreachable node needs to go through a public (relay) node first. I would therefore argue that this problem will exist regardless of the solution.
[^1]: AutoNAT is the local service that helps a peer understand whether or not it is publicly reachable. It works by asking public peers to ping it back and seeing if that succeeds.
[^2]: I say in principle cause if replies take too long then NAT bindings can expire. I do not know how big of an issue this would be.