/* Introduction */
IPFS works because… magic. Ok, that is not actually true. But how does an IPFS client ask everyone for anything and actually get a reasonable response in a reasonable amount of time?
The answer is: content resolution algorithms built on distributed database architectures. That's not helpful? Ok then how about this: The Distributed Hash Table, referred from here on out in this post as the DHT.
If you're a developer who has spent most of their coding life using HTTP, then you're one of around twenty million people… almost none of whom have actually built production applications on top of a peer to peer architecture. You're going to need to know a few things in order to build, profile, debug… and to know who to curse at. If something goes wrong while you're building with IPFS, this post will ground your debugging approach in some practical knowledge of how the underlying system works.
If you already know roughly what this layer is supposed to do you can skip to Torrent.
Content routing is the art and science of sending data from where you have it to where you want it quickly and efficiently. The first generation of peer-to-peer networks used the approaches of flooding and random walks, which were simple and straightforward but almost impossible to scale. Then DHTs came on the scene and [benefits of DHT]. However, [still problems and therefore need accelerated DHTs]”.
Let's assume you want to download a file, (I'm gonna refer to this example file as Qmfoo
here).
Qm
+ next 3 chars or something like that. Might simplify the text, and the examples will actually work then. We traditionally use this, but is up to you 😈 -dietrich)We need to find a node which hosts Qmfoo
, connect to it, and ask her "Pls, I would like to download Qmfoo
" then the other node will reply with the file.
Let's recap how much we know about that file at this point: We only know the CID, and if we use cid.ipfs.io (with a real CID, this fake one will just error).
Using Qmdd7368UxUcLM1bdCE5KHRMkeUxMx7h4QwUXsFpRivJ5s
for example, we see the data inside the CID looks like this:
- base58btc
- cidv0
- dag-pb
(
sha2-256
256
3165D7D9E71759F5DB82F26318F03A5BEEA4BAFD9AE1EECC5D366403CD38C16
)
Most of what you see is metadata explaining how to read the actual data. What is interesting is in between the parenthesis.
And you might notice that it's not much. We have:
sha2-256
.256
bits, yes it's already included by the previous step (the -256
already mean 256 bits long), other algorithm may have varying outputs and the spec just says to always include it.3165D7D9E71759F5DB82F26318F03A5BEEA4BAFD9AE1EECC5D366403CD38C16
(I'm not gonna go in the details or why it's not actually the file itself, you can read more about IPLD if you want to know about that).The important thing to realise is that nowhere in this data structure will you find an address. That is an issue because whatever transport we are gonna use (Internet Protocol, Tor, Ethernet, …) doesn't support dialing random file hashes like that.
So we need a way to get from the CID of a single file to a list of connection details to nodes who might have it. The connection details we need are the public key of the other node, IP address and Port, Tor addresses, … or whatever your transport uses. That's right, IPFS is transport agnostic - it can use many different network transports for any given interaction. I'm gonna stick to IP address + Port for this explanation.
BTW we want a list of them because each one is a single node, if multiple nodes are hosting the file we want to know all of them in order to download from as many places at once as possible, making the download potentially faster and increasing the probability that it will complete successfully.
HTTP uses the URL (Uniform Resource Locator) to resolve content.
If we look at how they look for example:
https://xkcd.com/1319/
Without including more complex part of the standard, a URL is typically composed of:
https
xkcd.com
1319/
It's trivial to see how you would go about securely requesting that file:
1319/
.The first thing to notice is that the security model is a complex authority system that requires that you delegate trust to multiple 3rd parties. In contrast, IPFS embeds the hash of the file (*well, hash of the root which in the end contains the hashes of the leafs, security wise it's the same thing) in the CID. You can verify that what you received is actually what you asked for. Trust moves out-of-band, decoupled from the transport layer.
And they just literally include the location, without even talking about DNS and how much that protocol is cenetralised, this is a big nono for IPFS because we dont know and doesn't even care about where things are, with http and URL if your URL is on a set server that is slow or down you can't download that file even if a nearby fast server have this same file, while IPFS would just connect to an other nearby node and download it.
In short, HTTP says where things are while IPFS says what they should be. A URL is *mostly just a path on a specific computer somewhere. A CID is a Content Identifier, representing a unique piece of content.
Torrent is a much different story than HTTP.
Let's inspect a random torrent on my FS:
$ btcheck -lvni ubuntu-21.04-desktop-amd64.iso.torrent
Announce URL : https://torrent.ubuntu.com/announce
announce : https://torrent.ubuntu.com/announce
announce : https://ipv6.torrent.ubuntu.com/announce
File Name : ubuntu-21.04-desktop-amd64.iso
File Length : 2818738176
Piece Length : 262144
Torrent Hash : 64a980abe6e448226bb930ba061592e44c3781a1
Comment : Ubuntu CD releases.ubuntu.com
We see some similarities with a CID,
64a980abe6e448226bb930ba061592e44c3781a1
this allows us to check the integrity of the downloaded file, so an other node / server can't send us an invalid file (like by trying to inject a malware into a program for example), we would catch this in the hashing step.262144
, we also have the hash of each individual shard, however I'm not including all 262144 of thoses for your own sanity.https://torrent.ubuntu.com/announce
.Thoses are URLs, there are many, so you might think like HTTP earlier thoses points to the file except we have an additional layer of integrity thanks to the hashing. But no!
Remember our original problem: In order to talk to a server or other node we need its location on the network (IP), then we can use TCP as usual.
So those announce
URLS (usually called trackers) are servers that keep a list of who hosts the file. When someone hosts a file they share this information along with connection details (IP, Ports, Protocol version, …) with the tracker. So our host node will contact the tracker (here torrent.ubuntu.com/announce
) and say something like this:
I'm the node Xyz and I'm hosting the file Qmfoo, if any one needs it my ip is 10.42.76.7:1337.
Then when you want to download the file Qmfoo
. You contact torrent.ubuntu.com
yourself asking something along the lines of:
Hey I'm looking for the file Qmfoo, do you know anyone who has it ?
Then the tracker responds with a list of nodes who host this file:
- Xyz 10.42.76.7:1337
- Abc 192.168.0.3:5432
Then you proceed to contact them and actually download shards from them, check them and rebuild the original files.
It's important to remember that at no point do trackers hold the file, and they never exchange shards. They only store who is interested about which file.
You can't ask a torrent for the content of the file Qmfoo
, it can only give you the list it knows of nodes hosting Qmfoo
.
This is a pretty good system. It works, however there are challenges: Trackers make the system less resilient. The trackers are yet another piece of infrastructure to maintain and they have huge power over the network. They know at all times everyone who is downloading and hosting files. If you want to stop a torrent from circulating you just need to shutdown those two servers, and nodes will not be able to contact each other anymore and the exchange will stop.
Torrent has been expended and now some clients also feature a DHT (like transmission). So the usage of trackers can be omitted.
It's also possible to use magnet links which are really close to what CIDs are.
The tracker system is generally more complex and knows about completion (FIXME: why "completion"?) of each node.
The first thing IPFS does is it says all nodes are now a tracker too.
So you could ask any node if it knows where your file Qmfoo
is located.
The issue is that there is ~20k publicly dialable nodes right now (nodes that are running on public IPs like servers or nodes with an open nat like using UPNP or port forwarding). And this number goes up.
So you could just ask anyone and hope for the best, (IPFS does that too) but most of the time that just doesn't work. This just doesn't scale.
So what we will try is to centralise the management of certain files on a few node. So instead of asking the whole network you would just have to ask 20 or so nodes. Note that thoses 20 or so node are for a specifc file, an other file will likely use 20 different nodes in the network.
So how do we do that ?
When we peaked at the CID earlier, unlike torrent which explicitely specify trackers to use the CID is just the file hash.
This is the part that makes no sense. Don't search any meaning yet, all the individual steps I'll describe are meaning less without the bigger picture, if you don't understand at first, just continue to read and retry to read it knowing the future steps.
We choose thoses nodes according to their relatedness to the CID of the file.
To do so we introduce a new notion. Distance it's the XOR result of the two sha256 hashes of the two key we compare.
WTF are the keys in the first place ?
It's whatever we are comparing, most of the time it's the CID of a file and a node.
Yes but some savage people like pubsub inputs UTF-8 strings as keys.
This is to have fixed length things to compare against. Let's assume I ask you between cat
and cots
which one is the closer to dogs
, if your mind isn't filled CS concepts and you don't instantly scream Levenshtein distance you will start asking tricky questions like "how must we count adding and deleting characters ?". Using a hashing function is a cryptographically friendly way to make them always 256 bits long and remove all ambiguities comparing varying length thingy.
This also creates a good random distribution from related inputs. Some people with no respect for civilisation geuinly inputs things such as pubsub/v0.5.6/workplace
as keys in the DHT, imagine many other keys such as ending by /workplaces
, /working
, … passing them through a hash function spreads related inputs randomly in the DHT, this helps to avoid overloading one single tracker.
(If you have been offended the civilisation thingy part, ♥ pubsub, it's awesome, and well if we hash string keys anyway it's not so bad. But I still belive you should do it anyway as matter of principles)
Then we take thoses hashes and XOR them, the result is our distance.
XOR is a bitwise operation it takes two number in input, compare their bits and for each bit if they match inputs a 0, if they differ a 1.
So for example:
101 (5)
XOR 100 (4)
= MMD (match, match, differ)
001 (1)
In practice that just means keys that once hashed are similar produce a low distance number, with a really strong bias on matching the most significant bits.
If you are asking something like: "How does this calculation gets the distance between two nodes, no where here we input the location data".
Well it doesn't, this is completely made up thing we created, it has no real world meaning and two close node in the DHT can be really far away or really close, it's just two different thing.
Good now that we have a way to compare keys what we will do is take the list of all nodes compare their peer ID with our CID and take the 20 closests one.
That would work and it's highly probable a DHT client similar to that will be made one day, however we still don't like this, because this require keeping the list of all nodes in the network.
That takes some space, ok for ~20k nodes that likely fine, a good database splitting disk and memory accesses can maybe get the job done, the issue is that many of thoses nodes are temporary, appears stay a bit and leave.
This makes any join / leave of the IPFS a broadcast operation that require all nodes to be updated. You could make it work maybe. BGP (the protocol that makes internet works) does this to an extend. But that not scaling good enough for us, we want to talk less.
What we do instead, is the same thing (finding the closest nodes), but not on the complete network, just on all peers we know (also called a peerset) at that time.
Then we ask them about the Key, what they will do is compare their local peerset. And if they find anyone closer they will redirect us to them instead.
Then we repeat the operation with that peer, until we find someone who doesn't know anyone closer, thoses nodes will be our trackers.
That all !
Ok this doesn't makes to me either, let's go through an example instead.
We want to search for the file Qmfoo
, I'm gonna ignore the extra SHA256
step in the distance calculation because that is gonna be clearer, so I'm gonna compare raw CIDs and raw peer IDs in my example.
We host the file Qmfoo
, so we search in our peerset the closest node, it happen to be the node Qmbar
.
We ask Qmbar
:
Hey, I host Qmfoo, can you remember that pls ?
And they respond:
No, sorry I'm not in charge of that, I know Qmfaa which is closer.
So now we ask Qmfaa
:
Hey, I host Qmfoo, can you remember that pls ?
And they respond:
No, sorry I'm not in charge of that, I know Qmfoa which is closer.
So now we ask Qmfoa
:
Hey, I host Qmfoo, can you remember that pls ?
And they respond:
Yes ok, I'll note your connection details for the next one.
We want the file Qmfoo
, so we search in our peerset the closest node, it happen to be the node QmGft
.
We ask QmGft
:
Hey, I want Qmfoo do you know about it ?
And they respond:
No I don't, but I know QmGoo which is closer.
So now we ask QmGoo
:
Hey, I want Qmfoo do you know about it ?
And they respond:
No I don't, but I know Qmfoa which is closer.
So now we ask Qmfoa
:
Hey, I want Qmfoo do you know about it ?
And they respond:
Yes, I know the node QmXyz with over TCP with the IP 10.42.123.7 and port 1337.
Then the downloader contacts the host and they start exchanging shards and rebuilding the file using bitswap.
Notice that the path was different, this is normal, each file has multiple way to access it as nodes know many many neighbors (the default config is 300~600) you actually skip most of nodes while routing.
All I've said rely on an assumption, is that nodes knows their neighbors.
If we just let thing flow randomly we could have pockets of nodes that are really close in the DHT but doesn't know each other, this is bad because if in your search you step over the pocket entrance you will find other nodes responsible for the tracking than the provider used.
This is solved by searching for all of your neighbors first and regularly.
Simply nodes are gonna search for themself, obviously since they are new they will not succeeed, however the last nodes in their journey (the one that will not been able to find closer nodes and just give up) are your closest neighbors, so a node will search for it self and remember / hold a connection with close neighbors.
Note this doesn't apply to current IPFS implementations, this is a thing Chore do, current IPFS use Kadamelia instead which has a similar thing but using a 256 GF2 dimensional vector field.
If you only knows neighbors that great because it will work, but that will be slow.
What we do is that we want to know peers all over the network, imagine you are the node Qmbar
and search for the file Qmfoo
, thoses keys are really far from each other so all of your neighbors are likely gonna be faraway too, all of your hops are gonna be super tiny then.
So nodes also searches for nodes that are far away, what your node could do is make friend (bucket) a node like Qmfaa
, this is good because when you do a search in the Qmf
region you will be able to skip all the path from Qmb
to Qmf
, and if you search for Qme
it's still likely way closer going by Qmfaa
because f
is closer to e
than b
is.
So you search for thoses skips, for an optimal routing speed (for people asking you, not you searching), you need few skips to far away regions, and more and more skips to closer and closer regions, up to your own region where you need to know everyone.
So idealy skips follow some kind of logarithm distribution, for the far away region you just know a node for 3 regions let's say.
Then for closer one you know a node by region.
Then for closer one you know a node by departement.
Then for closer one you know a node by town.
Up to your own building where you know everyone.
(sorry for this geographical analogy, it's just me trying really hard not just bringing the equations)
The main difference is in how the peer IDs are stored, accelerated just use a big table and puts all known peers there, this is CPU intensive.
While the default one create buckets of peers, this allows more smarter and less agresive searches.