Passive Estimation of the size of the IPFS network

# Passive Estimation of the size of the IPFS network Assessing the size and understanding the dynamic nature of the IPFS network, as well as the popularity of the content on it, presents a lot of challenges. This is mainly due to the decentralized nature of the network which has no central registry of content or clients. The proposed system monitors incoming peer connections and bitswap requests, using this information to provide an estimation of the IPFS network's size and content popularity. The system is intended to operate in a self-contained fashion on a single stand-alone server, with all measurements being accumulated in RAM. Periodically measurements may be persisted to external storage. The design preserves privacy by maintaining only aggregated statistics rather than storing individual user activities. ## Outline of Design The design relies on operating passive nodes with unlimited connection capacity. The nodes intercepts all Bitswap messages, monitoring content requests received to gauge their frequency and popularity. The nodes are specifically configured to refrain from sending Bitswap requests or data. As a result, the nodes will typically only receive root CID requests since they do not participate in sessions. This limits the estimate of content popularity to the DAG level, not the block level. The node is implemented as a [Kubo plugin](https://github.com/ipfs/kubo/blob/master/docs/plugins.md) that uses a custom bitswap [`Tracer`](https://pkg.go.dev/github.com/ipfs/boxo@v0.8.1/bitswap/tracer#Tracer) that intercepts all bitswap messages, and a libp2p network [`Notifiee`](https://pkg.go.dev/github.com/libp2p/go-libp2p/core/network#Notifiee) to be notified of all peer connections. The design enables the following measurements to be estimated: - Network size by number of root CIDs - Network size by number of peers (total and core size) - Raw request popularity - Unique request popularity The total size of the IPFS network would still need to be extrapolated from the local view of the monitoring node. ## Estimating Network Size by number of root CIDs The proposed design estimates the network size by tracking the number of unique CIDs using HyperLogLog (HLL) sketches (more preciicely: HyperLogLog++[^ref4]), a probabilistic data structure used for cardinality estimation in large datasets. One HLL sketch would be used for each time period of interest (daily, weekly and monthly) to gain a view of the size of the network over time. ## Estimating Network Size by number of peers For each day, the system maintains a separate HyperLogLog sketch to keep track of the peers. Whenever a peer connects, the system adds the peer ID to the current day's HLL. This strategy gives us a daily count of unique peers interacting with our passive nodes. We store seven days' worth of HLLs, allowing for a week-long retrospective view of the network's unique peer count. At the start of each day, the system initiates a new HyperLogLog and adds all peers from the current connection list to it. ### Estimating Total and Core Network Size HyperLogLog sketches can be merged to provide combined cardinality estimates. By performing a union of the HLLs for each day, the node can estimate the total number of unique peers seen over the week. This yields an understanding of the overall size and reach of the IPFS network. In addition to the total network size, we can attempt to identify the long-term, consistent peers that form the "core network." These are the nodes that persistently contribute to the network's activity over time. The node can estimate the size of this core network by calculating the intersection of each day's HyperLogLog using the MinHash technique, a method used for estimating the similarity between sets[^ref2]. Each day the intersection between each of the seven HLLs can be taken to estimate the cardinality of the core network, i.e. peers that appear as connections in each of the seven days. ## Estimating Content Popularity Two potential measures of popularity are [^ref1]: - RRP: Raw Request Popularity - total number of requests received for a cid over a given period - URP: Unique Request Popularity - the number of distinct peers that requested a cid over a given period ### Estimating Raw Request Popularity The node uses a Count-Min sketch to track all Bitswap 'wants' by CID. The Count-Min sketch is a probabilistic data structure that allows estimates to be made of the frequency of events in a data stream, in this case, CID requests. The sketch trades space-efficiency for approximate accuracy. We can estimate RRP for the top-n CIDs by estimating the total number of requests received for a CID over a given period. This would work as follows: 1. When a bitswap CID request is received by a passive node, the system adds it to the Count-Min sketch. 2. The sketch is then queried to estimate the count of requests for that specific CID. 3. The CID is inserted into a heap (the RRP heap) with its request count to maintain a list of the top-n most requested CIDs. 4. The RRP heap is periodically iterated to produce a list of the top-n CIDs by request count, which is the RRP rating. ### Estimating Unique Request Popularity Estimating URP can be done by maintaining a bloom filter for each connected peer. 1. When a 'want' is received from a peer, the system adds the corresponding CID to that peer's Bloom filter. 2. If the CID was not previously in the Bloom filter — indicating this is the first time this peer requested this CID — the system increments the count in a separate Count-Min sketch dedicated to tracking URP. 3. The CID is also inserted into a URP heap with its request count to maintain a list of the top-n CIDs. 4. The URP heap is periodically iterated to produce a list of the top-n CIDs by unique requesting peer count, which is the URP rating. One of the characteristics of the Bloom filter is its possibility for false positives. The Bloom filter may occasionally indicate that a CID was previously requested when it has not been. The false positive probability 'p' can be estimated by the equation: p ≈ (1 - e^(-kn/m))^k Where: - k represents the number of hash functions used, - m is the size of the filter, - n stands for the number of elements inserted into the filter. This characteristic means the URP calculated will be innaccurate due to false positives. A false positive means that the popularity count for a CID will not be incremented for that peer since the node will incorrectly assume that it is a duplicate request. This leads to an undercount. However the estimate is tolerant of this since the counts are used to produce a popularity ranking not a count of the number of unique requests. Since every CID is subject to the same false positive rate the ranking order will not be affected, since it is relative to every count made. ### Other Potential Measurements Measurements can also be taken of the following low cardinality properties (using a simple hash table): - protocols supported by connected peers - user agents announced by peers ## Implementation Details ### Time Windowing The various sketches and measurements need to be windowed over time to provide an evolving view of the network. Resetting periodically and taking measurements just before each reset is one possibility. However a general way to window bucketed sketches such as count-min is to duplicate each bucket and cycle them throughout the window period[^ref3]. This allows measurements to be taken on a rolling basis. Concretely: 1. Each bucket is split into two: old and new 2. A process cycles through each bucket with a frequency such that each bucket is visited once per time window 3. When a bucket is visited the old version is emptied and filled with the new bucket's value. The new bucket is then emptied. 4. All writes are made to the new bucket. 5. All reads are taken from the sum of each bucket. This algorithm can be applied to most sketches if they are capable of being represented in a bucketed fashion. It is possible to apply this to a bloom filter. #### Time Windowing Heaps Some of the measurement techniques couple a count-min sketch and a heap to maintain a top-n ranking. The windowing can be carried over to the heap in the following way: 1. When a bucket is visited by the windowing process all items in the heap need to be rescored 2. The heap is iterated and the score for each item looked up in the count-min sketch 3. The heap is rebalanced with the adjusted scores With a time window of 24 hours and 32000 buckets in the count-min sketch (see below) this requires a full heap reconstruction every 2.7 seconds. This could be reduced to a reconstruction every other window visit or even every tenth without too much loss of accuracy. ### Resource Requirement Estimations #### HyperLogLog The size of an HLL structure in bits is given by: ceil(log2(L + 1 − p)) * 2^p Where: - L is the number of bits in the hash (64 for HLL++) - p is the required precision p is chosen to minimise the standard errorm given by: se = 1.04 / sqrt(2^p) Typically p is 14 or 16, giving a standard error of 0.008125 and 0.0040625 respectively. With p = 14, the memory requirement is 114688 bits, or 14kB. An HLL of this size can estimate cardinalities up to 2^L (2^64 in this case). #### Count-min The Count-Min Sketch has two important parameters that control its accuracy: the width w and the depth d. The sketch is an array of counters with width w and depth d. The depth is equivalent to the number of hash functions used. The width w is chosen to control the probability of an overestimate due to collisions, and is typically set as w = e/ε, where ε is the desired error in the frequency estimate and e is the base of the natural logarithm. The depth d is chosen to control the confidence in the accuracy of the estimate, and is typically set as d = ln(1/δ), where δ is the desired probability that the estimate is not within the error bound. In other words with a probability of 1−δ, the error is at most ϵ∗(sum of all counts) This means to obtain an estimate that is within 0.1% of the true count 99% of the time the parameters would be ε = 0.001 and δ = 0.01: w = e/ε ≈ 2.71828 / 0.001 ≈ 2718 d = ln(1/δ) ≈ ln(1/0.01) ≈ 4.6 ≈ 5 The width can be increased to reduce the chances of collision for very high counts. #### Bloom-filter A bloom filter is sized as follows: m = -(n * ln(p)) / (ln(2))^2 k = (m/n) * ln(2) Where "m" represents the size of the Bloom filter in bits, "n" is the number of items expected to be inserted into the Bloom filter, "p" is the false positive probability, and "k" is the number of hash functions. #### Network CID Count Sizing One HyperLogLog is maintained for each time period. For the monthly count we assume up to 100 million distinct CIDs may need to be counted. Assuming an error rate of 0.1% and a bin width of 5 bits: m = 1.0816 / (0.001*0.001) = 1081600 size = 1081600*5 = 3244800 bits = 405600 bytes = 396 kB This would double due to the time-window bucket approach described earlier, requirinf #### Network Peer Count Sizing The network peer count requires seven HyperLogLog sketches, each capable of counting up to 500k distinct peers. Assuming an error rate of 0.1% we get: m = 1.0816 / (0.001*0.001) = 1081600 width = ceil(ln(ln(500000))) = 3 size = 1081600*3 = 3244800 bits = 405600 bytes = 396 kB (This is no different to the sizing needed for 100M CIDs above.) Seven of these are required, doubled for windowing, for a total memory requirement of 5.4MB #### RRP Count-min and Heap Sizing RRP estimation requires a single count-min sketch and an associated heap. Here we are dealing with a set with high cardinality, certainly in the 100s of millions of CIDs. It would be sensible to size the width an order of magnitude greater than 2718, perhaps to 32k. With d=5 and using a 32-bit integer as a count, that suggests a count-min size of: 2^15 * 5 * 4 bytes = 655360 bytes = 640kB This would double due to the time-window bucket approach described earlier Assuming a maximum of 64 bytes per CID, an associated 32-bit count and a sorted array based heap implementation, the RRP heap would required 68 bytes per entry. With n=10000 we could track the top-n most requested CIDs using: 68 * 10000 = 680000 = 664.06 kB #### URP Bloom Filter Sizing URP estimation requires a bloom filter per connected peer, a count-min sketch and an associated heap. Assume the count-min and heap are the same size as in the RRP case. Some compromises have to be made here due to the memory requirements of the bloom filters. We either have a low accuracy bloom filter per peer or higher accuracy but only using a fraction of connected peers. Since the ranking estimate is tolerant of false positives we can sacrifice some accuracy. The bloom filters are intended to record a day's worth of unique requests per peer. For a false positive rate of 0.01 for up to 5 million root CIDs we'd need: m = -(5000000 * ln(0.01)) / (ln(2))^2 ≈ 47925291 bits ≈ 5.72 megabytes k = (47925291/5000000) * ln(2) ≈ 6.64 ≈ 7 hashes For 10M CIDs the false positive rate rises to 0.15 (15%) and for 20M it rises to 0.67 (67%) The space requirements would be doubled (11.4MB) for the time-window approach, if adopted. 11.4MB per peer is about 334GB RAM for 30k connected peers. A more reasonable approach would be to record URP using only 1000 peers, giving a memory requirement of 11.4GB. [^ref1]: Monitoring Data Requests in Decentralized Data Storage Systems: A Case Study of IPFS; Leonhard Balduf, Sebastian Henningsen, Martin Florian, Sebastian Rust, Björn Scheuermann; [arXiv:2104.09202](https://arxiv.org/abs/2104.09202) [^ref2]: [HyperLogLog and MinHash](https://tech.nextroll.com/blog/data/2013/07/10/hll-minhash.html) by Andrew Pascoe [^ref3]: Sliding Sketches: A Framework using Time Zones for Data Stream Processing in Sliding Windows https://yindazhang.github.io/files/sliding-sketch.pdf [^ref4]: HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm https://research.google/pubs/pub40671/