Double Hashed DHT Metrics and Analysis (Revised)

# Double Hashed DHT Metrics and Analysis (Revised) This document contains metrics collected using [Testground](https://github.com/testground/testground) on four different variations of go-libp2-kad-dht: - "vanilla": the current DHT implementation, where a CID is used internally as the provided/looked-up key. - double hashed: the CID that is provided/looked-up is internally hashed, and the hash of the CID is what is passed between nodes. - prefix lookup: uses double hashing, but also optionally allows for a variable length prefix of the hash to be passed between nodes. - provider encrypt: uses double hashing and (optional) prefix lookups, but also encrypts and signs the peer ID of the provider before it's stored in the DHT. On receipt of a provider record, the ciphertext must be decrypted to the peer ID. The encryption/decryption key is the lookup CID (ie. the pre-image of the double-hashed value passed between nodes). # 1. Testing Environment ### 1.1 Tested Branches | Name | Github | Commit ID | | ---- | ---- | ---- | | Vanilla (with hop counts) | https://github.com/ChainSafe/go-libp2p-kad-dht/tree/noot/measure-hops | `7af377cd4175ecffab591c862f928de12b3aee21` | | Double Hash | https://github.com/ChainSafe/go-libp2p-kad-dht/tree/noot/dh-hops | `2e7ec8334806cdf475378908d563863302f7296b` | | Prefix Lookup | https://github.com/ChainSafe/go-libp2p-kad-dht/tree/noot/pl-hops | `2fbe0738960eaf9c87df1c0c10b50768a54956e6` | | Provider Encrypt | https://github.com/ChainSafe/go-libp2p-kad-dht/tree/noot/pe-hops | `3ac1963ff033e89f8a6e1916716df5978e9edd14` | > Note: the `Double Hash`, `Prefix Lookup` and `Provider Encrypt` branches build off one another; for example, the prefix lookup branch contains all the changes in the double hash branch, and the `Provider Encrypt` branch contains all the changes in the prefix lookup branch. The runs will test "incremental" changes to the DHT that will test the different behaviours of all. ### 1.2 Environment Resource Allocation The tests performed with testground utilizes the Docker process and the number of resources that were allocated to the Docker process are as follows: | Resource | Value | |----|----| | CPUs | 8 | | Memory | 10GB | | Swap | 4GB | # 2. Testing Parameters ### 2.1 Double Hashed DHT Tests The testground test plan used starts 40 nodes from a completely fresh state and connects each node to the previous node that started. After the nodes are connected, each node does 5 random-walk queries to populate their routing tables. The test then gets each node to put 20 records sequentially in the DHT by calling `dht.Provide()`. Each node puts its records in the DHT concurrently. Then, each node looks up each record that was put concurrently by calling `dht.FindProvidersAsync()`. The test records the time to provide a record, as well as the time to find each provider, amongst other metrics. The test completes after all providers are found. > Note: runs 1-10 used a 256-bit prefix length for the `Prefix Lookup` and `Provider Encrypt` branches, which is the full key lookup. Run 1-3 are the baseline benchmarks tests which uses 40 nodes, 20 provider records per test. | Run | Branch | Latency | Test Parameters | -- | -- | -- | -- | | 1-3 | Vanilla | 0ms | https://gist.github.com/araskachoi/194b2ac55fe22bd3ad4dfcef0d74cce3 | | 1-3 | Double Hash | 0ms | https://gist.github.com/araskachoi/88741d0c29714a26cc4dabfe11569332 | | 1-3 | Prefix Lookup | 0ms | https://gist.github.com/araskachoi/a24c242fddf17c55df29d60a50946c65 | | 1-3 | Provider Encrypt | 0ms | https://gist.github.com/araskachoi/af2f41ff0a9dd1d48f0e7955acff6628 | > \* note: The Provider Encrypt branch has been tested 9 times due to high amounts of variance and we had wanted to normalize and outliers by taking more runs and averaging them. | Run | Branch | Latency | Test Parameters | -- | -- | -- | -- | | 4-6 | Vanilla | 500ms | https://gist.github.com/araskachoi/5e45b81654a639f2a9e9fb6f9549d36e | | 4-6 | Double Hash | 500ms | https://gist.github.com/araskachoi/88741d0c29714a26cc4dabfe11569332 | | 4-6 | Prefix Lookup | 500ms | https://gist.github.com/araskachoi/78a35459e6921dda0e72941d3be16fb5 | | 4-6 | Provider Encrypt | 500ms | https://gist.github.com/araskachoi/367c35b9dfe3ab6e5f70db9df158d65d | | Run | Branch | Latency | Test Parameters | -- | -- | -- | -- | | 7-9 | Vanilla | 1000ms | https://gist.github.com/araskachoi/45dd7e1568a4372c1d92fa8d10a5cbda | | 7-9 | Double Hash | 1000ms | https://gist.github.com/araskachoi/395e187071cc0d5e03e6dcb96347f881 | | 7-9 | Prefix Lookup | 1000ms | https://gist.github.com/araskachoi/a5d7ba9bbcba8dff7c5126c0b5de8ff2 | | 7-9 | Provider Encrypt | 1000ms | https://gist.github.com/araskachoi/a113c558b2a3c859455ce9eb4d6b8c58 | | Run | Branch | Prefix length | Test Parameters | | -- | -- | -- | -- | | run10-12 | Prefix Lookup | 128 bit | https://gist.github.com/araskachoi/0ff06ff35972f65f8013d63f0e308050 | | run13-15 | Prefix Lookup | 64 bit | https://gist.github.com/araskachoi/7d34aca1b2330351c8cc189a20b6d16d | | run16-18 | Prefix Lookup | 32 bit | https://gist.github.com/araskachoi/a3b719f0a81e7995a80fe73a96b02bfd | | run19-21 | Prefix Lookup | 16 bit | https://gist.github.com/araskachoi/7fc8961256ec688016389b35c28d5b1c | | run22-24 | Prefix Lookup | 8 bit | https://gist.github.com/araskachoi/b30b778ae6f6b7e57c6ddefb8428e4cf | | run25-27 | Prefix Lookup | 4 bit | https://gist.github.com/araskachoi/394d9719e22eeef22a556dca2c7303dc | > \* For the purposes of the "Prefix Length" tests, we have decided to only run against the `Prefix Lookup` branch because this was the branch that utilized the prefix length variability and could produce adequate data sets and convey the affects of different prefix length sizes to the KAD DHT. # 3. Metrics Collected <details><summary>DHT data points</summary> - "barrierbootstrapping0" - "barrierprovider-records1" - "barrierprovider-records2" - "full bootstrapping0" - "full provider-records1" - "full provider-records2" - "peers-found|done" - "peers-missing|done" - "signal bootstrapping0" - "signal provider-records1" - "signal provider-records2" - "time-to-find-first" - "time-to-find-last" - "time-to-find|done" - "time-to-provide" - "bandwidth-total-in" - "bandwidth-total-out" - "bandwidth-rate-in" - "bandwidth-rate-out" </details> <details><summary>Metrics collected</summary> - "EnableGC" - "HeapAlloc" - "LastGC" - "NumGC" - "MCacheSys" - "StackSys" - "StackInuse" - "Sys" - "NumThread" - "HeapIdle" - "HeapInuse" - "Lookups" - "MSpanInuse" - "Frees" - "NextGC" - "NumCgoCall" - "PauseTotalNs" - "NumGoroutine" - "BuckHashSys" - "HeapObjects" - "GCCPUFraction" - "TotalAlloc" - "DebugGC" - "HeapReleased" - "HeapSys" - "MCacheInuse" - "Alloc" - "Mallocs" - "MSpanSys" - "pauseNs" - "readMemStats" </details> # 4. Results For the following metrics, there were 20 provider records put in the DHT per node. Since there were 40 nodes, there were 800 records in total. For the "Prefix Lookup" and "Provider Encrypt" runs, a prefix length of 256 bits (32 bytes) was used, which is the same as a full-key lookup.  The values in the tables are as follows: - "total avg": the average of all the data points for all nodes - "min avg": the average of each node's data points was calculated, and this value is the minimum out of those. - "max avg": the average of each node's data points was calculated, and this value is the maximum out of those. ## 4.1 CPU Usage The table values are %CPU used. This metric was measured by determining how much CPU each node was using at regular intervals. We do not expect the values to differ much for each branch. | Branch | Run | Total Avg CPU% | Min Avg CPU% | Max Avg CPU% | | -------- | -------- | -------- | -------- | -------- | | Vanilla | 1-3 | 9.399 | 0.04 | 61.03 | | Double hash | 1-3 | 10.424 | 0.03 | 120.66 | | Prefix Lookup | 1-3 | 9.638 | 0.05 | 83.12 | | Provider Encrypt | 1-3 | 9.567663902 | 0.03 | 81.77 | > \* note: the averages listed here are the per-run averages of each metric of all nodes. The "total avg" is the average of all node averages, the "min avg" is the minimum node average, and the "max avg" is the maximum node average found. The disparity between the maximum and minimum CPU Avg values are quite large because the CPU logging occurs before the test begins and also collects CPU usage of the nodes that have not begun their script (finding peers, providing, health checks, etc.). Thus, the minimum does not indicate a low-cpu usage in the middle of the test, but most likely indicates where the test has not begun. As for the higher Max Avg CPU% observed on the `Double Hash` test series could be attributed to some throttling experienced during the test. However, the total average CPU% reveals that the overall CPU has been consistent and comparable to that of the other series. ### Local tests Note: these were not run with testground. - tested using [dht-tester repo](https://github.com/ChainSafe/dht-tester) on an Intel i7-8650U (8) @ 1.900GHz CPU - for 100 nodes, doing constant lookups - max CPU% after 10 minutes | DHT type | Max CPU% | Max CPU% per DHT node | | -------- | -------- | -------- | | Vanilla | 292 | 2.92 | | Double Hash | 318 | 3.18 | | Prefix Lookup (full prefix) | 325 | 3.25 | | Provider Encrypt | 342 | 3.42 | ### Analysis Overall, we did not observe notable changes in CPU usage between branches, which is what was expected. ## 4.2 Thread Usage This metric measures how many threads were used per node. x-axis: n/a y-axis: number of threads used per node > \* note: the x-axis ranges from the start of the test to when the test completes. The timestamps are not plotted as the y-values are average values, and thus the timestamps don't line up for every node and run. ![](https://i.imgur.com/EAbWaqB.png) > \* note: If there are any plots extend more than others, that implies that the test took longer to complete. | Branch | Total Avg | Min Avg | Max Avg | | -------- | -------- | -------- | -------- | | Vanilla | 17.136 | 11.800 | 19.200 | | Double hash | 17.054 | 11.975 | 19.4 | | Prefix Lookup | 14.924 | 11.675 | 18.85 | | Provider Encrypt | 15.675 | 11.675 | 19.5 | > \* note: the averages listed here are the per-run averages of each metric of all nodes. The "total avg" is the average of all node averages, the "min avg" is the minimum node average, and the "max avg" is the maximum node average found. ### Analysis We expect the number of threads used to be similar for each branch. This is what was observed. ## 4.3 time-to-provide This metric measures the time it took to put a provider record in the DHT; ie. how long the call to `dht.Provide` took. | branch | total avg [s] | min avg [s] | max avg [s] | | -------- | -------- | -------- | -------- | | Vanilla | 33.023 | 27.523 | 41.180 | | Double Hash | 37.264 | 28.326 | 51.626 | | Prefix Lookup | 42.485 | 14.645 | 69.303 | | Provider Encrypt | 33.012 | 15.883 | 48.568 | ### Analysis We expect the time to provide to be similar for each branch. The vanilla branch was the lowest, but the other branches were very similar. The disparity between the min and max averages found could be due to some non-determinism due to each node putting their records in the DHT concurrently. ## 4.4 Hop Count This metric is the maximum number of hops measured before the call to `dht.FindProvidersAsync()` returned. | branch | total avg [# hops]| min avg [# hops] | max avg [# hops] | | -------- | -------- | -------- | -------- | | Vanilla | 1.732 | 1.3 | 2.15 | | Double Hash | 1.895 | 1.4 | 2.5 | | Prefix Lookup | 1.960 | 1.3 | 2.9 | | Provider Encrypt | 1.959 | 1.2 | 2.7 | ### Analysis We expect the number of hops to be around the same for each branch. We see that the average hop count slightly increases from vanilla to `Provider Encrypt`. ## 4.5 time-to-find-first The time-to-find-first metric measures the time it took to find the first provider after calling `dht.FindProvidersAsync()`. | branch | total avg [s] | min avg [s] | max avg [s] | | -------- | -------- | -------- | -------- | | Vanilla | 0.769 | 0.0211 | 3.031 | | Double Hash | 1.282 | 0.0547 | 4.181 | | Prefix Lookup | 1.329 | 0.001 | 9.329 | | Provider Encrypt | 0.732 | 0.005 | 5.034 | ### Analysis We expect this duration to be around the same for all implementations. The average time-to-find-first for the `Vanilla` and `Provider Encrypt` branches are similar. However, when we go to the`Double Hash` and `Prefix Lookup` branches, the time to find the first provider almost doubles. This is likely due to some test non-determinism rather than implementation differences, as none of the branches should greatly affect `time-to-find-first`. Since `Vanilla` and `Provider Encrypt` are similar, and `Provider Encrypt` builds off `Prefix Lookup` and `Double Hash`, we can determine that `time-to-find-first` was not negatively impacted. ## 4.6 time-to-find-last The time-to-find-last metric measures the time it took to find the last provider after calling `dht.FindProvidersAsync()`. | branch | total avg [s] | min avg [s] | max avg [s] | | -------- | -------- | -------- | -------- | | Vanilla | 1.893 | 0.339 | 3.718 | | Double Hash | 2.774 | 0.869 | 5.223 | | Prefix Lookup | 3.040 | 0.240 | 14.376 | | Provider Encrypt | 4.738 | 0.754 | 19.508 | ### Analysis We expect this to be around the same for all implementations, except the `Provider Encrypt` branch, which may have a slightly higher duration due to time added for decrypting provider records. We observe that the `time-to-find-last` is higher than expected for `Provider Encrypt`. The impact of decryption would be more pronounced for `time-to-find-last` than `time-to-find-first` due to the cumulative effect of decrypting multiple records successively; however, it seems somewhat unlikely that provider record decryption would have such a significant effect on the time until the last provider is found. More specific encryption benchmarks might be needed here to confirm whether the effect here is due to encryption rather than test non-determinism. ## 4.7 time-to-find The time-to-find metric measures the time it took to find a provider after calling `dht.FindProvidersAsync()`. | branch | total avg [s] | min avg [s] | max avg [s] | | -------- | -------- | -------- | -------- | | Vanilla | 47.120 | 38.011 | 60.595 | | Double Hash | 64.340 | 50.697 | 89.296 | | Prefix Lookup | 61.980 | 18.036 | 117.536 | | Provider Encrypt | 80.795 | 23.601 | 164.369 | ### Local tests Note: these were not run with testground. - tested using [dht-tester repo](https://github.com/ChainSafe/dht-tester) on an Intel i7-8650U (8) @ 1.900GHz CPU - for 100 nodes and 1000 provider records, doing constant lookups | DHT type | Average time to find (ms) | | -------- | -------- | | Vanilla | 5.056243 | | Double Hash | 5.782872 | | Prefix Lookup (full prefix) | 5.814049 | | Provider Encrypt | 7.187447 | ### Analysis Like the above, we expect this to be around the same for all implementations, except the `Provider Encrypt` branch, which may have a higher duration due to time added for decrypting provider records. Similarly to above, we observe that the `time-to-find-last` is higher than expected for `Provider Encrypt`. The impact of decryption would be more pronounced for `time-to-find` than `time-to-find-first` due to the cumulative effect of decrypting multiple records successively; however, it seems somewhat unlikely that provider record decryption would have such a significant effect on the time until the last provider is found. We also note that the minimum average for `Provider Encrypt` is actually lower than `Vanilla`, which suggests this may be due to test non-determinism. To confirm that `Provider Encrypt` actually increased time-to-find, local tests were run with the `dht-tester` repo which runs many DHT nodes, connects them, and puts/looks up DHT records, the number of which can be configured by a client. The time-to-find (measured the same way as in the test plan) was higher for `Provider Encrypt` by around 23.6%, which confirms that encryption does add latency to the lookup time. ## 4.8 peers-found The peers-found metric measures how many providers were found for each CID looked up. There were 40 providers for each CID, so the expected value should be close to or 40 for each. | branch | total avg | min avg | max avg | | -------- | -------- | -------- | -------- | | Vanilla | 40 | 40 | 40 | | Double Hash | 40 | 40 | 40 | | Prefix Lookup | 40 | 40 | 40 | | Provider Encrypt | 40 | 40 | 40 | ### Analysis We expect the peers found to be similar for each branch, which is what was observed. In fact, for each run, each node was able to connect to every other node. ## 4.9 peers-missing The peers-found metric measures how many providers not were found for each CID looked up (out of the 40 providers). The data here is essentially the inverse of above (ie. each point is 40-(peers-found)). | branch | total avg | min avg | max avg | | -------- | -------- | -------- | -------- | | Vanilla | 0 | 0 | 0 | | Double Hash | 0 | 0 | 0 | | Prefix Lookup | 0 | 0 | 0 | | Provider Encrypt | 0 | 0 | 0 | ### Analysis We expect the peers missing to be similar for each branch, which is what was observed. In fact, for each run, each node was able to connect to every other node. ## 4.10 bandwidth-total-in The bandwidth-total-in metric logs the total inbound bandwidth of each node in bytes. | branch | total avg [MB] | min avg [MB] | max avg [MB] | | -------- | -------- | -------- | -------- | | Vanilla | 0.997 | 0.143 | 2.363 | | Double Hash | 1.054 | 0.1480 | 2.312 | | Prefix Lookup | 0.469 | 0.166 | 2.660 | | Provider Encrypt | 1.311 | 0.173 | 3.149 | **x-axis: n/a y-axis: inbound bandwidth (in megabytes)** ![](https://i.imgur.com/1E8tR0D.png) > \* note: the x-axis ranges from the start of the test to when the test completes. The timestamps are not plotted as the y-values are average values, and thus the timestamps don't line up for every node and run. ### Analysis We expect this to be around the same for all branches, as no implementation significantly increases the message size or number of messages passed. Provider encryption slightly increases the size of `AddProvider` and `GetProvider` messages due to the addition of a signature and public key. Overall, there were no significant bandwidth increases between implementations, except `Provider Encrypt` which was higher than the other branches. However, this may be due to the test running longer (as seen in the graph), which increases total bandwidth due to more messages being passed. ## 4.11 bandwidth-total-out The bandwidth-total-out metric logs the total outbound bandwidth of each node in bytes. | branch | total avg [MB] | min avg [MB] | max avg [MB] | | -------- | -------- | -------- | -------- | | Vanilla | 1.013 | 0.151 | 2.471 | | Double Hash | 1.072 | 0.155 | 2.338 | | Prefix Lookup | 0.476 | 0.169 | 2.876 | | Provider Encrypt | 1.318 | 0.168 | 3.238 | **x-axis: n/a y-axis: outbound bandwidth (in mb)** ![](https://i.imgur.com/RDsSA6F.png) > \* note: the x-axis ranges from the start of the test to when the test completes. The timestamps are not plotted as the y-values are average values, and thus the timestamps don't line up for every node and run. ### Analysis We expect this to be around the same for all branches, as no implementation significantly increases the message size or number of messages passed. Provider encryption slightly increases the size of `AddProvider` and `GetProvider` messages due to the addition of a signature and public key. Overall, there were no significant bandwidth increases between implementations, except `Provider Encrypt` which was higher than the other branches. However, this may be due to the test running longer (as seen in the graph), which increases total bandwidth due to more messages being passed. ## 4.12 bandwidth-rate-in The bandwidth-rate-in metric measures the inbound bandwidth rate (in bytes/second) of each node. | branch | total avg [KB/s] | min avg [KB/s] | max avg [KB/s] | | -------- | -------- | -------- | -------- | | Vanilla | 15.288 | 2.66e-4 | 92.258 | | Double hash | 17.012 | 3.811e-4 | 84.309 | | Prefix lookup | 11.806 | 3.597e-6 | 89.680 | | Provider Encrypt | 23.614 | 2.062e-6 | 128.440 | ### Analysis We expect this to be around the same for all branches, as no implementation significantly rate of messages sent. We observed that `Provider Encrypt` had the highest rate in, as well as the greatest difference between min and max averages. This is most likely due to test non-determinism rather than implementation details. ## 4.13 bandwidth-rate-out The bandwidth-rate-out metric measures the outbound bandwidth rate (in bytes/second) of each node. We expect this to be around the same for all branches, as no implementation significantly rate of messages sent. | branch | total avg [KB/s] | min avg [KB/s] | max avg [KB/s] | | -------- | -------- | -------- | -------- | | Vanilla | 16.106 | 2.7e-4 | 94.953 | | Double Hash | 16.799 | 3.945e-4 | 76.078 | | Prefix Lookup | 12.361 | 3.302e-6 | 69.706 | | Provider Encrypt | 23.562 | 2.954e-6 | 135.279 | ### Analysis We expect this to be around the same for all branches, as no implementation significantly rate of messages sent. We observed that `Provider Encrypt` had the highest rate out, as well as the greatest difference between min and max averages. This is most likely due to test non-determinism rather than implementation details ## 4.14 Number of hops for varying prefix lengths In this section, the prefix length (in bits) of the lookup key has been varied for each run. A shorter prefix length provides more anonymity, but may increase the number of hops before a provider record is found. This is because the queried nodes will return a list of closer nodes that contain the prefix, but these closer nodes may or may not be closer to the actual desired key. The num-hops metric measures the maximum number of hops before a provider record was found. The maximum hop counts observed: | Branch | Run | Avg Hop Count | | ---- | ---- | ---- | | Vanilla | 1-3 | 1.73 | | Double Hash | 1-3 | 1.90 | | Prefix Lookup (256 bit prefix) | 1-3 | 1.96 | | Provider Encrypt | 1-3 | 1.830 | | Branch | Run | Avg Hop Count | | ---- | ---- | ---- | | Vanilla | 4-6 | 1.71 | | Double Hash | 4-6 | 1.78 | | Prefix Lookup (256 bit prefix) | 4-6 | 1.99 | | Provider Encrypt | 4-6 | 1.96 | | Branch | Run | Avg Hop Count | | ---- | ---- | ---- | | Vanilla | 7-9 | 1.70 | | Double Hash | 7-9 | 1.81 | | Prefix Lookup (256 bit prefix) | 7-9 | 1.87 | | Provider Encrypt | 7-9 | 2.05 | | Branch | Run | Prefix Length | Avg Hop Count | | ---- | ---- | ---- | ---- | | Prefix Lookup | 10-12 | 128 | 2.08 | | Prefix Lookup | 13-15 | 64 | 2.14 | | Prefix Lookup | 16-18 | 32 | 2.07 | | Prefix Lookup | 19-21 | 16 | 2.32 | | Prefix Lookup | 22-24 | 8 | 2.32 | | Prefix Lookup | 25-27 | 4 | 2.12 | Overall, the average number of hops did not vary much for the different implementations. Markedly, the vanilla runs have had the most performant values. Using the vanilla run1-3 as the benchmark average, it can be clearly seen that there is a slight performace drop (in terms of number of hops required to reach a provider). However, the performance drop is about -34.1% from the largest hop count found (for prefix lookup runs 19-21). Although this percentage increase may seem to be a large decrease, the objectively, a hop count of ~2 does not look like it is a substantial number of hops required. From a benchmarking standpoint, the results do not show any significant performance decrease with the tested branches. A more thorough examination using a much higher node count may reveal a higher disparity for this metric. ## 5. Conclusions Overall, performance was not significantly affected by the implementation of `double hashing`, `prefix lookups`, and `provider encryption`. We note that `provider encryption` increases the time for many successive (1000+) lookups by around 23-30%, but this effect is not seen when the first provider is found for a lookup. Other metrics such as CPU, bandwidth, thread count, time to provide, and peer count were not affected. ## 6. Appendix - Test-plan Implementation: https://github.com/ChainSafe/test-plans/tree/araska/updatesdkgo - Testground Metrics Parser Implementation: https://github.com/araskachoi/testground-parser - Test result additional graphs: https://github.com/araskachoi/testground-parser/tree/master/images - includes graphs for each run and each metric - Test Results Spreadsheet: https://docs.google.com/spreadsheets/d/1EOnQAWIwTI9PvhS2sMu_9KMIlTKkjF5gVxI813fx1tk/edit?pli=1#gid=212830386 - includes average values for each node for each run and metric - also includes average/min/max values of each of these averages