Privacy preserving content routing

# Privacy preserving content routing We consider the setting of privacy preserving content routing for a large network, where there are up to 10^15 files and 10^9 content providers, and a smaller amount, but more powerful set of content routers. Concretely we consider around 10-100 full routers and around 1000-10,000 caching routers containing the most frequent elements. While not the case now, it is possible to assume the caching routers will hold synchronised data-sets. It is also possible to assume that each router and client can hold or fetch an up-to-date table of most, if not all, routers. The client wants to access a specific piece of data. This is done by hashing the data to get a succinct content ID (cid). This cid is contained in a Distributed Hash Table (DHT) collectively held by the routers. The table maps the cid to a to a list of IP addresses of content providers, each of which should hold the data in question. Note that there is resilience and lack of full synchronisation used in this DHT approach and thus there are multiple candidate content routers that may or may not hold a specific cid in question. We expect there to be around 5000-10,000 unique client queries per second, and a total of 50,0000 non-unique queries per second (i.e. where a client contacts around 10 routers with the same request). This model and numbers matches the expected near future usage of IPFS. We want to hide from the content routers which cid a client wish to access, along with the identity of the client. Eventually we also want to be able to hide the identity of clients providing content and the content they provide. That is, for now we only consider *reader privacy* and *identity reader privacy*. Specifically we consider *content reader privacy* towards the routing providers, i.e. that *which* content is accessed is hidden from the routers. We note that *identity reader privacy*, i.e. who the client is, can heuristically be achieved through existing approaches like onion routing (Tor) or through a trusted third party (VPN) as there is no authentication used and thus only meta data that can leak the clients' identities. ## Problem sketch In this setting there are two different patterns that need hiding; 1) the ID of the content router that might actually holds the piece of data the client is looking for, 2) if the current router holds the data it should be hidden exactly which cid the client trying to access. Based on the content id (cid) the client wish to access, along with the specification of the current content router and its nearest neighbours, the client will know if it is at a content router that could hold the cid it is looking for, or if it will need to jump to another content router. As mentioned, for reliability the IPFS works with multiple content routers that *may* contain the cid. So even if the client is at the router that is supposed to hold the cid according to the classical Kademlia algorithm, it might not actually hold the cid, but one of its close neighbours will. Whether the client is looking for a new router in its search or the cid is stored in the current router, what the client wish to do is basically a binary search for an cid; either for the actual cid (in case this content router might hold it) or a search for the set of content routers who hold cids closest to what the client is looking for. While we might reasonably be able to assume the client can hold all the information of the content routers, it will be possible to use the approach of finding the cid on a router, as to find the IP of the next content routers to visit: The challenge is the same; the client holds a key to a map and wants to know the corresponding value. ### Partial privacy We might decide to initiate the deployment of private content routing by having it be optional. This will limit the total network overhead coming from adding privacy preservation. However, this approach can of course be an issue too, since now clients that execute the privacy preserving protocol might be flagged as suspicious. To prevent this queries must be randomly "upgraded" to the privacy preserving kind or padded with random non-privacy preserving ones such that a certain percentage of all queries from a given client will be privacy preserving. This is unfortunately not trivial as it could require holding a counter for all IPs sending querying which will in itself be a breach of privacy. Furthermore, in case a client only does privacy preserving queries, they will need to augmented with non-privacy preserving queries to random elements. Unfortunately this will significantly increase the traffic of that client as we might only want 1% or 0.1% of the queries to be private. It will also be hard to add random queries as they will likely be to very infrequently accessed cids, but the client may only access the most frequently used cids. This can be seen from traffic analysis since the privacy preserving queries will then exclusively be to the cached routers and the non-privacy preserving queries will mainly be to the full routers. This issue is related to a suggested hardening technique which we discuss further [here](#Noise-drowning). ## Single routing provider model In the following we consider the model where a client, content consumer, only interacts with a single content router. Implicitly we assume that *some* content routers may be malicious and collude. However, we cannot make assumptions on *which* content routers that may malicious or collude. We consider a solution where the content router gets to learn *if* it holds the cid the client is looking for. This can be achieved using multiple approaches such as 2PC and PIR, we note that one approach already exists which achieves exactly this. That is, [the unbalanced, one-way PSI](https://www.usenix.org/system/files/sec19-kales.pdf). Unfortunately this requires sending a very data object (a Cuckoo filter) to every client. Asymptotically it will be quasi-linear in the amount of keys in the map, but *not* dependent on the size of the values in the map. However, this is not enough, as the client also needs to learn the value associated with the key it is looking for. For this a unbalanced one-way [*labeled* PSI](https://eprint.iacr.org/2018/787.pdf) approach can be used. We note that unbalanced one-way labeled PSI is equivalent to PIR when the small set contain a single cid. Thus what we really want is actually unbalanced one-way PIR followed by a PIR. We do note that this requires the content is ordered in a way where the client knows exactly where in a contiguous data structure they need to access then they can use PIR to retrieve that record. A complete solution for this can be achieved by [Cong et al.](https://eprint.iacr.org/2021/1116.pdf), which is not as efficient as one could hope for and may lead to leakage in the multi-client setting. So more investigation will be needed if this path is chosen. Otherwise a general PIR solution may be used. This could be used both to avoid the need to send a large Cuckoo filter to the client *and* for the client to retrieve the cid needed if held by the router. Further investigation must be taken in this area to find a solution that allows a client-independent preprocessing phase which can be used for multiple, possibly colluding clients. The approach should also be able to allow for relatively efficient updates to the database as content routers tables may change often. A pariculary promising apporach is for example [Spiral](https://eprint.iacr.org/2022/368.pdf), although the scalability is far from the level required for IPFS. ### An alternative to PIR It is possible to replace the sending of a large Cuckoo filter and the PIR protocol with a k-anonymity approach, where the Cuckoo filter and map is cut into sub-filters/maps, each of which is small enough for the client to download fully. Thus the client can request the appropriate filter/map, but at the cost that the router learns which chunk of k-elements the client wish to access. ## Multiple routing provider model In the following we consider a protocol assuming there is a non-colluding third party assisting the content router. Our first approach require it to hold exactly the same data as the content router. Whereas our second approach, is agnostic to the content of the router. This approach build on top of the single routing provider solution sketched above. ### Multi-server PIR PIR can be efficiently and statelessly realised using information theoretic crypto when multiple non-colluding servers collaborate. However, such an approach requires the servers to have the exact same database and do a preprocessing step based on this data. This might not be a feasible assumption in the IPFS model. ### An ORAM solution An ORAM construction generally works in the setting where a single client with limited storage wish to outsource the storage of large amounts of data, to an untrusted server. Simply encrypting and authenticating the data before sending it to a cloud provider would be sufficient to achieve this. However, in an ORAM it is furthermore desired that the client can obliviously access, modify and add to the data already stored, *without* leaking the access pattern to the server. This somewhat fits the problem of privacy preserving content routing although we have an essentially unlimited amount of clients, each with their own data, who cannot be trusted to only access and modify their own data. Furthermore, the data in itself is not secret and since clients can come and go, the server could also play a client and thus try to learn which pieces of data it has stored where, and hence extract information about the access pattern of future clients. Although the latter problem can be abstract away by assuming that data is identified by a high entropy ID. Research does exist adding support for multiple clients. However the solutions generally assume either a [PIR substructure](https://eprint.iacr.org/2017/329.pdf) or a [trusted proxy](https://sites.cs.ucsb.edu/~rachel.lin/papers/TaoStore-CameraReady.pdf). In fact, without the usage of a proxy, there is a [lower bound](https://eprint.iacr.org/2017/329.pdf) on O(n) server computation time and O(sqrt(n)) communication per query. Although solutions with O(log n) communication exist based on [FHE (although still with O(n) computation) or in the multi-server setting, based on distributed point functions](https://eprint.iacr.org/2020/1551.pdf). While [ConcurORAM](https://zxr.io/research/sion2019ndss.1.pdf) aims to avoid these problems, they only manage to do so if clients behave honestly. [Another scheme](https://eprint.iacr.org/2018/363.pdf) uses ideas from coding theory, along with fully homomorphic encryption, to construct a threshold scheme, which is secure as long as a certain threshold of *clients* don't collude. Furthermore, the above-mentioned solutions all require non-constant client storage. Neither of these solutions are optimal in our setting, as they either required shared state between multiple servers, heavy O(n) computation, or a trusted proxy. We need a more efficient and advanced solution. Concretely we have the following requirements: 1. Clients can join ad-hoc and must not be required to hold a state. 2. Clients can read and write data, by a high entropy identifier being associated with each element. Crucially this means that *any* client knowing the identifier of an element should be able to access it, even if they did not write it. 3. The system must be secure against multiple, colluding clients. I.e. a set of colluding clients cannot compromise the privacy of the access pattern of online clients. 4. Data *cannot* be secret-shared between multiple servers. I.e. only a single server can use storage linear in the data. 5. Communication must be sublinear in n, ideally polylog(n). 6. Both server and client computation must be sublinear in n. However, we are able to assume that a client know the position of the data they wish to access or write. This can be achieved through the use of augmented Cuckoo filters and assuming each element is identified through a high entropy ID. When looking at how an ORAM is constructed, we see that the client holds a state of not only encryption/authentication keys, but also information as to how their data is pseudorandomly distributed on the server. This data is used during a read to find the file the client is looking for, but crucially it is also updated, since after reading, the file's location must be updated on the server. In a multi-client ORAM such information must necessarily be shared and updated between the clients. But not only that. Care must also be taken to ensure that clients colluding with the ORAM server cannot learn information about other, honest, client's access patterns. To handle this we suggest to take departure in the idea of a proxy, as [suggested above](https://sites.cs.ucsb.edu/~rachel.lin/papers/TaoStore-CameraReady.pdf). However, we cannot simply add a trusted proxy, so instead we suggest to realize such a proxy through threshold cryptography. The threshold proxy will store knowledge on the state being held at the server, while remaining oblivious to the content. It will interact with the real clients, and without learning which element a client wish to access, help them compute the positions of the data elements on the server they will need to access. The proxy will then, in collaboration with the clients, ensure that the position of the data gets rerandomized and the data gets reencrypted. Similarly for the case of writing new data. Based on this we want to realize a system with the following security guarantees: 1. A malicious server will not be able to learn anything about the data it holds or the access pattern of clients. 2. Assuming not all parties realizing the proxy collude, they will learn nothing about the elements clients access. 3. Security of honest clients must be fulfilled, even if some clients are malicious and colluding with the server and a strict subset of the proxy servers. The last requirement seems to be the hardest, as it implies that the position of data elements must be hidden as long as at least one proxy server remains honest. It also requires that data can be authenticated and encrypted as long as at least one proxy server is honest. #### How to realize a threshold proxy ORAM Previous approaches to multi-client ORAM has taken departure in [Path ORAM](https://eprint.iacr.org/2013/280.pdf) (a tree-based ORAM). With our requirements, this gets tricky since a tree-based ORAM has a random path associated with each element. Since we cannot require the clients to keep state, such a path must be stored by the threshold proxy. We could imagine a path being derived using a threshold PRF. However, the path must be updated for each element, at each access. Since any client must be able to access any element, as long as they know the identifier, it is not clear how to update the path in a way where other clients obliviously becomes aware of such an update. For this and more reasons, it seems more feasible to take departure in square root ORAM, or more generally [hierarchical ORAM](https://dl.acm.org/doi/10.1145/233551.233553). In such constructions data stored are obliviously shuffled at certain intervals. Concretely this is done through assigning random integers with each element and then obliviously sorting the elements. Assigning random integers to the elements can easily be done using various standard threshold techniques by the proxy. The oblivious sorting can of course be realised using generic MPC, but the cost will be high, thus a more clever solution should be found. Furthermore (re)encryption of elements must also be done by the proxy. However, this can also be realized using standard techniques such as threshold encryption or using one-time pads based on [oblivious PRFs](https://eprint.iacr.org/2022/302.pdf). It is also of interest to note that oblivious PRFs can be used to map elements to positions in a Cuckoo filter and hence facilitate an efficient way where clients can realize which positions in the database they need to access. ## Hardening techniques We go through some hardening techniques that might not yield formally provable privacy, but still increase the privacy in practice. We note that these approaches could be combined. ### Noise drowning By having each client request multiple, random entries from the same content router besides the entry it is actually looking for, can provide some plausible deniability in relation to the file accessed for non-repeated access. This is known as K-anonymity. However, since the cids are hash digests, they will look random and thus care must be taken in ensuring that a client can efficiently request cids that do exist at a specific content router. This issue can be solved by adding the routing information in an augmented Cuckoo filter. In a bit more detail 1. The routers would construct a cuckoo filter of fingerprints derived from the cids they must link. However, instead of storing only the fingerprints in each bucket, they will also store the IPs of the content providers holding the data associated with each of the cids in a given bucket. * Cuckoo filters consists of 3 hash functions; 1) The tag-hash function hashes the input to a short tag, which is actually stored in the filter. 2) The position-hash function hashes the data to a position in the cuckoo filter. To ensure consistency in client requests in step 2, we need to define the position-hash function to first hash the input using the tag-hash function and then take this digest and use as input to another hash function. 2. The client would request the content of the two possible indexes of the cid it is looking for, in the cuckoo filter. However it would also pick k-1 random tags. Based on these it computes the two possible positions in the filter where they could be stored. The client requests the content of the buckets in these 2k positions. That is, both the tags *and* the IPs of the content providers of the elements in the buckets. 3. The client checks if the content of the buckets in the true positions it was looking for, contains the tag computed from the cid. If it does, it can now use the IPs to retrieve the content at the content providers. If not, it will need to contact another router. We make some observations based on this approach: * The table does not need to store the entire cid, but only a fingerprint of it which is large enough to avoid non-negligible collisions of items. Thus space could perhaps be saved over what is currently done. * By defining the position-hash function from the tag-hash function it allows the client to fake consistent positions in the hash table despite it not knowing the cid it is faking. That is, similar to the double hashing approach, the router does not learn any cid of the data it holds, nor the cid of any data a client requests. * Thus the server will not know which was the real tag the client was looking for, if it was found, nor which cid the client was actually looking for. While this approach allows for plausible deniability of the client, the actual privacy achieved may depend on the popularity of the actual cid the client is looking for, compared to the k-1 random cids it is implicitly picking as noise. To properly hide the cid the client requests, it must pick k-1 cids which are either approximately as popular as the true cid, or the total set of k cids picked must cover elements of all possible popularities. A possible solution to this would be to have the routers store multiple augmented cuckoo filters. One for each possible level of popularity. The client will request k elements from each of the augmented cuckoo filters and check each of them when returned. The servers can update the different filters by keeping a counter associated with each bucket which is incremented each time it is requested. Once a bucket in a less-frequent cuckoo filter has a higher counter than one in a more-frequent cuckoo filter, then the tags and their associated IPs, in the buckets get moved from one filter to the other. ### Encrypting content provider ID Encrypting the content provider IDs using a private key derived from the content itself, will hide the content providers for the content router and thus make it harder to achieve auxiliary information on the file the client is accessing. This is currently what is being achieved using the new ["double hashing" approach](https://www.notion.so/pl-strflt/Double-Hashing-for-Privacy-ff44e3156ce040579289996fec9af609), where some anonymity is also added by storing routing information based on the hash of the cid, instead of on the cid itself. However, the entropy in a cid is low, and some-what public. Thus increasing the entropy used in encryption is highly desirable. It would also be highly desirable to add entropy to the table stored by the content router, such that each entry will not just be based on the cid, but the router ID and some extra entropy. We can acheive this through the use of a third, somewhat trusted server. The idea is that instead of just using the cid to compute the encryption key for encrypting the provider info, the router's identity and some randomness from a third party is combined to derive the encryption key. This can be achieved through the use of an OPRF where the entropy derived from the cid along with the ID of the content router, is given as input to an OPRF, where the third party holds the PRF-key. The output of this OPRF is then used as the encryption key. Thus the somewhat trusted server does not learn anything about the file the client is looking for, and the encrypted ID of the content providers will be different across multiple content routers. However, more importantly, the somewhat trusted third party, will be able to throttle excessive requests. In particular if the OPRF used is partially blind, and the content router ID is allowed to be learned by the server, then it can block a malicious content router, trying to decrypt its own database if it has managed to learn the cids which it holds. We finally note, that the somewhat trusted third party, could actually be a threshold algorithm instead of a single server. Thus the Medusa network could be used to provide such a hardening. Computing then server part of a Diffie-Hellman based partially blind OPRF is computationally bounded by a [single pairing operation and a constant amount of group exponentiations.](https://eprint.iacr.org/2022/302.pdf) regardless of whether the OPRF is distributed or not. In case of distribution we note that it is generally trivial to realise arbitrary *t-out-of-n* thresholds in the *static* model and that *proactive* security can be achieved cheaply. It can be expected that a powerful CPU core can handle around 50-100 requests per second. Thus, using a powerful CPU with e.g. 32 cores, and assuming that a server can be realized as a cluster, it means that we would require around 30 physical CPUs per server instance to handle the current demand. If choosing to realise this in a threshold manner, then a small t and large n would be needed to ensure up-time and reliability. Of course this would also require the servers to be quite trusted in that it would be unlike that t of the colluded. ### Secure hardware General purpose secure hardware for confidential computing, such as ARM TrustZone and Intel SGX *supposedly* allows for private and verifiable computation of arbitrary programs without much of a performance penalty to the execution of a regular program. Unfortunately [multiple attacks have surfaced against](https://sgaxe.com/files/SGAxe.pdf) these supposedly secure execution environment. Despite this, multiple advanced and security critical applications are [still constructed](https://www.town-crier.org) and[ even put to use](https://medium.com/@maniacbolts/signal-increases-their-reliance-on-sgx-f46378f336d3). The reason for this is that, while many attacks are feasable, they do require physical and administrator access to the SGX machine, and are not trivial nor efficient to carry out. So while SGX and related technologies cannot be used in lieu of cryptographic techniques like MPC, they provide a solid hardening technique, to running everything without the use of a secure enclave. By having the clients encrypt the entry they wish to access, under a key stored in the router's secure enclave. The client also encrypts a symmetric key as well, which only they know. They send they encrypted entry and encryption key to the router. The router runs a program in SGX which decrypts the entry and encryption key, and then search the entry and encrypt the result under the supplied symmetric key. If the entry is not there, a null-string is instead computed. The encrypted result is then returned to the client, who can decrypt it. SGX can use up to 128 MB of RAM on the machine. Depending on how liberal assumptions one wish to make on a malicious router, it might be fine to leak which 128 MB chunk of the routing table the user will access, *and* to run binary search on this chunk. Alternatively the table could be stored using the augmented Cuckoo filter approach mentioned previously. In that case, when combining with the [noise drowning](#Noise-drowning) approach we could assume the routing table is segregated into 128 MB chunks, where, by defaul, the most frequently used chunk reside in the SGX memory. Thus in most cases, there would be no need for the router to load new data into memory, and they would be able to answer a query in constant time through an running a Cuckoo filter look-up in SGX. ## Questions ### Double hashing: Original spec / plan for DHT: https://www.notion.so/protocollabs/IPFS-Double-Hashing-Repurpose-8fdaae8748414ae592a5d24d59c0d8ed For network indexers: https://github.com/ipni/specs/pull/5