:::info
:bulb: This document provides a summary of [🔗 Original Source: "Content Censorship in the InterPlanetary File System"](https://arxiv.org/pdf/2307.12212.pdf) as well as provides an extended background on DHTs and how they operate in IPFS.
:::
<center>
<img src="https://hackmd.io/_uploads/H1VaOr4j3.png" width="450" height="450">
</center>
[ToC]
## Abstract
From, [🔗 Original Source: "Content Censorship in the InterPlanetary File System"](https://arxiv.org/pdf/2307.12212.pdf):
>Abstract—The InterPlanetary File System (IPFS) is currently the largest decentralized storage solution in operation, with thousands of active participants and millions of daily content transfers. IPFS is used as remote data storage for numerous blockchain-based smart contracts, Non Fungible Tokens (NFT), and decentralized applications.
>
> We present a content censorship attack that can be executed with minimal effort and cost, and that prevents the retrieval of any chosen content in the IPFS network. The attack exploits a conceptual issue in a core component of IPFS, the Kademlia Distributed Hash Table (DHT), which is used to resolve content IDs to peer addresses. We provide efficient detection and mitigation mechanisms for this vulnerability. Our mechanisms achieve a 99.6% detection rate and mitigate 100% of the detected attacks with minimal signaling and computational overhead. We followed responsible disclosure procedures, and our countermeasures are scheduled for deployment in the future versions of IPFS.
## Introduction
IPFS is a content-centric network where each piece of content is identified by a Content Identifier (CID), similar to BitTorrent or Content-Centric Networking. Data retrieval in a content-centric network requires mapping CIDs into network identifiers of nodes hosting the content, called providers.
The author of [🔗 Original Source](https://arxiv.org/pdf/2307.12212.pdf) provides a description and implementation for a censorship attack of content on the IPFS network by creating many sybils that are "closest" to the content requested. The author also provides a decscription and implementation of how to detect the attack and how to mitagate by improving on prior work.
## Background
A Distributed Hash Table (DHT) is a decentralized distributed system that provides a lookup service similar to a hash table: (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can join or leave the network with minimal disruption.
DHTs form an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, instant messaging, and more.
One of the most well-known DHTs is Kademlia, which is used by IPFS. Kademlia is a specific type of DHT that uses a system of nodes to store and retrieve data. Each node in Kademlia has a unique identifier, and when a node wants to store or retrieve data, it uses these identifiers to find a path to the node that has the data. This makes Kademlia extremely efficient and scalable, as the number of steps it takes to find a node is logarithmic relative to the total number of nodes.
In the context of IPFS, the DHT is used for content addressing. Each file and all of the blocks within it are given a unique fingerprint called a cryptographic hash. IPFS removes duplications across the network and tracks version history for each file. Each network node stores only content it is interested in, and some indexing information that helps figure out who is storing what.
### Simplified View
When a user requests a file from the IPFS network, the user's IPFS client converts the human-readable file request into a CID, which is a cryptographic hash of the file's content. The client then uses the DHT to find peers who have the content associated with that CID. This is done by querying the DHT for the CID. The client doesn't need to query every node in the network, but instead, it queries a subset of nodes that should, according to the DHT's algorithms, be closer to the information it needs.
The queried nodes respond with either the requested content or with a closer node (or nodes) to the CID in the DHT. This process continues until the client finds a node that has the content. Once the client has found a node with the content, it retrieves the content directly from that node.
A simplified diagram to illustrate this process:
```mermaid
graph TB
A[User] -->|Request File| B[IPFS Client]
B -->|Convert to CID| C[CID]
C -->|Query DHT| D[DHT]
D -->|Find Node with Content| E[Node]
E -->|Retrieve Content| F[Content]
F -->|Return File| A
classDef green fill:#9f6,stroke:#333,stroke-width:2px;
classDef orange fill:#f96,stroke:#333,stroke-width:4px;
class A,B green
class C,D,E,F orange
```
In this diagram:
- The User (green) requests a file from their IPFS Client (green).
- The IPFS Client converts the file request into a CID (orange).
- The CID is used to query the DHT (orange) to find a Node (orange) that has the content.
- The Node retrieves the Content (orange).
- The retrieved Content is returned to the User (green).
For more detailed information about IPFS and how it works, you might want to refer to the [official IPFS documentation](https://docs.ipfs.io/concepts/how-ipfs-works/).
### Advanced View
In order to engage with the DHT, each peer generates a public/private key pair. From this, it derives an identity peerid ∈ {0, 1}^256, which is the hash of its public key. Ideally, each peer generates a random key pair and as such, peer IDs are uniformly and independently distributed over the space {0, 1}^256. While honest nodes abide by this rule, malicious nodes may generate and choose from an arbitrary number of key pairs. Each peer maintains a routing table consisting of m = 256 buckets. The i-th bucket contains the addresses of up to k = 20 peers, with whom it shares an i-bit common prefix with the peer's own peer ID.
A new participant node connects to the IPFS network by reaching out to one of the hardcoded bootstrap nodes. The bootstrap node supplies the new node with initial peers, which enables the new node to join the DHT. This information facilitates a DHT walk towards the new node's peer ID. This walk serves a dual purpose:
1. Ensure no other node in the network possesses the same ID.
2. Discover new peers and populate the newcomer's DHT routing table.
Simultaneously, the newcomer establishes Bitswap connections with a subset of encountered peers (usually around 300 of them). Bitswap's primary function is to facilitate bilateral content transfer and to act as a cache for recently accessed content.
The main operation of the DHT, `GETCLOSESTPEERS(key)`, returns the k = 20 closest peers to the key. In Kademlia, the distance between two keys x and y in the key space is defined by x⊕y ∈ {0, ..., 2^256−1}, where ⊕ signifies the bitwise XOR operation on the keys. When a client wants to find the peers with IDs closest to the key, it sends a request to the α = 3 peers in its routing table whose peer IDs are closest to the key. This process continues until no more peers closer to the key are discovered. This process has limitations due to network churn and imperfect routing tables, as observed in our experiments (more details in Section VIII, Figure 6).
Content in IPFS is resolved in a specific manner since it is a content-centric network. IPFS allows participants to request files without specifying their location. Content is indexed by content IDs (cid ∈ {0, 1}^256) derived from a hash of the content. Both peer IDs and CIDs serve as keys in the DHT.
Every node in the network can take on the role of a provider, downloader, or resolver. The process of content resolution is as follows:
- Providers wishing to publish content on IPFS create a provider record containing the content ID and provider’s address. They first use `GETCLOSESTPEERS(cid)` to find the k = 20 peers whose peer IDs are closest to the cid and then send them a `PutProvider` message with the provider record. These peers are called the resolvers for the cid.
- When a downloader wishes to fetch content, it sends a request to all its Bitswap peers. If none of them have the content, the downloader uses the DHT-based resolution system (Bitswap unstructured search). If this fails, the downloader resorts to `FINDPROVIDERS(cid)`. This operation uses a DHT walk similar to `GETCLOSESTPEERS(cid)` to find k resolvers and also queries encountered nodes for a provider record for the cid.
- Upon receipt of a provider record, the client connects to the address in the provider record to retrieve the actual content. Note that provider records are not authenticated, and hence malicious providers may respond with incorrect provider records or may not respond at all. However, the integrity of the content is preserved since the hash of the retrieved content can be verified against its cid.
```mermaid
sequenceDiagram
participant D as Downloader
participant B as Bitswap Peers
participant R as DHT-based Resolution System
participant P as Provider
D->>B: Request content
B-->>D: No content found
D->>R: Use FINDPROVIDERS(cid)
R-->>D: Return k resolvers and provider records
D->>P: Connect to provider address
P-->>D: Return actual content
```
## The Attack
The attack exploits a conceptual issue in a core component of IPFS, the Kademlia Distributed Hash Table (DHT), which is used to resolve content IDs to peer addresses. The attack can be performed from a single, resource-constrained machine at very little cost ($4 using AWS) and makes the provider records unavailable after a time that ranges from a few seconds to up to 48 hours depending on the initial setup.
The attack proceeds as follows. First, the attacker uses the IPFS API to retrieve the current k = 20 closest peer IDs to the target CID, sort them by distance to the CID, and identify the closest one. The attacker then repeatedly generates random public/private key pairs and computes new peer IDs by hashing the public key. If a generated peer ID is closer to the CID than to the currently closest one, the attacker replaces the closest peer ID with the newly generated one.
The attacker can easily make any content undiscoverable in the network by strategically placing a small number of Sybil identities in the DHT. The effectiveness of the attack was confirmed by removing multiple, specifically crafted content from the live IPFS network.
```mermaid
sequenceDiagram
participant A as Attacker
participant IPFS as IPFS Network
participant DHT as Kademlia DHT
A->>IPFS: Retrieve k=20 closest peer IDs to target CID
loop Generate new peer IDs
A->>A: Generate random public/private key pairs
A->>A: Compute new peer IDs by hashing public key
end
A->>DHT: Place Sybil identities close to target CID
Note over A,DHT: Content becomes undiscoverable
```
This sequence diagram illustrates the steps of the attack. The attacker first retrieves the closest peer IDs to the target CID from the IPFS network. The attacker then generates new peer IDs and places Sybil identities in the DHT close to the target CID, making the content undiscoverable.
## Detection Technique
The first step in countering an attack is its reliable detection. It enables activating mitigation techniques only when needed and avoids unnecessary overhead when the network is not under attack. Importantly, the detection cannot simply rely on the unavailability of the provider records as this could be due to natural network dynamics.
The authors present a reliable attack detection technique that analyzes the distribution of peer IDs in the network using the KL Divergence metric. This method extends previous work and leverages a local density-based network size estimator to automatically adjust the detection to the dynamic size of the IPFS network. The detection can be performed by any node during regular content resolution operations and does not incur any additional message overhead.
The detection method flags a CID as under attack if the divergence is above a certain threshold. In the experiments, the detection method was able to detect 99% of the attacks with a false positive rate of 4%.
The detection method is robust to changes in the network size because it automatically calculates the model distribution based on the current network size. This makes it adaptable to the dynamic nature of the IPFS network.
## Mitigation Technique
From [🔗 Original Source: "Content Censorship in the InterPlanetary File System"](https://arxiv.org/pdf/2307.12212.pdf),
>Main idea. The core observation behind our approach is that, while an attacker can spawn additional Sybil identities, it has no way of removing the honest ones from the network. As long as the provider can send its provider record to the initial honest resolvers, and the downloader can communicate with these resolvers, the censorship attack will be mitigated. To maintain communication with the initial honest resolvers even during an attack, we propose region-based DHT queries. Rather than communicating with the k = 20 closest peers to a CID, that an attacker can easily control, we communicate with all the nodes in the hash space region that k = 20 uniformly distributed peer IDs should cover (Figure 4). The size of this region is calculated using the network size estimate and using the assumption that honest peer IDs are distributed uniformly over the key space. Such an approach ensures that regardless of the number of Sybil nodes placed by an attacker, the provider can store provider records on ≈ 20 honest resolvers, and the downloader also communicates with ≈ 20 honest resolvers to reliably retrieve the correct provider records. To prevent additional overhead when there is no attack, we run the region-based queries only when an attack is detected using the detection mechanism
The mitigation technique introduced allows the discovery of provider records regardless of the number of Sybil nodes placed by an attacker around the target CID. The mitigation replaces the regular put and get DHT operation by hash space region-based queries. Using these, providers always find honest resolvers to store their provider records and querying nodes always discover these honest resolvers and receive true provider records. While introducing an overhead linear in the number of Sybil nodes placed close to the target CID, this mitigation is only enforced when suspicions exist about the existence of an attack, as indicated by the detection mechanism.
The mitigation technique involves querying the Sybil nodes for provider records. The Sybil nodes may return a large number of fake provider records so that downloaders keep trying them, thereby slowing down the resolution. The impact of such an attack can be reduced if the downloader only tries a single provider record obtained from each resolver and prefers records obtained from resolvers with diverse IP addresses (i.e. from different /24 networks). Any attempts to significantly delay the resolution would sharply increase the attacker’s cost.
The detection and mitigation implementations are fully compatible with the unmodified IPFS clients and can be incrementally deployed in the system. Therefore, while nodes that have not implemented the detection and mitigation techniques remain vulnerable to the attack, they do not prevent other nodes from detecting and mitigating the attack.
```mermaid
sequenceDiagram
participant P as Provider
participant D as Downloader
participant S as Sybil Node
participant R as Honest Resolver
P->>R: Store provider records
D->>S: Query for provider records
S->>D: Return fake provider records
D->>R: Query for provider records
R->>D: Return true provider records
Note over D,R: Content becomes discoverable
```
This sequence diagram illustrates the steps of the mitigation technique. The provider stores provider records with honest resolvers. The downloader queries for provider records from both Sybil nodes and honest resolvers. The Sybil nodes return fake provider records, while the honest resolvers return true provider records, making the content discoverable.
## Conclusion
In conclusion, the original paper unveils a serious threat to the IPFS network, demonstrating how the Kademlia DHT could be exploited to render content undiscoverable with minimal cost. However, the paper also proposes efficient solutions to counter this problem.
The detection technique, leveraging the KL Divergence metric, can accurately identify attacks while adapting to the network's dynamic nature. The mitigation approach introduces region-based DHT queries, ensuring providers can store and downloaders can retrieve content despite the presence of Sybil nodes. These techniques can be deployed incrementally and are compatible with unmodified IPFS clients.
This solution provides weak probabilistic gurantees of censorship resistance.