owned this note
owned this note
Published
Linked with GitHub
# Formalizing Content Routing
## 3. Definitions
**Summary**:
- There are 3 kinds of agents: Content Providers ($p_i \in P$), Content Consumers ($c_i \in C$), and Content Routers ($r_i \in R$).
- Providers store data ($d_i \in D$) and make it available to consumers.
- Providers and consumers find each other through routers.
- In some protocols, Providers create advertisements called provider records ($\mathsf{rec}^p_d$, which include content identifiers $\mathsf{cid}_d$), and propagate them to the Routers. Consumers query the routers using $\mathsf{cid}_d$, getting 0 or more corresponding provider records.
### 3.1. Entities
#### $P:$ Content Providers
- $P$ is a *dynamic* set of *content providers*, which store data and make it available to *data consumers*, and advertise which data their have through *provider records* and *content routers*.
- $p_i \in P$ is the $i$th content provider
- $p_i$ stores a dynamic collection of data $D^i$
- $p_i$ can produce provider record $\mathsf{rec}^i_j$ which advertises that $p_i$ provides datum $d_j$, where $d_j \in D^i$ (more on data and provider records below).
#### $R:$ Content Routers
- $R$ is a *dynamic* set of *content routers*
- $r_i \in R$ is the $i$th content router
- routers store routing information (eg provider records) in order to facilitate the search process between consumers and providers.
- system designs vary significantly, and therefore the role of routers can be vastly different from one system to another
- In a content routing system, consider providers and consumers to be "light/thin clients" which implement a small software module, and content routers do most of the work in the routing system.
#### $C:$ Content Consumers
- $C$ is a *dynamic* set of *content consumers*
- $c_i \in C$ is the $i$th content consumer
- consumers make requests to routers (to find providers associated with particular data) and to providers (to retrieve data)
- In some systems, consumer requests to routers (routing queries) may be recursive (the router searches other routers on behalf of the consumer) or iterative (routers with partial information give consumers information about the next router to try).
- In some systems, routers may return routing information to consumers (eg content routing records), and consumers may carry out operations to verify authenticity, validity, and suitability of records and candidate providers.
#### $D:$ Data (the content)
- $D$ is a *dynamic* set of *linked data* provided by *content providers*, and requested by *content consumers* across the network, and routable via *content routers*.
- $d_i \in D$ is the $i$th datum.
- $d_i$ may contain links to other data $d_j$, either through immutable links ($\mathsf{cid}_j$) or mutable links (out of scope here).
### 3.2. Useful Data Structures
The following data structures are useful for content routing systems. Not all systems will use these.
### $\mathsf{cid}_d:$ Content Identifiers for Data
- $\mathsf{cid}_i \leftarrow \mathsf{cid}(d_i)$ is a function that returns a *self-certifying*, *verifiable* identifier for datum $d_i$.
- For simplicity, we restrict to using [CIDs](https://github.com/multiformats/cid), which are cryptographic hash-function based identifiers.
- CIDs can use many different hash functions and data encoding functions. They achieve this through self-description.
- This means that: In reality, the function is: $\mathsf{cid}_{h,e,i} \leftarrow \mathsf{cid}(d_i, h, e)$ for hash function $h$ and data encoding function $e$. But for simplicity, we use $\mathsf{cid}_i$ and deal with $h$ and $e$ implicitly, or make $d_i$ specify $h$ and $e$.
### $\mathsf{rec}^p_d:$ provider records
- $A$ is a *dynamic* set of *provider records* (advertisements).
- $\mathsf{rec}^p_d \in A$ is provider record that advertises: datum $d_d$ is available at provider $p_p$.
- Properties
- (**non-repudiable**): a record $\mathsf{rec}^p_i$ is *non-repudiable* if the record can only be produced with direct authorization by $p_p$. An easy way to achieve this is by making $\mathsf{rec}^p_i$ include a digital signature over a signature less record ($\sigma \leftarrow \mathsf{sign}(\mathsf{sk}_p, \mathsf{pk}_p || \mathsf{cid}_d)$).
- (**repudiable**): a record $\mathsf{rec}^p_i$ is *repudiable* if the record can be produced **without** direct authorization by $p_p$. It is the opposite of non-repudiable. Repudiable records are useful in order to produce content-routing records that offer plausible deniability.
- (**content-verified**): a record $\mathsf{rec}^p_i$ is *content-verified* if the record can only be produced with access to $d_d$. An easy way to achieve this is by making $\mathsf{rec}^p_i$ include a *proof-of-storage* over $d_d$ ($\pi \leftarrow \mathsf{PoS}(d_d)$)
- (**temporary**): a record is *temporary* if the network has a notion of time, and the record is defined to only last for an amount of such time ($\mathsf{TTL}$ - time to live). Temporary records may be:
- (**valid**) if the record is being read within its designated valid time span.
- (**invalid**) if the record is being read *outside* its designated valid time span (before or after).
- (**expired**) if the record is being read *after* its designated valid time span (most systems only have a notion of expiry and records cannot be read *before*.
- (**spam-cost**): a record may have an associated production cost in order to rate-limit the production of records, either to reduce honest load in the system or to reduce adversarial use. Many systems desire repudiability, adversarial spam prevention, and low cost to honest users. One plausible way to achieve this is with cost-based repudiability: $p_p$ can produce records $\mathsf{rec}^p_d$ records cheaply. Other entities can produce $\mathsf{rec}^p_d$ records, but only expensively (eg by searching for a weak private key).
- (**aggregatable**): many systems desire routing records that are aggregatable while preserving other properties, either by datum (many records about $d_d$), or by provider (many records by $p_p$). Some cryptographic primitives can be used to construct aggregatable records (eg bilinear pairings).
- (**encrypted**): a record may be encrypted such that only agents with secret information may decrypt them. For example, records might be encrypted using a key derived from the data's content identifier, so that records only reveal their contents to agents who have the relevant $\mathsf{cid_d}$, and routers are unable to enumerate who provides what. These records can provide partial reader-writer privacy.
- (**homomorphicly-encrypted**): a record may be homomorphically-encrypted to enable routers to route and consumers to look up without learning the identities of the providers or the content. These may be useful in networks that seek to provide full reader and writer privacy.
### 3.3. $\Psi:$ Content Routing System properties
- $\Psi = (P, R, C, D, A, \Pi)$ is a *content routing system* with five sets (providers, routers, consumers, data, routing entries) and a protocol $\Pi$ defining the functions for programs implementing the agents $(P, R, C)$ and data structures $(D, A)$. Properties below.
- **Authenticity**
- (**authenticated**) $\Psi$ is *authenticated* if agents have their own identities, and protocol messages can be associated with authors.
- Example: $\psi$ assigns cryptographic identities to all parties using public-key cryptography, and signs some or all messages between entities.
- (**self-certifying**) $\Psi$ is *self-certifying* if agents have their own public-key based identities, which can sign messages.
- **Privacy**
- (**transport-encrypted**) $\Psi$ is *transport-encrypted* if the messages between participants are encrypted such that malicious observers cannot decrypt or extract the messages.
- (**writer-privacy**) $\Psi$ provides *writer-privacy* if providers can provide content to a subset of consumers, without unauthorized consumers, routers, or adversaries learning what the provider has published. There are many different notions of writer-privacy:
- (**identity writer-privacy**) adversaries cannot learn *who* provided some content. Adversaries may see advertisements and learn *what* data they are associated to, but not learn the identity of the provider.
- (**content writer-privacy**) adversaries cannot learn *what* content is being provided. Adversaries may detect that a provider is publishing advertisements, but not learn *what* data the advertisements are about.
- (**action writer-privacy**) adversaries cannot learn that a particular provider has taken any action. This is difficult to achieve, as solutions tend to require slow round-based protocols that produce lots of noise.
- (**full writer-privacy**) no entity learns what entities provide content or what content is being provided, at all. This is extremely difficult, and requires corresponding writer-privacy preserving retrieval protocols between providers and consumers as well.
- (**reader-privacy**) $\Psi$ provides *reader-privacy* if consumers can find content (eg lookup records) without unauthorized providers, routers, or adversaries learning what the consumer has requested. There are many different notions of reader-privacy:
- (**identity reader-privacy**) adversaries cannot learn *who* is looking for some content. Adversaries may see requests and learn *what* data they are associated to, but not learn the identity of the consumer.
- (**content reader-privacy**) adversaries cannot learn *what* content is being searched. Adversaries may detect that a particular consumer is making requests, but not learn *what* data the requests are about.
- (**action reader-privacy**) adversaries cannot learn that a particular consumer has taken any action. This is difficult to achieve, as solutions tend to require slow round-based protocols that produce lots of noise.
- (**full reader-privacy**) no entity learns what entities request content or what content is being requested, at all. This is extremely difficult, and requires corresponding reader-privacy preserving retrieval protocols between providers and consumers as well.
- **Scalable and decentralized set membership**
- (**$S$-permissionless**) $\Psi$ is *permissionless* with respect to set $S$ if new agents can add themselves to agent set $S$ without requiring permission from any authority, or the group.
- Example: $\psi$ is $C$-permissionless and $P$-permissionless but not $R$-permissionless if it allows consumers and providers to join and leave permissionlessly, but routers can only enter (or leave) with permission.
- (**$S$-consistent**) $\Psi$ is *consistent* with respect to set $S$ if there is a canonical registry that maintains the exact state of set $S$, with safety -- meaning that as participants join and leave (or records are introduced), any registries or information about $S$ are maintained consistently across the network.
- (**$S$-sloppy**) $\Psi$ is *sloppy* with respect to set $S$ if information or registries maintained about the state of set $S$ are partial, incomplete, or inconsistent -- meaning that participants join and leave, and any registries or information about $S$ are sloppily maintained without requiring consistency (or not maintained at all). Sloppy sets are extremely useful in high-availabilty and high-traffic distributed systems. It is a limited (and easier to implement) form of eventual consistency.
- Example: $\psi$ is $C$-sloppy, $R$-sloppy, and $A$-sloppy, but not $P$-sloppy if it maintains a sloppy registry of routers, no registry of clients, sloppy information about advertisements, and an **exact** registry of providers.
- **Liveness**
- (**liveness**) $\Psi$ maintains *liveness* if it eventually makes progress in the presence of some fraction of (honest or malicious) failures.
- (**partition-tolerant**) $\Psi$ is *partition-tolerant* if the system maintains availability through network partitions. This means that in the partitioned subnetwork, providers, routers, and consumers are able to continue their normal operation, albeit only with the connected subset of the network.
- **Game Security**
- (**$(n,f)$-BFT**) $\Psi$ is $(n,f)$-BFT (Byzantine Fault-Tolerant) if the system can tolerate a total $\frac{f}{n}$ adversaries ($f$ adversaries out of $n$ total entities)
- (**$(n,f)$-PFT**) $\Psi$ is $(n,f)$-PFT (Power Fault-Tolerant) if the system can tolerate a total $\frac{f}{n}$ adversarial *power* ($f$ adversarial *power* out of $n$ total power)
- (**BART**) $\Psi$ is [Byzantine, Altruistic, Rational Tolerant](https://www.cs.cornell.edu/lorenzo/papers/sosp05.pdf) if it guarantees operation in the presence of all rational deviations from the protocol.
- (**Incentive Compatible**) $\Psi$ is *Incentive Compatible* if incentives guarantee that it is in the best interest of all rational entities (or power) to follow the protocol.
## 2. Topologies
### Large number of small routers
Some content routing systems use a large number of small routers, which contain a tiny fraction of the routing information. These systems trade off resource-sharing for performance, as large networks tend to introduce large latencies and resource waste.
### Small number of Large routers
Some content routing systems use a small number of large routers, which contain large fractions of the routing information. These systems prefer performance (specially latency reduction) and depend on dedicated, high performance nodes.
### Specialized router network
Some content routing systems provide standalone router nodes that answer queries for both consumers and providers.
```graphviz { engine = "neato"}
graph {
subgraph C {
style=filled;
color=lightgrey;
node [style=filled, color=lightblue];
c0 [pos="-3, 2!"]
c1 [pos="-3, 1!"]
c2 [pos="-3, 0!"]
c3 [pos="-3,-1!"]
c4 [pos="-3,-2!"]
}
subgraph R {
style=filled;
color=lightgray;
node [style=filled, color=cyan];
r0 [pos="-1, 1!"]
r1 [pos="-1, 0!"]
r2 [pos="-1, -1!"]
r3 [pos=" 0, 1!"]
r4 [pos=" 0, 0!"]
r5 [pos=" 0,-1!"]
r6 [pos=" 1, 1!"]
r7 [pos=" 1, 0!"]
r8 [pos=" 1,-1!"]
}
subgraph P {
style=filled;
color=lightgrey;
node [style=filled, color=yellow];
p0 [pos="3, 2!"]
p1 [pos="3, 1!"]
p2 [pos="3, 0!"]
p3 [pos="3,-1!"]
p4 [pos="3,-2!"]
}
c0 -- r0
c1 -- r0
c1 -- r1
c2 -- r0
c2 -- r1
c2 -- r2
c3 -- r2
c3 -- r1
c4 -- r2
r0 -- r1 -- r2
r1 -- r4 -- r7
r2 -- r5 -- r8
r0 -- r3 -- r6
r3 -- r4 -- r5
r6 -- r7 -- r8
r0 -- r4 -- r8
r2 -- r4 -- r6
r1 -- r3 -- r7
r1 -- r5 -- r7
p0 -- r6
p1 -- r6
p1 -- r7
p2 -- r6
p2 -- r7
p2 -- r8
p3 -- r8
p3 -- r7
p4 -- r8
}
```
### No specialized router nodes
Some content routing systems are deployed as software run by the providers, the consumers, or both providers and consumers. These systems offer simple scalability across networks, though yield complex performance issues and failures.
```graphviz { engine = "neato"}
graph {
subgraph C {
c0 [label="c0 (r0)", pos="-2, 2!"]
c1 [label="c1 (r2)", pos="-2, 1!"]
c2 [label="c2 (r4)", pos="-2, 0!"]
c3 [label="c3 (r6)", pos="-2,-1!"]
c4 [label="c4 (r8)", pos="-2,-2!"]
}
subgraph P {
p0 [label="p0 (r1)", pos="2, 2!"]
p1 [label="p1 (r3)", pos="2, 1!"]
p2 [label="p2 (r5)", pos="2, 0!"]
p3 [label="p3 (r7)", pos="2,-1!"]
p4 [label="p4 (r9)", pos="2,-2!"]
}
c0 -- p4
c1 -- p3
c2 -- p2
c3 -- p1
c4 -- p0
}
```
### No content routers
Content Routing systems without routers are possible, when there are strategies for content dispersal that are compressible into functions ahead of time. For example, a system may use a massive O(1) distributed hash table (or distributed key-value store), where all participants know exactly what providers have what content. This kind of system side-steps the content routing problem, though it is fairly inflexible and not very realistic for massive scale decentralized systems.
```graphviz { engine = "neato"}
graph {
subgraph C {
c0 [pos="-2, 2!"]
c1 [pos="-2, 1!"]
c2 [pos="-2, 0!"]
c3 [pos="-2,-1!"]
c4 [pos="-2,-2!"]
}
subgraph P {
p0 [pos="2, 2!"]
p1 [pos="2, 1!"]
p2 [pos="2, 0!"]
p3 [pos="2,-1!"]
p4 [pos="2,-2!"]
}
c0 -- p4
c1 -- p3
c2 -- p2
c3 -- p1
c4 -- p0
}
```